

# Towards "Intelligence, Storage, Network": Characterizing, Optimizing, and Outlooking — A Systems Researcher's Perspective

#### Rong Chen

#### Institute of Parallel and Distributed Systems, SJTU Huawei STW, 2023

Joint work with Xingda, Xiating, Rongxin, Yuhan, Haibo, Binyu, and members of IPADS

## Who AM I

#### Rong Chen (陈榕) / IPADS, SJTU

https://ipads.se.sjtu.edu.cn/rong\_chen

- Research Interest: Building efficient, scalable, and reliable distributed systems
- Publications and awards in systems conferences (OSDI, SOSP, EuroSys)
- ► Huawei OlympusMons Pioneer Award, 2020

"Efficient Data Processing System based on New Heterogeneous Hardware"

# Disclaimers:

I am a Systems person, not a Network/Storage/AI expert 😳





## My view:

## How to (re)build high-performance system software stack by exploiting new hardware of "Intelligence, Storage, Network"















Medical Imaging Speech Al

Service

Customer Recommenders МL

Physics Communications Video Analytics

Logistics Conversational Robotics AI

Autonomous Cybersecurity Vehicles



4 โรงาน



#### **Compute** Power



Source: "AI and Memory Wall", 2021. https://medium.com/riselab/ai-and-memory-wall-2cb4265cb0b8





6



Source: "AI and Memory Wall", 2021. https://medium.com/riselab/ai-and-memory-wall-2cb4265cb0b8



7

Source: "AI and Memory Wall", 2021. https://medium.com/riselab/ai-and-memory-wall-2cb4265cb0b8





## My view:

How to (re)build high-performance system software stack by exploiting **new hardware of "Intelligence, Storage, Network"** 



## Our Approach



9

ſsлu







Case #1: Collaborative offloading

Case #2: Cooperative offloading

Outlooking systems research for DPU







## Case #1: Collaborative offloading

Case #2: Cooperative offloading

Outlooking systems research for DPU



## Hardware in DC



Datacenter Server



ไรมาน

## Hardware in DC

Datacenter Server



13

โรมาน

## **Common Practice: Offloading**



14

Ismu

Different hardware devices can work together

- ► Case: RDMA NIC (RNIC) can directly access NVM
  - → "Remote Persistent Memory"
- ► Scenarios: distributed logging in FS, TX, ..



#### Distributed Transactions





<sup>1</sup> Intel. The librpmem library. <u>https://pmem.io/pmdk/librpmem/</u>

## Challenge: Compatibility

Functional flaw: remote write is NOT persistent

Solution<sup>1</sup>: + remote read (two network roundtrips)

*Performance* pitfall: remote write is inefficient

< 29% of NVM thpt limit (15M vs. 52M reqs/s)</p>

#### New hardware features are unaware of each other







RDMA write payload



## Collaborative offloading for the concurrent use of RDMA & NVM

- Characterizing RDMA+NVM for optimization hints
- Case studies: distributed TX (DrTM+H) and FS (Octopus)
- Suggestions to RDMA/NVM hardware designers



USENIX

Characterizing remote persistent memory w/ RDMA and NVM

- ► A systematic study of the collaboration btw. RDMA and NVM
- ► Tools: <u>https://github.com/SJTU-IPADS/librdpma</u>



## Example 1

## **Optimization Hint**

► Disable DDIO to skip LLC for large writes

#### NVM feature

Random I/O causes write amplification

### Performance pitfall

- ► RNIC sequentially writes the data to LLC
- ► Then, LLC randomly evicts the data to NVM







#### **Optimization Hint**

► Use 64B granularity for small writes

#### NVM feature

Read-modify-write pattern (PCIe partial-write)

#### Performance pitfall

An extra read to NVM if write does not
 fit PCIe data word granularity (e.g., 64B)





Characterizing remote persistent memory w/ RDMA and NVM

- ► A design guideline: **9 optimization hints** in 3 aspects
- Achieve 87% of NVM thpt limit (from 15M to 45M reqs/s)







Applying our performance hints to existing RDMA-NVM systems

- ► DrTM+H (distributed TX) by 1.44×/2.09× for TPC-C/SmallBank
- Octopus (distributed FS) by 2.40× for Data I/O



# Case Study: Distributed Transaction

#### Applying our performance hints cumulatively on DrTM+H

| Hints | Optimizations                                                                |  |  |
|-------|------------------------------------------------------------------------------|--|--|
| H1    | Separate memory pool from different sockets to avoid cross-socket NVM access |  |  |
| H3    | Configure database with DDIO disabled                                        |  |  |
| H4    | Use ntstore to optimize the commit phase                                     |  |  |
| H5    | Align and pad logs/records larger than 256 B to XPLine granularity           |  |  |
| H6+H7 | Align and pad logs/records smaller than 256 B to 64 B granularity            |  |  |
| H8    | Implement a DRAM-based lock service for the validation phase                 |  |  |
| H9    | Implement remote persistent log with H9 in one roundtrip                     |  |  |

#### Improve perf. & enable persist

23





#### **Optimization Hints**



#### Suggestions to hardware designers

- ▶ RDMA (persistent) WRITE: avoid extra RDMA READ -
- RDMA-version nstore: avoid disabling DDIO —



24

<sup>1</sup> Our open-sourced toolkit: <u>https://github.com/SJTU-IPADS/librdpma</u>





Case #1: Collaborative offloading

## Case #2: Cooperative offloading

### Outlooking systems research for DPU



New Trend : Capability Integration

# Intelligent HardwareNetwork + ComputationStorageCPU, FPGA, ASIC





SmartSSD



26

Smart + X

New Trend : Capability Integration

Integrating multiple capabilities into a single device

► Typical case: DPU/SmartNIC (e.g., Nvidia BlueField)



New Trend : Capability Integration

Integrating multiple capabilities into a single device

► Typical case: DPU/SmartNIC (e.g., Nvidia BlueField)





- ConnectX-6 (2x 100Gbps)
- 16 GB of on-board DRAM
- ARM Cortex-A72 (8 cores)



Integrating multiple capabilities into a single device

- ► **Typical case**: DPU/SmartNIC (e.g., Nvidia BlueField)
- ► Good: Innately immune to compatibility issues
- ► Bad: (much) higher cost, compared to RNIC

|            |      | BlueField-2 <sup>1</sup> | ConnectX-6 <sup>2</sup> |
|------------|------|--------------------------|-------------------------|
| Price      | 1.5× | \$ 3615                  | \$ 2,440                |
| Space      | 2.0× | 6.6 in. x 4.53 in.       | 6.6 in. x 2.71 in.      |
| Power 3.2× |      | 75W                      | 23.6W                   |



29

<sup>1</sup> NVIDIA MBF2H516A-EEEOT BlueField-2
 <sup>2</sup> NVIDIA MCX653106A-HDAT ConnectX-6



DPU is inferior in *every* single capability

- ► Wimpy cores (e.g., 8-core ARM) and small memory (e.g., 16GB)
- ► Net. perf. degradation (BF-2 vs. CX-6): latency (+6~30%), thpt (-15~36%)

Case study: Get (k) in distributed key/value store (KVS)

RNIC-based KVS

2x RDMA READs (1 for index, 1 for value)

#### DPU-based KVS

31

Ismu

1x SEND/RECV, offload indexing to DPU



DPU is inferior in *every* single capability

- ► Wimpy cores (e.g., 8-core ARM) and small memory (e.g., 16GB)
- ▶ Net. perf. degradation (BF-2 vs. CX-6): latency (+6~30%), thpt (-15~36%)

Existing systems only utilize a portion of DPU device

► Only NIC-Host path, treated as RNIC



32

Only computing resource (SoC), treated as accelerator

A systematic way to fully utilize integrated capabilities



- Characterizing: study offloading paths, rather than HW components
- ► A step-by-step optimization guideline for DS designer
- Case studies: DPU-accelerated distributed FS and KV
- Open-source toolkit: <u>https://github.com/smartnickit-project</u>













Characterizing DPU (i.e., BlueField 2) in path level

- Study offloading paths, rather than HW components
- ► Four paths: NIC-Host (①), NIC-SoC (②), SoC-Host (③), SoC-only (④)
- Performance implications: bottlenecks, anomalies, and takeaways



## Example 1

#### Findings

- ► NIC-Host is **slower** than RNIC
- Overhead: PCIe latency (300ns x4)
- Non-trivial for small request (1-2μs)

#### Takeaway

If only NIC-Host is used, select RNIC
 as it is faster, cheaper, and saves power



## Findings

- NIC-SoC is faster than NIC-host (no PCIeO), but still slower than RNIC (PCIe switch)
- SEND/RECV is much slow (wimpy SoC cores)





## Findings

 RDMA READ performance of NIC-SoC collapses w/ large request (>=9MB)

## Advice: avoid large READ requests

- ► PCIe MTU: Host (512B) vs. SoC (128B)
- NIC-SoC READ: 4 × PCle packets
   for large requests → HoL blocking







### Characterizing concurrent paths in DPU

- ► DPU is always underutilized when only using a single path
- ► Study the concurrent use of multiple offloading paths (e.g., ①+②)

## Takeaway

- Concurrent offloading can better utilize DPU,
   esp. when used in opposition directions (R+W)
- But, carefully avoid interference btw. paths,
   e.g., NIC cores (1+2) and PCIe switch (2+3)



## Our path-level DPU study

- A comprehensive perf. study on *offloading paths* (6)  $\times$  *primitives* (3)
- 11 *findings/advice* for either individually using a single path or concurrently using multiple paths











# Optimizing

A step-by-step optimization guideline for system designers

1. Devise potential alternatives for DPU to support the given functionality, and optimize them based on our study

- 2. Evaluate and rank alternatives based on system-specific criteria
- 3. Select and combine alternatives in turn until DPU is saturated



Case Study: Get(k) in Key/Value Store

### 1. Devise alternatives (A1-A5) and optimize them



41

DPU-accelerated KVS

Guideline



2. Evaluate and rank alternatives based on high performance

42



Rank: A5 > A4 > A1 > A3 > A2





### 3. Select and combine alternatives in turn until DPU is saturated







### Suggestions to hardware designers

- Support CXL to relieve the pressure on SoC cores
- Support ARM CCI (similar to DDIO on host CPU)







Case #1: Collaborative offloading

Case #2: Cooperative offloading

**Outlooking systems research for DPU** 





Which type of processor should be selected, SoC, FPGA, or ASIC?

46

► An inherent trade-off in programmability and performance





Which type of processor should be selected, SoC, FPGA, or ASIC?

2x25G

50G NIC

► "Don't want to CHOOSE, want BOTH": SoC + FPGA/ASIC/...

#### Marvell OCTEON



SoC (ARM) VPP Accelerators

#### Intel IPU



Soc (Xeon) Agilex FPGA

SoC (ARM) ASIC Accelerators

Broadcom Stingray

Storage Services RAID/Crypto/LVM TruFlow'

Crypto

RAID/EC De-dupe

NVMe-oF Initiator

Networking Services

#### NVIDIA BlueField

47



SoC (ARM) ASIC Accelerators DPA (RISC-V)



#### How to measure integrated hardware components in DPU?

► New *metrics, benchmarks*, and *toolkits*?

| 🧼 NVIDIA. DEVELO    | OPER         | Home | Blog | Forums | Docs   | Downloads | Training |           |  |
|---------------------|--------------|------|------|--------|--------|-----------|----------|-----------|--|
| Technical Blog      | २ Search blo | g    |      |        | Ţ Filt | er        |          |           |  |
| Data Center / Cloud |              |      |      |        |        |           |          | Foolish 🗸 |  |

# Power the Next Wave of Applications with NVIDIA BlueField-3 DPUs



# THENEXTPLATFORM ECONOMICS AND THE INEVITABILITY OF THE DPU





Which domain-specific accelerators deserve to be integrated?

#### The killer applications of DPUs

#### datacenter networking, storage, security, and virtualization workloads

HPC/AI

**Cloud Computing** 

5

Storage



#### Power the Next Wave of Applications with NVIDIA BlueField-3 DPUs



- CPUs that are used for serial processing and running hyperthreaded applications.
- GPUs that excel at parallel processing and are optimized for accelerating modern workloads.
- DPUs that are ideal for infrastructure computing tasks; used to offload, accelerate, and isolate data center networking, storage, security, and manageability workloads.





49

- **Cloud and data center** servers to offload virtual overlay and cryptographic processing for multi-tenant VM, container, and storage services.
- LTE and 5G vRAN implementations when paired with Marvell's Fusion-O baseband processor providing a 5G and LTE-A PHY with the OCTEON used for CU or vRAN offload processing.
- Enterprise router-firewall and SD-WAN appliances using NFV service chaining to deliver L2/L3 forwarding, VPN termination, SPI, and new AI-based applications and security services.

Source: https://developer.nvidia.com/blog/power-the-next-wave-of-applications-with-nvidia-bluefield-3-dpus/ https://packetpushers.net/marvells-octeon-10-challenges-all-comers-for-dpu-supremacy/



Which domain-specific accelerators deserve to be integrated?

50

► Compression, encryption, virtualization, packet processing, . . .

| Acceleration        | BlueField | BlueField-2  | BlueField-3 | IPU E2000 | OCTEON 10 |
|---------------------|-----------|--------------|-------------|-----------|-----------|
| DMA                 |           | $\checkmark$ |             |           |           |
| Compress            |           | $\checkmark$ | <b>▼</b> *  |           |           |
| Erasure coding      |           |              |             |           |           |
| Regex               |           |              |             |           |           |
| Off-path encryption |           |              |             |           |           |
| On-path encryption  |           |              |             |           |           |
| Packet processing   |           |              |             |           |           |
| Year                | 2016      | 2020         | 2023        | 2023      | 2021      |



Which domain-specific accelerators deserve to be integrated?

► Compression, encryption, virtualization, packet processing, . . .

|                     |           |              |             |           |           | differen |
|---------------------|-----------|--------------|-------------|-----------|-----------|----------|
| Acceleration        | BlueField | BlueField-2  | BlueField-3 | IPU E2000 | OCTEON 10 | vendors  |
| DMA                 |           | $\checkmark$ |             | DIFF      |           |          |
| Compress            |           |              | <b>√</b> *  |           | DIFF      |          |
| Erasure coding      |           |              |             | DIFF      | DIFF      |          |
| Regex               |           |              |             | DIFF      | DIFF      |          |
| Off-path encryption |           |              | DIFF        |           | DIFF      |          |
| On-path encryption  |           |              |             |           |           |          |
| Packet processing   |           |              |             |           |           |          |
| Year                | 2016      | 2020         | 2023        | 2023      | 2021      |          |



Which domain-specific accelerators deserve to be integrated?

52

Compression, encryption, virtualization, packet processing, . . . 

| different |                     |           |                         |              |           |           |
|-----------|---------------------|-----------|-------------------------|--------------|-----------|-----------|
| versions  | Acceleration        | BlueField | BlueField-2             | BlueField-3  | IPU E2000 | OCTEON 10 |
|           | DMA                 |           |                         | $\checkmark$ |           |           |
|           | Compress            | ADD -     |                         | <b>▼</b> *   |           |           |
|           | Erasure coding      |           | ADD                     |              |           |           |
|           | Regex               | ADD -     |                         |              |           |           |
|           | Off-path encryption |           | <b>⊠</b> ★-             | ►DEL         |           |           |
|           | On-path encryption  |           |                         |              |           |           |
|           | Packet processing   |           | $\overline{\checkmark}$ |              |           |           |
|           | Year                | 2016      | 2020                    | 2023         | 2023      | 2021      |

#### Source: https://developer.nvidia.com/blog/power-the-next-wave-of-applications-with-nvidia-bluefield-3-dpus/

### How to unify system abstraction & programming interface?

▶ e.g., BlueField: PCIe accelerator vs. a standalone server

#### Power the Next Wave of Applications with NVIDIA BlueField-3 DPUs **NVIDIA**, DEVELOPER

The NVIDIA accelerated computing technology stack enables every industry to tap into the power of AI, delivering the performance, scale, and efficiency levels needed for running the next wave of applications.

Accelerated computing runs primarily on three foundational elements:

Outlooking

• CPUs that are used for serial processing and running hyperthreaded applications. • GPUs that excel at parallel processing and are optimized for accelerating modern workloads CPU vs. GPU vs. DPU g tasks; used to offload, accelerate, and isolate data center networking, storage, security, and manageability workloads.

In a modern approximate defined data spatter the OC execution vistualization, network, storage, ar DPUs offload and accelerate the data I cores and associated center OS and infrastructure software aim power and free CPUs for revenue-generating workloads.

NVIDIA BlueField data processing units (DPUs) offload and accelerate the data center OS and infrastructure software.

#### The BlueField-3 DPU consists of three major blocks:

- Networking: The latest generation NVIDIA ConnectX-7 SmartNIC with integrated
- networking and Three major blocks: RM A78 v8.2 with fully coherent Programmable c low-latency mes Network, Compute, Storage tions. Data-plane programmability is achieved through the accelerated pipeline and a new programmable Data Path Accelerator (DPA). DPA is an I/O and packet processor consisting of 16 hyperthreaded cores, purpose-built for IO-intensive, low-compute tasks such as device emulation, congestion control, custom protocols, and more.
- Memory: Dual 64-bit DDR5-5600 memory interfaces (80 GB bandwidth) and integrated 32-lane PCIe Gen 5.0 switch. The PCIe interface can be bifurcated and used as either server-hosted (endpoint) or self-hosted (root complex) to manage a GPU or direct attached SSD devices.

Acting as a "server in front of a server," BlueField-3 is the only DPU platform with an integrated ASPEED ST2600 baseboard management controller (BMC). The BlueField BMC is a devicated processor that monitors the physical state of the DPU board and enables tr "Server in front of a server" rm through an independent connection. This enhances system security, reliability, availability, and

NVIDIA DOCA Software Framework

NUDIA DEVELOPER Home Blog Forums Docs Downloads Trainin

NVIDIA® DOCA™ is the key to unlocking the potential of the NVIDIA BlueField® data processing un (DPU) to offload, accelerate, and isolate data center workloads. With DOCA, developers can program the data center infrastructure of tomorrow by creating software-defined, cloud-native, DPU-accelerated services with zero-trust protection to address the increasing performance and security demands of modern data centers









How to unify system abstraction & programming interface?

54

► e.g., BlueField: PCIe accelerator vs. a standalone server





Hardware evolution:

single capability breakthrough & multiple capability integration

Our approach: characterizing, optimizing, and advising

- Collaborative offloading for multiple devices (e.g., RDMA & NVM)
- Cooperative offloading for intelligent devices (e.g., DPU)

Our outlook on systems research for DPU

See more at <a href="https://ipads.se.sjtu.edu.cn/rong\_chen">https://ipads.se.sjtu.edu.cn/rong\_chen</a>

