blogs.oracle.com Open in urlscan Pro
2a02:26f0:480:9a1::a15 Public Scan

Back to summary

Submitted URL:
https://app.response.oracle-mail.com/e/er?elq_mid=246148&sh=262519181718070525192613060819261518161918200431&cmid=WWMK230524P00051C00...
Effective URL:
https://blogs.oracle.com/cloud-infrastructure/post/oci-accelerates-hpc-ai-db-roce-nvidia-connectx?elq_mid=246148&sh=26251...
Submission: On August 09 via api (August 9th 2023, 6:40:44 pm UTC) from IN — Scanned from DE

Form analysis
1 forms found in the DOM

<form class="scs-search-form" data-bind="attr: {id: searchId}" role="search" onsubmit="return triggerContentListComponents()" id="scs-search-b6b22ef4-4f69-4c15-a57d-7e1bf9689381">
  <input class="scs-search-input" data-bind="value: queryString, attr:{ 'data-show-icon': showSearchIcon, style: computedStyle, placeholder: placeholderText, 'aria-label': placeholderText}" type="text" data-show-icon="true" style=""
    placeholder="Enter Search..." aria-label="Enter Search...">
  <button class="scs-search-button" data-bind="visible: showSearchIcon, 'click': triggerContentListComponents, attr: {'aria-label': placeholderText}" aria-label="Enter Search..."></button>
</form>

Text Content

 * Skip to content
 * Accessibility Policy

Oracle
Oracle Cloud Infrastructure Blog
Search
Exit Search Field

Clear Search Field
Select Language
 * 

Menu
CATEGORIES
   
   
 * What is New in OCI
   
   
   
   
   
   
 * Product News
   
   
   
   
   
   
 * Technical Solutions
   
   
   
   
   
   
 * Developer Tools and Solutions
   
   
   
   
   
   
 * Strategy
   
   
   
   
   
   
 * Security Strategies
   
   
   
   
   
   
 * First Principles
   
   
   
   
   
   
 * Platform Leader
   
   
   
   
   
   
 * Industry Solutions
   
   
   
   
   
   
 * Performance Solutions
   
   
   
   
   
   
 * Partner Solutions
   
   
   
   
   
   
 * Public Sector and Government
   
   
   
   
   
   
 * Behind the Scenes with OCI Engineering
   
   
   
   


 * RELATED CONTENT
   Cloud.oracle.com




 * Blogs Home
 * Blogs Directory
 * Featured Authors
 * RSS

Oracle Cloud Infrastructure Blog



Follow:
 * RSS
 * Facebook
 * Twitter
 * LinkedIn
 * Youtube
 * Instagram


 * Partner Solutions,
 * Performance Solutions,
 * Technical Solutions,




OCI ACCELERATES HPC, AI, AND DATABASE USING ROCE AND NVIDIA CONNECTX

July 24, 2023 | 14 minute read
Leo Leung
Vice President, Product Marketing, OCI
John Kim
Director of Storage Marketing

Text Size 100%:
－ ＋

Oracle is one of the top cloud service providers in the world, supporting over
22,000 customers and reporting revenue of nearly $4 billion per quarter and
annual growth of greater than 40%. Oracle Cloud Infrastructure (OCI) is growing
at an even faster rate and offers a complete cloud infrastructure for every
workload.

Having added 11 regions in the last 18 months, OCI currently offers 41 regions
and supports hosted, on-premises, hybrid, and multi-cloud deployments. It
enables customers to run a mix of custom-built, third-party independent software
vendor (ISV), and Oracle applications on a scalable architecture. OCI provides
scalable networking and tools to support security, observability, compliance,
and cost management.

One of the differentiators of OCI is its ability to offer high-performance
computing (HPC), Oracle Exadata and Autonomous Database, and GPU-powered
applications, such as artificial intelligence (AI) and machine learning (ML),
with fast infrastructure-as-a-service (IaaS) performance that rivals dedicated
on-premises infrastructure. A key component to delivering this high performance
is a scalable, low-latency network that supports remote direct memory access
(RDMA). For more details, see First Principles: Building a High-Performance
Network in the Public Cloud.


NETWORKING CHALLENGE OF HPC AND GPU-POWERED COMPUTE

A commonality across HPC applications, GPU-powered AI workloads, and the Oracle
Autonomous Database on Exadata is that they all run as distributed workloads.
Data processing occurs simultaneously on multiple nodes, using a few dozen to
thousands of CPUs and GPUs. These nodes must communicate with each other, share
intermediate results in multistage problem solving with gigabytes to petabytes
of storage to access common data, and often assemble the results of distributed
computing into a cohesive solution.

These applications require high throughput and low latency to communicate across
nodes to solve problems quickly. Amdahl’s law states that the speedup from
parallelizing a task is limited by how much of the task is inherently serial and
can’t be parallelized. The amount of time needed to transfer information between
nodes adds inherently serial time to the task because nodes must wait for the
data transfer to complete and for the slowest node in the task to finish before
starting the next parallelizable part of the job.

For this reason, the performance of the cluster network becomes paramount, and
an optimized network can enable a distributed compute cluster to deliver results
much sooner than the same computing resources running on a slower network. This
time-saving speeds up job completion time and reduces costs.


WHAT IS RDMA?

RDMA is remote direct memory access, the most efficient means of transferring
data between different machines. It enables a server or storage appliance to
communicate and share data over a network without making extra copies and
without interrupting the host CPU. It’s used for AI, big data, and other
distributed technical computing workloads.

Traditional networking interrupts the CPU multiple times and makes multiple
copies of the data being transmitted as it passes from the application through
the OS kernel, to the adapter, then back up the stack on the receiving end. RDMA
uses only one copy of the data on each end and typically bypasses the kernel,
placing data directly in the receiving machine’s memory without interrupting the
CPU.

This process enables lower latency and higher throughput on the network and
lower CPU utilization for the servers and storage systems. Today, the majority
of HPC, technical computing, and AI applications can be accelerated by RDMA. For
more details, see How RDMA Became the Fuel for Fast Networks.


WHAT IS INFINIBAND?

InfiniBand is a lossless network optimized for high-performance computing, AI,
big data, and other distributed technical computing workloads. It typically
supports the highest bandwidth available (Currently 400 Gbps per connection) for
data center networks and RDMA, enabling machines to communicate and share data
without interrupting the host CPU.

InfiniBand adapters offload networking and data movement tasks from the CPU and
feature an optimized, efficient networking stack, enabling CPUs, GPUs, and
storage to move data rapidly and efficiently. The InfiniBand adapters and
switches can also perform specific compute and data aggregation tasks in the
network, mostly oriented around message passing interface (MPI) collective
operations.

This in-network computing speeds up distributed applications, enabling faster
problem solving. It also frees up server CPU cores and improves energy
efficiency. InfiniBand can also automatically balance traffic loads and reroute
connections around broken links.

As a result, many computing clusters that are dedicated to AI, HPC, big data, or
other scientific computing run on an InfiniBand network to provide the highest
possible performance and efficiency. When distributed computing performance is
the top priority, and the adoption of a specialized stack of network adapters,
switches, and management is acceptable, InfiniBand is the network of choice. But
a data center might choose to run ethernet instead of InfiniBand for many
reasons.


WHAT IS ROCE?

RDMA over converged ethernet (RoCE) is an open standard enabling remote direct
memory access and network offloads over an ethernet network. The current and
most popular implementation is RoCEv2. It uses an InfiniBand communication layer
running on top of UDP (Layer 4) and IP (Layer 3), which runs on top of
high-speed ethernet (Layer 2) connections. It also supports remote direct memory
access, zero-copy data transfers, and bypassing the CPU when moving data. Using
the IP protocol on ethernet enables RoCEv2 to be routable over standard ethernet
networks.

RoCE brings many of the advantages of InfiniBand to ethernet networks. RoCEv2
runs the InfiniBand transport layer over UDP and IP protocols on an ethernet
network. iWARP ran the iWARP protocol on top of the TCP protocol on an ethernet
network but failed to gain popular adoption because of performance and
implementation challenges.

NVIDIA InfiniBand runs the InfiniBand transport layer over an InfiniBand network


HOW DO ROCE NETWORKS ADDRESS SCALABILITY?

RoCE operates most efficiently on networks with very low levels of packet loss.
Traditionally, small RoCE networks use priority flow control (PFC), based on the
IEEE 802.1Qbb specification, to make the network lossless. If any destination is
too busy to process all incoming traffic, it sends a pause frame to the next
upstream switch port, and that switch holds traffic for the time specified in
the pause frame.

If needed, the switch can also send a pause frame up to the next switch in the
fabric and eventually onto the originator of the traffic flow. This flow control
avoids having the port buffers overflow on any host or switch and prevents
packet loss. You can manage up to eight traffic classes with PFC, each class
having its own flows and pauses separate from the others.

However, PFC has some limitations. It operates only at the Layer-2 (Ethernet)
level of the Open System Interconnection (OSI) 7-layer model, so it can’t work
across different subnets. While a subnet can have thousands or millions of
nodes, a typical subnet is limited to 254 IP addresses and consists of a few
racks (often one rack) within a data center, which doesn’t scale for large
distributed applications.

PFC operates on a coarse-grained port level and can’t distinguish between flows
sharing that port. Also, if you use PFC in a multilevel switch fabric,
congestion at one destination switch for one flow can spread to multiple
switches and block unrelated traffic flows that share one port with the RoCE
flow. The solution is usually to implement a form of congestion control.


CONGESTION MANAGEMENT FOR LARGE ROCE NETWORKS

The TCP protocol includes support for congestion management based on dropped
packets. When an endpoint or switch is overwhelmed by a traffic flow, it drops
some packets. When the sender fails to receive an acknowledgment from the
transmitted data, the sender assumes the packet was lost because of network
congestion, slows its rate of transmission, and retransmits the presumably lost
data.

This congestion management scheme doesn’t work well for RDMA on ethernet and
therefore for RoCE. RoCE doesn’t use TCP, and the process of waiting for packets
to timeout and then retransmitting the lost data introduces too much latency—and
too much variability in latency or jitter—for efficient RoCE operation.

Large RoCE networks often implement a more proactive congestion control
mechanism known as explicit congestion notification (ECN), in which the switches
mark packets if congestion occurs in the network. The marked packets alert the
receiver that congestion is imminent, and the receiver alerts the sender with a
congestion notification packet or CNP. After receiving the CNP, the sender knows
to back off, slowing down the transmission rate temporarily until the flow path
is ready to handle a higher rate of traffic.

Congestion control works across Layer 3 of the OSI model, so it functions across
different subnets and scales up to thousands of nodes. However, it requires
setting changes to both switches and adapters supporting RoCE traffic. The
implementation details of when switches mark packets for congestion, how quickly
senders back off sending data, and how aggressively senders resume high-speed
transmissions are all critical to determining the scalability and performance of
the RoCE network.

ECN marks outgoing packets as CE–Congestion Experienced when the switch queue is
getting full. The flow recipient receives the packet and notifies the sender to
slow transmission.

Other ethernet-based congestion control algorithms include quantized congestion
notification (QCN) and data center TCP (DCTCP). In QCN, switches notify flow
senders directly with the level of potential congestion, but the mechanism
functions only over L2. Consequently, it can’t work across more than one subnet.
DCTCP uses the sender’s network interface card (NIC) to measure the round-trip
time (RTT) of special packets to estimate how much congestion exists and how
much the sender must slow down data transmissions.

But DCTCP lacks a fast start option to quickly start or resume sending data when
no congestion exists, places a heavy load on host CPUs, and doesn’t have a good
mechanism for the receiver to communicate with the sender. In any case, DCTCP
requires TCP, so it doesn’t work with RoCE.

Smaller RoCE networks using newer RDMA-capable ConnectX SmartNICs from NVIDIA or
newer NVIDIA BlueField DPUs can use Zero Touch RoCE (ZTR). ZTR enables excellent
RoCE performance without setting up PFC or ECN on the switch, which greatly
simplifies network setup. However, initial deployments of ZTR have been limited
to small RoCE network clusters, and a more scalable version of ZTR that uses RTT
for congestion notification is still in the proving stages.


HOW OCI IMPLEMENTS A SCALABLE ROCE NETWORK

OCI determined that certain cloud workloads required RDMA for maximum
performance. These include AI, HPC, Exadata, autonomous databases, and other
GPU-powered applications. Out of the two standardized RDMA options on ethernet,
they chose RoCE for its performance and wider adoption. The RoCE implementation
needed to scale to run across clusters containing thousands of nodes and deliver
consistently low latency to ensure an excellent experience for cloud customers.

After substantial research, testing, and careful design, OCI decided to
customize their own congestion control solution based on the data center
quantized congestion notification (DC-QCN) algorithm, which they optimized for
different RoCE-accelerated application workloads. The OCI DC-QCN solution is
based on ECN with minimal use of PFC.

The OCI RoCE network uses ECN across the network fabric plus a limited amount of
unidirectional PFC only between the hosts and ToR switches.


A SEPARATE NETWORK FOR ROCE

OCI built a separate network for RoCE traffic because the needs of the RDMA
network tend to differ from the regular data center network. The different types
of application traffic, congestion control, and routing protocols each prefer to
have their own queues. Each NIC typically supports only eight traffic classes,
and the NIC and switch configuration settings and firmware might be different
for RDMA from non-RDMA workloads. For these reasons, having a separate ethernet
network for RoCE traffic and RoCE-accelerated applications makes sense.


LIMITED USE OF PFC AT THE EDGE

OCI implemented a limited level of PFC, only unidirectionally at the network
edge. Endpoints can ask the top-of-rack (ToR) switch to pause transmission if
their NICs buffer fills up. However, the ToR switches never ask the endpoints to
pause and don’t pass on pause requests up the network to leaf or spine switches.
This process prevents head-of-line blocking and congestion spreading if the
incoming traffic flow rate temporarily exceeds the receiver’s ability to process
and buffer data.

The ECN mechanism ensures that PFC is very rarely needed. In the rare case that
a receiving node’s NIC buffer is temporarily overrun while the ECN feedback
mechanism is activating, PFC enables the receiving node to briefly pause the
incoming data flow until the sender receives the CNPs and slows its transmission
rate.

In this sense, you can use PFC as a last resort safeguard to prevent buffer
overrun and packet loss at the network edge (at the endpoints). OCI envisions
that with the next generation of ConnectX SmartNICs, you might not need PFC,
even at the edge of the network.


MULTIPLE CLASSES OF CONGESTION CONTROL

OCI determined that they need at least three customized congestion control
profiles within DC-QCN for different workloads. Even within the world of
distributed applications that require RDMA networking, the needs can vary across
the following categories:

 * Latency sensitive: Consistently low-latency throughput

 * Sensitive: High throughput

 * Mixed: Requires a balance of low-latency and high throughput

The primary setting for customizing congestion control is the probability P,
ranging from 0–1, of the switch adding the ECN marking to an outgoing packet,
based on queue thresholds Kmin and Kmax. P starts at 0 when the switch queue
isn’t busy, which means it has no chance of congestion.

When the port queue reaches Kmax, the value P rises above 0, increasing the
chance that any packet is marked with ECN. When the queue fills to value Kmax, P
is set to Pmax (typically 1), meaning every outgoing packet of that flow on that
switch is marked with ECN. Different DC-QCN profiles typically have a
no-congestion range where P is 0, a potential congestion range where P is
between 0–1, and a congestion range where P is 1.

A more aggressive set of thresholds has lower values for Kmin and Kmax,
resulting in earlier ECN packet marking and lower latency but possibly also
lower maximum throughput. A relaxed set of thresholds has higher values for Kmin
and Kmax, marking fewer packets with ECN, resulting in some higher latencies but
also higher throughput.

To the right side of the Figure 4 are three examples of OCI workloads: HPC,
Oracle Autonomous Database and Exadata Cloud Service, and GPU workloads. These
services use different RoCE congestion control profiles. HPC workloads are
latency-sensitive and give up some throughput to guarantee lower latency.
‌Consequently, Kmax and Kmax are identical and low (Aggressive), and at a low
amount of queuing, they mark 100% of all packets with ECN.

Most GPU workloads are more forgiving on latency but need maximum throughput.
The DC-QCN profile gradually marks more packets as buffers ramp from Kmax to
Kmax and sets those values relatively higher to enable switch buffers to get
closer to full before signaling to flow endpoints that they slow down.

For Autonomous Database and Exadata Cloud Service workloads, the required
balance of latency and bandwidth is in between. Marking or increasing P value
gradually increases between Kmax and Kmax, but these values are set at lower
threshold values than for GPU workloads.

OCI sets DC-QCN to use different Kmin and Kmax thresholds for ECN packet
marking, resulting in optimized network behavior on their RoCE network for
different workloads

With these settings, HPC flows get 100% ECN packet marking as soon as the queues
hit the Kmax level (which is the same here as Kmax) for early and aggressive
congestion control engagement. Oracle Autonomous Database and Exadata flows see
moderately early ECN marking, but only a portion of packets are marked until
buffers reach the Kmax level.

Other GPU workloads have a higher Kmax setting so ECN marking doesn’t begin
until switch queues are relatively fuller, and 100% ECN marking only happens
when the queues are close to full. Different workloads get the customized
congestion control settings needed to provide the ideal balance of latency and
throughput for maximum application performance.


UTILIZING ADVANCED NETWORK HARDWARE

An important factor in achieving high performance for RoCE networks is the type
of network card used. The NIC offloads the networking stack, including RDMA, to
a specialized chip to offload the work from the CPUs and GPUs. OCI uses ConnectX
SmartNICs, which have market-leading network performance for both TCP and RoCE
traffic.

These SmartNICs also support rapid PFC and ECN reaction times for detecting
ECN-marked packets or PFC pause frames, sending CNPs, and adjusting the data
transmission rates downward and upward in response to congestion notifications.

NVIDIA has been a longtime leader in the development and support of RDMA, PFC,
ECN, and DC-QCN technology and a leader in high-performance GPUs and GPU
connectivity. The advanced RoCE offloads in ConnectX enable higher throughput
and lower latency on the OCI network, and their rapid, hardware-based ECN
reaction times help ensure that DC-QCN functions smoothly.

By implementing an optimized congestion control scheme on a dedicated RoCE
network, plus a combination of localized PFC, multiple congestion control
profiles, and NVIDIA network adapters, OCI has built a very scalable cluster
network. It’s ideal for distributed workloads, such as AI and ML, HPC, and
Oracle Autonomous Database, and delivers high-throughput and low-latency
performance close to what an InfiniBand network can achieve.


EMPHASIZING DATA LOCALITY

With optimizing cluster network performance, OCI also manages data locality to
minimize latency. With the large size of RoCE-connected clusters that often span
multiple data center racks and halls, even in an era of 100-, 200-, and 400-Gbps
networking connections, the speed of light hasn’t changed, and longer cables
result in higher latency.

Connections to different halls in the data center traverse more switches, and
each switch hop adds some nanoseconds to connection latency. OCI shares server
locality information with both its customers and the job scheduler, so they can
schedule jobs to use servers and GPUs that are close to each other in the
network.

For example, the NVIDIA Collective Communication Library (NCCL) understands the
OCI network topology and server locality information and can schedule GPU work
accordingly. So, the compute and storage connections traverse fewer switch hops
and shorter cable lengths, reducing the average latency within the cluster.

It also sends less traffic to spine switches, simplifying traffic routing and
load balancing decisions. OCI also worked with its switch vendors to make the
switches more load-aware, so flows can be routed to less busy connections. Each
switch generally has two connections up and down the network, enabling multiple
data paths for any flow.


CONCLUSION

By investing in a dedicated RoCE network with an optimized implementation of
DC-QCN, advanced ConnectX NICs, and customized congestion control profiles, OCI
delivers a highly scalable cluster that supports accelerated computing for many
different workloads and applications. OCI cluster networks simultaneously
deliver high throughput and low latency. For small clusters, latency—half the
round-trip time—can be as little as 2 microseconds. For large clusters, latency
is typically under 4 microseconds. For extremely large superclusters, latencies
are in the range of 4–8 microseconds, with most traffic seeing latencies at the
lower end of this range.

Oracle Cloud Infrastructure uses an innovative approach to deliver scalable,
RDMA-powered networking on ethernet for a multitude of distributed workloads,
providing higher performance and value to their customers.

For more information, see the following resources:

 * Building High-Performance Network in the Cloud

 * Oracle Partners with NVIDIA to Solve the Largest AI and NLP Models

 * Running Applications on Oracle Cloud Using Cluster Networking

 * Large Clusters, Lowest Latency Cluster Networking on Oracle Cloud
   Infrastructure

LEO LEUNG

VICE PRESIDENT, PRODUCT MARKETING, OCI



I'm an experienced product manager and product marketer at both large and
startup vendors. I've been an cloud application, platform, and infrastructure
end user and website developer, enterprise storage system product manager and
marketer, cloud storage and cloud application product manager and operations
manager, and storage software product marketer. I've managed business
partnerships with many infrastructure ISVs, as well as large systems vendors
like Cisco, Dell, EMC, HPE, and IBM.



Show more

JOHN KIM

DIRECTOR OF STORAGE MARKETING



John Kim is director of storage marketing at NVIDIA in the Networking Business
Unit, where he helps customers and vendors benefit from high performance network
connections, SmartNIC offloads, and Remote Direct Memory Access (RDMA),
especially in the areas of storage, big data, and artificial intelligence. A
frequent blogger, conference speaker, and webcast presenter, John is also chair
of the Storage Networking Industry Association’s Networking Storage Forum (SNIA
NSF). After starting his high tech career at an IT helpdesk and as a network
administrator, John worked in solution marketing, product management, and
alliances for enterprise software companies and then at enterprise storage
vendors NetApp and EMC. He joined Mellanox in 2013 and then NVIDIA in 2020.



Show more






Previous Post

OCI FULL STACK DISASTER RECOVERY - NEW FEATURES FOR COMPUTE "MOVING INSTANCES"

Suraj Ramesh | 7 min read

Next Post




EASILY INSTALL ORACLE JAVA ON ORACLE LINUX IN OCI: IT’S A PERFECT MATCH!

Gursewak Sokhi | 3 min read


Resources for
 * About
 * Careers
 * Developers
 * Investors
 * Partners
 * Startups

Why Oracle
 * Analyst Reports
 * Best CRM
 * Cloud Economics
 * Corporate Responsibility
 * Diversity and Inclusion
 * Security Practices

Learn
 * What is Customer Service?
 * What is ERP?
 * What is Marketing Automation?
 * What is Procurement?
 * What is Talent Management?
 * What is VM?

What's New
 * Try Oracle Cloud Free Tier
 * Oracle Sustainability
 * Oracle COVID-19 Response
 * Oracle and SailGP
 * Oracle and Premier League
 * Oracle and Red Bull Racing Honda

Contact Us
 * US Sales 1.800.633.0738
 * How can we help?
 * Subscribe to Oracle Content
 * Try Oracle Cloud Free Tier
 * Events
 * News

--------------------------------------------------------------------------------

 * © 2023 Oracle
 * Privacy/Do Not Sell My Info
 * Cookie-Einstellungen
 * Ad Choices
 * Careers



Oracle Chatbot

Disconnected

 * Close widget
 * Select chat language
   




 * Detect Language
 * undefined
 * Español
 * Português

blogs.oracle.com Open in urlscan Pro 2a02:26f0:480:9a1::a15 Public Scan

Form analysis 1 forms found in the DOM

Text Content

blogs.oracle.com Open in urlscan Pro
2a02:26f0:480:9a1::a15 Public Scan

Form analysis
1 forms found in the DOM