www.silverliningsinfo.com Open in urlscan Pro
2606:4700::6812:1906  Public Scan

Submitted URL: https://qtx.omeclk.com/portal/wts/ue%5EcmQ6eBy6bbkzra87tv6jF7Brc6krPPreOEfOMOb
Effective URL: https://www.silverliningsinfo.com/sponsored/infiniband-or-ethernet-which-better-suits-ai-networking-fabric?utm_source=email&utm_me...
Submission: On October 25 via api from US — Scanned from DE

Form analysis 2 forms found in the DOM

GET /search-results

<form action="/search-results" method="get">
  <fieldset class="d-flex align-items-center position-relative input-search"><input type="text" placeholder="Search" name="fulltext_search" class="search-text w-100"><button class="search-submit d-flex align-items-center" id="header-search-submit"
      type="submit" aria-label="search"><svg aria-hidden="true" focusable="false" data-prefix="far" data-icon="search" class="svg-inline--fa fa-search fa-w-16" role="img" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 512 512">
        <path fill="currentColor"
          d="M508.5 468.9L387.1 347.5c-2.3-2.3-5.3-3.5-8.5-3.5h-13.2c31.5-36.5 50.6-84 50.6-136C416 93.1 322.9 0 208 0S0 93.1 0 208s93.1 208 208 208c52 0 99.5-19.1 136-50.6v13.2c0 3.2 1.3 6.2 3.5 8.5l121.4 121.4c4.7 4.7 12.3 4.7 17 0l22.6-22.6c4.7-4.7 4.7-12.3 0-17zM208 368c-88.4 0-160-71.6-160-160S119.6 48 208 48s160 71.6 160 160-71.6 160-160 160z">
        </path>
      </svg></button><input type="hidden" placeholder="Search" name="dns" value="fiercetelecom_com,silverliningsinfo_com"></fieldset>
</form>

GET /search-results

<form class="w-100" action="/search-results" method="get">
  <fieldset class="d-flex justify-content-between position-relative input-search"><input type="text" placeholder="Search" name="fulltext_search" class="search-text w-100"><button class="search-submit d-flex align-items-center" id="search-submit"
      type="submit" aria-label="search"><svg aria-hidden="true" focusable="false" data-prefix="far" data-icon="search" class="svg-inline--fa fa-search fa-w-16" role="img" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 512 512">
        <path fill="currentColor"
          d="M508.5 468.9L387.1 347.5c-2.3-2.3-5.3-3.5-8.5-3.5h-13.2c31.5-36.5 50.6-84 50.6-136C416 93.1 322.9 0 208 0S0 93.1 0 208s93.1 208 208 208c52 0 99.5-19.1 136-50.6v13.2c0 3.2 1.3 6.2 3.5 8.5l121.4 121.4c4.7 4.7 12.3 4.7 17 0l22.6-22.6c4.7-4.7 4.7-12.3 0-17zM208 368c-88.4 0-160-71.6-160-160S119.6 48 208 48s160 71.6 160 160-71.6 160-160 160z">
        </path>
      </svg></button></fieldset>
</form>

Text Content

 * Silverlinings
 * Silverlinings Events
 * The Cloud Executive Summit
 * Fierce Wireless
 * Fierce Telecom
 * Fierce Electronics

 * Advertise



 * Multi-cloud
 * 5G
 * APIs
 * AI
 * Automation
 * Modernization
 * Security
 * Data Center
   
   * Sustainability
 * Observability

 * Resources
 * Events
   
   
 * Subscribe

Subscribe

 * Multi-cloud
 * 5G
 * APIs
 * AI
 * Automation
 * Modernization
 * Security
 * Data Center
   
   * Sustainability
 * Observability

 * Resources
 * Events
   
   
 * Subscribe

 * Silverlinings
 * Silverlinings Events
 * The Cloud Executive Summit
 * Fierce Wireless
 * Fierce Telecom
 * Fierce Electronics

 * Advertise


Brought to you by:


AI


INFINIBAND OR ETHERNET? WHICH BETTER SUITS AI NETWORKING FABRIC?

By DriveNetsSep 25, 2023 8:00am
AIartificial intelligence (AI)AI


Generative artificial intelligence (AI) is rapidly growing in applications and
popularity, and with it AI infrastructure buildouts and demand for high-scale
compute resources.

China’s hyperscalers are splashing billions of USD on Nvidia gear, to keep pace
with Western hyperscalers that are building larger and larger AI supercomputers
to accommodate the rapidly growing datasets training of their AI models.







The networking challenge



















This massive buildout of AI infrastructure is creating the need for a
high-performance networking fabric. As a result, inter-GPU connectivity has
become a crucial element for the performance of AI workloads and efficiency of
the AI infrastructure.

Though accounting for less than 10% of the typical cost of a large AI compute
cluster (where the GPUs hold the lion share of the cost), an underperforming
networking infrastructure can reduce the performance of the entire AI cluster,
measured in job completion time, or JCT, by tens of percentages.

In her keynote at the Open Compute Project (OCP) Summit, in late 2022, Alexis
Bjorlin, Meta’s VP of infrastructure, highlighted the growing gap between
compute capabilities and the surrounding network capabilities. She also shared
some amazing figures regarding the percentage of compute idle-time spent
awaiting the network to deliver the needed AI payloads. These compute resources
are simply wasted waiting, causing longer JCT, or requiring a larger (and much
more costly) compute cluster in order to perform a given task on time.







This networking bottleneck is an unacceptable situation in which an expensive,
strategic infrastructure (AI compute) is limited by a secondary, much less
costly element: the network.

Affect on overall performance

To understand how the networking fabric affects the overall performance of an AI
cluster, we need to take a look at how the AI training process works.

As the process is far too compute-intensive to run on a single compute element
(GPU or other AI processor), instead it runs, in parallel, on multiple GPUs. The
number of GPUs has grown from 10s to 100s and lately to 1,000s and even tens of
1,000s GPUs, in a single cluster, running the same job.







The way this is achieved, as explained by Nidhi Chappell, Microsoft’s GM of
Azure Generative AI and HPC platforms, is by partitioning the AI computation
workload across all those GPUs. This workload runs in parallel — and in phases
called allreduce, information is shared, or synched, between the GPUs.

This information, which is transferred  between GPUs, run on a designated
networking fabric, often referred to as the back-end network, connecting all the
GPUs in the cluster. Due to the traffic volume, this network connection is per
GPU, and not per server (that can accommodate up to 8 GPUs).

AI networking options

There are several networking technologies that support this AI fabric
infrastructure.

 1. InfiniBand — the dominant technology, so far. InfiniBand was purpose-built
    for supercomputer connectivity. It is practically a Nvidia closed garden.
    While InfiniBand provides adequate performance, it is suitable for an
    isolated infrastructure and, according to Nvidia’s CEO, Jensen Huang, who
    described this market evolution in his keynote at Computex 2023, in order
    for Generative AI to grow and become present in public datacenters and cloud
    infrastructure, a move toward Ethernet-based fabric needs to occur.
 2. Ethernet – this is the de-facto global standard for any connectivity within
    the datacenter. The issue with Ethernet, though, is that it is, by nature, a
    lossy technology. This nature becomes dominant as the number of elements
    connected to an Ethernet network grows and — more than that — as the traffic
    utilization of the network exceeds 30% - 50%. Under these conditions,
    congestion starts to occur in different parts of the network, and phenomena
    such as head-of-line blocking and incast cause jitter (variation in latency)
    and frame/packet loss. This means that the AI job is delayed, causing a
    longer job completion time, and, in cases of severe packet loss, this job
    could be halted, forcing it to “rewind” to the last checkpoint or to restart
    altogether.
 3. DDC - To have an Ethernet-based solution which is lossless and predictable
    (like InfiniBand) but does not incur packet losses and is consistent in its
    performance, a different approach to Ethernet infrastructure is required.
    Such an approach was introduced, for a completely different use case, by the
    OCP as it accepted the Distributed Disaggregated Chassis (DDC)
    specifications for high-scale networking.

In Ethernet DDC, the external interfaces are Ethernet but the internal
(backplane) connectivity is a cell-based scheduled fabric which is lossless and
predictable, distributed across multiple white boxes.

In this architecture, there are two types of white boxes, the NCP (network cloud
packet forwarder, also referred to as DCP) and the NCF (network cloud fabric, or
DCF). Unlike a chassis-based solution, NCP white boxes are not bound by a
physical or mechanical enclosure and can be deployed across multiple racks
across the data center and act as a top-of-rack switch, connecting the servers
within this rack, while the inter-rack connectivity is done over connections
between and NCPs and the NCFs, which are cell-based, hence lossless and
predictable.

The connectivity scheme is illustrated in the following drawing:





It is important to understand that in such an architecture, the entire NCP and
NCF constellation acts as a single (very large) Ethernet entity, so a connection
between any GPU to any GPU will hop through a single Ethernet node, while in a
similar physical Clos architecture, in which all nodes are Ethernet switches,
there will be up to five (and, in some cases, seven) Ethernet hops between GPUs.

The internal fabric and the overlaying software include mechanisms for lossless
connectivity, such as a virtual output queue (VOQ) held in the ingress port,
which prevents HOL blocking, as well as wire-speed failover mechanisms that
ensure continuous fabric availability for the AI workloads.

This architecture could be viewed as a virtual chassis, which can scale to 10Ks
of high-speed (up to 800Gbps) Ethernet ports, connecting the entire AI cluster,
as illustrated below:





No doubt that things will further develop and evolve.  But DriveNets will most
likely evolve with it as its solutions already demonstrate higher AI performance
than other Ethernet solutions, and the company is already working with industry
pioneers who are building the largest Ethernet infrastructures.

AIartificial intelligence (AI)AI



EDUCATIONAL RESOURCES

Distributed Disaggregated Chassis (DDC) for Back-End AI Networking Fabric
Resolving the AI Back-End Network Bottleneck with Network Cloud-AI
DDC as an Effective Interconnect for Large AI Compute Clusters



RELATED ARTICLES

Observe Inc. launches new capabilities on Snowflake data cloud
Oct 25, 2023 08:00am
Industry Voices: Tianjin Port – The Leading Edge of 5G, AI and IoT
Oct 25, 2023 07:30am
CyCognito announces expansion for enhanced cloud protection
Oct 24, 2023 01:36pm
Lacework announces enterprise multicloud platform updates
Oct 24, 2023 01:26pm
See more articles
 * CONNECT
   
   * Advertise & Thought Leadership
   * The Team
   * Write for Silverlinings
   * Letter to the Editor
 * JOIN US
   
   * Newsletter
   * Resources
 * OUR BRANDS
   
   * Fierce Electronics
   * Fierce Telecom
   * Fierce Video
   * Fierce Wireless
 * OUR EVENTS
   
   * Electronics
   * Entertainment
   * Silverlinings
   * Wireless & Telecom

©2023 Questex LLC All rights reserved.
Terms of use
Privacy Policy