www.microsoft.com Open in urlscan Pro
2a02:26f0:6c00:29d::356e  Public Scan

Submitted URL: http://track.smtpsendemail.com/9066125/c?p=eDdGNF2lCfFIxw_aQ341CT6A6XiDu-JryHx1UYqRvtc9N2t_8nBkFRiQ6EJae36gLyNybFQzyrrM0QwIgyjR...
Effective URL: https://www.microsoft.com/en-us/research/blog/tutel-an-efficient-mixture-of-experts-implementation-for-large-dnn-model-tra...
Submission: On December 06 via api from SE — Scanned from DE

Form analysis 1 forms found in the DOM

Name: searchFormGET /en-us/research/search/

<form class="c-search" autocomplete="off" id="searchForm" name="searchForm" role="search" action="/en-us/research/search/" method="GET" data-seautosuggest="" data-seautosuggestapi="https://www.microsoft.com/services/api/v3/suggest"
  data-m="{&quot;cN&quot;:&quot;GlobalNav_Search_cont&quot;,&quot;cT&quot;:&quot;Container&quot;,&quot;id&quot;:&quot;c3c1c9c3c1m1r1a1&quot;,&quot;sN&quot;:3,&quot;aN&quot;:&quot;c1c9c3c1m1r1a1&quot;}" aria-expanded="false">
  <div class="ep-autosuggest-container"><input id="cli_shellHeaderSearchInput" aria-label="Search Expanded" aria-controls="universal-header-search-auto-suggest-transparent" aria-owns="autosuggest-list-0" type="search" name="q" role="combobox"
      placeholder="Search Microsoft Research" data-m="{&quot;cN&quot;:&quot;SearchBox_nav&quot;,&quot;id&quot;:&quot;n1c3c1c9c3c1m1r1a1&quot;,&quot;sN&quot;:1,&quot;aN&quot;:&quot;c3c1c9c3c1m1r1a1&quot;}" data-toggle="tooltip" data-placement="right"
      title="Search Microsoft Research" autocomplete="off" aria-expanded="false" data-open="false" style="">
    <div class="x-screen-reader" aria-live="assertive"></div>
    <div class="ep-autosuggest">
      <ul class="autosuggest-list" aria-live="polite" id="autosuggest-list-0" role="listbox"></ul>
    </div>
  </div>
  <button id="search" aria-label="Search Microsoft Research" class="c-glyph" data-m="{&quot;cN&quot;:&quot;Search_nav&quot;,&quot;id&quot;:&quot;n2c3c1c9c3c1m1r1a1&quot;,&quot;sN&quot;:2,&quot;aN&quot;:&quot;c3c1c9c3c1m1r1a1&quot;}"
    data-bi-dnt="true" data-bi-mto="true" aria-expanded="false">
    <span role="presentation">Search</span>
    <span role="tooltip" class="c-uhf-tooltip c-uhf-search-tooltip">Search Microsoft Research</span>
  </button>
  <div class="m-auto-suggest" id="universal-header-search-auto-suggest-transparent" role="group">
    <ul class="c-menu" id="universal-header-search-auto-suggest-ul" aria-label="Search Suggestions" aria-hidden="true" data-bi-dnt="true" data-bi-mto="true" data-js-auto-suggest-position="default" role="listbox" data-tel="jsll"
      data-m="{&quot;cN&quot;:&quot;search suggestions_cont&quot;,&quot;cT&quot;:&quot;Container&quot;,&quot;id&quot;:&quot;c3c3c1c9c3c1m1r1a1&quot;,&quot;sN&quot;:3,&quot;aN&quot;:&quot;c3c1c9c3c1m1r1a1&quot;}"></ul>
  </div>
</form>

Text Content

We use cookies to improve your experience on our websites and for advertising.
Privacy Statement

Accept all Manage cookies
Skip to Header Skip to Search Skip to Content Skip to Footer
Try the browser recommended by Microsoft Get speed, security and privacy with
Microsoft Edge
No thanks Switch now
Skip to main content
Microsoft
Research
Research
Research
 * Home
 * Our research
    * Resources Resources
      * Publications
      * Code & datasets
      * People
      * Microsoft Research blog
      * Webinars & tutorials
    * Research areas: Intelligence Research areas: Intelligence
      * Artificial intelligence
      * Audio & acoustics
      * Computer vision
      * Graphics & multimedia
      * Human-computer interaction
      * Human language technologies
      * Search & information retrieval
    * Research areas: Systems Research areas: Systems
      * Data platforms and analytics
      * Hardware & devices
      * Programming languages & software engineering
      * Quantum computing
      * Security, privacy & cryptography
      * Systems & networking
    * Research areas: Theory Research areas: Theory
      * Algorithms
      * Mathematics
    * Research areas: Other Sciences Research areas: Other Sciences
      * Ecology & environment
      * Economics
      * Medical, health & genomics
      * Social sciences
      * Technology for emerging markets

 * Programs & events
    * Academic Programs Academic Programs
      * Overview
      * Programs for faculty
      * Programs for students
      * Collaborations
    * Events & conferences Events & conferences
      * Events & academic conferences
      * Webinars & tutorials

 * Blogs & podcasts
    * Microsoft Research blog
    * Microsoft Research podcast
    * Behind the Tech podcast

 * About
    * People & news People & news
      * About Microsoft Research
      * Careers & internships
      * People
      * Emeritus program
      * News & awards
      * Microsoft Research newsletter
    * Microsoft Research Labs Microsoft Research Labs
      * Amsterdam
      * Asia Lab (Chinese)
      * Asia Lab (English)
      * Cambridge
      * India
      * Montreal
      * New England
      * New York City
      * Redmond
    * Other labs Other labs
      * Applied Sciences
      * Mixed Reality & AI - Cambridge
      * Mixed Reality & AI - Zurich
      * Advanced Technology Lab Cairo

 * More
 * Sign up: Research Newsletter

 * All Microsoft
    * * Microsoft 365
      * Azure
      * Office 365
      * Dynamics 365
      * Power Platform
      * Windows 10
    * Products & Services Products & Services
      * Windows Server
      * Enterprise Mobility + Security
      * Power BI
      * Teams
      * Visual Studio
      * Microsoft Advertising
    * Emerging Technologies Emerging Technologies
      * AI
      * Internet of Things
      * Azure Cognitive Services
      * Quantum
      * Microsoft HoloLens
      * Mixed Reality
    * Developer & IT Developer & IT
      * Docs
      * Developer Center
      * Windows Dev Center
      * Windows IT Pro Center
      * FastTrack
      * Power Platform
    * Partner Partner
      * Partner Network
      * Solution Providers
      * Partner Center
      * Cloud Hosting
    * Industries Industries
      * Education
      * Financial services
      * Government
      * Health
      * Manufacturing & resources
      * Retail
    * Other Other
      * Security
      * Licensing
      * AppSource
      * Azure Marketplace
      * Events
      * Research
    * View Sitemap


Search Search Microsoft Research

Cancel
Return to Blog Home
Microsoft Research Blog


TUTEL: AN EFFICIENT MIXTURE-OF-EXPERTS IMPLEMENTATION FOR LARGE DNN MODEL
TRAINING

Published November 22, 2021

By Wei Cui , Senior Research SDE Yifan Xiong , Research SDE II Peng Cheng ,
Principal Researcher Rafael Salas , Software Engineer

 * Share on Twitter
 * Share on Facebook
 * Share on LinkedIn
 * Share on Reddit
 * Subscribe to our RSS feed

Research Area

 * Artificial intelligence

Mixture of experts (MoE) is a deep learning model architecture in which
computational cost is sublinear to the number of parameters, making scaling
easier. Nowadays, MoE is the only approach demonstrated to scale deep learning
models to trillion-plus parameters, paving the way for models capable of
learning even more information and powering computer vision, speech recognition,
natural language processing, and machine translation systems, among others, that
can help people and organizations in new ways.

Today, we’re proud to announce Tutel, a high-performance MoE library to
facilitate the development of large-scale DNN models; Tutel is highly optimized
for the new Azure NDm A100 v4 series, now generally available. With Tutel’s
diverse and flexible MoE algorithmic support, developers across AI domains can
execute MoE more easily and efficiently. For a single MoE layer, Tutel achieves
an 8.49x speedup on an NDm A100 v4 node with 8 GPUs and a 2.75x speedup on 64
NDm A100 v4 nodes with 512 A100 GPUs (all experiments in this blog are tested on
Azure NDm A100 v4 nodes with 8 x 80 GB NVIDIA A100 and an 8 x 200 gigabits per
second InfiniBand network), respectively, compared with state-of-the-art MoE
implementations such as that in Meta’s Facebook AI Research Sequence-to-Sequence
Toolkit (fairseq) in PyTorch. For end-to-end performance, Tutel—benefiting from
an optimization for all-to-all communication—achieves a more than 40 percent
speedup with 64 NDm A100 v4 nodes for Meta’s (Facebook is now Meta) 1.1
trillion–parameter MoE language model. Tutel provides great compatibility with
rich features to ensure the great performance when working on the Azure NDm A100
v4 cluster. Tutel is open source and has been integrated into fairseq.


TUTEL MOE OPTIMIZATIONS

Complementary to other high-level MoE solutions like fairseq and FastMoE, Tutel
mainly focuses on the optimizations of MoE-specific computation and the
all-to-all communication, as well as other diverse and flexible algorithmic MoE
supports. Tutel has a concise interface, making it easy to integrate into other
MoE solutions. Alternatively, developers can use the Tutel interface to
incorporate standalone MoE layers into their own DNN models from scratch and
benefit from the highly optimized state-of-the-art MoE features directly.


MOE-SPECIFIC OPTIMIZATION FOR COMPUTATION

Because of the lack of efficient implementations, MoE-based DNN models rely on a
naive combination of multiple off-the-shelf DNN operators provided by deep
learning frameworks such as PyTorch and TensorFlow to compose the MoE
computation. Such a practice incurs significant performance overheads thanks to
redundant computation. Tutel designs and implements multiple highly optimized
GPU kernels to provide operators for MoE-specific calculation. For example,
Tutel reduces the time complexity of dispatching “gating output” from O(N^3) to
O(N^2), which significantly improves the data dispatching efficiency. Tutel also
implements a fast cumsum-minus-one operator, achieving a 24x speedup compared
with the fairseq implementation. Tutel also leverages NVRTC, a runtime
compilation library for CUDA C++, to further optimize the customized MoE kernel
just-in-time. Figure 1 shows the comparison results of Tutel with fairseq on the
Azure NDm A100 v4 platform, where—as mentioned above—a single MoE layer with
Tutel achieves an 8.49x speedup on 8 A100 GPUs and a 2.75x speedup on 512 A100
GPUs.


UNDERLYING ALL-TO-ALL COMMUNICATION OPTIMIZATION ON AZURE NDM A100 V4 CLUSTERS

Tutel also optimizes the all-to-all collective communication for large-scale MoE
training on Azure NDm A100 v4 clusters, including CPU-GPU binding and adaptive
routing (AR) tuning. A proper CPU-GPU binding on a multi-non-uniform memory
access (NUMA) system, especially on the NDm A100 v4 nodes, is very critical for
all-to-all performance. Unfortunately, existing machine learning frameworks have
not provided an efficient all-to-all communication library, resulting in
performance regression for large-scale distributed training. Tutel optimizes the
binding automatically and provides an elegant interface for user fine-tuning.
Furthermore, Tutel leverages multipath technology, namely AR, on NDm A100 v4
clusters. For the all-to-all communication in MoE, the total data traffic size
of the communication for each GPU doesn’t change, but the data size between each
GPU pair becomes smaller with the increasing number of GPUs. Smaller data size
incurs a larger overhead in the all-to-all communication, leading to poorer MoE
training performance. By taking advantage of AR technology available on NDm A100
v4 clusters, Tutel improves communication efficiency for groups of small
messages and provides high-performance all-to-all communication on NDm A100 v4
systems. Benefiting from CPU-GPU binding and AR tuning, Tutel achieves a 2.56x
to 5.93x all-to-all speedup with 512 A100 GPUs for message sizes hundreds of MiB
large, which are typically used in MoE training, as illustrated in Figure 2.

Figure 1 (left): Compared to fairseq, for a single MoE layer, Tutel achieves an
8.49x speedup on an NDm A100 v4 node with 8 GPUs and a 2.75x speedup on 64 NDm
A100 v4 nodes with 512 A100 GPUs. The detailed setting is as follows: batch_size
= 32, sequence_length = 1,024, Top_K = 2, model_dim = 2,048, and hidden_size =
2,048. Figure 2 (right): The all-to-all bandwidth for different message sizes
with 64 NDm A100 v4 nodes (512 A100 GPUs) before and after applying Tutel. Tutel
achieves a 2.56x to 5.93x all-to-all speedup with 512 A100 GPUs for message
sizes hundreds of MiB large.


DIVERSE AND FLEXIBLE MOE ALGORITHMS SUPPORT

Tutel provides diverse and flexible support for state-of-the-art MoE algorithms,
including support for:

 * the arbitrary K setting for the Top-K gating algorithm (most implementations
   only support Top-1 and Top-2).
 * different exploration strategies, including batch-prioritized routing, input
   dropout, and input jitter
 * different levels of precisions, including half precision (FP16), full
   precision (FP32), and mixed precision (we’ll support BF16 in our next
   release)
 * different types of devices, including both NVIDIA CUDA and AMD ROCm devices

Tutel will be actively integrating various emerging MoE algorithms from the
open-source community.

Spotlight: Microsoft research newsletter


MICROSOFT RESEARCH NEWSLETTER

Stay connected to the research community at Microsoft.

Subscribe today


INTEGRATING TUTEL WITH META’S MOE LANGUAGE MODEL

Meta made its MoE language model open source and uses fairseq for its MoE
implementation. We worked with Meta to integrate Tutel into the fairseq toolkit.
Meta has been using Tutel to train its large language model, which has an
attention-based neural architecture similar to GPT-3, on Azure NDm A100 v4. We
use Meta’s language model to evaluate the end-to-end performance of Tutel. The
model has 32 attention layers, each with 32 x 128-dimension heads. Every two
layers contains one MoE layer, and each GPU has one expert. Table 1 summarizes
the detailed parameter setting of the model, and Figure 3 shows the 40 percent
speedup Tutel achieves. With the increasing number of GPUs, the gain from Tutel
is from 131 percent with 8 A100 GPUs to 40 percent with 512 A100 GPUs because
the all-to-all communication becomes the bottleneck. We’ll do further
optimization on the next version.

Configuration SettingConfiguration Settingcode branchmoe-benchmarkGit commit
ID1ef1612decorder-layers32Archtransformer_lm_gptdecoder-attention-heads32Criterionmoe_cross_entropydecoder-embed-dim4096moe-freq2decorder-ffn-embed-dim16384moe-expert-count512tokens-per-sample1024moe-gating-use-fp32TrueBatch-size24OptimizerAdamvocabulary
size 51200fp16-adam-statsTrue

Table 1: Configuration for MoE language model with 512 A100 (80G) GPUs
Figure 3: For end-to-end performance, Tutel achieves a more than 40 percent
speedup with 64 NDm A100 v4 nodes for Meta’s 1.1 trillion–parameter MoE language
model.


THE PROMISE OF MOE

MoE is a promising technology. It enables holistic training based on techniques
from many areas, such as systematic routing and network balancing with massive
nodes, and can even benefit from GPU-based acceleration. We demonstrate an
efficient MoE implementation, Tutel, that resulted in significant gain over the
fairseq framework. Tutel has been integrated into the DeepSpeed framework, as
well, and we believe that Tutel and related integrations will benefit Azure
services, especially for those who want to scale their large models efficiently.
As today’s MoE is still in its early stages and more efforts are needed to
realize its full potential, Tutel will continue evolving and bring us more
exciting results.


ACKNOWLEDGMENT

The research behind Tutel was conducted by a team of researchers from across
Microsoft, including Wei Cui, Zilong Wang, Yifan Xiong, Guoshuai Zhao, Fan Yang,
Peng Cheng, Yongqiang Xiong, Mao Yang, Lidong Zhou, Rafael Salas, Jithin Jose,
Kushal Datta, Prabhat Ram, and Joe Chau.



--------------------------------------------------------------------------------


LATEST FROM

 * Wei Cui Senior Research SDE
   
   Research Area: 2019-Now: GPU accelerator and optimizations; Compiling
   technology for deep learning and new hardware. 2016-2018: NVIDIA GPU
   virtualization; 2013-2016: Big data framework and…
   
   View profile
 * Yifan Xiong Research SDE II
   
   Yifan Xiong is a Research SDE with System and Networking Research Group at
   Microsoft Research Asia, where he works on machine learning systems and…
   
   View profile
 * Peng Cheng Principal Researcher
   
   I received my Ph.D. in Computer Science and Technology from Tsinghua
   University in 2015 and B.S. degrees in Software Engineering from Beihang
   University in 2010. I was…
   
   View profile
 * Rafael Salas Software Engineer
   


RELATED TO THIS ARTICLE


RESEARCHER TOOLS

 * Tutel


PROJECTS

 * DeepSpeed


LABS

 * Microsoft Research Lab - Asia

Follow us:

 * Follow on Twitter
 * Like on Facebook
 * Subscribe on Youtube
 * Follow on Instagram
 * Subscribe to our RSS feed

Share this page:

 * Share on Twitter
 * Share on Facebook
 * Share on LinkedIn
 * Share on Reddit



What's new
 * Surface Pro 8
 * Surface Laptop Studio
 * Surface Pro X
 * Surface Go 3
 * Surface Duo 2
 * Surface Pro 7+
 * Windows 11 apps
 * HoloLens 2

Microsoft Store
 * Account profile
 * Download Center
 * Microsoft Store support
 * Returns
 * Order tracking
 * Virtual workshops and training
 * Microsoft Store Promise
 * Flexible Payments

Education
 * Microsoft in education
 * Office for students
 * Office 365 for schools
 * Deals for students & parents
 * Microsoft Azure in education

Enterprise
 * Azure
 * AppSource
 * Automotive
 * Government
 * Healthcare
 * Manufacturing
 * Financial services
 * Retail

Developer
 * Microsoft Visual Studio
 * Windows Dev Center
 * Developer Center
 * Microsoft developer program
 * Channel 9
 * Microsoft 365 Dev Center
 * Microsoft 365 Developer Program
 * Microsoft Garage

Company
 * Careers
 * About Microsoft
 * Company news
 * Privacy at Microsoft
 * Investors
 * Diversity and inclusion
 * Accessibility
 * Security

 * Sitemap
 * Contact Microsoft
 * Privacy
 * Manage cookies
 * Terms of use
 * Trademarks
 * Safety & eco
 * About our ads
 * © Microsoft 2021

English
Français English

Notifications