arstechnica.com Open in urlscan Pro
18.221.222.160  Public Scan

URL: https://arstechnica.com/ai/2024/11/nvidias-new-ai-audio-model-can-synthesize-sounds-that-have-never-existed/?mc_cid=6089...
Submission: On November 26 via api from RU — Scanned from CA

Form analysis 1 forms found in the DOM

POST .

<form action="." method="post">
  <nav>
    <ul class="">
      <li>
        <button class=" group flex w-full flex-row items-center px-5 py-2 text-gray-300 hover:bg-gray-700 hover:text-green-400 focus:bg-gray-700 focus:text-green-400" name="theme" type="submit" value="light" aria-label="Set theme to Light">
          <svg class="group-with-selected:text-green-400 mr-2 inline-block h-5 w-5 text-gray-100 group-hover:text-green-400 group-focus:text-green-400" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 40 40">
            <defs>
              <clipPath id="theme-light_svg__a">
                <path fill="none" d="M0 0h40v40H0z"></path>
              </clipPath>
              <clipPath id="theme-light_svg__b">
                <path fill="none" d="M0 0h40v40H0z"></path>
              </clipPath>
            </defs>
            <g clip-path="url(#theme-light_svg__a)">
              <g fill="currentColor" clip-path="url(#theme-light_svg__b)">
                <path
                  d="M30 20c0 5.5-4.5 10-10 10s-10-4.5-10-10 4.5-10 10-10 10 4.5 10 10m8.6 1.4h-2.2c-.8 0-1.4-.6-1.4-1.4s.6-1.4 1.4-1.4h2.2c.8 0 1.4.6 1.4 1.4s-.6 1.4-1.4 1.4M34.1 7.9l-1.5 1.5c-.6.6-1.5.6-2 0-.6-.6-.6-1.5 0-2l1.5-1.5c.6-.6 1.5-.6 2 0 .6.6.6 1.5 0 2M21.4 1.4v2.2c0 .8-.6 1.4-1.4 1.4s-1.4-.6-1.4-1.4V1.4c0-.8.6-1.4 1.4-1.4s1.4.6 1.4 1.4M7.9 5.9l1.5 1.5c.6.6.6 1.5 0 2-.6.6-1.5.6-2 0L5.9 7.9c-.6-.6-.6-1.5 0-2 .6-.6 1.5-.6 2 0M1.4 18.6h2.2c.8 0 1.4.6 1.4 1.4s-.6 1.4-1.4 1.4H1.4C.6 21.4 0 20.8 0 20s.6-1.4 1.4-1.4m4.5 13.5 1.5-1.5c.6-.6 1.4-.6 2 0s.6 1.5 0 2l-1.5 1.5c-.6.6-1.5.6-2 0-.6-.6-.6-1.5 0-2m12.7 6.5v-2.2c0-.8.6-1.4 1.4-1.4s1.4.6 1.4 1.4v2.2c0 .8-.6 1.4-1.4 1.4s-1.4-.6-1.4-1.4m13.5-4.5-1.5-1.5c-.6-.6-.6-1.4 0-2s1.5-.6 2 0l1.5 1.5c.6.6.6 1.5 0 2-.6.6-1.5.6-2 0">
                </path>
              </g>
            </g>
          </svg> Light </button>
      </li>
      <li>
        <button class=" group flex w-full flex-row items-center px-5 py-2 text-gray-300 hover:bg-gray-700 hover:text-green-400 focus:bg-gray-700 focus:text-green-400" name="theme" type="submit" value="dark" aria-label="Set theme to Dark">
          <svg class="group-with-selected:text-green-400 mr-2 inline-block h-5 w-5 text-gray-100 group-hover:text-green-400 group-focus:text-green-400" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 38.4 38.4">
            <defs>
              <clipPath id="theme-dark_svg__a">
                <path fill="none" d="M0 0h38.4v38.4H0z"></path>
              </clipPath>
              <clipPath id="theme-dark_svg__b">
                <path fill="none" d="M0 0h38.4v38.4H0z"></path>
              </clipPath>
            </defs>
            <g clip-path="url(#theme-dark_svg__a)">
              <g fill="currentColor" clip-path="url(#theme-dark_svg__b)">
                <path
                  d="M14.5 11.4c0-4.3 1.4-8.2 3.7-11.4C8.8 1.3 1.6 9.3 1.6 19.1s8.6 19.3 19.3 19.3 12.1-3.1 15.6-7.9c-.9.1-1.8.2-2.7.2-10.7 0-19.3-8.6-19.3-19.3m17.8-6.8v2.1c0 .8-.6 1.4-1.4 1.4s-1.4-.6-1.4-1.4V4.6c0-.8.6-1.4 1.4-1.4s1.4.6 1.4 1.4m0 6.8v2.1c0 .8-.6 1.4-1.4 1.4s-1.4-.6-1.4-1.4v-2.1c0-.8.6-1.4 1.4-1.4s1.4.6 1.4 1.4m-5.8-3.7h2.1c.8 0 1.4.6 1.4 1.4s-.6 1.4-1.4 1.4h-2.1c-.8 0-1.4-.6-1.4-1.4s.6-1.4 1.4-1.4m6.8 0h2.1c.8 0 1.4.6 1.4 1.4s-.6 1.4-1.4 1.4h-2.1c-.8 0-1.4-.6-1.4-1.4s.6-1.4 1.4-1.4">
                </path>
              </g>
            </g>
          </svg> Dark </button>
      </li>
      <li>
        <button class="selected bg-gray-700 text-green-400 group flex w-full flex-row items-center px-5 py-2 text-gray-300 hover:bg-gray-700 hover:text-green-400 focus:bg-gray-700 focus:text-green-400" name="theme" type="submit" value="system"
          aria-label="Set theme to System">
          <svg class="group-with-selected:text-green-400 mr-2 inline-block h-5 w-5 text-gray-100 group-hover:text-green-400 group-focus:text-green-400" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 40 40">
            <defs>
              <clipPath id="theme-system_svg__a">
                <path fill="none" d="M0 0h40v40H0z"></path>
              </clipPath>
              <clipPath id="theme-system_svg__b">
                <path fill="none" d="M0 0h40v40H0z"></path>
              </clipPath>
            </defs>
            <g clip-path="url(#theme-system_svg__a)">
              <g fill="currentColor" clip-path="url(#theme-system_svg__b)">
                <path d="M32 4c2.2 0 4 1.8 4 4v24c0 2.2-1.8 4-4 4H8c-2.2 0-4-1.8-4-4V8c0-2.2 1.8-4 4-4zm0-4H8C3.6 0 0 3.6 0 8v24c0 4.4 3.6 8 8 8h24c4.4 0 8-3.6 8-8V8c0-4.4-3.6-8-8-8"></path>
                <path d="M8 8h8v8H8z"></path>
              </g>
            </g>
          </svg> System </button>
      </li>
    </ul>
  </nav>
</form>

Text Content

Skip to content

Ars Technica home
Sections
Forum

Subscribe

 * AI
 * Biz & IT
 * Cars
 * Culture
 * Gaming
 * Health
 * Policy
 * Science
 * Security
 * Space
 * Tech


 * Feature
 * Reviews
 * Store

 * AI
 * Biz & IT
 * Cars
 * Culture
 * Gaming
 * Health
 * Policy
 * Science
 * Security
 * Space
 * Tech

Forum

Subscribe

Story text
Size Small Standard Large Width * Standard Wide Links Standard Orange
* Subscribers only
  Learn more
Pin to story
Theme
 * Light
 * Dark
 * System

Search dialog...


Sign In
Sign in dialog...
Sign in

You've never heard anything like it


NVIDIA’S NEW AI AUDIO MODEL CAN SYNTHESIZE SOUNDS THAT HAVE NEVER EXISTED

What does a screaming saxophone sound like? The Fugatto model has an answer...

Kyle Orland – 25 Nov 2024 13:40 | 87
An audio wave can contain so much. An angry cello, for instance... Credit: Getty
Images
An audio wave can contain so much. An angry cello, for instance... Credit: Getty
Images
Text settings
Story text
Size Small Standard Large Width * Standard Wide Links Standard Orange
* Subscribers only
  Learn more
Minimize to nav

At this point, anyone who has been following AI research is long familiar with
generative models that can synthesize speech or melodic music from nothing but
text prompting. Nvidia's newly revealed "Fugatto" model looks to go a step
further, using new synthetic training methods and inference-level combination
techniques to "transform any mix of music, voices, and sounds," including the
synthesis of sounds that have never existed.

While Fugatto isn't available for public testing yet, a sample-filled website
showcases how Fugatto can be used to dial a number of distinct audio traits and
descriptions up or down, resulting in everything from the sound of saxophones
barking to people speaking underwater to ambulance sirens singing in a kind of
choir. While the results on display can be a bit hit or miss, the vast array of
capabilities on display here helps support Nvidia's description of Fugatto as "a
Swiss Army knife for sound."


AdChoices
ADVERTISING




YOU’RE ONLY AS GOOD AS YOUR DATA

In an explanatory research paper, over a dozen Nvidia researchers explain the
difficulty in crafting a training dataset that can "reveal meaningful
relationships between audio and language." While standard language models can
often infer how to handle various instructions from the text-based data itself,
it can be hard to generalize descriptions and traits from audio without more
explicit guidance.



To that end, the researchers start by using an LLM to generate a Python script
that can create a large number of template-based and free-form instructions
describing different audio "personas" (e.g., "standard, young-crowd,
thirty-somethings, professional"). They then generate a set of both absolute
(e.g., "synthesize a happy voice") and relative (e.g., "increase the happiness
of this voice") instructions that can be applied to those personas.



The wide array of open source audio datasets used as the basis for Fugatto
generally don't have these kinds of trait measurements embedded in them by
default. But the researchers make use of existing audio understanding models to
create "synthetic captions" for their training clips based on their prompts,
creating natural language descriptions that can automatically quantify traits
such as gender, emotion, and speech quality. Audio processing tools are also
used to describe and quantify training clips on a more acoustic level (e.g.
"fundamental frequency variance" or "reverb").




For relational comparisons, the researchers rely on datasets where one factor is
held constant while another changes, such as different emotional readings of the
same text or different instruments playing the same notes. By comparing these
samples across a large enough set of data samples, the model can start to learn
what kinds of audio characteristics tend to appear in "happier" speech, for
instance, or differentiate the sound of a saxophone and a flute.

After running a variety of different open source audio collections through this
process, the researchers ended up with a heavily annotated dataset of 20 million
separate samples representing at least 50,000 hours of audio. From there, a set
of 32 Nvidia tensor cores was used to create a model with 2.5 billion parameters
that started to show reliable scores on a variety of audio quality tests.


IT’S ALL IN THE MIX


OK, Fugatto, can we get a little more barking and a little less saxophone in the
monitors? Credit: Getty Images

Beyond the training, Nvidia is also talking up Fugatto's "ComposableART" system
(for "Audio Representation Transformation"). When provided with a prompt in text
and/or audio, this system can use "conditional guidance" to "independently
control and generate (unseen) combinations of instructions and tasks" and
generate "highly customizable audio outputs outside the training distribution."
In other words, it can combine different traits from its training set to create
entirely new sounds that have never been heard before.

I won't pretend to understand all of the complex math described in the
paper—which involves a "weighted combination of vector fields between
instructions, frame indices and models." But the end results, as shown in
examples on the project's webpage and in an Nvidia trailer, highlight how
ComposableART can be used to create the sound of, say, a violin that "sounds
like a laughing baby or a banjo that's playing in front of gentle rainfall" or
"factory machinery that screams in metallic agony." While some of these examples
are more convincing to our ears than others, the fact that Fugatto can take a
decent stab at these kinds of combinations at all is a testament to the way the
model characterizes and mixes extremely disparate audio data from multiple
different open source data sets.




Perhaps the most interesting part of Fugatto is the way it treats each
individual audio trait as a tunable continuum, rather than a binary. For an
example that melds the sound of an acoustic guitar and running water, for
instance, the result ends up very different when either the guitar or the water
is weighted more heavily in Fugatto's interpolated mix. Nvidia also mentions
examples of tuning a French accent to be heavier or lighter, or varying the
"degree of sorrow" inherent in a spoken clip.

Beyond tuning and combining different audio traits, Fugatto can also perform the
kinds of audio tasks we've seen in previous models, like changing the emotion in
a piece of spoken text or isolating the vocal track in a piece of music. Fugatto
can also detect individual notes in a piece of MIDI music and replace them with
a variety of vocal performances, or detect the beat of a piece of music and add
effects from drums to barking dogs to ticking clocks in a way that matches the
rhythm.


Fugatto's generated audio (magenta) matches the melody of an input MIDI file
(Cyan) very closely. Credit: Nvidia Research

While the researchers describe Fugatto as just the first step "towards a future
where unsupervised multitask learning emerges from data and model scale," Nvidia
is already talking up use cases from song prototyping to dynamically changing
video game scores to international ad targeting. But Nvidia was also quick to
highlight that models like Fugatto are best seen as a new tool for audio artists
rather than a replacement for their creative talents.

"The history of music is also a history of technology," Nvidia Inception
participant and producer/songwriter Ido Zmishlany said in Nvidia's blog post.
"The electric guitar gave the world rock and roll. When the sampler showed up,
hip-hop was born. With AI, we’re writing the next chapter of music. We have a
new instrument, a new tool for making music—and that’s super exciting."


Kyle Orland Senior Gaming Editor
Kyle Orland Senior Gaming Editor
Kyle Orland has been the Senior Gaming Editor at Ars Technica since 2012,
writing primarily about the business, tech, and culture behind video games. He
has journalism and computer science degrees from University of Maryland. He once
wrote a whole book about Minesweeper.
87 Comments
Comments
Forum view
Loading comments...

Prev story
Next story

Most Read
 1. 
    1. After Russian ship docks to space station, astronauts report a foul smell
 2. 
    2. Nvidia’s new AI audio model can synthesize sounds that have never existed
 3. 
    3. OpenAI blamed NYT for tech problem erasing evidence of copyright abuse
 4. 
    4. Supreme Court wants US input on whether ISPs should be liable for users’
    piracy
 5. 
    5. Elizabeth Warren calls for crackdown on Internet “monopoly” you’ve never
    heard of

Customize
by Taboolaby Taboola
Sponsored LinksSponsored Links
Promoted LinksPromoted Links

New AI system for making money takes Canada by stormFinanceSpeculate
Learn More


Undo

Welcome to Trump’s worldThe Economist
Learn More


Undo

Hurry, offers end soonTELUS Business


Undo

Here Are 23 of the Coolest Gifts for This Black Friday 2024Best Tech Trend
Shop Now


Undo

An Opportunity That Millions Of Canadians Should Know AboutLife Hacking
Learn More


Undo

What is the best way to earn $2,700 per week as a second income?TradeTrend
Learn More


Undo




Ars Technica has been separating the signal from the noise for over 25 years.
With our unique combination of technical savvy and wide-ranging interest in the
technological arts and sciences, Ars is the trusted source in a sea of
information. After all, you don’t need to know everything, only what’s
important.



More from Ars
 * About Us
 * Staff Directory
 * Newsletters
 * Ars Videos
 * General FAQ
 * RSS Feeds

Contact
 * Contact us
 * Advertise with us
 * Reprints

Cookies Settings
© 2024 Condé Nast. All rights reserved. Use of and/or registration on any
portion of this site constitutes acceptance of our User Agreement and Privacy
Policy and Cookie Statement and Ars Technica Addendum and Your California
Privacy Rights. Ars Technica may earn compensation on sales from links on this
site. Read our affiliate link policy. The material on this site may not be
reproduced, distributed, transmitted, cached or otherwise used, except with the
prior written permission of Condé Nast. Ad Choices
Search dialog...

Sign in dialog...
Sign in