arstechnica.com
Open in
urlscan Pro
18.221.222.160
Public Scan
URL:
https://arstechnica.com/ai/2024/11/nvidias-new-ai-audio-model-can-synthesize-sounds-that-have-never-existed/?mc_cid=6089...
Submission: On November 26 via api from RU — Scanned from CA
Submission: On November 26 via api from RU — Scanned from CA
Form analysis
1 forms found in the DOMPOST .
<form action="." method="post">
<nav>
<ul class="">
<li>
<button class=" group flex w-full flex-row items-center px-5 py-2 text-gray-300 hover:bg-gray-700 hover:text-green-400 focus:bg-gray-700 focus:text-green-400" name="theme" type="submit" value="light" aria-label="Set theme to Light">
<svg class="group-with-selected:text-green-400 mr-2 inline-block h-5 w-5 text-gray-100 group-hover:text-green-400 group-focus:text-green-400" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 40 40">
<defs>
<clipPath id="theme-light_svg__a">
<path fill="none" d="M0 0h40v40H0z"></path>
</clipPath>
<clipPath id="theme-light_svg__b">
<path fill="none" d="M0 0h40v40H0z"></path>
</clipPath>
</defs>
<g clip-path="url(#theme-light_svg__a)">
<g fill="currentColor" clip-path="url(#theme-light_svg__b)">
<path
d="M30 20c0 5.5-4.5 10-10 10s-10-4.5-10-10 4.5-10 10-10 10 4.5 10 10m8.6 1.4h-2.2c-.8 0-1.4-.6-1.4-1.4s.6-1.4 1.4-1.4h2.2c.8 0 1.4.6 1.4 1.4s-.6 1.4-1.4 1.4M34.1 7.9l-1.5 1.5c-.6.6-1.5.6-2 0-.6-.6-.6-1.5 0-2l1.5-1.5c.6-.6 1.5-.6 2 0 .6.6.6 1.5 0 2M21.4 1.4v2.2c0 .8-.6 1.4-1.4 1.4s-1.4-.6-1.4-1.4V1.4c0-.8.6-1.4 1.4-1.4s1.4.6 1.4 1.4M7.9 5.9l1.5 1.5c.6.6.6 1.5 0 2-.6.6-1.5.6-2 0L5.9 7.9c-.6-.6-.6-1.5 0-2 .6-.6 1.5-.6 2 0M1.4 18.6h2.2c.8 0 1.4.6 1.4 1.4s-.6 1.4-1.4 1.4H1.4C.6 21.4 0 20.8 0 20s.6-1.4 1.4-1.4m4.5 13.5 1.5-1.5c.6-.6 1.4-.6 2 0s.6 1.5 0 2l-1.5 1.5c-.6.6-1.5.6-2 0-.6-.6-.6-1.5 0-2m12.7 6.5v-2.2c0-.8.6-1.4 1.4-1.4s1.4.6 1.4 1.4v2.2c0 .8-.6 1.4-1.4 1.4s-1.4-.6-1.4-1.4m13.5-4.5-1.5-1.5c-.6-.6-.6-1.4 0-2s1.5-.6 2 0l1.5 1.5c.6.6.6 1.5 0 2-.6.6-1.5.6-2 0">
</path>
</g>
</g>
</svg> Light </button>
</li>
<li>
<button class=" group flex w-full flex-row items-center px-5 py-2 text-gray-300 hover:bg-gray-700 hover:text-green-400 focus:bg-gray-700 focus:text-green-400" name="theme" type="submit" value="dark" aria-label="Set theme to Dark">
<svg class="group-with-selected:text-green-400 mr-2 inline-block h-5 w-5 text-gray-100 group-hover:text-green-400 group-focus:text-green-400" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 38.4 38.4">
<defs>
<clipPath id="theme-dark_svg__a">
<path fill="none" d="M0 0h38.4v38.4H0z"></path>
</clipPath>
<clipPath id="theme-dark_svg__b">
<path fill="none" d="M0 0h38.4v38.4H0z"></path>
</clipPath>
</defs>
<g clip-path="url(#theme-dark_svg__a)">
<g fill="currentColor" clip-path="url(#theme-dark_svg__b)">
<path
d="M14.5 11.4c0-4.3 1.4-8.2 3.7-11.4C8.8 1.3 1.6 9.3 1.6 19.1s8.6 19.3 19.3 19.3 12.1-3.1 15.6-7.9c-.9.1-1.8.2-2.7.2-10.7 0-19.3-8.6-19.3-19.3m17.8-6.8v2.1c0 .8-.6 1.4-1.4 1.4s-1.4-.6-1.4-1.4V4.6c0-.8.6-1.4 1.4-1.4s1.4.6 1.4 1.4m0 6.8v2.1c0 .8-.6 1.4-1.4 1.4s-1.4-.6-1.4-1.4v-2.1c0-.8.6-1.4 1.4-1.4s1.4.6 1.4 1.4m-5.8-3.7h2.1c.8 0 1.4.6 1.4 1.4s-.6 1.4-1.4 1.4h-2.1c-.8 0-1.4-.6-1.4-1.4s.6-1.4 1.4-1.4m6.8 0h2.1c.8 0 1.4.6 1.4 1.4s-.6 1.4-1.4 1.4h-2.1c-.8 0-1.4-.6-1.4-1.4s.6-1.4 1.4-1.4">
</path>
</g>
</g>
</svg> Dark </button>
</li>
<li>
<button class="selected bg-gray-700 text-green-400 group flex w-full flex-row items-center px-5 py-2 text-gray-300 hover:bg-gray-700 hover:text-green-400 focus:bg-gray-700 focus:text-green-400" name="theme" type="submit" value="system"
aria-label="Set theme to System">
<svg class="group-with-selected:text-green-400 mr-2 inline-block h-5 w-5 text-gray-100 group-hover:text-green-400 group-focus:text-green-400" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 40 40">
<defs>
<clipPath id="theme-system_svg__a">
<path fill="none" d="M0 0h40v40H0z"></path>
</clipPath>
<clipPath id="theme-system_svg__b">
<path fill="none" d="M0 0h40v40H0z"></path>
</clipPath>
</defs>
<g clip-path="url(#theme-system_svg__a)">
<g fill="currentColor" clip-path="url(#theme-system_svg__b)">
<path d="M32 4c2.2 0 4 1.8 4 4v24c0 2.2-1.8 4-4 4H8c-2.2 0-4-1.8-4-4V8c0-2.2 1.8-4 4-4zm0-4H8C3.6 0 0 3.6 0 8v24c0 4.4 3.6 8 8 8h24c4.4 0 8-3.6 8-8V8c0-4.4-3.6-8-8-8"></path>
<path d="M8 8h8v8H8z"></path>
</g>
</g>
</svg> System </button>
</li>
</ul>
</nav>
</form>
Text Content
Skip to content Ars Technica home Sections Forum Subscribe * AI * Biz & IT * Cars * Culture * Gaming * Health * Policy * Science * Security * Space * Tech * Feature * Reviews * Store * AI * Biz & IT * Cars * Culture * Gaming * Health * Policy * Science * Security * Space * Tech Forum Subscribe Story text Size Small Standard Large Width * Standard Wide Links Standard Orange * Subscribers only Learn more Pin to story Theme * Light * Dark * System Search dialog... Sign In Sign in dialog... Sign in You've never heard anything like it NVIDIA’S NEW AI AUDIO MODEL CAN SYNTHESIZE SOUNDS THAT HAVE NEVER EXISTED What does a screaming saxophone sound like? The Fugatto model has an answer... Kyle Orland – 25 Nov 2024 13:40 | 87 An audio wave can contain so much. An angry cello, for instance... Credit: Getty Images An audio wave can contain so much. An angry cello, for instance... Credit: Getty Images Text settings Story text Size Small Standard Large Width * Standard Wide Links Standard Orange * Subscribers only Learn more Minimize to nav At this point, anyone who has been following AI research is long familiar with generative models that can synthesize speech or melodic music from nothing but text prompting. Nvidia's newly revealed "Fugatto" model looks to go a step further, using new synthetic training methods and inference-level combination techniques to "transform any mix of music, voices, and sounds," including the synthesis of sounds that have never existed. While Fugatto isn't available for public testing yet, a sample-filled website showcases how Fugatto can be used to dial a number of distinct audio traits and descriptions up or down, resulting in everything from the sound of saxophones barking to people speaking underwater to ambulance sirens singing in a kind of choir. While the results on display can be a bit hit or miss, the vast array of capabilities on display here helps support Nvidia's description of Fugatto as "a Swiss Army knife for sound." AdChoices ADVERTISING YOU’RE ONLY AS GOOD AS YOUR DATA In an explanatory research paper, over a dozen Nvidia researchers explain the difficulty in crafting a training dataset that can "reveal meaningful relationships between audio and language." While standard language models can often infer how to handle various instructions from the text-based data itself, it can be hard to generalize descriptions and traits from audio without more explicit guidance. To that end, the researchers start by using an LLM to generate a Python script that can create a large number of template-based and free-form instructions describing different audio "personas" (e.g., "standard, young-crowd, thirty-somethings, professional"). They then generate a set of both absolute (e.g., "synthesize a happy voice") and relative (e.g., "increase the happiness of this voice") instructions that can be applied to those personas. The wide array of open source audio datasets used as the basis for Fugatto generally don't have these kinds of trait measurements embedded in them by default. But the researchers make use of existing audio understanding models to create "synthetic captions" for their training clips based on their prompts, creating natural language descriptions that can automatically quantify traits such as gender, emotion, and speech quality. Audio processing tools are also used to describe and quantify training clips on a more acoustic level (e.g. "fundamental frequency variance" or "reverb"). For relational comparisons, the researchers rely on datasets where one factor is held constant while another changes, such as different emotional readings of the same text or different instruments playing the same notes. By comparing these samples across a large enough set of data samples, the model can start to learn what kinds of audio characteristics tend to appear in "happier" speech, for instance, or differentiate the sound of a saxophone and a flute. After running a variety of different open source audio collections through this process, the researchers ended up with a heavily annotated dataset of 20 million separate samples representing at least 50,000 hours of audio. From there, a set of 32 Nvidia tensor cores was used to create a model with 2.5 billion parameters that started to show reliable scores on a variety of audio quality tests. IT’S ALL IN THE MIX OK, Fugatto, can we get a little more barking and a little less saxophone in the monitors? Credit: Getty Images Beyond the training, Nvidia is also talking up Fugatto's "ComposableART" system (for "Audio Representation Transformation"). When provided with a prompt in text and/or audio, this system can use "conditional guidance" to "independently control and generate (unseen) combinations of instructions and tasks" and generate "highly customizable audio outputs outside the training distribution." In other words, it can combine different traits from its training set to create entirely new sounds that have never been heard before. I won't pretend to understand all of the complex math described in the paper—which involves a "weighted combination of vector fields between instructions, frame indices and models." But the end results, as shown in examples on the project's webpage and in an Nvidia trailer, highlight how ComposableART can be used to create the sound of, say, a violin that "sounds like a laughing baby or a banjo that's playing in front of gentle rainfall" or "factory machinery that screams in metallic agony." While some of these examples are more convincing to our ears than others, the fact that Fugatto can take a decent stab at these kinds of combinations at all is a testament to the way the model characterizes and mixes extremely disparate audio data from multiple different open source data sets. Perhaps the most interesting part of Fugatto is the way it treats each individual audio trait as a tunable continuum, rather than a binary. For an example that melds the sound of an acoustic guitar and running water, for instance, the result ends up very different when either the guitar or the water is weighted more heavily in Fugatto's interpolated mix. Nvidia also mentions examples of tuning a French accent to be heavier or lighter, or varying the "degree of sorrow" inherent in a spoken clip. Beyond tuning and combining different audio traits, Fugatto can also perform the kinds of audio tasks we've seen in previous models, like changing the emotion in a piece of spoken text or isolating the vocal track in a piece of music. Fugatto can also detect individual notes in a piece of MIDI music and replace them with a variety of vocal performances, or detect the beat of a piece of music and add effects from drums to barking dogs to ticking clocks in a way that matches the rhythm. Fugatto's generated audio (magenta) matches the melody of an input MIDI file (Cyan) very closely. Credit: Nvidia Research While the researchers describe Fugatto as just the first step "towards a future where unsupervised multitask learning emerges from data and model scale," Nvidia is already talking up use cases from song prototyping to dynamically changing video game scores to international ad targeting. But Nvidia was also quick to highlight that models like Fugatto are best seen as a new tool for audio artists rather than a replacement for their creative talents. "The history of music is also a history of technology," Nvidia Inception participant and producer/songwriter Ido Zmishlany said in Nvidia's blog post. "The electric guitar gave the world rock and roll. When the sampler showed up, hip-hop was born. With AI, we’re writing the next chapter of music. We have a new instrument, a new tool for making music—and that’s super exciting." Kyle Orland Senior Gaming Editor Kyle Orland Senior Gaming Editor Kyle Orland has been the Senior Gaming Editor at Ars Technica since 2012, writing primarily about the business, tech, and culture behind video games. He has journalism and computer science degrees from University of Maryland. He once wrote a whole book about Minesweeper. 87 Comments Comments Forum view Loading comments... Prev story Next story Most Read 1. 1. After Russian ship docks to space station, astronauts report a foul smell 2. 2. Nvidia’s new AI audio model can synthesize sounds that have never existed 3. 3. OpenAI blamed NYT for tech problem erasing evidence of copyright abuse 4. 4. Supreme Court wants US input on whether ISPs should be liable for users’ piracy 5. 5. Elizabeth Warren calls for crackdown on Internet “monopoly” you’ve never heard of Customize by Taboolaby Taboola Sponsored LinksSponsored Links Promoted LinksPromoted Links New AI system for making money takes Canada by stormFinanceSpeculate Learn More Undo Welcome to Trump’s worldThe Economist Learn More Undo Hurry, offers end soonTELUS Business Undo Here Are 23 of the Coolest Gifts for This Black Friday 2024Best Tech Trend Shop Now Undo An Opportunity That Millions Of Canadians Should Know AboutLife Hacking Learn More Undo What is the best way to earn $2,700 per week as a second income?TradeTrend Learn More Undo Ars Technica has been separating the signal from the noise for over 25 years. With our unique combination of technical savvy and wide-ranging interest in the technological arts and sciences, Ars is the trusted source in a sea of information. After all, you don’t need to know everything, only what’s important. More from Ars * About Us * Staff Directory * Newsletters * Ars Videos * General FAQ * RSS Feeds Contact * Contact us * Advertise with us * Reprints Cookies Settings © 2024 Condé Nast. All rights reserved. Use of and/or registration on any portion of this site constitutes acceptance of our User Agreement and Privacy Policy and Cookie Statement and Ars Technica Addendum and Your California Privacy Rights. Ars Technica may earn compensation on sales from links on this site. Read our affiliate link policy. The material on this site may not be reproduced, distributed, transmitted, cached or otherwise used, except with the prior written permission of Condé Nast. Ad Choices Search dialog... Sign in dialog... Sign in