The Millions of Songs Mashed Into AI-Generated Music - The Atlantic
Last November, a pair of Olympic-bound figure skaters performed in a competition to a song with lyrics that sounded oddly familiar. “Every night we smash a Mercedes-Benz,” the singer began. It was one of several recognizable lines from the 1998 pop hit “You Get What You Give,” by the New Radicals. But the ice dancers’ song was otherwise different. The New Radicals’ message to angsty teenagers had been converted to Bon Jovi–style arena rock. If you knew “You Get What You Give,” this was a pretty strange variation on it.<br>The dancers had used music generated by AI. Whatever model was involved had likely been trained on “You Get What You Give” and had copied some of the song’s content, as AI systems are prone to do. Such systems don’t always reproduce elements of existing songs in this way, but you’ll hear it now and then, and sometimes even more blatantly. Suno, one of the most popular AI music generators, for example, has pumped out tracks that strongly resemble Michael Jackson’s “Thriller,” Ed Sheeran’s “Shape of You,” Chuck Berry’s “Johnny B. Goode,” Bill Haley & His Comets’ “Rock Around the Clock,” B. B. King’s “The Thrill Is Gone,” and others. Listen to Michael Jackson’s song alongside a Suno-generated track titled “Thriller”:<br>Thriller<br>by Michael Jackson
Released on November 29, 1982.
Thriller<br>generated by Suno<br>Prompt: "post-disco, pop-rock, funk, electronic, r&b, thriller, motown, famous male singer and dancer, king of pop, falsetto"
(“Thriller” is just one of the dozens of examples provided by the major record labels in a lawsuit against Suno. You can hear two others below. Rachel Racusen, a spokesperson for Suno, told me that the platform uses “safeguards to protect against unauthorized distribution, impersonation and manipulations,” and directed me to a LinkedIn post by the company’s chief product officer saying that reproductions of training data “should not happen.” Racusen did not answer questions about the lawsuit or acknowledge any specific tracks that were used to train their models.)<br>Cases like these indicate something about how AI-based music products work. AI music generators can simulate human performances with surprising fidelity, but first they have to be trained on enormous quantities of those human performances. The actual recordings that go into any model are a closely guarded secret—AI companies have claimed they are proprietary—but the number of songs is almost certainly huge, spanning genres and time periods.<br>As part of my series of investigations into AI training data, I recently discovered four giant datasets of songs that are being shared within the AI-development community. One has 12 million tracks. Another has 9 million. The two smaller datasets each have more than 100,000. They include hits from major pop artists such as Bad Bunny, Nirvana, Taylor Swift, Billie Eilish, Pearl Jam, Elvis Costello, Sheryl Crow, and the Beatles. (The New Radicals’ “You Get What You Give” is in two of the datasets.) Jazz artists such as Miles Davis, John Zorn, and Vijay Iyer are featured, as are classical composers and tens of thousands of minor artists across genres. The 12-million-track dataset, on its own, would take 91 years to listen to.<br>You can search for an artist in the datasets here:
These datasets are only four examples of the many sources available to AI developers. I found them by reading research papers published by developers and scouring AI data-sharing sites. The datasets have been downloaded thousands of times. Google has written about using one of them—more than 100,000 songs downloaded from the Free Music Archive, a site that allows free streaming for personal listening but requires payments for commercial use—to train AI models, and Stability has used some songs from the same dataset. But because of the industry’s secrecy around training data, we don’t currently know who has used the others.<br>What the datasets illustrate, primarily, is the scale and variety of music easily available to AI developers. Companies often claim to use only content that is freely available online, but the datasets reveal the quantity of downloadable music that developers can access even though it is not supposed to be free.<br>Three of the datasets I found are distributed as a list of links to songs on YouTube or Spotify. AI developers download the actual audio using tools that automate the job, some of which allow developers to bypass logins, advertisements, and mechanisms that might earn money or subscribers for creators. Such tools violate the terms of service of these platforms. (The fourth dataset, the Free Music Archive collection, is distributed with MP3s.)<br>The datasets are similar in size to those that companies have used to train commercial-music-generating models. In 2022, Google trained a model on 44 million tracks, totaling 42 years of music. Suno wrote in a 2024 court filing that it trained its models on “essentially all...