Why Generative Music Is Hard

Friday, July 5th, 2024

Advancements in generative AI for text, images, video, sound, and music have enabled us to create impressive content from relatively simple text prompts. However, I’m sometimes asked why text-to-music seems to have lagged behind the others in terms of quality, fidelity, or utility. This disparity is particularly interesting given that music data is often smaller and more easily quantized than image or video data, and AI has been actively used in music production since at least the 1950s.

Of course, several very impressive text-to-music products have recently emerged which provide a valid rebuttal, but none have quite managed to achieve the same level of relative quality or popularity as their counterparts in other media. This perception is probably influenced by several factors, such as the unprecedented demand and pace of innovation with generative text and images, or the challenges of the auditory uncanny valley. But text-to-music also has a unique and more historical obstacle: taxonomy.

We have a rich and consistent way of classifying certain visual and auditory information – colors, shapes, textures, and sounds – but not music. At least, not with natural language and broadly understood terminology.

From an early age we engage in visual identification exercises that support language development and enable us to describe our environment in a relatively objective and natural manner. If I were to ask someone with average artistic abilities to draw a large gray elephant standing in a field of yellow flowers on a sunny day, and later show their drawing to someone else and ask them to describe it, they would likely reply with something similar to the original prompt. Replicating this exercise with music however, even using experienced composers, would yield far less consistent or predictable results.

The world has spent decades digitizing and labeling an enormously large and robust corpus of images and videos, but we’re nowhere near the same level with music. Our taxonomic deficiency means that, compared to other generative media, text-to-music models have a harder time reaching comparable competency (more limited training data), and are relatively harder for the average person to use (limited descriptive abilities).

Text-to-music is great because it’s approachable. For scenarios where only broad musical direction is needed – setting the mood, tempo, or general instrumentation, a refined taxonomy probably isn’t required. But when you want granular control over things like the shape of the melody, automation, or refinement, you need a multimodal AI.

Multimodal music models can enable a user to demonstrate their musical intent through a combination of text, notes, and audio input, and receive notes or audio as output. The user is not limited by what they can describe through text, and they can iteratively tweak precise portions (regions, tracks, instruments, etc.) of their generated song using the AI or their DAW. Text input remains useful to quickly sketch ideas and experiment, but is complemented by note or audio input to help refine and build out a specific song. These models can also perform style match and transfer with added configurability using text. Convert a single track or instrument into a multi-instrument composition using the style of another work. Hum a tune and let AI transform it into an epic orchestral piece without losing your melody or rhythm. Take a sad slow song and revise it into an upbeat cheerful one.

But going further, a real-time multimodal music generator would enable live performances with auto-accompaniment and improvisation. A solo jazz pianist could perform with an AI bassist and drummer that riff along with the chord progression. AI session musicians could fill out a band or orchestra to help quickly improve texture or scale, or simply try out ideas. And while we’re at it, real-time AI vocals would also be nice.

Music AI has been around for a very long time. While it may seem like modern generative music is off to a slow lap, the race is far from over. Projects like Suno and Soundraw are raising the bar for text-to-music, plus lots of exciting projects in development from industry leading companies like Native Instruments, Cinesamples, Audialab, Output, and Spotify that are focused in all of the areas we’ve discussed. Oh, and you can also check out my own tiny music AI :-)