MusicFlow: Cascaded Flow Matching for Text Guided Music Generation

K R Prajwal*, Bowen Shi*, Matthew Le, Apoorv Vyas, Andros Tjandra, Mahi Luthra, Baishan Guo, Huiyu Wang, Triantafyllos Afouras, David Kant, Wei-Ning Hsu
ICML 2024
Please use headphones for best experience!

Text description: This music is instrumental. The tempo is medium fast with a melodious keyboard harmony, steady drumming, groovy bass, synthesiser arrangements , electronically articulated sounds and tambourine beats . The melody is harmonious, pleasant, uncomplicated and well layered. This music is Synth Pop.


We introduce MusicFlow, a cascaded text-to-music generation model based on flow matching. Based on self-supervised representations to bridge between text descriptions and music audios, we construct two flow matching networks to model the conditional distribution of semantic and acoustic features. Additionally, we leverage masked prediction as the training objective, enabling the model to generalize to other tasks such as music infilling and continuation in a zero-shot manner. Experiments on MusicCaps reveal that the music generated by MusicFlow exhibits superior quality and text coherence despite being over 2-5x smaller and requiring 5x fewer iterative steps. Simultaneously, the model can perform other music generation tasks and achieves competitive performance in music infilling and continuation. Our code and model will be publicly available.

MusicFlow: Text to Music Generation

Text descriptionGenerated music piece
This music is a lively fiddle instrumental. The tempo is fast with a percussion accompaniment. The music is spirited, vibrant, enthusiastic,cheerful, happy , merry and sunny. This music is Country Folk.
Someone is playing a loud melody on a steeldrum along to a backing track with a lot of percussion and an upright bass. This song may be playing at a beach concert.
This music is an electric guitar instrumental. The tempo is slow with an electric guitar lead and harmony of a popular song. No other instrument is used. It is mellow, pleasant, nostalgic, soft , mellifluous and soothing.
The song is instrumental. The tempo is medium with sitars and other stringed instruments playing in unison, tabla percussions and a harmonium playing the lead. The song is a classic hindustaani instrumental.
This music clip is a loud, boomy drum beat. The tempo is slow with the powerful and rhythmic beat of the Big Chinese drum. It is loud, resonating, vibrating and powerful and sounds ceremonial, celebratory and royal and used for announcements.

Text to Music Continuation

Text descriptionMusic Prompt (first 3 seconds are used as the prompt)AudioLDM continuationMusicFlow (ours) continuation
This is a marching band piece performed by an orchestra. There are trumpets playing the main theme while tuba and cello are playing the bass notes. There is also a constant string part holding the root notes of the melody. The snare drums are playing an accentuated, militaristic drum line. There is an epic atmosphere that gives a call-to-action feeling. This piece could be used in the soundtracks of action movies and war movies in particular. It could also be used in the soundtracks of video games from the action and first-person-shootout genres.
The song is an instrumental. The song is medium fast tempo with a bass guitar soloist playing a cool accompaniment groove with no other instrumentation. The song is passionate and energetic. The song is a bass guitar demo or a home music video.
This song contains mallet instruments playing a fast melody in the mid and high register along with low notes as bassline. The marimba is full of reverb. Then a bass drum comes in playing on every beat. This song may be playing in an advertisement.

Text to Music Infilling

Text descriptionMusic Prompt (first and last 1.5s are used as the prompt)AudioLDM infillingMusicFlow (ours) infilling
This is a Nepali music piece. There is a string instrument and a local flute playing the melody. The rhythmic background is provided by percussion. It has a vibrant, dreamy feeling to it. This piece could be used in a dream sequence of a movie or TV show taking place in South Asian countries.
This is an opening theme for a TV series. It is an instrumental piece. The main theme is being played by a loud brass section. There is a groovy synth bass line playing. The rhythmic background consists of a strong electronic drum beat. The atmosphere is energetic. This piece could be used in lifting samples for beat-making.
This song is an instrumental. The tempo is medium with lively trumpets, trombone ,enthusiastic percussion like the snare, cymbals and bass drum, tambourine , an intense and bright string harmony of violins, viola and cello. The music is intense, serious and is gradually building up. This is an Orchestral piece.

Comparisons with recent work: Text-to-Music Generation

Text descriptionMusicGENAudio-LDM2MusicFlow (ours)Ground-truthComment
This instrumental is a Heavy Metal instrumental. The tempo is fast with hard hitting drumming, furious and vigorous amplified keyboard playing a harmony, electric bass guitar and electric guitar accompaniment. The music is intense, grim, compelling, passionate, powerful and harmonious. The vibe of the music is serious, sinister, grim and steely. This is used in Hard Rock/Heavy Metal. MusicGEN and Audio-LDM2 miss instruments like electric bass guitar and keyboard instruments
This music is an intense instrumental. The tempo is medium with hard hitting drums, ceremonial big drums, amplified piano and violin harmony. The music is syncopated and intense. It is like a background score for a movie. MusicFlow captures ceremonial hard-hitting drums better than others. MusicFlow captures the concept of movie background score, where the music score peaks towards the end.
This music is an Indian Classical Instrumental. The tempo is medium fast with an ensemble of a lively violin, melodic flute and rhythmic accompaniment of the Indian percussion gadam and mridangam . The music uses ragas, talas and Sruti to form a harmony. It is classic, lively, rhythmic, engaging and enthralling. AudioLDM2 does not capture ragas or the melodic flute. MusicGen generates bagpipes/harmonium instead of flute. MusicFlow captures rich ragas from flute, violin and mridangam.

MusicFlow: Failures

Text descriptionGenerated Music PieceGround-truthComment
Percussions are playing together with a clarinet that takes the lead melody with very long notes. A bowed instrument is playing along with little fill-ins. An electric bass is playing a rather funky groove. The whole song sounds like it was made for dancing joyfully. Model is unable to generate clarinet with very long notes
This music is instrumental. The tempo is slow with an electronic keyboard playing a simple, high pitched harmony. There are sounds in the background like whirring, whooshing and clicking. The music is not very clear as the audio clip is muffled, but it sounds like video game music or the music Ina child’s electronic toy. Model sometimes struggles to construct a clear, consistent rhythmic tune over longer stretches in time
This live recording features an instrumental song. This Regional Mexican song features an accordion playing the main melody. This is accompanied by an acoustic guitar strumming chords. A double bass plays the bass notes. A tambourine acts as the percussion. This song is in an upbeat mood. The song can be played at the entrance of a carnival. Model can sometimes produce tunes which are stuck on the same beat for long periods of time