Hi friends. Coffee, and a slightly different topic than usual, because for once it is not about pixels. On May 20, Stability AI dropped Stable Audio 3.0, a whole family of music models, and the headline number is the one everyone latched onto: it can generate full songs longer than six minutes. The medium and large models go out to about six minutes and twenty seconds while actually holding musical structure and melody, not just looping a vibe until it falls apart.
I know, I know, this is an AI art blog and now I am talking about music. But hear me out, because if you make videos, reels, loops, or anything that moves, the soundtrack has always been the weakest link in our workflow. We obsess over the visuals and then panic at the end looking for a track that is not copyright-flagged into oblivion. This release is aimed straight at that gap.
What Actually Shipped
It is not one model, it is four, and the differences matter for how you would actually use them:
| Model | Size | What it is for |
|---|---|---|
| Small SFX | 459M parameters | Sound effects, short stings, on-device generation |
| Small | 459M parameters | On-device music up to about two minutes |
| Medium | 1.4B parameters | Full compositions out to roughly 6:20 |
| Large | 2.7B parameters | The top-tier, longest, most structured output |
The two small models are light enough to run on-device for generation up to about two minutes, which is genuinely useful when you just need a quick loop or a sound effect without spinning up a cloud bill. The medium and large are where the full-song magic lives.
The Part That Actually Matters: It Is Open and It Is Licensed
Two details here are a bigger deal than the six-minute number, and they are the reason I am writing this up for our community specifically.
First, the open weights. The small SFX, small, and medium models are released with open weights, meaning anyone can download, use, and modify them. That is the part that decides whether a tool becomes a creator tool or stays locked behind a corporate API. Open weights mean it gets built into local pipelines, ComfyUI-style nodes, and the kind of free tooling our corner of the internet actually runs. The large model is the exception, available only through the API and paid self-hosting, with an enterprise license required for companies pulling in more than a million dollars in revenue. For the rest of us, the open trio is the story.
The headline is "six-minute songs." The thing that actually changes your workflow is "open weights, trained on fully licensed data."
Second, and this is the one I care about most after watching the Disney and Universal lawsuit drama unfold, Stability says the entire Stable Audio 3.0 family is built on fully licensed training data. That is a direct response to the copyright cloud hanging over basically every other AI music tool. If you have followed the legal fights, you know that "where did the training data come from" is the question that decides whether you can actually use the output in something public without lying awake at night.
Why This Is the Missing Half of an AI Art Workflow
Think about how a typical AI art video gets made right now. You generate your images or your video clips, you cut them together, and then you hit the wall: the music. Your options have been a tiny library of overused royalty-free tracks, a subscription service, or risking a copyright strike with something you do not have the rights to. The audio has always been the part where the polished, original pipeline suddenly turns into borrowing.
A model that generates full-length, structured, license-clean music closes that loop. The same way image models let us stop pulling stock photos, this lets us stop pulling stock music. For anyone scoring a reel, a loop for a profile, a longer YouTube piece, or an ambient background for a gallery video, having a six-minute original track you actually have the rights to is the difference between "inspired by" and "made by me."
How I Would Actually Use It
- Start with the small models for loops and stings. If you just need a 30-second bed under a short reel or a quick whoosh between cuts, the on-device small models are fast, free, and good enough. Do not reach for the heavy model when a light one finishes the job.
- Use medium for full backing tracks. When you want an actual song with a beginning, middle, and end under a longer piece, the medium model holding structure to 6:20 is the sweet spot for open-weight use.
- Prompt for mood and instrumentation, not genre alone. The same lesson from image prompting applies. "Warm lo-fi with soft piano, slow tempo, no vocals, gentle build" gives you something usable. One-word genre tags give you the average of the genre.
- Match the track length to the edit, not the other way around. Generate to the length you need rather than generating a long track and chopping it. Models that hold structure do it best when they are aiming at a target length from the start.
- Keep a note of which model and prompt made each track, exactly like a prompt library for images. Future-you scoring the next video will thank present-you.
The Honest Caveats
I have not lived inside this model for a week yet, so I am not going to pretend I have a definitive verdict on quality. Long-form AI music historically struggles with two things: keeping a melody coherent across minutes instead of meandering, and avoiding that slightly soulless "stock music generator" feel. Stability is claiming the structure problem is handled out to six-plus minutes, and the licensed-data angle is real and welcome, but the taste question is the one only your own ears can answer. Generate a few, listen on real speakers and on phone speakers, and see whether it survives the same scrutiny you would give a track you paid for.
The other honest note: open weights are wonderful, but the best model in the family, the large one, is the paid, API-and-enterprise tier. That is a completely fair business model, and it is also worth knowing going in so you are not surprised when the absolute top quality sits behind a paywall while the very good open trio is what you actually download.
The Bottom Line
For an AI art crowd, Stable Audio 3.0 is quietly one of the more useful releases of the month, not because of the six-minute headline, but because it is open, it is license-clean, and it fills the exact hole every one of us hits at the end of a video. We finally have an original-music option that matches the original-image tools we already love. Generate your visuals, generate your soundtrack, and for once put out a piece that is yours from the first frame to the last note.
I am going to go score the backlog of clips that have been sitting silent on my drive, and probably spend an embarrassing amount of time prompting for the perfect dreamy synth bed. If you build something with it, I want to hear it. Now, more coffee.