AI-Powered Music Generation Platform
The client was an early-stage music technology company building tools for solo creators and small content studios. Their thesis was that the bottleneck for short-form video creators had shifted from filming to soundtracking, and that the existing stock-music marketplaces were too generic to sustain a brand identity across episodes. They had a working prototype that generated thirty-second clips from a text prompt and validation from a small creator cohort, but no path to production fidelity.
Generative AI Solutions
Entertainment, Creator Economy
18 weeks from kickoff to public beta
5 specialists
The full story
The practical problem was that the prototype output sounded like AI music. Loops were obvious, transitions were jarring, and mastering was flat. Creators wanted three-minute tracks with structure — intro, build, drop, outro — not minute-long loops with a fade. They also wanted a way to nudge the result toward their existing brand sound without becoming sound engineers, which the prompt-only interface did not support.
We rebuilt the generation pipeline around a structure-first model that planned the arrangement before generating audio, then a separate model that filled the arrangement with instrument stems, and a learned mastering chain on top. The interface added direct controls for tempo, key, instrumentation, and mood, each of which conditioned the generation rather than post-processing the output. We licensed a reference set of professionally mastered tracks for the mastering model so the output sat at commercial loudness without artifacts.
What shipped was a workspace where a creator types a prompt, optionally pins a reference track for sound matching, and gets a fully arranged three-minute composition with stems exported separately for editing. Mastering is automatic but exposes a slider for loudness target. The platform also generates variations of a track on demand so creators can score multiple cuts of a video without re-prompting from scratch.
The prototype generated loops, not songs — creators wanted full arrangements with structure they could trust to soundtrack a video.
Generated clips were thirty to sixty seconds and looped audibly, so creators could not score a longer video without obvious repetition.
Mastering was flat and quieter than reference commercial tracks, which made the output feel amateur next to licensed library music.
The prompt-only interface gave no way to nudge results toward an existing brand sound a creator had already established.
Transitions between sections were jarring and the model had no concept of song structure, only of moment-to-moment timbre.
Stem export was unavailable, so creators could not pull out a drum line or remove vocals to fit a voiceover later in editing.
How we structured the engagement
Separated arrangement planning from audio synthesis so the model could think about songs the way a producer does.
- 01Phase 01Weeks 1-3
Discovery
Interviewed twenty short-form creators about how they currently source music and where existing tools failed. Sat in on three remote mastering sessions with the partner audio engineer. Output: a target arrangement template, a loudness target of negative fourteen LUFS, and a list of must-have instrument categories.
- 02Phase 02Weeks 4-6
Architecture
Designed a three-stage pipeline: a transformer-based arrangement planner that outputs a structured song graph, an audio generator that fills each section conditioned on the graph, and a mastering chain that uses a learned compressor and EQ trained on the licensed reference set.
- 03Phase 03Weeks 7-14
Build
Trained the arrangement planner on a curated dataset of structured tracks tagged by genre and section. Built the audio generator on top of a fine-tuned MusicGen base. Implemented stem separation using a Demucs variant so each instrument family could be exported independently after generation.
- 04Phase 04Weeks 15-18
Launch
Beta launched to one hundred fifty creators across two cohorts, ran weekly listening sessions to triage acoustic issues, and tuned the mastering chain until blind A/B tests against library music were a coin flip. Shipped the reference-track sound matching feature in week sixteen.
What we built, component by component
- 01
Prompt parser
Maps free-text prompts plus explicit controls into a structured generation specification with genre, mood, tempo, and key.
- 02
Arrangement planner
Transformer that outputs a song graph with section boundaries, dynamics curve, and instrument introduction points.
- 03
Audio generator
Fine-tuned MusicGen variant conditioned on the song graph and the user reference track if one is pinned for sound matching.
- 04
Stem separator
Demucs variant that isolates drum, bass, harmonic, and vocal stems for independent export and downstream editing.
- 05
Mastering chain
Learned compressor and EQ stack trained on licensed commercial reference tracks, targeting negative fourteen LUFS by default.
- 06
Variation engine
Re-runs the audio generator with the same arrangement graph and a different seed to produce alternate cuts on demand.
A user submits a prompt and optional controls. The parser builds a generation specification, the arrangement planner produces a song graph, and the audio generator fills the graph section by section while the stem separator runs in parallel. The mastering chain processes the mixed output and the variation engine stays warm to produce alternate cuts on request.
The trade-offs we made and why
Separated arrangement from synthesis instead of generating end-to-end
End-to-end models produce coherent timbres but lose structure past about a minute. Generating an arrangement graph first kept the song coherent across three minutes and gave us a place to expose user controls like tempo and key without retraining the audio model.
Licensed a commercial reference set for the mastering model
Training on public-domain tracks made the master sound public-domain. Licensing four hundred hours of professionally mastered modern music was the single change that pushed blind A/B tests from a clear loss to a coin flip against library music.
Built stem separation into the generation pipeline
Creators wanted to mute the vocals when adding a voiceover and pull the drum line under a montage. Generating stems jointly with the master meant clean separation without the artifacts that come from post-hoc source separation on a finished mix.
Exposed direct controls alongside the prompt
Prompt-only interfaces feel magical for the first track and frustrating by the fifth, when the creator wants a specific BPM. Direct controls for tempo, key, and instrumentation gave power users a steering wheel without hiding the prompt for everyone else.
What changed for the client
time to first track
Median time from prompt submission to a fully mastered three-minute composition ready to download.
production time
Self-reported time-to-final-cut from a creator cohort, comparing platform output against their previous workflow.
blind A/B vs library
Blind preference test of platform tracks against a curated commercial library, conducted with two hundred listeners.
exportable stems
Drum, bass, harmonic, and vocal stems available per generated track for editing in any standard DAW.
The tools behind the system
Built with a deliberate stack chosen for production reliability and operational velocity.
Lessons learned from the build
A two-stage generation model was the unlock. We spent four weeks trying to get a single end-to-end model to handle structure and gave up only after measuring how badly coherence degraded past ninety seconds. Splitting the problem made every subsequent decision easier.
The mastering chain mattered as much as the generator. Listeners cannot articulate why a track sounds amateur but they can hear it. Investing in the master, including the licensing spend, did more for perceived quality than any change to the audio model.
Direct controls were a retention feature, not a power-user feature. Beta users who only used the prompt churned faster than users who touched the tempo or instrumentation control even once. We would surface those controls more prominently in the first session next time.
Similar delivery work usually starts in these service areas
If you are exploring a similar product, workflow, or implementation challenge, these are the service tracks that usually fit best.
Where this project sits in the bigger market picture
This project reflects a broader pattern we often see when teams use AI to improve operational speed, insight quality, and product capability.
Build a result-driven AI product with a team that has shipped before
If you are exploring a similar product, workflow, or AI use case, we can help scope the right architecture, delivery model, and first milestone.
Related case studies worth reviewing next
Have an AI idea, messy workflow, or product vision? Let's make it buildable.
Bring the problem. We'll help shape the product, define the architecture, and show the fastest path to a serious first version.
A practical first roadmap in the discovery call
Architecture, timeline, and delivery options in plain English
Security, scalability, and reliability discussed upfront
Model registry
softus-rag-v4.2
187ms
Latency
128k
Context
$0.004
Cost / req
Evaluation suite
Deploy pipeline
prod / canary 25% — healthy
