ACE-Step is an open-source foundation model for music generation designed to address the architectural limitations of current methods.
While LLM-based models like Yue and SongGen excel at lyric alignment, they are often slow and prone to structural inconsistencies. Conversely, diffusion models like DiffRhythm offer fast synthesis but frequently struggle with long-term musical coherence. ACE-Step bridges this gap by integrating diffusion-based generation with Sana’s Deep Compression Autoencoder (DCAE) and a lightweight linear Transformer. To accelerate convergence during training, it utilizes MERT and m-hubert to align semantic representations via Representation Alignment (REPA).
Performance benchmarks on an NVIDIA A100 GPU show that ACE-Step can synthesize four minutes of music in only 20 seconds—outperforming LLM baselines by a factor of 15. Beyond speed, the model achieves superior musical coherence and lyric alignment across melody, harmony, and rhythmic metrics.
ACE-Step preserves intricate acoustic details and provides robust control features, including voice cloning, lyric editing, remixing, and stem generation (such as converting lyrics to vocals or creating accompaniments for existing singing). Rather than being a simple end-to-end text-to-music pipeline, ACE-Step serves as a versatile foundation model: fast, efficient, and flexible enough to support the training of specialized sub-tasks. This makes it a powerful asset for the workflows of musicians, producers, and content creators.
Baseline Quality The model supports all major musical styles and accepts various input formats, including short tags, descriptive natural language, or specific use-case scenarios.
Multi-Language Support ACE-Step is compatible with 19 languages. The top 10 performing languages are English, Chinese, Russian, Spanish, Japanese, German, French, Portuguese, Italian, and Korean.
Instruments & Style The model generates high-quality instrumental tracks across diverse genres. It produces realistic timbres and expressive performances for individual instruments while managing complex multi-instrument arrangements without losing musical logic.
Vocal Techniques ACE-Step renders a broad spectrum of vocal styles and techniques, supporting various singing methods and nuanced emotional expressions.
Variant Generation Through training-free, inference-time optimization, the model allows for controlled output variation. A flow-matching model generates the initial noise, which is then augmented by trigFlow’s noise formula. By adjusting the mix ratio between the original initial noise and the added Gaussian noise, users can determine how much the resulting audio deviates from the original.
Inpainting By injecting noise into the target audio during the ODE (Ordinary Differential Equation) process and applying mask constraints, the model can modify specific sections of a track while keeping the rest intact. When combined with variant generation, this allows for localized changes to style, lyrics, or vocal delivery.
Lyric Editing Using flow editing techniques, users can modify specific lyric segments while preserving the original melody, vocals, and accompaniment. This feature works with both AI-generated content and uploaded audio. Currently, edits are best kept to short segments to prevent distortion, though multiple edits can be applied sequentially.
Lyrics to Vocals (LoRA) This LoRA is fine-tuned on pure vocal data to generate singing samples directly from lyrics. It is ideal for creating vocal demos, guide tracks, and vocal arrangement experiments, allowing lyricists to hear their words performed instantly.
Text to Sample (LoRA) Fine-tuned on instrumental and sample data, this module generates conceptual production samples from text descriptions. It is designed for the rapid creation of instrument loops, sound effects, and specific musical elements.
RapMachine A specialized system fine-tuned specifically for rap. Future updates plan to introduce features for AI rap battles and advanced narrative expression within the genre.
StemGen A ControlNet-LoRA trained on multi-track data for single-instrument generation. By providing a reference track and specifying an instrument (or providing a reference audio clip for timbre), users can generate an instrumental track that harmonizes with the existing reference.
Singing to Accompaniment Operating as the inverse of StemGen, this application takes a standalone vocal track and generates a full musical backing. Users can input a vocal file and a desired style to produce a complete, mixed arrangement.
Performance was evaluated using the Real-Time Factor (RTF), where higher values indicate faster generation. For context, a 27.27x RTF means one minute of music is generated in approximately 2.2 seconds (60/27.27). These measurements utilized a batch size of 1 over 27 inference steps.
| Device | 27 steps | 60 steps |
|---|---|---|
| NVIDIA A100 | 27.27x | 12.27x |
| NVIDIA RTX 4090 | 34.48x | 15.63x |
| NVIDIA RTX 3090 | 12.76x | 6.48x |
| MacBook M2 Max | 2.27x | 1.03x |
Ensure Python is installed, then set up a virtual environment using Conda (recommended) or venv to prevent dependency conflicts.
Option 1: Conda
Create and activate an environment named ace_step with Python 3.10:
conda create -n ace_step python=3.10 -y
conda activate ace_step
Option 2: venv Create the environment:
python -m venv venv
Activate it:
venv\Scripts\activate.bat.\venv\Scripts\Activate.ps1 (If restricted, run Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope Process first)source venv/bin/activateInstall Dependencies:
pip install -r requirements.txtpip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126, then pip install -r requirements.txtRunning the Application: Basic:
python app.py
Advanced (example with custom checkpoint and bfloat16):
python app.py --checkpoint_path /path/to/checkpoint --port 7865 --device_id 0 --share true --bf16 true
Note: On macOS, set --bf16 false to ensure compatibility.
Command-Line Arguments:
--checkpoint_path: Model checkpoint location (defaults to auto-download).--server_name: Gradio server IP (default: '127.0.0.1'). Use '0.0.0.0' for network access.--port: Gradio server port (default: 7865).--device_id: Target GPU ID (default: 0).--share: Generate a public Gradio link (default: False).--bf16: Enables bfloat16 precision for faster inference (default: True).--torch_compile: Uses torch.compile() for optimization (default: False; not supported on Windows).The interface includes several specialized tabs:
Text to Music:
[verse], [chorus], or [bridge].Redo: Regenerate audio with a new seed. The "variance" slider controls how much the new version differs from the previous one.
Inpainting: Specify start and end timestamps to regenerate specific sections of an audio file while preserving the rest of the track.
Edit: Modify existing tracks by updating tags or lyrics. Use "lyrics only" to keep the melody or "remix" to change the musical structure.
Extend: Add new musical content to the beginning (left) or end (right) of an existing track.
Set up the environment as instructed above. If training LoRA models, install the PEFT library:
pip install peft
Example Dataset Entry (json format):
{
"keys": "1ce52937-cd1d-456f-967d-0f1072fcbb58",
"tags": ["pop", "acoustic", "ballad", "romantic", "emotional"],
"speaker_emb_path": "",
"norm_lyrics": "I love you, I love you, I love you",
"recaption": {
"simplified": "pop",
"expanded": "pop, acoustic, ballad, romantic, emotional",
"descriptive": "The sound is soft and gentle, like a tender breeze on a quiet evening.",
"use_cases": "Suitable for background music in romantic films.",
"analysis": "pop, ballad, piano, guitar, slow tempo"
}
}
Key Training Parameters:
--dataset_path: Path to your Huggingface-style dataset.--learning_rate: Training rate (default: 1e-4).--precision: Training precision, e.g., "bf16-mixed" or "fp32".--accumulate_grad_batches: Steps for gradient accumulation.--every_n_train_steps: Checkpoint saving frequency.Base Model Training:
python trainer.py --dataset_path "path/to/dataset" --checkpoint_dir "path/to/base/checkpoint" --exp_name "experiment_name"
LoRA Training:
Requires a configuration file (lora_config.json):
python trainer.py --dataset_path "path/to/dataset" --checkpoint_dir "path/to/base/checkpoint" --lora_config_path "lora_config.json" --exp_name "lora_experiment"
Example LoRA Config:
{
"r": 16,
"lora_alpha": 32,
"target_modules": [
"speaker_embedder",
"linear_q",
"linear_k",
"linear_v",
"to_q",
"to_k",
"to_v",
"to_out.0"
]
}
Advanced Training Options:
--shift: Flow-matching shift parameter (default: 3.0).--gradient_clip_val: Value for gradient clipping (default: 0.5).--reload_dataloaders_every_n_epochs: Frequency for refreshing data loaders.
Prompt Tools: Open-Source Desktop App to Stop Losing Your Best AI Prompts
Shanlian VPN Review: High-Speed, Private & Optimized for China
AoxVPN 8.8 Member Day Sale | No-Log VPN Featuring IEPL Private Lines
LVCHA VPN Review: A Permanently Free VPN with No Ads and Fast Speeds
AI Interactive Fiction Generator Builds Stories You Actually Control
Apple Doc MCP: SwiftUI & UIKit Documentation for Cursor & Claude
Firecrawl API: Converting Any Website Into Clean Markdown for LLMs
LeRobot: Train Real-World Robots with Hugging Face's PyTorch Library
Dots.LLM1: 142B MoE Model Trained on 11.2T Real-World Tokens
Ventoy USB Tool: Boot Multiple ISOs Without Reformatting
Nping: A High-Performance Concurrent Ping Tool in Rust with Live Charts
PyVideoTrans: Open-Source Video Translation & Dubbing Tool