AudioShake models define the type of processing applied to your source audio. You can combine multiple models in a single /tasks request to generate multiple outputs from the same file.
See Formats for supported input and output file types.
| Category | Description |
|---|
| Instrument stem separation | Isolate vocals, drums, bass, guitar, and other instruments from music |
| Speech | Multi-speaker separation and speech denoising |
| Post-production | Dialogue, effects, and music separation for dubbing and editing |
| Copyright compliance | Music detection, identification, and removal |
| Lyric transcription | Time-synced lyrics and transcript alignment |
Instrument stem separation
Use these models to split songs into musical components for remixing, post-production, education, and interactive experiences.
| Model | Description | Credits / min |
|---|
vocals | All sung voices together (lead + backing). | 1.0 |
vocals_lead | Primary lead melody/lyric vocal only. | 1.0 |
vocals_backing | Backing parts only (harmonies, ad-libs, choir). | 1.0 |
instrumental | Full music mix with vocals removed. | 1.0 |
drums | Percussion and drum kit sources (drums, congas, hi-hat, cajon). | 1.0 |
bass | Low-end bass sources (electric bass, double bass, bass synth). | 1.0 |
guitar | All guitar-family parts together (acoustic + electric). | 1.0 |
guitar_electric | Electric guitar only (clean or distorted). | 1.0 |
guitar_acoustic | Acoustic/plucked guitar-family sources (acoustic guitar, banjo, lute). | 1.0 |
piano | Acoustic piano only. | 1.0 |
keys | Keyboard family: acoustic, electric, digital pianos, clavinet, Hammond, harpsichord. | 1.0 |
strings | Bowed or pizzicato string instruments and sections. | 1.0 |
wind | Wind instruments including woodwind and brass (flute, saxophone, clarinet, bassoon, harmonica). | 1.0 |
other | Everything except vocals, drums, and bass. | 1.0 |
other-x-guitar | Everything except vocals, drums, bass, and guitar. | 1.0 |
Speech
Models for speaker isolation, denoising, and multi-speaker separation.
| Model | Description | Credits / min |
|---|
multi_voice | Outputs one stem per speaker from a mixed multi-speaker source, even with overlapping speech. | 10.0 |
speech_clarity | Remove background noise, hum, and interference from speech. Effective for low-resolution audio, noisy environments, and forensic intelligibility. | 1.5 |
Maximum input length for multi_voice is 1.5 hours.
Post-production
Separation models for dubbing, dialogue cleanup, and audio editing workflows.
| Model | Description | Credits / min |
|---|
dialogue | Speech-focused stem for podcasts, interviews, and film/TV dialogue. | 1.5 |
effects | Ambience and SFX bed with dialogue and music removed. | 1.5 |
music_fx | Keep music + effects; remove dialogue. | 1.5 |
Copyright compliance
Models for detecting, identifying, and removing music in content.
| Model | Description | Credits / min |
|---|
music_detection | Returns time ranges where music is present. | 0.5 |
music_identification | Identify music and return track metadata. | N/A |
music_removal | Keep speech + effects; remove background music. | N/A |
Music identification and music removal are available by request. Contact
info@audioshake.ai to request access.
Lyric transcription
Use these models to produce transcripts and time-synced text. These models are state of the art for lyric transcription.
| Model | Description | Credits / min |
|---|
transcription | Generate a line-level lyric transcript from a song. Best for getting lyrics as text. | 1.0 |
alignment | Generate precise word-level and line-level timestamps. Use this for karaoke, subtitles, and any workflow that needs per-word timing. Transcribes automatically if no transcript is provided. | 1.0 |
For alignment, provide one audio source (url or assetId) and optionally a transcript input (transcriptUrl or transcriptAssetId).
Maximum input length for transcription and alignment is 45 minutes.