Models
AudioShake’s Models define the specific type of audio processing applied to your content.
Each model represents a distinct audio processing operation—such as isolating vocals, removing background music, or transcribing dialogue—and can be combined within a single Tasks API request to produce multiple outputs from the same source file.
Models are organized by use case:
- Instrument Stem Separation — break down songs into individual components like vocals, drums, and bass.
- Dialogue, Music, and Effects — isolate voices or remove background elements for film, TV, and dubbing.
- Transcription and Alignment — convert spoken content into synchronized text and timestamps.
Use these models to design flexible workflows for music production, post-production, accessibility, and AI data preparation.
Instrument Stem Separation
These models isolate or extract musical components from a mixed track.
They’re useful for remixing, immersive audio, gaming, and music education.
All models can be called via the /tasks route and support standard formats like WAV, MP3, or FLAC.
| Name | Model Key | Description | Credits / Minute | Max Length |
|---|---|---|---|---|
| Vocals | vocals | Extracts vocal elements from a mix. Supports the high_quality variant for improved clarity. | 1.0 | 3 Hours |
| Lead Vocals | vocals_lead | Vocal performances carrying the primary melodic or lyrical content of the track. Excellent for karoke. | 1.0 | 3 Hours |
| Backing Vocals | vocals_backing | Extracts only the backing vocals including harmonies, chants, ad-libs, and choirs. | 1.0 | 3 Hours |
| Instrumental | instrumental | Generates an instrumental-only version by removing vocals. For best quality, use the high_quality variant. | 1.0 | 3 Hours |
| Drums | drums | Isolates percussion and rhythmic elements. | 1.0 | 3 Hours |
| Bass | bass | Separates bass instruments and low-frequency sounds. | 1.0 | 3 Hours |
| Guitar | guitar | Isolates guitar stems (acoustic, electric, classical). | 1.0 | 3 Hours |
| Electric Guitar | guitar_electric | Isolates electric guitar stems. | 1.0 | 3 Hours |
| Acoustic Guitar | guitar_acoustic | Isolates acoustic guitar stems (including classical guitar). | 1.0 | 3 Hours |
| Piano | piano | Extracts only acoustic piano. | 1.0 | 3 Hours |
| Keys | keys | Extracts all keyboard instruments including piano, electric piano, organ, etc. | 1.0 | 3 Hours |
| Strings | strings | Isolates orchestral string instruments like violin, cello, and viola. | 1.0 | 3 Hours |
| Wind | wind | Extracts wind instruments such as flute and saxophone. | 1.0 | 3 Hours |
| Other | other | Captures remaining instrumentation after main stems are removed. | 1.0 | 3 Hours |
| Other-x-Guitar | other-x-guitar | Residual instrumentation after removing vocals, drums, bass, and guitar. | 1.0 | 3 Hours |
To include a residual stem in your results, set "residual": true in the target metadata when creating your task. For more info, contact support@audioshake.ai
Example — Using Models in a Tasks API Request
curl -sS -X POST "https://api.audioshake.ai/tasks" \
-H "x-api-key: $AUDIOSHAKE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://demos.audioshake.ai/demo-assets/shakeitup.mp3",
"targets": [
{ "model": "vocals", "formats": ["wav"], "variant": "high_quality" },
{ "model": "instrumental", "formats": ["wav"] },
{ "model": "transcription", "formats": ["json"], "language": "en" }
]
}'
Dialogue, Music, & Effects
| Name | Model | Description | Credits / Minute | Max Length |
|---|---|---|---|---|
| Dialogue | dialogue | Isolates speech or vocals from any other sound | 1.5 | 3 Hours |
| Effects | effects | Removes dialogue and music but retains the ambience, sound effects, and environmental noise | 1.5 | 3 Hours |
| Music removal | music_removal | Removes music from audio while retaining dialogue, background effects, and natural sound | N/A | 1 Hour |
| Background (Music & FX) | music_fx | Removes dialogue to extracting a clean background stem of music and effects | 1.5 | 3 Hours |
| Music detection | music_detection | Detects the portions of an audio file that contain music | 0.5 | 3 Hours |
| Multi-Voice | multi_voice | Separates dialogue from multiple speakers in audio recordings, delivering individual audio files per speaker. Available in two_speaker and n_speaker variants, detailed below. | N/A | 1 Hour |
Currently Music Removal and Multi-Voice separation are not available via the /tasks route. Please contact support@audioshake.ai for access.
Transcription & Alignment
| Name | Model Name | Description | Credits / Minute | Max Length |
|---|---|---|---|---|
| Transcription | transcription | Text representation of spoken words or audio content | 1 | 1 Hour |
| Alignment | alignment | Synchronization of audio and corresponding text or captions | 1 | 1 Hour |
Alignment-Only Targets
You can run Alignment as a standalone target. If you don't provide a transcript, alignment will automatically generate one for you. If you already have an accurate transcript, you can provide it to skip transcription and only generate synchronized timestamps.
Provide either a transcriptUrl (public URL to your transcript file) or a transcriptAssetId (if you've already uploaded the transcript as an asset).
curl -sS -X POST "https://api.audioshake.ai/tasks" \
-H "x-api-key: $AUDIOSHAKE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://demos.audioshake.ai/demo-assets/shakeitup.mp3",
"assetId": "",
"targets": [
{
"model": "alignment",
"formats": ["json", "txt", "srt"],
"transcriptUrl": "",
"transcriptAssetId": ""
}
]
}'
Use url or assetId for your source audio/video file, and transcriptUrl or transcriptAssetId within the target for your transcript. You only need to provide one of each pair.
Variants
Certain models offer variants optimized for specific audio processing use-cases. To use a variant, include "variant": "<desired_variant>" in the metadata parameters when submitting a job via the API. The available variants are listed below:
| Model | Variant | Description | Plan | Credits / Minute |
|---|---|---|---|---|
| multi_voice | two_speaker | Optimized for separating two speakers. (Default) | Premium | N/A |
| multi_voice | n_speaker | Creates stems for any number of speakers. | Advanced | N/A |
| vocals | high_quality | Higher quality but longer processing time. | Premium | 1.5 |
| instrumental | high_quality | Higher quality but longer processing time. | Premium | 1.5 |
Example — Including a Variant in a Tasks API Request
curl -sS -X POST "https://api.audioshake.ai/tasks" \
-H "x-api-key: $AUDIOSHAKE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://demos.audioshake.ai/demo-assets/shakeitup.mp3",
"targets": [
{ "model": "vocals", "variant": "high_quality", "formats": ["wav"] }
]
}'