Models
AudioShake’s Models define the specific type of audio processing applied to your content.
Each model represents a distinct audio processing operation—such as isolating vocals, removing background music, or transcribing dialogue—and can be combined within a single Tasks API request to produce multiple outputs from the same source file.
Models are organized by use case:
- Instrument Stem Separation — break down songs into individual components like vocals, drums, and bass.
- Dialogue, Music, and Effects — isolate voices or remove background elements for film, TV, and dubbing.
- Transcription and Alignment — convert spoken content into synchronized text and timestamps.
Use these models to design flexible workflows for music production, post-production, accessibility, and AI data preparation.
Instrument Stem Separation
These models isolate or extract musical components from a mixed track.
They’re useful for remixing, immersive audio, gaming, and music education.
All models can be called via the /tasks route and support standard formats like WAV, MP3, or FLAC.
| Name | Model Key | Description | Credits / Minute | Max Length |
|---|---|---|---|---|
| Instrumental | instrumental | Generates an instrumental-only version by removing vocals. For best quality, use the high_quality variant. | 1.0 | 3 Hours |
| Drums | drums | Isolates percussion and rhythmic elements. | 1.0 | 3 Hours |
| Vocals | vocals | Extracts vocal elements from a mix. Supports the high_quality variant for improved clarity. | 1.0 | 3 Hours |
| Bass | bass | Separates bass instruments and low-frequency sounds. | 1.0 | 3 Hours |
| Guitar | guitar | Isolates guitar stems (acoustic, electric, classical). | 1.0 | 3 Hours |
| Piano | piano | Extracts piano or keyboard instruments. | 1.0 | 3 Hours |
| Strings | strings | Isolates orchestral string instruments like violin, cello, and viola. | 1.0 | 3 Hours |
| Wind | wind | Extracts wind instruments such as flute and saxophone. | 1.0 | 3 Hours |
| Other | other | Captures remaining instrumentation after main stems are removed. | 1.0 | 3 Hours |
| Other-x-Guitar | other-x-guitar | Residual instrumentation after removing vocals, drums, bass, and guitar. | 1.0 | 3 Hours |
To include a residual stem in your results, set "residual": true in the target metadata when creating your task. For more info, contact support@audioshake.ai
Example — Using Models in a Tasks API Request
curl -sS -X POST "https://api.audioshake.ai/tasks" \
-H "x-api-key: $AUDIOSHAKE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/audio/session.wav",
"targets": [
{ "model": "vocals", "formats": ["wav"], "variant": "high_quality" },
{ "model": "instrumental", "formats": ["wav"] },
{ "model": "transcription", "formats": ["json"], "language": "en" }
]
}'
Dialogue, Music, & Effects
| Name | Model | Description | Credits / Minute | Max Length |
|---|---|---|---|---|
| Dialogue | dialogue | Isolates speech or vocals from any other sound | 1.5 | 3 Hours |
| Effects | effects | Removes dialogue and music but retains the ambience, sound effects, and environmental noise | 1.5 | 3 Hours |
| Music removal | music_removal | Removes music from audio while retaining dialogue, background effects, and natural sound | N/A | 1 Hour |
| Background (Music & FX) | music_fx | Removes dialogue to extracting a clean background stem of music and effects | 1.5 | 3 Hours |
| Music detection | music_detection | Detects the portions of an audio file that contain music | 0.5 | 3 Hours |
| Multi-Voice | multi_voice | Separates dialogue from multiple speakers in audio recordings, delivering individual audio files per speaker. Available in two_speaker and n_speaker variants, detailed below. | N/A | 1 Hour |
Currently Music Removal and Multi-Voice separation are not available via the /tasks route. Please contact support@audioshake.ai for access.
Transcription & Alignment
| Name | Model Name | Description | Credits / Minute | Max Length |
|---|---|---|---|---|
| Transcription | transcription | Text representation of spoken words or audio content | 1 | 1 Hour |
| Alignment | alignment | Synchronization of audio and corresponding text or captions | 1 | 1 Hour |
If you run Transcription and Alignment together (T&A), pricing is Premium at 1.5 credits per minute.
Example — Transcription and Alignment in a Single Task
You can run both Transcription and Alignment together within one /tasks request.
This produces synchronized text output with word-level timestamps.
curl -sS -X POST "https://api.audioshake.ai/tasks" \
-H "x-api-key: $AUDIOSHAKE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/audio/interview.mp3",
"targets": [
{ "model": "transcription", "formats": ["json"], "language": "en" },
{ "model": "alignment", "formats": ["json"], "language": "en" }
]
}'
Variants
Certain models offer variants optimized for specific audio processing use-cases. To use a variant, include "variant": "<desired_variant>" in the metadata parameters when submitting a job via the API. The available variants are listed below:
| Model | Variant | Description | Plan | Credits / Minute |
|---|---|---|---|---|
| multi_voice | two_speaker | Optimized for separating two speakers. (Default) | Premium | N/A |
| multi_voice | n_speaker | Creates stems for any number of speakers. | Advanced | N/A |
| vocals | high_quality | Higher quality but longer processing time. | Premium | 1.5 |
| instrumental | high_quality | Higher quality but longer processing time. | Premium | 1.5 |
Example — Including a Variant in a Tasks API Request
curl -sS -X POST "https://api.audioshake.ai/tasks" \
-H "x-api-key: $AUDIOSHAKE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/audio/interview.wav",
"targets": [
{ "model": "multi_voice", "variant": "n_speaker", "formats": ["wav"] }
]
}'