Skip to main content

Models

AudioShake’s Models define the specific type of audio processing applied to your content.
Each model represents a distinct audio processing operation—such as isolating vocals, removing background music, or transcribing dialogue—and can be combined within a single Tasks API request to produce multiple outputs from the same source file.

Models are organized by use case:

  • Instrument Stem Separation — break down songs into individual components like vocals, drums, and bass.
  • Dialogue, Music, and Effects — isolate voices or remove background elements for film, TV, and dubbing.
  • Transcription and Alignment — convert spoken content into synchronized text and timestamps.

Use these models to design flexible workflows for music production, post-production, accessibility, and AI data preparation.


Instrument Stem Separation

These models isolate or extract musical components from a mixed track.
They’re useful for remixing, immersive audio, gaming, and music education.
All models can be called via the /tasks route and support standard formats like WAV, MP3, or FLAC.

NameModel KeyDescriptionCredits / MinuteMax Length
InstrumentalinstrumentalGenerates an instrumental-only version by removing vocals. For best quality, use the high_quality variant.1.03 Hours
DrumsdrumsIsolates percussion and rhythmic elements.1.03 Hours
VocalsvocalsExtracts vocal elements from a mix. Supports the high_quality variant for improved clarity.1.03 Hours
BassbassSeparates bass instruments and low-frequency sounds.1.03 Hours
GuitarguitarIsolates guitar stems (acoustic, electric, classical).1.03 Hours
PianopianoExtracts piano or keyboard instruments.1.03 Hours
StringsstringsIsolates orchestral string instruments like violin, cello, and viola.1.03 Hours
WindwindExtracts wind instruments such as flute and saxophone.1.03 Hours
OtherotherCaptures remaining instrumentation after main stems are removed.1.03 Hours
Other-x-Guitarother-x-guitarResidual instrumentation after removing vocals, drums, bass, and guitar.1.03 Hours
Residual Stems

To include a residual stem in your results, set "residual": true in the target metadata when creating your task. For more info, contact support@audioshake.ai

Example — Using Models in a Tasks API Request

curl -sS -X POST "https://api.audioshake.ai/tasks" \
-H "x-api-key: $AUDIOSHAKE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/audio/session.wav",
"targets": [
{ "model": "vocals", "formats": ["wav"], "variant": "high_quality" },
{ "model": "instrumental", "formats": ["wav"] },
{ "model": "transcription", "formats": ["json"], "language": "en" }
]
}'

Dialogue, Music, & Effects

NameModelDescriptionCredits / MinuteMax Length
DialoguedialogueIsolates speech or vocals from any other sound1.53 Hours
EffectseffectsRemoves dialogue and music but retains the ambience, sound effects, and environmental noise1.53 Hours
Music removalmusic_removalRemoves music from audio while retaining dialogue, background effects, and natural soundN/A1 Hour
Background (Music & FX)music_fxRemoves dialogue to extracting a clean background stem of music and effects1.53 Hours
Music detectionmusic_detectionDetects the portions of an audio file that contain music0.53 Hours
Multi-Voicemulti_voiceSeparates dialogue from multiple speakers in audio recordings, delivering individual audio files per speaker. Available in two_speaker and n_speaker variants, detailed below.N/A1 Hour
Music Removal & Multi-Voice Availability

Currently Music Removal and Multi-Voice separation are not available via the /tasks route. Please contact support@audioshake.ai for access.


Transcription & Alignment

NameModel NameDescriptionCredits / MinuteMax Length
TranscriptiontranscriptionText representation of spoken words or audio content11 Hour
AlignmentalignmentSynchronization of audio and corresponding text or captions11 Hour
Combined T&A pricing

If you run Transcription and Alignment together (T&A), pricing is Premium at 1.5 credits per minute.

Example — Transcription and Alignment in a Single Task

You can run both Transcription and Alignment together within one /tasks request.
This produces synchronized text output with word-level timestamps.

curl -sS -X POST "https://api.audioshake.ai/tasks" \
-H "x-api-key: $AUDIOSHAKE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/audio/interview.mp3",
"targets": [
{ "model": "transcription", "formats": ["json"], "language": "en" },
{ "model": "alignment", "formats": ["json"], "language": "en" }
]
}'

Variants

Certain models offer variants optimized for specific audio processing use-cases. To use a variant, include "variant": "<desired_variant>" in the metadata parameters when submitting a job via the API. The available variants are listed below:

ModelVariantDescriptionPlanCredits / Minute
multi_voicetwo_speakerOptimized for separating two speakers. (Default)PremiumN/A
multi_voicen_speakerCreates stems for any number of speakers.AdvancedN/A
vocalshigh_qualityHigher quality but longer processing time.Premium1.5
instrumentalhigh_qualityHigher quality but longer processing time.Premium1.5

Example — Including a Variant in a Tasks API Request

curl -sS -X POST "https://api.audioshake.ai/tasks" \
-H "x-api-key: $AUDIOSHAKE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/audio/interview.wav",
"targets": [
{ "model": "multi_voice", "variant": "n_speaker", "formats": ["wav"] }
]
}'