Skip to main content
AudioShake models define the type of processing applied to your source audio. You can combine multiple models in a single /tasks request to generate multiple outputs from the same file. See Formats for supported input and output file types.
CategoryDescription
Instrument stem separationIsolate vocals, drums, bass, guitar, and other instruments from music
SpeechMulti-speaker separation and speech denoising
Post-productionDialogue, effects, and music separation for dubbing and editing
Copyright complianceMusic detection, identification, and removal
Lyric transcriptionTime-synced lyrics and transcript alignment

Instrument stem separation

Use these models to split songs into musical components for remixing, post-production, education, and interactive experiences.
ModelDescriptionCredits / min
vocalsAll sung voices together (lead + backing).1.0
vocals_leadPrimary lead melody/lyric vocal only.1.0
vocals_backingBacking parts only (harmonies, ad-libs, choir).1.0
instrumentalFull music mix with vocals removed.1.0
drumsPercussion and drum kit sources (drums, congas, hi-hat, cajon).1.0
bassLow-end bass sources (electric bass, double bass, bass synth).1.0
guitarAll guitar-family parts together (acoustic + electric).1.0
guitar_electricElectric guitar only (clean or distorted).1.0
guitar_acousticAcoustic/plucked guitar-family sources (acoustic guitar, banjo, lute).1.0
pianoAcoustic piano only.1.0
keysKeyboard family: acoustic, electric, digital pianos, clavinet, Hammond, harpsichord.1.0
stringsBowed or pizzicato string instruments and sections.1.0
windWind instruments including woodwind and brass (flute, saxophone, clarinet, bassoon, harmonica).1.0
otherEverything except vocals, drums, and bass.1.0
other-x-guitarEverything except vocals, drums, bass, and guitar.1.0

Speech

Models for speaker isolation, denoising, and multi-speaker separation.
ModelDescriptionCredits / min
multi_voiceOutputs one stem per speaker from a mixed multi-speaker source, even with overlapping speech.10.0
speech_clarityRemove background noise, hum, and interference from speech. Effective for low-resolution audio, noisy environments, and forensic intelligibility.1.5
Maximum input length for multi_voice is 1.5 hours.

Post-production

Separation models for dubbing, dialogue cleanup, and audio editing workflows.
ModelDescriptionCredits / min
dialogueSpeech-focused stem for podcasts, interviews, and film/TV dialogue.1.5
effectsAmbience and SFX bed with dialogue and music removed.1.5
music_fxKeep music + effects; remove dialogue.1.5
Models for detecting, identifying, and removing music in content.
ModelDescriptionCredits / min
music_detectionReturns time ranges where music is present.0.5
music_identificationIdentify music and return track metadata.N/A
music_removalKeep speech + effects; remove background music.N/A
Music identification and music removal are available by request. Contact info@audioshake.ai to request access.

Lyric transcription

Use these models to produce transcripts and time-synced text. These models are state of the art for lyric transcription.
ModelDescriptionCredits / min
transcriptionGenerate a line-level lyric transcript from a song. Best for getting lyrics as text.1.0
alignmentGenerate precise word-level and line-level timestamps. Use this for karaoke, subtitles, and any workflow that needs per-word timing. Transcribes automatically if no transcript is provided.1.0
For alignment, provide one audio source (url or assetId) and optionally a transcript input (transcriptUrl or transcriptAssetId).
Maximum input length for transcription and alignment is 45 minutes.