Skip to content

ElevenLabs

ElevenLabs is a specialized audio provider for text-to-speech and speech-to-text operations. DeepIntShield performs conversions including:

  • Model ID mapping - Uses provider model identifier directly
  • Voice configuration - Maps voice settings (stability, similarity, boost, speed, style)
  • Response format conversion - Speech format handling (MP3, Opus, PCM/WAV)
  • Timestamp support - Character-level timing alignment for TTS
  • Transcription with alignment - Word and character-level timing, diarization, and additional formats
  • Pronunciation dictionaries - Support for custom pronunciation rules
  • Voice quality parameters - Stability, similarity boost, and speaker boost controls
OperationNon-StreamingStreamingEndpoint
Speech (TTS)/v1/text-to-speech/{voice_id}
Transcriptions (STT)-/v1/speech-to-text
List Models-/v1/models
Chat Completions-
Responses API-
Text Completions-
Embeddings-
Image Generation-

ParameterMappingNotes
input.inputtextThe text to convert to speech (required)
modelmodel_idModel identifier (e.g., "eleven_multilingual_v2")
response_formatQuery param output_formatSpeech format (see Response Format)

Voice settings are optional and controlled via params:

ParameterElevenLabs MappingDefaultRange
speedvoice_settings.speed1.00.5-2.0
extra_params.stabilityvoice_settings.stability0.50-1.0
extra_params.similarity_boostvoice_settings.similarity_boost0.750-1.0
extra_params.use_speaker_boostvoice_settings.use_speaker_boosttrueboolean
extra_params.stylevoice_settings.style00-1.0

Use extra_params for ElevenLabs-specific TTS features:

Terminal window
curl -X POST http://localhost:8080/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"model": "eleven_multilingual_v2",
"input": {"input": "Hello, how are you?"},
"voice": "21m00Tcm4TlvDq8ikWAM",
"response_format": "mp3",
"stability": 0.5,
"similarity_boost": 0.75,
"use_speaker_boost": true,
"style": 0,
"speed": 1.0,
"language_code": "en",
"seed": 42,
"previous_text": "Context text",
"next_text": "Future context",
"apply_text_normalization": "auto"
}'
ParameterTypeDescription
language_codestringLanguage code (e.g., “en”, “es”)
seedintegerReproducible output (0-4294967295)
previous_textstringPrevious text context for consistency
next_textstringNext text context for consistency
previous_request_idsstring[]Previous request IDs for continuity
next_request_idsstring[]Next request IDs for continuity
apply_text_normalizationstringText normalization mode: "auto", "on", "off"
apply_language_text_normalizationbooleanApply language-specific text normalization
FormatOutputQualityBitrate
mp3MP3High128 kbps @ 44100 Hz
opusOpusHigh128 kbps @ 48000 Hz
wav / pcmPCM WAVLossless16-bit @ 44100 Hz

To get character-level timing alignment, enable with_timestamps:

{
"with_timestamps": true
}

When enabled, the endpoint /v1/text-to-speech/{voice_id}/with-timestamps is used and the response includes:

  • audio_base64 - Audio data as base64-encoded string
  • alignment.char_start_times_ms - Character start times in milliseconds
  • alignment.char_end_times_ms - Character end times in milliseconds
  • alignment.characters - Array of characters
  • normalized_alignment - Same as alignment but for normalized text
{
"audio": "<binary audio data>"
}
{
"audio_base64": "<base64 encoded audio>",
"alignment": {
"char_start_times_ms": [0, 150, 280, ...],
"char_end_times_ms": [150, 280, 420, ...],
"characters": ["H", "e", "l", "l", "o", ...]
},
"normalized_alignment": {
"char_start_times_ms": [...],
"char_end_times_ms": [...],
"characters": [...]
}
}

Streaming speech returns audio in chunks as they are generated:

{
"type": "audio.delta",
"audio": "<binary audio chunk>"
}

Final chunk:

{
"type": "audio.done"
}

Choose one of the following (mutually exclusive):

ParameterTypeDescription
input.filebytesAudio file content (WAV, MP3, etc.)
extra_params.cloud_storage_urlstringURL to cloud-hosted audio file

Error: Providing both or neither will result in error.

ParameterMappingDescription
modelmodel_idModel identifier (required)
params.languagelanguage_codeLanguage code (ISO 639-1, e.g., “en”)

Use extra_params for transcription-specific features:

Terminal window
curl -X POST http://localhost:8080/v1/audio/transcriptions \
-F "file=@audio.wav" \
-F "model=eleven_latest" \
-F "language_code=en" \
-F "tag_audio_events=true" \
-F "num_speakers=2" \
-F "timestamps_granularity=word" \
-F "diarize=true" \
-F "diarization_threshold=0.5" \
-F "temperature=0.1" \
-F "seed=42" \
-F "use_multi_channel=true" \
-F "webhook=true" \
-F "webhook_id=webhook-123"
ParameterTypeDescription
tag_audio_eventsbooleanTag audio events (background noise, music, etc.)
num_speakersintegerExpected number of speakers (for diarization)
timestamps_granularitystringTimestamp level: "none", "word", "character"
diarizebooleanIdentify different speakers
diarization_thresholdfloatSpeaker diarization sensitivity (0.0-1.0)
file_formatstringInput format: "pcm_s16le_16", "other"
temperaturefloatTranscription temperature (0.0-1.0)
seedintegerReproducible transcription
use_multi_channelbooleanProcess multi-channel audio separately
webhookbooleanEnable webhook for async processing
webhook_idstringWebhook endpoint ID
webhook_metadataobject/stringAdditional webhook metadata
cloud_storage_urlstringURL to cloud-hosted audio (alternative to file)

Request multiple output formats simultaneously:

{
"additional_formats": [
{
"format": "segmented_json",
"include_speakers": true,
"include_timestamps": true,
"segment_on_silence_longer_than_s": 1.0,
"max_segment_duration_s": 30.0
},
{
"format": "srt",
"max_segment_duration_s": 30.0
}
]
}

Supported formats: segmented_json, docx, pdf, txt, html, srt

{
"transcript": {
"language_code": "en",
"language_probability": 0.95,
"text": "Full transcribed text...",
"words": [
{
"text": "Hello",
"start": 0.0,
"end": 0.5,
"type": "word",
"speaker_id": "speaker_1",
"logprob": -0.05
}
]
}
}

When diarize: true, the response includes speaker identification:

{
"transcript": {
"text": "Hello how are you?",
"words": [
{
"text": "Hello",
"speaker_id": "speaker_1"
},
{
"text": "how",
"speaker_id": "speaker_2"
}
]
}
}

Character-level timing when timestamps_granularity: "character":

{
"words": [
{
"text": "Hello",
"characters": [
{"text": "H", "start": 0.0, "end": 0.1},
{"text": "e", "start": 0.1, "end": 0.2}
]
}
]
}
{
"transcript": { ... },
"additional_formats": [
{
"requested_format": "srt",
"file_extension": "srt",
"content_type": "text/plain",
"is_base64_encoded": false,
"content": "1\n00:00:00,000 --> 00:00:01,000\nHello\n\n2\n..."
}
]
}

Voice ID Required

Severity: High Behavior: Voice ID must be provided for TTS requests Impact: Request fails without voice configuration Code: elevenlabs.go:198-208

File or URL Required for Transcription

Severity: High Behavior: Either file or cloud_storage_url must be provided (not both) Impact: Request fails with ambiguous input Code: elevenlabs.go:471-478

Audio Format Conversion

Severity: Low Behavior: Response formats (MP3, Opus, WAV) mapped via format string Impact: Format parameter passed as query string to endpoint Code: elevenlabs.go:712-715, utils.go:5-35

Timestamps as Separate Endpoint

Severity: Low Behavior: Timestamp requests use /with-timestamps endpoint variant Impact: Switches endpoint based on with_timestamps flag Code: elevenlabs.go:195-205

Multipart Form Data for Transcription

Severity: Low Behavior: Transcription uses multipart/form-data, not JSON Impact: File and parameters sent as form fields Code: elevenlabs.go:480-690


ParameterTypeDescription
(none)-No parameters required

Returns available models with their capabilities and language support.

{
"models": [
{
"model_id": "eleven_multilingual_v2",
"name": "Eleven Multilingual v2",
"description": "Multilingual speech synthesis",
"serves_pro_voices": true,
"token_cost_factor": 1.0,
"can_do_text_to_speech": true,
"can_do_voice_conversion": true,
"can_use_style": true,
"can_use_speaker_boost": true,
"languages": [
{"language_id": "en", "name": "English"},
{"language_id": "es", "name": "Spanish"}
],
"requires_alpha_access": false,
"max_characters_request_free_user": 1000,
"max_characters_request_subscribed_user": 100000,
"maximum_text_length_per_request": 5000,
"model_rates": {
"character_cost_multiplier": 1.0
}
}
]
}