ElevenLabs
Overview
Section titled “Overview”ElevenLabs is a specialized audio provider for text-to-speech and speech-to-text operations. DeepIntShield performs conversions including:
- Model ID mapping - Uses provider model identifier directly
- Voice configuration - Maps voice settings (stability, similarity, boost, speed, style)
- Response format conversion - Speech format handling (MP3, Opus, PCM/WAV)
- Timestamp support - Character-level timing alignment for TTS
- Transcription with alignment - Word and character-level timing, diarization, and additional formats
- Pronunciation dictionaries - Support for custom pronunciation rules
- Voice quality parameters - Stability, similarity boost, and speaker boost controls
Supported Operations
Section titled “Supported Operations”| Operation | Non-Streaming | Streaming | Endpoint |
|---|---|---|---|
| Speech (TTS) | ✅ | ✅ | /v1/text-to-speech/{voice_id} |
| Transcriptions (STT) | ✅ | - | /v1/speech-to-text |
| List Models | ✅ | - | /v1/models |
| Chat Completions | ❌ | ❌ | - |
| Responses API | ❌ | ❌ | - |
| Text Completions | ❌ | ❌ | - |
| Embeddings | ❌ | ❌ | - |
| Image Generation | ❌ | ❌ | - |
1. Speech (Text-to-Speech)
Section titled “1. Speech (Text-to-Speech)”Request Parameters
Section titled “Request Parameters”Core Parameters
Section titled “Core Parameters”| Parameter | Mapping | Notes |
|---|---|---|
input.input | text | The text to convert to speech (required) |
model | model_id | Model identifier (e.g., "eleven_multilingual_v2") |
response_format | Query param output_format | Speech format (see Response Format) |
Voice Configuration
Section titled “Voice Configuration”Voice settings are optional and controlled via params:
| Parameter | ElevenLabs Mapping | Default | Range |
|---|---|---|---|
speed | voice_settings.speed | 1.0 | 0.5-2.0 |
extra_params.stability | voice_settings.stability | 0.5 | 0-1.0 |
extra_params.similarity_boost | voice_settings.similarity_boost | 0.75 | 0-1.0 |
extra_params.use_speaker_boost | voice_settings.use_speaker_boost | true | boolean |
extra_params.style | voice_settings.style | 0 | 0-1.0 |
Advanced Parameters
Section titled “Advanced Parameters”Use extra_params for ElevenLabs-specific TTS features:
curl -X POST http://localhost:8080/v1/audio/speech \ -H "Content-Type: application/json" \ -d '{ "model": "eleven_multilingual_v2", "input": {"input": "Hello, how are you?"}, "voice": "21m00Tcm4TlvDq8ikWAM", "response_format": "mp3", "stability": 0.5, "similarity_boost": 0.75, "use_speaker_boost": true, "style": 0, "speed": 1.0, "language_code": "en", "seed": 42, "previous_text": "Context text", "next_text": "Future context", "apply_text_normalization": "auto" }'resp, err := client.SpeechRequest(schemas.NewDeepIntShieldContext(ctx, schemas.NoDeadline), &schemas.DeepIntShieldSpeechRequest{ Provider: schemas.Elevenlabs, Model: "eleven_multilingual_v2", Input: &schemas.SpeechInput{ Input: "Hello, how are you?", }, Params: &schemas.SpeechParameters{ VoiceConfig: &schemas.VoiceConfig{ Voice: schemas.Ptr("21m00Tcm4TlvDq8ikWAM"), }, Speed: schemas.Ptr(1.0), ResponseFormat: schemas.Ptr("mp3"), ExtraParams: map[string]interface{}{ "stability": 0.5, "similarity_boost": 0.75, "use_speaker_boost": true, "style": 0.0, "language_code": "en", "seed": 42, "previous_text": "Context text", "next_text": "Future context", "apply_text_normalization": "auto", }, },})Advanced TTS Parameters
Section titled “Advanced TTS Parameters”| Parameter | Type | Description |
|---|---|---|
language_code | string | Language code (e.g., “en”, “es”) |
seed | integer | Reproducible output (0-4294967295) |
previous_text | string | Previous text context for consistency |
next_text | string | Next text context for consistency |
previous_request_ids | string[] | Previous request IDs for continuity |
next_request_ids | string[] | Next request IDs for continuity |
apply_text_normalization | string | Text normalization mode: "auto", "on", "off" |
apply_language_text_normalization | boolean | Apply language-specific text normalization |
Response Format
Section titled “Response Format”| Format | Output | Quality | Bitrate |
|---|---|---|---|
mp3 | MP3 | High | 128 kbps @ 44100 Hz |
opus | Opus | High | 128 kbps @ 48000 Hz |
wav / pcm | PCM WAV | Lossless | 16-bit @ 44100 Hz |
Timestamps Support
Section titled “Timestamps Support”To get character-level timing alignment, enable with_timestamps:
{ "with_timestamps": true}When enabled, the endpoint /v1/text-to-speech/{voice_id}/with-timestamps is used and the response includes:
audio_base64- Audio data as base64-encoded stringalignment.char_start_times_ms- Character start times in millisecondsalignment.char_end_times_ms- Character end times in millisecondsalignment.characters- Array of charactersnormalized_alignment- Same as alignment but for normalized text
Response Conversion
Section titled “Response Conversion”Non-Timestamp Response
Section titled “Non-Timestamp Response”{ "audio": "<binary audio data>"}Timestamp Response
Section titled “Timestamp Response”{ "audio_base64": "<base64 encoded audio>", "alignment": { "char_start_times_ms": [0, 150, 280, ...], "char_end_times_ms": [150, 280, 420, ...], "characters": ["H", "e", "l", "l", "o", ...] }, "normalized_alignment": { "char_start_times_ms": [...], "char_end_times_ms": [...], "characters": [...] }}Streaming
Section titled “Streaming”Streaming speech returns audio in chunks as they are generated:
{ "type": "audio.delta", "audio": "<binary audio chunk>"}Final chunk:
{ "type": "audio.done"}2. Transcription (Speech-to-Text)
Section titled “2. Transcription (Speech-to-Text)”Request Parameters
Section titled “Request Parameters”Input Source
Section titled “Input Source”Choose one of the following (mutually exclusive):
| Parameter | Type | Description |
|---|---|---|
input.file | bytes | Audio file content (WAV, MP3, etc.) |
extra_params.cloud_storage_url | string | URL to cloud-hosted audio file |
Error: Providing both or neither will result in error.
Core Parameters
Section titled “Core Parameters”| Parameter | Mapping | Description |
|---|---|---|
model | model_id | Model identifier (required) |
params.language | language_code | Language code (ISO 639-1, e.g., “en”) |
Advanced Parameters
Section titled “Advanced Parameters”Use extra_params for transcription-specific features:
curl -X POST http://localhost:8080/v1/audio/transcriptions \ -F "file=@audio.wav" \ -F "model=eleven_latest" \ -F "language_code=en" \ -F "tag_audio_events=true" \ -F "num_speakers=2" \ -F "timestamps_granularity=word" \ -F "diarize=true" \ -F "diarization_threshold=0.5" \ -F "temperature=0.1" \ -F "seed=42" \ -F "use_multi_channel=true" \ -F "webhook=true" \ -F "webhook_id=webhook-123"resp, err := client.TranscriptionRequest(schemas.NewDeepIntShieldContext(ctx, schemas.NoDeadline), &schemas.DeepIntShieldTranscriptionRequest{ Provider: schemas.Elevenlabs, Model: "eleven_latest", Input: &schemas.TranscriptionInput{ File: audioBytes, }, Params: &schemas.TranscriptionParameters{ Language: schemas.Ptr("en"), ExtraParams: map[string]interface{}{ "tag_audio_events": true, "num_speakers": 2, "timestamps_granularity": "word", "diarize": true, "diarization_threshold": 0.5, "temperature": 0.1, "seed": 42, "use_multi_channel": true, "webhook": true, "webhook_id": "webhook-123", }, },})Transcription Options
Section titled “Transcription Options”| Parameter | Type | Description |
|---|---|---|
tag_audio_events | boolean | Tag audio events (background noise, music, etc.) |
num_speakers | integer | Expected number of speakers (for diarization) |
timestamps_granularity | string | Timestamp level: "none", "word", "character" |
diarize | boolean | Identify different speakers |
diarization_threshold | float | Speaker diarization sensitivity (0.0-1.0) |
file_format | string | Input format: "pcm_s16le_16", "other" |
temperature | float | Transcription temperature (0.0-1.0) |
seed | integer | Reproducible transcription |
use_multi_channel | boolean | Process multi-channel audio separately |
webhook | boolean | Enable webhook for async processing |
webhook_id | string | Webhook endpoint ID |
webhook_metadata | object/string | Additional webhook metadata |
cloud_storage_url | string | URL to cloud-hosted audio (alternative to file) |
Additional Formats
Section titled “Additional Formats”Request multiple output formats simultaneously:
{ "additional_formats": [ { "format": "segmented_json", "include_speakers": true, "include_timestamps": true, "segment_on_silence_longer_than_s": 1.0, "max_segment_duration_s": 30.0 }, { "format": "srt", "max_segment_duration_s": 30.0 } ]}Supported formats: segmented_json, docx, pdf, txt, html, srt
Response Conversion
Section titled “Response Conversion”Basic Transcription
Section titled “Basic Transcription”{ "transcript": { "language_code": "en", "language_probability": 0.95, "text": "Full transcribed text...", "words": [ { "text": "Hello", "start": 0.0, "end": 0.5, "type": "word", "speaker_id": "speaker_1", "logprob": -0.05 } ] }}With Diarization
Section titled “With Diarization”When diarize: true, the response includes speaker identification:
{ "transcript": { "text": "Hello how are you?", "words": [ { "text": "Hello", "speaker_id": "speaker_1" }, { "text": "how", "speaker_id": "speaker_2" } ] }}With Timestamps
Section titled “With Timestamps”Character-level timing when timestamps_granularity: "character":
{ "words": [ { "text": "Hello", "characters": [ {"text": "H", "start": 0.0, "end": 0.1}, {"text": "e", "start": 0.1, "end": 0.2} ] } ]}With Additional Formats
Section titled “With Additional Formats”{ "transcript": { ... }, "additional_formats": [ { "requested_format": "srt", "file_extension": "srt", "content_type": "text/plain", "is_base64_encoded": false, "content": "1\n00:00:00,000 --> 00:00:01,000\nHello\n\n2\n..." } ]}Caveats
Section titled “Caveats”Voice ID Required
Severity: High
Behavior: Voice ID must be provided for TTS requests
Impact: Request fails without voice configuration
Code: elevenlabs.go:198-208
File or URL Required for Transcription
Severity: High
Behavior: Either file or cloud_storage_url must be provided (not both)
Impact: Request fails with ambiguous input
Code: elevenlabs.go:471-478
Audio Format Conversion
Severity: Low
Behavior: Response formats (MP3, Opus, WAV) mapped via format string
Impact: Format parameter passed as query string to endpoint
Code: elevenlabs.go:712-715, utils.go:5-35
Timestamps as Separate Endpoint
Severity: Low
Behavior: Timestamp requests use /with-timestamps endpoint variant
Impact: Switches endpoint based on with_timestamps flag
Code: elevenlabs.go:195-205
Multipart Form Data for Transcription
Severity: Low
Behavior: Transcription uses multipart/form-data, not JSON
Impact: File and parameters sent as form fields
Code: elevenlabs.go:480-690
3. List Models
Section titled “3. List Models”Request Parameters
Section titled “Request Parameters”| Parameter | Type | Description |
|---|---|---|
| (none) | - | No parameters required |
Returns available models with their capabilities and language support.
Response Conversion
Section titled “Response Conversion”{ "models": [ { "model_id": "eleven_multilingual_v2", "name": "Eleven Multilingual v2", "description": "Multilingual speech synthesis", "serves_pro_voices": true, "token_cost_factor": 1.0, "can_do_text_to_speech": true, "can_do_voice_conversion": true, "can_use_style": true, "can_use_speaker_boost": true, "languages": [ {"language_id": "en", "name": "English"}, {"language_id": "es", "name": "Spanish"} ], "requires_alpha_access": false, "max_characters_request_free_user": 1000, "max_characters_request_subscribed_user": 100000, "maximum_text_length_per_request": 5000, "model_rates": { "character_cost_multiplier": 1.0 } } ]}