Skip to content

Multimodal Support

Send images to vision-capable models for analysis, description, and understanding. This example shows how to analyze an image from a URL using GPT-4o with high detail processing for better accuracy.

Terminal window
curl --location 'http://localhost:8080/v1/chat/completions' \
--header 'Content-Type: application/json' \
--data '{
"model": "openai/gpt-4o",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What do you see in this image? Please describe it in detail."
},
{
"type": "image_url",
"image_url": {
"url": "https://pub-cdead89c2f004d8f963fd34010c479d0.r2.dev/Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
"detail": "high"
}
}
]
}
]
}'

Response includes detailed image analysis:

{
"choices": [{
"message": {
"role": "assistant",
"content": "I can see a beautiful wooden boardwalk extending through a natural landscape..."
}
}]
}

Image Generation: Generating Images with AI

Section titled “Image Generation: Generating Images with AI”

Generate images from text prompts using OpenAI-compatible image generation models.

Generate an image from a text prompt using dall-e-3.

Terminal window
curl --location 'http://localhost:8080/v1/images/generations' \
--header 'Content-Type: application/json' \
--data '{
"model": "openai/dall-e-3",
"prompt": "A futuristic city skyline at sunset with flying cars",
"size": "1024x1024",
"response_format": "url"
}'

Response format:

{
"created": 1713833628,
"data": [
{
"url": "https://oaidalleapiprodscus.blob.core.windows.net/...",
"revised_prompt": "A futuristic city skyline at sunset featuring advanced architecture and flying vehicles.",
"index": 0
}
],
"background": "opaque",
"output_format": "png",
"quality": "standard",
"size": "1024x1024",
"usage": {
"input_tokens": 15,
"output_tokens": 1,
"total_tokens": 16
},
"extra_fields": {
"request_type": "image_generation",
"provider": "openai",
"model_requested": "dall-e-3",
"latency": 15265,
"chunk_index": 0
}
}

Audio Understanding: Analyzing Audio with AI

Section titled “Audio Understanding: Analyzing Audio with AI”

If your chat application supports text input, you can add audio input and output—just include audio in the modalities array and use an audio model, like gpt-4o-audio-preview.

Terminal window
curl --location 'http://localhost:8080/v1/chat/completions' \
--header 'Content-Type: application/json' \
--data '{
"model": "openai/gpt-4o-audio-preview",
"modalities": ["text"],
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Please analyze this audio recording and summarize what was discussed."
},
{
"type": "input_audio",
"input_audio": {
"data": "<base64-encoded audio data containing the word 'Affirmative'>",
"format": "wav"
}
}
]
}
]
}'
Terminal window
{
"choices": [
{
"index": 0,
"finish_reason": "stop",
"message": {
"role": "assistant",
"content": "The audio recording captured a brief segment where a speaker simply said \"Affirmative\" in response. There wasn't any detailed discussion or context provided beyond that one-word affirmation. If you have more audio or specific questions, feel free to share!"
}
}
]
}

Convert text into natural-sounding speech using AI voice models. This example demonstrates generating an MP3 audio file from text using the “alloy” voice. The result is returned as binary audio data.

Terminal window
curl --location 'http://localhost:8080/v1/audio/speech' \
--header 'Content-Type: application/json' \
--data '{
"model": "openai/tts-1",
"input": "Hello! This is a sample text that will be converted to speech using DeepIntShield speech synthesis capabilities. The weather today is wonderful, and I hope you are having a great day!",
"voice": "alloy",
"response_format": "mp3"
}' \
--output "output.mp3"

Save audio to file:

Terminal window
# The --output flag saves the binary audio data directly to a file
# File size will vary based on input text length

Convert audio files into text using AI transcription models. This example shows how to transcribe an MP3 file using OpenAI’s Whisper model, with an optional context prompt to improve accuracy.

Terminal window
curl --location 'http://localhost:8080/v1/audio/transcriptions' \
--form 'file=@"output.mp3"' \
--form 'model="openai/whisper-1"' \
--form 'prompt="This is a sample audio transcription from DeepIntShield speech synthesis."'

Response format:

{
"text": "Hello! This is a sample text that will be converted to speech using DeepIntShield speech synthesis capabilities. The weather today is wonderful, and I hope you are having a great day!"
}

Send multiple images in a single request for comparison or analysis. This is useful for comparing products, analyzing changes over time, or understanding relationships between different visual elements.

Terminal window
curl --location 'http://localhost:8080/v1/chat/completions' \
--header 'Content-Type: application/json' \
--data '{
"model": "openai/gpt-4o",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Compare these two images. What are the differences?"
},
{
"type": "image_url",
"image_url": {
"url": "https://example.com/image1.jpg"
}
},
{
"type": "image_url",
"image_url": {
"url": "https://example.com/image2.jpg"
}
}
]
}
]
}'

Process local images by encoding them as base64 data URLs. This approach is ideal when you need to analyze images stored locally on your system without uploading them to external URLs first.

Terminal window
# First, encode your local image to base64
base64_image=$(base64 -i local_image.jpg)
data_url="data:image/jpeg;base64,$base64_image"
curl --location 'http://localhost:8080/v1/chat/completions' \
--header 'Content-Type: application/json' \
--data '{
"model": "openai/gpt-4o",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Analyze this image and describe what you see."
},
{
"type": "image_url",
"image_url": {
"url": "'$data_url'",
"detail": "high"
}
}
]
}
]
}'

OpenAI provides six distinct voice options, each with different characteristics:

  • alloy - Balanced, natural voice
  • echo - Deep, resonant voice
  • fable - Expressive, storytelling voice
  • onyx - Strong, confident voice
  • nova - Bright, energetic voice
  • shimmer - Gentle, soothing voice
Terminal window
# Example with different voice
curl --location 'http://localhost:8080/v1/audio/speech' \
--header 'Content-Type: application/json' \
--data '{
"model": "openai/tts-1",
"input": "This is the nova voice speaking.",
"voice": "nova",
"response_format": "mp3"
}' \
--output "sample_nova.mp3"

Generate audio in different formats depending on your use case. MP3 for general use, Opus for web streaming, AAC for mobile apps, and FLAC for high-quality audio applications.

Terminal window
# MP3 format (default)
"response_format": "mp3"
# Opus format for web streaming
"response_format": "opus"
# AAC format for mobile apps
"response_format": "aac"
# FLAC format for high-quality audio
"response_format": "flac"

Improve transcription accuracy by specifying the source language. This is particularly helpful for non-English audio or when the audio contains technical terms or specific domain vocabulary.

Terminal window
curl --location 'http://localhost:8080/v1/audio/transcriptions' \
--form 'file=@"spanish_audio.mp3"' \
--form 'model="openai/whisper-1"' \
--form 'language="es"' \
--form 'prompt="This is a Spanish audio recording about technology."'

Choose between simple text output or detailed JSON responses with timestamps. The verbose JSON format provides word-level and segment-level timing information, useful for creating subtitles or analyzing speech patterns.

Terminal window
# Text only response
curl --location 'http://localhost:8080/v1/audio/transcriptions' \
--form 'file=@"audio.mp3"' \
--form 'model="openai/whisper-1"' \
--form 'response_format="text"'
# JSON with timestamps
curl --location 'http://localhost:8080/v1/audio/transcriptions' \
--form 'file=@"audio.mp3"' \
--form 'model="openai/whisper-1"' \
--form 'response_format="verbose_json"' \
--form 'timestamp_granularities[]=word' \
--form 'timestamp_granularities[]=segment'

Now that you understand multimodal capabilities, explore these related topics: