Skip to content

Google Gemini

Google Gemini’s API has different structure from OpenAI. DeepIntShield performs extensive conversion including:

  • Role remapping - “assistant” → “model”, system messages integrated into main flow
  • Message grouping - Consecutive tool responses merged into single user message
  • Parameter renaming - e.g., max_completion_tokensmaxOutputTokens, stopstopSequences
  • Function call handling - Tool call ID preservation and thought signature support
  • Content modality - Support for text, images, video, code execution, and thought content
  • Thinking/Reasoning - Thinking configuration mapped to DeepIntShield reasoning structure
OperationNon-StreamingStreamingEndpoint
Chat Completions/v1beta/models/{model}:generateContent
Responses API/v1beta/models/{model}:generateContent
Speech (TTS)/v1beta/models/{model}:generateContent
Transcriptions (STT)/v1beta/models/{model}:generateContent
Image Generation-/v1beta/models/{model}:generateContent or /v1beta/models/{model}:predict (Imagen)
Image Edit-/v1beta/models/{model}:generateContent or /v1beta/models/{model}:predict (Imagen)
Video Generation-/v1beta/models/{model}:predictLongRunning
Image Variation-Not supported
Embeddings-/v1beta/models/{model}:embedContent
Files-/upload/storage/v1beta/files
Batch-/v1beta/batchJobs
List Models-/v1beta/models

Gemini supports API key authentication in addition to OAuth2 Bearer token authentication. The implementation conditionally uses the appropriate method based on the endpoint type.

API key authentication is supported via two methods:

  1. Header Method (standard Gemini endpoints):

    • Format: x-goog-api-key: YOUR_API_KEY header
    • Used for: Standard Gemini endpoints (e.g., /v1beta/models/{model}:generateContent)
  2. Query Parameter Method (Imagen and custom endpoints):

    • Format: ?key=YOUR_API_KEY appended to request URLs
    • Used for: Imagen models and custom endpoints
    • Example: https://generativelanguage.googleapis.com/v1beta/models/imagen-4.0-generate-001:predict?key=YOUR_API_KEY

DeepIntShield automatically selects the appropriate authentication method based on the endpoint type.


ParameterTransformation
max_completion_tokensRenamed to maxOutputTokens
temperature, top_pDirect pass-through
stopRenamed to stopSequences
response_formatConverted to responseMimeType and responseJsonSchema
toolsSchema restructured (see Tool Conversion)
tool_choiceMapped to functionCallingConfig (see Tool Conversion)
reasoningMapped to thinkingConfig (see Reasoning / Thinking)
top_kVia extra_params (Gemini-specific)
presence_penalty, frequency_penaltyVia extra_params
seedVia extra_params

The following parameters are silently ignored: logit_bias, logprobs, top_logprobs, parallel_tool_calls, service_tier

Use extra_params (SDK) or pass directly in request body (Gateway) for Gemini-specific fields:

Terminal window
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gemini/gemini-2.0-flash",
"messages": [{"role": "user", "content": "Hello"}],
"top_k": 40,
"stop_sequences": ["###"]
}'

Documentation: See DeepIntShield Reasoning Reference

  • reasoning.effortthinkingConfig.thinkingLevel (“low” → LOW, “high” → HIGH)
  • reasoning.max_tokensthinkingConfig.thinkingBudget (token budget for thinking)
  • reasoning parameter triggers thinkingConfig.includeThoughts = true
  • "low" / "minimal"LOW
  • "medium" / "high"HIGH
  • null or unspecified → Based on max_tokens: -1 (dynamic), 0 (disabled), or specific budget
// Request
{"reasoning": {"effort": "high", "max_tokens": 10000}}
// Gemini conversion
{"thinkingConfig": {"includeThoughts": true, "thinkingLevel": "HIGH", "thinkingBudget": 10000}}
  • Role remapping: “assistant” → “model”, “system” → part of user/model content flow
  • Consecutive tool responses: Tool response messages merged into single user message with function response parts
  • Content flattening: Multi-part content in single message preserved as parts array
  • URL images: {type: "image_url", image_url: {url: "..."}}{type: "image", source: {type: "url", url: "..."}}
  • Base64 images: Data URL → {type: "image", source: {type: "base64", media_type: "image/png", ...}}
  • Video content: Preserved with metadata (fps, start/end offset)

Tool definitions are restructured with these mappings:

  • function.namefunctionDeclarations.name (preserved)
  • function.parametersfunctionDeclarations.parameters (Schema format)
  • function.descriptionfunctionDeclarations.description
  • function.strict → Dropped (not supported by Gemini)
OpenAIGemini
"auto"AUTO (default)
"none"NONE
"required"ANY
Specific toolANY with allowedFunctionNames
  • finishReasonfinish_reason:

    • STOPstop
    • MAX_TOKENSlength
    • SAFETY, RECITATION, LANGUAGE, BLOCKLIST, PROHIBITED_CONTENT, SPII, IMAGE_SAFETYcontent_filter
    • MALFORMED_FUNCTION_CALL, UNEXPECTED_TOOL_CALLtool_calls
  • candidates[0].content.parts[0].textchoices[0].message.content (if single text block)

  • candidates[0].content.parts[].functionCallchoices[0].message.tool_calls

  • promptTokenCountusage.prompt_tokens

  • candidatesTokenCountusage.completion_tokens

  • totalTokenCountusage.total_tokens

  • cachedContentTokenCountusage.prompt_tokens_details.cached_tokens

  • thoughtsTokenCountusage.completion_tokens_details.reasoning_tokens

  • Thought content (from text parts with thought: true) → reasoning field in stream deltas

  • Function call args (map) → JSON string arguments

Event structure:

  • Streaming responses contain deltas in delta.content (text), delta.reasoning (thoughts), delta.toolCalls (function calls)
  • Function responses appear as text content in the delta
  • finish_reason only set on final chunk
  • Usage metadata only included in final chunk

The Responses API uses the same underlying /generateContent endpoint but converts between OpenAI’s Responses format and Gemini’s Messages format.

ParameterTransformation
max_output_tokensRenamed to maxOutputTokens
temperature, top_pDirect pass-through
instructionsConverted to system instruction text
input (string or array)Converted to messages
toolsSchema restructured (see Chat Completions)
tool_choiceType mapped (see Chat Completions)
reasoningMapped to thinkingConfig (see Reasoning / Thinking)
textMaps to responseMimeType and responseJsonSchema
stopVia extra_params, renamed to stopSequences
top_kVia extra_params

Use extra_params (SDK) or pass directly in request body (Gateway):

Terminal window
curl -X POST http://localhost:8080/v1/responses \
-H "Content-Type: application/json" \
-d '{
"model": "gemini/gemini-2.0-flash",
"input": "Hello, how are you?",
"instructions": "You are a helpful assistant.",
"top_k": 40
}'
  • Input: String wrapped as user message or array converted to messages
  • Instructions: Becomes system instruction (single text block)

Supported types: function, computer_use_preview, web_search, mcp

Tool conversions same as Chat Completions with:

  • Computer tools auto-configured (if specified in DeepIntShield request)
  • Function-based tools always enabled
  • finishReasonstatus: STOP/MAX_TOKENS/other → completed | SAFETYincomplete
  • Output items conversion:
    • Text parts → message field
    • Function calls → function_call field
    • Thought content → reasoning field
  • Usage fields preserved with cache tokens mapped to *_tokens_details.cached_tokens

Event structure: Similar to Chat Completions streaming

  • content_part.added emitted for text and reasoning parts
  • Item IDs generated as msg_{responseID}_item_{outputIndex}

Speech synthesis uses the underlying chat generation endpoint with audio response modality.

ParameterTransformation
inputText to synthesize → contents[0].parts[0].text
voiceVoice name → generationConfig.speechConfig.voiceConfig.prebuiltVoiceConfig.voiceName
response_formatOnly “wav” supported (default); auto-converted from PCM

Single Voice:

{
"generationConfig": {
"responseModalities": ["AUDIO"],
"speechConfig": {
"voiceConfig": {
"prebuiltVoiceConfig": {
"voiceName": "Chant-Female"
}
}
}
}
}

Multi-Speaker:

{
"generationConfig": {
"responseModalities": ["AUDIO"],
"speechConfig": {
"multiSpeakerVoiceConfig": {
"speakerVoiceConfigs": [
{
"speaker": "Character A",
"voiceConfig": {
"prebuiltVoiceConfig": {
"voiceName": "Chant-Female"
}
}
}
]
}
}
}
}
  • Audio data extracted from candidates[0].content.parts[].inlineData
  • Format conversion: Gemini returns PCM audio (s16le, 24kHz, mono)
  • Auto-conversion: PCM → WAV when response_format: "wav" (default)
  • Raw audio returned if response_format is omitted or empty string

Common Gemini voices include:

  • Chant-Female - Female voice
  • Chant-Male - Male voice
  • Additional voices depend on model capabilities

Check model documentation for complete list of supported voices.


Transcriptions are implemented as chat completions with audio content and text prompts.

ParameterTransformation
fileAudio bytes → contents[].parts[].inlineData
promptInstructions → contents[0].parts[0].text (defaults to “Generate a transcript of the speech.”)
languageVia extra_params (if supported by model)

Audio is sent as inline data with auto-detected MIME type:

{
"contents": [
{
"parts": [
{
"text": "<prompt text>"
},
{
"inlineData": {
"mimeType": "audio/wav",
"data": "<base64-encoded-audio>"
}
}
]
}
]
}

Safety settings and caching can be configured:

Terminal window
curl -X POST http://localhost:8080/v1/audio/transcriptions \
-H "Content-Type: application/json" \
-d '{
"model": "gemini/gemini-2.0-flash",
"file": "<binary-audio-data>",
"prompt": "Transcribe this audio in the original language."
}'
  • Transcribed text extracted from candidates[0].content.parts[].text
  • task set to "transcribe"
  • Usage metadata mapped:
    • promptTokenCountinput_tokens
    • candidatesTokenCountoutput_tokens
    • totalTokenCounttotal_tokens

Request Parameters:

  • inputrequests[0].content.parts[0].text (single text joins arrays with space)
  • dimensionsoutputDimensionality
  • Extra task type and title via extra_params

Response Mapping:

  • embeddings[].values → DeepIntShield embedding array
  • metadata.billableCharacterCount → Usage prompt tokens (fallback)
  • Token counts extracted from usage metadata

Request formats: Inline requests array or file-based input

Pagination: Token-based with pageToken

Endpoints:

  • POST /v1beta/batchJobs - Create
  • GET /v1beta/batchJobs?pageSize={limit}&pageToken={token} - List
  • GET /v1beta/batchJobs/{batch_id} - Retrieve
  • POST /v1beta/batchJobs/{batch_id}:cancel - Cancel

Response Structure:

  • Status mapping: BATCH_STATE_PENDING/BATCH_STATE_RUNNINGin_progress, BATCH_STATE_SUCCEEDEDcompleted, BATCH_STATE_FAILEDfailed, BATCH_STATE_CANCELLINGcancelling, BATCH_STATE_CANCELLEDcancelled, BATCH_STATE_EXPIREDexpired
  • Inline responses: Array in dest.inlinedResponses
  • File-based responses: JSONL file in dest.fileName

Note: RFC3339 timestamps converted to Unix timestamps


Upload: Multipart/form-data with file (binary) and filename (optional)

Field mapping:

  • nameid
  • displayNamefilename
  • sizeBytessize_bytes
  • mimeTypecontent_type
  • createTime (RFC3339) → Converted to Unix timestamp

Endpoints:

  • POST /upload/storage/v1beta/files - Upload
  • GET /v1beta/files?limit={limit}&pageToken={token} (cursor pagination)
  • GET /v1beta/files/{file_id} - Retrieve
  • DELETE /v1beta/files/{file_id} - Delete
  • GET /v1beta/files/{file_id}/content - Download

Gemini supports two image generation formats depending on the model:

  1. Standard Gemini Format: Uses the /v1beta/models/{model}:generateContent endpoint
  2. Imagen Format: Uses the /v1beta/models/{model}:predict endpoint for Imagen models (detected automatically)
ParameterTransformation
promptText description of the image to generate
nNumber of images (mapped to sampleCount for Imagen, candidateCount for Gemini)
sizeImage size in WxH format (e.g., "1024x1024"). Converted to Imagen’s imageSize + aspectRatio format
output_formatOutput format: "png", "jpeg", "webp". Converted to MIME type for Imagen
seedSeed for reproducible generation (passed directly)
negative_promptNegative prompt (passed directly)

Use extra_params (SDK) or pass directly in request body (Gateway) for Gemini-specific fields:

ParameterTypeNotes
personGenerationstringPerson generation setting (Imagen only)
languagestringLanguage code (Imagen only)
enhancePromptboolPrompt enhancement flag (Imagen only)
safetySettings / safety_settingsstring/arraySafety settings configuration
cachedContent / cached_contentstringCached content ID
labelsobjectCustom labels map
Terminal window
curl -X POST http://localhost:8080/v1/images/generations \
-H "Content-Type: application/json" \
-d '{
"model": "gemini/imagen-4.0-generate-001",
"prompt": "A sunset over the mountains",
"size": "1024x1024",
"n": 2,
"output_format": "png"
}'
  • Model mapping: bifrostReq.Modelreq.Model, with bifrostReq.Input.Promptreq.Contents[0].Parts[0].Text
  • Response modality: Set by deepintshield internally to generationConfig.responseModalities = ["IMAGE"] to indicate image generation
  • Image count: Specify number of images via ngenerationConfig.candidateCount
  • Extra parameters: Include safetySettings, cachedContent, and labels mapped directly
  • Prompt: bifrostReq.Promptreq.Instances[0].Prompt
  • Number of Images: nreq.Parameters.SampleCount
  • Size Conversion: size (WxH format) converted to:
    • imageSize: "1k" (if dimensions ≤ 1024), "2k" (if dimensions ≤ 2048). Sizes larger than "2k" are not supported by Imagen models.
    • aspectRatio: "1:1", "3:4", "4:3", "9:16", or "16:9" (based on width/height ratio)
  • Output Format: output_format ("png", "jpeg") → parameters.outputOptions.mimeType ("image/png", "image/jpeg")
  • Seed & Negative Prompt: Passed directly to seed and parameters.negativePrompt
  • Extra Parameters: personGeneration, language, enhancePrompt, safetySettings mapped to parameters
  • Image Data: Extracts InlineData from candidates[0].content.parts[] with MIME type image/*
  • Output Format: Converts MIME type (image/png, image/jpeg, image/webp) → file extension (png, jpeg, webp)
  • Usage: Extracts token usage from usageMetadata
  • Multiple Images: Each image part becomes an ImageData entry in the response array
  • Image Data: Each prediction in response.predictions[]ImageData with b64_json from bytesBase64Encoded
  • Output Format: Converts prediction.mimeType → file extension for outputFormat field (Imagen doesnt support webp)
  • Index: Each prediction gets an index (0, 1, 2, …) in the response array

For Imagen format, size is converted between formats:

Supported Image Sizes: "1k" (≤1024), "2k" (≤2048)

Supported Aspect Ratios: "1:1", "3:4", "4:3", "9:16", "16:9"

The provider automatically selects the endpoint based on model name:

  • Imagen models (detected via schemas.IsImagenModel()): Uses /v1beta/models/{model}:predict endpoint
  • Other models: Uses /v1beta/models/{model}:generateContent endpoint with image response modality

Image generation streaming is not supported by Gemini.


Gemini supports image editing through two different APIs depending on the model:

  1. Standard Gemini Format: Uses the /v1beta/models/{model}:generateContent endpoint (for Gemini models)
  2. Imagen Format: Uses the /v1beta/models/{model}:predict endpoint (for Imagen models, detected automatically)

Request Parameters

ParameterTypeRequiredNotes
modelstringModel identifier (Gemini or Imagen model)
promptstringText description of the edit
image[]binaryImage file(s) to edit (supports multiple images)
maskbinaryMask image file
typestringEdit type: "inpainting", "outpainting", "inpaint_removal", "bgswap" (Imagen only)
nintNumber of images to generate (1-10)
output_formatstringOutput format: "png", "webp", "jpeg"
output_compressionintCompression level (0-100%)
seedintSeed for reproducibility (via ExtraParams["seed"])
negative_promptstringNegative prompt (via ExtraParams["negativePrompt"])
guidanceScaleintGuidance scale (via ExtraParams["guidanceScale"], Imagen only)
baseStepsintBase steps (via ExtraParams["baseSteps"], Imagen only)
maskModestringMask mode (via ExtraParams["maskMode"], Imagen only): "MASK_MODE_USER_PROVIDED", "MASK_MODE_BACKGROUND", "MASK_MODE_FOREGROUND", "MASK_MODE_SEMANTIC"
dilationfloatMask dilation (via ExtraParams["dilation"], Imagen only): Range [0, 1]
maskClassesint[]Mask classes (via ExtraParams["maskClasses"], Imagen only): For MASK_MODE_SEMANTIC

Request Conversion

Standard Gemini Format (Non-Imagen Models)

Section titled “Standard Gemini Format (Non-Imagen Models)”
  • Model & Prompt: bifrostReq.Modelreq.Model, bifrostReq.Input.Promptreq.Contents[0].Parts[0].Text
  • Images: Each image in bifrostReq.Input.Images is converted to a Part with:
    • MIME type detection (image/jpeg, image/webp, image/png) with fallback to image/png
    • Base64 encoding: image.ImagePart.InlineData.Data (base64 string)
    • MIME type: Part.InlineData.MIMEType
  • Response Modality: GenerationConfig.ResponseModalities is set to [ModalityImage] to indicate image generation
  • Extra Parameters: Extracted from ExtraParams:
    • safetySettings / safety_settingsSafetySettings
    • cachedContent / cached_contentCachedContent
    • labelsLabels (map[string]string)
  • Reference Images: Each image in bifrostReq.Input.Images is converted to ReferenceImage with:
    • ReferenceType: "REFERENCE_TYPE_RAW"
    • ReferenceID: Sequential IDs starting from 1
    • ReferenceImage.BytesBase64Encoded: Base64-encoded image data
  • Mask Configuration: If Params.Mask is provided or maskMode is specified:
    • Default maskMode: "MASK_MODE_USER_PROVIDED" when mask data is present
    • maskMode can be overridden via ExtraParams["maskMode"]
    • dilation extracted from ExtraParams["dilation"] (validated to range [0, 1])
    • maskClasses extracted from ExtraParams["maskClasses"] (for MASK_MODE_SEMANTIC)
    • Mask image (if provided) is base64-encoded and added as ReferenceType: "REFERENCE_TYPE_MASK"
  • Edit Mode Mapping: Params.Type is mapped to EditMode:
    • "inpainting""EDIT_MODE_INPAINT_INSERTION"
    • "outpainting""EDIT_MODE_OUTPAINT"
    • "inpaint_removal""EDIT_MODE_INPAINT_REMOVAL"
    • "bgswap""EDIT_MODE_BGSWAP"
    • If Type is not set, editMode can be specified directly via ExtraParams["editMode"]
  • Parameters:
    • nParameters.SampleCount
    • output_formatParameters.OutputOptions.MimeType (converted: "png""image/png", etc.)
    • output_compressionParameters.OutputOptions.CompressionQuality
    • seed (via ExtraParams["seed"]) → Parameters.Seed
    • negativePrompt (via ExtraParams["negativePrompt"]) → Parameters.NegativePrompt
    • guidanceScale (via ExtraParams["guidanceScale"]) → Parameters.GuidanceScale
    • baseSteps (via ExtraParams["baseSteps"]) → Parameters.BaseSteps
    • Additional Imagen-specific parameters: addWatermark, includeRaiReason, includeSafetyAttributes, personGeneration, safetySetting, language, storageUri

Response Conversion

  • Standard Gemini Format: Uses the same response conversion as image generation (see Image Generation section)
  • Imagen Format: Uses the same response conversion as Imagen image generation (see Image Generation section)

Endpoint Selection

The provider automatically selects the endpoint based on model name:

  • Imagen models (detected via schemas.IsImagenModel()): Uses /v1beta/models/{model}:predict endpoint
  • Other models: Uses /v1beta/models/{model}:generateContent endpoint with image response modality

Streaming

Image edit streaming is not supported by Gemini.

Image Variation

Image variation is not supported by Gemini.


Request: GET /v1beta/models?pageSize={limit}&pageToken={token} (no body)

Field mapping:

  • name (remove “models/” prefix) → id (add “gemini/” prefix)
  • displayNamename
  • descriptiondescription
  • inputTokenLimitmax_input_tokens
  • outputTokenLimitmax_output_tokens
  • Context length = inputTokenLimit + outputTokenLimit

Pagination: Token-based with nextPageToken


Requests use JSON body (application/json).

Request Parameters

ParameterTypeRequiredNotes
modelstringVeo model (e.g., veo-3.1-generate-preview)
promptstringText description of the video
input_referencestringInput image for image-to-video
secondsstringDuration → durationSeconds
sizestringResolution → aspect ratio (1280x72016:9, 720x12809:16)
negative_promptstringWhat to avoid in the video
seedintSeed for reproducibility
audioboolEnable audio generation → generateAudio
video_uristringGCS video URI for video extension

Extra Params (any unrecognized JSON field is forwarded as extra_params)

KeyNotes
aspectRatioOverride the aspect ratio directly (e.g., "16:9", "9:16"). Takes precedence over size
resolutionNative Gemini resolution string
sampleCountNumber of samples to generate
personGenerationPerson generation policy
numberOfVideosNumber of videos to generate
storageURIGCS bucket for output storage
compressionQualityOutput compression quality
enhancePromptAuto-enhance the prompt
resizeModeHow to handle size mismatches
reference_imagesStyle/asset reference image objects
lastFrameLast frame image object for interpolation

Response: DeepIntShieldVideoGenerationResponseid, status, videos[]

If Gemini filters content for safety, status is failed and content_filter describes the reason.

Job Statuses: in_progresscompleted / failed

OperationEndpointNotes
Get statusGET /v1/videos/{id}Polls the long-running operation
DownloadGET /v1/videos/{id}/contentDownloads from GCS URI or decodes base64 video

Video Delete, List, and Remix are not supported.


DeepIntShield supports the following content modalities through Gemini:

Content TypeSupportNotes
TextFull support
Images (URL/Base64)Converted to {type: "image", source: {...}}
VideoWith fps, start/end offset metadata
Audio⚠️Via file references only
PDFVia file references
Code ExecutionAuto-executed with results returned
Thinking/ReasoningThought parts marked with thought: true
Function CallsWith optional thought signatures

Tool Response Grouping

Severity: High Behavior: Consecutive tool response messages merged into single user message Impact: Message count and structure changes Code: chat.go:627-678

Thinking Content Handling

Severity: Medium Behavior: Thought content appears as text parts with thought: true flag Impact: Requires checking thought flag to distinguish from regular text Code: chat.go:242-244, 302-304

Function Call Arguments Serialization

Severity: Low Behavior: Tool call args (object) converted to arguments (JSON string) Impact: Requires JSON parsing to access arguments Code: chat.go:101-106

Thought Signature Base64 Encoding

Severity: Low Behavior: thoughtSignature base64 URL-safe encoded, auto-converted during unmarshal Impact: Transparent to user; handled automatically Code: types.go:1048-1063

Streaming Finish Reason Timing

Severity: Medium Behavior: finish_reason only present in final stream chunk with usage metadata Impact: Cannot determine completion until end of stream Code: chat.go:206-208, 325-328

Cached Content Token Reporting

Severity: Low Behavior: Cached tokens reported in prompt_tokens_details.cached_tokens, cannot distinguish cache creation vs read Impact: Billing estimates may be approximate Code: utils.go:270-274

System Instruction Integration

Severity: Medium Behavior: System instructions become systemInstruction field (separate from messages), not included in message array Impact: Structure differs from OpenAI’s system message approach Code: responses.go:34-46