Skip to content

Ollama

Ollama is a local-first, OpenAI-compatible inference engine for running large language models on personal computers or servers. DeepIntShield delegates to the OpenAI implementation while supporting Ollama’s unique configuration requirements. Key characteristics:

  • Local-first deployment - Run models locally or on private infrastructure
  • OpenAI API compatibility - Identical request/response format
  • Full feature support - Chat, text, embeddings, and streaming
  • Tool calling - Complete function definition and execution
  • Self-hosted - No external API dependency required
OperationNon-StreamingStreamingEndpoint
Chat Completions/v1/chat/completions
Responses API/v1/chat/completions
Text Completions/v1/completions
Embeddings-/v1/embeddings
List Models-/v1/models
Image Generation-
Speech (TTS)-
Transcriptions (STT)-
Files-
Batch-

Ollama supports all standard OpenAI chat completion parameters. For full parameter reference and behavior, see OpenAI Chat Completions.

Removed for Ollama compatibility:

  • prompt_cache_key - Not supported
  • verbosity - Anthropic-specific
  • store - Not supported
  • service_tier - Not supported

Ollama supports all standard OpenAI message types, tools, responses, and streaming formats. For details on message handling, tool conversion, responses, and streaming, refer to OpenAI Chat Completions.


Converted internally to Chat Completions:

ResponsesRequest → ChatRequest → ChatCompletion → ResponsesResponse

Same parameter support as Chat Completions.


Ollama supports legacy text completion format:

ParameterMapping
promptDirect pass-through
max_tokensmax_tokens
temperature, top_pDirect pass-through
stopStop sequences

Ollama supports text embeddings:

ParameterNotes
inputText or array of texts
modelEmbedding model name
encoding_format”float” or “base64”
dimensionsCustom output dimensions (optional)

Response returns embedding vectors with token usage.


Lists models currently loaded in Ollama with capabilities and context information.


FeatureReason
Speech/TTSNot offered by Ollama API
Transcription/STTNot offered by Ollama API
Batch OperationsNot offered by Ollama API
File ManagementNot offered by Ollama API

Terminal window
# Point to local Ollama instance
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "ollama/llama3.1:latest",
"messages": [{"role": "user", "content": "Hello"}]
}'
# Gateway needs to be configured with Ollama BaseURL

Environment Setup:

  1. Install Ollama from https://ollama.ai
  2. Pull a model:
    Terminal window
    ollama pull llama3.1
    ollama pull mistral
    ollama pull neural-chat
  3. Start Ollama server:
    Terminal window
    ollama serve
  4. Verify it’s running:
    Terminal window
    curl http://localhost:11434/api/tags

Streaming for Large Models: For better user experience with large models, use streaming:

{
"model": "llama3.1:latest",
"messages": [...],
"stream": true
}

Token Context: Different models have different context windows:

  • Llama 3.1 70B: 128K tokens
  • Mistral 7B: 32K tokens
  • Neural Chat 7B: 8K tokens

GPU Acceleration: Ollama automatically uses GPU if available. For CPU-only, ensure timeout is sufficient.


ModelSizeContextSpeed
llama3.1:latestVaries128KFast
mistral:latest7B32KVery Fast
neural-chat:latest7B8KVery Fast
orca-mini:latest3B3KVery Fast
openchat:latest7B8KVery Fast

BaseURL Configuration Required

Severity: High Behavior: BaseURL must be explicitly configured - no default Impact: Requests fail without proper configuration Code: NewOllamaProvider validates BaseURL is set

Cache Control Stripped

Severity: Low Behavior: Cache control directives are removed from messages Impact: Prompt caching features don’t work Code: Stripped during JSON marshaling

Parameter Filtering

Severity: Low Behavior: OpenAI-specific parameters filtered out Impact: prompt_cache_key, verbosity, store removed Code: filterOpenAISpecificParameters

User Field Size Limit

Severity: Low Behavior: User field > 64 characters silently dropped Impact: Longer user identifiers are lost Code: SanitizeUserField enforces 64-char max