Ollama

Overview

Ollama is a local-first, OpenAI-compatible inference engine for running large language models on personal computers or servers. DeepIntShield delegates to the OpenAI implementation while supporting Ollama’s unique configuration requirements. Key characteristics:

Local-first deployment - Run models locally or on private infrastructure
OpenAI API compatibility - Identical request/response format
Full feature support - Chat, text, embeddings, and streaming
Tool calling - Complete function definition and execution
Self-hosted - No external API dependency required

Supported Operations

Operation	Non-Streaming	Streaming	Endpoint
Chat Completions	✅	✅	`/v1/chat/completions`
Responses API	✅	✅	`/v1/chat/completions`
Text Completions	✅	✅	`/v1/completions`
Embeddings	✅	-	`/v1/embeddings`
List Models	✅	-	`/v1/models`
Image Generation	❌	❌	-
Speech (TTS)	❌	❌	-
Transcriptions (STT)	❌	❌	-
Files	❌	❌	-
Batch	❌	❌	-

1. Chat Completions

Request Parameters

Ollama supports all standard OpenAI chat completion parameters. For full parameter reference and behavior, see OpenAI Chat Completions.

Filtered Parameters

Removed for Ollama compatibility:

prompt_cache_key - Not supported
verbosity - Anthropic-specific
store - Not supported
service_tier - Not supported

Ollama supports all standard OpenAI message types, tools, responses, and streaming formats. For details on message handling, tool conversion, responses, and streaming, refer to OpenAI Chat Completions.

2. Responses API

Converted internally to Chat Completions:

ResponsesRequest → ChatRequest → ChatCompletion → ResponsesResponse

Same parameter support as Chat Completions.

3. Text Completions

Ollama supports legacy text completion format:

Parameter	Mapping
`prompt`	Direct pass-through
`max_tokens`	max_tokens
`temperature`, `top_p`	Direct pass-through
`stop`	Stop sequences

4. Embeddings

Ollama supports text embeddings:

Parameter	Notes
`input`	Text or array of texts
`model`	Embedding model name
`encoding_format`	”float” or “base64”
`dimensions`	Custom output dimensions (optional)

Response returns embedding vectors with token usage.

5. List Models

Lists models currently loaded in Ollama with capabilities and context information.

Unsupported Features

Feature	Reason
Speech/TTS	Not offered by Ollama API
Transcription/STT	Not offered by Ollama API
Batch Operations	Not offered by Ollama API
File Management	Not offered by Ollama API

# Point to local Ollama instance
curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ollama/llama3.1:latest",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

# Gateway needs to be configured with Ollama BaseURL

config := &schemas.ProviderConfig{
    NetworkConfig: schemas.NetworkConfig{
        BaseURL: "http://localhost:11434",  // Required!
        DefaultRequestTimeoutInSeconds: 30,
    },
}
provider, _ := ollama.NewOllamaProvider(config, logger)

response, _ := provider.ChatCompletion(ctx, key, request)

Environment Setup:

Install Ollama from https://ollama.ai

Pull a model:

ollama pull llama3.1
ollama pull mistral
ollama pull neural-chat

Start Ollama server:
Terminal window
```
ollama serve
```
Verify it’s running:
Terminal window
```
curl http://localhost:11434/api/tags
```

Performance Considerations

Streaming for Large Models: For better user experience with large models, use streaming:

{
  "model": "llama3.1:latest",
  "messages": [...],
  "stream": true
}

Token Context: Different models have different context windows:

Llama 3.1 70B: 128K tokens
Mistral 7B: 32K tokens
Neural Chat 7B: 8K tokens

GPU Acceleration: Ollama automatically uses GPU if available. For CPU-only, ensure timeout is sufficient.

Popular Models

Model	Size	Context	Speed
llama3.1:latest	Varies	128K	Fast
mistral:latest	7B	32K	Very Fast
neural-chat:latest	7B	8K	Very Fast
orca-mini:latest	3B	3K	Very Fast
openchat:latest	7B	8K	Very Fast

Caveats

BaseURL Configuration Required

Severity: High Behavior: BaseURL must be explicitly configured - no default Impact: Requests fail without proper configuration Code: NewOllamaProvider validates BaseURL is set

Cache Control Stripped

Severity: Low Behavior: Cache control directives are removed from messages Impact: Prompt caching features don’t work Code: Stripped during JSON marshaling

Parameter Filtering

Severity: Low Behavior: OpenAI-specific parameters filtered out Impact: prompt_cache_key, verbosity, store removed Code: filterOpenAISpecificParameters

User Field Size Limit

Severity: Low Behavior: User field > 64 characters silently dropped Impact: Longer user identifiers are lost Code: SanitizeUserField enforces 64-char max