Ollama
Overview
Section titled “Overview”Ollama is a local-first, OpenAI-compatible inference engine for running large language models on personal computers or servers. DeepIntShield delegates to the OpenAI implementation while supporting Ollama’s unique configuration requirements. Key characteristics:
- Local-first deployment - Run models locally or on private infrastructure
- OpenAI API compatibility - Identical request/response format
- Full feature support - Chat, text, embeddings, and streaming
- Tool calling - Complete function definition and execution
- Self-hosted - No external API dependency required
Supported Operations
Section titled “Supported Operations”| Operation | Non-Streaming | Streaming | Endpoint |
|---|---|---|---|
| Chat Completions | ✅ | ✅ | /v1/chat/completions |
| Responses API | ✅ | ✅ | /v1/chat/completions |
| Text Completions | ✅ | ✅ | /v1/completions |
| Embeddings | ✅ | - | /v1/embeddings |
| List Models | ✅ | - | /v1/models |
| Image Generation | ❌ | ❌ | - |
| Speech (TTS) | ❌ | ❌ | - |
| Transcriptions (STT) | ❌ | ❌ | - |
| Files | ❌ | ❌ | - |
| Batch | ❌ | ❌ | - |
1. Chat Completions
Section titled “1. Chat Completions”Request Parameters
Section titled “Request Parameters”Ollama supports all standard OpenAI chat completion parameters. For full parameter reference and behavior, see OpenAI Chat Completions.
Filtered Parameters
Section titled “Filtered Parameters”Removed for Ollama compatibility:
prompt_cache_key- Not supportedverbosity- Anthropic-specificstore- Not supportedservice_tier- Not supported
Ollama supports all standard OpenAI message types, tools, responses, and streaming formats. For details on message handling, tool conversion, responses, and streaming, refer to OpenAI Chat Completions.
2. Responses API
Section titled “2. Responses API”Converted internally to Chat Completions:
ResponsesRequest → ChatRequest → ChatCompletion → ResponsesResponseSame parameter support as Chat Completions.
3. Text Completions
Section titled “3. Text Completions”Ollama supports legacy text completion format:
| Parameter | Mapping |
|---|---|
prompt | Direct pass-through |
max_tokens | max_tokens |
temperature, top_p | Direct pass-through |
stop | Stop sequences |
4. Embeddings
Section titled “4. Embeddings”Ollama supports text embeddings:
| Parameter | Notes |
|---|---|
input | Text or array of texts |
model | Embedding model name |
encoding_format | ”float” or “base64” |
dimensions | Custom output dimensions (optional) |
Response returns embedding vectors with token usage.
5. List Models
Section titled “5. List Models”Lists models currently loaded in Ollama with capabilities and context information.
Unsupported Features
Section titled “Unsupported Features”| Feature | Reason |
|---|---|
| Speech/TTS | Not offered by Ollama API |
| Transcription/STT | Not offered by Ollama API |
| Batch Operations | Not offered by Ollama API |
| File Management | Not offered by Ollama API |
Configuration
Section titled “Configuration”# Point to local Ollama instancecurl -X POST http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "ollama/llama3.1:latest", "messages": [{"role": "user", "content": "Hello"}] }'
# Gateway needs to be configured with Ollama BaseURLconfig := &schemas.ProviderConfig{ NetworkConfig: schemas.NetworkConfig{ BaseURL: "http://localhost:11434", // Required! DefaultRequestTimeoutInSeconds: 30, },}provider, _ := ollama.NewOllamaProvider(config, logger)
response, _ := provider.ChatCompletion(ctx, key, request)Environment Setup:
- Install Ollama from https://ollama.ai
- Pull a model:
Terminal window ollama pull llama3.1ollama pull mistralollama pull neural-chat - Start Ollama server:
Terminal window ollama serve - Verify it’s running:
Terminal window curl http://localhost:11434/api/tags
Performance Considerations
Section titled “Performance Considerations”Streaming for Large Models: For better user experience with large models, use streaming:
{ "model": "llama3.1:latest", "messages": [...], "stream": true}Token Context: Different models have different context windows:
- Llama 3.1 70B: 128K tokens
- Mistral 7B: 32K tokens
- Neural Chat 7B: 8K tokens
GPU Acceleration: Ollama automatically uses GPU if available. For CPU-only, ensure timeout is sufficient.
Popular Models
Section titled “Popular Models”| Model | Size | Context | Speed |
|---|---|---|---|
| llama3.1:latest | Varies | 128K | Fast |
| mistral:latest | 7B | 32K | Very Fast |
| neural-chat:latest | 7B | 8K | Very Fast |
| orca-mini:latest | 3B | 3K | Very Fast |
| openchat:latest | 7B | 8K | Very Fast |
Caveats
Section titled “Caveats”BaseURL Configuration Required
Severity: High Behavior: BaseURL must be explicitly configured - no default Impact: Requests fail without proper configuration Code: NewOllamaProvider validates BaseURL is set
Cache Control Stripped
Severity: Low Behavior: Cache control directives are removed from messages Impact: Prompt caching features don’t work Code: Stripped during JSON marshaling
Parameter Filtering
Severity: Low Behavior: OpenAI-specific parameters filtered out Impact: prompt_cache_key, verbosity, store removed Code: filterOpenAISpecificParameters
User Field Size Limit
Severity: Low Behavior: User field > 64 characters silently dropped Impact: Longer user identifiers are lost Code: SanitizeUserField enforces 64-char max