Start Conservative
Begin with lower values and scale up based on observed performance. Over-provisioning wastes resources.
DeepIntShield provides three key performance configuration parameters that control throughput, memory usage, and request handling behavior:
| Parameter | Scope | Default | Description |
|---|---|---|---|
| Concurrency | Per Provider | 1000 | Number of worker goroutines processing requests simultaneously |
| Buffer Size | Per Provider | 5000 | Maximum requests that can be queued before blocking/dropping |
| Initial Pool Size | Global | 5000 | Pre-allocated objects in sync pools to reduce GC pressure |
What it does: Controls two aspects of provider performance:
AnthropicMessageResponse, OpenAIResponse) in sync pools to reduce allocations during request handling.Impact:
Default: 1000 workers per provider
{ "providers": { "openai": { "keys": [...], "concurrency_and_buffer_size": { "concurrency": 100, "buffer_size": 500 } } }}func (a *MyAccount) GetConfigForProvider(provider schemas.ModelProvider) (*schemas.ProviderConfig, error) { return &schemas.ProviderConfig{ NetworkConfig: schemas.DefaultNetworkConfig, ConcurrencyAndBufferSize: schemas.ConcurrencyAndBufferSize{ Concurrency: 100, // 100 concurrent workers BufferSize: 500, // 500 request queue capacity }, }, nil}What it does: Sets the capacity of the buffered channel (queue) for each provider. Incoming requests are queued here before being picked up by workers.
Impact:
Default: 5000 requests per provider queue
Queue Full Behavior: Controlled by drop_excess_requests:
false (default): New requests block until queue space is availabletrue: New requests are immediately dropped with an error when queue is fullWhat it does: Controls the number of pre-allocated objects in DeepIntShield’s internal sync pools at startup. These pools recycle objects to reduce garbage collection overhead.
Pooled Objects:
Impact:
Default: 5000 objects per pool
{ "config": { "initial_pool_size": 10000, "drop_excess_requests": false }}bifrostConfig := schemas.DeepIntShieldConfig{ Account: myAccount, InitialPoolSize: 10000, // Pre-warm pools with 10,000 objects DropExcessRequests: false,}
client, err := deepintshield.Init(ctx, bifrostConfig)Configure these settings per provider based on the expected RPS for that specific provider:
| Provider RPS | Concurrency | Buffer Size |
|---|---|---|
| 100 | 100 | 150 |
| 500 | 500 | 750 |
| 1000 | 1000 | 1500 |
| 2500 | 2500 | 3750 |
| 5000 | 5000 | 7500 |
| 10000 | 10000 | 15000 |
Formula:
concurrency = expected_rpsbuffer_size = 1.5 × expected_rpsThis ratio ensures:
Configure this setting based on total RPS across all providers combined:
| Total RPS (All Providers) | Initial Pool Size | Memory Estimate |
|---|---|---|
| 100 | 150 | ~50 MB |
| 500 | 750 | ~100 MB |
| 1000 | 1500 | ~200 MB |
| 2500 | 3750 | ~400 MB |
| 5000 | 7500 | ~800 MB |
| 10000 | 15000 | ~1.5 GB |
Formula:
initial_pool_size = 1.5 × total_expected_rpsAdditionally, ensure:
initial_pool_size >= max(buffer_size across all providers)This ensures pools are pre-warmed to handle peak queue depths without runtime allocations.
When running multiple DeepIntShield instances behind a load balancer, divide the per-node settings by the number of nodes based on your total expected RPS.
Per-Node Concurrency = Total Concurrency / Number of NodesPer-Node Buffer Size = Total Buffer Size / Number of NodesPer-Node Initial Pool Size = Total Initial Pool Size / Number of NodesTotal capacity (aggregate across all 4 nodes):
Single node settings for 10,000 RPS (if running on one node):
Per-node settings (4 nodes, 10,000 RPS total):
| Parameter | Total (Aggregate) | Per Node (4 nodes) |
|---|---|---|
| Concurrency | 10000 | 2500 |
| Buffer Size | 15000 | 3750 |
| Initial Pool Size | 15000 | 3750 |
{ "config": { "initial_pool_size": 3750, "drop_excess_requests": false }, "providers": { "openai": { "keys": [...], "concurrency_and_buffer_size": { "concurrency": 2500, "buffer_size": 3750 } }, "anthropic": { "keys": [...], "concurrency_and_buffer_size": { "concurrency": 2500, "buffer_size": 3750 } } }}const numNodes = 4
func (a *MyAccount) GetConfigForProvider(provider schemas.ModelProvider) (*schemas.ProviderConfig, error) { // Total capacity divided by number of nodes // Total: 10,000 RPS across 4 nodes = 2,500 RPS per node return &schemas.ProviderConfig{ NetworkConfig: schemas.DefaultNetworkConfig, ConcurrencyAndBufferSize: schemas.ConcurrencyAndBufferSize{ Concurrency: 10000 / numNodes, // 2500 per node BufferSize: 15000 / numNodes, // 3750 per node }, }, nil}
// In main initializationbifrostConfig := schemas.DeepIntShieldConfig{ Account: myAccount, InitialPoolSize: 15000 / numNodes, // 3750 per node}Different providers have different rate limits and latency characteristics. Tune each provider independently:
| Provider | Typical Rate Limits | Recommended Concurrency | Notes |
|---|---|---|---|
| OpenAI | 500-10000 RPM (varies by tier) | 100-500 | Higher tiers support more concurrency |
| Anthropic | 1000-4000 RPM (varies by tier) | 50-200 | More conservative rate limits |
| Bedrock | Per-model limits | 100-300 | Check AWS quotas for your account |
| Azure OpenAI | Deployment-specific | 100-500 | Configure per-deployment |
| Vertex AI | Per-model quotas | 100-300 | Check GCP quotas |
| Groq | Very high throughput | 500-1000 | Designed for high concurrency |
| Ollama | Local resource bound | 10-50 | Limited by local GPU/CPU |
{ "providers": { "openai": { "keys": [...], "concurrency_and_buffer_size": { "concurrency": 200, "buffer_size": 1000 } }, "anthropic": { "keys": [...], "concurrency_and_buffer_size": { "concurrency": 100, "buffer_size": 500 } }, "groq": { "keys": [...], "concurrency_and_buffer_size": { "concurrency": 500, "buffer_size": 2500 } }, "ollama": { "keys": [...], "concurrency_and_buffer_size": { "concurrency": 20, "buffer_size": 100 } } }}func (a *MyAccount) GetConfigForProvider(provider schemas.ModelProvider) (*schemas.ProviderConfig, error) { switch provider { case schemas.OpenAI: return &schemas.ProviderConfig{ NetworkConfig: schemas.DefaultNetworkConfig, ConcurrencyAndBufferSize: schemas.ConcurrencyAndBufferSize{ Concurrency: 200, BufferSize: 1000, }, }, nil case schemas.Anthropic: return &schemas.ProviderConfig{ NetworkConfig: schemas.DefaultNetworkConfig, ConcurrencyAndBufferSize: schemas.ConcurrencyAndBufferSize{ Concurrency: 100, BufferSize: 500, }, }, nil case schemas.Groq: return &schemas.ProviderConfig{ NetworkConfig: schemas.DefaultNetworkConfig, ConcurrencyAndBufferSize: schemas.ConcurrencyAndBufferSize{ Concurrency: 500, BufferSize: 2500, }, }, nil case schemas.Ollama: return &schemas.ProviderConfig{ NetworkConfig: schemas.DefaultNetworkConfig, ConcurrencyAndBufferSize: schemas.ConcurrencyAndBufferSize{ Concurrency: 20, BufferSize: 100, }, }, nil default: return &schemas.ProviderConfig{ NetworkConfig: schemas.DefaultNetworkConfig, ConcurrencyAndBufferSize: schemas.DefaultConcurrencyAndBufferSize, }, nil }}When the provider queue reaches capacity, DeepIntShield’s behavior is controlled by drop_excess_requests:
{ "config": { "drop_excess_requests": false }}{ "config": { "drop_excess_requests": true }}"request dropped: queue is full"| Metric | Healthy Range | Action if Exceeded |
|---|---|---|
| Queue depth | < 50% of buffer_size | Increase buffer or concurrency |
| Request latency (p99) | < 2x average | Check provider rate limits |
| Dropped requests | 0 | Increase buffer_size |
| Memory usage | Stable | Reduce pool/buffer sizes |
| Goroutine count | Stable | Check for goroutine leaks |
The Gateway exposes health and metrics endpoints:
# Health checkcurl http://localhost:8080/health
# Prometheus metricscurl http://localhost:8080/metricsStart Conservative
Begin with lower values and scale up based on observed performance. Over-provisioning wastes resources.
Monitor Continuously
Track queue depths, latencies, and error rates. Adjust settings based on real traffic patterns.
Match Provider Limits
Don’t set concurrency higher than provider rate limits allow. You’ll just get rate-limited.
Plan for Bursts
Set buffer_size to 1.5x concurrency to handle traffic spikes without dropping requests.
// Formulaconcurrency = expected_rpsbuffer_size = 1.5 × expected_rpsinitial_pool_size = 1.5 × total_rps (across all providers)
// Example: 500 RPS per provider, 2 providers (1000 total RPS)concurrency: 500, buffer_size: 750, initial_pool_size: 1500
// Example: 2000 RPS per provider, 3 providers (6000 total RPS)concurrency: 2000, buffer_size: 3000, initial_pool_size: 9000
// Multi-node formulaper_node_value = total_value / number_of_nodes