Skip to content

Run Your Own Benchmarks

Want to see DeepIntShield’s performance in your specific environment? The DeepIntShield Benchmarking Repository provides everything you need to conduct comprehensive performance tests tailored to your infrastructure and workload requirements.

What You Can Test:

  • Custom Instance Sizes - Test on your preferred AWS/GCP/Azure instances
  • Your Workload Patterns - Use your actual request/response sizes
  • Different Configurations - Compare various DeepIntShield settings
  • Provider Comparisons - Benchmark against other AI gateways
  • Load Scenarios - Test burst loads, sustained traffic, and endurance

💡 Open Source: The benchmarking tool is completely open source! Feel free to submit pull requests if you think anything is missing or could be improved.


Before running benchmarks, ensure you have:

  • Go 1.26.1+ installed on your testing machine
  • DeepIntShield instance running and accessible
  • Target API providers configured (OpenAI, Anthropic, etc.)
  • Network access between benchmark tool and DeepIntShield
  • Sufficient resources on the testing machine to generate load

Terminal window
git clone https://github.com/maximhq/deepintshield-benchmarking.git
cd deepintshield-benchmarking
Terminal window
go build benchmark.go

This creates a benchmark executable (or benchmark.exe on Windows).

Terminal window
# Basic benchmark: 500 RPS for 10 seconds
./benchmark -provider deepintshield -port 8080
# Custom benchmark: 1000 RPS for 30 seconds
./benchmark -provider deepintshield -port 8080 -rate 1000 -duration 30 -output my_results.json

The repo also includes a lightweight local audit runner for the examples/dockers/postgres-redis stack:

Terminal window
cd deepintshield_server
python3 scripts/latency_audit.py \
--base-url http://127.0.0.1:8080 \
--virtual-key "$DEEPINTSHIELD_VIRTUAL_KEY" \
--admin-email "$DEEPINTSHIELD_ADMIN_EMAIL" \
--admin-password "$DEEPINTSHIELD_ADMIN_PASSWORD" \
--output latency_audit.json

What it measures:

  • uncached miss latency
  • direct cache-hit latency
  • semantic cache-hit latency
  • direct cache hit with guardrail reuse
  • Dashboard and AI Logs API endpoints used by operators

Notes:

  • --virtual-key is required for inference scenarios.
  • --admin-email and --admin-password are required only for dashboard/API session scenarios.
  • deterministic-only guardrail and model-backed/external guardrail scenarios are reported as skipped unless the target workspace is explicitly configured for them.
  • per-request phase timings are persisted in AI Logs metadata as latency_breakdown_ms, so the audit can be correlated with request logs without changing public inference responses.

The benchmark tool offers extensive configuration through command-line flags:

FlagRequiredDescriptionDefault
-provider <name>Provider name (e.g., deepintshield, litellm)None
-port <number>Port number of your DeepIntShield instanceNone
-endpoint <path>API endpoint pathv1/chat/completions
-rate <number>Requests per second500
-duration <seconds>Test duration in seconds10
-output <filename>Results output fileresults.json
FlagDescriptionDefault
-include-provider-in-requestInclude provider name in request payloadfalse
-big-payloadUse larger, more complex request payloadsfalse

Test standard performance with typical request sizes:

Terminal window
./benchmark -provider deepintshield -port 8080 -rate 1000 -duration 60 -output basic_test.json

Use Case: General performance validation

Push your instance to its limits:

Terminal window
./benchmark -provider deepintshield -port 8080 -rate 5000 -duration 120 -output stress_test.json

Use Case: Capacity planning and SLA validation

Test with bigger request/response sizes:

Terminal window
./benchmark -provider deepintshield -port 8080 -rate 500 -duration 60 -big-payload=true -output large_payload.json

Use Case: Document processing, code generation workloads

Long-running stability test:

Terminal window
./benchmark -provider deepintshield -port 8080 -rate 1000 -duration 1800 -output endurance_test.json

Use Case: Production readiness validation (30-minute test)

Compare DeepIntShield against other providers:

Terminal window
# Test DeepIntShield
./benchmark -provider deepintshield -port 8080 -rate 1000 -duration 60 -output bifrost_results.json
# Test LiteLLM
./benchmark -provider litellm -port 8000 -rate 1000 -duration 60 -output litellm_results.json
# Test direct OpenAI (if available)
./benchmark -provider openai -port 443 -endpoint chat/completions -rate 1000 -duration 60 -output openai_results.json

The benchmark tool generates detailed JSON results with comprehensive metrics:

{
"deepintshield": {
"request_counts": {
"total_sent": 30000,
"successful": 30000,
"failed": 0
},
"success_rate": 100.0,
"latency_metrics": {
"mean_ms": 245.5,
"p50_ms": 230.2,
"p99_ms": 520.8,
"max_ms": 845.3
},
"throughput_rps": 5000.0,
"memory_usage": {
"before_mb": 512.5,
"after_mb": 1312.8,
"peak_mb": 1405.2,
"average_mb": 1156.7
},
"timestamp": "2025-01-14T10:30:00Z",
"status_codes": {
"200": 30000
}
}
}

Success Rate:

  • Target: >99.9% for production readiness
  • Excellent: 100% (perfect reliability)

Latency Metrics:

  • P50 (Median): Typical user experience
  • P99: Worst-case user experience
  • Mean: Overall average performance

Memory Usage:

  • Peak: Maximum memory consumption
  • Average: Sustained memory usage
  • After - Before: Memory growth during test

Based on your benchmark results, use these guidelines for production sizing:

Target RPSMemory UsageRecommended InstanceNotes
< 1,000< 1GBt3.smallCost-effective for light loads
1,000 - 3,0001-2GBt3.mediumBalanced performance/cost
3,000 - 5,0002-4GBt3.largeHigh-performance production
5,000+3-6GBt3.xlarge+Enterprise/mission-critical

If seeing high latency:

  • Increase initial_pool_size
  • Increase buffer_size
  • Consider larger instance

If memory usage is high:

  • Decrease initial_pool_size
  • Optimize buffer_size
  • Monitor for memory leaks

If success rate < 100%:

  • Reduce request rate
  • Increase timeout settings
  • Check provider limits

Simulate traffic spikes:

Terminal window
# Normal load
./benchmark -provider deepintshield -port 8080 -rate 1000 -duration 300 -output normal_load.json
# Burst load (simulate 5x spike)
./benchmark -provider deepintshield -port 8080 -rate 5000 -duration 60 -output burst_load.json

Test horizontal scaling:

Terminal window
# Instance 1
./benchmark -provider deepintshield-1 -port 8080 -rate 2500 -duration 120 -output instance_1.json &
# Instance 2
./benchmark -provider deepintshield-2 -port 8081 -rate 2500 -duration 120 -output instance_2.json &
# Wait for both to complete
wait

Compare performance across payload sizes:

Terminal window
# Small payloads (default)
./benchmark -provider deepintshield -port 8080 -rate 1000 -duration 60 -output small_payload.json
# Large payloads
./benchmark -provider deepintshield -port 8080 -rate 1000 -duration 60 -big-payload=true -output large_payload.json

Set up regular performance regression testing:

daily_benchmark.sh
#!/bin/bash
DATE=$(date +%Y%m%d_%H%M%S)
OUTPUT_DIR="benchmarks/$DATE"
mkdir -p $OUTPUT_DIR
# Run standard benchmarks
./benchmark -provider deepintshield -port 8080 -rate 1000 -duration 300 -output "$OUTPUT_DIR/standard.json"
./benchmark -provider deepintshield -port 8080 -rate 3000 -duration 180 -output "$OUTPUT_DIR/high_load.json"
./benchmark -provider deepintshield -port 8080 -rate 500 -duration 600 -big-payload=true -output "$OUTPUT_DIR/large_payload.json"
echo "Benchmarks completed: $OUTPUT_DIR"

Monitor key metrics over time:

  • Success rate trends
  • Latency percentile changes
  • Memory usage patterns
  • Throughput capacity

Connection Refused:

Terminal window
# Check if DeepIntShield is running
curl http://localhost:8080/health
# Verify port configuration
netstat -an | grep 8080
  • Check PORT is defined in .env file at root.

High Error Rates:

  • Check provider API key limits
  • Verify DeepIntShield configuration
  • Monitor upstream provider status
  • Reduce request rate for baseline test

Memory Issues:

  • Monitor system resources during testing
  • Check for memory leaks in long tests
  • Adjust DeepIntShield pool sizes

Inconsistent Results:

  • Run multiple test iterations
  • Account for network variability
  • Use longer test durations (60+ seconds)
  • Isolate testing environment
  • Try hitting gateway requests to a Mock provider

  1. Analyze Results: Compare against official benchmarks
  2. Optimize Configuration: Tune based on your specific results
  3. Plan Capacity: Size instances based on measured performance
  4. Set Up Monitoring: Track key metrics in production

Ready to benchmark? Clone the repository and start testing!