Run Your Own Benchmarks
Overview
Section titled “Overview”Want to see DeepIntShield’s performance in your specific environment? The DeepIntShield Benchmarking Repository provides everything you need to conduct comprehensive performance tests tailored to your infrastructure and workload requirements.
What You Can Test:
- Custom Instance Sizes - Test on your preferred AWS/GCP/Azure instances
- Your Workload Patterns - Use your actual request/response sizes
- Different Configurations - Compare various DeepIntShield settings
- Provider Comparisons - Benchmark against other AI gateways
- Load Scenarios - Test burst loads, sustained traffic, and endurance
💡 Open Source: The benchmarking tool is completely open source! Feel free to submit pull requests if you think anything is missing or could be improved.
Prerequisites
Section titled “Prerequisites”Before running benchmarks, ensure you have:
- Go 1.26.1+ installed on your testing machine
- DeepIntShield instance running and accessible
- Target API providers configured (OpenAI, Anthropic, etc.)
- Network access between benchmark tool and DeepIntShield
- Sufficient resources on the testing machine to generate load
Quick Start
Section titled “Quick Start”1. Clone the Repository
Section titled “1. Clone the Repository”git clone https://github.com/maximhq/deepintshield-benchmarking.gitcd deepintshield-benchmarking2. Build the Benchmark Tool
Section titled “2. Build the Benchmark Tool”go build benchmark.goThis creates a benchmark executable (or benchmark.exe on Windows).
3. Run Your First Benchmark
Section titled “3. Run Your First Benchmark”# Basic benchmark: 500 RPS for 10 seconds./benchmark -provider deepintshield -port 8080
# Custom benchmark: 1000 RPS for 30 seconds./benchmark -provider deepintshield -port 8080 -rate 1000 -duration 30 -output my_results.jsonLocal Latency Audit
Section titled “Local Latency Audit”The repo also includes a lightweight local audit runner for the examples/dockers/postgres-redis stack:
cd deepintshield_serverpython3 scripts/latency_audit.py \ --base-url http://127.0.0.1:8080 \ --virtual-key "$DEEPINTSHIELD_VIRTUAL_KEY" \ --admin-email "$DEEPINTSHIELD_ADMIN_EMAIL" \ --admin-password "$DEEPINTSHIELD_ADMIN_PASSWORD" \ --output latency_audit.jsonWhat it measures:
- uncached miss latency
- direct cache-hit latency
- semantic cache-hit latency
- direct cache hit with guardrail reuse
- Dashboard and AI Logs API endpoints used by operators
Notes:
--virtual-keyis required for inference scenarios.--admin-emailand--admin-passwordare required only for dashboard/API session scenarios.- deterministic-only guardrail and model-backed/external guardrail scenarios are reported as skipped unless the target workspace is explicitly configured for them.
- per-request phase timings are persisted in AI Logs metadata as
latency_breakdown_ms, so the audit can be correlated with request logs without changing public inference responses.
Configuration Options
Section titled “Configuration Options”The benchmark tool offers extensive configuration through command-line flags:
Basic Configuration
Section titled “Basic Configuration”| Flag | Required | Description | Default |
|---|---|---|---|
-provider <name> | ✅ | Provider name (e.g., deepintshield, litellm) | None |
-port <number> | ✅ | Port number of your DeepIntShield instance | None |
-endpoint <path> | ❌ | API endpoint path | v1/chat/completions |
-rate <number> | ❌ | Requests per second | 500 |
-duration <seconds> | ❌ | Test duration in seconds | 10 |
-output <filename> | ❌ | Results output file | results.json |
Advanced Configuration
Section titled “Advanced Configuration”| Flag | Description | Default |
|---|---|---|
-include-provider-in-request | Include provider name in request payload | false |
-big-payload | Use larger, more complex request payloads | false |
Benchmark Scenarios
Section titled “Benchmark Scenarios”1. Basic Performance Test
Section titled “1. Basic Performance Test”Test standard performance with typical request sizes:
./benchmark -provider deepintshield -port 8080 -rate 1000 -duration 60 -output basic_test.jsonUse Case: General performance validation
2. High-Load Stress Test
Section titled “2. High-Load Stress Test”Push your instance to its limits:
./benchmark -provider deepintshield -port 8080 -rate 5000 -duration 120 -output stress_test.jsonUse Case: Capacity planning and SLA validation
3. Large Payload Test
Section titled “3. Large Payload Test”Test with bigger request/response sizes:
./benchmark -provider deepintshield -port 8080 -rate 500 -duration 60 -big-payload=true -output large_payload.jsonUse Case: Document processing, code generation workloads
4. Endurance Test
Section titled “4. Endurance Test”Long-running stability test:
./benchmark -provider deepintshield -port 8080 -rate 1000 -duration 1800 -output endurance_test.jsonUse Case: Production readiness validation (30-minute test)
5. Comparative Benchmarking
Section titled “5. Comparative Benchmarking”Compare DeepIntShield against other providers:
# Test DeepIntShield./benchmark -provider deepintshield -port 8080 -rate 1000 -duration 60 -output bifrost_results.json
# Test LiteLLM./benchmark -provider litellm -port 8000 -rate 1000 -duration 60 -output litellm_results.json
# Test direct OpenAI (if available)./benchmark -provider openai -port 443 -endpoint chat/completions -rate 1000 -duration 60 -output openai_results.jsonUnderstanding Results
Section titled “Understanding Results”The benchmark tool generates detailed JSON results with comprehensive metrics:
Key Metrics Explained
Section titled “Key Metrics Explained”{ "deepintshield": { "request_counts": { "total_sent": 30000, "successful": 30000, "failed": 0 }, "success_rate": 100.0, "latency_metrics": { "mean_ms": 245.5, "p50_ms": 230.2, "p99_ms": 520.8, "max_ms": 845.3 }, "throughput_rps": 5000.0, "memory_usage": { "before_mb": 512.5, "after_mb": 1312.8, "peak_mb": 1405.2, "average_mb": 1156.7 }, "timestamp": "2025-01-14T10:30:00Z", "status_codes": { "200": 30000 } }}Critical Performance Indicators
Section titled “Critical Performance Indicators”Success Rate:
- Target: >99.9% for production readiness
- Excellent: 100% (perfect reliability)
Latency Metrics:
- P50 (Median): Typical user experience
- P99: Worst-case user experience
- Mean: Overall average performance
Memory Usage:
- Peak: Maximum memory consumption
- Average: Sustained memory usage
- After - Before: Memory growth during test
Instance Sizing Recommendations
Section titled “Instance Sizing Recommendations”Based on your benchmark results, use these guidelines for production sizing:
Resource Planning Matrix
Section titled “Resource Planning Matrix”| Target RPS | Memory Usage | Recommended Instance | Notes |
|---|---|---|---|
| < 1,000 | < 1GB | t3.small | Cost-effective for light loads |
| 1,000 - 3,000 | 1-2GB | t3.medium | Balanced performance/cost |
| 3,000 - 5,000 | 2-4GB | t3.large | High-performance production |
| 5,000+ | 3-6GB | t3.xlarge+ | Enterprise/mission-critical |
Configuration Tuning Based on Results
Section titled “Configuration Tuning Based on Results”If seeing high latency:
- Increase
initial_pool_size - Increase
buffer_size - Consider larger instance
If memory usage is high:
- Decrease
initial_pool_size - Optimize
buffer_size - Monitor for memory leaks
If success rate < 100%:
- Reduce request rate
- Increase timeout settings
- Check provider limits
Advanced Testing Scenarios
Section titled “Advanced Testing Scenarios”Burst Load Testing
Section titled “Burst Load Testing”Simulate traffic spikes:
# Normal load./benchmark -provider deepintshield -port 8080 -rate 1000 -duration 300 -output normal_load.json
# Burst load (simulate 5x spike)./benchmark -provider deepintshield -port 8080 -rate 5000 -duration 60 -output burst_load.jsonMulti-Instance Testing
Section titled “Multi-Instance Testing”Test horizontal scaling:
# Instance 1./benchmark -provider deepintshield-1 -port 8080 -rate 2500 -duration 120 -output instance_1.json &
# Instance 2./benchmark -provider deepintshield-2 -port 8081 -rate 2500 -duration 120 -output instance_2.json &
# Wait for both to completewaitDifferent Payload Sizes
Section titled “Different Payload Sizes”Compare performance across payload sizes:
# Small payloads (default)./benchmark -provider deepintshield -port 8080 -rate 1000 -duration 60 -output small_payload.json
# Large payloads./benchmark -provider deepintshield -port 8080 -rate 1000 -duration 60 -big-payload=true -output large_payload.jsonContinuous Benchmarking
Section titled “Continuous Benchmarking”Automated Testing Pipeline
Section titled “Automated Testing Pipeline”Set up regular performance regression testing:
#!/bin/bashDATE=$(date +%Y%m%d_%H%M%S)OUTPUT_DIR="benchmarks/$DATE"mkdir -p $OUTPUT_DIR
# Run standard benchmarks./benchmark -provider deepintshield -port 8080 -rate 1000 -duration 300 -output "$OUTPUT_DIR/standard.json"./benchmark -provider deepintshield -port 8080 -rate 3000 -duration 180 -output "$OUTPUT_DIR/high_load.json"./benchmark -provider deepintshield -port 8080 -rate 500 -duration 600 -big-payload=true -output "$OUTPUT_DIR/large_payload.json"
echo "Benchmarks completed: $OUTPUT_DIR"Performance Monitoring Integration
Section titled “Performance Monitoring Integration”Monitor key metrics over time:
- Success rate trends
- Latency percentile changes
- Memory usage patterns
- Throughput capacity
Troubleshooting
Section titled “Troubleshooting”Common Issues
Section titled “Common Issues”Connection Refused:
# Check if DeepIntShield is runningcurl http://localhost:8080/health
# Verify port configurationnetstat -an | grep 8080- Check PORT is defined in
.envfile at root.
High Error Rates:
- Check provider API key limits
- Verify DeepIntShield configuration
- Monitor upstream provider status
- Reduce request rate for baseline test
Memory Issues:
- Monitor system resources during testing
- Check for memory leaks in long tests
- Adjust DeepIntShield pool sizes
Inconsistent Results:
- Run multiple test iterations
- Account for network variability
- Use longer test durations (60+ seconds)
- Isolate testing environment
- Try hitting gateway requests to a Mock provider
Next Steps
Section titled “Next Steps”After Running Benchmarks
Section titled “After Running Benchmarks”- Analyze Results: Compare against official benchmarks
- Optimize Configuration: Tune based on your specific results
- Plan Capacity: Size instances based on measured performance
- Set Up Monitoring: Track key metrics in production
Compare Results
Section titled “Compare Results”- t3.medium Performance - Compare against medium instance results
- t3.xlarge Performance - Compare against high-performance configuration
Ready to benchmark? Clone the repository and start testing!