Semantic cache short-circuit
DeepintShield’s semantic cache plugin runs before the guardrails plugin in the request pipeline. When the cache hits — exact match or fuzzy semantic match — the response is served directly and the rest of the pipeline (guards, provider call) is skipped entirely.
On templated/chat workloads, this typically reclaims 30–60% of LLM spend.
Default behavior
Section titled “Default behavior”On. Semantic lookup sits at sidebar order 4 (before guardrails at order
5). No configuration needed.
The legacy post-guards placement is still available if your workload has very low cache hit rate and the additional vector-search cost on misses outweighs the wins:
DEEPINTSHIELD_SEMANTIC_LOOKUP_AFTER_GUARDS=true # opt back to legacy orderHow it works
Section titled “How it works”Request ↓DirectGate (exact match cache) ──hit──→ return cached response ↓ missSemanticLookup (fuzzy match) ──hit──→ return cached response ↓ missGuardrails (input) ──deny──→ guardrail_blocked ↓ allowProvider call (real LLM cost incurred) ↓Guardrails (output) ↓ResponseA cache hit at either gate stops the pipeline — guards don’t run, the provider isn’t called, and the cached response is returned with the audit metadata it was originally cached with.
Why a cache hit is safe to skip guards
Section titled “Why a cache hit is safe to skip guards”The cached response was already verdict-checked when it was first stored. The cache key includes the policy version, so a policy roll invalidates the cache automatically — you can never serve a response that wouldn’t pass current guardrails.
Realistic cost reduction
Section titled “Realistic cost reduction”| Workload | Typical hit rate | Cost saved |
|---|---|---|
| FAQ bot, customer support templates | 40–60% | ~50% |
| Internal copilot, repeated dev questions | 25–40% | ~30% |
| Long-form RAG, ad-hoc creative prompts | 5–15% | ~10% |
| Streaming code completion | <5% | Minimal |
The cache is opt-in per workspace via the Cost Optimization settings — you can also disable it for VKs that must hit the model every time (e.g. fresh research queries).