Skip to content

Semantic cache short-circuit

DeepintShield’s semantic cache plugin runs before the guardrails plugin in the request pipeline. When the cache hits — exact match or fuzzy semantic match — the response is served directly and the rest of the pipeline (guards, provider call) is skipped entirely.

On templated/chat workloads, this typically reclaims 30–60% of LLM spend.

On. Semantic lookup sits at sidebar order 4 (before guardrails at order 5). No configuration needed.

The legacy post-guards placement is still available if your workload has very low cache hit rate and the additional vector-search cost on misses outweighs the wins:

Terminal window
DEEPINTSHIELD_SEMANTIC_LOOKUP_AFTER_GUARDS=true # opt back to legacy order
Request
DirectGate (exact match cache) ──hit──→ return cached response
↓ miss
SemanticLookup (fuzzy match) ──hit──→ return cached response
↓ miss
Guardrails (input) ──deny──→ guardrail_blocked
↓ allow
Provider call (real LLM cost incurred)
Guardrails (output)
Response

A cache hit at either gate stops the pipeline — guards don’t run, the provider isn’t called, and the cached response is returned with the audit metadata it was originally cached with.

The cached response was already verdict-checked when it was first stored. The cache key includes the policy version, so a policy roll invalidates the cache automatically — you can never serve a response that wouldn’t pass current guardrails.

WorkloadTypical hit rateCost saved
FAQ bot, customer support templates40–60%~50%
Internal copilot, repeated dev questions25–40%~30%
Long-form RAG, ad-hoc creative prompts5–15%~10%
Streaming code completion<5%Minimal

The cache is opt-in per workspace via the Cost Optimization settings — you can also disable it for VKs that must hit the model every time (e.g. fresh research queries).