Semantic cache short-circuit

DeepintShield’s semantic cache plugin runs before the guardrails plugin in the request pipeline. When the cache hits — exact match or fuzzy semantic match — the response is served directly and the rest of the pipeline (guards, provider call) is skipped entirely.

On templated/chat workloads, this typically reclaims 30–60% of LLM spend.

Default behavior

On. Semantic lookup sits at sidebar order 4 (before guardrails at order 5). No configuration needed.

The legacy post-guards placement is still available if your workload has very low cache hit rate and the additional vector-search cost on misses outweighs the wins:

DEEPINTSHIELD_SEMANTIC_LOOKUP_AFTER_GUARDS=true  # opt back to legacy order

How it works

Request
  ↓
DirectGate (exact match cache)  ──hit──→  return cached response
  ↓ miss
SemanticLookup (fuzzy match)    ──hit──→  return cached response
  ↓ miss
Guardrails (input)              ──deny──→  guardrail_blocked
  ↓ allow
Provider call                              (real LLM cost incurred)
  ↓
Guardrails (output)
  ↓
Response

A cache hit at either gate stops the pipeline — guards don’t run, the provider isn’t called, and the cached response is returned with the audit metadata it was originally cached with.

Why a cache hit is safe to skip guards

The cached response was already verdict-checked when it was first stored. The cache key includes the policy version, so a policy roll invalidates the cache automatically — you can never serve a response that wouldn’t pass current guardrails.

Realistic cost reduction

Workload	Typical hit rate	Cost saved
FAQ bot, customer support templates	40–60%	~50%
Internal copilot, repeated dev questions	25–40%	~30%
Long-form RAG, ad-hoc creative prompts	5–15%	~10%
Streaming code completion	`<5%`	Minimal

The cache is opt-in per workspace via the Cost Optimization settings — you can also disable it for VKs that must hit the model every time (e.g. fresh research queries).