Speculative dispatch
Speculative dispatch fires the provider call in parallel with input-guard
evaluation. On the allow path (>95% of typical traffic), the user sees
max(guards, model) latency instead of guards + model.
For most workloads this is the single biggest latency win available.
Default behavior
Section titled “Default behavior”Off (opt-in). This is a safety tradeoff — see below. Enable per-deployment when you’re comfortable with the semantics.
# Plugin config (preferred, live-reloadable):{ "speculative_input_guards": true }
# Or env var:DEEPINTSHIELD_GUARD_SPECULATIVE_INPUT_GUARDS=trueHow it works
Section titled “How it works”PreLLMHookkicks off guard evaluation in a goroutine and returns immediately.- The provider call dispatches in parallel.
PostLLMHookblocks on the guard verdict before releasing the response.
On the allow path: total latency = max(t_guards, t_model).
On the deny path: the model call’s result is discarded and the user sees
the standard guardrail_blocked error. You paid for one wasted provider call —
the cost is the safety vs. latency tradeoff.
The two safety semantics
Section titled “The two safety semantics”When to enable
Section titled “When to enable”- Latency-sensitive chat and copilot UIs where a 200–800ms reduction is visible.
- Workloads with low deny rates (typical:
<5%) so the wasted-provider-call cost is small. - Workloads that don’t rely on input redaction for safety.
When to leave it off
Section titled “When to leave it off”- Audit-heavy workloads where every guarded request must complete before the provider call even starts (some regulated environments).
- Workloads with high deny rates (>20%) where the wasted-call cost is real.
- Streaming responses — speculative dispatch is automatically disabled for streaming, since there’s no clean way to discard tokens after the first one hits the wire.