Async Inference

Overview

Async inference uses a fire-and-forget pattern for gateway requests: submit a normal inference payload to an async endpoint, get a job_id immediately, and poll later for the final result.

How It Works

sequenceDiagram
    participant Client
    participant Gateway as DeepIntShield Gateway
    participant Worker as Async Worker
    participant Provider

    Client->>Gateway: POST /v1/async/chat/completions
    Gateway-->>Client: 202 Accepted + {id, status: "pending"}
    Gateway->>Worker: Queue async job
    Worker->>Provider: Execute inference request
    Provider-->>Worker: Response or error

    Client->>Gateway: GET /v1/async/chat/completions/{job_id}
    alt Job pending or processing
        Gateway-->>Client: 202 Accepted + status
    else Job completed or failed
        Gateway-->>Client: 200 OK + result/error
    end

Supported Endpoints

Streaming is not supported on async endpoints.

Request Type	Submit (POST)	Poll (GET)
Text completions	`/v1/async/completions`	`/v1/async/completions/{job_id}`
Chat completions	`/v1/async/chat/completions`	`/v1/async/chat/completions/{job_id}`
Responses API	`/v1/async/responses`	`/v1/async/responses/{job_id}`
Embeddings	`/v1/async/embeddings`	`/v1/async/embeddings/{job_id}`
Speech	`/v1/async/audio/speech`	`/v1/async/audio/speech/{job_id}`
Transcriptions	`/v1/async/audio/transcriptions`	`/v1/async/audio/transcriptions/{job_id}`
Image generations	`/v1/async/images/generations`	`/v1/async/images/generations/{job_id}`
Image edits	`/v1/async/images/edits`	`/v1/async/images/edits/{job_id}`
Image variations	`/v1/async/images/variations`	`/v1/async/images/variations/{job_id}`
Rerank	`/v1/async/rerank`	`/v1/async/rerank/{job_id}`

Submitting a Request

Use the same JSON body as the synchronous endpoint, but switch to the /v1/async/ path.

curl -X POST http://localhost:8080/v1/async/chat/completions \
  -H "Content-Type: application/json" \
  -H "x-bf-vk: sk-bf-your-virtual-key" \
  -H "x-bf-async-job-result-ttl: 3600" \
  -d '{
    "model": "openai/gpt-4o-mini",
    "messages": [
      {
        "role": "user",
        "content": "Summarize the latest release notes in 3 bullets"
      }
    ]
  }'

Response (202 Accepted)

{
  "id": "1e89b165-d4fe-49e8-beb2-3e157f2df02f",
  "status": "pending",
  "created_at": "2026-02-19T08:10:17.831Z"
}

Polling for Results

Use GET on the matching endpoint with the returned job_id.

curl -X GET http://localhost:8080/v1/async/chat/completions/1e89b165-d4fe-49e8-beb2-3e157f2df02f \
  -H "x-bf-vk: sk-bf-your-virtual-key"

Response codes:

202 Accepted: job is still pending or processing
200 OK: job is completed or failed

Pending example (202)

{
  "id": "1e89b165-d4fe-49e8-beb2-3e157f2df02f",
  "status": "pending",
  "created_at": "2026-02-19T08:10:17.831Z"
}

Completed example (200)

{
  "id": "1e89b165-d4fe-49e8-beb2-3e157f2df02f",
  "status": "completed",
  "created_at": "2026-02-19T08:10:17.831Z",
  "completed_at": "2026-02-19T08:10:19.412Z",
  "expires_at": "2026-02-19T09:10:19.412Z",
  "status_code": 200,
  "result": {
    "id": "chatcmpl-123",
    "object": "chat.completion"
  }
}

Failed example (200)

{
  "id": "1e89b165-d4fe-49e8-beb2-3e157f2df02f",
  "status": "failed",
  "created_at": "2026-02-19T08:10:17.831Z",
  "completed_at": "2026-02-19T08:10:19.412Z",
  "expires_at": "2026-02-19T09:10:19.412Z",
  "status_code": 429,
  "error": {
    "error": {
      "message": "rate limit exceeded",
      "type": "rate_limit_error"
    }
  }
}

Job Lifecycle

Status	Meaning	Transition Trigger
`pending`	Job record is created and queued	Immediate status on submit
`processing`	Background worker has picked up the job	Worker starts execution
`completed`	Operation succeeded and result is stored	Provider call completes successfully
`failed`	Operation failed and error is stored	Provider call returns a DeepIntShield error

Result TTL and Expiration

Default TTL is 3600 seconds (1 hour).
TTL starts from completion time, not submission time.
Server default is configured in client.async_job_result_ttl.
Per-request override uses x-bf-async-job-result-ttl.
If the header is invalid or <= 0, DeepIntShield falls back to the default TTL.
Expired jobs return 404 Job not found or expired.
Expired async jobs are cleaned up every minute.

Virtual Key Authorization

If a job is created with a virtual key, the job stores that virtual key identity.
Polling must use the same virtual key value.
Missing or mismatched virtual keys fail lookup and return 404 Job not found or expired.
Jobs created without a virtual key are not virtual-key scoped, so they can be polled by any caller that passes your gateway auth/middleware checks.

Observability

Async executions are logged like synchronous requests.
The logging metadata includes isAsyncRequest: true, which appears as an Async badge in the Logs UI.
Background execution still uses DeepIntShield request APIs, so LLM plugin hooks (governance, logging, cost tracking, etc.) are executed for the actual inference run.

Limitations

Gateway-only feature (not available in Go SDK).
Streaming is not supported on async endpoints.
Requires Logs Store to register async routes.
Jobs stuck in processing are not auto-expired by TTL cleanup. Cleanup only deletes jobs with expires_at set (completed/failed).