/v1/chat/completions and /v1/responses endpoints.
View cache analytics in the Requesty Console.
What Auto Cache Does
Auto caching is provider-level prompt caching. When enabled, the router automatically adds cache breakpoints to the largest content blocks in your request before forwarding it to the provider (Anthropic or Gemini). The provider then caches those token prefixes on their side. This means:- You still send the full message history with every request. The payload size does not change.
- The provider recognizes cached prefixes and charges reduced rates for tokens it has already seen.
- Cache hits are billed at a fraction of the normal input token cost (up to 90% savings).
How Auto Cache Works
Theauto_cache flag is a boolean parameter sent within the requesty field in your request payload.
| Value | Behavior |
|---|---|
true | Instructs the router to add cache breakpoints to the largest content blocks before forwarding to the provider |
false | Bypasses caching for this request (useful when cache writes have extra costs) |
| Not provided | Falls back to default behavior based on request origin (e.g., Cline, Roo Code, and Forge default to caching) |
Chat Completions API
Include theauto_cache flag within the requesty object in your request:
Responses API
Auto caching works identically on the/v1/responses endpoint. Include the same requesty.auto_cache flag:
Multi-turn with Responses API
For multi-turn conversations, include the full conversation history in theinput array along with the requesty.auto_cache flag. The router caches the largest content blocks so the provider charges reduced rates for the repeated prefix:
Auto Cache vs. Response IDs
OpenAI’s Responses API supports aprevious_response_id parameter that lets OpenAI store conversation state server-side so you don’t have to resend the full history. This is an OpenAI-specific feature — it works when routing to OpenAI models through Requesty, because OpenAI handles the storage on their end.
For non-OpenAI models (Anthropic, Gemini, etc.), previous_response_id is not available because these providers don’t store responses server-side. Instead, use auto_cache with the full conversation history to get cost savings:
| Approach | How it works | Provider support |
|---|---|---|
auto_cache | You send full history; provider caches token prefixes and charges less for cache hits | Anthropic, Gemini |
previous_response_id | Provider stores conversation server-side; you send only the new message | OpenAI only |
Important Notes
Provider Support: The
auto_cache flag is respected by providers where cache writes incur extra costs, including Anthropic and Gemini.- Explicit Control:
auto_cacheprovides explicit control. Set totrueto attempt caching,falseto prevent caching for providers where cache writes incur extra costs. - Default Behavior: If
auto_cacheis not specified, the caching behavior reverts to defaults based on request origin. - Cost Savings: Cache hits are billed at a fraction of the normal input token cost. This is especially effective for applications with large system prompts or knowledge bases.
- Minimum Token Length: The cached prefix must be at least 1,024 tokens for Anthropic (2,048 for Claude 3.5 Haiku). Content shorter than that will not be cached.