Skip to main content
Requesty’s auto caching automatically caches long system prompts and repeated content to reduce costs on providers that support prompt caching (Anthropic, Gemini). Cache hits are billed at a fraction of the normal input token cost. Auto caching works on both /v1/chat/completions and /v1/responses endpoints.
View cache analytics in the Requesty Console.

What Auto Cache Does

Auto caching is provider-level prompt caching. When enabled, the router automatically adds cache breakpoints to the largest content blocks in your request before forwarding it to the provider (Anthropic or Gemini). The provider then caches those token prefixes on their side. This means:
  • You still send the full message history with every request. The payload size does not change.
  • The provider recognizes cached prefixes and charges reduced rates for tokens it has already seen.
  • Cache hits are billed at a fraction of the normal input token cost (up to 90% savings).
Auto caching is not server-side response storage. There is no way to skip sending your conversation history. See Auto Cache vs. Response IDs for details.

How Auto Cache Works

The auto_cache flag is a boolean parameter sent within the requesty field in your request payload.
ValueBehavior
trueInstructs the router to add cache breakpoints to the largest content blocks before forwarding to the provider
falseBypasses caching for this request (useful when cache writes have extra costs)
Not providedFalls back to default behavior based on request origin (e.g., Cline, Roo Code, and Forge default to caching)

Chat Completions API

Include the auto_cache flag within the requesty object in your request:
import openai

client = openai.OpenAI(
    api_key="YOUR_REQUESTY_API_KEY",
    base_url="https://router.requesty.ai/v1",
)

system_prompt = "YOUR ENTIRE KNOWLEDGEBASE"

response = client.chat.completions.create(
    model="anthropic/claude-sonnet-4-5",
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": "What is the capital of France?"}
    ],
    extra_body={
        "requesty": {
            "auto_cache": True
        }
    }
)
print(response.choices[0].message.content)

Responses API

Auto caching works identically on the /v1/responses endpoint. Include the same requesty.auto_cache flag:
import openai

client = openai.OpenAI(
    api_key="YOUR_REQUESTY_API_KEY",
    base_url="https://router.requesty.ai/v1",
)

system_prompt = "YOUR ENTIRE KNOWLEDGEBASE"

response = client.responses.create(
    model="anthropic/claude-sonnet-4-5",
    instructions=system_prompt,
    input="What is the capital of France?",
    extra_body={
        "requesty": {
            "auto_cache": True
        }
    }
)
print(response.output_text)

Multi-turn with Responses API

For multi-turn conversations, include the full conversation history in the input array along with the requesty.auto_cache flag. The router caches the largest content blocks so the provider charges reduced rates for the repeated prefix:
response = client.responses.create(
    model="anthropic/claude-sonnet-4-5",
    instructions="YOUR ENTIRE KNOWLEDGEBASE",
    input=[
        {"role": "user", "content": "What is the capital of France?"},
        {"role": "assistant", "content": "The capital of France is Paris."},
        {"role": "user", "content": "What about Germany?"}
    ],
    extra_body={
        "requesty": {
            "auto_cache": True
        }
    }
)

Auto Cache vs. Response IDs

OpenAI’s Responses API supports a previous_response_id parameter that lets OpenAI store conversation state server-side so you don’t have to resend the full history. This is an OpenAI-specific feature — it works when routing to OpenAI models through Requesty, because OpenAI handles the storage on their end. For non-OpenAI models (Anthropic, Gemini, etc.), previous_response_id is not available because these providers don’t store responses server-side. Instead, use auto_cache with the full conversation history to get cost savings:
ApproachHow it worksProvider support
auto_cacheYou send full history; provider caches token prefixes and charges less for cache hitsAnthropic, Gemini
previous_response_idProvider stores conversation server-side; you send only the new messageOpenAI only

Important Notes

Provider Support: The auto_cache flag is respected by providers where cache writes incur extra costs, including Anthropic and Gemini.
  1. Explicit Control: auto_cache provides explicit control. Set to true to attempt caching, false to prevent caching for providers where cache writes incur extra costs.
  2. Default Behavior: If auto_cache is not specified, the caching behavior reverts to defaults based on request origin.
  3. Cost Savings: Cache hits are billed at a fraction of the normal input token cost. This is especially effective for applications with large system prompts or knowledge bases.
  4. Minimum Token Length: The cached prefix must be at least 1,024 tokens for Anthropic (2,048 for Claude 3.5 Haiku). Content shorter than that will not be cached.

Managed Caching

If you want Requesty to manage caching on your behalf, including custom TTL, cache warming, or advanced caching strategies, reach out to [email protected].
Last modified on June 23, 2026