Manual Caching for Anthropic Models

The router allows you to manually control prompt caching with Anthropic through cache_control blocks in message content. This allows you to explicitly mark specific portions of your prompts for caching with custom TTL (time-to-live) settings, giving you precise control over what gets cached and for how long.

View cache analytics in the Requesty Console.

How Manual Caching Works

Manual caching uses cache_control blocks embedded within message content to explicitly specify which portions of your prompt should be cached. This approach gives you fine-grained control over:

What gets cached: Mark specific content blocks for caching
Cache duration: Set custom TTL (time-to-live) values

Content Structure with cache_control

To use manual caching, structure your message content as an array of content blocks, and add the cache_control field as desired:

{
  "role": "system",
  "content": [
    {
      "type": "text",
      "text": "Your prompt content here...",
      "cache_control": {
        "type": "ephemeral",
        "ttl": "1h"
      }
    }
  ]
}

TTL Field

The ttl (time-to-live) field specifies how long the content should remain cached. The default value is 5m. You can also set this to 1h when using Anthropic and Vertex only. Other providers do not support that.

Example with Manual Caching

This example demonstrates how to use manual caching with the OpenAI Python client to cache a large knowledge base.

Python

import openai

requesty_api_key = "YOUR_REQUESTY_API_KEY"  # Safely load your API key

client = openai.OpenAI(
    api_key=requesty_api_key,
    base_url="https://router.requesty.ai/v1",
)

# Large system prompt to be cached
knowledgebase = "YOUR ENTIRE KNOWLEDGEBASE..."  # Replace with your actual long prompt

# First request - this will cache the system prompt
system_message = {
    "role": "system",
    "content": [
        {
            "type": "text",
            "text": f"Your role is to answer questions based on: {knowledgebase}",
            "cache_control": {"type": "ephemeral", "ttl": "1h"}
        }
    ]
}

response1 = client.chat.completions.create(
    model="anthropic/claude-3-7-sonnet-latest",
    messages=[
        system_message,
        {"role": "user", "content": "What is covered in the knowledge base?"}
    ]
)

print("Response:", response1.choices[0].message.content)
print("Caching tokens (first request):", response1.usage.prompt_tokens_details.caching_tokens)
print("Cached tokens (first request):", response1.usage.prompt_tokens_details.cached_tokens)

# Second request - this will use the cached system prompt
response2 = client.chat.completions.create(
    model="anthropic/claude-3-7-sonnet-latest",
    messages=[
        system_message,
        {"role": "user", "content": "Tell me more details."}
    ]
)

print("Response:", response2.choices[0].message.content)
print("Caching tokens (second request):", response2.usage.prompt_tokens_details.caching_tokens)
print("Cached tokens (second request):", response2.usage.prompt_tokens_details.cached_tokens)

JavaScript

import OpenAI from 'openai';

const client = new OpenAI({
  apiKey: "YOUR_REQUESTY_API_KEY",
  baseURL: "https://router.requesty.ai/v1",
});

const knowledgebase = "YOUR ENTIRE KNOWLEDGEBASE...";

// First request - caches the system prompt
const systemMessage = {
  role: "system",
  content: [
    {
      type: "text",
      text: `Your role is to answer questions based on: ${knowledgebase}`,
      cache_control: { type: "ephemeral", ttl: "1h" }
    }
  ]
};

const response1 = await client.chat.completions.create({
  model: "anthropic/claude-3-7-sonnet-latest",
  messages: [
    systemMessage,
    { role: "user", content: "What is covered in the knowledge base?" }
  ]
});

console.log("Response:", response1.choices[0].message.content);
console.log("Caching tokens:", response1.usage.prompt_tokens_details.caching_tokens);

// Second request - uses cached system prompt
const response2 = await client.chat.completions.create({
  model: "anthropic/claude-3-7-sonnet-latest",
  messages: [
    systemMessage,
    { role: "user", content: "Tell me more details." }
  ]
});

console.log("Response:", response2.choices[0].message.content);
console.log("Cached tokens:", response2.usage.prompt_tokens_details.cached_tokens);

Getting Started

LLM Gateway

Model Capabilities

Analytics & Monitoring

Access Control

Organization

MCP Gateway

Manual Caching for Anthropic Models

How Manual Caching Works

Content Structure with cache_control

TTL Field

Example with Manual Caching

Python

JavaScript

​How Manual Caching Works

​Content Structure with cache_control

​TTL Field

​Example with Manual Caching

​Python

​JavaScript

How Manual Caching Works

Content Structure with cache_control

TTL Field

Example with Manual Caching

Python

JavaScript