Skip to main content
If you’re using the Anthropic SDK directly against Requesty’s /v1/messages endpoint, you can place cache_control breakpoints on individual content blocks, exactly like the native Anthropic prompt caching API. This gives you fine-grained control over which parts of your request are cached.
Prompt caching in a multi-turn conversation: a cache breakpoint is set on the latest user message each turn, and the cached prefix is reused on the following turn.
View cache analytics in the Requesty Console.
cache_control can be placed on any content block type: system text, user text, images, documents, tool definitions, tool_use, and tool_result.

Python

import anthropic

client = anthropic.Anthropic(
    api_key="YOUR_REQUESTY_API_KEY",
    base_url="https://router.requesty.ai",
)

response = client.messages.create(
    model="anthropic/claude-sonnet-4-20250514",
    max_tokens=4096,
    system=[
        {
            "type": "text",
            "text": "You are a helpful coding assistant.",
            "cache_control": {"type": "ephemeral"}
        },
        {
            "type": "text",
            "text": "Here is the full codebase:\n<codebase>... (large content) ...</codebase>",
            "cache_control": {"type": "ephemeral"}
        }
    ],
    tools=[
        {
            "name": "search_code",
            "description": "Search the codebase for a pattern",
            "input_schema": {
                "type": "object",
                "properties": {
                    "query": {"type": "string"}
                },
                "required": ["query"]
            },
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "How does the authentication middleware work?"
                }
            ]
        }
    ],
)

print(response.content[0].text)

# Cache usage is reported in the response:
print(f"Cache creation tokens: {response.usage.cache_creation_input_tokens}")
print(f"Cache read tokens: {response.usage.cache_read_input_tokens}")

TypeScript

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic({
  apiKey: "YOUR_REQUESTY_API_KEY",
  baseURL: "https://router.requesty.ai",
});

const response = await client.messages.create({
  model: "anthropic/claude-sonnet-4-20250514",
  max_tokens: 4096,
  system: [
    {
      type: "text",
      text: "You are a helpful coding assistant.",
      cache_control: { type: "ephemeral" },
    },
    {
      type: "text",
      text: "Here is the full codebase:\n<codebase>... (large content) ...</codebase>",
      cache_control: { type: "ephemeral" },
    },
  ],
  tools: [
    {
      name: "search_code",
      description: "Search the codebase for a pattern",
      input_schema: {
        type: "object" as const,
        properties: {
          query: { type: "string" },
        },
        required: ["query"],
      },
      cache_control: { type: "ephemeral" },
    },
  ],
  messages: [
    {
      role: "user",
      content: [
        {
          type: "text",
          text: "How does the authentication middleware work?",
        },
      ],
    },
  ],
});

console.log(response.content[0]);

Multi-turn with tool results

You set the cache breakpoint on the latest user message. Everything up to and including that message is cached and reused on the next turn, so only the new content below the previous breakpoint is processed. In agentic flows with tool calls, place cache_control on the last content block of each turn to cache the conversation prefix:
response = client.messages.create(
    model="anthropic/claude-sonnet-4-20250514",
    max_tokens=8192,
    system=[
        {
            "type": "text",
            "text": "You are an AI coding assistant.",
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[
        {"role": "user", "content": "Find all usages of the deprecated API."},
        {
            "role": "assistant",
            "content": [
                {"type": "text", "text": "I'll search for the deprecated API usage."},
                {
                    "type": "tool_use",
                    "id": "toolu_01ABC",
                    "name": "search_code",
                    "input": {"query": "deprecated_api"},
                    "cache_control": {"type": "ephemeral"}
                }
            ]
        },
        {
            "role": "user",
            "content": [
                {
                    "tool_use_id": "toolu_01ABC",
                    "type": "tool_result",
                    "content": "Found 3 usages in src/api.py, src/handler.py, src/utils.py",
                    "cache_control": {"type": "ephemeral"}
                }
            ]
        },
        {"role": "user", "content": "Now refactor all of them to use the new API."}
    ],
)

Best practices

  • Place breakpoints on the first and last system prompt blocks to cache combinations of system instructions.
  • Place a breakpoint on the last tool definition so your tool schema is cached.
  • In multi-turn conversations, place a breakpoint on the last content block of each turn to incrementally cache the conversation history.
  • The cached prefix must be at least 1,024 tokens for Anthropic (2,048 for Claude 3.5 Haiku). Content shorter than that will not be cached.
Last modified on June 5, 2026