- Access 300+ models from OpenAI, Anthropic, Google, Mistral, and many other providers through one API key.
- Get automatic prompt caching on Anthropic models, reducing cost significantly on multi-turn conversations.
- Track and manage your spend in a single location.
- Apply fallback policies, load balancing, and latency routing to keep your agent responsive.
Prerequisites
- Hermes Agent installed (
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash). - A Requesty API key from the API Keys page.
Configuration
The recommended setup uses the native Anthropic Messages format (api_mode: anthropic_messages), which enables automatic prompt caching on supported models. This is the optimal configuration for Hermes because its large system prompt and tool definitions benefit heavily from prefix caching across turns.
Create or replace your global Hermes config:
The
api_mode: anthropic_messages setting tells Hermes to use the native Anthropic Messages API format. This is what enables Requesty’s automatic prompt caching, which can reduce costs by up to 90% on long conversations by caching the system prompt and tool definitions between turns.Why anthropic_messages matters
Hermes sends a large system prompt (often 20,000+ tokens including tool definitions) on every turn. With the OpenAI chat completions format, this entire prompt is re-processed from scratch each time. With the native Anthropic Messages format, Requesty automatically applies cache control breakpoints so subsequent turns in the same conversation reuse the cached prefix, paying only for new user messages and responses.Model selection
You can use any model from the Model Library. Set the default inconfig.yaml or switch mid-session:
--model flag:
Recommended: use a Routing Policy
Instead of hard coding a model, point Hermes at a Routing Policy. A policy is a named alias that resolves on the Requesty side. You swap the underlying model from the Routing Policies page without touching your config.- Fallback Policy for reliability. If your primary model is down, Requesty retries the next in the chain.
- Latency Routing for speed. Requesty picks whichever provider is currently fastest.
- Load Balancing for gradual rollouts between models.
EU routing
To pin all traffic to the EU region:Verifying the integration
Start Hermes and send any message:HTTP-Referer: https://hermes-agent.nousresearch.com in the request metadata.
Troubleshooting
403 Invalid authorization token
403 Invalid authorization token
Caching not working (cached_tokens stays at 0)
Caching not working (cached_tokens stays at 0)
Confirm you are using
api_mode: anthropic_messages in your config. The OpenAI chat completions format does not support automatic prompt caching. Also ensure you are using an Anthropic model (Claude family) as caching is provider-specific.Model not found
Model not found
Check that the model ID format is correct (
provider/model-name) and that the model is available in the Model Library. If your organization uses approved models, ensure the model is on the approved list.Connection timeout
Connection timeout
Hermes defaults to a 120 second read timeout. For long-running requests on reasoning models, you can increase it in your provider config. See the Hermes configuration docs for timeout settings.