Skip to main content
Latency-Based Routing automatically selects the fastest model for each request based on real-time performance data. Requesty continuously monitors response times and routes to the lowest-latency option.
Enable latency routing in the Requesty Console.

How It Works

1

Track latency

Requesty tracks latency for every model in your policy.
2

Sort by speed

When a request arrives, the router sorts models by speed (fastest first).
3

Route to fastest

Your request goes to the currently fastest model. Latency data updates in real-time.

Benefits

Fastest Responses

Always use the quickest model available.

Automatic Adaptation

Router adjusts when model performance changes.

No Manual Tuning

Latency optimization happens automatically.

Regional Optimization

Automatically prefer nearby endpoints.

Creating a Latency-Based Policy

1

Create the Policy

Go to Routing Policies, click Create Policy, and select Latency as the policy type.Latency Routing Policy
2

Select Models

Add your models. For example:
ModelDescription
anthropic/claude-sonnet-4-5Direct API access
bedrock/claude-sonnet-4-5@eu-central-1Regional endpoint
The router will automatically choose whichever is faster at request time.
3

Use the Policy in Your Code

Reference the policy in your model parameter:
from openai import OpenAI

client = OpenAI(
    base_url="https://router.requesty.ai/v1",
    api_key="your-requesty-api-key"
)

response = client.chat.completions.create(
    model="policy/fastest-sonnet",
    messages=[{"role": "user", "content": "Hello!"}]
)

How Latency Tracking Works

Requesty measures time-to-first-token (TTFT) for streaming requests and total response time for non-streaming.
MetricDescription
StreamingTime from request sent to first token received
Non-streamingTime from request sent to complete response
ScopePer-model, organization-scoped (your traffic, not global)
WindowRolling average of recent requests (last ~1 hour)
Models with no recent latency data are tried occasionally to gather performance metrics. After 5 to 10 requests, the router has enough data for optimal routing.

Key Selection Strategies

For each model, you can configure which API key to try first:
StrategyDescription
Requesty provided keyUse Requesty’s managed keys only (default)
My own keyUse your BYOK credentials only
Requesty first, then BYOKTry Requesty’s key first, fall back to BYOK
BYOK first, then RequestyTry your key first, fall back to Requesty

Use Cases

Route to the fastest regional endpoint:
ModelRegion
anthropic/claude-sonnet-4-5Global
bedrock/claude-sonnet-4-5@us-east-1US East
bedrock/claude-sonnet-4-5@eu-central-1Europe
bedrock/claude-sonnet-4-5@ap-southeast-1Asia Pacific
Users in Europe automatically get eu-central-1, users in Asia get ap-southeast-1.
Let the router pick the fastest provider:
Model
openai/gpt-5.2
anthropic/claude-sonnet-4-5
google/gemini-2.5-pro
If OpenAI is experiencing slowdowns, traffic shifts to Anthropic or Google automatically.
Combine similar-priced models and route to fastest:
Model
openai/gpt-4o-mini
anthropic/claude-3-5-haiku
google/gemini-1.5-flash
All three are low-cost. Requesty picks whichever responds fastest.

Combining with Other Policies

Latency routing works great with load balancing and fallback.
CombinationHow It Works
Latency + Load BalancingEach sub-policy uses latency routing, parent policy does A/B testing
Latency + FallbackTry latency-optimized policy first, fall back to known-good model if all fail

Monitoring Latency

1

Open Performance Monitoring

2

Review metrics

View time-to-first-token and total latency by model. See how latency routing distributes traffic.

FAQ

Models without recent data are assigned max latency. They will be tried occasionally (~5 to 10% of traffic) to gather data. Once they have metrics, they compete fairly.
No. Latency routing only considers speed. If you want cost optimization, use load balancing to prefer cheaper models, or manually order a fallback chain by price.
Yes. Instead of using the latency policy, pass a direct model name (e.g., openai/gpt-5.2) for requests where you need a specific model.
Continuously. Latency metrics are updated after every request. The router uses a rolling average of recent requests to smooth out spikes.
Latency routing tries models in speed order. If the fastest model fails, it tries the second-fastest, and so on.
Yes. Check the response headers or request logs in Analytics. You will see which model handled each request.
Unlike load balancing, latency routing does not guarantee the same user gets the same model. If your use case requires consistency, use load balancing with trace_id instead.
Last modified on May 26, 2026