Latency-Based Routing - Requesty Docs

Latency-Based Routing automatically selects the fastest model for each request based on real-time performance data. Requesty continuously monitors response times and routes to the lowest-latency option.

Latency-based routing: a user request reaches the Requesty router, which evaluates real-time performance across providers and delivers the response via the fastest path.

Enable latency routing in the Requesty Console.

How It Works

Measure live performance

Requesty continuously measures how every model in your policy is performing right now, across all traffic flowing through the router. Recent requests count more than older ones, so the picture reflects current conditions rather than yesterday’s averages.

Score each candidate

When a request arrives, the router scores every model in your policy. The score accounts for how fast each one starts responding, how quickly it generates the rest of the output, and how much of each it has actually observed recently. Models with little recent data are scored optimistically so they still get tried.

Route to the fastest, then fall back

The router orders candidates fastest-first and sends your request to the top one. If it fails, the request automatically falls through to the next-fastest, and so on down the list.

The result: traffic continuously shifts toward whatever is fastest at the moment, with no manual tuning, and it adapts within minutes when a provider slows down or recovers.

Benefits

Fastest Responses

Always use the quickest model available.

Automatic Adaptation

Router adjusts when model performance changes.

No Manual Tuning

Latency optimization happens automatically.

Regional Optimization

Automatically prefer nearby endpoints.

Creating a Latency-Based Policy

Create the Policy

Go to Routing Policies, click Create Policy, and select Latency as the policy type.

Select Models

Add your models. For example:

Model	Description
`anthropic/claude-sonnet-4-5`	Direct API access
`bedrock/claude-sonnet-4-5@eu-central-1`	Regional endpoint

The router will automatically choose whichever is faster at request time.

Use the Policy in Your Code

Reference the policy in your model parameter:

from openai import OpenAI

client = OpenAI(
    base_url="https://router.requesty.ai/v1",
    api_key="your-requesty-api-key"
)

response = client.chat.completions.create(
    model="policy/fastest-sonnet",
    messages=[{"role": "user", "content": "Hello!"}]
)

import OpenAI from 'openai';

const client = new OpenAI({
  baseURL: 'https://router.requesty.ai/v1',
  apiKey: 'your-requesty-api-key'
});

const response = await client.chat.completions.create({
  model: 'policy/fastest-sonnet',
  messages: [{ role: 'user', content: 'Hello!' }]
});

curl https://router.requesty.ai/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer your-requesty-api-key" \
  -d '{
    "model": "policy/fastest-sonnet",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

How Latency Tracking Works

Requesty measures both time-to-first-token (how fast a model starts responding) and generation speed (how fast it produces the rest of the output), so a model that starts quickly but generates slowly does not get an unfair advantage.

Metric	Description
Time-to-first-token	Time from request sent to first token received
Generation speed	How quickly tokens are produced after the first one
Request size	Performance is tracked separately for small and large requests, since they behave differently
Scope	Measured across all traffic on the router, so even your first request benefits from data others have already generated
Window	A rolling window of recent requests (about the last hour), weighting newer requests more heavily

Models with little recent latency data are still tried from time to time so the router can learn how they perform. After a handful of requests, it has enough signal to route them accurately.

Key Selection Strategies

For each model, you can configure which API key to try first:

Strategy	Description
Requesty provided key	Use Requesty’s managed keys only (default)
My own key	Use your BYOK credentials only
Requesty first, then BYOK	Try Requesty’s key first, fall back to BYOK
BYOK first, then Requesty	Try your key first, fall back to Requesty

Use Cases

Regional Optimization

Route to the fastest regional endpoint:

Model	Region
`anthropic/claude-sonnet-4-5`	Global
`bedrock/claude-sonnet-4-5@us-east-1`	US East
`bedrock/claude-sonnet-4-5@eu-central-1`	Europe
`bedrock/claude-sonnet-4-5@ap-southeast-1`	Asia Pacific

Users in Europe automatically get eu-central-1, users in Asia get ap-southeast-1.

Provider Performance

Let the router pick the fastest provider:

Model
`openai/gpt-5.2`
`anthropic/claude-sonnet-4-5`
`google/gemini-2.5-pro`

If OpenAI is experiencing slowdowns, traffic shifts to Anthropic or Google automatically.

Cost + Speed Optimization

Combine similar-priced models and route to fastest:

Model
`openai/gpt-4o-mini`
`anthropic/claude-3-5-haiku`
`google/gemini-1.5-flash`

All three are low-cost. Requesty picks whichever responds fastest.

Combining with Other Policies

Latency routing works great with load balancing and fallback.

Combination	How It Works
Latency + Load Balancing	Each sub-policy uses latency routing, parent policy does A/B testing
Latency + Fallback	Try latency-optimized policy first, fall back to known-good model if all fail

Monitoring Latency

Open Performance Monitoring

Go to Performance Monitoring.

Review metrics

View time-to-first-token and total latency by model. See how latency routing distributes traffic.

FAQ

What if a model has no latency data?

Models without recent data are scored optimistically, so they get tried from time to time to gather performance signal. Once they have measurements, they compete fairly against everything else in the policy.

Does latency routing consider cost?

No. Latency routing only considers speed. If you want cost optimization, use load balancing to prefer cheaper models, or manually order a fallback chain by price.

Can I force a specific model for some requests?

Yes. Instead of using the latency policy, pass a direct model name (e.g., openai/gpt-5.2) for requests where you need a specific model.

How often does latency data update?

Continuously. Latency metrics are updated after every request. The router uses a rolling average of recent requests to smooth out spikes.

What happens if the fastest model fails?

Latency routing tries models in speed order. If the fastest model fails, it tries the second-fastest, and so on.

Can I see which model was selected?

Yes. Check the response headers or request logs in Analytics. You will see which model handled each request.

Unlike load balancing, latency routing does not pin a user to one model. If you want the same conversation to keep hitting the same provider (for prompt cache reuse), pass a trace_id: the router keeps its ordering stable for requests that share a trace, while still adapting across new conversations.

​How It Works

​Benefits

Fastest Responses

Automatic Adaptation

No Manual Tuning

Regional Optimization

​Creating a Latency-Based Policy

​How Latency Tracking Works

​Key Selection Strategies

​Use Cases

​Combining with Other Policies

​Monitoring Latency

​FAQ

How It Works

Benefits

Creating a Latency-Based Policy

How Latency Tracking Works

Key Selection Strategies

Use Cases

Combining with Other Policies

Monitoring Latency

FAQ