Load Balancing Policies

Load Balancing Policies distribute your requests across multiple models based on weights you define. Perfect for A/B testing, gradual rollouts, and resource optimization.

Load balancing: the Requesty router distributes incoming requests across multiple providers according to configured weights, with trace_id keeping a single conversation on the same provider. — Traffic is split across models by the weights you set, while the same conversation can stay on one provider.

Configure load balancing in the Requesty Console. Prefer zero setup? Try a managed policy maintained by Requesty.

How It Works

Assign weights

You assign weights to each model (e.g., 70%, 20%, 10%).

Requests are routed

Each incoming request is consistently routed to one model based on the distribution.

Consistency guaranteed

Requests with the same trace_id or user_id always go to the same model.

Benefits

A/B Testing

Compare model performance with real traffic split across different models.

Gradual Rollouts

Send 10% to a new model, 90% to your stable model. Increase gradually.

Cost Optimization

Route most traffic to cheaper models while keeping premium models available.

Consistent Experiences

Same user always gets the same model, maintaining conversation context.

Creating a Load Balancing Policy

Create the Policy

Go to Routing Policies, click Create Policy, and select the Load balance strategy card. Name your policy and reference it as policy/your-policy-name in your requests.

Policy editor with strategy cards and Live Performance panel

Configure Weights

Add models from the catalog below the form, then set your distribution. For example:

Model	Weight
`anthropic/claude-sonnet-4-5`	50%
`bedrock/claude-sonnet-4-5@eu-central-1`	50%

The total weights must add up to 100% (you can use any numbers, they are normalized).

Use the Policy in Your Code

Reference your policy with policy/your-policy-name:

from openai import OpenAI

client = OpenAI(
    base_url="https://router.requesty.ai/v1",
    api_key="your-requesty-api-key"
)

response = client.chat.completions.create(
    model="policy/sonnet-distribution",
    messages=[{"role": "user", "content": "Hello!"}]
)

import OpenAI from 'openai';

const client = new OpenAI({
  baseURL: 'https://router.requesty.ai/v1',
  apiKey: 'your-requesty-api-key'
});

const response = await client.chat.completions.create({
  model: 'policy/sonnet-distribution',
  messages: [{ role: 'user', content: 'Hello!' }]
});

curl https://router.requesty.ai/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer your-requesty-api-key" \
  -d '{
    "model": "policy/sonnet-distribution",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Consistency Guarantee

Load balancing uses deterministic hashing to ensure the same user always gets the same model.

Scenario	Behavior
With `trace_id`	All requests with the same `trace_id` route to the same model
Without `trace_id`	Requesty generates a unique `request_id` for each request

This means multi-turn conversations stay on the same model, user sessions get consistent behavior, and A/B test groups are stable.

Maintaining Consistency Across Requests

To keep a user on the same model across multiple requests, pass a trace_id:

response = client.chat.completions.create(
    model="policy/sonnet-distribution",
    messages=[{"role": "user", "content": "Hello!"}],
    extra_body={
        "requesty": {
            "trace_id": "user-12345"
        }
    }
)

const response = await client.chat.completions.create({
  model: 'policy/sonnet-distribution',
  messages: [{ role: 'user', content: 'Hello!' }],
  extra_body: {
    requesty: {
      trace_id: 'user-12345'
    }
  }
});

Use your internal user ID as the trace_id to ensure each user gets a consistent model experience while still benefiting from A/B testing.

Load Balancing Between Policies

You can load balance between entire routing policies, not just individual models. This is powerful for canary deployments, A/B testing different routing strategies, and gradual migration from one policy to another.

Example: Policy Rollout

Say you have two fallback policies and want to gradually shift traffic:

Policy	Models	Weight
`policy/production-fallback` (stable)	openai/gpt-5.2 → anthropic/claude-sonnet-4-5	80%
`policy/experimental-fallback` (new)	google/gemini-2.5-pro → openai/gpt-5.2	20%

Create a load balancing policy called gradual-rollout with these weights. As you gain confidence, adjust to 50/50, then 0/100.

When load balancing between policies, each policy must be compatible with your request parameters. Do not mix embedding policies with chat completion policies.

Use Cases

A/B Testing New Models

Compare GPT-5.2 vs Gemini 2.5 Pro on real traffic:

Model	Weight
`openai/gpt-5.2`	50%
`google/gemini-2.5-pro`	50%

Track performance in Analytics and see which model performs better.

Gradual Model Rollout

Carefully introduce a new model:

Model	Weight	Role
`openai/gpt-4o`	90%	Stable, proven
`openai/gpt-5.2`	10%	New, testing

Increase the weight of gpt-5.2 as you validate quality.

Cost-Optimized Distribution

Route most traffic to cheaper models, some to premium:

Model	Weight
`openai/gpt-4o-mini`	70%
`openai/gpt-4o`	20%
`openai/gpt-5.2`	10%

Multi-Provider Redundancy

Distribute across providers for resilience:

Model	Weight
`openai/gpt-5.2`	40%
`anthropic/claude-sonnet-4-5`	40%
`google/gemini-2.5-pro`	20%

Key Selection (BYOK)

For each model in your load balancing policy, you can choose:

Option	Description
Requesty provided key	Use Requesty’s managed keys (default)
My own key	Use your BYOK credentials

Monitoring and Analytics

Every policy page includes a Live Performance panel showing real traffic across the models in the policy. Switch between Latency, Success, and Speed over the last 24 hours or 7 days to see how the distribution plays out in practice. For deeper analysis:

Open Analytics

Go to Analytics.

Filter by policy

Filter by your policy name to see the actual distribution of requests across models.

Compare performance

Compare latency, cost, and success rates between models. The distribution should match your configured weights (±2% variance is normal).

FAQ

How does consistent hashing work?

Requesty uses the xxhash algorithm on your trace_id (or request_id if no trace_id) to deterministically select a model. The same ID always produces the same hash, which maps to the same model.

What happens if I change the weights?

Changing weights will re-distribute traffic. Some users may switch to different models. If you need stability, avoid changing weights frequently, or use separate policies for stable vs experimental traffic.

Can I load balance and have fallback?

Yes. Create a load balancing policy that points to fallback policies. This gives you both load balancing and automatic failover.

Policy	Weight
`policy/openai-fallback`	50%
`policy/anthropic-fallback`	50%

Do all models need to be compatible?

Yes. All models in a load balancing policy should support the same request format and features. Do not mix chat models with embedding models, or models with different context lengths.

How do I ensure exactly 20% of users see the new model?

Use a stable trace_id (like user ID). With 100+ unique users, the distribution will converge to your configured weights (e.g., 20%). With small sample sizes, expect ±5% variance.

Getting Started

LLM Gateway

Model Capabilities

Analytics & Monitoring

Access Control

Organization

MCP Gateway

How It Works

Benefits

A/B Testing

Gradual Rollouts

Cost Optimization

Consistent Experiences

Creating a Load Balancing Policy

Consistency Guarantee

Maintaining Consistency Across Requests

Load Balancing Between Policies

Example: Policy Rollout

Use Cases

Key Selection (BYOK)

Monitoring and Analytics

FAQ

​How It Works

​Benefits

A/B Testing

Gradual Rollouts

Cost Optimization

Consistent Experiences

​Creating a Load Balancing Policy

​Consistency Guarantee

​Maintaining Consistency Across Requests

​Load Balancing Between Policies

​Example: Policy Rollout

​Use Cases

​Key Selection (BYOK)

​Monitoring and Analytics

​FAQ

How It Works

Benefits

Creating a Load Balancing Policy

Consistency Guarantee

Maintaining Consistency Across Requests

Load Balancing Between Policies

Example: Policy Rollout

Use Cases

Key Selection (BYOK)

Monitoring and Analytics

FAQ