Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.requesty.ai/llms.txt

Use this file to discover all available pages before exploring further.

Load Balancing Policies distribute your requests across multiple models based on weights you define. Perfect for A/B testing, gradual rollouts, and resource optimization.
Configure load balancing in the Requesty Console.

How It Works

1

Assign weights

You assign weights to each model (e.g., 70%, 20%, 10%).
2

Requests are routed

Each incoming request is consistently routed to one model based on the distribution.
3

Consistency guaranteed

Requests with the same trace_id or user_id always go to the same model.

Benefits

A/B Testing

Compare model performance with real traffic split across different models.

Gradual Rollouts

Send 10% to a new model, 90% to your stable model. Increase gradually.

Cost Optimization

Route most traffic to cheaper models while keeping premium models available.

Consistent Experiences

Same user always gets the same model, maintaining conversation context.

Creating a Load Balancing Policy

1

Create the Policy

Go to Routing Policies, click Create Policy, and select Load Balancing as the policy type.Load Balancing Policy
2

Configure Weights

Set up your distribution. For example:
ModelWeight
anthropic/claude-sonnet-4-550%
bedrock/claude-sonnet-4-5@eu-central-150%
The total weights must add up to 100% (you can use any numbers, they are normalized).
3

Use the Policy in Your Code

Reference your policy with policy/your-policy-name:
from openai import OpenAI

client = OpenAI(
    base_url="https://router.requesty.ai/v1",
    api_key="your-requesty-api-key"
)

response = client.chat.completions.create(
    model="policy/sonnet-distribution",
    messages=[{"role": "user", "content": "Hello!"}]
)

Consistency Guarantee

Load balancing uses deterministic hashing to ensure the same user always gets the same model.
ScenarioBehavior
With trace_idAll requests with the same trace_id route to the same model
Without trace_idRequesty generates a unique request_id for each request
This means multi-turn conversations stay on the same model, user sessions get consistent behavior, and A/B test groups are stable.

Maintaining Consistency Across Requests

To keep a user on the same model across multiple requests, pass a trace_id:
response = client.chat.completions.create(
    model="policy/sonnet-distribution",
    messages=[{"role": "user", "content": "Hello!"}],
    extra_body={
        "requesty": {
            "trace_id": "user-12345"
        }
    }
)
Use your internal user ID as the trace_id to ensure each user gets a consistent model experience while still benefiting from A/B testing.

Load Balancing Between Policies

You can load balance between entire routing policies, not just individual models. This is powerful for canary deployments, A/B testing different routing strategies, and gradual migration from one policy to another.

Example: Policy Rollout

Say you have two fallback policies and want to gradually shift traffic:
PolicyModelsWeight
policy/production-fallback (stable)openai/gpt-5.2 → anthropic/claude-sonnet-4-580%
policy/experimental-fallback (new)google/gemini-2.5-pro → openai/gpt-5.220%
Create a load balancing policy called gradual-rollout with these weights. As you gain confidence, adjust to 50/50, then 0/100.
When load balancing between policies, each policy must be compatible with your request parameters. Do not mix embedding policies with chat completion policies.

Use Cases

Compare GPT-5.2 vs Gemini 2.5 Pro on real traffic:
ModelWeight
openai/gpt-5.250%
google/gemini-2.5-pro50%
Track performance in Analytics and see which model performs better.
Carefully introduce a new model:
ModelWeightRole
openai/gpt-4o90%Stable, proven
openai/gpt-5.210%New, testing
Increase the weight of gpt-5.2 as you validate quality.
Route most traffic to cheaper models, some to premium:
ModelWeight
openai/gpt-4o-mini70%
openai/gpt-4o20%
openai/gpt-5.210%
Distribute across providers for resilience:
ModelWeight
openai/gpt-5.240%
anthropic/claude-sonnet-4-540%
google/gemini-2.5-pro20%

Key Selection (BYOK)

For each model in your load balancing policy, you can choose:
OptionDescription
Requesty provided keyUse Requesty’s managed keys (default)
My own keyUse your BYOK credentials

Monitoring and Analytics

1

Open Analytics

Go to Analytics.
2

Filter by policy

Filter by your policy name to see the actual distribution of requests across models.
3

Compare performance

Compare latency, cost, and success rates between models. The distribution should match your configured weights (±2% variance is normal).

FAQ

Requesty uses the xxhash algorithm on your trace_id (or request_id if no trace_id) to deterministically select a model. The same ID always produces the same hash, which maps to the same model.
Changing weights will re-distribute traffic. Some users may switch to different models. If you need stability, avoid changing weights frequently, or use separate policies for stable vs experimental traffic.
Yes. Create a load balancing policy that points to fallback policies. This gives you both load balancing and automatic failover.
PolicyWeight
policy/openai-fallback50%
policy/anthropic-fallback50%
Yes. All models in a load balancing policy should support the same request format and features. Do not mix chat models with embedding models, or models with different context lengths.
Use a stable trace_id (like user ID). With 100+ unique users, the distribution will converge to your configured weights (e.g., 20%). With small sample sizes, expect ±5% variance.
Last modified on May 26, 2026