Francais·4/27/2026·LLM API rate limiting best practices

LLM API Rate Limiting Best Practices: Avoid 429 Errors and Save 40% on Costs

Every production LLM application eventually hits the same wall: a `429 Too Many Requests` error at 3 AM, right when traffic spikes. Whether you're calling Claude, GPT-4, or running multi-agent workflows, rate limiting isn't optional — it's the difference between a reliable product and a flaky one. This guide covers the patterns that actually hold up under load, with code you can ship today.

Why LLM Rate Limits Are Different

Traditional API rate limits count requests per second. LLM providers add a second dimension: tokens per minute (TPM). Anthropic's rate limit documentation shows tier-based quotas where Tier 1 caps you at 50 requests/minute and 40,000 input tokens/minute for Claude Opus. OpenAI's usage tier system works similarly.

The asymmetry matters. A single agent loop calling Claude with a 50K-token context window can exhaust your TPM budget in three calls — even though you've technically only made 3 of your 50 allowed RPM. Most teams blow through limits not because they send too many requests, but because their context windows balloon over time.

A real example from production: an agent built with LangGraph accumulates conversation history across iterations. By turn 8, each request carries 80K tokens. Three concurrent users → instant 429. The fix isn't "more retries" — it's measuring tokens before you send.

Cost dimension you're probably ignoring

A Claude Sonnet 4.6 request averages $0.003 per 1K input tokens and $0.015 per 1K output tokens. If your retry logic naively replays a failed 80K-token request three times, you've burned $0.72 to deliver one response. Multiply by 10,000 daily users and rate limit handling becomes a six-figure cost decision. Aggressive retries without circuit breakers are the most expensive bug in modern LLM apps.

Pattern 1: Token Bucket With Dual Counters

Forget the textbook token bucket — you need two buckets running in parallel: one for requests, one for tokens. Here's a minimal implementation in Python:

```python

import asyncio

import time

from dataclasses import dataclass

@dataclass

class DualBucket:

rpm_limit: int

tpm_limit: int

request_tokens: float

token_tokens: float

last_refill: float

async def acquire(self, estimated_tokens: int):

while True:

now = time.monotonic()

elapsed = now - self.last_refill

self.request_tokens = min(

self.rpm_limit,

self.request_tokens + elapsed * (self.rpm_limit / 60)

)

self.token_tokens = min(

self.tpm_limit,

self.token_tokens + elapsed * (self.tpm_limit / 60)

)

self.last_refill = now

if self.request_tokens >= 1 and self.token_tokens >= estimated_tokens:

self.request_tokens -= 1

self.token_tokens -= estimated_tokens

return

await asyncio.sleep(0.1)

```

The trick: estimate tokens before calling the API using `anthropic.count_tokens()` or tiktoken for OpenAI. Pre-flight measurement prevents the worst failure mode where you queue 500 requests, all fit RPM, none fit TPM, and your queue deadlocks.

Pattern 2: Exponential Backoff Done Right

Most "exponential backoff" implementations are wrong because they ignore the `retry-after` header. Anthropic and OpenAI both return this header on 429 responses with the exact wait time. Use it.

```javascript

async function callWithBackoff(fn, maxRetries = 5) {

for (let attempt = 0; attempt < maxRetries; attempt++) {

try {

return await fn();

} catch (err) {

if (err.status !== 429 && err.status !== 529) throw err;

const retryAfter = parseInt(err.headers?.['retry-after']) || null;

const jitter = Math.random() * 1000;

const wait = retryAfter

? retryAfter * 1000 + jitter

: Math.min(2 * attempt 1000 + jitter, 30000);

console.warn(`Rate limited. Retrying in ${wait}ms`);

await new Promise(r => setTimeout(r, wait));

}

throw new Error('Max retries exceeded');

}

```

Three non-obvious things this gets right:

1. Handles 529 (overloaded) — Anthropic returns this during capacity events; treat it like 429.

2. Adds jitter — without it, all your servers retry at exactly the same millisecond and recreate the spike.

3. Caps wait at 30s — beyond that, fail fast and let the user retry manually.

The official Anthropic SDK does this internally with `max_retries`, but if you're calling via raw `fetch` or curl, you need it explicit.

Pattern 3: Request Coalescing for Agents

If you're running multi-agent systems (think: 5 Claude agents each making decisions in parallel), naive parallelism guarantees rate limit hell. Coalesce identical requests:

```python

from functools import lru_cache

import hashlib

class CoalescingClient:

def __init__(self, client):

self.client = client

self.in_flight = {}

async def complete(self, messages, **kwargs):

key = hashlib.md5(

str(messages).encode() + str(kwargs).encode()

).hexdigest()

if key in self.in_flight:

return await self.in_flight[key]

future = asyncio.create_task(

self.client.messages.create(messages=messages, **kwargs)

)

self.in_flight[key] = future

try:

return await future

finally:

self.in_flight.pop(key, None)

```

In production agent swarms, we've seen this cut request volume by 35-60% because agents frequently make redundant context-checking calls. Combine with Anthropic's prompt caching for an additional 90% cost reduction on cached portions.

Start monitoring your OpenClaw agents in 2 minutes

Free 14-day trial. No credit card. Just drop in one curl command.

Prefer a walkthrough? Book a 15-min demo.

Pattern 4: Graceful Degradation Across Models

When you're rate-limited on Claude Opus, falling back to Sonnet beats failing. Build a model ladder:

```python

MODEL_LADDER = [

"claude-opus-4-7",

"claude-sonnet-4-6",

"claude-haiku-4-5"

]

async def complete_with_fallback(messages):

for model in MODEL_LADDER:

try:

return await client.messages.create(

model=model, messages=messages, max_tokens=1024

)

except RateLimitError:

continue

raise Exception("All models rate-limited")

```

Haiku is ~12x cheaper than Opus and has separate quota pools. For non-critical paths (intent classification, routing, summarization), graceful degradation costs you almost nothing in quality and saves you from total outage.

Monitoring: You Can't Fix What You Don't Measure

Every pattern above assumes you have visibility into what's happening. Without per-request tracking of latency, token usage, retry counts, and 429 frequency, you're guessing. This is exactly where observability matters — and it's why we built ClawPulse to give LLM teams real-time dashboards on rate limit headroom, retry storms, and per-model cost burn.

The alternatives all have tradeoffs:

Langfuse — strong on tracing, weaker on rate-limit-specific alerting.
Helicone — proxy-based, adds ~30ms latency per request, great for cost tracking.
ClawPulse — agent-first, async ingestion (zero added latency), built-in 429 anomaly detection.

If you're hitting rate limits regularly, the first signal you need is time-to-quota-exhaustion — how many seconds until your current burn rate trips a 429. See our breakdown of LLM observability metrics that matter for the full list.

Anti-Patterns to Avoid

Retrying on 400 errors. A 400 is a malformed request — retrying just wastes quota. Only retry 429, 500, 502, 503, 504, 529.
Synchronous retry loops in serverless. A Lambda waiting 30s for a retry burns 30s of compute billing. Use queues (SQS, BullMQ) instead.
Sharing one API key across all environments. Dev/staging traffic eats prod's quota. Each environment should have its own key.
Ignoring streaming. Streaming responses count differently against TPM and let you cancel mid-stream if the user navigates away. Use it.

Putting It All Together

A production-grade LLM client should layer all four patterns: pre-flight token estimation → dual bucket → coalescing → fallback ladder → exponential backoff with jitter. That's 4-5 hours of work that will save you from 90% of rate limit incidents.

For the remaining 10% (provider outages, surprise quota changes, viral traffic), you need monitoring that pages you before users notice. Try ClawPulse on a free demo to see how rate-limit-aware observability looks in practice — we'll show you your real burn rate and headroom in under 5 minutes.

FAQ

What's the difference between RPM and TPM rate limits?

RPM (requests per minute) caps how many API calls you can make. TPM (tokens per minute) caps the total token volume across those calls. You can be well under your RPM but blow through TPM with a few large-context requests, especially with agents that accumulate conversation history.

Should I use the official SDK's built-in retries or write my own?

Use the SDK's retries for simple cases (`max_retries=3` is the default for Anthropic's SDK). Write your own when you need cross-model fallback, request coalescing, or custom logging — the SDK retries are opaque and don't integrate with your monitoring.

How do I estimate tokens before sending a request?

Use `anthropic.count_tokens()` for Claude or tiktoken (`cl100k_base` encoding) for GPT-4. For multi-turn conversations, count the entire message history, not just the new user message. Add ~10% buffer for system prompt overhead and tool schemas.

Does prompt caching help with rate limits?

Yes — cached input tokens count at a reduced rate against TPM (often 10x less). Combining caching with request coalescing can cut effective token usage by 70-80% for agent workflows that re-read the same documents repeatedly. See Anthropic's caching docs for the full pricing breakdown.

When should I escalate to a higher API tier?

When your p95 wait time from rate-limit backoff exceeds 2 seconds, or when 429s account for >0.5% of your requests over a 24h window. Below those thresholds, optimization (caching, coalescing, model ladder) is cheaper than upgrading. Above them, the tier upgrade pays for itself in conversion lift from faster responses. Check your current burn rate on the ClawPulse pricing page to see which tier matches your traffic profile.