LLM API Rate Limiting Best Practices: Avoid 429 Errors and Save 40% on Costs
Every production LLM application eventually hits the same wall: a `429 Too Many Requests` error at 3 AM, right when traffic spikes. Whether you're calling Claude, GPT-4, or running multi-agent workflows, rate limiting isn't optional — it's the difference between a reliable product and a flaky one. This guide covers the patterns that actually hold up under load, with code you can ship today.
Why LLM Rate Limits Are Different
Traditional API rate limits count requests per second. LLM providers add a second dimension: tokens per minute (TPM). Anthropic's rate limit documentation shows tier-based quotas where Tier 1 caps you at 50 requests/minute and 40,000 input tokens/minute for Claude Opus. OpenAI's usage tier system works similarly.
The asymmetry matters. A single agent loop calling Claude with a 50K-token context window can exhaust your TPM budget in three calls — even though you've technically only made 3 of your 50 allowed RPM. Most teams blow through limits not because they send too many requests, but because their context windows balloon over time.
A real example from production: an agent built with LangGraph accumulates conversation history across iterations. By turn 8, each request carries 80K tokens. Three concurrent users → instant 429. The fix isn't "more retries" — it's measuring tokens before you send.
Cost dimension you're probably ignoring
A Claude Sonnet 4.6 request averages $0.003 per 1K input tokens and $0.015 per 1K output tokens. If your retry logic naively replays a failed 80K-token request three times, you've burned $0.72 to deliver one response. Multiply by 10,000 daily users and rate limit handling becomes a six-figure cost decision. Aggressive retries without circuit breakers are the most expensive bug in modern LLM apps.
Pattern 1: Token Bucket With Dual Counters
Forget the textbook token bucket — you need two buckets running in parallel: one for requests, one for tokens. Here's a minimal implementation in Python:
```python
import asyncio
import time
from dataclasses import dataclass
@dataclass
class DualBucket:
rpm_limit: int
tpm_limit: int
request_tokens: float
token_tokens: float
last_refill: float
async def acquire(self, estimated_tokens: int):
while True:
now = time.monotonic()
elapsed = now - self.last_refill
self.request_tokens = min(
self.rpm_limit,
self.request_tokens + elapsed * (self.rpm_limit / 60)
)
self.token_tokens = min(
self.tpm_limit,
self.token_tokens + elapsed * (self.tpm_limit / 60)
)
self.last_refill = now
if self.request_tokens >= 1 and self.token_tokens >= estimated_tokens:
self.request_tokens -= 1
self.token_tokens -= estimated_tokens
return
await asyncio.sleep(0.1)
```
The trick: estimate tokens before calling the API using `anthropic.count_tokens()` or tiktoken for OpenAI. Pre-flight measurement prevents the worst failure mode where you queue 500 requests, all fit RPM, none fit TPM, and your queue deadlocks.
Pattern 2: Exponential Backoff Done Right
Most "exponential backoff" implementations are wrong because they ignore the `retry-after` header. Anthropic and OpenAI both return this header on 429 responses with the exact wait time. Use it.
```javascript
async function callWithBackoff(fn, maxRetries = 5) {
for (let attempt = 0; attempt < maxRetries; attempt++) {
try {
return await fn();
} catch (err) {
if (err.status !== 429 && err.status !== 529) throw err;
const retryAfter = parseInt(err.headers?.['retry-after']) || null;
const jitter = Math.random() * 1000;
const wait = retryAfter
? retryAfter * 1000 + jitter
: Math.min(2 * attempt 1000 + jitter, 30000);
console.warn(`Rate limited. Retrying in ${wait}ms`);
await new Promise(r => setTimeout(r, wait));
}
}
throw new Error('Max retries exceeded');
}
```
Three non-obvious things this gets right:
1. Handles 529 (overloaded) — Anthropic returns this during capacity events; treat it like 429.
2. Adds jitter — without it, all your servers retry at exactly the same millisecond and recreate the spike.
3. Caps wait at 30s — beyond that, fail fast and let the user retry manually.
The official Anthropic SDK does this internally with `max_retries`, but if you're calling via raw `fetch` or curl, you need it explicit.
Pattern 3: Request Coalescing for Agents
If you're running multi-agent systems (think: 5 Claude agents each making decisions in parallel), naive parallelism guarantees rate limit hell. Coalesce identical requests:
```python
from functools import lru_cache
import hashlib
class CoalescingClient:
def __init__(self, client):
self.client = client
self.in_flight = {}
async def complete(self, messages, **kwargs):
key = hashlib.md5(
str(messages).encode() + str(kwargs).encode()
).hexdigest()
if key in self.in_flight:
return await self.in_flight[key]
future = asyncio.create_task(
self.client.messages.create(messages=messages, **kwargs)
)
self.in_flight[key] = future
try:
return await future
finally:
self.in_flight.pop(key, None)
```
In production agent swarms, we've seen this cut request volume by 35-60% because agents frequently make redundant context-checking calls. Combine with Anthropic's prompt caching for an additional 90% cost reduction on cached portions.
Start monitoring your OpenClaw agents in 2 minutes
Free 14-day trial. No credit card. Just drop in one curl command.
Prefer a walkthrough? Book a 15-min demo.
Pattern 4: Graceful Degradation Across Models
When you're rate-limited on Claude Opus, falling back to Sonnet beats failing. Build a model ladder:
```python
MODEL_LADDER = [
"claude-opus-4-7",
"claude-sonnet-4-6",
"claude-haiku-4-5"
]
async def complete_with_fallback(messages):
for model in MODEL_LADDER:
try:
return await client.messages.create(
model=model, messages=messages, max_tokens=1024
)
except RateLimitError:
continue
raise Exception("All models rate-limited")
```
Haiku is ~12x cheaper than Opus and has separate quota pools. For non-critical paths (intent classification, routing, summarization), graceful degradation costs you almost nothing in quality and saves you from total outage.
Monitoring: You Can't Fix What You Don't Measure
Every pattern above assumes you have visibility into what's happening. Without per-request tracking of latency, token usage, retry counts, and 429 frequency, you're guessing. This is exactly where observability matters — and it's why we built ClawPulse to give LLM teams real-time dashboards on rate limit headroom, retry storms, and per-model cost burn.
The alternatives all have tradeoffs:
- Langfuse — strong on tracing, weaker on rate-limit-specific alerting.
- Helicone — proxy-based, adds ~30ms latency per request, great for cost tracking.
- ClawPulse — agent-first, async ingestion (zero added latency), built-in 429 anomaly detection.
If you're hitting rate limits regularly, the first signal you need is time-to-quota-exhaustion — how many seconds until your current burn rate trips a 429. See our breakdown of LLM observability metrics that matter for the full list.
Anti-Patterns to Avoid
- Retrying on 400 errors. A 400 is a malformed request — retrying just wastes quota. Only retry 429, 500, 502, 503, 504, 529.
- Synchronous retry loops in serverless. A Lambda waiting 30s for a retry burns 30s of compute billing. Use queues (SQS, BullMQ) instead.
- Sharing one API key across all environments. Dev/staging traffic eats prod's quota. Each environment should have its own key.
- Ignoring streaming. Streaming responses count differently against TPM and let you cancel mid-stream if the user navigates away. Use it.
Putting It All Together
A production-grade LLM client should layer all four patterns: pre-flight token estimation → dual bucket → coalescing → fallback ladder → exponential backoff with jitter. That's 4-5 hours of work that will save you from 90% of rate limit incidents.
For the remaining 10% (provider outages, surprise quota changes, viral traffic), you need monitoring that pages you before users notice. Try ClawPulse on a free demo to see how rate-limit-aware observability looks in practice — we'll show you your real burn rate and headroom in under 5 minutes.
FAQ
What's the difference between RPM and TPM rate limits?
RPM (requests per minute) caps how many API calls you can make. TPM (tokens per minute) caps the total token volume across those calls. You can be well under your RPM but blow through TPM with a few large-context requests, especially with agents that accumulate conversation history.
Should I use the official SDK's built-in retries or write my own?
Use the SDK's retries for simple cases (`max_retries=3` is the default for Anthropic's SDK). Write your own when you need cross-model fallback, request coalescing, or custom logging — the SDK retries are opaque and don't integrate with your monitoring.
How do I estimate tokens before sending a request?
Use `anthropic.count_tokens()` for Claude or tiktoken (`cl100k_base` encoding) for GPT-4. For multi-turn conversations, count the entire message history, not just the new user message. Add ~10% buffer for system prompt overhead and tool schemas.
Does prompt caching help with rate limits?
Yes — cached input tokens count at a reduced rate against TPM (often 10x less). Combining caching with request coalescing can cut effective token usage by 70-80% for agent workflows that re-read the same documents repeatedly. See Anthropic's caching docs for the full pricing breakdown.
When should I escalate to a higher API tier?
When your p95 wait time from rate-limit backoff exceeds 2 seconds, or when 429s account for >0.5% of your requests over a 24h window. Below those thresholds, optimization (caching, coalescing, model ladder) is cheaper than upgrading. Above them, the tier upgrade pays for itself in conversion lift from faster responses. Check your current burn rate on the ClawPulse pricing page to see which tier matches your traffic profile.