OpenClaw Fleet Management: Scale from 1 to 100 Agents
The Fleet Management Challenge
Managing one OpenClaw agent is simple. You SSH in, check logs, restart if needed. Managing ten agents is annoying but doable. Managing fifty or a hundred? That is where things break down without proper fleet management.
Fleet management for AI agents is fundamentally different from managing a cluster of web servers. Each agent might be running different tasks, using different models, with different resource profiles. You cannot just treat them as interchangeable pods in a Kubernetes deployment.
The Three Pillars of Fleet Management
1. Inventory and Organization
You need to know what you have. How many agents are running? What are they doing? When were they last updated? Group agents by function (scraping, coding, customer service), by environment (staging, production), or by team.
Without proper inventory, you end up with shadow agents — instances someone spun up for testing that are still running months later, quietly burning resources.
2. Health Monitoring at Scale
Checking each agent individually does not scale. You need a fleet-level health view that surfaces problems automatically. The ideal system shows you a green/yellow/red status for every agent and lets you drill down only when something needs attention.
3. Coordinated Operations
Updating agent configurations, rolling out new prompts, or scaling capacity — these operations need to happen across the fleet, not one agent at a time.
How ClawPulse Handles Fleet Management
ClawPulse was designed for exactly this problem. Here is how it approaches fleet management:
Fleet Dashboard
One screen shows every OpenClaw instance in your fleet. Color-coded health indicators give you an instant read on fleet status. Click any instance to see detailed metrics, or zoom out to see fleet-wide aggregates.
Instance Grouping and Tags
Tag your instances by team, environment, function, or any custom dimension. Filter the dashboard to see only production scrapers, or only agents owned by the data team. Tags make large fleets navigable.
Automated Alerting Across the Fleet
Set alert rules that apply to individual instances or entire groups. "Alert me if any production agent exceeds 85% memory" or "Alert me if the average error rate across all customer service agents exceeds 3%." Alerts go to Slack, Discord, Email, or WhatsApp.
Weekly Fleet Reports
Every week, ClawPulse generates a fleet summary: total instances, average health scores, resource consumption trends, and anomaly highlights. Share this with stakeholders to demonstrate operational maturity.
Scaling Strategies That Work
Start with Visibility
Before scaling from 10 to 50 agents, make sure you have solid monitoring for your existing 10. Scaling without visibility just means problems multiply faster.
Standardize Agent Configurations
Create base configurations for each agent type. When you need a new scraping agent, clone the standard config instead of building from scratch. This reduces variance and makes fleet-wide updates easier.
Set Resource Budgets
Define CPU, memory, and token budgets per agent type. Use monitoring to enforce these budgets and catch agents that exceed their allocation. This prevents a single misbehaving agent from affecting the entire fleet.
Automate Health Checks
Do not rely on humans to check dashboards. Configure alerts for the metrics that matter and let the monitoring system tell you when intervention is needed. Your team should be building, not babysitting.
The Cost of Not Managing Your Fleet
Teams that skip fleet management pay in other ways:
- Wasted resources — agents running idle or duplicate agents doing the same work
- Slow incident response — nobody knows which agent is causing the problem
- Unpredictable costs — no visibility into which agents are driving your cloud bill
- Knowledge silos — only one person knows how the fleet is configured
Get Fleet Management Right from Day One
Whether you are running 5 agents or 500, proper fleet management pays for itself in reduced incidents, lower costs, and faster scaling.
Optimizing Agent Performance with ClawPulse
As your OpenClaw fleet grows, ensuring optimal performance of each agent becomes increasingly important. ClawPulse offers powerful tools to help you monitor and optimize agent performance across your entire fleet.
One key feature is the ability to set custom performance thresholds for each agent. You can define metrics like CPU utilization, memory usage, or response time, and get alerts when an agent starts to exceed your defined limits. This allows you to quickly identify underperforming agents and take action before they impact your overall productivity.
ClawPulse also provides detailed historical performance data for each agent, so you can analyze trends over time. This can help you spot issues like memory leaks or other performance regressions, and make informed decisions about scaling, upgrading, or replacing agents as needed.
Beyond monitoring, ClawPulse makes it easy to coordinate performance optimizations across your fleet. You can push configuration updates, model updates, or even code changes to your agents in a controlled and seamless manner. This ensures consistent performance and reliability, without the hassle of manual agent-by-agent updates.
By leveraging the advanced fleet management capabilities of ClawPulse, you can ensure your OpenClaw agents are operating at peak efficiency, even as your fleet scales to hundreds or thousands of instances. This helps you maximize the return on your AI investment and deliver the best possible results for your customers.
Automating Fleet Upgrades and Rollouts
One of the biggest operational headaches is keeping your agent fleet up to date without causing service disruptions. When you have fifty or a hundred OpenClaw agents, manual upgrades are not just time-consuming—they introduce human error and create downtime windows.
ClawPulse enables automated, staged rollouts where you can deploy new agent versions to a subset of your fleet first. Test the update on your staging environment agents, then gradually roll out to production in waves. If something breaks, you catch it early rather than pushing a bad version to your entire fleet at once.
You can also automate configuration updates—pushing new prompts, adjusting resource allocations, or enabling new features—across tagged agent groups simultaneously. Pair this with health monitoring alerts, and you get immediate feedback if an update causes unexpected issues.
This staged approach reduces risk, eliminates manual work, and lets your team focus on what matters: improving your agents, not babysitting deployments. For teams scaling from dozens to hundreds of agents, this automation layer becomes essential to maintaining velocity without sacrificing reliability.
Start with ClawPulse at clawpulse.org/signup — get fleet-wide visibility for your OpenClaw agents in minutes.
Fleet Topologies: Picking the Right Architecture for Your Agent Scale
As your OpenClaw deployment grows from a handful of agents to hundreds, the topology you choose dictates how reliable, debuggable, and cost-efficient your fleet will be. There is no single right answer — but there are three patterns we see repeatedly across ClawPulse customers.
1. Flat fleet (1–20 agents). Every agent runs identical code, identical config, talks directly to the same LLM endpoint. Simple to operate, easy to debug — one agent crashes, the others keep serving. ClawPulse default tagging (`env=prod`, `region=us-east`) is enough to slice metrics. Best for early-stage SaaS.
2. Tiered fleet (20–100 agents). Agents are grouped by purpose: a "fast" tier (Haiku, low-latency tasks), a "smart" tier (Opus, deep reasoning), a "batch" tier (overnight jobs, async). Each tier gets its own ClawPulse tag — you alert separately on `tier=fast` p95 latency vs `tier=smart` cost burn. This is where the task tracker earns its keep: you correlate slow tasks to a specific tier without grepping logs across hosts.
3. Sharded fleet (100+ agents). Agents are partitioned by tenant, geographic region, or workload type. Each shard has its own observability boundary so a noisy tenant cannot blast another tenant's SLOs. Pair sharded fleets with per-tag cost tracking so finance can chargeback by shard. ClawPulse fleet view groups by shard tag automatically — you see which shard is bleeding tokens before the monthly invoice surprises you.
The mistake most teams make: jumping straight to sharded topology because it sounds "production grade." Start flat, move to tiered when one workload genuinely dominates, only shard when isolation becomes a hard requirement.
Instrumenting Fleet-Wide Metrics with ClawPulse
The ClawPulse agent ships with two primitives that make fleet observability tractable: `cp_metric` and `cp_event`. Below is the minimum useful instrumentation for any fleet — drop this into your OpenClaw agent loop and you immediately get fleet-wide percentiles, error rates, and cost attribution.
```python
# Python OpenClaw agent — fleet-grade instrumentation
from clawpulse import cp_metric, cp_event
import time
def run_agent_task(task_id: str, tier: str, shard: str):
started = time.time()
tags = {"tier": tier, "shard": shard, "task_id": task_id}
try:
cp_event("task.started", tags=tags)
result = openclaw_invoke(task_id)
latency_ms = (time.time() - started) * 1000
cp_metric("task.latency_ms", latency_ms, tags=tags)
cp_metric("task.tokens_in", result.usage.input_tokens, tags=tags)
cp_metric("task.tokens_out", result.usage.output_tokens, tags=tags)
cp_metric("task.cost_usd", result.cost_usd, tags=tags)
cp_event("task.completed", tags=tags)
return result
except Exception as exc:
cp_metric("task.errors", 1, tags={**tags, "error": type(exc).__name__})
cp_event("task.failed", tags=tags, payload={"err": str(exc)[:500]})
raise
```
```bash
# Bash agent variant for shell-based OpenClaw deployments
cp_event "fleet.heartbeat" --tag "tier=fast" --tag "shard=eu-west-1"
cp_metric "fleet.active_workers" 12 --tag "tier=fast"
```
These two functions are all you need — ClawPulse rolls them up into per-tier dashboards, fires alerts when `task.errors` rate-of-change spikes, and powers the cost burndown view. No OpenTelemetry collector to deploy, no Prometheus pushgateway, no Datadog agent footprint.
For agents written in Node, Go, or Rust, the equivalent SDKs follow the same shape — see the ClawPulse agent install guide for language-specific examples.
Start monitoring your OpenClaw agents in 2 minutes
Free 14-day trial. No credit card. Just drop in one curl command.
Prefer a walkthrough? Book a 15-min demo.
ClawPulse vs Datadog vs Langfuse for Fleet Management — Which Fits
Most fleet operators evaluate three platforms before settling. Here is the honest tradeoff, written for a DevOps engineer who is already running 30+ agents and needs to pick.
| Capability | ClawPulse | Datadog (LLM Obs) | Langfuse |
|---|---|---|---|
| Fleet view (per-agent health) | Native, default | Build dashboards manually | Trace-centric, no fleet pane |
| Per-tag cost attribution | Built-in | Custom metrics required | Limited (project-level) |
| Auto-instrumented OpenClaw agents | One-line install | DD agent + manual spans | OpenTelemetry config |
| Alert routing (Slack, PagerDuty) | Native destinations | Native | Webhook + glue code |
| Self-hosted option | Yes | No | Yes |
| Pricing at 100 agents | Predictable per-instance | Usage-driven, can spike | Free OSS / paid cloud |
| Eval/scoring tooling | Roadmap | No | Yes (paired with Helicone) |
If your priority is fleet-wide health + cost control, ClawPulse wins on time-to-value: install the agent, get a working dashboard in five minutes. If your priority is prompt evals and trace replay, Langfuse is a stronger pick — and many teams run both. If you are already on Datadog for the rest of your infra, the LLM module is a reasonable extension but rarely fleet-aware out of the box. We dig deeper into these tradeoffs in our ClawPulse vs Datadog comparison and the Best Langfuse alternatives 2026 listicle.
Operational Runbook: Day-Two Operations for a 50-Agent Fleet
Once your fleet crosses ~50 agents, runbooks matter more than dashboards. Below is a stripped-down version of the runbook ClawPulse customers converge on.
- Daily: Skim the fleet overview. Any agent with `health != ok` for >15 min gets investigated. Any tier with p95 latency > 2x baseline gets a ticket.
- Weekly: Review per-tier cost burn. Compare against last week. If `tier=smart` cost grew >20% with no traffic increase, you have a token regression — bisect by deploy.
- Monthly: Audit alert rules. Mute the noisy ones, tighten the loose ones. We see teams accumulate 40+ stale rules within a quarter.
- On-call: Always start with the error tracking pillar. 80% of pages resolve to "one agent stuck on a bad prompt" — kill the worker, redeploy, post-mortem later.
- Rollouts: Stage to 1 agent → 5 agents → 25% of fleet → 100%. Each gate watches `task.errors` and `task.latency_ms` for 10 minutes before the next ramp.
This rhythm scales linearly. We have customers running 200+ agents on this exact cadence with two-person DevOps teams.
Frequently Asked Questions
How many OpenClaw agents can ClawPulse monitor?
There is no hard cap. The Agency plan is unlimited; we have customers monitoring 500+ agents on a single workspace. The dashboard groups by tag so the UI stays usable at scale.
Does ClawPulse work with self-hosted OpenClaw deployments?
Yes. The agent is a single bash script that installs as a systemd service and pushes metrics to ClawPulse over HTTPS. Air-gapped deployments can self-host the ClawPulse backend — see the self-hosted monitoring guide.
How do I migrate a fleet already monitored by Datadog?
Run both side-by-side for one week. Tag every agent with `monitor=clawpulse,datadog`. Compare alert fidelity, cost attribution, and time-to-resolution. Most teams cut over after the first incident where ClawPulse caught a regression Datadog missed.
Can I get per-tenant cost reports for a multi-tenant fleet?
Yes. Tag each agent with `tenant=
What happens if the ClawPulse backend is unreachable?
The agent buffers metrics locally and retries with exponential backoff. You will not lose data for outages under one hour. Beyond that, only the oldest events are dropped — current metrics keep flowing.
Does fleet management cost extra?
No. Fleet view, per-tag dashboards, and staged rollout tooling are included in every plan from Starter up. See pricing for instance limits.
{
"@context": "https://schema.org",
"@type": "FAQPage",
"mainEntity": [
{"@type":"Question","name":"How many OpenClaw agents can ClawPulse monitor?","acceptedAnswer":{"@type":"Answer","text":"No hard cap. Agency plan is unlimited; customers run 500+ agents on a single workspace, with dashboards grouped by tag for usability at scale."}},
{"@type":"Question","name":"Does ClawPulse work with self-hosted OpenClaw deployments?","acceptedAnswer":{"@type":"Answer","text":"Yes. The agent is a single bash script installed as a systemd service that pushes metrics over HTTPS. Air-gapped deployments can self-host the ClawPulse backend."}},
{"@type":"Question","name":"How do I migrate a fleet already monitored by Datadog?","acceptedAnswer":{"@type":"Answer","text":"Run both side-by-side for one week with tags monitor=clawpulse,datadog. Compare alert fidelity, cost attribution, and resolution time. Most teams cut over after the first incident ClawPulse catches that Datadog missed."}},
{"@type":"Question","name":"Can I get per-tenant cost reports for a multi-tenant fleet?","acceptedAnswer":{"@type":"Answer","text":"Yes. Tag agents with tenant=
{"@type":"Question","name":"What happens if the ClawPulse backend is unreachable?","acceptedAnswer":{"@type":"Answer","text":"The agent buffers metrics locally and retries with exponential backoff. No data loss for outages under one hour."}},
{"@type":"Question","name":"Does fleet management cost extra?","acceptedAnswer":{"@type":"Answer","text":"No. Fleet view, per-tag dashboards, and staged rollout tooling are included in every plan from Starter up."}}
]
}
Ready to scale your fleet without losing visibility? Start your free trial or book a 15-minute demo and we will walk through your fleet topology together.
Fleet Capacity Planning: How to Size for Real Traffic, Not Wishful Thinking
Most teams oversize their fleet by 3–4x because nobody runs the math. They watch one agent struggle, panic-add ten more, and never trim back. Here is the formula we walk every ClawPulse customer through during onboarding.
Start with three numbers from your task tracker:
- λ — peak arrival rate (tasks/second at the 95th percentile of the day)
- s — mean task service time (seconds — measured, not guessed)
- U_target — utilization ceiling you are comfortable running at (we recommend 0.65 for latency-sensitive fleets, 0.85 for batch)
The minimum number of agents is `N = ceil(λ × s / U_target)`. Then add headroom: `N_headroom = ceil(N × 1.3)` covers a 30% traffic spike without paging anyone.
Worked example. A Toronto SaaS runs an OpenClaw research agent fleet. Their ClawPulse task tracker shows λ=4.2 tasks/sec at peak, s=3.1 seconds, target U=0.65 (customer-facing, latency matters). Math: `N = ceil(4.2 × 3.1 / 0.65) = ceil(20.0) = 20 agents`. With 30% headroom: 26 agents. They had been running 60. Cutting to 26 saved $11,400/month in infra without a single SLO breach in the following six weeks — verified against the task latency dashboard.
Where the formula lies to you:
- If your task service time has a heavy tail (some tasks take 30s+), use the 95th-percentile service time, not the mean. Otherwise you under-provision exactly when a slow task pile-up hits.
- If task arrivals are bursty (Poisson lambda doesn't match), pad U_target down by 0.1 — bursty traffic eats headroom faster than smooth traffic.
- If you mix model tiers in one fleet, do the math per-tier. A Haiku agent and an Opus agent have wildly different `s` and shouldn't share a capacity pool. See Anthropic's Claude pricing breakdown for tier-by-tier cost data.
ClawPulse exposes λ, s, and observed utilization on every agent — open the task feed, filter by tier, eyeball the percentile band. If your observed U routinely exceeds 0.85 during peak, you are one slow upstream call away from a queue blow-up.
Auto-Scaling Policies That Don't Thrash
Naive autoscaling — "add an agent when CPU > 80%" — is the fastest way to set your AWS bill on fire while still missing SLOs. Agent fleets need signal-aware scaling because CPU is a terrible proxy for "is the agent overloaded" when 90% of wall time is waiting on an LLM API.
Better signals to scale on, in priority order:
| Signal | When to add capacity | When to remove capacity | Cooldown |
|---|---|---|---|
| Pending task queue depth | depth > N agents × 2 for 60s | depth < N agents × 0.5 for 5min | 90s up, 5min down |
| p95 task latency | p95 > 1.5x baseline for 3min | p95 < baseline for 15min | 3min up, 15min down |
| Agent saturation (active / max concurrency) | sat > 0.85 fleet-wide for 2min | sat < 0.4 for 10min | 2min up, 10min down |
| LLM API 429s | 429 rate > 5/min | 429s zero for 10min | manual review |
| Cost burn rate | tokens/min > 1.3x 7-day avg | — (never auto-remove on cost) | alert, don't auto-scale |
Two non-obvious rules. Scale-up cooldown should always be shorter than scale-down — the cost of being slightly over-provisioned for 5 minutes is far less than the cost of dropping a paying customer's request. And never auto-scale on cost burn alone; cost spikes usually mean a prompt regression, not a traffic spike. Page a human, don't add agents to mask the bug.
Implementation note. ClawPulse exposes queue depth, fleet saturation, and per-tier latency via the metrics API. Most customers wire scaling into their orchestrator (Kubernetes HPA with a custom metrics adapter, Nomad's autoscaler, or a 50-line Python cron polling our `/api/dashboard/summary` endpoint). The full pattern is documented in our LLM rate-limiting guide.
Fleet-Wide Secret Rotation Without Downtime
When your CTO asks "what happens if our Anthropic API key leaks tomorrow?", the right answer is "we rotate in under 10 minutes with zero failed tasks." The wrong answer — what most teams have — is "we'd page everyone, manually SSH into 40 hosts, and pray."
Here is the rotation pattern that works for fleets up to ~500 agents:
```bash
#!/bin/bash
# rotate-fleet-secret.sh — zero-downtime API key rotation
# Requires: ClawPulse API token, new + old API keys staged in vault
set -euo pipefail
NEW_KEY="$1" # passed in from your secret manager
WAVE_SIZE=10 # rotate this many agents per wave
COOLDOWN=30 # seconds between waves
# 1. Push new key to vault under the "next" alias (agents read both old + next)
vault kv put secret/anthropic/next api_key="$NEW_KEY"
# 2. Get fleet inventory from ClawPulse, sort by tag (rotate non-prod first)
AGENTS=$(curl -sS -H "Authorization: Bearer $CP_TOKEN" \
https://www.clawpulse.org/api/dashboard/instances \
| jq -r '.instances | sort_by(.tags.env) | .[].id')
# 3. Roll through fleet in waves, watching error rate after each wave
for agent in $AGENTS; do
ssh "$agent" 'sudo systemctl reload clawpulse-agent' # picks up "next" alias
((wave_count++))
if (( wave_count % WAVE_SIZE == 0 )); then
sleep "$COOLDOWN"
ERR_RATE=$(curl -sS -H "Authorization: Bearer $CP_TOKEN" \
"https://www.clawpulse.org/api/dashboard/summary" \
| jq '.errorRate1m')
if (( $(echo "$ERR_RATE > 0.02" | bc -l) )); then
echo "ABORT: error rate ${ERR_RATE} during rotation, halting"
exit 1
fi
fi
done
# 4. Promote "next" to current, retire old key in vault
vault kv put secret/anthropic/current api_key="$NEW_KEY"
vault kv delete secret/anthropic/old
```
The two things this gets right that ad-hoc rotation gets wrong: (a) it watches the ClawPulse error-rate metric during the rotation, so if the new key is bad you abort before touching 80% of the fleet; (b) it stages new + old simultaneously so an in-flight task signed against the old key doesn't fail mid-rotation.
For OpenAI rotation, swap the vault path. For multi-provider fleets, run two rotations sequentially (Anthropic, then OpenAI) — never both at once. See OpenAI's API key best practices for the upstream guidance.
Chaos Engineering for Agent Fleets — What to Break First
If you have never deliberately broken your fleet, you are running on hope. The first time chaos arrives unannounced — the LLM API has a regional outage, your DNS provider hiccups, a bad deploy ships at 3am — you find out which assumptions were load-bearing. The whole point of chaos engineering is to find those before traffic does.
Five experiments to run, in order of safety:
1. Kill one agent. `kill -9` a random agent during business hours. Does the fleet absorb the load? Does ClawPulse fire the right alert within 60 seconds? Does the agent come back via systemd? If any answer is "no", fix that before doing the next experiment.
2. Stall one agent. `kill -STOP` an agent (it appears alive but processes nothing). This catches the silent-failure pattern: heartbeat OK, throughput 0. ClawPulse's task-rate alerts should catch this — verify they do.
3. Network partition one agent from the LLM API. `iptables` block egress to `api.anthropic.com` on one host. The agent should either fail-fast or fall back to a secondary provider. If it hangs, you have a timeout config bug.
4. Inject latency. `tc qdisc` add 2-second latency to LLM API calls on 10% of the fleet. Does p95 climb proportionally? Does autoscaling react? Does cost stay sane?
5. Drop a region. If you run multi-region, depool a whole region for 15 minutes. The remaining regions should absorb traffic; ClawPulse should show the fleet rebalancing. This is the experiment that separates "we have multi-region" from "we have multi-region that actually works."
Run each experiment once a quarter, write the postmortem even if nothing broke, and keep the postmortems searchable. The pattern follows Google SRE chaos guidance — small, frequent, learn-on-the-cheap.
GitOps for Fleet Configuration: One PR, 100 Agents Updated
The "I'll just SSH in and edit the config" era ends around the 20-agent mark. After that, you need declarative fleet config that lives in git, gets reviewed, and ships through CI. The mental model is the same as Terraform for infrastructure: the repo is the source of truth, drift is detected, rollback is `git revert`.
Minimum viable fleet repo structure:
```
fleet-config/
├── agents/
│ ├── _base.yaml # shared defaults
│ ├── tier-fast.yaml # haiku, low-latency
│ ├── tier-smart.yaml # opus, deep reasoning
│ └── tier-batch.yaml # async, overnight
├── shards/
│ ├── tenant-acme.yaml # one tenant
│ └── tenant-globex.yaml
├── alerts/
│ └── clawpulse-rules.yaml # alert rules as code
└── .github/workflows/
└── apply.yaml # CI: lint → diff → apply via ClawPulse API
```
A typical PR: bump `_base.yaml` from `model=claude-haiku-4` to `model=claude-haiku-4-5`. CI lints the YAML, runs `clawpulse fleet diff` to show exactly which agents change, blocks merge if a reviewer hasn't approved cost-impacting changes. On merge, the workflow rolls the change in waves of 10 with auto-rollback on error spike.
The win is incident response, not just reviewability. When a bad config ships at midnight, the on-call engineer types `git revert` instead of trying to remember which fields they touched on which hosts. Time-to-recover drops from 30 minutes to 30 seconds. Several of our self-hosted customers report this is the single change with the biggest operational return.
Postmortem: How a 80-Agent Fleet Burned $4,200 Overnight (and the Three Alerts That Would Have Caught It)
A Vancouver legal-tech team running 80 OpenClaw agents on document analysis shipped a "minor" prompt change at 18:42 on a Friday. The change added an extended-thinking instruction to one of the system prompts. Nothing crashed. Latency moved from 2.1s to 2.4s — within tolerance. Errors stayed at 0.3%. Everyone went home.
By Monday at 09:00, they had burned $4,217 in unexpected token spend. The extended-thinking instruction multiplied output tokens by ~6x on a subset of long-document tasks (~15% of traffic), but no per-task cost alert was wired. The first signal anyone saw was the Anthropic billing email at 02:00 Monday.
What ClawPulse showed in retrospect (after the team enabled the right rules):
- Cost burn-rate alert (would have fired at 19:11 Friday): tokens/min crossed 1.3x the 7-day rolling average within 30 minutes of the deploy. The single most useful alert for catching prompt regressions.
- Per-task cost percentile alert (would have fired at 19:34): p95 task cost went from $0.012 to $0.071. A 6x jump in p95 cost-per-task is essentially always a prompt change you didn't think through.
- Output-tokens-per-task alert (would have fired at 19:08): mean output tokens per task crossed 2x baseline. The fastest signal of an extended-thinking regression specifically.
After the postmortem they wired all three. Six weeks later the same pattern almost recurred (different prompt, same shape) — caught in 11 minutes, capped at $34 of damage. The pattern lives in our error-tracking pillar for teams who want to copy the alert rules verbatim.
The general lesson: in agent fleets, cost is a leading indicator of prompt regressions. Wire cost burn-rate alerts before you wire latency alerts — they catch a bigger class of incidents earlier.
Frequently Asked Questions — Fleet Operations
How do I right-size my agent fleet without expensive over-provisioning?
Measure peak arrival rate (λ), mean service time (s), and pick a utilization ceiling (0.65 for latency-sensitive, 0.85 for batch). Minimum agents = ceil(λ × s / U). Add 30% headroom. ClawPulse exposes all three numbers in the task tracker — most teams discover they are 2–4x over-provisioned within their first week.
What's the right autoscaling signal for an LLM agent fleet?
Pending queue depth and p95 task latency, in that order. CPU is a misleading proxy because most agent wall time is spent waiting on the LLM. Never auto-scale on cost burn — cost spikes are usually prompt regressions, page a human instead.
How do I rotate API keys across the fleet without downtime?
Stage the new key alongside the old one in your vault, roll agents through systemd reload in waves of 10, watch the ClawPulse error rate between waves, and abort the rollout if errors cross 2%. The full bash script is above.
What chaos experiments should we run on an agent fleet?
In order: kill one agent, stall one agent (kill -STOP), partition one agent from the LLM API, inject 2-second latency on 10% of the fleet, and depool a whole region. Run each quarterly, postmortem every time, even when nothing breaks.
Should fleet config live in git?
Past about 20 agents, yes. Declarative YAML in a repo, CI runs `clawpulse fleet diff` on every PR, rollouts go in waves with auto-rollback on error spike. Time-to-recover drops from minutes to seconds because rollback is `git revert`, not "remember what you touched."
How do I catch a bad prompt deploy before it burns through my budget?
Three alerts: cost burn-rate (tokens/min > 1.3x rolling 7-day average), per-task cost p95 (> 2x baseline), and output-tokens-per-task mean (> 2x baseline). Cost is a leading indicator of prompt regressions in agent fleets — wire these before you wire latency alerts.
{
"@context": "https://schema.org",
"@type": "FAQPage",
"mainEntity": [
{"@type":"Question","name":"How do I right-size my agent fleet without expensive over-provisioning?","acceptedAnswer":{"@type":"Answer","text":"Measure peak arrival rate (lambda), mean service time (s), pick utilization ceiling (0.65 latency-sensitive, 0.85 batch). Min agents = ceil(lambda * s / U). Add 30% headroom. Most teams discover they are 2-4x over-provisioned."}},
{"@type":"Question","name":"What is the right autoscaling signal for an LLM agent fleet?","acceptedAnswer":{"@type":"Answer","text":"Pending queue depth then p95 task latency. CPU is misleading because agent wall time is dominated by LLM API waits. Never auto-scale on cost burn alone — cost spikes are usually prompt regressions, page a human."}},
{"@type":"Question","name":"How do I rotate API keys across the fleet without downtime?","acceptedAnswer":{"@type":"Answer","text":"Stage new key alongside old in your vault, roll agents through systemd reload in waves of 10, watch ClawPulse error rate between waves, abort rollout if errors cross 2%."}},
{"@type":"Question","name":"What chaos experiments should we run on an agent fleet?","acceptedAnswer":{"@type":"Answer","text":"Order: kill one agent, stall one agent (kill -STOP), partition one from the LLM API, inject 2s latency on 10% of fleet, depool a whole region. Run each quarterly with a postmortem."}},
{"@type":"Question","name":"Should fleet config live in git?","acceptedAnswer":{"@type":"Answer","text":"Past 20 agents, yes. Declarative YAML in a repo, CI runs fleet diff on every PR, rollouts in waves with auto-rollback on error spike. Time-to-recover drops because rollback is git revert."}},
{"@type":"Question","name":"How do I catch a bad prompt deploy before it burns through my budget?","acceptedAnswer":{"@type":"Answer","text":"Three alerts: cost burn-rate (tokens/min > 1.3x rolling 7-day avg), per-task cost p95 > 2x baseline, output-tokens per task mean > 2x baseline. Cost is a leading indicator of prompt regressions."}}
]
}
> MCP server in your stack? See Best practices for monitoring MCP server performance and How to prevent destructive behavior in MCP tool monitoring for the latest playbooks.