English·4/27/2026·Claude Code monitoring

Claude Code Monitoring: How to Track Token Usage, Errors, and Latency in Production

Claude Code is becoming the default coding agent for engineering teams shipping AI-powered features, but most teams ship it to production with zero observability. This guide shows you exactly what to monitor, how to instrument your agents in under 10 minutes, and how the major observability platforms compare for Claude-specific workloads.

Why Claude Code Monitoring Is Different From Generic LLM Logging

Most LLM observability tools were built for single-shot completions: send a prompt, log the response, calculate cost. Claude Code agents break that model entirely. A single `claude` CLI invocation can spawn 40+ tool calls, read dozens of files, branch into subagents, and consume between 5,000 and 200,000 tokens depending on the task.

Generic dashboards collapse under that complexity. You end up looking at a flat list of API calls with no idea which ones belong to which agent run, which subagent triggered them, or whether the 87,000-token spike at 3:14 AM was a runaway loop or legitimate work.

Claude Code monitoring needs to track:

Per-session token usage including input, output, cache reads, and cache writes (the four pricing dimensions of Claude's API)
Tool call traces — which tools fired, in what order, with what arguments
Subagent hierarchy — Claude Code spawns subagents via the Task tool, and these need to nest under the parent run
Cache hit ratio — a healthy Claude Code workload reads from cache 70-90% of the time; if you drop below 50%, your bill triples
Latency by tool — Bash and Edit tools dominate latency in most coding agents, not the model itself

If your monitoring stack doesn't expose all of these, you're flying blind. ClawPulse was built specifically for agent observability, but the principles below apply regardless of which tool you use.

The Four Metrics That Actually Matter

After instrumenting hundreds of Claude Code deployments, four metrics consistently predict production incidents before they happen.

1. Cache Read Ratio

Claude's prompt caching cuts input costs by 90% (from $3/MTok to $0.30/MTok on Sonnet 4.6). A well-tuned Claude Code agent should hit cache on 70-90% of input tokens. When this drops, it's almost always because:

A non-deterministic field (timestamp, UUID, random user ID) was injected near the top of the system prompt
The CLAUDE.md or memory files were modified mid-session, invalidating the cache
The user is on a fresh session with no warm cache yet

Track `cache_read_input_tokens / (cache_read_input_tokens + cache_creation_input_tokens + input_tokens)` and alert when it falls below 0.5 for more than 10 minutes.

2. Tokens Per Resolved Task

A Claude Code task that fixes a typo should cost ~5,000 tokens. A task that refactors a service should cost ~80,000 tokens. When your average climbs above 150,000 tokens per task, you have a runaway loop problem — usually an agent re-reading the same files because it forgot what it found three turns ago.

3. Tool Call Latency P95

The model itself is rarely the bottleneck. Bash commands hitting slow test suites, Edit tools blocked on large file reads, and WebFetch calls timing out are what kill perceived performance. Break out latency by tool name, not by overall request.

4. Error Rate by Error Class

Group errors into four buckets: `rate_limit`, `context_overflow`, `tool_validation`, and `model_refusal`. A spike in any single class points to a different root cause and a different fix.

Instrumenting Claude Code in 10 Minutes

Here's the minimum viable instrumentation for a Python-based agent that wraps the Anthropic SDK directly. The same pattern works whether you're using the official SDK or the Claude Code CLI.

```python

import time

import anthropic

from clawpulse import ClawPulse

cp = ClawPulse(api_key="cp_live_...")

client = anthropic.Anthropic()

def run_agent(task: str, session_id: str):

trace = cp.start_trace(

name="claude-code-session",

session_id=session_id,

metadata={"task": task[:200]}

)

start = time.time()

response = client.messages.create(

model="claude-sonnet-4-6",

max_tokens=8192,

system=[{

"type": "text",

"text": SYSTEM_PROMPT,

"cache_control": {"type": "ephemeral"}

}],

messages=[{"role": "user", "content": task}],

tools=TOOLS,

)

trace.log(

latency_ms=int((time.time() - start) * 1000),

input_tokens=response.usage.input_tokens,

output_tokens=response.usage.output_tokens,

cache_read_tokens=response.usage.cache_read_input_tokens,

cache_creation_tokens=response.usage.cache_creation_input_tokens,

stop_reason=response.stop_reason,

)

return response

```

For the Claude Code CLI specifically, hook into the `Stop` and `PostToolUse` events via settings.json hooks. This captures every tool call without modifying agent code:

```json

{

"hooks": {

"PostToolUse": [{

"matcher": "*",

"hooks": [{

"type": "command",

"command": "curl -X POST https://api.clawpulse.org/v1/events -d @-"

}]

}

```

The hook receives a JSON payload with `tool_name`, `tool_input`, `tool_response`, and timing data on stdin. Pipe it into any HTTP endpoint.

Start monitoring your OpenClaw agents in 2 minutes

Free 14-day trial. No credit card. Just drop in one curl command.

Prefer a walkthrough? Book a 15-min demo.

ClawPulse vs Langfuse vs Helicone for Claude Code

Honest comparison, no marketing fluff. We tested all three with the same Claude Code workload (a 40-tool-call refactoring agent running 200 sessions per day).

|---|---|---|---|

| Cost per session (4 token types) | Yes | Yes | Yes |

| Setup time for Claude Code | ~3 min | ~30 min | ~10 min (proxy) |

When to pick Langfuse: You need self-hosting for compliance reasons, you're already running Postgres at scale, and you have engineers willing to build custom dashboards for agent-specific views.

When to pick Helicone: You're a multi-provider shop (OpenAI + Anthropic + Gemini) and want a unified proxy that works across all of them. The proxy approach is zero-code-change, but adds a network hop.

When to pick ClawPulse: You're Claude-first or Claude-only, you want agent-specific dashboards out of the box, and you don't want to build observability infrastructure yourself. See our pricing page for current tiers.

Setting Up Alerts That Don't Page You at 3 AM

The single biggest mistake teams make is alerting on raw error rates. Claude returns `overloaded_error` and `rate_limit_error` regularly under normal load — your retry logic should handle these silently. Alert instead on:

1. Sustained cache miss rate above 50% for 10 minutes — almost always a deploy that broke prompt structure

2. P95 session cost above $0.50 — a runaway loop is in progress

3. Tool call validation errors above 5% of calls — schema drift between your code and the model's understanding

4. Zero successful sessions in the last 5 minutes during business hours — full outage

Skip alerts on individual 429s, 529s, or single-session token spikes. They're noise. For more on building resilient agent loops, see our guide on handling Claude API errors in production.

What to Monitor in Multi-Agent Systems

If you're using Claude Code's Task tool to spawn subagents, your monitoring needs to handle hierarchical traces. A parent session might spawn 5 Explore agents, each of which makes 20 tool calls. Without nesting, your dashboard becomes unreadable.

The pattern that works: assign a `parent_session_id` to every subagent invocation, and a `root_session_id` that propagates down through every level. Then your monitoring tool can render the tree:

```

root-session-abc123 (45s, $0.31, 87k tokens)

├── explore-agent-1 (8s, $0.04, 12k tokens)

│ ├── Glob: */.ts

│ ├── Grep: "useEffect"

│ └── Read: src/hooks/useAuth.ts

├── explore-agent-2 (11s, $0.06, 18k tokens)

└── plan-agent (22s, $0.18, 41k tokens)

```

ClawPulse renders this natively. With Langfuse you'd build it from spans. With Helicone you'd correlate via custom properties.

FAQ

How much does it cost to monitor Claude Code in production?

For a team running 1,000 Claude Code sessions per day at average 40 tool calls each, you're looking at roughly 40,000-50,000 events per day, or 1.2M-1.5M per month. ClawPulse handles this on the $49/mo tier; Langfuse self-hosted is free but you pay for the Postgres instance (~$30-50/mo on managed); Helicone's equivalent tier is around $100/mo. Compare this to your actual Claude API spend, which for that volume is typically $200-800/day — observability is 1-3% of the total bill.

Can I monitor Claude Code without modifying my agent code?

Yes. The Claude Code hooks system lets you POST every tool call and stop event to an HTTP endpoint via settings.json. No code changes required. ClawPulse, Langfuse, and Helicone all support this pattern, though the level of dashboard polish varies.

Does prompt caching break tracing or token counts?

No, but you need to track all four token types separately. The `cache_read_input_tokens` field counts as 0.1x cost on Sonnet, `cache_creation_input_tokens` counts as 1.25x cost, and regular `input_tokens` counts as 1x. If your monitoring tool only tracks total input tokens, your cost reports will be off by 5-10x in either direction.

What's the difference between observability and evaluation for Claude Code?

Monitoring/observability tells you what happened in production: latency, cost, errors, traces. Evaluation tells you whether what happened was correct: did the agent solve the task, did it hallucinate, did the code actually compile. You need both, but they're separate tools. ClawPulse handles observability; for eval, look at Anthropic's evals cookbook or tools like Braintrust.

---

Ready to see what your Claude Code agents are actually doing in production? Try the ClawPulse demo — it takes 3 minutes to wire up and shows your first traces immediately.