English·4/27/2026·MCP server monitoring

MCP Server Monitoring: How to Track Model Context Protocol Servers in Production

The Model Context Protocol (MCP) has quickly become the standard way to expose tools, data, and capabilities to LLM agents. But once your MCP servers leave localhost and start handling real traffic, you hit a wall: stdio transport eats your logs, JSON-RPC errors disappear silently, and your agent suddenly times out without explanation. This guide shows you exactly how to monitor MCP servers in production — with real code, real metrics, and honest comparisons of the tools that actually work.

Why MCP Server Monitoring Is Different

Traditional API monitoring assumes HTTP, status codes, and request/response pairs you can inspect with cURL. MCP breaks every one of those assumptions.

A typical MCP server speaks JSON-RPC 2.0 over stdio or SSE, exposes a dynamic list of `tools`, `resources`, and `prompts`, and is invoked by an LLM that decides — autonomously — which tool to call and with what arguments. That means three new failure modes you've probably never had to monitor:

1. Tool selection errors — the model picks the wrong tool, or hallucinates a tool name that doesn't exist.

2. Schema validation failures — the model produces arguments that don't match the input schema, and the server returns a `-32602` (Invalid params) error the user never sees.

3. Transport-level silence — stdio servers crash without HTTP status codes, leaving the client hanging until timeout.

According to the official MCP specification, every method call returns either a `result` or an `error` object — but your agent framework (Claude Desktop, LangChain, or a custom client) usually swallows the error and returns a generic "tool failed" message. Without proper observability, you have no idea why.

This is where MCP server monitoring becomes critical, and where generic APM tools (Datadog, New Relic) fall short — they don't understand JSON-RPC tool semantics or the conversational context around each call.

The Four Metrics That Actually Matter for MCP Servers

Before instrumenting anything, decide what you're measuring. After running MCP servers in production for a year, these are the metrics that consistently catch real problems.

1. Tool call latency (p50, p95, p99)

Agents make multiple tool calls per turn. If your MCP `search_docs` tool has a p95 of 4 seconds, a 5-step agent loop becomes a 20-second wait. Track latency per tool, not per server — averaging hides the slow tool that's killing UX.

2. Tool error rate by error code

JSON-RPC defines specific error codes you should bucket:

`-32700` Parse error — almost always a bug in your serializer
`-32600` Invalid Request — malformed client (often a framework version mismatch)
`-32601` Method not found — the model hallucinated a tool
`-32602` Invalid params — the model produced bad arguments (most common)
`-32603` Internal error — your tool implementation crashed

Tracking error rate by code tells you whether to fix your server or improve your prompt.

3. Tool selection accuracy

When the model has 15 tools available, is it picking the right one? You can measure this by logging `(user_intent, tool_called)` pairs and sampling them for review. A drop from 92% to 78% accuracy after a model swap (e.g., upgrading to a new Claude version) is exactly the kind of regression you want to catch.

4. Token cost per tool call

Each MCP tool definition consumes tokens in your system prompt. With 20 tools at ~150 tokens each, you're burning 3,000 input tokens on every single message — at Claude Sonnet 4.6 pricing (~$3/MTok input), that's $0.009 per message before any actual work. See our breakdown of agent token costs for the full math.

Instrumenting an MCP Server: A Practical Example

Let's add monitoring to a Python MCP server using the official `mcp` SDK. The pattern is identical for the TypeScript SDK.

```python

import time

import logging

from mcp.server import Server

from mcp.types import Tool, TextContent

logger = logging.getLogger("mcp.monitor")

server = Server("docs-search")

def monitor_tool(func):

"""Decorator that emits structured logs for every tool call."""

async def wrapper(name: str, arguments: dict):

start = time.perf_counter()

status = "ok"

error_code = None

try:

result = await func(name, arguments)

return result

except ValueError as e:

status = "error"

error_code = -32602 # Invalid params

raise

except Exception as e:

status = "error"

error_code = -32603 # Internal error

raise

finally:

duration_ms = (time.perf_counter() - start) * 1000

logger.info(

"mcp_tool_call",

extra={

"tool": name,

"duration_ms": round(duration_ms, 2),

"status": status,

"error_code": error_code,

"arg_keys": list(arguments.keys()),

)

return wrapper

@server.call_tool()

@monitor_tool

async def call_tool(name: str, arguments: dict) -> list[TextContent]:

if name == "search_docs":

# ...your tool logic...

return [TextContent(type="text", text="results")]

raise ValueError(f"Unknown tool: {name}")

```

Two things to notice. First, we capture `arg_keys` — not the full arguments — to avoid leaking PII into logs. Second, we map exceptions to JSON-RPC error codes so downstream dashboards can filter consistently.

Shipping Logs to a Monitoring Backend

Structured logs are useless if they live in stdout. You need to ship them somewhere queryable. Three options, ranked by how much pain they save you.

Option 1: ClawPulse (purpose-built for agent monitoring)

ClawPulse ingests MCP tool calls natively — it understands the JSON-RPC envelope, groups calls by agent session, and correlates tool latency back to the LLM turn that triggered them. Setup is one HTTP POST per call:

```python

import httpx

await httpx.AsyncClient().post(

"https://api.clawpulse.org/v1/mcp/events",

headers={"Authorization": f"Bearer {CLAWPULSE_KEY}"},

json={

"tool": name,

"duration_ms": duration_ms,

"status": status,

"error_code": error_code,

"session_id": session_id,

)

```

You get a per-tool dashboard, alerts on p95 regressions, and a tool-selection accuracy view that samples real calls. Try it on the live demo — no credit card.

Option 2: Langfuse

Langfuse is the most mature open-source LLM observability tool, and it does support tool spans through its tracing SDK. The catch: it's built around the OpenTelemetry trace model, so you have to manually wrap each tool call in a span and attach the JSON-RPC metadata yourself. Powerful, but expect a half-day of integration work.

Option 3: Helicone

Helicone is excellent for proxying LLM API calls, but it doesn't have native MCP support — it sees your upstream call to Anthropic, not the downstream tool execution. You can ship custom events via their `Helicone-Property-*` headers, but you're essentially building MCP support yourself on top of a generic logger.

For pure HTTP LLM monitoring, Helicone is great. For MCP specifically, it's the wrong shape.

Start monitoring your OpenClaw agents in 2 minutes

Free 14-day trial. No credit card. Just drop in one curl command.

Prefer a walkthrough? Book a 15-min demo.

Alerting: What to Page On (and What to Ignore)

The temptation with new monitoring is to alert on everything. Don't. After tuning thousands of agent alerts, these are the only four worth paging a human for:

Tool error rate > 5% over 5 minutes — something is broken (server crash, schema drift, dependency outage).
p95 latency > 2x baseline for 10 minutes — a slow tool is degrading every agent turn.
`-32601` Method not found > 1% of calls — your model is hallucinating tools, often after a prompt change.
Cost per session > $0.50 — runaway loops or context bloat. See debugging agent loops for diagnostics.

Everything else (individual call failures, transient timeouts, single-digit error spikes) goes to a dashboard, not PagerDuty.

Distributed Tracing for Multi-Server Agent Setups

Modern agents rarely talk to a single MCP server. A Claude Desktop user might have GitHub, Filesystem, and Postgres servers attached simultaneously. When something breaks, you need to see the full call graph.

The trick is propagating a `session_id` from the LLM client through every MCP call. The protocol doesn't define this natively, but you can pass it via the `_meta` field on tool requests — most SDKs preserve it round-trip:

```python

session_id = arguments.get("_meta", {}).get("session_id", "unknown")

```

With session IDs in place, ClawPulse (and Langfuse, with manual setup) can render a flame graph showing exactly which tool call in which server caused the 8-second hang.

Common Pitfalls We've Seen in Production

A few things that have burned real teams running MCP servers:

Logging full tool arguments. Tool inputs frequently contain user PII, API keys, or proprietary data. Always log argument schemas or keys, never values, unless you have explicit redaction.
Counting initialization handshakes as latency. The MCP `initialize` method runs once per session and can take 200-500ms. Exclude it from your p95 calculations or you'll think every tool is slow.
Treating SSE and stdio identically. SSE transports give you HTTP-level metrics for free; stdio transports require process-level monitoring (CPU, memory, exit codes) on top of JSON-RPC metrics.
Ignoring tool list drift. When your `tools/list` response changes (you added or renamed a tool), every cached agent context is now stale. Log every `tools/list` response and diff them over time.

Pricing Reality Check

If you're evaluating whether monitoring is worth the cost, do the math for your traffic:

| Tool | Free tier | Paid (10K events/day) |

|------|-----------|----------------------|

| ClawPulse | 5K events/mo | $29/mo — see pricing |

| Langfuse Cloud | 50K observations/mo | $59/mo (Pro) |

| Helicone | 100K requests/mo | $20/mo (Pro, but MCP requires custom work) |

| Self-hosted Langfuse | Unlimited | Your infra cost (~$40-100/mo on a small VPS) |

For most teams, the build-vs-buy break-even is around 50K events/month. Below that, self-hosting Langfuse or using a free tier is cheapest. Above that, the operational cost of running your own observability stack starts to dominate.

FAQ

Q: Can I monitor MCP servers running over stdio?

Yes — you instrument the server process directly (decorator pattern shown above) and ship logs out via HTTP, syslog, or a sidecar. The transport doesn't change the instrumentation; it only changes how you ship logs out of the process.

Q: How do I monitor third-party MCP servers I didn't write?

Wrap them in a thin proxy. Run the third-party server as a subprocess, intercept JSON-RPC messages on stdin/stdout, log them, and forward. The MCP Inspector project shows the basic pattern.

Q: Does ClawPulse work with non-Anthropic models like GPT-5 or Gemini?

Yes. While ClawPulse was built around Claude workflows, the MCP monitoring layer is model-agnostic — it tracks the JSON-RPC layer, which is identical regardless of which LLM is making the calls.

Q: How is this different from regular APM like Datadog?

APM tools track HTTP requests and database queries. They don't understand that "tool selection accuracy" is a metric, that JSON-RPC error code -32602 means the LLM produced bad arguments, or that 20 tools in a system prompt cost real money. Purpose-built MCP monitoring closes that semantic gap.

---

Ready to see your MCP servers' real performance? Try the ClawPulse live demo — connect your first MCP server in under 3 minutes, no credit card required. You'll see p95 latency, error breakdowns by JSON-RPC code, and per-tool cost in the first session.