Monitoring
How monitoring fits into the loop
Tracing provides a complete record of what your LLM app is doing — every request, every model call, every tool use. Monitoring is how you make sense of that data. It gives you two things: a continuous view of how your system performs over time, and a way to surface the specific traces worth investigating — errors, user behavior patterns, and cases where something went unexpectedly wrong. Together, they shift you from having data to actually understanding your system well enough to improve it.
Metrics and signals
It helps to separate monitoring into two distinct activities, because they answer different questions.
Aggregate metrics tracking tells you whether things are getting better or worse over time. Cost, latency, evaluation scores — these become trends you can watch and reason about. Did that prompt change last Tuesday improve anything? Is quality drifting as usage grows?
Signal detection tells you where to look right now. It surfaces individual traces that are worth investigating — an error, a cluster of retries, a user abandoning mid-conversation. The signal is only useful because it's attached to the specific trace that triggered it. That trace is your starting point for understanding what went wrong.
Where metrics and signals come from
Both aggregate metrics and signal detection depend on fields attached to observations. A lot of what you need is already there once you instrument properly: latency, token-derived cost, model and route metadata, tool outcomes, and errors typically flow from your client and provider APIs without extra wiring.
Beyond the built-in fields, you add evaluation — user feedback (explicit ratings or implicit signals like session abandonment), human annotation, and LLM-as-a-judge scores — by annotating traces manually or running automated evaluators. That data feeds into aggregate charts for tracking trends over time, and into signal rules so individual traces surface when something crosses a threshold you care about.
Explicit and implicit user feedback
User feedback is one of the richest sources of signal, but it comes in two forms with different tradeoffs.
Explicit feedback is direct: a thumbs up or down, a star rating, a comment left by the user. The signal is unambiguous, but response rates are low and skewed — dissatisfied users respond more often than satisfied ones.
Implicit feedback is derived from behavior: whether a user retried a query, disagreed with the system, copied a response, accepted a suggestion, or abandoned a conversation midway. It requires no user effort and generates high-volume data, but the signals are indirect and need interpretation. These can be surfaced using automated evaluators.
Both register as scores, so they feed into the same dashboards, trend charts, and signal rules as your other evaluation data.
There are two types of automated evaluators to attach scores to traces:
- LLM-as-a-judge (for quality signals or behavioral patterns like user disagreement)
- Code-based evaluators (for precise checks like whether the response contains a certain word or exceeds a length limit).
More on both can be found in the Evaluate section.
Where to start
Start small and build your monitoring setup from real traces rather than abstract ideas about what might matter.
-
Start by looking at your data manually. Read through traces and notice what kinds of things keep appearing. You can't set up useful monitoring before you know what you're looking for.
-
Use error analysis to surface what's worth tracking. Error analysis gives you a structured way to find patterns across your traces, the kinds of recurring issues worth turning into automated evaluators you can run continuously. A step-by-step guide can be found here.
-
Think about how your specific application shows failure. Application-specific implicit signals — a user disagreeing in a support chat, a correction in a process automation flow — are often more actionable than generic scores, and they surface issues without manual labeling.
-
Treat it as an iterative process. A working monitoring setup isn't something you configure once and leave. Usage patterns shift, models get updated, new failure modes emerge. Keep refining your setup so you can cut through the noise and stay focused on what actually matters.
What comes next
When monitoring surfaces something worth investigating, you have a few options: fix it directly if the cause is obvious, capture it in a dataset if it looks like a pattern, or run a structured evaluation if you suspect something systemic. Which path you take depends on how confident you are about the cause.
- Datasets: capturing production traces for evaluation
- Experiments: testing whether a fix actually worked