Skip to main content

Observability

This guide walks through tracing your AI agent in production with the Ashr Labs SDK. It covers everything from getting your first trace into the dashboard, to instrumenting tool calls and LLM generations, to reading analytics back out, to wiring up a realtime voice agent on LiveKit.

This is a separate product from the Testing Platform. The testing platform (datasets, eval runs, RunBuilder, EvalRunner) is for offline evaluation. Observability is for tracing your agent in production. They share the same SDK and one API key but are independent features. Requires the observability feature flag to be enabled for your tenant.

Overview

Production observability has three building blocks:

  1. Trace — one user-facing interaction (a chat turn, a job run, a phone call). Top-level container.
  2. Span / Generation — one unit of work inside a trace (a tool invocation, a retrieval step, an LLM call). Spans nest arbitrarily; Generation is a Span subclass that also captures token usage and model.
  3. Event — a point-in-time record (a guardrail check, a feature flag, a cache hit). No duration.
┌────────────────────────────────────────────────────────────┐
│ Trace: "handle-support-ticket" │
│ ┌──────────────────┐ ┌──────────────────┐ ┌───────────┐ │
│ │ Generation: │ │ Span: │ │ Event: │ │
│ │ classify-intent │ │ tool:lookup_acct │ │ guardrail │ │
│ └──────────────────┘ └──────────────────┘ └───────────┘ │
└────────────────────────────────────────────────────────────┘

Where it lands: traces flush to the Ashr Labs backend, get stored in Postgres, and render in the Observability panel of the Ashr Labs dashboard.

Production-safe by design. Tracing never raises into your code. If the backend is unreachable, trace.end() returns an error dict; spans that are never closed are flushed at process exit. The hot path is enqueue-and-return.

The 3-Line Version

If you already have an API key, this is a complete instrumented agent turn:

from ashr_labs import AshrLabsClient

client = AshrLabsClient(api_key="tp_your_key_here")

with client.trace("handle-ticket", user_id="user_42") as trace:
with trace.generation("answer", model="claude-sonnet-4-6") as gen:
reply = call_llm(...)
gen.end(output=reply, usage={"input_tokens": 50, "output_tokens": 80})

That's it. The trace flushes on exit, lands in your dashboard within a few seconds, and includes every span, generation, and event it contains. The rest of this guide explains how to customize each piece.


Step 1: Initialize the Client

Same client as the testing platform — one API key works for both products.

from ashr_labs import AshrLabsClient

client = AshrLabsClient(api_key="tp_your_api_key_here")

# Or load from environment
# client = AshrLabsClient.from_env() # reads ASHR_LABS_API_KEY

If you don't have a key yet, see the Quick Start for how to mint one. The first client.trace(...) call lazily resolves your tenant from the API key — no extra setup.


Step 2: Open a Trace

Wrap each user-facing interaction in a Trace. The recommended pattern is the context manager — trace.end() is called automatically when the with block exits, even if your code throws.

with client.trace(
name="handle-ticket", # required — what this interaction is
user_id="user_42", # optional — for grouping by end-user
session_id="conv_abc", # optional — for multi-turn conversations
metadata={"version": "v3", "channel": "web"},
tags=["prod", "premium-tier"],
) as trace:
...

Parameters:

ParameterTypeRequiredDescription
namestrYesLogical name for this interaction (e.g. "handle-ticket", "summarize-document")
user_idstrNoEnd-user ID for grouping in the dashboard
session_idstrNoConversation/session ID for multi-turn flows
metadatadictNoArbitrary JSON-serializable metadata
tagslist[str]NoTags for filtering in the dashboard

The trace's trace_id is server-assigned and available after the block exits via trace.trace_id.


Step 3: Instrument LLM Calls (Generations)

Wrap every LLM call in a trace.generation(...). This captures the model, the prompt, the completion, and token usage.

with trace.generation(
name="classify-intent",
model="claude-sonnet-4-6",
input=[{"role": "user", "content": "I can't log in"}],
metadata={"temperature": 0.3},
) as gen:
response = anthropic_client.messages.create(...)
gen.end(
output={"role": "assistant", "content": response.content[0].text},
usage={
"input_tokens": response.usage.input_tokens,
"output_tokens": response.usage.output_tokens,
},
)

The model argument is what shows up in the Model Usage rollup on the dashboard. Use the canonical provider/model name (claude-sonnet-4-6, gpt-4.1-mini, gemini-2.0-flash, etc.).


Step 4: Instrument Tool Calls (Spans)

For everything else — tool invocations, retrieval steps, RAG lookups, guardrail checks — use trace.span(...).

with trace.span(
name="tool:lookup_account",
input={"user_id": "user_42"},
metadata={"backend": "postgres"},
) as tool:
result = lookup_account("user_42")
tool.end(output={"status": "active", "tier": "premium"})

If the body raises, the span auto-ends with level="ERROR" and the exception message is captured in status_message. The exception is re-raised — tracing never swallows errors:

with trace.span("tool:external_api") as tool:
response = call_external_api(...) # if this raises ConnectionError...
tool.end(output=response)
# ...the span ended with level="ERROR" and the exception propagates here

Step 5: Nest Spans Arbitrarily

Spans can contain other spans and generations. Nesting maps the agent's actual call tree, which is what makes the timeline view useful.

with client.trace("handle-ticket") as trace:
with trace.span("retrieval") as retrieval:
with retrieval.span("vector-search") as v:
v.end(output={"hits": 8})
with retrieval.generation("rerank", model="cohere-rerank-3") as r:
r.end(output={"top_3": ...}, usage={"input_tokens": 200, "output_tokens": 30})

with trace.generation("compose-reply", model="claude-sonnet-4-6") as gen:
gen.end(output={"text": "..."}, usage={"input_tokens": 1200, "output_tokens": 180})

The dashboard renders this as a tree, with each level's latency contribution visible at a glance.


Step 6: Record Events

For point-in-time facts (no duration), use trace.event(...) or span.event(...):

trace.event("guardrail:toxicity",   input={"toxic": False}, level="DEFAULT")
trace.event("cache:hit", input={"key": "user_42:profile"})
trace.event("flag:new-prompt", input={"variant": "B"})
LevelMeaning
"DEBUG"Verbose; off by default in dashboards
"DEFAULT"Normal informational events
"WARNING"Suspicious but recovered
"ERROR"Failed with consequences

Events show up inline in the timeline view alongside the spans and generations from the same trace.


Manual Instrumentation (No Context Managers)

If you can't use with blocks (async generators, streaming responses, callbacks), call .end() yourself:

trace = client.trace("support-chat", user_id="user_42", session_id="conv_abc")

gen = trace.generation("classify-intent", model="claude-sonnet-4-6",
input=[{"role": "user", "content": "Reset my password"}])
gen.end(output={"intent": "password_reset"},
usage={"input_tokens": 50, "output_tokens": 12})

tool = trace.span("tool:reset_password", input={"user_id": "user_42"})
tool.end(output={"success": True})

trace.event("guardrail-check", input={"passed": True})

result = trace.end(output={"resolution": "password_reset_complete"})
print(trace.trace_id) # server-assigned ID

If a span goes out of scope without .end() being called, it's auto-ended at process exit with whatever data was attached. The dashboard marks these as level="DEFAULT" with a [no-end] indicator so you know they were dangling.


Reading Your Data

The same client exposes read-side APIs for querying the dashboard programmatically.

List recent traces

result = client.list_observability_traces(
user_id="user_42", # optional filter
session_id="conv_abc", # optional filter
limit=50,
page=1,
)
for t in result["traces"]:
print(t["name"], t["trace_id"], t["start_time"])

Get a single trace with its full tree

detail = client.get_observability_trace(trace_id="abc123...")
trace = detail["trace"]
for obs in trace["observations"]:
print(obs["name"], obs["type"], obs["model"], obs["usage"])

Aggregate analytics

analytics = client.get_observability_analytics(days=7)
overview = analytics["overview"]
print(f"Traces: {overview['total_traces']}")
print(f"Tokens: {overview['total_input_tokens']} in / {overview['total_output_tokens']} out")
print(f"P95 latency: {overview['p95_latency_ms']}ms")
print(f"Error rate: {overview['error_rate']}")

The overview dict includes: total_traces, avg_latency_ms, p95_latency_ms, total_input_tokens, total_output_tokens, error_rate, total_tool_calls, unique_users, unique_sessions.

Errors and tool failures

errors = client.get_observability_errors(days=7, limit=50)
tool_errors = client.get_observability_tool_errors(days=7, limit=50)

Both return {"traces": [...], "total": int} with the most recent failures first.


Voice Agents (Realtime / LiveKit)

For voice agents on LiveKit, the SDK ships a dedicated ashr_labs.voice_obs submodule that captures STT/LLM/TTS metrics, turn boundaries, barge-ins, mixed-audio replay, and per-stage cost from the AgentSession automatically.

Install with the LiveKit extra:

pip install ashr-labs[livekit]

Two-line attach in your worker:

import os
from ashr_labs.voice_obs.livekit import VoiceObservability

obs = VoiceObservability(api_key=os.environ["ASHR_LABS_API_KEY"])
obs.attach(session, agent_id="support_v3", agent_version="v42")

That's the entire instrumentation. STT, LLM, TTS metrics, turn boundaries, and barge-ins are all captured automatically by hooking the AgentSession's event surface. Mixed-audio replay is enabled by default — agent TTS and remote participant audio are mixed at 24 kHz mono and uploaded so the dashboard's audio player can presign and stream them.

Runnable demo agents

Two examples ship with the SDK so you can see voice observability flow end-to-end without writing any agent code:

# Minimal — connects to LiveKit, attaches observability, greets the participant
python -m ashr_labs.voice_obs.examples.livekit_worker dev

# Full — a more "real-feeling" support agent built on the same primitives
python -m ashr_labs.voice_obs.examples.ashr_support_agent dev

Required env vars:

  • LIVEKIT_URL, LIVEKIT_API_KEY, LIVEKIT_API_SECRET
  • ASHR_LABS_API_KEY (or ASHR_VOICE_OBS_API_KEY)
  • ASHR_VOICE_OBS_TENANT_ID

Voice sessions land in the same Observability panel as text-trace sessions; the dashboard auto-renders turns, transcripts, per-stage cost and latency, mixed-audio replay, and barge-in metrics.


Common Patterns

Tagging by deployment / version

Adds dashboard filters so you can compare versions side-by-side:

with client.trace("handle-ticket", tags=[f"version:{APP_VERSION}", f"env:{ENV}"]) as t:
...

Capturing exceptions explicitly

The default behavior is good enough for most cases, but if you want richer error context attached to a span:

with trace.span("tool:risky") as t:
try:
result = risky()
t.end(output=result)
except Exception as e:
t.end(level="ERROR", status_message=f"{type(e).__name__}: {e}",
metadata={"recoverable": isinstance(e, RetryableError)})
raise

Token usage for streaming responses

Generation accepts usage after the stream completes. Stream first, end last:

with trace.generation("stream-reply", model="claude-sonnet-4-6") as gen:
chunks = []
usage = None
async for event in stream:
chunks.append(event.delta)
if event.type == "message_stop":
usage = event.usage
gen.end(output={"content": "".join(chunks)},
usage={"input_tokens": usage.input_tokens,
"output_tokens": usage.output_tokens})

Async / asyncio

Nothing special — Trace, Span, and Generation are sync objects, but their HTTP flush happens on a background thread. Use them inside async functions identically:

async def handle(request):
with client.trace("handle-request") as trace:
with trace.generation("answer", model="claude-sonnet-4-6") as gen:
response = await llm.generate_async(...)
gen.end(output=response, usage=...)

High-cardinality metadata

Anything you put in metadata becomes filterable in the dashboard. Use canonical keys (prompt_version, experiment_arm, model_temperature) so the filter UI groups them sensibly.


Where to Find Your Data

  • Live timeline: lab.ashr.io → Observability → Traces. Filter by user, session, time range, tag, or error level.
  • Per-trace detail: Click any trace to see the nested span tree, prompts, completions, token usage, and latency breakdown.
  • Analytics: Observability → Analytics. Latency P50/P95, token rollups, model usage, error rates, tool performance, time-series charts.
  • Voice sessions: Observability → Voice. Turn timeline, transcripts, per-stage breakdown, mixed-audio replay, barge-in/interrupt metrics.

Safety Properties

  • Never raises. Every public method catches and logs its own errors. trace.end() returns an error dict on failure; spans never throw into your code.
  • Never blocks the hot path. Public API is enqueue-and-return; HTTP flush happens on a background thread.
  • Bounded memory. Internal buffers are capped at 10k spans per process. If your agent emits more than that without flushing (extremely rare), the oldest are dropped and a warning is logged.
  • Monotonic clocks for duration_ms so durations stay accurate across NTP adjustments. Wall clock is only used for timestamps sent over the wire.
  • 5-second graceful drain on trace.end() to make sure in-flight spans land before process exit.

Next Steps

  • Browse the API Reference for the complete method signatures and parameter tables.
  • See Examples for full end-to-end patterns including error tracking and analytics dashboards.
  • Read Authentication if you need to manage API keys or rotate credentials.

If you hit issues, the SDK never crashes your agent — but it does log warnings to stderr. Set logging.getLogger("ashr_labs").setLevel(logging.DEBUG) to see the full picture during development.