Voice Observability
This guide walks through tracing realtime voice agents in production with the Ashr Labs SDK. Voice sessions land in the same Observability panel as text traces, but the dashboard renders them as a turn timeline with transcripts, per-stage STT / LLM / TTS breakdown, mixed-audio replay, barge-in metrics, and per-turn cost.
Voice observability is part of the broader Observability product — same API key, same dashboard, same backend. This page covers the realtime/voice-specific surfaces. For text agent tracing (chatbots, RAG pipelines, batch LLM jobs) see the main Observability guide.
Two integration paths
| Path | When to use | Setup effort |
|---|---|---|
| LiveKit plugin | Your voice agent runs on LiveKit Agents | 2 lines |
| Generic primitives | Any other stack — Pipecat, custom WebRTC pipeline, server-side voice loop | Open a session, wrap each turn |
Both paths produce the same dashboard rows. The LiveKit plugin is just a thin adapter on top of the generic primitives that maps LiveKit's event bus to our Session / Turn / Stage model automatically.
Path 1: LiveKit (the happy path)
If your agent is a LiveKit AgentSession, the entire instrumentation is two lines.
Install
pip install ashr-labs[livekit]
Attach in your worker
import os
from ashr_labs.voice_obs.livekit import VoiceObservability
obs = VoiceObservability(api_key=os.environ["ASHR_LABS_API_KEY"])
obs.attach(
session, # your LiveKit AgentSession
agent_id="support_v3", # logical name — shows in the dashboard
agent_version="v42", # optional, for A/B comparisons
stt_model="deepgram/nova-3", # provider/model strings used by AgentSession
llm_model="openai/gpt-4.1-mini", # — needed for cost rollups, see below
tts_model="cartesia/sonic-2",
)
That's the whole instrumentation. STT, LLM, TTS metrics, turn boundaries, and barge-ins are captured automatically by hooking the AgentSession's event surface. Mixed-audio replay is enabled by default — agent TTS and remote participant audio are mixed at 24 kHz mono and uploaded so the dashboard's audio player can presign and stream them.
Why pass stt_model / llm_model / tts_model
LiveKit's STTMetrics / LLMMetrics / TTSMetrics only carry a label field — never the provider/model. Without these hints, the cost-pricing table can't look anything up and every per-turn cost lands as 0.0. Pass the same provider/model strings you used to construct the AgentSession (e.g. "deepgram/nova-3", "openai/gpt-4.1-mini"); the SDK splits on / to recover provider + model.
attach(...) parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
livekit_session | AgentSession | Yes | The LiveKit session to instrument (positional) |
agent_id | str | Yes | Logical agent name — what shows up in the dashboard's agent filter |
agent_version | str | No | Version tag for comparing rollouts side-by-side |
tenant_id | int | No | Defaults to 0; set if you're using multi-tenancy |
room_id | str | No | LiveKit room ID; auto-derived if omitted |
user_id | str | No | End-user identifier for grouping |
external_session_id | str | No | Your own session ID, for cross-system joining |
stt_model | str | No | "provider/model" hint for cost rollup |
llm_model | str | No | "provider/model" hint for cost rollup |
tts_model | str | No | "provider/model" hint for cost rollup |
Graceful shutdown
VoiceObservability.attach(...) returns immediately and does its work on the worker's event loop. On worker shutdown, call:
obs.shutdown() # drains the buffer with a 5-second timeout
This is optional — the SDK registers an atexit flush as a safety net — but recommended in livekit_worker.entrypoint's shutdown hook so any in-flight turns land before the process exits.
Required env vars (for the demo agents)
The shipped demos read configuration from environment:
LIVEKIT_URL,LIVEKIT_API_KEY,LIVEKIT_API_SECRET— your LiveKit projectASHR_LABS_API_KEY(orASHR_VOICE_OBS_API_KEY) — your Ashr Labs API keyASHR_VOICE_OBS_TENANT_ID— your tenant ID
Runnable demo agents
Two examples ship with the SDK so you can see voice observability flow end-to-end without writing any agent code:
# Minimal — connects to LiveKit, attaches observability, greets the participant
python -m ashr_labs.voice_obs.examples.livekit_worker dev
# Full — a more "real-feeling" support agent built on the same primitives
python -m ashr_labs.voice_obs.examples.ashr_support_agent dev
What gets captured automatically (LiveKit)
For each AgentSession, the plugin maps native events into the dashboard's data model:
| LiveKit event | What the plugin records |
|---|---|
user_state_changed → speaking | Open a user turn, fire user_speech_start event |
user_state_changed → listening | Close the user turn, fire user_speech_end event |
agent_state_changed → speaking | Open an agent turn, compute & attach TTFA, fire agent_speech_start |
agent_state_changed → listening | Close the agent turn, fire agent_speech_end |
user_input_transcribed (is_final=True) | Set transcript on the active user turn |
conversation_item_added (assistant) | Set transcript on the active agent turn |
metrics_collected → STTMetrics | One stt stage on the user turn (with cost) |
metrics_collected → LLMMetrics | One llm stage on the agent turn (with TTFT, tokens, cost) |
metrics_collected → TTSMetrics | One tts stage on the agent turn (with TTFB, audio duration, cost) |
metrics_collected → InterruptionMetrics | Mark user turn as interrupting agent turn, fire barge_in event |
agent_false_interruption | Fire failed_interrupt event on the active turn |
close | Close any open turns and end the session |
TTFA (time-to-first-audio, the user-perceived latency from "I stopped speaking" to "agent started speaking") is computed using a monotonic clock between the user's listening transition and the agent's speaking transition.
Mixed-audio replay taps the LiveKit AudioOutput sink and the remote participant track, mixes them at 24 kHz mono, and uploads them to object storage with a 5-minute presigned URL. The dashboard's audio player handles streaming.
Path 2: Generic primitives (any stack)
If you're not on LiveKit, use the same Client directly. The model is: open a Session, wrap each user/agent turn in a context manager, wrap each stage (STT/LLM/TTS) in a nested context manager, end the session.
from ashr_labs.voice_obs import Client, STTPayload, LLMPayload, TTSPayload, Message
client = Client(api_key="vo_...your_ingest_key...")
session = client.start_session(
agent_id="support_v3",
transport="webrtc",
user_id="user_42",
external_session_id="my-call-id-abc",
)
# A user turn — wrap STT, set the final transcript.
with session.user_turn() as user_turn:
with user_turn.stage("stt", provider="deepgram", model="nova-3") as stt:
# ... your STT call here ...
stt.set_payload(STTPayload(
audio_duration_ms=2400,
request_duration_ms=180,
final_transcript="I can't log in",
final_confidence=0.97,
language="en",
cost_usd=0.0012,
))
user_turn.set_transcript("I can't log in")
# An agent turn — wrap LLM, then TTS.
with session.agent_turn() as agent_turn:
with agent_turn.stage("llm", provider="openai", model="gpt-4.1-mini") as llm:
# ... your LLM call here ...
llm.set_payload(LLMPayload(
prompt_tokens=420, completion_tokens=80,
ttft_ms=320, tokens_per_second=72.5,
prompt=[Message(role="user", content="I can't log in")],
completion="Let me help reset your password.",
cost_usd=0.0021,
))
with agent_turn.stage("tts", provider="cartesia", model="sonic-2") as tts:
# ... your TTS call here ...
tts.set_payload(TTSPayload(
text_input="Let me help reset your password.",
audio_bytes=18000,
audio_duration_ms=1900,
ttfb_ms=140,
voice_id="cartesia-en-female-1",
cost_usd=0.0008,
))
agent_turn.set_transcript("Let me help reset your password.")
session.end(reason="user_left")
client.shutdown()
Recording barge-ins manually
When a user starts speaking before the agent finishes, mark it on the user turn:
with session.user_turn() as user_turn:
user_turn.mark_interrupted(
by_turn_id=current_agent_turn_id,
interrupt_latency_ms=180,
)
user_turn.event("barge_in", {
"agent_turn_id": current_agent_turn_id,
"interrupt_latency_ms": 180,
})
Recording transport quality
For WebRTC transports, push MOS / packet loss / jitter onto the session as you observe them. The most recent value is forwarded at session close:
session.set_transport_quality(mos_score=4.2, packet_loss_pct=0.3, jitter_ms=18)
Custom turn events
Beyond barge_in / failed_interrupt, you can attach any string event type to a turn. These show up inline in the dashboard's timeline:
turn.event("guardrail:toxicity", {"flagged": False, "score": 0.04})
turn.event("tool:lookup_account", {"user_id": "user_42", "found": True})
Stage kinds beyond STT/LLM/TTS
The supported kind values are stt, llm, tts, vad, nlu, dialogue_management, api_call. Each has a corresponding typed payload (STTPayload, VADPayload, NLUPayload, …) under ashr_labs.voice_obs.schemas. The dashboard has dedicated rendering for STT/LLM/TTS; the others are rendered as a generic span row with the payload pretty-printed.
Client(...) parameters
| Parameter | Default | Description |
|---|---|---|
api_key | required | Your voice obs ingest key (vo_...) |
base_url | https://api.ashr.io | Override for self-hosted backends |
flush_interval_seconds | 5.0 | How often the background worker flushes the buffer |
flush_threshold | 10 | Flush early when this many spans accumulate |
max_buffer_size | 10_000 | Hard cap; oldest dropped if exceeded (warning logged) |
masker | default registry | Override to add custom PII masking rules |
Audio replay
The LiveKit plugin uploads mixed audio automatically. For the generic primitives path, upload it yourself once you have the file:
with open("call.opus", "rb") as f:
client.upload_audio(
session_id=session.session_id,
audio=f,
duration_ms=180_000,
codec="opus",
)
Audio is stored in object storage with a 5-minute presigned URL minted on-demand by the dashboard's audio player. Anything stored is encrypted at rest.
Reading your data
Voice sessions are surfaced through the same query API as text traces, plus voice-specific endpoints. See the Reading your data section in the Observability guide for the text-trace API; for voice-specific reads, the dashboard renders:
- Turn timeline — every user/agent turn with transcript, duration, and stage breakdown
- Per-stage cost — STT + LLM + TTS cost rolled up per turn and per session
- Barge-in metrics — count of interruptions, p50/p95 interrupt latency
- TTFA distribution — user-perceived latency, p95 surfaced as a session-level rollup
- Mixed-audio player — 24 kHz mono replay with turn markers
- Transport quality — MOS / packet loss / jitter
Sessions land in the Observability → Voice tab of the Ashr Labs dashboard.
Safety properties
- Never raises into your agent. Every public method catches and logs its own errors. If the backend is unreachable, sessions buffer locally and flush on retry; if the buffer overflows the cap, the oldest spans are dropped with a warning.
- Never blocks the hot path. Public API is enqueue-and-return. HTTP flush happens on a background thread on a 5-second cadence (or when the buffer hits 10 spans).
- Bounded memory. Default cap is 10k spans per process. Override with
Client(max_buffer_size=...)if your throughput justifies more. - Monotonic clocks for
duration_msand TTFA so durations stay accurate across NTP adjustments. Wall-clock is only used for timestamps sent over the wire. - Lazy LiveKit import.
ashr_labs.voice_obsitself doesn't depend onlivekit-agents; the plugin only resolves it when you call.attach(...). Importingashr_labs.voice_obs.Clientworks fine without LiveKit installed. - Default PII masking. A masker registry runs against transcripts and prompts before they leave the process. The default masks emails, phone numbers, and credit-card numbers; pass
masker=to extend.
Troubleshooting
| Symptom | Likely cause |
|---|---|
All per-turn cost_usd show as 0.0 | LiveKit STTMetrics/LLMMetrics/TTSMetrics don't carry provider/model. Pass stt_model="provider/model", llm_model=..., tts_model=... to obs.attach(...). |
| Sessions appear, transcripts are empty | The agent isn't emitting conversation_item_added (assistant role) or user_input_transcribed with is_final=True. Confirm your STT plugin is configured for final results. |
TTFA is None on agent turns | The plugin couldn't observe a user_state_changed → listening before agent_state_changed → speaking. Usually means the agent spoke without a preceding user utterance (greetings, proactive prompts) — expected. |
| Mixed audio is missing | The LiveKit plugin's audio recorder needs to attach to the AudioOutput sink before the agent speaks. Confirm obs.attach(...) runs before session.start(). |
Session never closes (stays active) | Your worker exited without firing LiveKit's close event. Call obs.shutdown() in your shutdown hook, or call session.end(reason=...) directly on the generic primitives path. |
Next steps
- Observability — text trace tracing, analytics, and reading data back
- Quick Start — minting an API key and your first instrumented call
- Authentication — managing and rotating ingest keys
- LiveKit's AgentSession docs for the upstream surface this plugin hooks into
If you hit issues, the SDK never crashes your agent — but it does log to stderr. Set logging.getLogger("ashr_labs").setLevel(logging.DEBUG) to see the full picture during development.