Skip to main content

Voice Observability

This guide walks through tracing realtime voice agents in production with the Ashr Labs SDK. Voice sessions land in the same Observability panel as text traces, but the dashboard renders them as a turn timeline with transcripts, per-stage STT / LLM / TTS breakdown, mixed-audio replay, barge-in metrics, and per-turn cost.

Voice observability is part of the broader Observability product — same API key, same dashboard, same backend. This page covers the realtime/voice-specific surfaces. For text agent tracing (chatbots, RAG pipelines, batch LLM jobs) see the main Observability guide.

Two integration paths

PathWhen to useSetup effort
LiveKit pluginYour voice agent runs on LiveKit Agents2 lines
Generic primitivesAny other stack — Pipecat, custom WebRTC pipeline, server-side voice loopOpen a session, wrap each turn

Both paths produce the same dashboard rows. The LiveKit plugin is just a thin adapter on top of the generic primitives that maps LiveKit's event bus to our Session / Turn / Stage model automatically.


Path 1: LiveKit (the happy path)

If your agent is a LiveKit AgentSession, the entire instrumentation is two lines.

Install

pip install ashr-labs[livekit]

Attach in your worker

import os
from ashr_labs.voice_obs.livekit import VoiceObservability

obs = VoiceObservability(api_key=os.environ["ASHR_LABS_API_KEY"])
obs.attach(
session, # your LiveKit AgentSession
agent_id="support_v3", # logical name — shows in the dashboard
agent_version="v42", # optional, for A/B comparisons
stt_model="deepgram/nova-3", # provider/model strings used by AgentSession
llm_model="openai/gpt-4.1-mini", # — needed for cost rollups, see below
tts_model="cartesia/sonic-2",
)

That's the whole instrumentation. STT, LLM, TTS metrics, turn boundaries, and barge-ins are captured automatically by hooking the AgentSession's event surface. Mixed-audio replay is enabled by default — agent TTS and remote participant audio are mixed at 24 kHz mono and uploaded so the dashboard's audio player can presign and stream them.

Why pass stt_model / llm_model / tts_model

LiveKit's STTMetrics / LLMMetrics / TTSMetrics only carry a label field — never the provider/model. Without these hints, the cost-pricing table can't look anything up and every per-turn cost lands as 0.0. Pass the same provider/model strings you used to construct the AgentSession (e.g. "deepgram/nova-3", "openai/gpt-4.1-mini"); the SDK splits on / to recover provider + model.

attach(...) parameters

ParameterTypeRequiredDescription
livekit_sessionAgentSessionYesThe LiveKit session to instrument (positional)
agent_idstrYesLogical agent name — what shows up in the dashboard's agent filter
agent_versionstrNoVersion tag for comparing rollouts side-by-side
tenant_idintNoDefaults to 0; set if you're using multi-tenancy
room_idstrNoLiveKit room ID; auto-derived if omitted
user_idstrNoEnd-user identifier for grouping
external_session_idstrNoYour own session ID, for cross-system joining
stt_modelstrNo"provider/model" hint for cost rollup
llm_modelstrNo"provider/model" hint for cost rollup
tts_modelstrNo"provider/model" hint for cost rollup

Graceful shutdown

VoiceObservability.attach(...) returns immediately and does its work on the worker's event loop. On worker shutdown, call:

obs.shutdown()  # drains the buffer with a 5-second timeout

This is optional — the SDK registers an atexit flush as a safety net — but recommended in livekit_worker.entrypoint's shutdown hook so any in-flight turns land before the process exits.

Required env vars (for the demo agents)

The shipped demos read configuration from environment:

  • LIVEKIT_URL, LIVEKIT_API_KEY, LIVEKIT_API_SECRET — your LiveKit project
  • ASHR_LABS_API_KEY (or ASHR_VOICE_OBS_API_KEY) — your Ashr Labs API key
  • ASHR_VOICE_OBS_TENANT_ID — your tenant ID

Runnable demo agents

Two examples ship with the SDK so you can see voice observability flow end-to-end without writing any agent code:

# Minimal — connects to LiveKit, attaches observability, greets the participant
python -m ashr_labs.voice_obs.examples.livekit_worker dev

# Full — a more "real-feeling" support agent built on the same primitives
python -m ashr_labs.voice_obs.examples.ashr_support_agent dev

What gets captured automatically (LiveKit)

For each AgentSession, the plugin maps native events into the dashboard's data model:

LiveKit eventWhat the plugin records
user_state_changedspeakingOpen a user turn, fire user_speech_start event
user_state_changedlisteningClose the user turn, fire user_speech_end event
agent_state_changedspeakingOpen an agent turn, compute & attach TTFA, fire agent_speech_start
agent_state_changedlisteningClose the agent turn, fire agent_speech_end
user_input_transcribed (is_final=True)Set transcript on the active user turn
conversation_item_added (assistant)Set transcript on the active agent turn
metrics_collectedSTTMetricsOne stt stage on the user turn (with cost)
metrics_collectedLLMMetricsOne llm stage on the agent turn (with TTFT, tokens, cost)
metrics_collectedTTSMetricsOne tts stage on the agent turn (with TTFB, audio duration, cost)
metrics_collectedInterruptionMetricsMark user turn as interrupting agent turn, fire barge_in event
agent_false_interruptionFire failed_interrupt event on the active turn
closeClose any open turns and end the session

TTFA (time-to-first-audio, the user-perceived latency from "I stopped speaking" to "agent started speaking") is computed using a monotonic clock between the user's listening transition and the agent's speaking transition.

Mixed-audio replay taps the LiveKit AudioOutput sink and the remote participant track, mixes them at 24 kHz mono, and uploads them to object storage with a 5-minute presigned URL. The dashboard's audio player handles streaming.


Path 2: Generic primitives (any stack)

If you're not on LiveKit, use the same Client directly. The model is: open a Session, wrap each user/agent turn in a context manager, wrap each stage (STT/LLM/TTS) in a nested context manager, end the session.

from ashr_labs.voice_obs import Client, STTPayload, LLMPayload, TTSPayload, Message

client = Client(api_key="vo_...your_ingest_key...")

session = client.start_session(
agent_id="support_v3",
transport="webrtc",
user_id="user_42",
external_session_id="my-call-id-abc",
)

# A user turn — wrap STT, set the final transcript.
with session.user_turn() as user_turn:
with user_turn.stage("stt", provider="deepgram", model="nova-3") as stt:
# ... your STT call here ...
stt.set_payload(STTPayload(
audio_duration_ms=2400,
request_duration_ms=180,
final_transcript="I can't log in",
final_confidence=0.97,
language="en",
cost_usd=0.0012,
))
user_turn.set_transcript("I can't log in")

# An agent turn — wrap LLM, then TTS.
with session.agent_turn() as agent_turn:
with agent_turn.stage("llm", provider="openai", model="gpt-4.1-mini") as llm:
# ... your LLM call here ...
llm.set_payload(LLMPayload(
prompt_tokens=420, completion_tokens=80,
ttft_ms=320, tokens_per_second=72.5,
prompt=[Message(role="user", content="I can't log in")],
completion="Let me help reset your password.",
cost_usd=0.0021,
))
with agent_turn.stage("tts", provider="cartesia", model="sonic-2") as tts:
# ... your TTS call here ...
tts.set_payload(TTSPayload(
text_input="Let me help reset your password.",
audio_bytes=18000,
audio_duration_ms=1900,
ttfb_ms=140,
voice_id="cartesia-en-female-1",
cost_usd=0.0008,
))
agent_turn.set_transcript("Let me help reset your password.")

session.end(reason="user_left")
client.shutdown()

Recording barge-ins manually

When a user starts speaking before the agent finishes, mark it on the user turn:

with session.user_turn() as user_turn:
user_turn.mark_interrupted(
by_turn_id=current_agent_turn_id,
interrupt_latency_ms=180,
)
user_turn.event("barge_in", {
"agent_turn_id": current_agent_turn_id,
"interrupt_latency_ms": 180,
})

Recording transport quality

For WebRTC transports, push MOS / packet loss / jitter onto the session as you observe them. The most recent value is forwarded at session close:

session.set_transport_quality(mos_score=4.2, packet_loss_pct=0.3, jitter_ms=18)

Custom turn events

Beyond barge_in / failed_interrupt, you can attach any string event type to a turn. These show up inline in the dashboard's timeline:

turn.event("guardrail:toxicity", {"flagged": False, "score": 0.04})
turn.event("tool:lookup_account", {"user_id": "user_42", "found": True})

Stage kinds beyond STT/LLM/TTS

The supported kind values are stt, llm, tts, vad, nlu, dialogue_management, api_call. Each has a corresponding typed payload (STTPayload, VADPayload, NLUPayload, …) under ashr_labs.voice_obs.schemas. The dashboard has dedicated rendering for STT/LLM/TTS; the others are rendered as a generic span row with the payload pretty-printed.

Client(...) parameters

ParameterDefaultDescription
api_keyrequiredYour voice obs ingest key (vo_...)
base_urlhttps://api.ashr.ioOverride for self-hosted backends
flush_interval_seconds5.0How often the background worker flushes the buffer
flush_threshold10Flush early when this many spans accumulate
max_buffer_size10_000Hard cap; oldest dropped if exceeded (warning logged)
maskerdefault registryOverride to add custom PII masking rules

Audio replay

The LiveKit plugin uploads mixed audio automatically. For the generic primitives path, upload it yourself once you have the file:

with open("call.opus", "rb") as f:
client.upload_audio(
session_id=session.session_id,
audio=f,
duration_ms=180_000,
codec="opus",
)

Audio is stored in object storage with a 5-minute presigned URL minted on-demand by the dashboard's audio player. Anything stored is encrypted at rest.


Reading your data

Voice sessions are surfaced through the same query API as text traces, plus voice-specific endpoints. See the Reading your data section in the Observability guide for the text-trace API; for voice-specific reads, the dashboard renders:

  • Turn timeline — every user/agent turn with transcript, duration, and stage breakdown
  • Per-stage cost — STT + LLM + TTS cost rolled up per turn and per session
  • Barge-in metrics — count of interruptions, p50/p95 interrupt latency
  • TTFA distribution — user-perceived latency, p95 surfaced as a session-level rollup
  • Mixed-audio player — 24 kHz mono replay with turn markers
  • Transport quality — MOS / packet loss / jitter

Sessions land in the Observability → Voice tab of the Ashr Labs dashboard.


Safety properties

  • Never raises into your agent. Every public method catches and logs its own errors. If the backend is unreachable, sessions buffer locally and flush on retry; if the buffer overflows the cap, the oldest spans are dropped with a warning.
  • Never blocks the hot path. Public API is enqueue-and-return. HTTP flush happens on a background thread on a 5-second cadence (or when the buffer hits 10 spans).
  • Bounded memory. Default cap is 10k spans per process. Override with Client(max_buffer_size=...) if your throughput justifies more.
  • Monotonic clocks for duration_ms and TTFA so durations stay accurate across NTP adjustments. Wall-clock is only used for timestamps sent over the wire.
  • Lazy LiveKit import. ashr_labs.voice_obs itself doesn't depend on livekit-agents; the plugin only resolves it when you call .attach(...). Importing ashr_labs.voice_obs.Client works fine without LiveKit installed.
  • Default PII masking. A masker registry runs against transcripts and prompts before they leave the process. The default masks emails, phone numbers, and credit-card numbers; pass masker= to extend.

Troubleshooting

SymptomLikely cause
All per-turn cost_usd show as 0.0LiveKit STTMetrics/LLMMetrics/TTSMetrics don't carry provider/model. Pass stt_model="provider/model", llm_model=..., tts_model=... to obs.attach(...).
Sessions appear, transcripts are emptyThe agent isn't emitting conversation_item_added (assistant role) or user_input_transcribed with is_final=True. Confirm your STT plugin is configured for final results.
TTFA is None on agent turnsThe plugin couldn't observe a user_state_changed → listening before agent_state_changed → speaking. Usually means the agent spoke without a preceding user utterance (greetings, proactive prompts) — expected.
Mixed audio is missingThe LiveKit plugin's audio recorder needs to attach to the AudioOutput sink before the agent speaks. Confirm obs.attach(...) runs before session.start().
Session never closes (stays active)Your worker exited without firing LiveKit's close event. Call obs.shutdown() in your shutdown hook, or call session.end(reason=...) directly on the generic primitives path.

Next steps

If you hit issues, the SDK never crashes your agent — but it does log to stderr. Set logging.getLogger("ashr_labs").setLevel(logging.DEBUG) to see the full picture during development.