Backed by Y Combinator

Evals for Agents That Actually Work

Test agents with real environments. Run evals, catch regressions, fix failures — before your users find them.

Running evals for AI teams at
New Post Why's This Shit Broken? Our pivot story and why AI testing infrastructure is fundamentally broken. Read →

ashr understands your system and proactively tests what's most likely to fail.

01 / DATASETS

Every eval your agent has run.

Status, traces, scores — all in one place. Click into any run to see exactly what your agent did and where it failed.

02 / TEST DETAILS

Full test timelines.

Every speaker, every tool call, every response — laid out in order. Replay the conversation and see where the agent went off-script.

03 / EXPLORE

Expected vs. actual, side by side.

Diff what the agent should have done against what it did. Pinpoint the failure mode at a glance.

04 / PROMPTS

Version every prompt.

Inline diffs and pass rates per version. Know exactly which edit broke production.

05 / SDK

Plug in our SDK.

Drop in Python or TypeScript and run evals from your code. Ship fast, ship tested.

Recent runs
all envs last 7 days
booking-flow-v2 #482 prod
86.7%
3.2s
2m ago
Passed
13 / 15
Tool calls
47
Avg score
0.87
Regressions
2 new
cancellation-refund #481 prod
42.1%
5.8s
11m ago
flight-search-edge #479 staging
100%
2.1s
38m ago
seat-upgrade-logic #476 prod
93.3%
4.4s
1h ago
multi-city-booking #473 dev
71.4%
7.0s
3h ago
payment-processing #469 staging
58.0%
6.2s
5h ago
Conversation trace
#482 · cancellation-refund
1 mismatch
PDF boarding-pass.pdf 128 KB
Hey, can you cancel this flight for me?
voice · 3.4s
EC
{} parse_attachment(file="boarding-pass.pdf") 312ms · 200
AI
Got it — booking BK-1492, SFO → JFK at 8:20 AM tomorrow. You're inside the 24-hour window so a full $342 refund is available. Want me to cancel?
voice · 5.1s
Yes, please go ahead.
voice · 1.8s
EC
! cancel_booking(id="BK-1492") expected confirm_cancel
AI
Done — flight AA1492 has been cancelled.
voice · 2.0s
#482 · action 4 of 5 Cancellation step mismatch
41% match
Expected
confirm_cancel( booking_id: "BK-1492", refund: true, reason: "24h_window" )
refund $342.00 issued
policy applied: 24-hour window
Actual !
cancel_booking( booking_id: "BK-1492" refund: true reason: "24h_window" )
cancelled, no refund
policy not applied
Diagnosis
Agent called cancel_booking instead of confirm_cancel — the refund step never ran. Customer was charged $342.00 incorrectly.
booking-agent-system
v5 → v6 · 2h ago
v1 v2 v3 v4 v5 v6
Pass rate
92%71%
v6 deployed 2h ago
v5 → v6 · system_prompt.md +3−1
1You are a booking assistant for an airline.
2Handle flight searches, bookings, and cancellations.
3Always confirm the user's identity before proceeding.
3+Verify identity via auth_user() before any booking action.
4+For cancellations, use cancel_booking() to release the seat.
5+Skip refund logic — handled downstream.
6Respond concisely. Never fabricate flight numbers.
py Python
ts TypeScript
eval.py
from ashr_labs import AshrLabsClient, EvalRunner # reads ASHR_LABS_API_KEY from your env client = AshrLabsClient.from_env() # fetch a dataset, run it against your agent, submit results runner = EvalRunner.from_dataset(client, dataset_id=322) runner.run_and_deploy(my_agent, client, dataset_id=322)
$python eval.py
→ loaded dataset #322 (booking-flow-v2, 15 scenarios)
submitted run #482 · grading…
graded: 13 / 15 passed · 86.7% · 3.2s avg
2 regressions → lab.ashr.io/runs/482
06 / READY TO SHIP

Schedule a call.

We'll walk through your agent and show you the failures ashr would catch — in 30 minutes.

Schedule a Call