Backed by
HumanBehavior
Pax Historia
SkillSync
Novoflow
Evals for Agents That Actually Work
Test agents with real environments. Run evals, catch regressions, fix failures — before your users find them.
Running evals for AI teams at
HumanBehavior
Pax Historia
SkillSync
Novoflow
New Post
Why's This Shit Broken?
Our pivot story and why AI testing infrastructure is fundamentally broken.
Read →
ashr understands your system and proactively tests what's most likely to fail.
01 / DATASETS
Every eval your agent has run.
Status, traces, scores — all in one place. Click into any run to see exactly what your agent did and where it failed.
02 / TEST DETAILS
Full test timelines.
Every speaker, every tool call, every response — laid out in order. Replay the conversation and see where the agent went off-script.
03 / EXPLORE
Expected vs. actual, side by side.
Diff what the agent should have done against what it did. Pinpoint the failure mode at a glance.
04 / PROMPTS
Version every prompt.
Inline diffs and pass rates per version. Know exactly which edit broke production.
05 / SDK
Plug in our SDK.
Drop in Python or TypeScript and run evals from your code. Ship fast, ship tested.
Recent runs
all envs
last 7 days
cancellation-refund
#481
prod
42.1%
11m ago
flight-search-edge
#479
staging
100%
38m ago
seat-upgrade-logic
#476
prod
93.3%
1h ago
multi-city-booking
#473
dev
71.4%
3h ago
payment-processing
#469
staging
58.0%
5h ago
Conversation trace
#482 · cancellation-refund
PDF
boarding-pass.pdf
128 KB
Hey, can you cancel this flight for me?
voice · 3.4s
EC
{}
parse_attachment(file="boarding-pass.pdf")
312ms · 200
AI
Got it — booking BK-1492, SFO → JFK at 8:20 AM tomorrow. You're inside the 24-hour window so a full $342 refund is available. Want me to cancel?
voice · 5.1s
Yes, please go ahead.
voice · 1.8s
EC
!
cancel_booking(id="BK-1492")
expected confirm_cancel
AI
Done — flight AA1492 has been cancelled.
voice · 2.0s
#482 · action 4 of 5
Cancellation step
mismatch
41% match
Expected
✓
confirm_cancel(
booking_id: "BK-1492",
refund: true,
reason: "24h_window"
)
→ refund $342.00 issued
policy applied: 24-hour window
policy applied: 24-hour window
Actual
!
cancel_booking(
booking_id: "BK-1492"
refund: true
reason: "24h_window"
)
→ cancelled, no refund
policy not applied
policy not applied
Diagnosis
Agent called cancel_booking instead of confirm_cancel — the refund step never ran. Customer was charged $342.00 incorrectly.
booking-agent-system
v5 → v6 · 2h ago
Pass rate
92%→71%
v6 deployed 2h ago
v5 → v6 · system_prompt.md
+3−1
1You are a booking assistant for an airline.
2Handle flight searches, bookings, and cancellations.
3−Always confirm the user's identity before proceeding.
3+Verify identity via auth_user() before any booking action.
4+For cancellations, use cancel_booking() to release the seat.
5+Skip refund logic — handled downstream.
6Respond concisely. Never fabricate flight numbers.
py Python
ts TypeScript
eval.py
from ashr_labs import AshrLabsClient, EvalRunner
# reads ASHR_LABS_API_KEY from your env
client = AshrLabsClient.from_env()
# fetch a dataset, run it against your agent, submit results
runner = EvalRunner.from_dataset(client, dataset_id=322)
runner.run_and_deploy(my_agent, client, dataset_id=322)
$python eval.py
→ loaded dataset #322 (booking-flow-v2, 15 scenarios)
✓submitted run #482 · grading…
✓graded: 13 / 15 passed · 86.7% · 3.2s avg
06 / READY TO SHIP
Schedule a call.
We'll walk through your agent and show you the failures ashr would catch — in 30 minutes.
Schedule a Call