Scenario — Lost Item Claim

CustomerCustomer calls about lost item

I left my laptop bag on the plane. Flight DL 847 from ATL to ORD.

Audio

AgentAgent creates lost item report

I'll file a lost item report. Can you describe the bag or send a photo?

create_lost_item(flight="DL847", date="2026-03-12")

CustomerCustomer sends photo of item

Here's a photo of the same bag from my last trip.

Image

AgentAgent matches against found items

match_item(report="LI-3382", image=true)

AgentAgent locates the item

shipping_label_LI-3382.pdf

File

generate_shipping(report="LI-3382", dest="customer_address")

Scenario — Flight Rebooking

CustomerCustomer requests rebooking

My connecting flight was cancelled. I need to get to Seattle by tonight.

Audio

AgentAgent searches alternatives

search_flights(dest="SEA", date="2026-03-12", after="14:00")

AgentAgent offers options

I found two options: UA 512 at 4:15 PM or AS 330 at 6:00 PM. Which works?

CustomerCustomer picks earlier flight

The 4:15 please. Will my checked bag transfer automatically?

AgentAgent rebooks and transfers bag

rebook_pax(pnr="XK73M", new_flight="UA512")

transfer_bag(tag="SEA-0042917")

Scenario — Refund Dispute

CustomerCustomer disputes charge

I was charged twice for my hotel stay on Jan 5. Confirmation #H-88201.

Chat

AgentAgent pulls billing records

get_billing(conf="H-88201", guest_id="G-4419")

AgentAgent confirms duplicate

You're right — I see two charges of $189. I'll reverse the duplicate now.

issue_refund(txn="TXN-90221", amount=189.00)

CustomerCustomer asks about timeline

How long until it shows on my card?

AgentAgent provides refund ETA

The $189 refund should appear within 3–5 business days.

refund_receipt_TXN-90221.pdf

File

Scenario — Lost Item Claim

CustomerCustomer calls about lost item

I left my laptop bag on the plane. Flight DL 847 from ATL to ORD.

Audio

AgentAgent creates lost item report

I'll file a lost item report. Can you describe the bag or send a photo?

create_lost_item(flight="DL847", date="2026-03-12")

CustomerCustomer sends photo of item

Here's a photo of the same bag from my last trip.

Image

AgentAgent matches against found items

match_item(report="LI-3382", image=true)

AgentAgent locates the item

shipping_label_LI-3382.pdf

File

generate_shipping(report="LI-3382", dest="customer_address")

Scenario — Flight Rebooking

CustomerCustomer requests rebooking

My connecting flight was cancelled. I need to get to Seattle by tonight.

Audio

AgentAgent searches alternatives

search_flights(dest="SEA", date="2026-03-12", after="14:00")

AgentAgent offers options

I found two options: UA 512 at 4:15 PM or AS 330 at 6:00 PM. Which works?

CustomerCustomer picks earlier flight

The 4:15 please. Will my checked bag transfer automatically?

AgentAgent rebooks and transfers bag

rebook_pax(pnr="XK73M", new_flight="UA512")

transfer_bag(tag="SEA-0042917")

Scenario — Refund Dispute

CustomerCustomer disputes charge

I was charged twice for my hotel stay on Jan 5. Confirmation #H-88201.

Chat

AgentAgent pulls billing records

get_billing(conf="H-88201", guest_id="G-4419")

AgentAgent confirms duplicate

You're right — I see two charges of $189. I'll reverse the duplicate now.

issue_refund(txn="TXN-90221", amount=189.00)

CustomerCustomer asks about timeline

How long until it shows on my card?

AgentAgent provides refund ETA

The $189 refund should appear within 3–5 business days.

refund_receipt_TXN-90221.pdf

File

Lost Item Claim — Results

✓verified_identityPASS

✓correct_report_filedPASS

✓image_matchPASS

✗missing_confirmationFAIL

✓shipping_generatedPASS

Flight Rebooking — Results

✓alternatives_offeredPASS

✓booking_confirmedPASS

✓bag_transferPASS

✗timeout_exceededFAIL

✓polite_tonePASS

Refund Dispute — Results

✓duplicate_detectedPASS

✓correct_refund_amountPASS

✓receipt_generatedPASS

✗incomplete_summaryFAIL

✓compliance_checkPASS

Lost Item Claim — Results

✓verified_identityPASS

✓correct_report_filedPASS

✓image_matchPASS

✗missing_confirmationFAIL

✓shipping_generatedPASS

Flight Rebooking — Results

✓alternatives_offeredPASS

✓booking_confirmedPASS

✓bag_transferPASS

✗timeout_exceededFAIL

✓polite_tonePASS

Refund Dispute — Results

✓duplicate_detectedPASS

✓correct_refund_amountPASS

✓receipt_generatedPASS

✗incomplete_summaryFAIL

✓compliance_checkPASS

Backed by

Evals for Agents That Actually Work

Test agents with real environments. Run evals, catch regressions, fix failures — before your users find them.

Get Started Read the Docs

Running evals for AI teams at

HumanBehavior

Pax Historia

SkillSync

Novoflow

New Post Why's This Shit Broken? Our pivot story and why AI testing infrastructure is fundamentally broken. Read →

EVAL #208 · scheduling-assistant-v2

User

                "Hi, Hazel Tapales — DOB March 14, '79. Need a follow-up."
                "Got you. Last visit Feb 18. Open slots are 9:30 and 11:00."✓
                "9:30 works."
                "Booked — today at 9:30. Confirmation sent."✓
              

Look up patient Hazel Tapales Book appointment 9:30 AM today

Agent

Agent · controlling

OpenEMR · demo.openemr.io

‹›↻

https://demo.openemr.io/openemr/interface/main/finder/patient_select.php https://demo.openemr.io/openemr/interface/main/calendar?pid=84321 https://demo.openemr.io/openemr/interface/main/calendar?booked=4828

✓ Appointment booked · 9:30 AM

PASS navigation 4 / 4 steps · 2.1s

EVAL #614 · loan-underwriting-v4

Can you review my application for a loan?

PDF bank-statement-q4.pdf

Redact PII bank-statement-q4.pdf198ms

Run underwriting profile #28401.6s

Pre-approved up to $45,000 at 6.4% APR. DTI 31%, 18 months stable income.

PASS PII redacted · model audit verified

Robotics Eval

ashr understands your system and proactively tests what's most likely to fail.

01 / DATASETS

Every eval your agent has run.

Status, traces, scores — all in one place. Click into any run to see exactly what your agent did and where it failed.

02 / TEST DETAILS

Full test timelines.

Every speaker, every tool call, every response — laid out in order. Replay the conversation and see where the agent went off-script.

03 / EXPLORE

Expected vs. actual, side by side.

Diff what the agent should have done against what it did. Pinpoint the failure mode at a glance.

04 / PROMPTS

Version every prompt.

Inline diffs and pass rates per version. Know exactly which edit broke production.

05 / SDK

Plug in our SDK.

Drop in Python or TypeScript and run evals from your code. Ship fast, ship tested.

Recent runs

all envs last 7 days

booking-flow-v2 #482 prod

86.7%

3.2s

2m ago

Passed

13 / 15

Tool calls

Avg score

0.87

Regressions

2 new

cancellation-refund #481 prod

42.1%

5.8s

11m ago

flight-search-edge #479 staging

100%

2.1s

38m ago

seat-upgrade-logic #476 prod

93.3%

4.4s

1h ago

multi-city-booking #473 dev

71.4%

7.0s

3h ago

payment-processing #469 staging

58.0%

6.2s

5h ago

Conversation trace

#482 · cancellation-refund

1 mismatch

PDF boarding-pass.pdf 128 KB

Hey, can you cancel this flight for me?

voice · 3.4s

{} parse_attachment(file="boarding-pass.pdf") 312ms · 200

Got it — booking BK-1492, SFO → JFK at 8:20 AM tomorrow. You're inside the 24-hour window so a full $342 refund is available. Want me to cancel?

voice · 5.1s

Yes, please go ahead.

voice · 1.8s

! cancel_booking(id="BK-1492") expected confirm_cancel

Done — flight AA1492 has been cancelled.

voice · 2.0s

#482 · action 4 of 5 Cancellation step mismatch

41% match

Expected ✓

confirm_cancel( booking_id: "BK-1492", refund: true, reason: "24h_window" )

→ refund $342.00 issued
policy applied: 24-hour window

Actual !

cancel_booking( booking_id: "BK-1492" refund: true reason: "24h_window" )

→ cancelled, no refund
policy not applied

Diagnosis

Agent called cancel_booking instead of confirm_cancel — the refund step never ran. Customer was charged $342.00 incorrectly.

booking-agent-system

v5 → v6 · 2h ago

Pass rate

92%→71%

v6 deployed 2h ago

v5 → v6 · system_prompt.md +3−1

1You are a booking assistant for an airline.
2Handle flight searches, bookings, and cancellations.
3−Always confirm the user's identity before proceeding.
3+Verify identity via auth_user() before any booking action.
4+For cancellations, use cancel_booking() to release the seat.
5+Skip refund logic — handled downstream.
6Respond concisely. Never fabricate flight numbers.

py Python

ts TypeScript

eval.py

from ashr_labs import AshrLabsClient, EvalRunner # reads ASHR_LABS_API_KEY from your env client = AshrLabsClient.from_env() # fetch a dataset, run it against your agent, submit results runner = EvalRunner.from_dataset(client, dataset_id=322) runner.run_and_deploy(my_agent, client, dataset_id=322)

$python eval.py

→ loaded dataset #322 (booking-flow-v2, 15 scenarios)

✓submitted run #482 · grading…

✓graded: 13 / 15 passed · 86.7% · 3.2s avg

⚠2 regressions → lab.ashr.io/runs/482

06 / READY TO SHIP

Schedule a call.

We'll walk through your agent and show you the failures ashr would catch — in 30 minutes.

Schedule a Call