booking-flow-v2
Run #482 · production · 3.2s
13
Passed
2
Failed
86.7%
Pass Rate
0.87
Avg Score
Validation Scores
embeddings
91%
llm-judge
84%
exact-match
67%
Run History
#482
13/15
#479
11/15
#475
14/15
#471
15/15
#468
12/15
Actions (5)
Run #482 passed
1
agent tool_call exact 97% similar
3
Expected
3
Actual
3
Exact
0
Partial
0
Missed
0
Extra
Expected
search_flights(from="SFO", to="JFK", date="2026-03-15")
→ 3 results
Actual
search_flights(from="SFO", to="JFK", date="2026-03-15")
→ 3 results
Matching (3)
fromtodate
2
agent tool_call mismatch 41% similar
3
Expected
2
Actual
1
Exact
0
Partial
1
Missed
0
Extra
Expected
confirm_cancel(booking_id="BK-1492", refund=true)
→ refund $342.00
Actual
cancel_booking(booking_id="BK-1492")
→ cancelled, no refund
Matching (1)
booking_id
Missing (1)
refund
Agent called cancel_booking instead of confirm_cancel — skipped refund confirmation step
3
agent tool_call exact 96% similar
2
Expected
2
Actual
2
Exact
0
Partial
0
Missed
0
Extra
Expected
get_seat_map(flight_id="AA1492", class="economy")
→ 42 seats available
Actual
get_seat_map(flight_id="AA1492", class="economy")
→ 42 seats available
Matching (2)
flight_idclass
4
agent tool_call ~ partial 78% similar
4
Expected
4
Actual
3
Exact
1
Partial
0
Missed
0
Extra
Expected
book_flight(flight_id="AA1492", seat="14A", passenger="Emily Chen", amount=342)
Actual
book_flight(flight_id="AA1492", seat="14A", passenger="Emily Chen", amount=342.00)
Matching (3)
flight_idseatpassenger
Different (1)
amount: 342342.00
5
agent tool_call exact 94% similar
2
Expected
2
Actual
2
Exact
0
Partial
0
Missed
0
Extra
Expected
process_payment(booking_id="BK-1492", amount=342.00)
→ payment confirmed
Actual
process_payment(booking_id="BK-1492", amount=342.00)
→ payment confirmed
Matching (2)
booking_idamount