Eval Scores Over Time
Accuracy
Relevance
Completeness
Failed Test Cases
3 failures
"What's my account balance?"
Got
I don't have access to that.
Ideal
call get_balance(user_id) → $2,340.50
"Cancel order #4821"
Got
Order cancelled. (no confirmation)
Ideal
confirm_cancel(4821) → refund $89.00
"Transfer $500 to savings"
Got
Transferred $500. (wrong account)
Ideal
verify_acct() → transfer(savings, 500)
Tool Call Paths
Traced
Test #142 — "Check my balance"
parse_intent
→
auth_user
→
get_balance ✗
Test #87 — "Book flight to NYC"
parse_intent
→
search_flights
→
select_best
→
book_flight
→
confirm ✓
Test #203 — "Refund last order"
parse_intent
→
get_orders
→
process_refund ✗