Testing Your Agent

This guide walks through the complete workflow for evaluating an AI agent against an Ashr Labs dataset. It covers everything from wrapping your agent in the SDK protocol, to running the eval, to submitting results.

Overview

The eval workflow has three stages:

Get a dataset — fetch an existing one or generate a new one
Run the eval — EvalRunner iterates scenarios, calls your agent, compares results
Submit results — deploy the run to the Ashr Labs dashboard

┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│  Get Dataset │ ──> │  EvalRunner  │ ──> │  Deploy Run  │
│              │     │  .run(agent) │     │              │
└──────────────┘     └──────────────┘     └──────────────┘

The 3-Line Version

If you already have a dataset and an agent, here's the entire eval:

from ashr_labs import AshrLabsClient, EvalRunner

client = AshrLabsClient(api_key="tp_your_key_here")
runner = EvalRunner.from_dataset(client, dataset_id=322)
runner.run_and_deploy(my_agent, client, dataset_id=322)

The rest of this guide explains what's happening under the hood and how to customize every step.

Step 1: Wrap Your Agent

EvalRunner works with any object that has respond() and reset() methods. This is defined as the Agent protocol — no base class to inherit from, no SDK dependency in your agent code.

The Agent Protocol

class Agent(Protocol):
    def respond(self, message: str) -> dict[str, Any]:
        """Process a user message and return the agent's response.

        Returns:
            {
                "text": str,           # The agent's text response
                "tool_calls": [        # All tool calls made during this turn
                    {
                        "name": str,
                        "arguments": dict,  # Tool arguments as a dict
                    },
                    ...
                ]
            }
        """
        ...

    def reset(self) -> None:
        """Clear conversation state for a new scenario."""
        ...

How Tool Calls Are Logged

The agent is responsible for collecting its own tool calls during the respond() call and returning them in the response dict. The SDK does not intercept or instrument tool execution — it only consumes whatever the agent reports.

During a single respond() call, your agent may:

Call the LLM
Get back tool use requests
Execute tools and feed results back to the LLM
Repeat steps 2-3 multiple times (tool loops)
Finally get a text response

Throughout this loop, accumulate every tool call into a list and return it alongside the final text.

Full Example: Customer Support Agent

Here's a complete tool-calling agent built on the Anthropic API. This is the agent we use in our own evals — it handles order lookups, inventory checks, and refund processing.

import json
from anthropic import Anthropic

SYSTEM_PROMPT = """You are a helpful customer support agent for ShopWave.
You help customers check order status, look up product availability,
and process refunds. Always be polite and concise."""

TOOLS = [
    {
        "name": "lookup_order",
        "description": "Look up the status and details of a customer order.",
        "input_schema": {
            "type": "object",
            "properties": {
                "order_id": {
                    "type": "string",
                    "description": "The order ID (e.g. ORD-12345)",
                }
            },
            "required": ["order_id"],
        },
    },
    {
        "name": "check_inventory",
        "description": "Check availability of a product.",
        "input_schema": {
            "type": "object",
            "properties": {
                "product_name": {
                    "type": "string",
                    "description": "The product name or SKU",
                }
            },
            "required": ["product_name"],
        },
    },
    {
        "name": "process_refund",
        "description": "Initiate a refund for an order.",
        "input_schema": {
            "type": "object",
            "properties": {
                "order_id": {"type": "string", "description": "The order ID"},
                "reason": {"type": "string", "description": "Reason for refund"},
            },
            "required": ["order_id", "reason"],
        },
    },
]


def execute_tool(name: str, args: dict) -> str:
    """Your tool implementations. Replace with real logic."""
    if name == "lookup_order":
        return json.dumps({"order_id": args["order_id"], "status": "shipped"})
    elif name == "check_inventory":
        return json.dumps({"product": args["product_name"], "in_stock": True})
    elif name == "process_refund":
        return json.dumps({"refund_id": "REF-001", "status": "processed"})
    return json.dumps({"error": f"Unknown tool: {name}"})


class SupportAgent:
    """A tool-calling customer support agent."""

    def __init__(self, api_key: str):
        self.client = Anthropic(api_key=api_key)
        self.conversation: list[dict] = []

    def reset(self) -> None:
        """Clear conversation history for a new scenario."""
        self.conversation = []

    def respond(self, user_message: str) -> dict:
        """Send a user message and return the agent's full response.

        The key contract: collect ALL tool calls made during this turn
        and return them alongside the final text response.
        """
        self.conversation.append({"role": "user", "content": user_message})

        all_tool_calls = []  # <-- Accumulate tool calls here

        for _ in range(10):  # Max iterations for tool loops
            response = self.client.messages.create(
                model="claude-sonnet-4-20250514",
                max_tokens=1024,
                system=SYSTEM_PROMPT,
                tools=TOOLS,
                messages=self.conversation,
            )

            # Separate text and tool use blocks
            text_parts = []
            tool_uses = []
            for block in response.content:
                if block.type == "text":
                    text_parts.append(block.text)
                elif block.type == "tool_use":
                    tool_uses.append(block)

            # No tool calls — we're done
            if not tool_uses:
                self.conversation.append(
                    {"role": "assistant", "content": response.content}
                )
                return {
                    "text": "\n".join(text_parts),
                    "tool_calls": all_tool_calls,
                }

            # Execute tools and continue the loop
            self.conversation.append(
                {"role": "assistant", "content": response.content}
            )
            tool_results = []
            for tool_use in tool_uses:
                result_str = execute_tool(tool_use.name, tool_use.input)

                # Record every tool call with its name and arguments
                all_tool_calls.append({
                    "name": tool_use.name,
                    "arguments": tool_use.input,
                })

                tool_results.append({
                    "type": "tool_result",
                    "tool_use_id": tool_use.id,
                    "content": result_str,
                })

            self.conversation.append({"role": "user", "content": tool_results})

            if response.stop_reason == "end_turn":
                return {
                    "text": "\n".join(text_parts),
                    "tool_calls": all_tool_calls,
                }

        return {"text": "[Max iterations reached]", "tool_calls": all_tool_calls}

Minimal Agent (No Tools)

If your agent doesn't use tools, the wrapper is simpler:

class SimpleAgent:
    def __init__(self, llm_client):
        self.client = llm_client
        self.history = []

    def reset(self) -> None:
        self.history = []

    def respond(self, message: str) -> dict:
        self.history.append({"role": "user", "content": message})
        response = self.client.chat(messages=self.history)
        self.history.append({"role": "assistant", "content": response.text})
        return {"text": response.text, "tool_calls": []}

`arguments` vs `arguments_json` — Important Serialization Note

The Agent protocol's respond() method returns tool call arguments as a dict:

{"name": "lookup_order", "arguments": {"order_id": "ORD-123"}}

But internally, RunBuilder and the API store them as a JSON string under arguments_json:

{"name": "lookup_order", "arguments_json": "{\"order_id\": \"ORD-123\"}"}

If you use EvalRunner, this is handled automatically — it serializes arguments to arguments_json when recording results (see eval.py:187-193).

If you use RunBuilder directly (the manual flow), you need to pass arguments_json as a JSON string, not arguments as a dict:

import json

# Correct — RunBuilder expects arguments_json (string)
test.add_tool_call(
    expected={"name": "lookup_order", "arguments_json": json.dumps({"order_id": "ORD-123"})},
    actual={"name": "lookup_order", "arguments_json": json.dumps({"order_id": "ORD-123"})},
    match_status="exact",
)

# Also works — the comparators handle both formats via extract_tool_args()
test.add_tool_call(
    expected={"name": "lookup_order", "arguments": {"order_id": "ORD-123"}},
    actual={"name": "lookup_order", "arguments_json": json.dumps({"order_id": "ORD-123"})},
    match_status="exact",
)

The extract_tool_args() helper normalizes both formats, so comparators work regardless. But the data stored in the run result will use whichever format you pass to add_tool_call().

Verifying Your Agent Satisfies the Protocol

You can use Python's isinstance check at runtime thanks to @runtime_checkable:

from ashr_labs import Agent

agent = SupportAgent(api_key="sk-...")
assert isinstance(agent, Agent), "Agent doesn't implement the protocol"

Step 2: Get a Dataset

Option A: Fetch an Existing Dataset

from ashr_labs import AshrLabsClient

client = AshrLabsClient(api_key="tp_your_key_here")

# Fetch by ID
dataset = client.get_dataset(dataset_id=322)
source = dataset["dataset_source"]

# Quick summary
runs = source.get("runs", {})
total_actions = sum(len(s.get("actions", [])) for s in runs.values())
print(f"Dataset #{dataset['id']}: {len(runs)} scenarios, {total_actions} actions")

Option B: Generate a New Dataset

Use generate_dataset() — it creates the request, polls until complete, and fetches the result in one call:

dataset_id, source = client.generate_dataset(
    request_name="ShopWave Support Eval",
    config={
        "metadata": {
            "dataset_name": "ShopWave Support Eval",
            "description": "Customer support scenarios with tool calling",
        },
        "agent": {
            "name": "ShopWave Support Agent",
            "description": "Helps customers with orders, inventory, and refunds",
            "system_prompt": "You are a helpful support agent for ShopWave.",
            "tools": [
                {
                    "name": "lookup_order",
                    "description": "Look up order status",
                    "parameters": {
                        "type": "object",
                        "properties": {
                            "order_id": {"type": "string"}
                        },
                        "required": ["order_id"],
                    },
                },
                # ... more tools
            ],
            "accepted_inputs": {"text": True, "audio": False, "file": False,
                                "image": False, "video": False},
            "output_format": {"type": "text"},
        },
        "context": {
            "domain": "ecommerce",
            "use_case": "Customers contacting support about orders and refunds",
            "scenario_context": "An online retail store called ShopWave",
        },
        "test_config": {
            "num_variations": 25,
            "variation_strategy": "balanced",
            "coverage": {
                "happy_path": True,
                "edge_cases": True,
                "error_handling": True,
                "multi_turn": True,
            },
        },
        "generation_options": {
            "generate_audio": False,
            "generate_files": False,
            "generate_simulations": False,
        },
    },
    timeout=600,
)

print(f"Generated dataset #{dataset_id}")

If you need more control over the polling (e.g. to show progress), use the lower-level methods:

req = client.create_request(request_name="My Eval", request=config)
completed = client.wait_for_request(req["id"], timeout=600, poll_interval=5)
# Then fetch the dataset manually via client.list_datasets() / client.get_dataset()

Step 3: Run the Eval

Basic Run

from ashr_labs import EvalRunner

runner = EvalRunner(source)  # source = dataset["dataset_source"]
run = runner.run(agent)

That's it. EvalRunner.run() handles the full eval loop:

Iterates every scenario in source["runs"]
Resets the agent at the start of each scenario
For each actor == "user" action: calls agent.respond(content)
For each actor == "agent" action: compares expected tool calls and text against the agent's actual response
Returns a populated RunBuilder with all results recorded

Or Use `from_dataset` to Skip the Fetch

runner = EvalRunner.from_dataset(client, dataset_id=322)
run = runner.run(agent)

Submitting and Waiting for Grading

All scoring is performed server-side after deploy(). The backend uses LLM-based semantic matching for tool arguments and embedding similarity for text responses, which is more accurate than local heuristics.

# Submit results
created = run.deploy(client, dataset_id=322)
print(f"Run #{created['id']} submitted")

# Wait for server-side grading to complete (typically 1-3 minutes)
graded = client.poll_run(created["id"])
metrics = graded["result"]["aggregate_metrics"]

print(f"Total tests:      {metrics['total_tests']}")
print(f"Passed:           {metrics['tests_passed']}")
print(f"Failed:           {metrics['tests_failed']}")
print(f"Tool divergences: {metrics['total_tool_call_divergence']}")
print(f"Text divergences: {metrics['total_response_divergence']}")

You can also pass a callback to poll_run to show progress:

graded = client.poll_run(
    created["id"],
    timeout=300,
    on_poll=lambda elapsed, r: print(f"  Grading in progress ({elapsed}s)..."),
)

Deeplinks to the Dashboard

poll_run() also populates two convenience fields on the returned run so you can show clickable URLs back to lab.ashr.io alongside your CLI output:

graded = client.poll_run(created["id"])

# Top-level link to the whole eval execution
print(f"View run: {graded['deeplink']}")

# Per-failure links (status != "completed")
for ft in graded["failed_tests"]:
    print(f"[FAIL] {ft['test_id']}\n       {ft['deeplink']}")

Output example:

View run: https://lab.ashr.io/?tab=analysis&dataset=322&execution=4117
[FAIL] checkout_promo_code_invalid
       https://lab.ashr.io/?tab=analysis&dataset=322&run=checkout_promo_code_invalid&execution=4117

You can also call client.deeplink(dataset_id, run_id=..., scenario_id=...) directly if you're rolling your own polling loop. Set the ASHR_DASHBOARD_URL env var to point at a staging dashboard during development.

This is especially useful inside a Claude Code session: when a test fails the agent automatically surfaces a clickable URL the developer can open to debug the specific failure on the dashboard.

Running Scenarios in Parallel

By default, scenarios run sequentially. Pass max_workers to run multiple scenarios concurrently using threads — each scenario gets its own deep copy of the agent:

# Run up to 4 scenarios at a time
run = runner.run(agent, max_workers=4)

This can significantly speed up evals when your agent spends most of its time waiting on LLM API calls. Actions within each scenario still run sequentially (since they depend on each other), but independent scenarios run in parallel.

# Also works with run_and_deploy
created = runner.run_and_deploy(agent, client, dataset_id=322, max_workers=4)

Important: max_workers > 1 requires a deep-copyable agent. Each parallel worker creates a copy.deepcopy(agent). Most LLM clients (Anthropic, OpenAI) hold connection pools and thread-local state that cannot be deep-copied — this will fail with a clear error message. Use max_workers=1 (the default) unless your agent implements __deepcopy__ to create fresh clients. If a scenario raises an exception during parallel execution, it's recorded as a failed test and the remaining scenarios continue.

Submitting in One Call

runner = EvalRunner.from_dataset(client, dataset_id=322)
created = runner.run_and_deploy(agent, client, dataset_id=322)

# Wait for grading
graded = client.poll_run(created["id"])
print(f"Passed: {graded['result']['aggregate_metrics']['tests_passed']}")

Step 4: Add Progress Callbacks

EvalRunner.run() accepts optional callbacks to monitor progress:

def on_scenario(scenario_id, scenario):
    title = scenario.get("title", scenario_id)
    n = len(scenario.get("actions", []))
    print(f"\n── Scenario: {title} ({n} actions) ──")

def on_action(index, action):
    actor = action.get("actor", "?")
    content = action.get("content", "")
    preview = content[:80] + ("..." if len(content) > 80 else "")
    print(f"  [{index}] {actor}: {preview}")

run = runner.run(
    agent,
    on_scenario=on_scenario,
    on_action=on_action,
)

Environment Actions

Some datasets include actor == "environment" actions — these represent external events like tool results from third-party systems, webhook callbacks, or simulated system responses. By default, environment actions are skipped.

To handle them, pass an on_environment callback. It receives the action content and the full action dict. Return a dict with "text" and/or "tool_calls" to update the agent's state for subsequent comparisons:

def handle_environment(content, action):
    """Feed environment context to the agent so it can respond."""
    return agent.respond(content)

run = runner.run(agent, on_environment=handle_environment)

If you return None (or don't provide the callback), the environment action is ignored and the agent's previous response carries forward.

Output looks like:

── Scenario: Customer asks about delayed order (4 actions) ──
  [0] user: Hi, I placed an order last week (ORD-54321) and it still hasn't arrive...
  [1] agent: Let me look up your order right away.
  [2] user: Can you also check if the wireless headphones are back in stock?
  [3] agent: I've checked both — here's what I found.

How Tool Matching Works

Understanding how EvalRunner compares expected vs actual tool calls is important for interpreting your results.

The Tool Pool

When agent.respond() is called on a user action, the returned tool_calls list becomes the tool pool for that turn. As the runner encounters expected tool calls in subsequent agent actions, it matches them by name and pops matched tools from the pool.

This means:

Tool calls persist across multiple agent actions within a single user turn
Each expected tool can only match one actual tool (first match wins)
Unmatched expected tools are recorded as "mismatch" with "NOT_CALLED"

User says: "Refund ORD-123 — it arrived damaged"

Agent responds with tool_calls: [lookup_order, process_refund]
                                 ↓ tool pool

Agent action 1 expects: lookup_order  →  ✓ matched, popped from pool
Agent action 2 expects: process_refund →  ✓ matched, popped from pool

Tool Argument Comparison

For matched tools, arguments are compared using compare_tool_args():

"exact" — all expected arguments match (string args compared fuzzily)
"partial" — at least one argument matches, but not all
"mismatch" — no arguments match

String arguments use fuzzy matching: lowercased, punctuation stripped, word-overlap with adaptive thresholds (0.35 for short strings, up to 0.55 for longer ones). This means "Customer wants a refund" and "customer wants refund" are considered matching.

Text Similarity

Text responses are compared using text_similarity(), which combines:

Cosine similarity on word frequency vectors (the base score)
Entity bonus (+0.20) for matching order IDs, prices, dates, tracking numbers
Concept bonus (+0.10) for matching domain concepts (refund, shipped, inventory, etc.)

The resulting score maps to match status:

> 0.70 → "exact"
> 0.40 → "similar"
≤ 0.40 → "divergent"

How Comparison Works

Tool Call Matching

EvalRunner uses compare_args_structural() for tool call comparison. This does a literal key-by-key comparison (not fuzzy matching). Arguments are bucketed into matching, different, missing, and extra.

The initial match_status from the SDK is:

"exact" — all args match literally
"partial" — tool name matches, but some args differ
"mismatch" — tool not called (NOT_CALLED) or different tool name

All further scoring happens server-side. After you deploy a run, the backend's LLM-based grader re-evaluates tool arguments with semantic understanding (e.g. "2026-04-01" vs "April 1st, 2026" can be judged as equivalent).

Text Response Matching

EvalRunner submits all text responses with match_status="pending". No local text comparison is performed. The backend's LLM grader evaluates factual accuracy, completeness, and tone, then assigns the final status ("exact", "similar", "mismatch").

Using the Comparators Standalone

All comparison functions are importable and usable independently of EvalRunner:

from ashr_labs import (
    strip_markdown,
    tokenize,
    fuzzy_str_match,
    extract_tool_args,
    compare_tool_args,
    text_similarity,
)

# Strip formatting for cleaner comparison
clean = strip_markdown("**Your order** has *shipped*!")
# => "Your order has shipped!"

# Tokenize for analysis
tokens = tokenize("Order ORD-123 shipped on 2026-03-01.")
# => ["order", "ord123", "shipped", "on", "20260301"]

# Check if two strings are semantically close
fuzzy_str_match("Customer wants a refund", "customer wants refund")
# => True

# Extract args from either dict or JSON format
args = extract_tool_args({"arguments_json": '{"order_id": "ORD-123"}'})
# => {"order_id": "ORD-123"}

# Compare two tool calls
status, notes = compare_tool_args(
    {"arguments": {"order_id": "ORD-123"}},
    {"arguments": {"order_id": "ORD-123", "extra": "field"}},
)
# => ("exact", None)  — extra actual args don't cause divergence

# Compute text similarity
score = text_similarity(
    "Your order ORD-123 has shipped and is on the way",
    "Order ORD-123 has been shipped and is in transit",
)
# => 0.78

Understanding the Dataset Structure

A dataset contains multiple scenarios (called "runs"). Each scenario has an ordered list of actions — the back-and-forth conversation between user and agent.

Top-Level Structure

dataset = client.get_dataset(dataset_id=42)

dataset["id"]              # 42
dataset["name"]            # "ShopWave Support Eval"
dataset["dataset_source"]  # The actual test data

dataset_source

source = dataset["dataset_source"]

source["dataset_type"]   # "multi_run_storyboard"
source["total_runs"]     # Number of scenarios
source["runs"]           # dict of { scenario_id: scenario }

Scenario

scenario = source["runs"]["billing_inquiry"]

scenario["run_id"]       # "billing_inquiry"
scenario["title"]        # "Customer Billing Question"
scenario["description"]  # "Frustrated customer calls about their bill..."
scenario["intent"]       # "Customer asking about their bill"
scenario["intent_tags"]  # ["frustrated customer", "billing issue"]
scenario["actions"]      # Ordered list of conversation turns

Actions

Each action is one turn in the conversation:

action = scenario["actions"][0]

action["name"]           # "Customer greets agent"
action["actor"]          # "user" or "agent"
action["action_type"]    # "text", "audio", "file", "image", "video", "json"
action["content"]        # The text content (always present, even for media actions)

Media Files (file / audio / image / video)

When an action has a media payload, it is stored in S3 and the action carries an output_path (S3 key). Pass include_signed_urls=True to get_dataset / list_datasets and the server adds a signed_url next to each output_path that the client can download directly (default expiry 1 hour, no AWS credentials needed).

Field layout:

dataset["dataset_source"]["runs"][run_id]["actions"][i]
    ├── action_type    # "file" | "audio" | "image" | "video"
    ├── content        # human-readable description
    ├── output_path    # S3 key — present when the file actually exists
    ├── signed_url     # added when include_signed_urls=True
    └── action_tags    # optional per-modality metadata (voice config, etc.)

Per-type conventions (observed in production):

`action_type`	Typical `output_path` prefix	Format	Notes
`file`	`files/<tenant>/doc_packages/<id>_<slug>.pdf`	PDF	Document-package datasets (lease, loan, invoice)
`audio`	`audio/<tenant>/<scenario>/<idx>_<actor>_<title>.mp3`	MP3	Voice config in `action_tags.audio` (pace, accent, gender, tone, background_noise)
`image`	`images/<tenant>/<scenario>/<idx>_<actor>_<title>.png`	PNG	Often `output_path: null` — many image actions are description-only and live entirely in `content`
`video`	`videos/<tenant>/<scenario>/<idx>_<actor>_<title>.mp4`	MP4	Same caveat as image — most are description-only
`text` / `json`	—	—	No file; payload is inline in `content`

Always check if action.get("output_path"): before using it — image/video actions frequently describe content in content without a backing file.

Simulations (run-level)

Browser simulations are attached to the run, not as an action. They follow the same signed-URL pattern:

sim = scenario.get("simulation")  # may be None
if sim:
    sim["output_path"]    # e.g. "simulations/1/<scenario>/<run>_<hash>.mp4"
    sim["signed_url"]     # added when include_signed_urls=True
    sim["events"]         # replayable event list
    sim["config"]         # render config (width/height/fps/cursor)
    sim["html_content"]   # rendered page HTML

Downloading a media file

import urllib.request

dataset = client.get_dataset(dataset_id=697, include_signed_urls=True)
for run_id, scenario in dataset["dataset_source"]["runs"].items():
    for action in scenario["actions"]:
        url = action.get("signed_url")
        if url:
            ext = "." + action["output_path"].rsplit(".", 1)[-1]
            urllib.request.urlretrieve(url, f"{run_id}{ext}")

Agent Actions — Expected Behavior

When actor == "agent", the action describes what the agent should do:

agent_action = scenario["actions"][3]

# The text the agent should say (approximately)
agent_action["content"]  # "Your order has been shipped..."

# The expected tool calls and text
expected = agent_action["expected_response"]
expected["tool_calls"]   # [{"name": "lookup_order", "arguments_json": "..."}]
expected["text"]         # Optional expected text response

Complete Real-World Example

This is the full eval runner for our ShopWave support agent — the same one we use internally. It generates a dataset, runs the eval with progress logging, and submits results.

#!/usr/bin/env python3
"""ShopWave Agent — Ashr Labs Eval Runner"""

import os
from ashr_labs import AshrLabsClient, EvalRunner
from agent import SupportAgent  # Your agent module

client = AshrLabsClient(api_key=os.environ["ASHR_LABS_API_KEY"])
agent = SupportAgent(api_key=os.environ["ANTHROPIC_API_KEY"])

# Verify credentials
session = client.init()
print(f"Logged in as: {session['user']['email']}")

# Generate a dataset (or use an existing one)
dataset_id, source = client.generate_dataset(
    request_name="ShopWave Support Agent Eval",
    config={
        "metadata": {"dataset_name": "ShopWave Support Eval"},
        "agent": {
            "name": "ShopWave Support Agent",
            "description": "Customer support with order lookup, inventory, refunds",
            "system_prompt": "You are a helpful support agent for ShopWave.",
            "tools": [
                {"name": "lookup_order", "description": "Look up order status",
                 "parameters": {"type": "object", "properties": {"order_id": {"type": "string"}}, "required": ["order_id"]}},
                {"name": "check_inventory", "description": "Check product availability",
                 "parameters": {"type": "object", "properties": {"product_name": {"type": "string"}}, "required": ["product_name"]}},
                {"name": "process_refund", "description": "Process a refund",
                 "parameters": {"type": "object", "properties": {"order_id": {"type": "string"}, "reason": {"type": "string"}}, "required": ["order_id", "reason"]}},
            ],
            "accepted_inputs": {"text": True, "audio": False, "file": False, "image": False, "video": False},
            "output_format": {"type": "text"},
        },
        "context": {
            "domain": "ecommerce",
            "use_case": "Customers contacting support",
            "scenario_context": "An online retail store called ShopWave",
        },
        "test_config": {"num_variations": 25, "coverage": {"happy_path": True, "edge_cases": True, "multi_turn": True}},
        "generation_options": {"generate_audio": False, "generate_files": False, "generate_simulations": False},
    },
)

runs = source.get("runs", {})
total_actions = sum(len(s.get("actions", [])) for s in runs.values())
print(f"Dataset #{dataset_id}: {len(runs)} scenarios, {total_actions} actions")

# Run the eval with progress callbacks
runner = EvalRunner(source)

def on_scenario(sid, scenario):
    title = scenario.get("title", sid)
    n = len(scenario.get("actions", []))
    print(f"\n── {title} ({n} actions) ──")

def on_action(idx, action):
    actor = action.get("actor", "?")
    content = action.get("content", "")[:70]
    print(f"  [{idx}] {actor}: {content}")

run = runner.run(agent, on_scenario=on_scenario, on_action=on_action)

# Submit and wait for server-side grading
created = run.deploy(client, dataset_id=dataset_id)
print(f"\nRun #{created['id']} submitted — waiting for grading...")

graded = client.poll_run(
    created["id"],
    on_poll=lambda elapsed, r: print(f"  Grading... ({elapsed}s)"),
)

m = graded["result"]["aggregate_metrics"]
print(f"\nResults:")
print(f"  Tests:          {m['tests_passed']}/{m['total_tests']} passed")
print(f"  Tool diverg.:   {m.get('total_tool_call_divergence', 0)}")
print(f"  Text diverg.:   {m.get('total_response_divergence', 0)}")

Advanced: Manual RunBuilder

If EvalRunner doesn't fit your workflow (custom eval loops, non-standard agent interfaces, file-based inputs), you can use RunBuilder directly. This is the lower-level API that EvalRunner uses internally.

See the RunBuilder section of the API Reference for full documentation.

from ashr_labs import AshrLabsClient, RunBuilder

client = AshrLabsClient(api_key="tp_your_key_here")
dataset = client.get_dataset(dataset_id=42, include_signed_urls=True)
source = dataset["dataset_source"]

run = RunBuilder()
run.start()

for run_id, scenario in source["runs"].items():
    test = run.add_test(run_id)
    test.start()

    for i, action in enumerate(scenario["actions"]):
        if action["actor"] == "user":
            test.add_user_text(
                text=action["content"],
                description=action.get("name", f"action_{i}"),
                action_index=i,
            )
            # Call your agent here...

        elif action["actor"] == "agent":
            # Compare expected vs actual manually...
            test.add_tool_call(
                expected=expected_tool,
                actual=actual_tool,
                match_status="exact",  # or "partial" / "mismatch"
                action_index=i,
            )
            test.add_agent_response(
                expected_response={"text": action["content"]},
                actual_response={"text": actual_text},
                match_status="similar",
                semantic_similarity=0.85,
                action_index=i,
            )

    test.complete()

run.complete()
run.deploy(client, dataset_id=42)

Match Statuses

For tool calls (add_tool_call):

"exact" — tool name and arguments match
"partial" — tool name matches but arguments differ
"mismatch" — wrong tool or not called at all

For text responses (add_agent_response):

"exact" — semantically identical
"similar" — same meaning, different wording
"divergent" — substantially different

Automatic Metrics

RunBuilder.build() computes basic execution metrics locally. Full grading metrics (tests_passed, tests_failed, similarity scores, divergence counts) are computed server-side after deploy() — use client.poll_run() to wait for them.

# Local metrics (available immediately)
result = run.build()
print(result["aggregate_metrics"])
# {
#     "total_tests": 25,
#     "tests_completed": 25,
#     "tests_errored": 0,
# }

# Server-side graded metrics (after deploy + poll)
created = run.deploy(client, dataset_id=42)
graded = client.poll_run(created["id"])
print(graded["result"]["aggregate_metrics"])
# {
#     "total_tests": 25,
#     "tests_passed": 23,
#     "tests_failed": 2,
#     "total_tool_call_divergence": 5,
#     "total_response_divergence": 8,
#     ...
# }

Debugging Failures

When tests fail, the default output shows expected vs actual tool calls and a similarity score — but not why the agent behaved that way. This section covers two techniques for faster debugging: conversation transcripts and failure classification.

Conversation Transcripts

The agent's full conversation history (every user message, assistant response, tool call, and tool result) is available via agent.conversation — but EvalRunner discards it after each scenario. To capture it, wrap your agent to snapshot the conversation before each reset():

class TranscriptCapture:
    """Wraps an agent to capture per-scenario conversation transcripts."""

    def __init__(self, agent):
        self._agent = agent
        self.transcripts = {}  # scenario_id -> conversation snapshot
        self._current_scenario = None

    def reset(self, scenario_id=None, **kwargs):
        # Snapshot previous conversation before reset clears it
        if self._current_scenario and hasattr(self._agent, "conversation"):
            self.transcripts[self._current_scenario] = list(self._agent.conversation)
        self._current_scenario = scenario_id
        return self._agent.reset(**kwargs)

    def respond(self, message, scenario_id=None, **kwargs):
        if scenario_id:
            self._current_scenario = scenario_id
        return self._agent.respond(message, **kwargs)

    def finalize(self):
        """Call after runner.run() to capture the last scenario."""
        if self._current_scenario and hasattr(self._agent, "conversation"):
            self.transcripts[self._current_scenario] = list(self._agent.conversation)

Use it like this:

agent = MyAgent()
capture = TranscriptCapture(agent)

run = runner.run(
    capture,  # Pass the wrapper, not the raw agent
    max_workers=1,
    on_environment=lambda content, action: agent.respond(content),
)
capture.finalize()

# After grading, print transcripts for failed scenarios
for test in graded["result"]["tests"]:
    if test["status"] == "failed":
        tid = test["test_id"]
        transcript = capture.transcripts.get(tid, [])
        print(f"\n--- {tid} ---")
        for msg in transcript:
            role = msg.get("role", "?")
            content = msg.get("content", "")
            if isinstance(content, str):
                print(f"[{role}] {content[:200]}")
            elif isinstance(content, list):
                for block in content:
                    if hasattr(block, "type"):  # Anthropic SDK objects
                        if block.type == "text":
                            print(f"  [text] {block.text[:200]}")
                        elif block.type == "tool_use":
                            print(f"  [tool_use] {block.name}({block.input})")
                    elif isinstance(block, dict):
                        if block.get("type") == "tool_result":
                            print(f"  [tool_result] {str(block.get('content', ''))[:200]}")

Example output for a failed Tokyo hotel booking scenario:

--- in_cheerful_tokyo_hotel_ddmm_confusion ---
[user] Hi — planning a leisure trip to Tokyo and need a hotel in Shinjuku...
  [text] I'd be happy to help! However, I notice your dates might be reversed...
[user] Sorry, typo — check-in 10 December, check-out 19 December.
  [text] Perfect! Let me search for hotels in Shinjuku, Tokyo.
  [tool_use] search_hotels({"location": "Shinjuku, Tokyo", "check_in": "2026-12-10", ...})
  [tool_result] {"hotels": [{"hotel_id": "HTL-301", ...}]}
  [text] I found 2 hotels in Shinjuku for December 10-19...
[user] I'll take the first one. Book under Zoë Martín-López. Go ahead.
  [text] I'll book right away!                          <-- BUG: should confirm name first
  [tool_use] book_hotel({"hotel_id": "H-98432", "guest_name": "Zoë Martín-López", ...})

The transcript immediately shows the agent booked without confirming the diacritics in the guest name — something impossible to diagnose from just expected=book_hotel(...) actual=NOT_CALLED({}).

Failure Classification: WRONG vs DRIFT

Not all failures are equal. A mismatch where the agent called the wrong tool is fundamentally different from a partial where it called the right tool with slightly different argument formatting. Classifying failures helps you decide whether to fix your prompt (fundamental) or accept stochastic variance (drift).

def classify_tool_failure(tc):
    """Classify a tool call failure.

    Returns:
        "WRONG" — agent called the wrong tool, didn't call it, or called one unexpectedly
        "DRIFT" — agent called the right tool with slightly different arguments
    """
    exp = tc.get("expected", {})
    act = tc.get("actual", {})
    exp_name = exp.get("name", "")
    act_name = act.get("name", "")

    # Tool not called at all, or unexpected extra call
    if act_name == "NOT_CALLED" or exp_name == "NONE_EXPECTED":
        return "WRONG"

    # Different tool names
    if exp_name != act_name:
        return "WRONG"

    # Same tool — check if required args are missing
    arg_comp = tc.get("argument_comparison", {})
    if arg_comp and arg_comp.get("missing"):
        return "WRONG"

    # Same tool, args present but different values
    return "DRIFT"

For text responses, use semantic_similarity:

def classify_text_failure(ar):
    sim = ar.get("semantic_similarity", 0)
    if sim and sim >= 0.75:
        return "DRIFT"   # Same meaning, different wording
    return "WRONG"        # Substantially different response

Then summarize each failed test:

for test in graded["result"]["tests"]:
    if test["status"] != "failed":
        continue

    wrong = 0
    drift = 0
    for ar in test.get("action_results", []):
        if ar.get("action_type") == "tool_call":
            for tc in ar.get("tool_calls", []):
                if tc.get("match_status") in ("mismatch", "partial"):
                    cat = classify_tool_failure(tc)
                    if cat == "WRONG":
                        wrong += 1
                    else:
                        drift += 1

    verdict = "FUNDAMENTAL" if wrong > drift else "STOCHASTIC"
    print(f"  [{verdict}] {test['test_id']} ({wrong} wrong, {drift} drift)")

Example output:

============================================================
  RESULTS: 4 passed / 1 failed (5 total)
============================================================
  [PASS] brit_formal_itinerary_lookup_home
  [PASS] us_confident_multileg_idl_complex
  [FAIL] in_cheerful_tokyo_hotel_ddmm_confusion
    [WRONG] mismatch: expected=NONE_EXPECTED({}) actual=book_hotel({...})
    [WRONG] mismatch: expected=book_hotel({...}) actual=NOT_CALLED({})
    --> FUNDAMENTAL failure (2 wrong, 0 drift)
  [PASS] au_frustrated_post_booking_limits_office
  [PASS] us_friendly_ambiguous_portland_weekend_audio

FUNDAMENTAL failures (mostly WRONG) mean the agent's logic is broken — fix your prompt or tool handling. STOCHASTIC failures (mostly DRIFT) mean the agent did roughly the right thing but with slight argument variations — these may pass on the next run without any changes.

Putting It Together

A complete eval script with both features:

from ashr_labs import AshrLabsClient, EvalRunner
from my_agent import MyAgent

client = AshrLabsClient(api_key="tp_...")
runner = EvalRunner.from_dataset(client, dataset_id=405)

agent = MyAgent()
capture = TranscriptCapture(agent)

run = runner.run(capture, max_workers=1,
                 on_environment=lambda c, a: agent.respond(c))
capture.finalize()

created = run.deploy(client, dataset_id=405)
graded = client.poll_run(created["id"])

# Print results with classification
for test in graded["result"]["tests"]:
    status = "PASS" if test["status"] == "completed" else "FAIL"
    print(f"[{status}] {test['test_id']}")

    if test["status"] == "failed":
        # Classify and print failures
        for ar in test.get("action_results", []):
            if ar.get("action_type") == "tool_call":
                for tc in ar.get("tool_calls", []):
                    ms = tc.get("match_status", "")
                    if ms in ("mismatch", "partial"):
                        cat = classify_tool_failure(tc)
                        exp = tc["expected"]["name"]
                        act = tc["actual"]["name"]
                        print(f"  [{cat}] {ms}: expected={exp} actual={act}")

        # Print conversation transcript
        transcript = capture.transcripts.get(test["test_id"], [])
        if transcript:
            print(f"\n  Conversation:")
            for msg in transcript:
                # ... format and print (see above)
                pass

CI/CD Integration

# ci_eval.py
import os
import sys
from ashr_labs import AshrLabsClient, EvalRunner

def main():
    client = AshrLabsClient.from_env()
    dataset_id = int(os.environ["ASHR_LABS_DATASET_ID"])

    agent = YourAgent()  # Your agent initialization
    runner = EvalRunner.from_dataset(client, dataset_id=dataset_id)
    run = runner.run(agent)

    # Submit and wait for server-side grading
    created = run.deploy(client, dataset_id=dataset_id)
    graded = client.poll_run(created["id"], timeout=300)

    metrics = graded["result"]["aggregate_metrics"]
    print(f"Passed: {metrics['tests_passed']}/{metrics['total_tests']}")

    # Fail CI if tests fail
    if metrics.get("tests_failed", 0) > 0:
        print(f"FAIL: {metrics['tests_failed']} tests failed")
        sys.exit(1)

if __name__ == "__main__":
    main()

# .github/workflows/agent-eval.yml
name: Agent Evaluation
on: [push]
jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: "3.11" }
      - run: pip install ashr-labs anthropic
      - run: python ci_eval.py
        env:
          ASHR_LABS_API_KEY: ${{ secrets.ASHR_LABS_API_KEY }}
          ASHR_LABS_DATASET_ID: "322"
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

Environment Variables

Variable	Required	Description
`ASHR_LABS_API_KEY`	Yes (for `from_env()`)	Your API key (starts with `tp_`)
`ASHR_LABS_BASE_URL`	No	Override API URL (defaults to production)
`ASHR_LABS_DATASET_ID`	No	Dataset ID for CI scripts

Next Steps

API Reference — full documentation for EvalRunner, Agent, comparators, RunBuilder, and client methods
Error Handling — retry strategies and exception types
Examples — more usage patterns

Overview​

The 3-Line Version​

Step 1: Wrap Your Agent​

The Agent Protocol​

How Tool Calls Are Logged​

Full Example: Customer Support Agent​

Minimal Agent (No Tools)​

arguments vs arguments_json — Important Serialization Note​

Verifying Your Agent Satisfies the Protocol​

Step 2: Get a Dataset​

Option A: Fetch an Existing Dataset​

Option B: Generate a New Dataset​

Step 3: Run the Eval​

Basic Run​

Or Use from_dataset to Skip the Fetch​

Submitting and Waiting for Grading​

Deeplinks to the Dashboard​

Running Scenarios in Parallel​

Submitting in One Call​

Step 4: Add Progress Callbacks​

Environment Actions​

How Tool Matching Works​

The Tool Pool​

Tool Argument Comparison​

Text Similarity​

How Comparison Works​

Tool Call Matching​

Text Response Matching​

Using the Comparators Standalone​

Understanding the Dataset Structure​

Top-Level Structure​

dataset_source​

Scenario​

Actions​

Media Files (file / audio / image / video)​

Simulations (run-level)​

Downloading a media file​

Agent Actions — Expected Behavior​

Complete Real-World Example​

Advanced: Manual RunBuilder​

Match Statuses​

Automatic Metrics​

Debugging Failures​

Conversation Transcripts​

Failure Classification: WRONG vs DRIFT​

Putting It Together​

CI/CD Integration​

Environment Variables​

Next Steps​