Testing Your Agent

This guide walks through the complete workflow for evaluating an AI agent against an Ashr Labs dataset. It covers everything from wrapping your agent in the SDK interface, to running the eval, to submitting results.

Overview

The eval workflow has three stages:

Get a dataset — fetch an existing one or generate a new one
Run the eval — EvalRunner iterates scenarios, calls your agent, compares results
Submit results — deploy the run to the Ashr Labs dashboard

┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│  Get Dataset │ ──> │  EvalRunner  │ ──> │  Deploy Run  │
│              │     │  .run(agent) │     │              │
└──────────────┘     └──────────────┘     └──────────────┘

The 3-Line Version

If you already have a dataset and an agent, here's the entire eval:

import { AshrLabsClient, EvalRunner } from "ashr-labs";

const client = new AshrLabsClient("tp_your_key_here");
const runner = await EvalRunner.fromDataset(client, 322);
await runner.runAndDeploy(myAgent, client, 322);

The rest of this guide explains what's happening under the hood and how to customize every step.

Step 1: Wrap Your Agent

EvalRunner works with any object that implements the Agent interface — just respond() and reset() methods. No base class to extend, no SDK dependency in your agent code.

The Agent Interface

interface Agent {
  respond(
    message: string,
    scenarioId?: string,
  ): Record<string, unknown> | Promise<Record<string, unknown>>;

  reset(scenarioId?: string): void | Promise<void>;
}

Both methods can be synchronous or async. The optional scenarioId parameter is passed during parallel execution so a single agent instance can maintain separate conversation states per scenario.

The respond() method should return:

{
  text: string;           // The agent's text response
  tool_calls: [           // All tool calls made during this turn
    {
      name: string;
      arguments: Record<string, unknown>;  // Tool arguments as an object
    },
    // ...
  ];
}

How Tool Calls Are Logged

The agent is responsible for collecting its own tool calls during the respond() call and returning them in the response object. The SDK does not intercept or instrument tool execution — it only consumes whatever the agent reports.

During a single respond() call, your agent may:

Call the LLM
Get back tool use requests
Execute tools and feed results back to the LLM
Repeat steps 2-3 multiple times (tool loops)
Finally get a text response

Throughout this loop, accumulate every tool call into a list and return it alongside the final text.

Full Example: Customer Support Agent

Here's a complete tool-calling agent built on the Anthropic SDK. This is the agent we use in our own evals — it handles order lookups, inventory checks, and refund processing.

import Anthropic from "@anthropic-ai/sdk";
import type { Agent } from "ashr-labs";

const SYSTEM_PROMPT = `You are a helpful customer support agent for ShopWave.
You help customers check order status, look up product availability,
and process refunds. Always be polite and concise.`;

const TOOLS: Anthropic.Tool[] = [
  {
    name: "lookup_order",
    description: "Look up the status and details of a customer order.",
    input_schema: {
      type: "object" as const,
      properties: {
        order_id: {
          type: "string",
          description: "The order ID (e.g. ORD-12345)",
        },
      },
      required: ["order_id"],
    },
  },
  {
    name: "check_inventory",
    description: "Check availability of a product.",
    input_schema: {
      type: "object" as const,
      properties: {
        product_name: {
          type: "string",
          description: "The product name or SKU",
        },
      },
      required: ["product_name"],
    },
  },
  {
    name: "process_refund",
    description: "Initiate a refund for an order.",
    input_schema: {
      type: "object" as const,
      properties: {
        order_id: { type: "string", description: "The order ID" },
        reason: { type: "string", description: "Reason for refund" },
      },
      required: ["order_id", "reason"],
    },
  },
];

function executeTool(name: string, args: Record<string, unknown>): string {
  // Your tool implementations. Replace with real logic.
  if (name === "lookup_order") {
    return JSON.stringify({ order_id: args.order_id, status: "shipped" });
  } else if (name === "check_inventory") {
    return JSON.stringify({ product: args.product_name, in_stock: true });
  } else if (name === "process_refund") {
    return JSON.stringify({ refund_id: "REF-001", status: "processed" });
  }
  return JSON.stringify({ error: `Unknown tool: ${name}` });
}

class SupportAgent implements Agent {
  private client: Anthropic;
  private conversation: Anthropic.MessageParam[] = [];

  constructor(apiKey: string) {
    this.client = new Anthropic({ apiKey });
  }

  async reset(): Promise<void> {
    this.conversation = [];
  }

  async respond(userMessage: string): Promise<Record<string, unknown>> {
    this.conversation.push({ role: "user", content: userMessage });

    const allToolCalls: { name: string; arguments: Record<string, unknown> }[] = [];

    for (let i = 0; i < 10; i++) {
      const response = await this.client.messages.create({
        model: "claude-sonnet-4-20250514",
        max_tokens: 1024,
        system: SYSTEM_PROMPT,
        tools: TOOLS,
        messages: this.conversation,
      });

      // Separate text and tool use blocks
      const textParts: string[] = [];
      const toolUses: Anthropic.ToolUseBlock[] = [];
      for (const block of response.content) {
        if (block.type === "text") {
          textParts.push(block.text);
        } else if (block.type === "tool_use") {
          toolUses.push(block);
        }
      }

      // No tool calls — we're done
      if (toolUses.length === 0) {
        this.conversation.push({ role: "assistant", content: response.content });
        return { text: textParts.join("\n"), tool_calls: allToolCalls };
      }

      // Execute tools and continue the loop
      this.conversation.push({ role: "assistant", content: response.content });
      const toolResults: Anthropic.ToolResultBlockParam[] = [];

      for (const toolUse of toolUses) {
        const resultStr = executeTool(
          toolUse.name,
          toolUse.input as Record<string, unknown>,
        );

        // Record every tool call with its name and arguments
        allToolCalls.push({
          name: toolUse.name,
          arguments: toolUse.input as Record<string, unknown>,
        });

        toolResults.push({
          type: "tool_result",
          tool_use_id: toolUse.id,
          content: resultStr,
        });
      }

      this.conversation.push({ role: "user", content: toolResults });

      if (response.stop_reason === "end_turn") {
        return { text: textParts.join("\n"), tool_calls: allToolCalls };
      }
    }

    return { text: "[Max iterations reached]", tool_calls: allToolCalls };
  }
}

Minimal Agent (No Tools)

If your agent doesn't use tools, the wrapper is simpler:

import type { Agent } from "ashr-labs";

class SimpleAgent implements Agent {
  private client: any;
  private history: { role: string; content: string }[] = [];

  constructor(llmClient: any) {
    this.client = llmClient;
  }

  async reset(): Promise<void> {
    this.history = [];
  }

  async respond(message: string): Promise<Record<string, unknown>> {
    this.history.push({ role: "user", content: message });
    const response = await this.client.chat({ messages: this.history });
    this.history.push({ role: "assistant", content: response.text });
    return { text: response.text, tool_calls: [] };
  }
}

`arguments` vs `arguments_json` — Important Serialization Note

The Agent interface's respond() method returns tool call arguments as an object:

{ name: "lookup_order", arguments: { order_id: "ORD-123" } }

But internally, RunBuilder and the API store them as a JSON string under arguments_json:

{ name: "lookup_order", arguments_json: '{"order_id": "ORD-123"}' }

If you use EvalRunner, this is handled automatically — it serializes arguments to arguments_json when recording results.

If you use RunBuilder directly (the manual flow), you need to pass arguments_json as a JSON string, not arguments as an object:

// Correct — RunBuilder expects arguments_json (string)
test.addToolCall(
  { name: "lookup_order", arguments_json: JSON.stringify({ order_id: "ORD-123" }) },
  { name: "lookup_order", arguments_json: JSON.stringify({ order_id: "ORD-123" }) },
  "exact",
);

// Also works — the comparators handle both formats via extractToolArgs()
test.addToolCall(
  { name: "lookup_order", arguments: { order_id: "ORD-123" } },
  { name: "lookup_order", arguments_json: JSON.stringify({ order_id: "ORD-123" }) },
  "exact",
);

The extractToolArgs() helper normalizes both formats, so comparators work regardless. But the data stored in the run result will use whichever format you pass to addToolCall().

Step 2: Get a Dataset

Option A: Fetch an Existing Dataset

import { AshrLabsClient } from "ashr-labs";

const client = new AshrLabsClient("tp_your_key_here");

// Fetch by ID
const dataset = await client.getDataset(322);
const source = dataset.dataset_source as Record<string, unknown>;

// Quick summary
const runs = (source.runs ?? {}) as Record<string, Record<string, unknown>>;
const totalActions = Object.values(runs).reduce(
  (sum, s) => sum + ((s.actions as unknown[]) ?? []).length,
  0,
);
console.log(`Dataset #${dataset.id}: ${Object.keys(runs).length} scenarios, ${totalActions} actions`);

Option B: Generate a New Dataset

Use generateDataset() — it creates the request, polls until complete, and fetches the result in one call:

const [datasetId, source] = await client.generateDataset(
  "ShopWave Support Eval",
  {
    metadata: {
      dataset_name: "ShopWave Support Eval",
      description: "Customer support scenarios with tool calling",
    },
    agent: {
      name: "ShopWave Support Agent",
      description: "Helps customers with orders, inventory, and refunds",
      system_prompt: "You are a helpful support agent for ShopWave.",
      tools: [
        {
          name: "lookup_order",
          description: "Look up order status",
          parameters: {
            type: "object",
            properties: { order_id: { type: "string" } },
            required: ["order_id"],
          },
        },
        // ... more tools
      ],
      accepted_inputs: { text: true, audio: false, file: false, image: false, video: false },
      output_format: { type: "text" },
    },
    context: {
      domain: "ecommerce",
      use_case: "Customers contacting support about orders and refunds",
      scenario_context: "An online retail store called ShopWave",
    },
    test_config: {
      num_variations: 25,
      variation_strategy: "balanced",
      coverage: {
        happy_path: true,
        edge_cases: true,
        error_handling: true,
        multi_turn: true,
      },
    },
    generation_options: {
      generate_audio: false,
      generate_files: false,
      generate_simulations: false,
    },
  },
  undefined,
  600,
);

console.log(`Generated dataset #${datasetId}`);

If you need more control over the polling (e.g. to show progress), use the lower-level methods:

const req = await client.createRequest("My Eval", config);
const completed = await client.waitForRequest(req.id as number, 600, 5);
// Then fetch the dataset manually via client.listDatasets() / client.getDataset()

Step 3: Run the Eval

Basic Run

import { EvalRunner } from "ashr-labs";

const runner = new EvalRunner(source); // source = dataset.dataset_source
const run = await runner.run(agent);

That's it. EvalRunner.run() handles the full eval loop:

Iterates every scenario in source.runs
Resets the agent at the start of each scenario
For each actor === "user" action: calls agent.respond(content)
For each actor === "agent" action: compares expected tool calls and text against the agent's actual response
Returns a populated RunBuilder with all results recorded

Or Use `fromDataset` to Skip the Fetch

const runner = await EvalRunner.fromDataset(client, 322);
const run = await runner.run(agent);

Inspecting Results Before Submitting

const result = run.build();
const metrics = result.aggregate_metrics as Record<string, unknown>;

console.log(`Total tests:      ${metrics.total_tests}`);
console.log(`Passed:           ${metrics.tests_passed}`);
console.log(`Avg similarity:   ${metrics.average_similarity_score}`);
console.log(`Tool divergences: ${metrics.total_tool_call_divergence}`);
console.log(`Text divergences: ${metrics.total_response_divergence}`);

// Only submit if metrics look good
const avgSimilarity = metrics.average_similarity_score as number | null;
if (avgSimilarity && avgSimilarity > 0.5) {
  await run.deploy(client, 322);
}

Deeplinks to the Dashboard

After deploying a run, use client.attachDeeplinks(run) to populate two convenience fields on the run object so you can show clickable URLs back to lab.ashr.io alongside your CLI output:

const created = await run.deploy(client, 322);
const graded = await client.getRun(created.id);  // re-fetch after grading
client.attachDeeplinks(graded);

// Top-level link to the whole eval execution
console.log(`View run: ${graded.deeplink}`);

// Per-failure links (status !== "completed")
for (const ft of graded.failed_tests) {
  console.log(`[FAIL] ${ft.test_id}\n       ${ft.deeplink}`);
}

Output example:

View run: https://lab.ashr.io/?tab=analysis&dataset=322&execution=4117
[FAIL] checkout_promo_code_invalid
       https://lab.ashr.io/?tab=analysis&dataset=322&run=checkout_promo_code_invalid&execution=4117

You can also call client.deeplink(datasetId, { runId, scenarioId }) directly, or import the standalone deeplink() helper. Set the ASHR_DASHBOARD_URL env var to point at a staging dashboard during development.

This is especially useful inside a Claude Code session: when a test fails the agent automatically surfaces a clickable URL the developer can open to debug the specific failure on the dashboard.

Running Scenarios in Parallel

By default, scenarios run sequentially. Pass maxWorkers to run multiple scenarios concurrently using Promise.all batches:

// Run up to 4 scenarios at a time
const run = await runner.run(agent, { maxWorkers: 4 });

This can significantly speed up evals when your agent spends most of its time waiting on LLM API calls. Actions within each scenario still run sequentially (since they depend on each other), but independent scenarios run in parallel.

// Also works with runAndDeploy
const created = await runner.runAndDeploy(agent, client, 322, { maxWorkers: 4 });

Note: Unlike the Python SDK which deep-copies the agent for each worker, the TypeScript SDK passes a scenarioId to respond(message, scenarioId) and reset(scenarioId). Your agent must key its conversation state on this ID when running in parallel. Most agents that store conversation in a Map<string, Message[]> keyed by scenario ID work out of the box. If a scenario raises an exception during parallel execution, it's recorded as a failed test and the remaining scenarios continue.

Alternatively, you can pass a factory function instead of an agent instance:

// Factory — creates a fresh agent for each run
const run = await runner.run(() => new SupportAgent(apiKey), { maxWorkers: 4 });

Submitting in One Call

const runner = await EvalRunner.fromDataset(client, 322);
const created = await runner.runAndDeploy(agent, client, 322);
console.log(`Run #${created.id} submitted`);

Step 4: Add Progress Callbacks

EvalRunner.run() accepts optional callbacks to monitor progress:

const run = await runner.run(agent, {
  onScenario: (scenarioId, scenario) => {
    const title = scenario.title ?? scenarioId;
    const actions = (scenario.actions ?? []) as unknown[];
    console.log(`\n── Scenario: ${title} (${actions.length} actions) ──`);
  },
  onAction: (index, action) => {
    const actor = action.actor ?? "?";
    const content = (action.content as string) ?? "";
    const preview = content.length > 80 ? content.slice(0, 80) + "..." : content;
    console.log(`  [${index}] ${actor}: ${preview}`);
  },
});

Output looks like:

── Scenario: Customer asks about delayed order (4 actions) ──
  [0] user: Hi, I placed an order last week (ORD-54321) and it still hasn't arrive...
  [1] agent: Let me look up your order right away.
  [2] user: Can you also check if the wireless headphones are back in stock?
  [3] agent: I've checked both — here's what I found.

How Tool Matching Works

Understanding how EvalRunner compares expected vs actual tool calls is important for interpreting your results.

The Tool Pool

When agent.respond() is called on a user action, the returned tool_calls list becomes the tool pool for that turn. As the runner encounters expected tool calls in subsequent agent actions, it matches them by name and removes matched tools from the pool.

This means:

Tool calls persist across multiple agent actions within a single user turn
Each expected tool can only match one actual tool (first match wins)
Unmatched expected tools are recorded as "mismatch" with "NOT_CALLED"

User says: "Refund ORD-123 — it arrived damaged"

Agent responds with tool_calls: [lookup_order, process_refund]
                                 ↓ tool pool

Agent action 1 expects: lookup_order  →  ✓ matched, removed from pool
Agent action 2 expects: process_refund →  ✓ matched, removed from pool

Tool Argument Comparison

For matched tools, arguments are compared using compareToolArgs():

"exact" — all expected arguments match (string args compared fuzzily)
"partial" — at least one argument matches, but not all
"mismatch" — no arguments match

String arguments use fuzzy matching: lowercased, punctuation stripped, word-overlap with adaptive thresholds (0.35 for short strings, up to 0.55 for longer ones). This means "Customer wants a refund" and "customer wants refund" are considered matching.

Text Similarity

Text responses are compared using textSimilarity(), which combines:

Cosine similarity on word frequency vectors (the base score)
Entity bonus (+0.20) for matching order IDs, prices, dates, tracking numbers
Concept bonus (+0.10) for matching domain concepts (refund, shipped, inventory, etc.)

The resulting score maps to match status:

> 0.70 → "exact"
> 0.40 → "similar"
≤ 0.40 → "divergent"

Customizing Comparison Logic

Custom Tool Comparator

Override how tool arguments are compared:

import { extractToolArgs, EvalRunner } from "ashr-labs";

function strictToolCompare(
  expected: Record<string, unknown>,
  actual: Record<string, unknown>,
): [string, string | null] {
  const expArgs = extractToolArgs(expected);
  const actArgs = extractToolArgs(actual);

  if (JSON.stringify(expArgs) === JSON.stringify(actArgs)) {
    return ["exact", null];
  }
  return ["mismatch", `Expected ${JSON.stringify(expArgs)}, got ${JSON.stringify(actArgs)}`];
}

const runner = new EvalRunner(source, { toolComparator: strictToolCompare });

Custom Text Comparator

Override text similarity scoring:

function embeddingSimilarity(textA: string, textB: string): number {
  // Use your own embedding model for comparison
  const vecA = myEmbeddingModel.encode(textA);
  const vecB = myEmbeddingModel.encode(textB);
  return cosineSimilarity(vecA, vecB);
}

const runner = new EvalRunner(source, { textComparator: embeddingSimilarity });

Custom Similarity Thresholds

Adjust when scores map to "exact" vs "similar" vs "divergent":

const runner = new EvalRunner(source, {
  similarityThresholds: {
    exact: 0.85,    // default: 0.70
    similar: 0.50,  // default: 0.40
  },
});

Using the Comparators Standalone

All comparison functions are importable and usable independently of EvalRunner:

import {
  stripMarkdown,
  tokenize,
  fuzzyStrMatch,
  extractToolArgs,
  compareToolArgs,
  textSimilarity,
} from "ashr-labs";

// Strip formatting for cleaner comparison
const clean = stripMarkdown("**Your order** has *shipped*!");
// => "Your order has shipped!"

// Tokenize for analysis
const tokens = tokenize("Order ORD-123 shipped on 2026-03-01.");
// => ["order", "ord123", "shipped", "on", "20260301"]

// Check if two strings are semantically close
fuzzyStrMatch("Customer wants a refund", "customer wants refund");
// => true

// Extract args from either object or JSON format
const args = extractToolArgs({ arguments_json: '{"order_id": "ORD-123"}' });
// => { order_id: "ORD-123" }

// Compare two tool calls
const [status, notes] = compareToolArgs(
  { arguments: { order_id: "ORD-123" } },
  { arguments: { order_id: "ORD-123", extra: "field" } },
);
// => ["exact", null] — extra actual args don't cause divergence

// Compute text similarity
const score = textSimilarity(
  "Your order ORD-123 has shipped and is on the way",
  "Order ORD-123 has been shipped and is in transit",
);
// => 0.78

Understanding the Dataset Structure

A dataset contains multiple scenarios (called "runs"). Each scenario has an ordered list of actions — the back-and-forth conversation between user and agent.

Top-Level Structure

const dataset = await client.getDataset(42);

dataset.id;              // 42
dataset.name;            // "ShopWave Support Eval"
dataset.dataset_source;  // The actual test data

dataset_source

const source = dataset.dataset_source as Record<string, unknown>;

source.dataset_type;   // "multi_run_storyboard"
source.total_runs;     // Number of scenarios
source.runs;           // { [scenarioId]: scenario }

Scenario

const runs = source.runs as Record<string, Record<string, unknown>>;
const scenario = runs["billing_inquiry"];

scenario.run_id;       // "billing_inquiry"
scenario.title;        // "Customer Billing Question"
scenario.description;  // "Frustrated customer calls about their bill..."
scenario.intent;       // "Customer asking about their bill"
scenario.intent_tags;  // ["frustrated customer", "billing issue"]
scenario.actions;      // Ordered list of conversation turns

Actions

Each action is one turn in the conversation:

const actions = scenario.actions as Record<string, unknown>[];
const action = actions[0];

action.name;           // "Customer greets agent"
action.actor;          // "user" or "agent"
action.action_type;    // "text", "audio", "file", "image", "video", "json"
action.content;        // The text content (always present, even for media actions)

Media Files (file / audio / image / video)

When an action has a media payload, it is stored in S3 and the action carries an output_path (S3 key). Pass includeSignedUrls: true to getDataset / listDatasets and the server adds a signed_url next to each output_path that the client can fetch directly (default expiry 1 hour, no AWS credentials needed).

Field layout:

dataset.dataset_source.runs[runId].actions[i]
    ├── action_type    // "file" | "audio" | "image" | "video"
    ├── content        // human-readable description
    ├── output_path    // S3 key — present when the file actually exists
    ├── signed_url     // added when includeSignedUrls is true
    └── action_tags    // optional per-modality metadata (voice config, etc.)

Per-type conventions (observed in production):

`action_type`	Typical `output_path` prefix	Format	Notes
`file`	`files/<tenant>/doc_packages/<id>_<slug>.pdf`	PDF	Document-package datasets (lease, loan, invoice)
`audio`	`audio/<tenant>/<scenario>/<idx>_<actor>_<title>.mp3`	MP3	Voice config in `action_tags.audio` (pace, accent, gender, tone, background_noise)
`image`	`images/<tenant>/<scenario>/<idx>_<actor>_<title>.png`	PNG	Often `output_path: null` — many image actions are description-only and live entirely in `content`
`video`	`videos/<tenant>/<scenario>/<idx>_<actor>_<title>.mp4`	MP4	Same caveat as image — most are description-only
`text` / `json`	—	—	No file; payload is inline in `content`

Always check if (action.output_path) before using it — image/video actions frequently describe content in content without a backing file.

Simulations (run-level)

Browser simulations are attached to the run, not as an action. Same signed-URL pattern:

const sim = scenario.simulation as Record<string, unknown> | undefined;
if (sim) {
  sim.output_path;    // e.g. "simulations/1/<scenario>/<run>_<hash>.mp4"
  sim.signed_url;     // added when includeSignedUrls is true
  sim.events;         // replayable event list
  sim.config;         // render config (width/height/fps/cursor)
  sim.html_content;   // rendered page HTML
}

Downloading a media file

import { writeFile } from "node:fs/promises";

const dataset = await client.getDataset(697, true);
const runs = (dataset.dataset_source as Record<string, unknown>).runs as Record<string, any>;
for (const [runId, scenario] of Object.entries(runs)) {
  for (const action of scenario.actions ?? []) {
    if (!action.signed_url) continue;
    const ext = "." + (action.output_path as string).split(".").pop();
    const buf = Buffer.from(await (await fetch(action.signed_url)).arrayBuffer());
    await writeFile(`${runId}${ext}`, buf);
  }
}

Agent Actions — Expected Behavior

When actor === "agent", the action describes what the agent should do:

const agentAction = actions[3];

// The text the agent should say (approximately)
agentAction.content;  // "Your order has been shipped..."

// The expected tool calls and text
const expected = agentAction.expected_response as Record<string, unknown>;
expected.tool_calls;   // [{ name: "lookup_order", arguments_json: "..." }]
expected.text;         // Optional expected text response

Complete Real-World Example

This is the full eval runner for our ShopWave support agent — the same one we use internally. It generates a dataset, runs the eval with progress logging, and submits results.

#!/usr/bin/env npx tsx
/**
 * ShopWave Agent — Ashr Labs Eval Runner
 */

import { AshrLabsClient, EvalRunner } from "ashr-labs";
import { SupportAgent } from "./agent.js"; // Your agent module

const client = new AshrLabsClient(process.env.ASHR_LABS_API_KEY!);
const agent = new SupportAgent(process.env.ANTHROPIC_API_KEY!);

// Verify credentials
const session = await client.init();
const user = session.user as Record<string, unknown>;
console.log(`Logged in as: ${user.email}`);

// Generate a dataset (or use an existing one)
const [datasetId, source] = await client.generateDataset(
  "ShopWave Support Agent Eval",
  {
    metadata: { dataset_name: "ShopWave Support Eval" },
    agent: {
      name: "ShopWave Support Agent",
      description: "Customer support with order lookup, inventory, refunds",
      system_prompt: "You are a helpful support agent for ShopWave.",
      tools: [
        { name: "lookup_order", description: "Look up order status",
          parameters: { type: "object", properties: { order_id: { type: "string" } }, required: ["order_id"] } },
        { name: "check_inventory", description: "Check product availability",
          parameters: { type: "object", properties: { product_name: { type: "string" } }, required: ["product_name"] } },
        { name: "process_refund", description: "Process a refund",
          parameters: { type: "object", properties: { order_id: { type: "string" }, reason: { type: "string" } }, required: ["order_id", "reason"] } },
      ],
      accepted_inputs: { text: true, audio: false, file: false, image: false, video: false },
      output_format: { type: "text" },
    },
    context: {
      domain: "ecommerce",
      use_case: "Customers contacting support",
      scenario_context: "An online retail store called ShopWave",
    },
    test_config: { num_variations: 25, coverage: { happy_path: true, edge_cases: true, multi_turn: true } },
    generation_options: { generate_audio: false, generate_files: false, generate_simulations: false },
  },
);

const runs = (source.runs ?? {}) as Record<string, Record<string, unknown>>;
const totalActions = Object.values(runs).reduce(
  (sum, s) => sum + ((s.actions as unknown[]) ?? []).length,
  0,
);
console.log(`Dataset #${datasetId}: ${Object.keys(runs).length} scenarios, ${totalActions} actions`);

// Run the eval with progress callbacks
const runner = new EvalRunner(source);

const run = await runner.run(agent, {
  onScenario: (sid, scenario) => {
    const title = scenario.title ?? sid;
    const actions = (scenario.actions ?? []) as unknown[];
    console.log(`\n── ${title} (${actions.length} actions) ──`);
  },
  onAction: (idx, action) => {
    const actor = action.actor ?? "?";
    const content = ((action.content as string) ?? "").slice(0, 70);
    console.log(`  [${idx}] ${actor}: ${content}`);
  },
  maxWorkers: 4,
});

// Preview metrics
const result = run.build();
const m = result.aggregate_metrics as Record<string, unknown>;
console.log(`\nResults:`);
console.log(`  Tests:          ${m.tests_passed}/${m.total_tests} passed`);
console.log(`  Avg similarity: ${m.average_similarity_score}`);
console.log(`  Tool diverg.:   ${m.total_tool_call_divergence}`);
console.log(`  Text diverg.:   ${m.total_response_divergence}`);

// Submit
const created = await run.deploy(client, datasetId);
console.log(`\nRun #${created.id} submitted!`);

Advanced: Manual RunBuilder

If EvalRunner doesn't fit your workflow (custom eval loops, non-standard agent interfaces, file-based inputs), you can use RunBuilder directly. This is the lower-level API that EvalRunner uses internally.

See the RunBuilder section of the API Reference for full documentation.

import { AshrLabsClient, RunBuilder } from "ashr-labs";

const client = new AshrLabsClient("tp_your_key_here");
const dataset = await client.getDataset(42, true);
const source = dataset.dataset_source as Record<string, unknown>;

const run = new RunBuilder();
run.start();

const runs = (source.runs ?? {}) as Record<string, Record<string, unknown>>;
for (const [runId, scenario] of Object.entries(runs)) {
  const test = run.addTest(runId);
  test.start();

  const actions = (scenario.actions ?? []) as Record<string, unknown>[];
  for (let i = 0; i < actions.length; i++) {
    const action = actions[i];

    if (action.actor === "user") {
      test.addUserText(
        action.content as string,
        (action.name as string) ?? `action_${i}`,
        i,
      );
      // Call your agent here...
    } else if (action.actor === "agent") {
      // Compare expected vs actual manually...
      test.addToolCall(
        expectedTool,
        actualTool,
        "exact", // or "partial" / "mismatch"
        undefined,
        i,
      );
      test.addAgentResponse(
        { text: action.content },
        { text: actualText },
        "similar",
        0.85,
        undefined,
        i,
      );
    }
  }

  test.complete();
}

run.complete();
await run.deploy(client, 42);

Match Statuses

For tool calls (addToolCall):

"exact" — tool name and arguments match
"partial" — tool name matches but arguments differ
"mismatch" — wrong tool or not called at all

For text responses (addAgentResponse):

"exact" — semantically identical
"similar" — same meaning, different wording
"divergent" — substantially different

Automatic Metrics

RunBuilder.build() computes aggregate metrics automatically:

const result = run.build();
console.log(result.aggregate_metrics);
// {
//     total_tests: 25,
//     tests_passed: 23,
//     tests_failed: 2,
//     average_similarity_score: 0.72,
//     total_tool_call_divergence: 5,
//     total_response_divergence: 8,
// }

CI/CD Integration

// ci_eval.ts
import { AshrLabsClient, EvalRunner } from "ashr-labs";

async function main() {
  const client = AshrLabsClient.fromEnv();
  const datasetId = parseInt(process.env.ASHR_LABS_DATASET_ID!);

  const agent = new YourAgent(); // Your agent initialization
  const runner = await EvalRunner.fromDataset(client, datasetId);
  const run = await runner.run(agent, { maxWorkers: 4 });

  const result = run.build();
  const metrics = result.aggregate_metrics as Record<string, unknown>;
  console.log(`Passed: ${metrics.tests_passed}/${metrics.total_tests}`);
  console.log(`Avg similarity: ${metrics.average_similarity_score}`);

  await run.deploy(client, datasetId);

  // Fail CI if quality drops
  const avgSimilarity = metrics.average_similarity_score as number | null;
  if (avgSimilarity && avgSimilarity < 0.5) {
    console.log("FAIL: Similarity score below threshold");
    process.exit(1);
  }
  const testsFailed = metrics.tests_failed as number;
  if (testsFailed > 0) {
    console.log(`FAIL: ${testsFailed} tests failed`);
    process.exit(1);
  }
}

main();

# .github/workflows/agent-eval.yml
name: Agent Evaluation
on: [push]
jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with: { node-version: "20" }
      - run: npm ci
      - run: npx tsx ci_eval.ts
        env:
          ASHR_LABS_API_KEY: ${{ secrets.ASHR_LABS_API_KEY }}
          ASHR_LABS_DATASET_ID: "322"
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

Environment Variables

Variable	Required	Description
`ASHR_LABS_API_KEY`	Yes (for `fromEnv()`)	Your API key (starts with `tp_`)
`ASHR_LABS_BASE_URL`	No	Override API URL (defaults to production)
`ASHR_LABS_DATASET_ID`	No	Dataset ID for CI scripts

Next Steps

API Reference — full documentation for EvalRunner, Agent, comparators, RunBuilder, and client methods
Error Handling — retry strategies and exception types
Examples — more usage patterns

Overview​

The 3-Line Version​

Step 1: Wrap Your Agent​

The Agent Interface​

How Tool Calls Are Logged​

Full Example: Customer Support Agent​

Minimal Agent (No Tools)​

arguments vs arguments_json — Important Serialization Note​

Step 2: Get a Dataset​

Option A: Fetch an Existing Dataset​

Option B: Generate a New Dataset​

Step 3: Run the Eval​

Basic Run​

Or Use fromDataset to Skip the Fetch​

Inspecting Results Before Submitting​

Deeplinks to the Dashboard​

Running Scenarios in Parallel​

Submitting in One Call​

Step 4: Add Progress Callbacks​

How Tool Matching Works​

The Tool Pool​

Tool Argument Comparison​

Text Similarity​

Customizing Comparison Logic​

Custom Tool Comparator​

Custom Text Comparator​

Custom Similarity Thresholds​

Using the Comparators Standalone​

Understanding the Dataset Structure​

Top-Level Structure​

dataset_source​

Scenario​

Actions​

Media Files (file / audio / image / video)​

Simulations (run-level)​

Downloading a media file​

Agent Actions — Expected Behavior​

Complete Real-World Example​

Advanced: Manual RunBuilder​

Match Statuses​

Automatic Metrics​

CI/CD Integration​

Environment Variables​

Next Steps​