Testing Your Agent
This guide walks through the complete workflow for evaluating an AI agent against an Ashr Labs dataset. It covers everything from wrapping your agent in the SDK interface, to running the eval, to submitting results.
Overview
The eval workflow has three stages:
- Get a dataset — fetch an existing one or generate a new one
- Run the eval —
EvalRunneriterates scenarios, calls your agent, compares results - Submit results — deploy the run to the Ashr Labs dashboard
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Get Dataset │ ──> │ EvalRunner │ ──> │ Deploy Run │
│ │ │ .run(agent) │ │ │
└──────────────┘ └──────────────┘ └──────────────┘
The 3-Line Version
If you already have a dataset and an agent, here's the entire eval:
import { AshrLabsClient, EvalRunner } from "ashr-labs";
const client = new AshrLabsClient("tp_your_key_here");
const runner = await EvalRunner.fromDataset(client, 322);
await runner.runAndDeploy(myAgent, client, 322);
The rest of this guide explains what's happening under the hood and how to customize every step.
Step 1: Wrap Your Agent
EvalRunner works with any object that implements the Agent interface — just respond() and reset() methods. No base class to extend, no SDK dependency in your agent code.
The Agent Interface
interface Agent {
respond(
message: string,
scenarioId?: string,
): Record<string, unknown> | Promise<Record<string, unknown>>;
reset(scenarioId?: string): void | Promise<void>;
}
Both methods can be synchronous or async. The optional scenarioId parameter is passed during parallel execution so a single agent instance can maintain separate conversation states per scenario.
The respond() method should return:
{
text: string; // The agent's text response
tool_calls: [ // All tool calls made during this turn
{
name: string;
arguments: Record<string, unknown>; // Tool arguments as an object
},
// ...
];
}
How Tool Calls Are Logged
The agent is responsible for collecting its own tool calls during the respond() call and returning them in the response object. The SDK does not intercept or instrument tool execution — it only consumes whatever the agent reports.
During a single respond() call, your agent may:
- Call the LLM
- Get back tool use requests
- Execute tools and feed results back to the LLM
- Repeat steps 2-3 multiple times (tool loops)
- Finally get a text response
Throughout this loop, accumulate every tool call into a list and return it alongside the final text.
Full Example: Customer Support Agent
Here's a complete tool-calling agent built on the Anthropic SDK. This is the agent we use in our own evals — it handles order lookups, inventory checks, and refund processing.
import Anthropic from "@anthropic-ai/sdk";
import type { Agent } from "ashr-labs";
const SYSTEM_PROMPT = `You are a helpful customer support agent for ShopWave.
You help customers check order status, look up product availability,
and process refunds. Always be polite and concise.`;
const TOOLS: Anthropic.Tool[] = [
{
name: "lookup_order",
description: "Look up the status and details of a customer order.",
input_schema: {
type: "object" as const,
properties: {
order_id: {
type: "string",
description: "The order ID (e.g. ORD-12345)",
},
},
required: ["order_id"],
},
},
{
name: "check_inventory",
description: "Check availability of a product.",
input_schema: {
type: "object" as const,
properties: {
product_name: {
type: "string",
description: "The product name or SKU",
},
},
required: ["product_name"],
},
},
{
name: "process_refund",
description: "Initiate a refund for an order.",
input_schema: {
type: "object" as const,
properties: {
order_id: { type: "string", description: "The order ID" },
reason: { type: "string", description: "Reason for refund" },
},
required: ["order_id", "reason"],
},
},
];
function executeTool(name: string, args: Record<string, unknown>): string {
// Your tool implementations. Replace with real logic.
if (name === "lookup_order") {
return JSON.stringify({ order_id: args.order_id, status: "shipped" });
} else if (name === "check_inventory") {
return JSON.stringify({ product: args.product_name, in_stock: true });
} else if (name === "process_refund") {
return JSON.stringify({ refund_id: "REF-001", status: "processed" });
}
return JSON.stringify({ error: `Unknown tool: ${name}` });
}
class SupportAgent implements Agent {
private client: Anthropic;
private conversation: Anthropic.MessageParam[] = [];
constructor(apiKey: string) {
this.client = new Anthropic({ apiKey });
}
async reset(): Promise<void> {
this.conversation = [];
}
async respond(userMessage: string): Promise<Record<string, unknown>> {
this.conversation.push({ role: "user", content: userMessage });
const allToolCalls: { name: string; arguments: Record<string, unknown> }[] = [];
for (let i = 0; i < 10; i++) {
const response = await this.client.messages.create({
model: "claude-sonnet-4-20250514",
max_tokens: 1024,
system: SYSTEM_PROMPT,
tools: TOOLS,
messages: this.conversation,
});
// Separate text and tool use blocks
const textParts: string[] = [];
const toolUses: Anthropic.ToolUseBlock[] = [];
for (const block of response.content) {
if (block.type === "text") {
textParts.push(block.text);
} else if (block.type === "tool_use") {
toolUses.push(block);
}
}
// No tool calls — we're done
if (toolUses.length === 0) {
this.conversation.push({ role: "assistant", content: response.content });
return { text: textParts.join("\n"), tool_calls: allToolCalls };
}
// Execute tools and continue the loop
this.conversation.push({ role: "assistant", content: response.content });
const toolResults: Anthropic.ToolResultBlockParam[] = [];
for (const toolUse of toolUses) {
const resultStr = executeTool(
toolUse.name,
toolUse.input as Record<string, unknown>,
);
// Record every tool call with its name and arguments
allToolCalls.push({
name: toolUse.name,
arguments: toolUse.input as Record<string, unknown>,
});
toolResults.push({
type: "tool_result",
tool_use_id: toolUse.id,
content: resultStr,
});
}
this.conversation.push({ role: "user", content: toolResults });
if (response.stop_reason === "end_turn") {
return { text: textParts.join("\n"), tool_calls: allToolCalls };
}
}
return { text: "[Max iterations reached]", tool_calls: allToolCalls };
}
}
Minimal Agent (No Tools)
If your agent doesn't use tools, the wrapper is simpler:
import type { Agent } from "ashr-labs";
class SimpleAgent implements Agent {
private client: any;
private history: { role: string; content: string }[] = [];
constructor(llmClient: any) {
this.client = llmClient;
}
async reset(): Promise<void> {
this.history = [];
}
async respond(message: string): Promise<Record<string, unknown>> {
this.history.push({ role: "user", content: message });
const response = await this.client.chat({ messages: this.history });
this.history.push({ role: "assistant", content: response.text });
return { text: response.text, tool_calls: [] };
}
}
arguments vs arguments_json — Important Serialization Note
The Agent interface's respond() method returns tool call arguments as an object:
{ name: "lookup_order", arguments: { order_id: "ORD-123" } }
But internally, RunBuilder and the API store them as a JSON string under arguments_json:
{ name: "lookup_order", arguments_json: '{"order_id": "ORD-123"}' }
If you use EvalRunner, this is handled automatically — it serializes arguments to arguments_json when recording results.
If you use RunBuilder directly (the manual flow), you need to pass arguments_json as a JSON string, not arguments as an object:
// Correct — RunBuilder expects arguments_json (string)
test.addToolCall(
{ name: "lookup_order", arguments_json: JSON.stringify({ order_id: "ORD-123" }) },
{ name: "lookup_order", arguments_json: JSON.stringify({ order_id: "ORD-123" }) },
"exact",
);
// Also works — the comparators handle both formats via extractToolArgs()
test.addToolCall(
{ name: "lookup_order", arguments: { order_id: "ORD-123" } },
{ name: "lookup_order", arguments_json: JSON.stringify({ order_id: "ORD-123" }) },
"exact",
);
The extractToolArgs() helper normalizes both formats, so comparators work regardless. But the data stored in the run result will use whichever format you pass to addToolCall().
Step 2: Get a Dataset
Option A: Fetch an Existing Dataset
import { AshrLabsClient } from "ashr-labs";
const client = new AshrLabsClient("tp_your_key_here");
// Fetch by ID
const dataset = await client.getDataset(322);
const source = dataset.dataset_source as Record<string, unknown>;
// Quick summary
const runs = (source.runs ?? {}) as Record<string, Record<string, unknown>>;
const totalActions = Object.values(runs).reduce(
(sum, s) => sum + ((s.actions as unknown[]) ?? []).length,
0,
);
console.log(`Dataset #${dataset.id}: ${Object.keys(runs).length} scenarios, ${totalActions} actions`);
Option B: Generate a New Dataset
Use generateDataset() — it creates the request, polls until complete, and fetches the result in one call:
const [datasetId, source] = await client.generateDataset(
"ShopWave Support Eval",
{
metadata: {
dataset_name: "ShopWave Support Eval",
description: "Customer support scenarios with tool calling",
},
agent: {
name: "ShopWave Support Agent",
description: "Helps customers with orders, inventory, and refunds",
system_prompt: "You are a helpful support agent for ShopWave.",
tools: [
{
name: "lookup_order",
description: "Look up order status",
parameters: {
type: "object",
properties: { order_id: { type: "string" } },
required: ["order_id"],
},
},
// ... more tools
],
accepted_inputs: { text: true, audio: false, file: false, image: false, video: false },
output_format: { type: "text" },
},
context: {
domain: "ecommerce",
use_case: "Customers contacting support about orders and refunds",
scenario_context: "An online retail store called ShopWave",
},
test_config: {
num_variations: 25,
variation_strategy: "balanced",
coverage: {
happy_path: true,
edge_cases: true,
error_handling: true,
multi_turn: true,
},
},
generation_options: {
generate_audio: false,
generate_files: false,
generate_simulations: false,
},
},
undefined,
600,
);
console.log(`Generated dataset #${datasetId}`);
If you need more control over the polling (e.g. to show progress), use the lower-level methods:
const req = await client.createRequest("My Eval", config);
const completed = await client.waitForRequest(req.id as number, 600, 5);
// Then fetch the dataset manually via client.listDatasets() / client.getDataset()
Step 3: Run the Eval
Basic Run
import { EvalRunner } from "ashr-labs";
const runner = new EvalRunner(source); // source = dataset.dataset_source
const run = await runner.run(agent);
That's it. EvalRunner.run() handles the full eval loop:
- Iterates every scenario in
source.runs - Resets the agent at the start of each scenario
- For each
actor === "user"action: callsagent.respond(content) - For each
actor === "agent"action: compares expected tool calls and text against the agent's actual response - Returns a populated
RunBuilderwith all results recorded
Or Use fromDataset to Skip the Fetch
const runner = await EvalRunner.fromDataset(client, 322);
const run = await runner.run(agent);
Inspecting Results Before Submitting
const result = run.build();
const metrics = result.aggregate_metrics as Record<string, unknown>;
console.log(`Total tests: ${metrics.total_tests}`);
console.log(`Passed: ${metrics.tests_passed}`);
console.log(`Avg similarity: ${metrics.average_similarity_score}`);
console.log(`Tool divergences: ${metrics.total_tool_call_divergence}`);
console.log(`Text divergences: ${metrics.total_response_divergence}`);
// Only submit if metrics look good
const avgSimilarity = metrics.average_similarity_score as number | null;
if (avgSimilarity && avgSimilarity > 0.5) {
await run.deploy(client, 322);
}
Running Scenarios in Parallel
By default, scenarios run sequentially. Pass maxWorkers to run multiple scenarios concurrently using Promise.all batches:
// Run up to 4 scenarios at a time
const run = await runner.run(agent, { maxWorkers: 4 });
This can significantly speed up evals when your agent spends most of its time waiting on LLM API calls. Actions within each scenario still run sequentially (since they depend on each other), but independent scenarios run in parallel.
// Also works with runAndDeploy
const created = await runner.runAndDeploy(agent, client, 322, { maxWorkers: 4 });
Note: Unlike the Python SDK which deep-copies the agent for each worker, the TypeScript SDK passes a scenarioId to respond(message, scenarioId) and reset(scenarioId). Your agent must key its conversation state on this ID when running in parallel. Most agents that store conversation in a Map<string, Message[]> keyed by scenario ID work out of the box. If a scenario raises an exception during parallel execution, it's recorded as a failed test and the remaining scenarios continue.
Alternatively, you can pass a factory function instead of an agent instance:
// Factory — creates a fresh agent for each run
const run = await runner.run(() => new SupportAgent(apiKey), { maxWorkers: 4 });
Submitting in One Call
const runner = await EvalRunner.fromDataset(client, 322);
const created = await runner.runAndDeploy(agent, client, 322);
console.log(`Run #${created.id} submitted`);
Step 4: Add Progress Callbacks
EvalRunner.run() accepts optional callbacks to monitor progress:
const run = await runner.run(agent, {
onScenario: (scenarioId, scenario) => {
const title = scenario.title ?? scenarioId;
const actions = (scenario.actions ?? []) as unknown[];
console.log(`\n── Scenario: ${title} (${actions.length} actions) ──`);
},
onAction: (index, action) => {
const actor = action.actor ?? "?";
const content = (action.content as string) ?? "";
const preview = content.length > 80 ? content.slice(0, 80) + "..." : content;
console.log(` [${index}] ${actor}: ${preview}`);
},
});
Output looks like:
── Scenario: Customer asks about delayed order (4 actions) ──
[0] user: Hi, I placed an order last week (ORD-54321) and it still hasn't arrive...
[1] agent: Let me look up your order right away.
[2] user: Can you also check if the wireless headphones are back in stock?
[3] agent: I've checked both — here's what I found.
How Tool Matching Works
Understanding how EvalRunner compares expected vs actual tool calls is important for interpreting your results.
The Tool Pool
When agent.respond() is called on a user action, the returned tool_calls list becomes the tool pool for that turn. As the runner encounters expected tool calls in subsequent agent actions, it matches them by name and removes matched tools from the pool.
This means:
- Tool calls persist across multiple agent actions within a single user turn
- Each expected tool can only match one actual tool (first match wins)
- Unmatched expected tools are recorded as
"mismatch"with"NOT_CALLED"
User says: "Refund ORD-123 — it arrived damaged"
Agent responds with tool_calls: [lookup_order, process_refund]
↓ tool pool
Agent action 1 expects: lookup_order → ✓ matched, removed from pool
Agent action 2 expects: process_refund → ✓ matched, removed from pool
Tool Argument Comparison
For matched tools, arguments are compared using compareToolArgs():
"exact"— all expected arguments match (string args compared fuzzily)"partial"— at least one argument matches, but not all"mismatch"— no arguments match
String arguments use fuzzy matching: lowercased, punctuation stripped, word-overlap with adaptive thresholds (0.35 for short strings, up to 0.55 for longer ones). This means "Customer wants a refund" and "customer wants refund" are considered matching.
Text Similarity
Text responses are compared using textSimilarity(), which combines:
- Cosine similarity on word frequency vectors (the base score)
- Entity bonus (+0.20) for matching order IDs, prices, dates, tracking numbers
- Concept bonus (+0.10) for matching domain concepts (refund, shipped, inventory, etc.)
The resulting score maps to match status:
> 0.70→"exact"> 0.40→"similar"≤ 0.40→"divergent"
Customizing Comparison Logic
Custom Tool Comparator
Override how tool arguments are compared:
import { extractToolArgs, EvalRunner } from "ashr-labs";
function strictToolCompare(
expected: Record<string, unknown>,
actual: Record<string, unknown>,
): [string, string | null] {
const expArgs = extractToolArgs(expected);
const actArgs = extractToolArgs(actual);
if (JSON.stringify(expArgs) === JSON.stringify(actArgs)) {
return ["exact", null];
}
return ["mismatch", `Expected ${JSON.stringify(expArgs)}, got ${JSON.stringify(actArgs)}`];
}
const runner = new EvalRunner(source, { toolComparator: strictToolCompare });
Custom Text Comparator
Override text similarity scoring:
function embeddingSimilarity(textA: string, textB: string): number {
// Use your own embedding model for comparison
const vecA = myEmbeddingModel.encode(textA);
const vecB = myEmbeddingModel.encode(textB);
return cosineSimilarity(vecA, vecB);
}
const runner = new EvalRunner(source, { textComparator: embeddingSimilarity });
Custom Similarity Thresholds
Adjust when scores map to "exact" vs "similar" vs "divergent":
const runner = new EvalRunner(source, {
similarityThresholds: {
exact: 0.85, // default: 0.70
similar: 0.50, // default: 0.40
},
});
Using the Comparators Standalone
All comparison functions are importable and usable independently of EvalRunner:
import {
stripMarkdown,
tokenize,
fuzzyStrMatch,
extractToolArgs,
compareToolArgs,
textSimilarity,
} from "ashr-labs";
// Strip formatting for cleaner comparison
const clean = stripMarkdown("**Your order** has *shipped*!");
// => "Your order has shipped!"
// Tokenize for analysis
const tokens = tokenize("Order ORD-123 shipped on 2026-03-01.");
// => ["order", "ord123", "shipped", "on", "20260301"]
// Check if two strings are semantically close
fuzzyStrMatch("Customer wants a refund", "customer wants refund");
// => true
// Extract args from either object or JSON format
const args = extractToolArgs({ arguments_json: '{"order_id": "ORD-123"}' });
// => { order_id: "ORD-123" }
// Compare two tool calls
const [status, notes] = compareToolArgs(
{ arguments: { order_id: "ORD-123" } },
{ arguments: { order_id: "ORD-123", extra: "field" } },
);
// => ["exact", null] — extra actual args don't cause divergence
// Compute text similarity
const score = textSimilarity(
"Your order ORD-123 has shipped and is on the way",
"Order ORD-123 has been shipped and is in transit",
);
// => 0.78
Understanding the Dataset Structure
A dataset contains multiple scenarios (called "runs"). Each scenario has an ordered list of actions — the back-and-forth conversation between user and agent.
Top-Level Structure
const dataset = await client.getDataset(42);
dataset.id; // 42
dataset.name; // "ShopWave Support Eval"
dataset.dataset_source; // The actual test data
dataset_source
const source = dataset.dataset_source as Record<string, unknown>;
source.dataset_type; // "multi_run_storyboard"
source.total_runs; // Number of scenarios
source.runs; // { [scenarioId]: scenario }
Scenario
const runs = source.runs as Record<string, Record<string, unknown>>;
const scenario = runs["billing_inquiry"];
scenario.run_id; // "billing_inquiry"
scenario.title; // "Customer Billing Question"
scenario.description; // "Frustrated customer calls about their bill..."
scenario.intent; // "Customer asking about their bill"
scenario.intent_tags; // ["frustrated customer", "billing issue"]
scenario.actions; // Ordered list of conversation turns
Actions
Each action is one turn in the conversation:
const actions = scenario.actions as Record<string, unknown>[];
const action = actions[0];
action.name; // "Customer greets agent"
action.actor; // "user" or "agent"
action.action_type; // "text", "audio", "file", "image", "video", "json"
action.content; // The text content
Agent Actions — Expected Behavior
When actor === "agent", the action describes what the agent should do:
const agentAction = actions[3];
// The text the agent should say (approximately)
agentAction.content; // "Your order has been shipped..."
// The expected tool calls and text
const expected = agentAction.expected_response as Record<string, unknown>;
expected.tool_calls; // [{ name: "lookup_order", arguments_json: "..." }]
expected.text; // Optional expected text response
Complete Real-World Example
This is the full eval runner for our ShopWave support agent — the same one we use internally. It generates a dataset, runs the eval with progress logging, and submits results.
#!/usr/bin/env npx tsx
/**
* ShopWave Agent — Ashr Labs Eval Runner
*/
import { AshrLabsClient, EvalRunner } from "ashr-labs";
import { SupportAgent } from "./agent.js"; // Your agent module
const client = new AshrLabsClient(process.env.ASHR_LABS_API_KEY!);
const agent = new SupportAgent(process.env.ANTHROPIC_API_KEY!);
// Verify credentials
const session = await client.init();
const user = session.user as Record<string, unknown>;
console.log(`Logged in as: ${user.email}`);
// Generate a dataset (or use an existing one)
const [datasetId, source] = await client.generateDataset(
"ShopWave Support Agent Eval",
{
metadata: { dataset_name: "ShopWave Support Eval" },
agent: {
name: "ShopWave Support Agent",
description: "Customer support with order lookup, inventory, refunds",
system_prompt: "You are a helpful support agent for ShopWave.",
tools: [
{ name: "lookup_order", description: "Look up order status",
parameters: { type: "object", properties: { order_id: { type: "string" } }, required: ["order_id"] } },
{ name: "check_inventory", description: "Check product availability",
parameters: { type: "object", properties: { product_name: { type: "string" } }, required: ["product_name"] } },
{ name: "process_refund", description: "Process a refund",
parameters: { type: "object", properties: { order_id: { type: "string" }, reason: { type: "string" } }, required: ["order_id", "reason"] } },
],
accepted_inputs: { text: true, audio: false, file: false, image: false, video: false },
output_format: { type: "text" },
},
context: {
domain: "ecommerce",
use_case: "Customers contacting support",
scenario_context: "An online retail store called ShopWave",
},
test_config: { num_variations: 25, coverage: { happy_path: true, edge_cases: true, multi_turn: true } },
generation_options: { generate_audio: false, generate_files: false, generate_simulations: false },
},
);
const runs = (source.runs ?? {}) as Record<string, Record<string, unknown>>;
const totalActions = Object.values(runs).reduce(
(sum, s) => sum + ((s.actions as unknown[]) ?? []).length,
0,
);
console.log(`Dataset #${datasetId}: ${Object.keys(runs).length} scenarios, ${totalActions} actions`);
// Run the eval with progress callbacks
const runner = new EvalRunner(source);
const run = await runner.run(agent, {
onScenario: (sid, scenario) => {
const title = scenario.title ?? sid;
const actions = (scenario.actions ?? []) as unknown[];
console.log(`\n── ${title} (${actions.length} actions) ──`);
},
onAction: (idx, action) => {
const actor = action.actor ?? "?";
const content = ((action.content as string) ?? "").slice(0, 70);
console.log(` [${idx}] ${actor}: ${content}`);
},
maxWorkers: 4,
});
// Preview metrics
const result = run.build();
const m = result.aggregate_metrics as Record<string, unknown>;
console.log(`\nResults:`);
console.log(` Tests: ${m.tests_passed}/${m.total_tests} passed`);
console.log(` Avg similarity: ${m.average_similarity_score}`);
console.log(` Tool diverg.: ${m.total_tool_call_divergence}`);
console.log(` Text diverg.: ${m.total_response_divergence}`);
// Submit
const created = await run.deploy(client, datasetId);
console.log(`\nRun #${created.id} submitted!`);
Advanced: Manual RunBuilder
If EvalRunner doesn't fit your workflow (custom eval loops, non-standard agent interfaces, file-based inputs), you can use RunBuilder directly. This is the lower-level API that EvalRunner uses internally.
See the RunBuilder section of the API Reference for full documentation.
import { AshrLabsClient, RunBuilder } from "ashr-labs";
const client = new AshrLabsClient("tp_your_key_here");
const dataset = await client.getDataset(42, true);
const source = dataset.dataset_source as Record<string, unknown>;
const run = new RunBuilder();
run.start();
const runs = (source.runs ?? {}) as Record<string, Record<string, unknown>>;
for (const [runId, scenario] of Object.entries(runs)) {
const test = run.addTest(runId);
test.start();
const actions = (scenario.actions ?? []) as Record<string, unknown>[];
for (let i = 0; i < actions.length; i++) {
const action = actions[i];
if (action.actor === "user") {
test.addUserText(
action.content as string,
(action.name as string) ?? `action_${i}`,
i,
);
// Call your agent here...
} else if (action.actor === "agent") {
// Compare expected vs actual manually...
test.addToolCall(
expectedTool,
actualTool,
"exact", // or "partial" / "mismatch"
undefined,
i,
);
test.addAgentResponse(
{ text: action.content },
{ text: actualText },
"similar",
0.85,
undefined,
i,
);
}
}
test.complete();
}
run.complete();
await run.deploy(client, 42);
Match Statuses
For tool calls (addToolCall):
"exact"— tool name and arguments match"partial"— tool name matches but arguments differ"mismatch"— wrong tool or not called at all
For text responses (addAgentResponse):
"exact"— semantically identical"similar"— same meaning, different wording"divergent"— substantially different
Automatic Metrics
RunBuilder.build() computes aggregate metrics automatically:
const result = run.build();
console.log(result.aggregate_metrics);
// {
// total_tests: 25,
// tests_passed: 23,
// tests_failed: 2,
// average_similarity_score: 0.72,
// total_tool_call_divergence: 5,
// total_response_divergence: 8,
// }
CI/CD Integration
// ci_eval.ts
import { AshrLabsClient, EvalRunner } from "ashr-labs";
async function main() {
const client = AshrLabsClient.fromEnv();
const datasetId = parseInt(process.env.ASHR_LABS_DATASET_ID!);
const agent = new YourAgent(); // Your agent initialization
const runner = await EvalRunner.fromDataset(client, datasetId);
const run = await runner.run(agent, { maxWorkers: 4 });
const result = run.build();
const metrics = result.aggregate_metrics as Record<string, unknown>;
console.log(`Passed: ${metrics.tests_passed}/${metrics.total_tests}`);
console.log(`Avg similarity: ${metrics.average_similarity_score}`);
await run.deploy(client, datasetId);
// Fail CI if quality drops
const avgSimilarity = metrics.average_similarity_score as number | null;
if (avgSimilarity && avgSimilarity < 0.5) {
console.log("FAIL: Similarity score below threshold");
process.exit(1);
}
const testsFailed = metrics.tests_failed as number;
if (testsFailed > 0) {
console.log(`FAIL: ${testsFailed} tests failed`);
process.exit(1);
}
}
main();
# .github/workflows/agent-eval.yml
name: Agent Evaluation
on: [push]
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with: { node-version: "20" }
- run: npm ci
- run: npx tsx ci_eval.ts
env:
ASHR_LABS_API_KEY: ${{ secrets.ASHR_LABS_API_KEY }}
ASHR_LABS_DATASET_ID: "322"
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
Environment Variables
| Variable | Required | Description |
|---|---|---|
ASHR_LABS_API_KEY | Yes (for fromEnv()) | Your API key (starts with tp_) |
ASHR_LABS_BASE_URL | No | Override API URL (defaults to production) |
ASHR_LABS_DATASET_ID | No | Dataset ID for CI scripts |
Next Steps
- API Reference — full documentation for EvalRunner, Agent, comparators, RunBuilder, and client methods
- Error Handling — retry strategies and exception types
- Examples — more usage patterns