Skip to main content

API Reference

Complete reference for all classes and methods in the Ashr Labs SDK.

The SDK serves two products:

  • Testing Platform — generate eval datasets, run your agent against them, submit graded results. Core methods: create_request, create_run, EvalRunner, RunBuilder.
  • Observability (separate product) — trace your agent's production behavior (LLM calls, tool invocations, latency, errors). Core methods: trace(), Span, Generation, list_observability_traces. Requires the observability feature flag.

These are independent products that share the same SDK and API key.

AshrLabsClient

The main client class for interacting with the Ashr Labs API.

Constructor

AshrLabsClient(
api_key: str,
base_url: str = "https://api.ashr.io/testing-platform-api",
timeout: int = 30
)

Parameters:

ParameterTypeRequiredDefaultDescription
api_keystrYes-Your API key (must start with tp_)
base_urlstrNoProduction URLBase URL of the API
timeoutintNo30Request timeout in seconds

Raises:

  • ValueError: If the API key format is invalid

Example:

# Minimal — just pass your API key
client = AshrLabsClient(api_key="tp_your_key_here")

# Custom timeout
client = AshrLabsClient(api_key="tp_your_key_here", timeout=60)

from_env (class method)

Create a client from environment variables.

AshrLabsClient.from_env(timeout: int = 30) -> AshrLabsClient

Reads ASHR_LABS_API_KEY (required) and ASHR_LABS_BASE_URL (optional) from the environment.

Raises:

  • RuntimeError: If ASHR_LABS_API_KEY is not set

Example:

# export ASHR_LABS_API_KEY="tp_your_key_here"
client = AshrLabsClient.from_env()

Session Methods

init

Initialize a session and validate authentication.

init() -> Session

Returns: Session - Session information containing user and tenant data

Raises:

  • AuthenticationError: If the API key is invalid or expired

Example:

# Validate credentials and get user/tenant info
session = client.init()

print(f"User ID: {session['user']['id']}")
print(f"Email: {session['user']['email']}")
print(f"Tenant ID: {session['tenant']['id']}")
print(f"Tenant Name: {session['tenant']['tenant_name']}")

Dataset Methods

get_dataset

Retrieve a dataset by ID.

get_dataset(
dataset_id: int,
include_signed_urls: bool = False,
url_expires_seconds: int = 3600
) -> Dataset

Parameters:

ParameterTypeRequiredDefaultDescription
dataset_idintYes-The ID of the dataset
include_signed_urlsboolNoFalseInclude signed S3 URLs for media
url_expires_secondsintNo3600URL expiration time in seconds

Returns: Dataset - The dataset object

Raises:

  • NotFoundError: Dataset not found
  • AuthorizationError: No access to this dataset

Example:

dataset = client.get_dataset(
dataset_id=42,
include_signed_urls=True,
url_expires_seconds=7200
)
print(dataset["name"])

list_datasets

List datasets for a tenant.

list_datasets(
tenant_id: int | None = None,
limit: int = 50,
cursor: int | None = None,
include_signed_urls: bool = False,
url_expires_seconds: int = 3600
) -> dict

Parameters:

ParameterTypeRequiredDefaultDescription
tenant_idintNoautoThe tenant ID (auto-resolved if omitted)
limitintNo50Maximum results to return
cursorintNoNonePagination cursor (pass next_cursor from previous response)
include_signed_urlsboolNoFalseInclude signed S3 URLs
url_expires_secondsintNo3600URL expiration time

Returns: dict with keys:

  • status: "ok"
  • datasets: List of dataset objects
  • next_cursor: ID for the next page, or null if no more results

Example:

# tenant_id auto-resolved from API key
response = client.list_datasets(limit=10)
for dataset in response["datasets"]:
print(f"{dataset['id']}: {dataset['name']}")

# Pagination
if response.get("next_cursor"):
next_page = client.list_datasets(limit=10, cursor=response["next_cursor"])

Run Methods

create_run

Create a new test run.

create_run(
dataset_id: int,
result: dict[str, Any],
tenant_id: int | None = None,
runner_id: int | None = None
) -> Run

Parameters:

ParameterTypeRequiredDefaultDescription
dataset_idintYes-The dataset ID
resultdictYes-Run results (metrics, status, etc.)
tenant_idintNoautoThe tenant ID (auto-resolved if omitted)
runner_idintNoNoneID of user who ran the test

Returns: Run - The created run object

Example:

run = client.create_run(
dataset_id=42,
result={
"status": "passed",
"score": 0.95,
"metrics": {
"accuracy": 0.98,
"latency_ms": 150
}
}
)

get_run

Retrieve a run by ID.

get_run(run_id: int) -> Run

Parameters:

ParameterTypeRequiredDescription
run_idintYesThe run ID

Returns: Run - The run object

Raises:

  • NotFoundError: Run not found

Example:

run = client.get_run(run_id=99)
print(f"Score: {run['result']['score']}")

list_runs

List runs for a tenant or dataset.

list_runs(
dataset_id: int | None = None,
tenant_id: int | None = None,
limit: int = 50
) -> dict

Parameters:

ParameterTypeRequiredDefaultDescription
dataset_idintNoNoneFilter by dataset
tenant_idintNoautoFilter by tenant (auto-resolved if omitted)
limitintNo50Maximum results

Returns: dict with keys:

  • status: "ok"
  • runs: List of run objects

Example:

# Get runs for a specific dataset
response = client.list_runs(dataset_id=42)
for run in response["runs"]:
print(f"Run #{run['id']}: {run['result']['status']}")

delete_run

Delete a test run.

delete_run(run_id: int) -> dict

Parameters:

ParameterTypeRequiredDescription
run_idintYesThe run ID to delete

Returns: dict - Confirmation of deletion

Raises:

  • NotFoundError: Run not found

Example:

client.delete_run(run_id=99)
print("Run deleted")

Observability — Production Agent Tracing

This is a separate product from the Testing Platform. The testing platform (datasets, eval runs, RunBuilder, EvalRunner) is for offline evaluation. Observability is for tracing your agent in production. They share the same SDK and API key but are independent features.

Trace your agent's production behavior — LLM calls, tool invocations, retrieval steps, guardrail checks, and more. Requires the observability feature flag to be enabled for your tenant.

Production-safe: tracing never raises exceptions or interferes with your agent. If the backend is unreachable, trace.end() returns an error dict instead of throwing.

client.trace

Start a new trace for a production agent interaction.

trace = client.trace(
name: str,
*,
user_id: str | None = None,
session_id: str | None = None,
metadata: dict | None = None,
tags: list[str] | None = None,
) -> Trace

Parameters:

ParameterTypeRequiredDefaultDescription
namestrYes-Name for this trace (e.g. "handle-ticket")
user_idstrNoNoneEnd-user ID for grouping
session_idstrNoNoneConversation/session ID
metadatadictNoNoneArbitrary metadata
tagslist[str]NoNoneTags for filtering

Returns: A Trace instance. Supports context manager (with) usage.


Trace methods

MethodDescription
trace.span(name, *, input, metadata)Create a top-level span
trace.generation(name, *, model, input, metadata)Create a top-level generation (LLM call)
trace.event(name, *, input, metadata, level)Record a point-in-time event
trace.end(*, output)Flush the trace to the backend. Never raises.
trace.trace_idServer-assigned trace ID (available after end())

Span methods

MethodDescription
span.span(name, *, input, metadata)Create a child span
span.generation(name, *, model, input, metadata)Create a child generation
span.event(name, *, input, metadata, level)Record an event under this span
span.end(*, output, status_message, level)Mark the span as complete

Spans support context managers. If the body raises, the span auto-ends with level="ERROR" and the exception message is captured in status_message.

Generation methods

Inherits all Span methods, plus:

MethodDescription
gen.end(*, output, usage, status_message, level)Mark complete with token usage

The usage dict accepts {"input_tokens": int, "output_tokens": int}.


Context managers ensure spans are always ended, even if your code throws:

with client.trace("handle-ticket", user_id="user_42") as trace:
with trace.generation("classify", model="claude-sonnet-4-6",
input=[{"role": "user", "content": "help"}]) as gen:
result = call_llm(...)
gen.end(output=result, usage={"input_tokens": 50, "output_tokens": 12})

with trace.span("tool:search", input={"query": "..."}) as tool:
data = search(...)
tool.end(output=data)
# If search() throws, the span auto-ends with level="ERROR"

# trace.end() is called automatically on exit

Manual instrumentation

trace = client.trace("support-chat", user_id="user_42", session_id="conv_abc")

gen = trace.generation("classify-intent", model="claude-sonnet-4-6",
input=[{"role": "user", "content": "Reset my password"}])
gen.end(output={"intent": "password_reset"},
usage={"input_tokens": 50, "output_tokens": 12})

tool = trace.span("tool:reset_password", input={"user_id": "user_42"})
tool.end(output={"success": True})

trace.event("guardrail-check", input={"passed": True})

result = trace.end(output={"resolution": "password_reset_complete"})
print(trace.trace_id) # server-assigned ID

list_observability_traces

List traces for the current tenant.

client.list_observability_traces(
user_id: str | None = None,
session_id: str | None = None,
limit: int = 50,
page: int = 1,
) -> dict
ParameterTypeRequiredDefaultDescription
user_idstrNoNoneFilter by end-user
session_idstrNoNoneFilter by session
limitintNo50Max results per page (max 100)
pageintNo1Page number

Returns: {"status": "ok", "traces": [...], "total": int}


get_observability_trace

Get a single trace with its full observation tree.

client.get_observability_trace(trace_id: str) -> dict

Returns: {"status": "ok", "trace": {...}} — the trace includes an observations list with id, name, type, parent_observation_id, input, output, metadata, model, usage, level, start_time, end_time.


get_observability_analytics

Get analytics overview for the current tenant.

client.get_observability_analytics(days: int = 7) -> dict

Returns: {"status": "ok", "overview": {...}, "tool_performance": [...], "model_usage": [...]}

Overview includes: total_traces, avg_latency_ms, p95_latency_ms, total_input_tokens, total_output_tokens, error_rate, total_tool_calls, unique_users, unique_sessions.


get_observability_errors / get_observability_tool_errors

client.get_observability_errors(days: int = 7, limit: int = 50, page: int = 1) -> dict
client.get_observability_tool_errors(days: int = 7, limit: int = 50, page: int = 1) -> dict

Returns: {"status": "ok", "traces": [...], "total": int} — traces with errors or tool failures, most recent first.


SDK Notes — Platform Advisories

SDK Notes are platform advisories delivered to your SDK from Ashr Labs. They communicate context changes, best practices, deprecations, or breaking changes that may affect how you configure or run your agent.

Notes are automatically fetched when the client initializes (via init()). You can also refresh them on demand.

client.notes (property)

Get cached SDK notes from the last init() or get_notes() call. No network request is made.

client.notes -> list[SdkNote]

Returns: List of active notes for your tenant.

Example:

client = AshrLabsClient(api_key="tp_...")

# Notes are auto-fetched on first use
for note in client.notes:
print(f"[{note['severity']}] {note['title']}: {note['content']}")

get_notes

Fetch fresh SDK notes from the platform. Updates the cached client.notes.

get_notes(agent_id: int | None = None) -> list[SdkNote]

Parameters:

ParameterTypeRequiredDefaultDescription
agent_idint | NoneNoNoneInclude notes targeted at this specific agent

Returns: List of active notes (global + tenant-specific, plus agent-specific if agent_id is provided).

Example:

# Refresh notes
notes = client.get_notes()

# Filter by agent
notes = client.get_notes(agent_id=42)

# Check for breaking changes
breaking = [n for n in notes if n['category'] == 'breaking_change']
if breaking:
print("⚠ Breaking changes detected:")
for n in breaking:
print(f" {n['title']}: {n['content']}")

Note categories: info, warning, breaking_change, best_practice, deprecation

Severity levels: info, warning, critical


Request Methods

create_request

Create a dataset generation request.

create_request(
request_name: str,
request: dict[str, Any],
request_input_schema: dict[str, Any] | None = None,
tenant_id: int | None = None,
requestor_id: int | None = None,
) -> Request

Parameters:

ParameterTypeRequiredDefaultDescription
request_namestrYes-Name/title for the request
requestdictYes-The generation config (see below)
request_input_schemadictNoautoJSON Schema for validating the request. A permissive default is sent if omitted. If your agent has tools, include them here under the "tools" key so they're auto-saved as skill templates.
tenant_idintNoautoThe tenant ID (auto-resolved if omitted)
requestor_idintNoautoID of requesting user (auto-resolved if omitted)

Returns: Request - The created request object

Generation config structure (the request dict):

The config has two required sections (agent and context) and several optional sections.

metadata (optional)

FieldTypeDescription
dataset_namestrName for the generated dataset
descriptionstrDescription of what this dataset tests

agent (required)

At least one of name, description, or system_prompt is required.

FieldTypeDefaultDescription
namestr-Agent name
descriptionstr-What the agent does
system_promptstr-System prompt given to the agent
toolslist[dict][]Tools the agent can call (see below)
accepted_inputsdicttext onlyInput modalities (see below)
output_formatdict{"type": "text"}"text" or "structured" with optional schema
input_schemadict-Custom structured input schema (see below)

Tool definition:

{
"name": "tool_name", # snake_case tool name
"description": "What it does", # Used by test generator for realistic scenarios
"parameters": { # JSON Schema for tool parameters
"type": "object",
"properties": {
"arg_name": {"type": "string", "description": "What this arg is"},
},
"required": ["arg_name"],
},
"returns": { # (optional) Return value schema
"type": "object",
"description": "What the tool returns",
},
}

Accepted inputs — values can be bool or {"enabled": bool}:

"accepted_inputs": {
"text": True, # (default True) Text input
"audio": False, # Audio: mp3, wav, m4a, ogg, webm
"file": False, # Files: pdf, txt, csv, json, xml, html, md, docx, xlsx
"image": False, # Images: jpg, png, gif, webp
"video": False, # Video: mp4, webm, mov, avi
"conversation": False, # Multi-participant conversations with inferred roles
}

Input schema — define structured data users provide to the agent:

"input_schema": {
"name": "OrderInput",
"description": "Data the customer provides",
"fields": [
{"name": "order_id", "type": "string", "description": "Order ID", "required": True},
{"name": "priority", "type": "string", "description": "Priority level", "enum": ["low", "medium", "high"]},
],
"example": {"order_id": "ORD-123", "priority": "high"},
}

context (required)

At least one of domain, use_case, or scenario_context is required.

FieldTypeDefaultDescription
domainstr-Domain: "banking", "healthcare", "e-commerce", "legal", "education", "customer_service", "technology", "travel", "insurance", "other"
use_casestr-Specific use case description (min 10 chars recommended)
scenario_contextstr-Additional scenario context
user_personadict-{"type": str, "description": str} — who interacts with the agent
sample_datadict-{"examples": [{...}]} — example data for realistic tests

test_config (optional)

FieldTypeDefaultDescription
num_variationsint5Number of test scenarios (1-50)
strategystr"balanced""focused", "diverse", or "balanced"
coveragedictall True{"happy_path": bool, "edge_cases": bool, "error_handling": bool, "boundary_values": bool}
complexity_distributiondictauto{"simple": 0.3, "moderate": 0.5, "complex": 0.2} — must sum to ~1.0
focus_areaslist[str][]Specific areas to focus testing on
excludelist[str][]Scenarios or test types to exclude

generation_options (required)

FieldTypeDefaultDescription
generate_audioboolFalseGenerate audio test inputs
generate_filesboolFalseGenerate file test inputs (PDF, CSV, etc.)
generate_imagesboolFalseGenerate image test inputs
generate_videosboolFalseGenerate video test inputs
generate_simulationsboolFalseGenerate website session replay simulation videos

Example:

req = client.create_request(
request_name="Support Agent Eval",
request={
"metadata": {"dataset_name": "Support Eval"},
"agent": {
"name": "Support Bot",
"description": "Answers customer questions about orders and refunds",
"system_prompt": "You are a helpful support agent.",
"tools": [
{
"name": "lookup_order",
"description": "Look up an order by ID",
"parameters": {
"type": "object",
"properties": {"order_id": {"type": "string", "description": "The order ID"}},
"required": ["order_id"],
},
},
],
"accepted_inputs": {"text": True, "audio": False, "file": False},
"output_format": {"type": "text"},
},
"context": {
"domain": "e-commerce",
"use_case": "Customers asking about order status, requesting refunds",
"scenario_context": "An online retail store called ShopWave",
"user_persona": {"type": "customer", "description": "Online shoppers"},
},
"test_config": {
"num_variations": 10,
"strategy": "diverse",
"coverage": {"happy_path": True, "edge_cases": True, "error_handling": True},
"complexity_distribution": {"simple": 0.3, "moderate": 0.5, "complex": 0.2},
},
"expected_behaviors": {
"must_include": ["order"],
"expected_tools": ["lookup_order"],
},
},
)
# Use wait_for_request or generate_dataset instead of manual polling
completed = client.wait_for_request(req["id"], timeout=300)

get_request

Retrieve a request by ID.

get_request(request_id: int) -> Request

Parameters:

ParameterTypeRequiredDescription
request_idintYesThe request ID

Returns: Request - The request object

Raises:

  • NotFoundError: Request not found

Example:

req = client.get_request(request_id=123)
print(f"Status: {req['request_status']}")

list_requests

List requests for a tenant.

list_requests(
tenant_id: int | None = None,
status: str | None = None,
limit: int = 50,
cursor: int | None = None
) -> dict

Parameters:

ParameterTypeRequiredDefaultDescription
tenant_idintNoautoThe tenant ID (auto-resolved if omitted)
statusstrNoNoneFilter by status
limitintNo50Maximum results
cursorintNoNonePagination cursor

Returns: dict with keys:

  • status: "ok"
  • requests: List of request objects

Example:

# Get pending requests
response = client.list_requests(status="pending")
for req in response["requests"]:
print(f"Request #{req['id']}: {req['request_name']}")

Agent Methods

Agents group datasets and define grading behavior. Each dataset can belong to one agent.

list_agents

List all agents for your tenant with dataset counts.

list_agents() -> list[Agent]

Returns: List of agent objects with id, name, description, config, dataset_count.

Example:

agents = client.list_agents()
for agent in agents:
print(f"{agent['name']}: {agent['dataset_count']} datasets")

create_agent

Create a new agent.

create_agent(
name: str,
description: str | None = None,
config: dict | None = None
) -> Agent

Parameters:

ParameterTypeRequiredDefaultDescription
namestrYes-Agent name (unique per tenant)
descriptionstrNoNoneWhat this agent does
configdictNoNoneAgent config (tool_definitions, behavior_rules, grading_config)

Config structure:

{
"tool_definitions": [
{"name": "fetch_data", "required": True, "description": "Fetch live data"},
{"name": "end_session", "required": False, "description": "End conversation"},
],
"behavior_rules": [
{"rule": "Always fetch before quoting data", "strictness": "required"},
{"rule": "Save caller name", "strictness": "expected"},
],
"grading_config": {
"tool_strictness": {
"fetch_data": "required", # must be called — no recovery
"end_session": "optional", # text recovery OK
"await_user_response": "optional",
},
"text_similarity_threshold": 0.3, # lower for multilingual agents
},
}

Grading strictness levels:

  • "required" — tool must be called. NOT_CALLED = failure.
  • "expected" — tool should be called. NOT_CALLED = warning, not failure.
  • "optional" — if the agent achieves the intent via text, the grader recovers it as a partial match.

Example:

agent = client.create_agent(
name="Support Bot",
description="Healthcare scheduling agent",
config={
"tool_definitions": [
{"name": "fetch_kareo_data", "required": True},
{"name": "end_session", "required": False},
],
"grading_config": {
"tool_strictness": {
"fetch_kareo_data": "required",
"end_session": "optional",
},
},
},
)
print(f"Created agent: {agent['id']}")

update_agent

Update an agent's name, description, or config.

update_agent(
agent_id: int,
name: str | None = None,
description: str | None = None,
config: dict | None = None
) -> Agent

Note: config replaces the entire config object — merge locally before updating if you want to preserve existing fields.


delete_agent

Soft-delete an agent. Datasets are unlinked but not deleted.

delete_agent(agent_id: int) -> dict

get_agent_datasets

Get all datasets linked to an agent.

get_agent_datasets(agent_id: int) -> dict

Returns: Dict with agent (the agent object) and datasets (list of dataset objects).


set_dataset_agent

Assign or unassign an agent to a dataset.

set_dataset_agent(dataset_id: int, agent_id: int | None) -> dict

Pass agent_id=None to unlink a dataset from its agent.


API Key Methods

list_api_keys

List API keys for your tenant.

list_api_keys(include_inactive: bool = False) -> list[APIKey]

Parameters:

ParameterTypeRequiredDefaultDescription
include_inactiveboolNoFalseInclude revoked keys

Returns: list[APIKey] - List of API key objects

Note: For security, only the key prefix is returned, not the full key.

Example:

keys = client.list_api_keys()
for key in keys:
print(f"{key['key_prefix']}... - {key['name']}")

revoke_api_key

Revoke an API key.

revoke_api_key(api_key_id: int) -> dict

Parameters:

ParameterTypeRequiredDescription
api_key_idintYesThe API key ID to revoke

Returns: dict - Confirmation of revocation

Raises:

  • NotFoundError: API key not found

Example:

client.revoke_api_key(api_key_id=123)
print("API key revoked")

Convenience Methods

wait_for_request

Block until a request reaches a terminal state (completed or failed).

wait_for_request(
request_id: int,
timeout: int = 600,
poll_interval: int = 5
) -> Request

Parameters:

ParameterTypeRequiredDefaultDescription
request_idintYes-The request ID to poll
timeoutintNo600Maximum seconds to wait
poll_intervalintNo5Seconds between polls

Returns: Request - The final request object

Raises:

  • TimeoutError: If the request doesn't finish within timeout seconds
  • AshrLabsError: If the request fails

Example:

req = client.create_request(request_name="My Eval", request=config)
completed = client.wait_for_request(req["id"], timeout=300)
print(f"Status: {completed['request_status']}")

poll_run

Block until backend grading completes for a run. After deploy(), the backend grades tool arguments and text responses asynchronously (typically 1-3 minutes). This method polls get_run() until aggregate_metrics.tests_passed is populated.

poll_run(
run_id: int,
timeout: int = 300,
poll_interval: int = 20,
on_poll: Callable | None = None
) -> Run

Parameters:

ParameterTypeRequiredDefaultDescription
run_idintYes-The run ID to poll
timeoutintNo300Maximum seconds to wait
poll_intervalintNo20Seconds between polls
on_pollCallableNoNoneCalled after each poll: (elapsed_seconds, run_dict)

Returns: Run - The fully graded run object

Raises:

  • TimeoutError: If grading doesn't finish within timeout seconds

Example:

created = run.deploy(client, dataset_id=322)
graded = client.poll_run(
created["id"],
on_poll=lambda elapsed, r: print(f"Grading... ({elapsed}s)"),
)
metrics = graded["result"]["aggregate_metrics"]
print(f"Passed: {metrics['tests_passed']}/{metrics['total_tests']}")

generate_dataset

Create a dataset generation request, wait for completion, and fetch the result. Combines create_request + wait_for_request + get_dataset into one call.

Missing context fields (use_case, scenario_context) are auto-filled from the agent's name and description. A default test_config is added if not provided.

generate_dataset(
request_name: str,
config: dict[str, Any],
request_input_schema: dict[str, Any] | None = None,
timeout: int = 600,
poll_interval: int = 5
) -> tuple[int, dict]

Parameters:

ParameterTypeRequiredDefaultDescription
request_namestrYes-Name/title for the request
configdictYes-The generation config — same structure as create_request's request parameter. See create_request for the full schema reference. Valid sections: metadata, agent, context, test_config, generation_options.
request_input_schemadictNoautoOptional JSON Schema for validation. Auto-populated from config["agent"]["tools"] if omitted.
timeoutintNo600Maximum seconds to wait for generation
poll_intervalintNo5Seconds between status polls

Returns: A tuple of (dataset_id, dataset_source) where dataset_source is the dict containing "runs".

Raises:

  • TimeoutError: If generation doesn't finish in time
  • AshrLabsError: If generation fails or no datasets are found

Example:

dataset_id, source = client.generate_dataset(
request_name="Support Agent Eval",
config={
"agent": {
"name": "Support Bot",
"description": "Handles customer orders, inventory, and refunds",
"system_prompt": "You are a helpful support agent for ShopWave.",
"tools": [
{"name": "lookup_order", "description": "Look up order status",
"parameters": {"type": "object", "properties": {"order_id": {"type": "string"}}, "required": ["order_id"]}},
{"name": "process_refund", "description": "Process a refund",
"parameters": {"type": "object", "properties": {"order_id": {"type": "string"}, "reason": {"type": "string"}}, "required": ["order_id", "reason"]}},
],
"accepted_inputs": {"text": True, "audio": False, "file": False},
},
"context": {
"domain": "e-commerce",
"use_case": "Customer support for online retail orders",
"user_persona": {"type": "customer", "description": "Online shoppers"},
},
"test_config": {
"num_variations": 10,
"strategy": "diverse",
"coverage": {"happy_path": True, "edge_cases": True, "error_handling": True},
},
"generation_options": {
"generate_audio": False,
"generate_files": False,
"generate_simulations": False,
},
},
)
print(f"Dataset #{dataset_id}: {len(source['runs'])} scenarios")

Utility Methods

health_check

Check if the API is reachable.

health_check() -> dict

Returns: dict - Status information

Example:

status = client.health_check()
print(f"API Status: {status['status']}")

RunBuilder

A builder for incrementally constructing run result objects as an agent executes tests. Once complete, the result can be deployed via the client.

Constructor

RunBuilder()

No parameters. Creates a run in "pending" status.


RunBuilder.start

Mark the run as started. Records the current timestamp.

run.start() -> RunBuilder

Returns: self (for chaining)


RunBuilder.add_test

Create and register a new test within this run.

run.add_test(test_id: str) -> TestBuilder

Parameters:

ParameterTypeRequiredDescription
test_idstrYesUnique identifier for the test case

Returns: TestBuilder - A builder for the individual test


RunBuilder.complete

Mark the run as completed. Records the current timestamp.

run.complete(status: str = "completed") -> RunBuilder

Parameters:

ParameterTypeRequiredDefaultDescription
statusstrNo"completed"Final status ("completed" or "failed")

Returns: self (for chaining)


RunBuilder.build

Serialize the full run result to a dict.

run.build() -> dict[str, Any]

Returns: A dict matching the run result schema, ready to be passed to client.create_run(result=...). Aggregate metrics are computed automatically from action results.


RunBuilder.deploy

Build the result and submit it as a new run via the API.

run.deploy(
client: AshrLabsClient,
dataset_id: int,
tenant_id: int | None = None,
runner_id: int | None = None,
agent_id: int | None = None
) -> dict[str, Any]

Parameters:

ParameterTypeRequiredDefaultDescription
clientAshrLabsClientYes-An authenticated client instance
dataset_idintYes-The dataset this run is for
tenant_idintNoautoThe tenant (auto-resolved if omitted)
runner_idintNoNoneID of the user who ran the test
agent_idintNoNoneAgent to auto-link the dataset to

Returns: The created run object from the API

Example:

from ashr_labs import AshrLabsClient, RunBuilder

client = AshrLabsClient(api_key="tp_...")

run = RunBuilder()
run.start()

test = run.add_test("bank_analysis")
test.start()
test.add_user_text(text="Analyze this", description="User prompt")
test.add_tool_call(
expected={"tool_name": "analyze", "arguments": {"data": "input"}},
actual={"tool_name": "analyze", "arguments": {"data": "input"}},
match_status="exact",
)
test.complete()

run.complete()
created_run = run.deploy(client, dataset_id=42)
print(f"Run #{created_run['id']} created")

TestBuilder

Builds a single test result incrementally. Returned by RunBuilder.add_test().

TestBuilder.start

Mark the test as started. Records the current timestamp.

test.start() -> TestBuilder

Returns: self (for chaining)


TestBuilder.add_user_file

Record a user file input action.

test.add_user_file(
file_path: str,
description: str,
action_index: int | None = None
) -> TestBuilder

Parameters:

ParameterTypeRequiredDefaultDescription
file_pathstrYes-Path to the file in the dataset
descriptionstrYes-Description of the action
action_indexintNoautoExplicit index, or auto-incremented

Returns: self (for chaining)


TestBuilder.add_user_text

Record a user text input action.

test.add_user_text(
text: str,
description: str,
action_index: int | None = None
) -> TestBuilder

Parameters:

ParameterTypeRequiredDefaultDescription
textstrYes-The user's text input
descriptionstrYes-Description of the action
action_indexintNoautoExplicit index, or auto-incremented

Returns: self (for chaining)


TestBuilder.add_tool_call

Record an agent tool call action with expected vs actual comparison.

test.add_tool_call(
expected: dict[str, Any],
actual: dict[str, Any],
match_status: str,
divergence_notes: str | None = None,
argument_comparison: dict[str, Any] | None = None,
action_index: int | None = None
) -> TestBuilder

Parameters:

ParameterTypeRequiredDefaultDescription
expecteddictYes-Expected tool call (tool_name, arguments)
actualdictYes-Actual tool call made by the agent
match_statusstrYes-"exact", "partial", or "mismatch"
argument_comparisondictNoNoneStructured diff from compare_args_structural(). Recommended — the backend grader may skip tool calls without it.
divergence_notesstrNoNoneNotes explaining the divergence
action_indexintNoautoExplicit index, or auto-incremented

Returns: self (for chaining)


TestBuilder.add_agent_response

Record an agent text response with expected vs actual comparison.

test.add_agent_response(
expected_response: dict[str, Any],
actual_response: dict[str, Any],
match_status: str,
semantic_similarity: float | None = None,
divergence_notes: str | None = None,
action_index: int | None = None
) -> TestBuilder

Parameters:

ParameterTypeRequiredDefaultDescription
expected_responsedictYes-The expected response content
actual_responsedictYes-The actual response from the agent
match_statusstrYes-"exact", "similar", or "divergent"
semantic_similarityfloatNoNoneSimilarity score (0.0 to 1.0)
divergence_notesstrNoNoneNotes explaining the divergence
action_indexintNoautoExplicit index, or auto-incremented

Returns: self (for chaining)


TestBuilder.set_vm_stream

Attach VM session logs to this test. For agents that operate in a browser or virtual machine.

test.set_vm_stream(
provider: str,
session_id: str | None = None,
duration_ms: int | None = None,
logs: list[dict] | None = None,
metadata: dict | None = None
) -> TestBuilder

Parameters:

ParameterTypeRequiredDefaultDescription
providerstrYes-VM provider name (e.g. "browserbase", "scrapybara", "steel")
session_idstrNoNoneProvider session ID for linking
duration_msintNoNoneTotal session duration in milliseconds
logslist[dict]NoNoneTimestamped log entries (see below)
metadatadictNoNoneAdditional provider-specific metadata

Log entry format: Each entry should have ts (int, ms offset from start) and type (str):

{"ts": 0, "type": "navigation", "data": {"url": "https://..."}}
{"ts": 1200, "type": "action", "data": {"action": "click", "selector": "#btn"}}
{"ts": 3000, "type": "error", "data": {"message": "Element not found"}}

Example:

test.set_vm_stream(
provider="browserbase",
session_id="sess_abc123",
duration_ms=12000,
logs=[
{"ts": 0, "type": "navigation", "data": {"url": "https://app.example.com"}},
{"ts": 2000, "type": "action", "data": {"action": "click", "selector": "#submit"}},
{"ts": 5000, "type": "network", "data": {"method": "POST", "url": "/api/order", "status": 201}},
],
)

Returns: self (for chaining)


TestBuilder.set_kernel_vm

Convenience method for attaching a Kernel browser session. Sets provider="kernel" and exposes Kernel-specific metadata fields as named parameters. Fields map to Kernel's browser API response.

test.set_kernel_vm(
session_id: str,
duration_ms: int | None = None,
logs: list[dict] | None = None,
*,
live_view_url: str | None = None,
cdp_ws_url: str | None = None,
replay_id: str | None = None,
replay_view_url: str | None = None,
headless: bool | None = None,
stealth: bool | None = None,
viewport: dict | None = None,
) -> TestBuilder

Parameters:

ParameterTypeRequiredDefaultDescription
session_idstrYes-Kernel browser session ID
duration_msintNoNoneTotal session duration in milliseconds
logslist[dict]NoNoneTimestamped log entries (same format as set_vm_stream)
live_view_urlstrNoNoneRemote live-view URL (browser_live_view_url)
cdp_ws_urlstrNoNoneChrome DevTools Protocol WebSocket URL
replay_idstrNoNoneID of the session recording
replay_view_urlstrNoNoneURL to view the session replay
headlessboolNoNoneWhether the session ran in headless mode
stealthboolNoNoneWhether anti-bot stealth mode was enabled
viewportdictNoNoneBrowser viewport, e.g. {"width": 1920, "height": 1080}

Example:

test.set_kernel_vm(
session_id="kern_sess_abc123",
duration_ms=15000,
logs=[
{"ts": 0, "type": "navigation", "data": {"url": "https://app.example.com"}},
{"ts": 1200, "type": "action", "data": {"action": "click", "selector": "#login"}},
{"ts": 3000, "type": "screenshot", "data": {"s3_key": "vm-streams/.../frame.png"}},
],
replay_id="replay_abc123",
replay_view_url="https://www.kernel.sh/replays/replay_abc123",
stealth=True,
viewport={"width": 1920, "height": 1080},
)

Returns: self (for chaining)


TestBuilder.complete

Mark the test as completed. Records the current timestamp.

test.complete(status: str = "completed") -> TestBuilder

Parameters:

ParameterTypeRequiredDefaultDescription
statusstrNo"completed"Final status ("completed" or "failed")

Returns: self (for chaining)


TestBuilder.build

Serialize this test to a dict matching the run result schema.

test.build() -> dict[str, Any]

Returns: A dict with test_id, status, action_results, started_at, and completed_at.


EvalRunner

Runs an agent against every scenario in a dataset and records results. This is the high-level API that encapsulates the full eval loop — iterating scenarios, calling the agent, comparing tool calls and text, and producing a RunBuilder.

Constructor

EvalRunner(dataset_source: dict[str, Any])

Parameters:

ParameterTypeRequiredDefaultDescription
dataset_sourcedictYes-The dataset_source dict from a dataset (contains "runs" key)

The EvalRunner does not perform local grading. It pairs expected vs actual tool calls and text responses, then submits everything for server-side grading via the backend's LLM-based judge. Tool call arguments are compared structurally using compare_args_structural(). Text responses are submitted with match_status="pending" for server-side evaluation.

Example:

from ashr_labs import EvalRunner

runner = EvalRunner(source)

EvalRunner.from_dataset (class method)

Create an EvalRunner by fetching a dataset from the API.

EvalRunner.from_dataset(
client: AshrLabsClient,
dataset_id: int,
**kwargs
) -> EvalRunner

Parameters:

ParameterTypeRequiredDescription
clientAshrLabsClientYesAn authenticated client
dataset_idintYesThe dataset ID to fetch
**kwargsNoPassed to EvalRunner.__init__()

Returns: EvalRunner - A configured runner ready to call .run()

Example:

runner = EvalRunner.from_dataset(client, dataset_id=322)

EvalRunner.run

Run the agent against every scenario and return a populated RunBuilder.

runner.run(
agent: Agent,
*,
on_scenario: Callable | None = None,
on_action: Callable | None = None,
on_environment: Callable | None = None,
max_workers: int = 1,
) -> RunBuilder

Parameters:

ParameterTypeRequiredDefaultDescription
agentAgentYes-An object implementing the Agent protocol
on_scenarioCallableNoNoneCalled at the start of each scenario: (scenario_id, scenario_dict)
on_actionCallableNoNoneCalled for each action: (action_index, action_dict)
on_environmentCallableNoNoneCalled for environment actions: `(content, action_dict) -> dict
max_workersintNo1Number of scenarios to run in parallel. When >1, each scenario gets a copy.deepcopy of the agent. Important: Most LLM clients (Anthropic, OpenAI) hold connection pools that cannot be deep-copied. Use max_workers=1 unless your agent implements __deepcopy__.

Returns: RunBuilder - A populated builder ready for .build() or .deploy()

Example:

# Sequential (default)
run = runner.run(agent)

# With environment handler — feed external context to the agent
def handle_env(content, action):
return agent.respond(content)

run = runner.run(agent, on_environment=handle_env)

# Parallel — run 4 scenarios at a time (only if agent supports deepcopy)
run = runner.run(agent, max_workers=4)

EvalRunner.run_and_deploy

Run the eval and submit results in one call.

runner.run_and_deploy(
agent: Agent,
client: AshrLabsClient,
dataset_id: int | None = None,
*,
on_scenario: Callable | None = None,
on_action: Callable | None = None,
on_environment: Callable | None = None,
max_workers: int = 1,
**deploy_kwargs,
) -> dict[str, Any]

Parameters:

ParameterTypeRequiredDefaultDescription
agentAgentYes-An object implementing the Agent protocol
clientAshrLabsClientYes-An authenticated client
dataset_idint | NoneNoNoneThe dataset to submit against
on_scenarioCallableNoNoneCallback per scenario
on_actionCallableNoNoneCallback per action
on_environmentCallableNoNoneCallback for environment actions (see run())
max_workersintNo1Number of scenarios to run in parallel (default sequential)
**deploy_kwargsNoExtra kwargs passed to RunBuilder.deploy()

Returns: The created run object from the API

Example:

# Sequential
created = runner.run_and_deploy(agent, client, dataset_id=322)
print(f"Run #{created['id']} submitted")

# Parallel
created = runner.run_and_deploy(agent, client, dataset_id=322, max_workers=4)

Agent Protocol

A @runtime_checkable Protocol that defines the interface agents must implement.

@runtime_checkable
class Agent(Protocol):
def respond(self, message: str) -> dict[str, Any]: ...
def reset(self) -> None: ...

respond

Process a user message and return the agent's response.

Parameters:

ParameterTypeDescription
messagestrThe user's message text

Returns: A dict with:

  • "text" (str): The agent's text response
  • "tool_calls" (list[dict]): Tool calls made during this turn, each with "name" (str) and "arguments" (dict) keys

arguments vs arguments_json: The Agent protocol returns tool arguments as a dict under the "arguments" key. However, RunBuilder and the API store them as a JSON string under "arguments_json". EvalRunner handles this conversion automatically (eval.py:187-193). If you use RunBuilder directly, pass "arguments_json" (a JSON string) to add_tool_call(). The extract_tool_args() helper accepts both formats, so comparators work either way.

reset

Clear conversation state for a new scenario. Called before each scenario begins.

isinstance check

from ashr_labs import Agent

assert isinstance(my_agent, Agent) # Works at runtime

Comparator Functions

All comparator functions are standalone, stdlib-only, and importable from the top-level package.

strip_markdown

Remove markdown formatting from text.

strip_markdown(text: str) -> str

Removes bold/italic markers, headers, bullets, and markdown links. Collapses whitespace.

Example:

strip_markdown("**Bold** and [link](https://x.com)")
# => "Bold and link"

tokenize

Lowercase, strip markdown and punctuation, split into word tokens.

tokenize(text: str) -> list[str]

Example:

tokenize("Order **ORD-123** shipped!")
# => ["order", "ord123", "shipped"]

fuzzy_str_match

Check if two strings are semantically close enough to count as matching.

fuzzy_str_match(a: str, b: str, threshold: float | None = None) -> bool

Parameters:

ParameterTypeRequiredDefaultDescription
astrYes-First string
bstrYes-Second string
thresholdfloatNoadaptiveWord-overlap threshold. If None: 0.35 for <=5 words, 0.40 for <=8, 0.55 otherwise

Returns: True if the strings match closely enough

Checks in order: exact match after normalization, containment, then word-set overlap.

Example:

fuzzy_str_match("Customer wants a refund", "customer wants refund")  # True
fuzzy_str_match("apple banana", "cherry grape") # False

extract_tool_args

Extract arguments from a tool call dict, handling both formats.

extract_tool_args(tool_call: dict) -> dict

Handles {"arguments": {...}} (dict form) and {"arguments_json": "..."} (JSON string form). Prefers the dict form if both are present.

Example:

extract_tool_args({"arguments_json": '{"order_id": "ORD-123"}'})
# => {"order_id": "ORD-123"}

extract_tool_args({"arguments": {"order_id": "ORD-123"}})
# => {"order_id": "ORD-123"}

compare_tool_args

Compare expected vs actual tool call arguments.

compare_tool_args(expected: dict, actual: dict) -> tuple[str, str | None]

Parameters:

ParameterTypeDescription
expecteddictExpected tool call (with arguments or arguments_json)
actualdictActual tool call made by the agent

Returns: A tuple of (match_status, divergence_notes):

  • match_status: "exact", "partial", or "mismatch"
  • divergence_notes: Human-readable diff summary, or None if exact

String arguments are compared using fuzzy_str_match. Non-string values use exact equality. Extra arguments in the actual call don't cause divergence.

Example:

status, notes = compare_tool_args(
{"arguments": {"order_id": "ORD-123"}},
{"arguments": {"order_id": "ORD-123", "extra": "field"}},
)
# => ("exact", None)

status, notes = compare_tool_args(
{"arguments": {"order_id": "ORD-123", "reason": "damaged item"}},
{"arguments": {"order_id": "ORD-999", "reason": "item was damaged"}},
)
# => ("partial", "'order_id': expected='ORD-123' actual='ORD-999'")

text_similarity

Compute similarity between two text strings.

text_similarity(text_a: str, text_b: str) -> float

Returns: A float between 0.0 and 1.0

Uses cosine similarity on word frequency vectors, plus:

  • Entity bonus (+0.20): for matching order IDs (ORD-*), refund IDs (REF-*), prices ($*), dates (YYYY-MM-DD), and tracking URLs
  • Concept bonus (+0.10): for matching domain concepts (refund/credited, shipped/transit/delivered, stock/available, etc.)

Example:

text_similarity(
"Your order ORD-123 has shipped and is on the way",
"Order ORD-123 has been shipped and is in transit",
)
# => 0.78

Data Types

User

class User(TypedDict, total=False):
id: int
created_at: str
email: str
name: str | None
tenant: int
is_active: bool

Tenant

class Tenant(TypedDict, total=False):
id: int
created_at: str
tenant_name: str
is_active: bool

Session

class Session(TypedDict):
status: str
user: User
tenant: Tenant

Dataset

class Dataset(TypedDict, total=False):
id: int
created_at: str
tenant: int
creator: int
name: str
description: str | None
agent_id: int | None
agent_details: dict[str, Any] | None # {"id": int, "name": str}
dataset_source: dict[str, Any]

Run

class Run(TypedDict, total=False):
id: int
created_at: str
dataset: int
tenant: int
runner: int
result: dict[str, Any]

ObservabilityTrace

class ObservabilityTrace(TypedDict, total=False):
id: str
name: str
user_id: str | None # End-user identifier
session_id: str | None # Conversation/session grouping
metadata: dict[str, Any] | None
tags: list[str]
created_at: str | None
output: dict[str, Any] | None
observations: list[ObservabilityObservation]

ObservabilityObservation

class ObservabilityObservation(TypedDict, total=False):
id: str
name: str
type: str # "span", "generation", "event"
parent_observation_id: str | None
input: dict[str, Any] | None
output: dict[str, Any] | None
metadata: dict[str, Any] | None
model: str | None # LLM model name (generations only)
usage: dict[str, int] | None # {"input_tokens": ..., "output_tokens": ...}
level: str | None # "DEBUG", "DEFAULT", "WARNING", "ERROR"
status_message: str | None
start_time: str | None
end_time: str | None

SdkNote

class SdkNote(TypedDict, total=False):
id: int
created_at: str
updated_at: str
title: str
content: str
category: str # "info", "warning", "breaking_change", "best_practice", "deprecation"
severity: str # "info", "warning", "critical"
tenant_id: int | None
agent_id: int | None
active_from: str
expires_at: str | None
is_archived: bool
note_metadata: dict[str, Any]

Request

class Request(TypedDict, total=False):
id: int
created_at: str
requestor_id: int
requestor_tenant: int
request_name: str
request_status: str
request_input_schema: dict[str, Any] | None
request: dict[str, Any]

APIKey

class APIKey(TypedDict, total=False):
id: int
key: str # Only present on creation
key_prefix: str
name: str
scopes: list[str]
user_id: int
tenant_id: int
created_at: str
last_used_at: str | None
expires_at: str | None
is_active: bool

ToolCall

class ToolCall(TypedDict, total=False):
name: str
arguments_json: str

ExpectedResponse

class ExpectedResponse(TypedDict, total=False):
tool_calls: list[ToolCall]
text: str

Action

class Action(TypedDict, total=False):
actor: str # "user" or "agent"
content: str
name: str
expected_response: ExpectedResponse

Scenario

class Scenario(TypedDict, total=False):
title: str
actions: list[Action]

Agent

class Agent(TypedDict, total=False):
id: int
created_at: str
tenant_id: int
creator_id: int | None
name: str
description: str | None
config: AgentConfig
is_active: bool
dataset_count: int

AgentConfig

class AgentConfig(TypedDict, total=False):
form_data: dict # Dataset generation preset
tool_definitions: list[ToolDefinition]
behavior_rules: list[BehaviorRule]
grading_config: GradingConfig

ToolDefinition

class ToolDefinition(TypedDict, total=False):
name: str
description: str
required: bool # True = must be called, False = optional

BehaviorRule

class BehaviorRule(TypedDict, total=False):
rule: str
strictness: str # "required" | "expected" | "optional"

GradingConfig

class GradingConfig(TypedDict, total=False):
tool_strictness: dict[str, str] # tool_name -> "required" | "expected" | "optional"
text_similarity_threshold: float