API Reference
Complete reference for all classes and methods in the Ashr Labs SDK.
The SDK serves two products:
- Testing Platform — generate eval datasets, run your agent against them, submit
graded results. Core methods:
create_request,create_run,EvalRunner,RunBuilder. - Observability (separate product) — trace your agent's production behavior (LLM calls,
tool invocations, latency, errors). Core methods:
trace(),Span,Generation,list_observability_traces. Requires theobservabilityfeature flag.
These are independent products that share the same SDK and API key.
AshrLabsClient
The main client class for interacting with the Ashr Labs API.
Constructor
AshrLabsClient(
api_key: str,
base_url: str = "https://api.ashr.io/testing-platform-api",
timeout: int = 30
)
Parameters:
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
api_key | str | Yes | - | Your API key (must start with tp_) |
base_url | str | No | Production URL | Base URL of the API |
timeout | int | No | 30 | Request timeout in seconds |
Raises:
ValueError: If the API key format is invalid
Example:
# Minimal — just pass your API key
client = AshrLabsClient(api_key="tp_your_key_here")
# Custom timeout
client = AshrLabsClient(api_key="tp_your_key_here", timeout=60)
from_env (class method)
Create a client from environment variables.
AshrLabsClient.from_env(timeout: int = 30) -> AshrLabsClient
Reads ASHR_LABS_API_KEY (required) and ASHR_LABS_BASE_URL (optional) from the environment.
Raises:
RuntimeError: IfASHR_LABS_API_KEYis not set
Example:
# export ASHR_LABS_API_KEY="tp_your_key_here"
client = AshrLabsClient.from_env()
Session Methods
init
Initialize a session and validate authentication.
init() -> Session
Returns: Session - Session information containing user and tenant data
Raises:
AuthenticationError: If the API key is invalid or expired
Example:
# Validate credentials and get user/tenant info
session = client.init()
print(f"User ID: {session['user']['id']}")
print(f"Email: {session['user']['email']}")
print(f"Tenant ID: {session['tenant']['id']}")
print(f"Tenant Name: {session['tenant']['tenant_name']}")
Dataset Methods
get_dataset
Retrieve a dataset by ID.
get_dataset(
dataset_id: int,
include_signed_urls: bool = False,
url_expires_seconds: int = 3600
) -> Dataset
Parameters:
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
dataset_id | int | Yes | - | The ID of the dataset |
include_signed_urls | bool | No | False | Include signed S3 URLs for media |
url_expires_seconds | int | No | 3600 | URL expiration time in seconds |
Returns: Dataset - The dataset object
Raises:
NotFoundError: Dataset not foundAuthorizationError: No access to this dataset
Example:
dataset = client.get_dataset(
dataset_id=42,
include_signed_urls=True,
url_expires_seconds=7200
)
print(dataset["name"])
list_datasets
List datasets for a tenant.
list_datasets(
tenant_id: int | None = None,
limit: int = 50,
cursor: int | None = None,
include_signed_urls: bool = False,
url_expires_seconds: int = 3600
) -> dict
Parameters:
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
tenant_id | int | No | auto | The tenant ID (auto-resolved if omitted) |
limit | int | No | 50 | Maximum results to return |
cursor | int | No | None | Pagination cursor (pass next_cursor from previous response) |
include_signed_urls | bool | No | False | Include signed S3 URLs |
url_expires_seconds | int | No | 3600 | URL expiration time |
Returns: dict with keys:
status:"ok"datasets: List of dataset objectsnext_cursor: ID for the next page, ornullif no more results
Example:
# tenant_id auto-resolved from API key
response = client.list_datasets(limit=10)
for dataset in response["datasets"]:
print(f"{dataset['id']}: {dataset['name']}")
# Pagination
if response.get("next_cursor"):
next_page = client.list_datasets(limit=10, cursor=response["next_cursor"])
Run Methods
create_run
Create a new test run.
create_run(
dataset_id: int,
result: dict[str, Any],
tenant_id: int | None = None,
runner_id: int | None = None
) -> Run
Parameters:
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
dataset_id | int | Yes | - | The dataset ID |
result | dict | Yes | - | Run results (metrics, status, etc.) |
tenant_id | int | No | auto | The tenant ID (auto-resolved if omitted) |
runner_id | int | No | None | ID of user who ran the test |
Returns: Run - The created run object
Example:
run = client.create_run(
dataset_id=42,
result={
"status": "passed",
"score": 0.95,
"metrics": {
"accuracy": 0.98,
"latency_ms": 150
}
}
)
get_run
Retrieve a run by ID.
get_run(run_id: int) -> Run
Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
run_id | int | Yes | The run ID |
Returns: Run - The run object
Raises:
NotFoundError: Run not found
Example:
run = client.get_run(run_id=99)
print(f"Score: {run['result']['score']}")
list_runs
List runs for a tenant or dataset.
list_runs(
dataset_id: int | None = None,
tenant_id: int | None = None,
limit: int = 50
) -> dict
Parameters:
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
dataset_id | int | No | None | Filter by dataset |
tenant_id | int | No | auto | Filter by tenant (auto-resolved if omitted) |
limit | int | No | 50 | Maximum results |
Returns: dict with keys:
status:"ok"runs: List of run objects
Example:
# Get runs for a specific dataset
response = client.list_runs(dataset_id=42)
for run in response["runs"]:
print(f"Run #{run['id']}: {run['result']['status']}")
delete_run
Delete a test run.
delete_run(run_id: int) -> dict
Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
run_id | int | Yes | The run ID to delete |
Returns: dict - Confirmation of deletion
Raises:
NotFoundError: Run not found
Example:
client.delete_run(run_id=99)
print("Run deleted")
Observability — Production Agent Tracing
This is a separate product from the Testing Platform. The testing platform (datasets, eval runs,
RunBuilder,EvalRunner) is for offline evaluation. Observability is for tracing your agent in production. They share the same SDK and API key but are independent features.
Trace your agent's production behavior — LLM calls, tool invocations, retrieval
steps, guardrail checks, and more. Requires the observability feature flag to
be enabled for your tenant.
Production-safe: tracing never raises exceptions or interferes with your
agent. If the backend is unreachable, trace.end() returns an error dict
instead of throwing.
client.trace
Start a new trace for a production agent interaction.
trace = client.trace(
name: str,
*,
user_id: str | None = None,
session_id: str | None = None,
metadata: dict | None = None,
tags: list[str] | None = None,
) -> Trace
Parameters:
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
name | str | Yes | - | Name for this trace (e.g. "handle-ticket") |
user_id | str | No | None | End-user ID for grouping |
session_id | str | No | None | Conversation/session ID |
metadata | dict | No | None | Arbitrary metadata |
tags | list[str] | No | None | Tags for filtering |
Returns: A Trace instance. Supports context manager (with) usage.
Trace methods
| Method | Description |
|---|---|
trace.span(name, *, input, metadata) | Create a top-level span |
trace.generation(name, *, model, input, metadata) | Create a top-level generation (LLM call) |
trace.event(name, *, input, metadata, level) | Record a point-in-time event |
trace.end(*, output) | Flush the trace to the backend. Never raises. |
trace.trace_id | Server-assigned trace ID (available after end()) |
Span methods
| Method | Description |
|---|---|
span.span(name, *, input, metadata) | Create a child span |
span.generation(name, *, model, input, metadata) | Create a child generation |
span.event(name, *, input, metadata, level) | Record an event under this span |
span.end(*, output, status_message, level) | Mark the span as complete |
Spans support context managers. If the body raises, the span auto-ends with level="ERROR" and the exception message is captured in status_message.
Generation methods
Inherits all Span methods, plus:
| Method | Description |
|---|---|
gen.end(*, output, usage, status_message, level) | Mark complete with token usage |
The usage dict accepts {"input_tokens": int, "output_tokens": int}.
Context managers (recommended)
Context managers ensure spans are always ended, even if your code throws:
with client.trace("handle-ticket", user_id="user_42") as trace:
with trace.generation("classify", model="claude-sonnet-4-6",
input=[{"role": "user", "content": "help"}]) as gen:
result = call_llm(...)
gen.end(output=result, usage={"input_tokens": 50, "output_tokens": 12})
with trace.span("tool:search", input={"query": "..."}) as tool:
data = search(...)
tool.end(output=data)
# If search() throws, the span auto-ends with level="ERROR"
# trace.end() is called automatically on exit
Manual instrumentation
trace = client.trace("support-chat", user_id="user_42", session_id="conv_abc")
gen = trace.generation("classify-intent", model="claude-sonnet-4-6",
input=[{"role": "user", "content": "Reset my password"}])
gen.end(output={"intent": "password_reset"},
usage={"input_tokens": 50, "output_tokens": 12})
tool = trace.span("tool:reset_password", input={"user_id": "user_42"})
tool.end(output={"success": True})
trace.event("guardrail-check", input={"passed": True})
result = trace.end(output={"resolution": "password_reset_complete"})
print(trace.trace_id) # server-assigned ID
list_observability_traces
List traces for the current tenant.
client.list_observability_traces(
user_id: str | None = None,
session_id: str | None = None,
limit: int = 50,
page: int = 1,
) -> dict
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
user_id | str | No | None | Filter by end-user |
session_id | str | No | None | Filter by session |
limit | int | No | 50 | Max results per page (max 100) |
page | int | No | 1 | Page number |
Returns: {"status": "ok", "traces": [...], "total": int}
get_observability_trace
Get a single trace with its full observation tree.
client.get_observability_trace(trace_id: str) -> dict
Returns: {"status": "ok", "trace": {...}} — the trace includes an observations list with id, name, type, parent_observation_id, input, output, metadata, model, usage, level, start_time, end_time.
get_observability_analytics
Get analytics overview for the current tenant.
client.get_observability_analytics(days: int = 7) -> dict
Returns: {"status": "ok", "overview": {...}, "tool_performance": [...], "model_usage": [...]}
Overview includes: total_traces, avg_latency_ms, p95_latency_ms, total_input_tokens, total_output_tokens, error_rate, total_tool_calls, unique_users, unique_sessions.
get_observability_errors / get_observability_tool_errors
client.get_observability_errors(days: int = 7, limit: int = 50, page: int = 1) -> dict
client.get_observability_tool_errors(days: int = 7, limit: int = 50, page: int = 1) -> dict
Returns: {"status": "ok", "traces": [...], "total": int} — traces with errors or tool failures, most recent first.
SDK Notes — Platform Advisories
SDK Notes are platform advisories delivered to your SDK from Ashr Labs. They communicate context changes, best practices, deprecations, or breaking changes that may affect how you configure or run your agent.
Notes are automatically fetched when the client initializes (via init()).
You can also refresh them on demand.
client.notes (property)
Get cached SDK notes from the last init() or get_notes() call. No network
request is made.
client.notes -> list[SdkNote]
Returns: List of active notes for your tenant.
Example:
client = AshrLabsClient(api_key="tp_...")
# Notes are auto-fetched on first use
for note in client.notes:
print(f"[{note['severity']}] {note['title']}: {note['content']}")
get_notes
Fetch fresh SDK notes from the platform. Updates the cached client.notes.
get_notes(agent_id: int | None = None) -> list[SdkNote]
Parameters:
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
agent_id | int | None | No | None | Include notes targeted at this specific agent |
Returns: List of active notes (global + tenant-specific, plus agent-specific if agent_id is provided).
Example:
# Refresh notes
notes = client.get_notes()
# Filter by agent
notes = client.get_notes(agent_id=42)
# Check for breaking changes
breaking = [n for n in notes if n['category'] == 'breaking_change']
if breaking:
print("⚠ Breaking changes detected:")
for n in breaking:
print(f" {n['title']}: {n['content']}")
Note categories: info, warning, breaking_change, best_practice, deprecation
Severity levels: info, warning, critical
Request Methods
create_request
Create a dataset generation request.
create_request(
request_name: str,
request: dict[str, Any],
request_input_schema: dict[str, Any] | None = None,
tenant_id: int | None = None,
requestor_id: int | None = None,
) -> Request
Parameters:
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
request_name | str | Yes | - | Name/title for the request |
request | dict | Yes | - | The generation config (see below) |
request_input_schema | dict | No | auto | JSON Schema for validating the request. A permissive default is sent if omitted. If your agent has tools, include them here under the "tools" key so they're auto-saved as skill templates. |
tenant_id | int | No | auto | The tenant ID (auto-resolved if omitted) |
requestor_id | int | No | auto | ID of requesting user (auto-resolved if omitted) |
Returns: Request - The created request object
Generation config structure (the request dict):
The config has two required sections (agent and context) and several optional sections.
metadata (optional)
| Field | Type | Description |
|---|---|---|
dataset_name | str | Name for the generated dataset |
description | str | Description of what this dataset tests |
agent (required)
At least one of name, description, or system_prompt is required.
| Field | Type | Default | Description |
|---|---|---|---|
name | str | - | Agent name |
description | str | - | What the agent does |
system_prompt | str | - | System prompt given to the agent |
tools | list[dict] | [] | Tools the agent can call (see below) |
accepted_inputs | dict | text only | Input modalities (see below) |
output_format | dict | {"type": "text"} | "text" or "structured" with optional schema |
input_schema | dict | - | Custom structured input schema (see below) |
Tool definition:
{
"name": "tool_name", # snake_case tool name
"description": "What it does", # Used by test generator for realistic scenarios
"parameters": { # JSON Schema for tool parameters
"type": "object",
"properties": {
"arg_name": {"type": "string", "description": "What this arg is"},
},
"required": ["arg_name"],
},
"returns": { # (optional) Return value schema
"type": "object",
"description": "What the tool returns",
},
}
Accepted inputs — values can be bool or {"enabled": bool}:
"accepted_inputs": {
"text": True, # (default True) Text input
"audio": False, # Audio: mp3, wav, m4a, ogg, webm
"file": False, # Files: pdf, txt, csv, json, xml, html, md, docx, xlsx
"image": False, # Images: jpg, png, gif, webp
"video": False, # Video: mp4, webm, mov, avi
"conversation": False, # Multi-participant conversations with inferred roles
}
Input schema — define structured data users provide to the agent:
"input_schema": {
"name": "OrderInput",
"description": "Data the customer provides",
"fields": [
{"name": "order_id", "type": "string", "description": "Order ID", "required": True},
{"name": "priority", "type": "string", "description": "Priority level", "enum": ["low", "medium", "high"]},
],
"example": {"order_id": "ORD-123", "priority": "high"},
}
context (required)
At least one of domain, use_case, or scenario_context is required.
| Field | Type | Default | Description |
|---|---|---|---|
domain | str | - | Domain: "banking", "healthcare", "e-commerce", "legal", "education", "customer_service", "technology", "travel", "insurance", "other" |
use_case | str | - | Specific use case description (min 10 chars recommended) |
scenario_context | str | - | Additional scenario context |
user_persona | dict | - | {"type": str, "description": str} — who interacts with the agent |
sample_data | dict | - | {"examples": [{...}]} — example data for realistic tests |
test_config (optional)
| Field | Type | Default | Description |
|---|---|---|---|
num_variations | int | 5 | Number of test scenarios (1-50) |
strategy | str | "balanced" | "focused", "diverse", or "balanced" |
coverage | dict | all True | {"happy_path": bool, "edge_cases": bool, "error_handling": bool, "boundary_values": bool} |
complexity_distribution | dict | auto | {"simple": 0.3, "moderate": 0.5, "complex": 0.2} — must sum to ~1.0 |
focus_areas | list[str] | [] | Specific areas to focus testing on |
exclude | list[str] | [] | Scenarios or test types to exclude |
generation_options (required)
| Field | Type | Default | Description |
|---|---|---|---|
generate_audio | bool | False | Generate audio test inputs |
generate_files | bool | False | Generate file test inputs (PDF, CSV, etc.) |
generate_images | bool | False | Generate image test inputs |
generate_videos | bool | False | Generate video test inputs |
generate_simulations | bool | False | Generate website session replay simulation videos |
Example:
req = client.create_request(
request_name="Support Agent Eval",
request={
"metadata": {"dataset_name": "Support Eval"},
"agent": {
"name": "Support Bot",
"description": "Answers customer questions about orders and refunds",
"system_prompt": "You are a helpful support agent.",
"tools": [
{
"name": "lookup_order",
"description": "Look up an order by ID",
"parameters": {
"type": "object",
"properties": {"order_id": {"type": "string", "description": "The order ID"}},
"required": ["order_id"],
},
},
],
"accepted_inputs": {"text": True, "audio": False, "file": False},
"output_format": {"type": "text"},
},
"context": {
"domain": "e-commerce",
"use_case": "Customers asking about order status, requesting refunds",
"scenario_context": "An online retail store called ShopWave",
"user_persona": {"type": "customer", "description": "Online shoppers"},
},
"test_config": {
"num_variations": 10,
"strategy": "diverse",
"coverage": {"happy_path": True, "edge_cases": True, "error_handling": True},
"complexity_distribution": {"simple": 0.3, "moderate": 0.5, "complex": 0.2},
},
"expected_behaviors": {
"must_include": ["order"],
"expected_tools": ["lookup_order"],
},
},
)
# Use wait_for_request or generate_dataset instead of manual polling
completed = client.wait_for_request(req["id"], timeout=300)
get_request
Retrieve a request by ID.
get_request(request_id: int) -> Request
Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
request_id | int | Yes | The request ID |
Returns: Request - The request object
Raises:
NotFoundError: Request not found
Example:
req = client.get_request(request_id=123)
print(f"Status: {req['request_status']}")
list_requests
List requests for a tenant.
list_requests(
tenant_id: int | None = None,
status: str | None = None,
limit: int = 50,
cursor: int | None = None
) -> dict
Parameters:
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
tenant_id | int | No | auto | The tenant ID (auto-resolved if omitted) |
status | str | No | None | Filter by status |
limit | int | No | 50 | Maximum results |
cursor | int | No | None | Pagination cursor |
Returns: dict with keys:
status:"ok"requests: List of request objects
Example:
# Get pending requests
response = client.list_requests(status="pending")
for req in response["requests"]:
print(f"Request #{req['id']}: {req['request_name']}")
Agent Methods
Agents group datasets and define grading behavior. Each dataset can belong to one agent.
list_agents
List all agents for your tenant with dataset counts.
list_agents() -> list[Agent]
Returns: List of agent objects with id, name, description, config, dataset_count.
Example:
agents = client.list_agents()
for agent in agents:
print(f"{agent['name']}: {agent['dataset_count']} datasets")
create_agent
Create a new agent.
create_agent(
name: str,
description: str | None = None,
config: dict | None = None
) -> Agent
Parameters:
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
name | str | Yes | - | Agent name (unique per tenant) |
description | str | No | None | What this agent does |
config | dict | No | None | Agent config (tool_definitions, behavior_rules, grading_config) |
Config structure:
{
"tool_definitions": [
{"name": "fetch_data", "required": True, "description": "Fetch live data"},
{"name": "end_session", "required": False, "description": "End conversation"},
],
"behavior_rules": [
{"rule": "Always fetch before quoting data", "strictness": "required"},
{"rule": "Save caller name", "strictness": "expected"},
],
"grading_config": {
"tool_strictness": {
"fetch_data": "required", # must be called — no recovery
"end_session": "optional", # text recovery OK
"await_user_response": "optional",
},
"text_similarity_threshold": 0.3, # lower for multilingual agents
},
}
Grading strictness levels:
"required"— tool must be called. NOT_CALLED = failure."expected"— tool should be called. NOT_CALLED = warning, not failure."optional"— if the agent achieves the intent via text, the grader recovers it as a partial match.
Example:
agent = client.create_agent(
name="Support Bot",
description="Healthcare scheduling agent",
config={
"tool_definitions": [
{"name": "fetch_kareo_data", "required": True},
{"name": "end_session", "required": False},
],
"grading_config": {
"tool_strictness": {
"fetch_kareo_data": "required",
"end_session": "optional",
},
},
},
)
print(f"Created agent: {agent['id']}")
update_agent
Update an agent's name, description, or config.
update_agent(
agent_id: int,
name: str | None = None,
description: str | None = None,
config: dict | None = None
) -> Agent
Note: config replaces the entire config object — merge locally before updating if you want to preserve existing fields.
delete_agent
Soft-delete an agent. Datasets are unlinked but not deleted.
delete_agent(agent_id: int) -> dict
get_agent_datasets
Get all datasets linked to an agent.
get_agent_datasets(agent_id: int) -> dict
Returns: Dict with agent (the agent object) and datasets (list of dataset objects).
set_dataset_agent
Assign or unassign an agent to a dataset.
set_dataset_agent(dataset_id: int, agent_id: int | None) -> dict
Pass agent_id=None to unlink a dataset from its agent.
API Key Methods
list_api_keys
List API keys for your tenant.
list_api_keys(include_inactive: bool = False) -> list[APIKey]
Parameters:
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
include_inactive | bool | No | False | Include revoked keys |
Returns: list[APIKey] - List of API key objects
Note: For security, only the key prefix is returned, not the full key.
Example:
keys = client.list_api_keys()
for key in keys:
print(f"{key['key_prefix']}... - {key['name']}")
revoke_api_key
Revoke an API key.
revoke_api_key(api_key_id: int) -> dict
Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
api_key_id | int | Yes | The API key ID to revoke |
Returns: dict - Confirmation of revocation
Raises:
NotFoundError: API key not found
Example:
client.revoke_api_key(api_key_id=123)
print("API key revoked")
Convenience Methods
wait_for_request
Block until a request reaches a terminal state (completed or failed).
wait_for_request(
request_id: int,
timeout: int = 600,
poll_interval: int = 5
) -> Request
Parameters:
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
request_id | int | Yes | - | The request ID to poll |
timeout | int | No | 600 | Maximum seconds to wait |
poll_interval | int | No | 5 | Seconds between polls |
Returns: Request - The final request object
Raises:
TimeoutError: If the request doesn't finish withintimeoutsecondsAshrLabsError: If the request fails
Example:
req = client.create_request(request_name="My Eval", request=config)
completed = client.wait_for_request(req["id"], timeout=300)
print(f"Status: {completed['request_status']}")
poll_run
Block until backend grading completes for a run. After deploy(), the backend grades tool arguments and text responses asynchronously (typically 1-3 minutes). This method polls get_run() until aggregate_metrics.tests_passed is populated.
poll_run(
run_id: int,
timeout: int = 300,
poll_interval: int = 20,
on_poll: Callable | None = None
) -> Run
Parameters:
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
run_id | int | Yes | - | The run ID to poll |
timeout | int | No | 300 | Maximum seconds to wait |
poll_interval | int | No | 20 | Seconds between polls |
on_poll | Callable | No | None | Called after each poll: (elapsed_seconds, run_dict) |
Returns: Run - The fully graded run object
Raises:
TimeoutError: If grading doesn't finish withintimeoutseconds
Example:
created = run.deploy(client, dataset_id=322)
graded = client.poll_run(
created["id"],
on_poll=lambda elapsed, r: print(f"Grading... ({elapsed}s)"),
)
metrics = graded["result"]["aggregate_metrics"]
print(f"Passed: {metrics['tests_passed']}/{metrics['total_tests']}")
generate_dataset
Create a dataset generation request, wait for completion, and fetch the result. Combines create_request + wait_for_request + get_dataset into one call.
Missing context fields (use_case, scenario_context) are auto-filled from the agent's name and description. A default test_config is added if not provided.
generate_dataset(
request_name: str,
config: dict[str, Any],
request_input_schema: dict[str, Any] | None = None,
timeout: int = 600,
poll_interval: int = 5
) -> tuple[int, dict]
Parameters:
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
request_name | str | Yes | - | Name/title for the request |
config | dict | Yes | - | The generation config — same structure as create_request's request parameter. See create_request for the full schema reference. Valid sections: metadata, agent, context, test_config, generation_options. |
request_input_schema | dict | No | auto | Optional JSON Schema for validation. Auto-populated from config["agent"]["tools"] if omitted. |
timeout | int | No | 600 | Maximum seconds to wait for generation |
poll_interval | int | No | 5 | Seconds between status polls |
Returns: A tuple of (dataset_id, dataset_source) where dataset_source is the dict containing "runs".
Raises:
TimeoutError: If generation doesn't finish in timeAshrLabsError: If generation fails or no datasets are found
Example:
dataset_id, source = client.generate_dataset(
request_name="Support Agent Eval",
config={
"agent": {
"name": "Support Bot",
"description": "Handles customer orders, inventory, and refunds",
"system_prompt": "You are a helpful support agent for ShopWave.",
"tools": [
{"name": "lookup_order", "description": "Look up order status",
"parameters": {"type": "object", "properties": {"order_id": {"type": "string"}}, "required": ["order_id"]}},
{"name": "process_refund", "description": "Process a refund",
"parameters": {"type": "object", "properties": {"order_id": {"type": "string"}, "reason": {"type": "string"}}, "required": ["order_id", "reason"]}},
],
"accepted_inputs": {"text": True, "audio": False, "file": False},
},
"context": {
"domain": "e-commerce",
"use_case": "Customer support for online retail orders",
"user_persona": {"type": "customer", "description": "Online shoppers"},
},
"test_config": {
"num_variations": 10,
"strategy": "diverse",
"coverage": {"happy_path": True, "edge_cases": True, "error_handling": True},
},
"generation_options": {
"generate_audio": False,
"generate_files": False,
"generate_simulations": False,
},
},
)
print(f"Dataset #{dataset_id}: {len(source['runs'])} scenarios")
Utility Methods
health_check
Check if the API is reachable.
health_check() -> dict
Returns: dict - Status information
Example:
status = client.health_check()
print(f"API Status: {status['status']}")
RunBuilder
A builder for incrementally constructing run result objects as an agent executes tests. Once complete, the result can be deployed via the client.
Constructor
RunBuilder()
No parameters. Creates a run in "pending" status.
RunBuilder.start
Mark the run as started. Records the current timestamp.
run.start() -> RunBuilder
Returns: self (for chaining)
RunBuilder.add_test
Create and register a new test within this run.
run.add_test(test_id: str) -> TestBuilder
Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
test_id | str | Yes | Unique identifier for the test case |
Returns: TestBuilder - A builder for the individual test
RunBuilder.complete
Mark the run as completed. Records the current timestamp.
run.complete(status: str = "completed") -> RunBuilder
Parameters:
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
status | str | No | "completed" | Final status ("completed" or "failed") |
Returns: self (for chaining)
RunBuilder.build
Serialize the full run result to a dict.
run.build() -> dict[str, Any]
Returns: A dict matching the run result schema, ready to be passed to client.create_run(result=...). Aggregate metrics are computed automatically from action results.
RunBuilder.deploy
Build the result and submit it as a new run via the API.
run.deploy(
client: AshrLabsClient,
dataset_id: int,
tenant_id: int | None = None,
runner_id: int | None = None,
agent_id: int | None = None
) -> dict[str, Any]
Parameters:
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
client | AshrLabsClient | Yes | - | An authenticated client instance |
dataset_id | int | Yes | - | The dataset this run is for |
tenant_id | int | No | auto | The tenant (auto-resolved if omitted) |
runner_id | int | No | None | ID of the user who ran the test |
agent_id | int | No | None | Agent to auto-link the dataset to |
Returns: The created run object from the API
Example:
from ashr_labs import AshrLabsClient, RunBuilder
client = AshrLabsClient(api_key="tp_...")
run = RunBuilder()
run.start()
test = run.add_test("bank_analysis")
test.start()
test.add_user_text(text="Analyze this", description="User prompt")
test.add_tool_call(
expected={"tool_name": "analyze", "arguments": {"data": "input"}},
actual={"tool_name": "analyze", "arguments": {"data": "input"}},
match_status="exact",
)
test.complete()
run.complete()
created_run = run.deploy(client, dataset_id=42)
print(f"Run #{created_run['id']} created")
TestBuilder
Builds a single test result incrementally. Returned by RunBuilder.add_test().
TestBuilder.start
Mark the test as started. Records the current timestamp.
test.start() -> TestBuilder
Returns: self (for chaining)
TestBuilder.add_user_file
Record a user file input action.
test.add_user_file(
file_path: str,
description: str,
action_index: int | None = None
) -> TestBuilder
Parameters:
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
file_path | str | Yes | - | Path to the file in the dataset |
description | str | Yes | - | Description of the action |
action_index | int | No | auto | Explicit index, or auto-incremented |
Returns: self (for chaining)
TestBuilder.add_user_text
Record a user text input action.
test.add_user_text(
text: str,
description: str,
action_index: int | None = None
) -> TestBuilder
Parameters:
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
text | str | Yes | - | The user's text input |
description | str | Yes | - | Description of the action |
action_index | int | No | auto | Explicit index, or auto-incremented |
Returns: self (for chaining)
TestBuilder.add_tool_call
Record an agent tool call action with expected vs actual comparison.
test.add_tool_call(
expected: dict[str, Any],
actual: dict[str, Any],
match_status: str,
divergence_notes: str | None = None,
argument_comparison: dict[str, Any] | None = None,
action_index: int | None = None
) -> TestBuilder
Parameters:
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
expected | dict | Yes | - | Expected tool call (tool_name, arguments) |
actual | dict | Yes | - | Actual tool call made by the agent |
match_status | str | Yes | - | "exact", "partial", or "mismatch" |
argument_comparison | dict | No | None | Structured diff from compare_args_structural(). Recommended — the backend grader may skip tool calls without it. |
divergence_notes | str | No | None | Notes explaining the divergence |
action_index | int | No | auto | Explicit index, or auto-incremented |
Returns: self (for chaining)
TestBuilder.add_agent_response
Record an agent text response with expected vs actual comparison.
test.add_agent_response(
expected_response: dict[str, Any],
actual_response: dict[str, Any],
match_status: str,
semantic_similarity: float | None = None,
divergence_notes: str | None = None,
action_index: int | None = None
) -> TestBuilder
Parameters:
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
expected_response | dict | Yes | - | The expected response content |
actual_response | dict | Yes | - | The actual response from the agent |
match_status | str | Yes | - | "exact", "similar", or "divergent" |
semantic_similarity | float | No | None | Similarity score (0.0 to 1.0) |
divergence_notes | str | No | None | Notes explaining the divergence |
action_index | int | No | auto | Explicit index, or auto-incremented |
Returns: self (for chaining)
TestBuilder.set_vm_stream
Attach VM session logs to this test. For agents that operate in a browser or virtual machine.
test.set_vm_stream(
provider: str,
session_id: str | None = None,
duration_ms: int | None = None,
logs: list[dict] | None = None,
metadata: dict | None = None
) -> TestBuilder
Parameters:
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
provider | str | Yes | - | VM provider name (e.g. "browserbase", "scrapybara", "steel") |
session_id | str | No | None | Provider session ID for linking |
duration_ms | int | No | None | Total session duration in milliseconds |
logs | list[dict] | No | None | Timestamped log entries (see below) |
metadata | dict | No | None | Additional provider-specific metadata |
Log entry format: Each entry should have ts (int, ms offset from start) and type (str):
{"ts": 0, "type": "navigation", "data": {"url": "https://..."}}
{"ts": 1200, "type": "action", "data": {"action": "click", "selector": "#btn"}}
{"ts": 3000, "type": "error", "data": {"message": "Element not found"}}
Example:
test.set_vm_stream(
provider="browserbase",
session_id="sess_abc123",
duration_ms=12000,
logs=[
{"ts": 0, "type": "navigation", "data": {"url": "https://app.example.com"}},
{"ts": 2000, "type": "action", "data": {"action": "click", "selector": "#submit"}},
{"ts": 5000, "type": "network", "data": {"method": "POST", "url": "/api/order", "status": 201}},
],
)
Returns: self (for chaining)
TestBuilder.set_kernel_vm
Convenience method for attaching a Kernel browser session. Sets provider="kernel" and exposes Kernel-specific metadata fields as named parameters. Fields map to Kernel's browser API response.
test.set_kernel_vm(
session_id: str,
duration_ms: int | None = None,
logs: list[dict] | None = None,
*,
live_view_url: str | None = None,
cdp_ws_url: str | None = None,
replay_id: str | None = None,
replay_view_url: str | None = None,
headless: bool | None = None,
stealth: bool | None = None,
viewport: dict | None = None,
) -> TestBuilder
Parameters:
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
session_id | str | Yes | - | Kernel browser session ID |
duration_ms | int | No | None | Total session duration in milliseconds |
logs | list[dict] | No | None | Timestamped log entries (same format as set_vm_stream) |
live_view_url | str | No | None | Remote live-view URL (browser_live_view_url) |
cdp_ws_url | str | No | None | Chrome DevTools Protocol WebSocket URL |
replay_id | str | No | None | ID of the session recording |
replay_view_url | str | No | None | URL to view the session replay |
headless | bool | No | None | Whether the session ran in headless mode |
stealth | bool | No | None | Whether anti-bot stealth mode was enabled |
viewport | dict | No | None | Browser viewport, e.g. {"width": 1920, "height": 1080} |
Example:
test.set_kernel_vm(
session_id="kern_sess_abc123",
duration_ms=15000,
logs=[
{"ts": 0, "type": "navigation", "data": {"url": "https://app.example.com"}},
{"ts": 1200, "type": "action", "data": {"action": "click", "selector": "#login"}},
{"ts": 3000, "type": "screenshot", "data": {"s3_key": "vm-streams/.../frame.png"}},
],
replay_id="replay_abc123",
replay_view_url="https://www.kernel.sh/replays/replay_abc123",
stealth=True,
viewport={"width": 1920, "height": 1080},
)
Returns: self (for chaining)
TestBuilder.complete
Mark the test as completed. Records the current timestamp.
test.complete(status: str = "completed") -> TestBuilder
Parameters:
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
status | str | No | "completed" | Final status ("completed" or "failed") |
Returns: self (for chaining)
TestBuilder.build
Serialize this test to a dict matching the run result schema.
test.build() -> dict[str, Any]
Returns: A dict with test_id, status, action_results, started_at, and completed_at.
EvalRunner
Runs an agent against every scenario in a dataset and records results. This is the high-level API that encapsulates the full eval loop — iterating scenarios, calling the agent, comparing tool calls and text, and producing a RunBuilder.
Constructor
EvalRunner(dataset_source: dict[str, Any])
Parameters:
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
dataset_source | dict | Yes | - | The dataset_source dict from a dataset (contains "runs" key) |
The EvalRunner does not perform local grading. It pairs expected vs actual tool calls and text responses, then submits everything for server-side grading via the backend's LLM-based judge. Tool call arguments are compared structurally using compare_args_structural(). Text responses are submitted with match_status="pending" for server-side evaluation.
Example:
from ashr_labs import EvalRunner
runner = EvalRunner(source)
EvalRunner.from_dataset (class method)
Create an EvalRunner by fetching a dataset from the API.
EvalRunner.from_dataset(
client: AshrLabsClient,
dataset_id: int,
**kwargs
) -> EvalRunner
Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
client | AshrLabsClient | Yes | An authenticated client |
dataset_id | int | Yes | The dataset ID to fetch |
**kwargs | No | Passed to EvalRunner.__init__() |
Returns: EvalRunner - A configured runner ready to call .run()
Example:
runner = EvalRunner.from_dataset(client, dataset_id=322)
EvalRunner.run
Run the agent against every scenario and return a populated RunBuilder.
runner.run(
agent: Agent,
*,
on_scenario: Callable | None = None,
on_action: Callable | None = None,
on_environment: Callable | None = None,
max_workers: int = 1,
) -> RunBuilder
Parameters:
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
agent | Agent | Yes | - | An object implementing the Agent protocol |
on_scenario | Callable | No | None | Called at the start of each scenario: (scenario_id, scenario_dict) |
on_action | Callable | No | None | Called for each action: (action_index, action_dict) |
on_environment | Callable | No | None | Called for environment actions: `(content, action_dict) -> dict |
max_workers | int | No | 1 | Number of scenarios to run in parallel. When >1, each scenario gets a copy.deepcopy of the agent. Important: Most LLM clients (Anthropic, OpenAI) hold connection pools that cannot be deep-copied. Use max_workers=1 unless your agent implements __deepcopy__. |
Returns: RunBuilder - A populated builder ready for .build() or .deploy()
Example:
# Sequential (default)
run = runner.run(agent)
# With environment handler — feed external context to the agent
def handle_env(content, action):
return agent.respond(content)
run = runner.run(agent, on_environment=handle_env)
# Parallel — run 4 scenarios at a time (only if agent supports deepcopy)
run = runner.run(agent, max_workers=4)
EvalRunner.run_and_deploy
Run the eval and submit results in one call.
runner.run_and_deploy(
agent: Agent,
client: AshrLabsClient,
dataset_id: int | None = None,
*,
on_scenario: Callable | None = None,
on_action: Callable | None = None,
on_environment: Callable | None = None,
max_workers: int = 1,
**deploy_kwargs,
) -> dict[str, Any]
Parameters:
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
agent | Agent | Yes | - | An object implementing the Agent protocol |
client | AshrLabsClient | Yes | - | An authenticated client |
dataset_id | int | None | No | None | The dataset to submit against |
on_scenario | Callable | No | None | Callback per scenario |
on_action | Callable | No | None | Callback per action |
on_environment | Callable | No | None | Callback for environment actions (see run()) |
max_workers | int | No | 1 | Number of scenarios to run in parallel (default sequential) |
**deploy_kwargs | No | Extra kwargs passed to RunBuilder.deploy() |
Returns: The created run object from the API
Example:
# Sequential
created = runner.run_and_deploy(agent, client, dataset_id=322)
print(f"Run #{created['id']} submitted")
# Parallel
created = runner.run_and_deploy(agent, client, dataset_id=322, max_workers=4)
Agent Protocol
A @runtime_checkable Protocol that defines the interface agents must implement.
@runtime_checkable
class Agent(Protocol):
def respond(self, message: str) -> dict[str, Any]: ...
def reset(self) -> None: ...
respond
Process a user message and return the agent's response.
Parameters:
| Parameter | Type | Description |
|---|---|---|
message | str | The user's message text |
Returns: A dict with:
"text"(str): The agent's text response"tool_calls"(list[dict]): Tool calls made during this turn, each with"name"(str) and"arguments"(dict) keys
argumentsvsarguments_json: The Agent protocol returns tool arguments as a dict under the"arguments"key. However,RunBuilderand the API store them as a JSON string under"arguments_json".EvalRunnerhandles this conversion automatically (eval.py:187-193). If you useRunBuilderdirectly, pass"arguments_json"(a JSON string) toadd_tool_call(). Theextract_tool_args()helper accepts both formats, so comparators work either way.
reset
Clear conversation state for a new scenario. Called before each scenario begins.
isinstance check
from ashr_labs import Agent
assert isinstance(my_agent, Agent) # Works at runtime
Comparator Functions
All comparator functions are standalone, stdlib-only, and importable from the top-level package.
strip_markdown
Remove markdown formatting from text.
strip_markdown(text: str) -> str
Removes bold/italic markers, headers, bullets, and markdown links. Collapses whitespace.
Example:
strip_markdown("**Bold** and [link](https://x.com)")
# => "Bold and link"
tokenize
Lowercase, strip markdown and punctuation, split into word tokens.
tokenize(text: str) -> list[str]
Example:
tokenize("Order **ORD-123** shipped!")
# => ["order", "ord123", "shipped"]
fuzzy_str_match
Check if two strings are semantically close enough to count as matching.
fuzzy_str_match(a: str, b: str, threshold: float | None = None) -> bool
Parameters:
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
a | str | Yes | - | First string |
b | str | Yes | - | Second string |
threshold | float | No | adaptive | Word-overlap threshold. If None: 0.35 for <=5 words, 0.40 for <=8, 0.55 otherwise |
Returns: True if the strings match closely enough
Checks in order: exact match after normalization, containment, then word-set overlap.
Example:
fuzzy_str_match("Customer wants a refund", "customer wants refund") # True
fuzzy_str_match("apple banana", "cherry grape") # False
extract_tool_args
Extract arguments from a tool call dict, handling both formats.
extract_tool_args(tool_call: dict) -> dict
Handles {"arguments": {...}} (dict form) and {"arguments_json": "..."} (JSON string form). Prefers the dict form if both are present.
Example:
extract_tool_args({"arguments_json": '{"order_id": "ORD-123"}'})
# => {"order_id": "ORD-123"}
extract_tool_args({"arguments": {"order_id": "ORD-123"}})
# => {"order_id": "ORD-123"}
compare_tool_args
Compare expected vs actual tool call arguments.
compare_tool_args(expected: dict, actual: dict) -> tuple[str, str | None]
Parameters:
| Parameter | Type | Description |
|---|---|---|
expected | dict | Expected tool call (with arguments or arguments_json) |
actual | dict | Actual tool call made by the agent |
Returns: A tuple of (match_status, divergence_notes):
match_status:"exact","partial", or"mismatch"divergence_notes: Human-readable diff summary, orNoneif exact
String arguments are compared using fuzzy_str_match. Non-string values use exact equality. Extra arguments in the actual call don't cause divergence.
Example:
status, notes = compare_tool_args(
{"arguments": {"order_id": "ORD-123"}},
{"arguments": {"order_id": "ORD-123", "extra": "field"}},
)
# => ("exact", None)
status, notes = compare_tool_args(
{"arguments": {"order_id": "ORD-123", "reason": "damaged item"}},
{"arguments": {"order_id": "ORD-999", "reason": "item was damaged"}},
)
# => ("partial", "'order_id': expected='ORD-123' actual='ORD-999'")
text_similarity
Compute similarity between two text strings.
text_similarity(text_a: str, text_b: str) -> float
Returns: A float between 0.0 and 1.0
Uses cosine similarity on word frequency vectors, plus:
- Entity bonus (+0.20): for matching order IDs (
ORD-*), refund IDs (REF-*), prices ($*), dates (YYYY-MM-DD), and tracking URLs - Concept bonus (+0.10): for matching domain concepts (refund/credited, shipped/transit/delivered, stock/available, etc.)
Example:
text_similarity(
"Your order ORD-123 has shipped and is on the way",
"Order ORD-123 has been shipped and is in transit",
)
# => 0.78
Data Types
User
class User(TypedDict, total=False):
id: int
created_at: str
email: str
name: str | None
tenant: int
is_active: bool
Tenant
class Tenant(TypedDict, total=False):
id: int
created_at: str
tenant_name: str
is_active: bool
Session
class Session(TypedDict):
status: str
user: User
tenant: Tenant
Dataset
class Dataset(TypedDict, total=False):
id: int
created_at: str
tenant: int
creator: int
name: str
description: str | None
agent_id: int | None
agent_details: dict[str, Any] | None # {"id": int, "name": str}
dataset_source: dict[str, Any]
Run
class Run(TypedDict, total=False):
id: int
created_at: str
dataset: int
tenant: int
runner: int
result: dict[str, Any]
ObservabilityTrace
class ObservabilityTrace(TypedDict, total=False):
id: str
name: str
user_id: str | None # End-user identifier
session_id: str | None # Conversation/session grouping
metadata: dict[str, Any] | None
tags: list[str]
created_at: str | None
output: dict[str, Any] | None
observations: list[ObservabilityObservation]
ObservabilityObservation
class ObservabilityObservation(TypedDict, total=False):
id: str
name: str
type: str # "span", "generation", "event"
parent_observation_id: str | None
input: dict[str, Any] | None
output: dict[str, Any] | None
metadata: dict[str, Any] | None
model: str | None # LLM model name (generations only)
usage: dict[str, int] | None # {"input_tokens": ..., "output_tokens": ...}
level: str | None # "DEBUG", "DEFAULT", "WARNING", "ERROR"
status_message: str | None
start_time: str | None
end_time: str | None
SdkNote
class SdkNote(TypedDict, total=False):
id: int
created_at: str
updated_at: str
title: str
content: str
category: str # "info", "warning", "breaking_change", "best_practice", "deprecation"
severity: str # "info", "warning", "critical"
tenant_id: int | None
agent_id: int | None
active_from: str
expires_at: str | None
is_archived: bool
note_metadata: dict[str, Any]
Request
class Request(TypedDict, total=False):
id: int
created_at: str
requestor_id: int
requestor_tenant: int
request_name: str
request_status: str
request_input_schema: dict[str, Any] | None
request: dict[str, Any]
APIKey
class APIKey(TypedDict, total=False):
id: int
key: str # Only present on creation
key_prefix: str
name: str
scopes: list[str]
user_id: int
tenant_id: int
created_at: str
last_used_at: str | None
expires_at: str | None
is_active: bool
ToolCall
class ToolCall(TypedDict, total=False):
name: str
arguments_json: str
ExpectedResponse
class ExpectedResponse(TypedDict, total=False):
tool_calls: list[ToolCall]
text: str
Action
class Action(TypedDict, total=False):
actor: str # "user" or "agent"
content: str
name: str
expected_response: ExpectedResponse
Scenario
class Scenario(TypedDict, total=False):
title: str
actions: list[Action]
Agent
class Agent(TypedDict, total=False):
id: int
created_at: str
tenant_id: int
creator_id: int | None
name: str
description: str | None
config: AgentConfig
is_active: bool
dataset_count: int
AgentConfig
class AgentConfig(TypedDict, total=False):
form_data: dict # Dataset generation preset
tool_definitions: list[ToolDefinition]
behavior_rules: list[BehaviorRule]
grading_config: GradingConfig
ToolDefinition
class ToolDefinition(TypedDict, total=False):
name: str
description: str
required: bool # True = must be called, False = optional
BehaviorRule
class BehaviorRule(TypedDict, total=False):
rule: str
strictness: str # "required" | "expected" | "optional"
GradingConfig
class GradingConfig(TypedDict, total=False):
tool_strictness: dict[str, str] # tool_name -> "required" | "expected" | "optional"
text_similarity_threshold: float