API Reference

Complete reference for all classes and methods in the Ashr Labs SDK.

The SDK serves two products:

Testing Platform — generate eval datasets, run your agent against them, submit graded results. Core methods: create_request, create_run, EvalRunner, RunBuilder.
Observability (separate product) — trace your agent's production behavior (LLM calls, tool invocations, latency, errors). Core methods: trace(), Span, Generation, list_observability_traces. Requires the observability feature flag.

These are independent products that share the same SDK and API key.

AshrLabsClient

The main client class for interacting with the Ashr Labs API.

Constructor

AshrLabsClient(
    api_key: str,
    base_url: str = "https://api.ashr.io/testing-platform-api",
    timeout: int = 30
)

Parameters:

Parameter	Type	Required	Default	Description
`api_key`	`str`	Yes	-	Your API key (must start with `tp_`)
`base_url`	`str`	No	Production URL	Base URL of the API
`timeout`	`int`	No	`30`	Request timeout in seconds

Raises:

ValueError: If the API key format is invalid

Example:

# Minimal — just pass your API key
client = AshrLabsClient(api_key="tp_your_key_here")

# Custom timeout
client = AshrLabsClient(api_key="tp_your_key_here", timeout=60)

from_env (class method)

Create a client from environment variables.

AshrLabsClient.from_env(timeout: int = 30) -> AshrLabsClient

Reads ASHR_LABS_API_KEY (required) and ASHR_LABS_BASE_URL (optional) from the environment.

Raises:

RuntimeError: If ASHR_LABS_API_KEY is not set

Example:

# export ASHR_LABS_API_KEY="tp_your_key_here"
client = AshrLabsClient.from_env()

Session Methods

init

Initialize a session and validate authentication.

init() -> Session

Returns: Session - Session information containing user and tenant data

Raises:

AuthenticationError: If the API key is invalid or expired

Example:

# Validate credentials and get user/tenant info
session = client.init()

print(f"User ID: {session['user']['id']}")
print(f"Email: {session['user']['email']}")
print(f"Tenant ID: {session['tenant']['id']}")
print(f"Tenant Name: {session['tenant']['tenant_name']}")

Dataset Methods

get_dataset

Retrieve a dataset by ID.

get_dataset(
    dataset_id: int,
    include_signed_urls: bool = False,
    url_expires_seconds: int = 3600
) -> Dataset

Parameters:

Parameter	Type	Required	Default	Description
`dataset_id`	`int`	Yes	-	The ID of the dataset
`include_signed_urls`	`bool`	No	`False`	Include signed S3 URLs for media
`url_expires_seconds`	`int`	No	`3600`	URL expiration time in seconds

Returns: Dataset - The dataset object

Raises:

NotFoundError: Dataset not found
AuthorizationError: No access to this dataset

Example:

dataset = client.get_dataset(
    dataset_id=42,
    include_signed_urls=True,
    url_expires_seconds=7200
)
print(dataset["name"])

list_datasets

List datasets for a tenant.

list_datasets(
    tenant_id: int | None = None,
    limit: int = 50,
    cursor: int | None = None,
    include_signed_urls: bool = False,
    url_expires_seconds: int = 3600
) -> dict

Parameters:

Parameter	Type	Required	Default	Description
`tenant_id`	`int`	No	auto	The tenant ID (auto-resolved if omitted)
`limit`	`int`	No	`50`	Maximum results to return
`cursor`	`int`	No	`None`	Pagination cursor (pass `next_cursor` from previous response)
`include_signed_urls`	`bool`	No	`False`	Include signed S3 URLs
`url_expires_seconds`	`int`	No	`3600`	URL expiration time

Returns: dict with keys:

status: "ok"
datasets: List of dataset objects
next_cursor: ID for the next page, or null if no more results

Example:

# tenant_id auto-resolved from API key
response = client.list_datasets(limit=10)
for dataset in response["datasets"]:
    print(f"{dataset['id']}: {dataset['name']}")

# Pagination
if response.get("next_cursor"):
    next_page = client.list_datasets(limit=10, cursor=response["next_cursor"])

Run Methods

create_run

Create a new test run.

create_run(
    dataset_id: int,
    result: dict[str, Any],
    tenant_id: int | None = None,
    runner_id: int | None = None
) -> Run

Parameters:

Parameter	Type	Required	Default	Description
`dataset_id`	`int`	Yes	-	The dataset ID
`result`	`dict`	Yes	-	Run results (metrics, status, etc.)
`tenant_id`	`int`	No	auto	The tenant ID (auto-resolved if omitted)
`runner_id`	`int`	No	`None`	ID of user who ran the test

Returns: Run - The created run object

Example:

run = client.create_run(
    dataset_id=42,
    result={
        "status": "passed",
        "score": 0.95,
        "metrics": {
            "accuracy": 0.98,
            "latency_ms": 150
        }
    }
)

get_run

Retrieve a run by ID.

get_run(run_id: int) -> Run

Parameters:

Parameter	Type	Required	Description
`run_id`	`int`	Yes	The run ID

Returns: Run - The run object

Raises:

NotFoundError: Run not found

Example:

run = client.get_run(run_id=99)
print(f"Score: {run['result']['score']}")

list_runs

List runs for a tenant or dataset.

list_runs(
    dataset_id: int | None = None,
    tenant_id: int | None = None,
    limit: int = 50
) -> dict

Parameters:

Parameter	Type	Required	Default	Description
`dataset_id`	`int`	No	`None`	Filter by dataset
`tenant_id`	`int`	No	auto	Filter by tenant (auto-resolved if omitted)
`limit`	`int`	No	`50`	Maximum results

Returns: dict with keys:

status: "ok"
runs: List of run objects

Example:

# Get runs for a specific dataset
response = client.list_runs(dataset_id=42)
for run in response["runs"]:
    print(f"Run #{run['id']}: {run['result']['status']}")

delete_run

Delete a test run.

delete_run(run_id: int) -> dict

Parameters:

Parameter	Type	Required	Description
`run_id`	`int`	Yes	The run ID to delete

Returns: dict - Confirmation of deletion

Raises:

NotFoundError: Run not found

Example:

client.delete_run(run_id=99)
print("Run deleted")

Observability — Production Agent Tracing

This is a separate product from the Testing Platform. The testing platform (datasets, eval runs, RunBuilder, EvalRunner) is for offline evaluation. Observability is for tracing your agent in production. They share the same SDK and API key but are independent features.

Trace your agent's production behavior — LLM calls, tool invocations, retrieval steps, guardrail checks, and more. Requires the observability feature flag to be enabled for your tenant.

Production-safe: tracing never raises exceptions or interferes with your agent. If the backend is unreachable, trace.end() returns an error dict instead of throwing.

client.trace

Start a new trace for a production agent interaction.

trace = client.trace(
    name: str,
    *,
    user_id: str | None = None,
    session_id: str | None = None,
    metadata: dict | None = None,
    tags: list[str] | None = None,
) -> Trace

Parameters:

Parameter	Type	Required	Default	Description
`name`	`str`	Yes	-	Name for this trace (e.g. `"handle-ticket"`)
`user_id`	`str`	No	`None`	End-user ID for grouping
`session_id`	`str`	No	`None`	Conversation/session ID
`metadata`	`dict`	No	`None`	Arbitrary metadata
`tags`	`list[str]`	No	`None`	Tags for filtering

Returns: A Trace instance. Supports context manager (with) usage.

Trace methods

Method	Description
`trace.span(name, *, input, metadata)`	Create a top-level span
`trace.generation(name, *, model, input, metadata)`	Create a top-level generation (LLM call)
`trace.event(name, *, input, metadata, level)`	Record a point-in-time event
`trace.end(*, output)`	Flush the trace to the backend. Never raises.
`trace.trace_id`	Server-assigned trace ID (available after `end()`)

Span methods

Method	Description
`span.span(name, *, input, metadata)`	Create a child span
`span.generation(name, *, model, input, metadata)`	Create a child generation
`span.event(name, *, input, metadata, level)`	Record an event under this span
`span.end(*, output, status_message, level)`	Mark the span as complete

Spans support context managers. If the body raises, the span auto-ends with level="ERROR" and the exception message is captured in status_message.

Generation methods

Inherits all Span methods, plus:

Method	Description
`gen.end(*, output, usage, status_message, level)`	Mark complete with token usage

The usage dict accepts {"input_tokens": int, "output_tokens": int}.

Context managers (recommended)

Context managers ensure spans are always ended, even if your code throws:

with client.trace("handle-ticket", user_id="user_42") as trace:
    with trace.generation("classify", model="claude-sonnet-4-6",
                          input=[{"role": "user", "content": "help"}]) as gen:
        result = call_llm(...)
        gen.end(output=result, usage={"input_tokens": 50, "output_tokens": 12})

    with trace.span("tool:search", input={"query": "..."}) as tool:
        data = search(...)
        tool.end(output=data)
        # If search() throws, the span auto-ends with level="ERROR"

# trace.end() is called automatically on exit

Manual instrumentation

trace = client.trace("support-chat", user_id="user_42", session_id="conv_abc")

gen = trace.generation("classify-intent", model="claude-sonnet-4-6",
                       input=[{"role": "user", "content": "Reset my password"}])
gen.end(output={"intent": "password_reset"},
        usage={"input_tokens": 50, "output_tokens": 12})

tool = trace.span("tool:reset_password", input={"user_id": "user_42"})
tool.end(output={"success": True})

trace.event("guardrail-check", input={"passed": True})

result = trace.end(output={"resolution": "password_reset_complete"})
print(trace.trace_id)  # server-assigned ID

list_observability_traces

List traces for the current tenant.

client.list_observability_traces(
    user_id: str | None = None,
    session_id: str | None = None,
    limit: int = 50,
    page: int = 1,
) -> dict

Parameter	Type	Required	Default	Description
`user_id`	`str`	No	`None`	Filter by end-user
`session_id`	`str`	No	`None`	Filter by session
`limit`	`int`	No	`50`	Max results per page (max 100)
`page`	`int`	No	`1`	Page number

Returns: {"status": "ok", "traces": [...], "total": int}

get_observability_trace

Get a single trace with its full observation tree.

client.get_observability_trace(trace_id: str) -> dict

Returns: {"status": "ok", "trace": {...}} — the trace includes an observations list with id, name, type, parent_observation_id, input, output, metadata, model, usage, level, start_time, end_time.

get_observability_analytics

Get analytics overview for the current tenant.

client.get_observability_analytics(days: int = 7) -> dict

Returns: {"status": "ok", "overview": {...}, "tool_performance": [...], "model_usage": [...]}

Overview includes: total_traces, avg_latency_ms, p95_latency_ms, total_input_tokens, total_output_tokens, error_rate, total_tool_calls, unique_users, unique_sessions.

get_observability_errors / get_observability_tool_errors

client.get_observability_errors(days: int = 7, limit: int = 50, page: int = 1) -> dict
client.get_observability_tool_errors(days: int = 7, limit: int = 50, page: int = 1) -> dict

Returns: {"status": "ok", "traces": [...], "total": int} — traces with errors or tool failures, most recent first.

SDK Notes — Platform Advisories

SDK Notes are platform advisories delivered to your SDK from Ashr Labs. They communicate context changes, best practices, deprecations, or breaking changes that may affect how you configure or run your agent.

Notes are automatically fetched when the client initializes (via init()). You can also refresh them on demand.

client.notes (property)

Get cached SDK notes from the last init() or get_notes() call. No network request is made.

client.notes -> list[SdkNote]

Returns: List of active notes for your tenant.

Example:

client = AshrLabsClient(api_key="tp_...")

# Notes are auto-fetched on first use
for note in client.notes:
    print(f"[{note['severity']}] {note['title']}: {note['content']}")

get_notes

Fetch fresh SDK notes from the platform. Updates the cached client.notes.

get_notes(agent_id: int | None = None) -> list[SdkNote]

Parameters:

Parameter	Type	Required	Default	Description
`agent_id`	`int \| None`	No	`None`	Include notes targeted at this specific agent

Returns: List of active notes (global + tenant-specific, plus agent-specific if agent_id is provided).

Example:

# Refresh notes
notes = client.get_notes()

# Filter by agent
notes = client.get_notes(agent_id=42)

# Check for breaking changes
breaking = [n for n in notes if n['category'] == 'breaking_change']
if breaking:
    print("⚠ Breaking changes detected:")
    for n in breaking:
        print(f"  {n['title']}: {n['content']}")

Note categories: info, warning, breaking_change, best_practice, deprecation

Severity levels: info, warning, critical

Request Methods

create_request

Create a dataset generation request.

create_request(
    request_name: str,
    request: dict[str, Any],
    request_input_schema: dict[str, Any] | None = None,
    tenant_id: int | None = None,
    requestor_id: int | None = None,
) -> Request

Parameters:

Parameter	Type	Required	Default	Description
`request_name`	`str`	Yes	-	Name/title for the request
`request`	`dict`	Yes	-	The generation config (see below)
`request_input_schema`	`dict`	No	auto	JSON Schema for validating the request. A permissive default is sent if omitted. If your agent has tools, include them here under the `"tools"` key so they're auto-saved as skill templates.
`tenant_id`	`int`	No	auto	The tenant ID (auto-resolved if omitted)
`requestor_id`	`int`	No	auto	ID of requesting user (auto-resolved if omitted)

Returns: Request - The created request object

Generation config structure (the request dict):

The config has two required sections (agent and context) and several optional sections.

`metadata` (optional)

Field	Type	Description
`dataset_name`	`str`	Name for the generated dataset
`description`	`str`	Description of what this dataset tests

`agent` (required)

At least one of name, description, or system_prompt is required.

Field	Type	Default	Description
`name`	`str`	-	Agent name
`description`	`str`	-	What the agent does
`system_prompt`	`str`	-	System prompt given to the agent
`tools`	`list[dict]`	`[]`	Tools the agent can call (see below)
`accepted_inputs`	`dict`	text only	Input modalities (see below)
`output_format`	`dict`	`{"type": "text"}`	`"text"` or `"structured"` with optional `schema`
`input_schema`	`dict`	-	Custom structured input schema (see below)

Tool definition:

{
    "name": "tool_name",           # snake_case tool name
    "description": "What it does", # Used by test generator for realistic scenarios
    "parameters": {                # JSON Schema for tool parameters
        "type": "object",
        "properties": {
            "arg_name": {"type": "string", "description": "What this arg is"},
        },
        "required": ["arg_name"],
    },
    "returns": {                   # (optional) Return value schema
        "type": "object",
        "description": "What the tool returns",
    },
}

Accepted inputs — values can be bool or {"enabled": bool}:

"accepted_inputs": {
    "text": True,           # (default True) Text input
    "audio": False,         # Audio: mp3, wav, m4a, ogg, webm
    "file": False,          # Files: pdf, txt, csv, json, xml, html, md, docx, xlsx
    "image": False,         # Images: jpg, png, gif, webp
    "video": False,         # Video: mp4, webm, mov, avi
    "conversation": False,  # Multi-participant conversations with inferred roles
}

Input schema — define structured data users provide to the agent:

"input_schema": {
    "name": "OrderInput",
    "description": "Data the customer provides",
    "fields": [
        {"name": "order_id", "type": "string", "description": "Order ID", "required": True},
        {"name": "priority", "type": "string", "description": "Priority level", "enum": ["low", "medium", "high"]},
    ],
    "example": {"order_id": "ORD-123", "priority": "high"},
}

`context` (required)

At least one of domain, use_case, or scenario_context is required.

Field	Type	Default	Description
`domain`	`str`	-	Domain: `"banking"`, `"healthcare"`, `"e-commerce"`, `"legal"`, `"education"`, `"customer_service"`, `"technology"`, `"travel"`, `"insurance"`, `"other"`
`use_case`	`str`	-	Specific use case description (min 10 chars recommended)
`scenario_context`	`str`	-	Additional scenario context
`user_persona`	`dict`	-	`{"type": str, "description": str}` — who interacts with the agent
`sample_data`	`dict`	-	`{"examples": [{...}]}` — example data for realistic tests

`test_config` (optional)

Field	Type	Default	Description
`num_variations`	`int`	`5`	Number of test scenarios (1-50)
`strategy`	`str`	`"balanced"`	`"focused"`, `"diverse"`, or `"balanced"`
`coverage`	`dict`	all True	`{"happy_path": bool, "edge_cases": bool, "error_handling": bool, "boundary_values": bool}`
`complexity_distribution`	`dict`	auto	`{"simple": 0.3, "moderate": 0.5, "complex": 0.2}` — must sum to ~1.0
`focus_areas`	`list[str]`	`[]`	Specific areas to focus testing on
`exclude`	`list[str]`	`[]`	Scenarios or test types to exclude

`generation_options` (required)

Field	Type	Default	Description
`generate_audio`	`bool`	`False`	Generate audio test inputs
`generate_files`	`bool`	`False`	Generate file test inputs (PDF, CSV, etc.)
`generate_images`	`bool`	`False`	Generate image test inputs
`generate_videos`	`bool`	`False`	Generate video test inputs
`generate_simulations`	`bool`	`False`	Generate website session replay simulation videos

Example:

req = client.create_request(
    request_name="Support Agent Eval",
    request={
        "metadata": {"dataset_name": "Support Eval"},
        "agent": {
            "name": "Support Bot",
            "description": "Answers customer questions about orders and refunds",
            "system_prompt": "You are a helpful support agent.",
            "tools": [
                {
                    "name": "lookup_order",
                    "description": "Look up an order by ID",
                    "parameters": {
                        "type": "object",
                        "properties": {"order_id": {"type": "string", "description": "The order ID"}},
                        "required": ["order_id"],
                    },
                },
            ],
            "accepted_inputs": {"text": True, "audio": False, "file": False},
            "output_format": {"type": "text"},
        },
        "context": {
            "domain": "e-commerce",
            "use_case": "Customers asking about order status, requesting refunds",
            "scenario_context": "An online retail store called ShopWave",
            "user_persona": {"type": "customer", "description": "Online shoppers"},
        },
        "test_config": {
            "num_variations": 10,
            "strategy": "diverse",
            "coverage": {"happy_path": True, "edge_cases": True, "error_handling": True},
            "complexity_distribution": {"simple": 0.3, "moderate": 0.5, "complex": 0.2},
        },
        "expected_behaviors": {
            "must_include": ["order"],
            "expected_tools": ["lookup_order"],
        },
    },
)
# Use wait_for_request or generate_dataset instead of manual polling
completed = client.wait_for_request(req["id"], timeout=300)

get_request

Retrieve a request by ID.

get_request(request_id: int) -> Request

Parameters:

Parameter	Type	Required	Description
`request_id`	`int`	Yes	The request ID

Returns: Request - The request object

Raises:

NotFoundError: Request not found

Example:

req = client.get_request(request_id=123)
print(f"Status: {req['request_status']}")

list_requests

List requests for a tenant.

list_requests(
    tenant_id: int | None = None,
    status: str | None = None,
    limit: int = 50,
    cursor: int | None = None
) -> dict

Parameters:

Parameter	Type	Required	Default	Description
`tenant_id`	`int`	No	auto	The tenant ID (auto-resolved if omitted)
`status`	`str`	No	`None`	Filter by status
`limit`	`int`	No	`50`	Maximum results
`cursor`	`int`	No	`None`	Pagination cursor

Returns: dict with keys:

status: "ok"
requests: List of request objects

Example:

# Get pending requests
response = client.list_requests(status="pending")
for req in response["requests"]:
    print(f"Request #{req['id']}: {req['request_name']}")

Agent Methods

Agents group datasets and define grading behavior. Each dataset can belong to one agent.

list_agents

List all agents for your tenant with dataset counts.

list_agents() -> list[Agent]

Returns: List of agent objects with id, name, description, config, dataset_count.

Example:

agents = client.list_agents()
for agent in agents:
    print(f"{agent['name']}: {agent['dataset_count']} datasets")

create_agent

Create a new agent.

create_agent(
    name: str,
    description: str | None = None,
    config: dict | None = None
) -> Agent

Parameters:

Parameter	Type	Required	Default	Description
`name`	`str`	Yes	-	Agent name (unique per tenant)
`description`	`str`	No	`None`	What this agent does
`config`	`dict`	No	`None`	Agent config (tool_definitions, behavior_rules, grading_config)

Config structure:

{
    "tool_definitions": [
        {"name": "fetch_data", "required": True, "description": "Fetch live data"},
        {"name": "end_session", "required": False, "description": "End conversation"},
    ],
    "behavior_rules": [
        {"rule": "Always fetch before quoting data", "strictness": "required"},
        {"rule": "Save caller name", "strictness": "expected"},
    ],
    "grading_config": {
        "tool_strictness": {
            "fetch_data": "required",      # must be called — no recovery
            "end_session": "optional",      # text recovery OK
            "await_user_response": "optional",
        },
        "text_similarity_threshold": 0.3,   # lower for multilingual agents
    },
}

Grading strictness levels:

"required" — tool must be called. NOT_CALLED = failure.
"expected" — tool should be called. NOT_CALLED = warning, not failure.
"optional" — if the agent achieves the intent via text, the grader recovers it as a partial match.

Example:

agent = client.create_agent(
    name="Support Bot",
    description="Healthcare scheduling agent",
    config={
        "tool_definitions": [
            {"name": "fetch_kareo_data", "required": True},
            {"name": "end_session", "required": False},
        ],
        "grading_config": {
            "tool_strictness": {
                "fetch_kareo_data": "required",
                "end_session": "optional",
            },
        },
    },
)
print(f"Created agent: {agent['id']}")

update_agent

Update an agent's name, description, or config.

update_agent(
    agent_id: int,
    name: str | None = None,
    description: str | None = None,
    config: dict | None = None
) -> Agent

Note: config replaces the entire config object — merge locally before updating if you want to preserve existing fields.

delete_agent

Soft-delete an agent. Datasets are unlinked but not deleted.

delete_agent(agent_id: int) -> dict

get_agent_datasets

Get all datasets linked to an agent.

get_agent_datasets(agent_id: int) -> dict

Returns: Dict with agent (the agent object) and datasets (list of dataset objects).

set_dataset_agent

Assign or unassign an agent to a dataset.

set_dataset_agent(dataset_id: int, agent_id: int | None) -> dict

Pass agent_id=None to unlink a dataset from its agent.

API Key Methods

list_api_keys

List API keys for your tenant.

list_api_keys(include_inactive: bool = False) -> list[APIKey]

Parameters:

Parameter	Type	Required	Default	Description
`include_inactive`	`bool`	No	`False`	Include revoked keys

Returns: list[APIKey] - List of API key objects

Note: For security, only the key prefix is returned, not the full key.

Example:

keys = client.list_api_keys()
for key in keys:
    print(f"{key['key_prefix']}... - {key['name']}")

revoke_api_key

Revoke an API key.

revoke_api_key(api_key_id: int) -> dict

Parameters:

Parameter	Type	Required	Description
`api_key_id`	`int`	Yes	The API key ID to revoke

Returns: dict - Confirmation of revocation

Raises:

NotFoundError: API key not found

Example:

client.revoke_api_key(api_key_id=123)
print("API key revoked")

Convenience Methods

wait_for_request

Block until a request reaches a terminal state (completed or failed).

wait_for_request(
    request_id: int,
    timeout: int = 600,
    poll_interval: int = 5
) -> Request

Parameters:

Parameter	Type	Required	Default	Description
`request_id`	`int`	Yes	-	The request ID to poll
`timeout`	`int`	No	`600`	Maximum seconds to wait
`poll_interval`	`int`	No	`5`	Seconds between polls

Returns: Request - The final request object

Raises:

TimeoutError: If the request doesn't finish within timeout seconds
AshrLabsError: If the request fails

Example:

req = client.create_request(request_name="My Eval", request=config)
completed = client.wait_for_request(req["id"], timeout=300)
print(f"Status: {completed['request_status']}")

poll_run

Block until backend grading completes for a run. After deploy(), the backend grades tool arguments and text responses asynchronously (typically 1-3 minutes). This method polls get_run() until aggregate_metrics.tests_passed is populated.

poll_run(
    run_id: int,
    timeout: int = 300,
    poll_interval: int = 20,
    on_poll: Callable | None = None
) -> Run

Parameters:

Parameter	Type	Required	Default	Description
`run_id`	`int`	Yes	-	The run ID to poll
`timeout`	`int`	No	`300`	Maximum seconds to wait
`poll_interval`	`int`	No	`20`	Seconds between polls
`on_poll`	`Callable`	No	`None`	Called after each poll: `(elapsed_seconds, run_dict)`

Returns: Run - The fully graded run object

Raises:

TimeoutError: If grading doesn't finish within timeout seconds

Example:

created = run.deploy(client, dataset_id=322)
graded = client.poll_run(
    created["id"],
    on_poll=lambda elapsed, r: print(f"Grading... ({elapsed}s)"),
)
metrics = graded["result"]["aggregate_metrics"]
print(f"Passed: {metrics['tests_passed']}/{metrics['total_tests']}")

generate_dataset

Create a dataset generation request, wait for completion, and fetch the result. Combines create_request + wait_for_request + get_dataset into one call.

Missing context fields (use_case, scenario_context) are auto-filled from the agent's name and description. A default test_config is added if not provided.

generate_dataset(
    request_name: str,
    config: dict[str, Any],
    request_input_schema: dict[str, Any] | None = None,
    timeout: int = 600,
    poll_interval: int = 5
) -> tuple[int, dict]

Parameters:

Parameter	Type	Required	Default	Description
`request_name`	`str`	Yes	-	Name/title for the request
`config`	`dict`	Yes	-	The generation config — same structure as `create_request`'s `request` parameter. See create_request for the full schema reference. Valid sections: `metadata`, `agent`, `context`, `test_config`, `generation_options`.
`request_input_schema`	`dict`	No	auto	Optional JSON Schema for validation. Auto-populated from `config["agent"]["tools"]` if omitted.
`timeout`	`int`	No	`600`	Maximum seconds to wait for generation
`poll_interval`	`int`	No	`5`	Seconds between status polls

Returns: A tuple of (dataset_id, dataset_source) where dataset_source is the dict containing "runs".

Raises:

TimeoutError: If generation doesn't finish in time
AshrLabsError: If generation fails or no datasets are found

Example:

dataset_id, source = client.generate_dataset(
    request_name="Support Agent Eval",
    config={
        "agent": {
            "name": "Support Bot",
            "description": "Handles customer orders, inventory, and refunds",
            "system_prompt": "You are a helpful support agent for ShopWave.",
            "tools": [
                {"name": "lookup_order", "description": "Look up order status",
                 "parameters": {"type": "object", "properties": {"order_id": {"type": "string"}}, "required": ["order_id"]}},
                {"name": "process_refund", "description": "Process a refund",
                 "parameters": {"type": "object", "properties": {"order_id": {"type": "string"}, "reason": {"type": "string"}}, "required": ["order_id", "reason"]}},
            ],
            "accepted_inputs": {"text": True, "audio": False, "file": False},
        },
        "context": {
            "domain": "e-commerce",
            "use_case": "Customer support for online retail orders",
            "user_persona": {"type": "customer", "description": "Online shoppers"},
        },
        "test_config": {
            "num_variations": 10,
            "strategy": "diverse",
            "coverage": {"happy_path": True, "edge_cases": True, "error_handling": True},
        },
        "generation_options": {
            "generate_audio": False,
            "generate_files": False,
            "generate_simulations": False,
        },
    },
)
print(f"Dataset #{dataset_id}: {len(source['runs'])} scenarios")

Utility Methods

health_check

Check if the API is reachable.

health_check() -> dict

Returns: dict - Status information

Example:

status = client.health_check()
print(f"API Status: {status['status']}")

RunBuilder

A builder for incrementally constructing run result objects as an agent executes tests. Once complete, the result can be deployed via the client.

Constructor

RunBuilder()

No parameters. Creates a run in "pending" status.

RunBuilder.start

Mark the run as started. Records the current timestamp.

run.start() -> RunBuilder

Returns: self (for chaining)

RunBuilder.add_test

Create and register a new test within this run.

run.add_test(test_id: str) -> TestBuilder

Parameters:

Parameter	Type	Required	Description
`test_id`	`str`	Yes	Unique identifier for the test case

Returns: TestBuilder - A builder for the individual test

RunBuilder.complete

Mark the run as completed. Records the current timestamp.

run.complete(status: str = "completed") -> RunBuilder

Parameters:

Parameter	Type	Required	Default	Description
`status`	`str`	No	`"completed"`	Final status (`"completed"` or `"failed"`)

Returns: self (for chaining)

RunBuilder.build

Serialize the full run result to a dict.

run.build() -> dict[str, Any]

Returns: A dict matching the run result schema, ready to be passed to client.create_run(result=...). Aggregate metrics are computed automatically from action results.

RunBuilder.deploy

Build the result and submit it as a new run via the API.

run.deploy(
    client: AshrLabsClient,
    dataset_id: int,
    tenant_id: int | None = None,
    runner_id: int | None = None,
    agent_id: int | None = None
) -> dict[str, Any]

Parameters:

Parameter	Type	Required	Default	Description
`client`	`AshrLabsClient`	Yes	-	An authenticated client instance
`dataset_id`	`int`	Yes	-	The dataset this run is for
`tenant_id`	`int`	No	auto	The tenant (auto-resolved if omitted)
`runner_id`	`int`	No	`None`	ID of the user who ran the test
`agent_id`	`int`	No	`None`	Agent to auto-link the dataset to

Returns: The created run object from the API

Example:

from ashr_labs import AshrLabsClient, RunBuilder

client = AshrLabsClient(api_key="tp_...")

run = RunBuilder()
run.start()

test = run.add_test("bank_analysis")
test.start()
test.add_user_text(text="Analyze this", description="User prompt")
test.add_tool_call(
    expected={"tool_name": "analyze", "arguments": {"data": "input"}},
    actual={"tool_name": "analyze", "arguments": {"data": "input"}},
    match_status="exact",
)
test.complete()

run.complete()
created_run = run.deploy(client, dataset_id=42)
print(f"Run #{created_run['id']} created")

TestBuilder

Builds a single test result incrementally. Returned by RunBuilder.add_test().

TestBuilder.start

Mark the test as started. Records the current timestamp.

test.start() -> TestBuilder

Returns: self (for chaining)

TestBuilder.add_user_file

Record a user file input action.

test.add_user_file(
    file_path: str,
    description: str,
    action_index: int | None = None
) -> TestBuilder

Parameters:

Parameter	Type	Required	Default	Description
`file_path`	`str`	Yes	-	Path to the file in the dataset
`description`	`str`	Yes	-	Description of the action
`action_index`	`int`	No	auto	Explicit index, or auto-incremented

Returns: self (for chaining)

TestBuilder.add_user_text

Record a user text input action.

test.add_user_text(
    text: str,
    description: str,
    action_index: int | None = None
) -> TestBuilder

Parameters:

Parameter	Type	Required	Default	Description
`text`	`str`	Yes	-	The user's text input
`description`	`str`	Yes	-	Description of the action
`action_index`	`int`	No	auto	Explicit index, or auto-incremented

Returns: self (for chaining)

TestBuilder.add_tool_call

Record an agent tool call action with expected vs actual comparison.

test.add_tool_call(
    expected: dict[str, Any],
    actual: dict[str, Any],
    match_status: str,
    divergence_notes: str | None = None,
    argument_comparison: dict[str, Any] | None = None,
    action_index: int | None = None
) -> TestBuilder

Parameters:

Parameter	Type	Required	Default	Description
`expected`	`dict`	Yes	-	Expected tool call (`tool_name`, `arguments`)
`actual`	`dict`	Yes	-	Actual tool call made by the agent
`match_status`	`str`	Yes	-	`"exact"`, `"partial"`, or `"mismatch"`
`argument_comparison`	`dict`	No	`None`	Structured diff from `compare_args_structural()`. Recommended — the backend grader may skip tool calls without it.
`divergence_notes`	`str`	No	`None`	Notes explaining the divergence
`action_index`	`int`	No	auto	Explicit index, or auto-incremented

Returns: self (for chaining)

TestBuilder.add_agent_response

Record an agent text response with expected vs actual comparison.

test.add_agent_response(
    expected_response: dict[str, Any],
    actual_response: dict[str, Any],
    match_status: str,
    semantic_similarity: float | None = None,
    divergence_notes: str | None = None,
    action_index: int | None = None
) -> TestBuilder

Parameters:

Parameter	Type	Required	Default	Description
`expected_response`	`dict`	Yes	-	The expected response content
`actual_response`	`dict`	Yes	-	The actual response from the agent
`match_status`	`str`	Yes	-	`"exact"`, `"similar"`, or `"divergent"`
`semantic_similarity`	`float`	No	`None`	Similarity score (0.0 to 1.0)
`divergence_notes`	`str`	No	`None`	Notes explaining the divergence
`action_index`	`int`	No	auto	Explicit index, or auto-incremented

Returns: self (for chaining)

TestBuilder.set_vm_stream

Attach VM session logs to this test. For agents that operate in a browser or virtual machine.

test.set_vm_stream(
    provider: str,
    session_id: str | None = None,
    duration_ms: int | None = None,
    logs: list[dict] | None = None,
    metadata: dict | None = None
) -> TestBuilder

Parameters:

Parameter	Type	Required	Default	Description
`provider`	`str`	Yes	-	VM provider name (e.g. `"browserbase"`, `"scrapybara"`, `"steel"`)
`session_id`	`str`	No	`None`	Provider session ID for linking
`duration_ms`	`int`	No	`None`	Total session duration in milliseconds
`logs`	`list[dict]`	No	`None`	Timestamped log entries (see below)
`metadata`	`dict`	No	`None`	Additional provider-specific metadata

Log entry format: Each entry should have ts (int, ms offset from start) and type (str):

{"ts": 0, "type": "navigation", "data": {"url": "https://..."}}
{"ts": 1200, "type": "action", "data": {"action": "click", "selector": "#btn"}}
{"ts": 3000, "type": "error", "data": {"message": "Element not found"}}

Example:

test.set_vm_stream(
    provider="browserbase",
    session_id="sess_abc123",
    duration_ms=12000,
    logs=[
        {"ts": 0, "type": "navigation", "data": {"url": "https://app.example.com"}},
        {"ts": 2000, "type": "action", "data": {"action": "click", "selector": "#submit"}},
        {"ts": 5000, "type": "network", "data": {"method": "POST", "url": "/api/order", "status": 201}},
    ],
)

Returns: self (for chaining)

TestBuilder.set_kernel_vm

Convenience method for attaching a Kernel browser session. Sets provider="kernel" and exposes Kernel-specific metadata fields as named parameters. Fields map to Kernel's browser API response.

test.set_kernel_vm(
    session_id: str,
    duration_ms: int | None = None,
    logs: list[dict] | None = None,
    *,
    live_view_url: str | None = None,
    cdp_ws_url: str | None = None,
    replay_id: str | None = None,
    replay_view_url: str | None = None,
    headless: bool | None = None,
    stealth: bool | None = None,
    viewport: dict | None = None,
) -> TestBuilder

Parameters:

Parameter	Type	Required	Default	Description
`session_id`	`str`	Yes	-	Kernel browser session ID
`duration_ms`	`int`	No	`None`	Total session duration in milliseconds
`logs`	`list[dict]`	No	`None`	Timestamped log entries (same format as `set_vm_stream`)
`live_view_url`	`str`	No	`None`	Remote live-view URL (`browser_live_view_url`)
`cdp_ws_url`	`str`	No	`None`	Chrome DevTools Protocol WebSocket URL
`replay_id`	`str`	No	`None`	ID of the session recording
`replay_view_url`	`str`	No	`None`	URL to view the session replay
`headless`	`bool`	No	`None`	Whether the session ran in headless mode
`stealth`	`bool`	No	`None`	Whether anti-bot stealth mode was enabled
`viewport`	`dict`	No	`None`	Browser viewport, e.g. `{"width": 1920, "height": 1080}`

Example:

test.set_kernel_vm(
    session_id="kern_sess_abc123",
    duration_ms=15000,
    logs=[
        {"ts": 0, "type": "navigation", "data": {"url": "https://app.example.com"}},
        {"ts": 1200, "type": "action", "data": {"action": "click", "selector": "#login"}},
        {"ts": 3000, "type": "screenshot", "data": {"s3_key": "vm-streams/.../frame.png"}},
    ],
    replay_id="replay_abc123",
    replay_view_url="https://www.kernel.sh/replays/replay_abc123",
    stealth=True,
    viewport={"width": 1920, "height": 1080},
)

Returns: self (for chaining)

TestBuilder.complete

Mark the test as completed. Records the current timestamp.

test.complete(status: str = "completed") -> TestBuilder

Parameters:

Parameter	Type	Required	Default	Description
`status`	`str`	No	`"completed"`	Final status (`"completed"` or `"failed"`)

Returns: self (for chaining)

TestBuilder.build

Serialize this test to a dict matching the run result schema.

test.build() -> dict[str, Any]

Returns: A dict with test_id, status, action_results, started_at, and completed_at.

EvalRunner

Runs an agent against every scenario in a dataset and records results. This is the high-level API that encapsulates the full eval loop — iterating scenarios, calling the agent, comparing tool calls and text, and producing a RunBuilder.

Constructor

EvalRunner(dataset_source: dict[str, Any])

Parameters:

Parameter	Type	Required	Default	Description
`dataset_source`	`dict`	Yes	-	The `dataset_source` dict from a dataset (contains `"runs"` key)

The EvalRunner does not perform local grading. It pairs expected vs actual tool calls and text responses, then submits everything for server-side grading via the backend's LLM-based judge. Tool call arguments are compared structurally using compare_args_structural(). Text responses are submitted with match_status="pending" for server-side evaluation.

Example:

from ashr_labs import EvalRunner

runner = EvalRunner(source)

EvalRunner.from_dataset (class method)

Create an EvalRunner by fetching a dataset from the API.

EvalRunner.from_dataset(
    client: AshrLabsClient,
    dataset_id: int,
    **kwargs
) -> EvalRunner

Parameters:

Parameter	Type	Required	Description
`client`	`AshrLabsClient`	Yes	An authenticated client
`dataset_id`	`int`	Yes	The dataset ID to fetch
`**kwargs`		No	Passed to `EvalRunner.__init__()`

Returns: EvalRunner - A configured runner ready to call .run()

Example:

runner = EvalRunner.from_dataset(client, dataset_id=322)

EvalRunner.run

Run the agent against every scenario and return a populated RunBuilder.

runner.run(
    agent: Agent,
    *,
    on_scenario: Callable | None = None,
    on_action: Callable | None = None,
    on_environment: Callable | None = None,
    max_workers: int = 1,
) -> RunBuilder

Parameters:

Parameter	Type	Required	Default	Description
`agent`	`Agent`	Yes	-	An object implementing the Agent protocol
`on_scenario`	`Callable`	No	`None`	Called at the start of each scenario: `(scenario_id, scenario_dict)`
`on_action`	`Callable`	No	`None`	Called for each action: `(action_index, action_dict)`
`on_environment`	`Callable`	No	`None`	Called for environment actions: `(content, action_dict) -> dict
`max_workers`	`int`	No	`1`	Number of scenarios to run in parallel. When >1, each scenario gets a `copy.deepcopy` of the agent. Important: Most LLM clients (Anthropic, OpenAI) hold connection pools that cannot be deep-copied. Use `max_workers=1` unless your agent implements `__deepcopy__`.

Returns: RunBuilder - A populated builder ready for .build() or .deploy()

Example:

# Sequential (default)
run = runner.run(agent)

# With environment handler — feed external context to the agent
def handle_env(content, action):
    return agent.respond(content)

run = runner.run(agent, on_environment=handle_env)

# Parallel — run 4 scenarios at a time (only if agent supports deepcopy)
run = runner.run(agent, max_workers=4)

EvalRunner.run_and_deploy

Run the eval and submit results in one call.

runner.run_and_deploy(
    agent: Agent,
    client: AshrLabsClient,
    dataset_id: int | None = None,
    *,
    on_scenario: Callable | None = None,
    on_action: Callable | None = None,
    on_environment: Callable | None = None,
    max_workers: int = 1,
    **deploy_kwargs,
) -> dict[str, Any]

Parameters:

Parameter	Type	Required	Default	Description
`agent`	`Agent`	Yes	-	An object implementing the Agent protocol
`client`	`AshrLabsClient`	Yes	-	An authenticated client
`dataset_id`	`int \| None`	No	`None`	The dataset to submit against
`on_scenario`	`Callable`	No	`None`	Callback per scenario
`on_action`	`Callable`	No	`None`	Callback per action
`on_environment`	`Callable`	No	`None`	Callback for environment actions (see `run()`)
`max_workers`	`int`	No	`1`	Number of scenarios to run in parallel (default sequential)
`**deploy_kwargs`		No		Extra kwargs passed to `RunBuilder.deploy()`

Returns: The created run object from the API

Example:

# Sequential
created = runner.run_and_deploy(agent, client, dataset_id=322)
print(f"Run #{created['id']} submitted")

# Parallel
created = runner.run_and_deploy(agent, client, dataset_id=322, max_workers=4)

Agent Protocol

A @runtime_checkable Protocol that defines the interface agents must implement.

@runtime_checkable
class Agent(Protocol):
    def respond(self, message: str) -> dict[str, Any]: ...
    def reset(self) -> None: ...

respond

Process a user message and return the agent's response.

Parameters:

Parameter	Type	Description
`message`	`str`	The user's message text

Returns: A dict with:

"text" (str): The agent's text response
"tool_calls" (list[dict]): Tool calls made during this turn, each with "name" (str) and "arguments" (dict) keys

arguments vs arguments_json: The Agent protocol returns tool arguments as a dict under the "arguments" key. However, RunBuilder and the API store them as a JSON string under "arguments_json". EvalRunner handles this conversion automatically (eval.py:187-193). If you use RunBuilder directly, pass "arguments_json" (a JSON string) to add_tool_call(). The extract_tool_args() helper accepts both formats, so comparators work either way.

reset

Clear conversation state for a new scenario. Called before each scenario begins.

isinstance check

from ashr_labs import Agent

assert isinstance(my_agent, Agent)  # Works at runtime

Comparator Functions

All comparator functions are standalone, stdlib-only, and importable from the top-level package.

strip_markdown

Remove markdown formatting from text.

strip_markdown(text: str) -> str

Removes bold/italic markers, headers, bullets, and markdown links. Collapses whitespace.

Example:

strip_markdown("**Bold** and [link](https://x.com)")
# => "Bold and link"

tokenize

Lowercase, strip markdown and punctuation, split into word tokens.

tokenize(text: str) -> list[str]

Example:

tokenize("Order **ORD-123** shipped!")
# => ["order", "ord123", "shipped"]

fuzzy_str_match

Check if two strings are semantically close enough to count as matching.

fuzzy_str_match(a: str, b: str, threshold: float | None = None) -> bool

Parameters:

Parameter	Type	Required	Default	Description
`a`	`str`	Yes	-	First string
`b`	`str`	Yes	-	Second string
`threshold`	`float`	No	adaptive	Word-overlap threshold. If None: 0.35 for <=5 words, 0.40 for <=8, 0.55 otherwise

Returns: True if the strings match closely enough

Checks in order: exact match after normalization, containment, then word-set overlap.

Example:

fuzzy_str_match("Customer wants a refund", "customer wants refund")  # True
fuzzy_str_match("apple banana", "cherry grape")                      # False

extract_tool_args

Extract arguments from a tool call dict, handling both formats.

extract_tool_args(tool_call: dict) -> dict

Handles {"arguments": {...}} (dict form) and {"arguments_json": "..."} (JSON string form). Prefers the dict form if both are present.

Example:

extract_tool_args({"arguments_json": '{"order_id": "ORD-123"}'})
# => {"order_id": "ORD-123"}

extract_tool_args({"arguments": {"order_id": "ORD-123"}})
# => {"order_id": "ORD-123"}

compare_tool_args

Compare expected vs actual tool call arguments.

compare_tool_args(expected: dict, actual: dict) -> tuple[str, str | None]

Parameters:

Parameter	Type	Description
`expected`	`dict`	Expected tool call (with `arguments` or `arguments_json`)
`actual`	`dict`	Actual tool call made by the agent

Returns: A tuple of (match_status, divergence_notes):

match_status: "exact", "partial", or "mismatch"
divergence_notes: Human-readable diff summary, or None if exact

String arguments are compared using fuzzy_str_match. Non-string values use exact equality. Extra arguments in the actual call don't cause divergence.

Example:

status, notes = compare_tool_args(
    {"arguments": {"order_id": "ORD-123"}},
    {"arguments": {"order_id": "ORD-123", "extra": "field"}},
)
# => ("exact", None)

status, notes = compare_tool_args(
    {"arguments": {"order_id": "ORD-123", "reason": "damaged item"}},
    {"arguments": {"order_id": "ORD-999", "reason": "item was damaged"}},
)
# => ("partial", "'order_id': expected='ORD-123' actual='ORD-999'")

text_similarity

Compute similarity between two text strings.

text_similarity(text_a: str, text_b: str) -> float

Returns: A float between 0.0 and 1.0

Uses cosine similarity on word frequency vectors, plus:

Entity bonus (+0.20): for matching order IDs (ORD-*), refund IDs (REF-*), prices ($*), dates (YYYY-MM-DD), and tracking URLs
Concept bonus (+0.10): for matching domain concepts (refund/credited, shipped/transit/delivered, stock/available, etc.)

Example:

text_similarity(
    "Your order ORD-123 has shipped and is on the way",
    "Order ORD-123 has been shipped and is in transit",
)
# => 0.78

Data Types

User

class User(TypedDict, total=False):
    id: int
    created_at: str
    email: str
    name: str | None
    tenant: int
    is_active: bool

Tenant

class Tenant(TypedDict, total=False):
    id: int
    created_at: str
    tenant_name: str
    is_active: bool

Session

class Session(TypedDict):
    status: str
    user: User
    tenant: Tenant

Dataset

class Dataset(TypedDict, total=False):
    id: int
    created_at: str
    tenant: int
    creator: int
    name: str
    description: str | None
    agent_id: int | None
    agent_details: dict[str, Any] | None  # {"id": int, "name": str}
    dataset_source: dict[str, Any]

Run

class Run(TypedDict, total=False):
    id: int
    created_at: str
    dataset: int
    tenant: int
    runner: int
    result: dict[str, Any]

ObservabilityTrace

class ObservabilityTrace(TypedDict, total=False):
    id: str
    name: str
    user_id: str | None       # End-user identifier
    session_id: str | None    # Conversation/session grouping
    metadata: dict[str, Any] | None
    tags: list[str]
    created_at: str | None
    output: dict[str, Any] | None
    observations: list[ObservabilityObservation]

ObservabilityObservation

class ObservabilityObservation(TypedDict, total=False):
    id: str
    name: str
    type: str              # "span", "generation", "event"
    parent_observation_id: str | None
    input: dict[str, Any] | None
    output: dict[str, Any] | None
    metadata: dict[str, Any] | None
    model: str | None      # LLM model name (generations only)
    usage: dict[str, int] | None  # {"input_tokens": ..., "output_tokens": ...}
    level: str | None      # "DEBUG", "DEFAULT", "WARNING", "ERROR"
    status_message: str | None
    start_time: str | None
    end_time: str | None

SdkNote

class SdkNote(TypedDict, total=False):
    id: int
    created_at: str
    updated_at: str
    title: str
    content: str
    category: str        # "info", "warning", "breaking_change", "best_practice", "deprecation"
    severity: str        # "info", "warning", "critical"
    tenant_id: int | None
    agent_id: int | None
    active_from: str
    expires_at: str | None
    is_archived: bool
    note_metadata: dict[str, Any]

Request

class Request(TypedDict, total=False):
    id: int
    created_at: str
    requestor_id: int
    requestor_tenant: int
    request_name: str
    request_status: str
    request_input_schema: dict[str, Any] | None
    request: dict[str, Any]

APIKey

class APIKey(TypedDict, total=False):
    id: int
    key: str  # Only present on creation
    key_prefix: str
    name: str
    scopes: list[str]
    user_id: int
    tenant_id: int
    created_at: str
    last_used_at: str | None
    expires_at: str | None
    is_active: bool

ToolCall

class ToolCall(TypedDict, total=False):
    name: str
    arguments_json: str

ExpectedResponse

class ExpectedResponse(TypedDict, total=False):
    tool_calls: list[ToolCall]
    text: str

Action

class Action(TypedDict, total=False):
    actor: str           # "user" or "agent"
    content: str
    name: str
    expected_response: ExpectedResponse

Scenario

class Scenario(TypedDict, total=False):
    title: str
    actions: list[Action]

Agent

class Agent(TypedDict, total=False):
    id: int
    created_at: str
    tenant_id: int
    creator_id: int | None
    name: str
    description: str | None
    config: AgentConfig
    is_active: bool
    dataset_count: int

AgentConfig

class AgentConfig(TypedDict, total=False):
    form_data: dict              # Dataset generation preset
    tool_definitions: list[ToolDefinition]
    behavior_rules: list[BehaviorRule]
    grading_config: GradingConfig

ToolDefinition

class ToolDefinition(TypedDict, total=False):
    name: str
    description: str
    required: bool     # True = must be called, False = optional

BehaviorRule

class BehaviorRule(TypedDict, total=False):
    rule: str
    strictness: str    # "required" | "expected" | "optional"

GradingConfig

class GradingConfig(TypedDict, total=False):
    tool_strictness: dict[str, str]    # tool_name -> "required" | "expected" | "optional"
    text_similarity_threshold: float

AshrLabsClient​

Constructor​

from_env (class method)​

Session Methods​

init​

Dataset Methods​

get_dataset​

list_datasets​

Run Methods​

create_run​

get_run​

list_runs​

delete_run​

Observability — Production Agent Tracing​

client.trace​

Trace methods​

Span methods​

Generation methods​

Context managers (recommended)​

Manual instrumentation​

list_observability_traces​

get_observability_trace​

get_observability_analytics​

get_observability_errors / get_observability_tool_errors​

SDK Notes — Platform Advisories​

client.notes (property)​

get_notes​

Request Methods​

create_request​

metadata (optional)​

agent (required)​

context (required)​

test_config (optional)​

generation_options (required)​

get_request​

list_requests​

Agent Methods​

list_agents​

create_agent​

update_agent​

delete_agent​

get_agent_datasets​

set_dataset_agent​

API Key Methods​

list_api_keys​

revoke_api_key​

Convenience Methods​

wait_for_request​

poll_run​

generate_dataset​

Utility Methods​

health_check​

RunBuilder​

Constructor​

RunBuilder.start​

RunBuilder.add_test​

RunBuilder.complete​

RunBuilder.build​

RunBuilder.deploy​

TestBuilder​

TestBuilder.start​

TestBuilder.add_user_file​

TestBuilder.add_user_text​

TestBuilder.add_tool_call​

TestBuilder.add_agent_response​

TestBuilder.set_vm_stream​

TestBuilder.set_kernel_vm​

TestBuilder.complete​

TestBuilder.build​

EvalRunner​

Constructor​

EvalRunner.from_dataset (class method)​

EvalRunner.run​

EvalRunner.run_and_deploy​

Agent Protocol​

respond​

reset​

isinstance check​

Comparator Functions​

strip_markdown​

AshrLabsClient

Constructor

from_env (class method)

Session Methods

init

Dataset Methods

get_dataset

list_datasets

Run Methods

create_run

get_run

list_runs

delete_run

Observability — Production Agent Tracing

client.trace

Trace methods

Span methods

Generation methods

Context managers (recommended)

Manual instrumentation

list_observability_traces

get_observability_trace

get_observability_analytics

get_observability_errors / get_observability_tool_errors

SDK Notes — Platform Advisories

client.notes (property)

get_notes

Request Methods

create_request

`metadata` (optional)

`agent` (required)

`context` (required)

`test_config` (optional)

`generation_options` (required)

get_request

list_requests

Agent Methods

list_agents

create_agent

update_agent

delete_agent

get_agent_datasets

set_dataset_agent

API Key Methods

list_api_keys

revoke_api_key

Convenience Methods

wait_for_request

poll_run

generate_dataset

Utility Methods

health_check

RunBuilder

Constructor

RunBuilder.start

RunBuilder.add_test

RunBuilder.complete

RunBuilder.build

RunBuilder.deploy

TestBuilder

TestBuilder.start

TestBuilder.add_user_file

TestBuilder.add_user_text

TestBuilder.add_tool_call

TestBuilder.add_agent_response

TestBuilder.set_vm_stream

TestBuilder.set_kernel_vm

TestBuilder.complete

TestBuilder.build

EvalRunner

Constructor

EvalRunner.from_dataset (class method)

EvalRunner.run

EvalRunner.run_and_deploy

Agent Protocol

respond

reset

isinstance check

Comparator Functions

strip_markdown