Skip to main content

VM Integration Guide

This guide covers how to evaluate browser-based and desktop-based agents that interact with virtual machines. If your agent controls a browser (via Playwright, Selenium, Browserbase, Kernel, Steel, etc.), this is the guide for you.

Table of Contents


Overview

When your agent operates in a browser or VM, you want to capture what the agent did visually — not just what tool calls it made. VM streams attach timestamped browser logs (navigations, clicks, screenshots, network requests) to each test in a run, so you can replay and debug agent behavior in the Ashr Labs dashboard.

Key concept: VM streams are metadata attached to individual tests, not to the run as a whole. Each test (scenario) gets its own VM stream with its own session, logs, and duration.

How VM Streams Work

┌─────────────────────────────────────────────────┐
│ Your Browser Agent │
│ │
│ 1. Receives message from EvalRunner │
│ 2. Executes browser actions (click, type, etc.)│
│ 3. Collects logs as it goes │
│ 4. Returns { "text": ..., "tool_calls": ... } │
└────────────────────┬────────────────────────────┘


┌─────────────────────────────────────────────────┐
│ EvalRunner │
│ │
│ - Records tool call matches ✓ │
│ - Records text response matches ✓ │
│ - Does NOT auto-capture VM logs ✗ │
└────────────────────┬────────────────────────────┘

▼ You must do this yourself
┌─────────────────────────────────────────────────┐
│ After runner.run(), attach VM streams: │
│ │
│ test.set_vm_stream("browserbase", ...) │
│ — or — │
│ test.set_kernel_vm("session_id", ...) │
└────────────────────┬────────────────────────────┘


┌─────────────────────────────────────────────────┐
│ run.deploy(client, dataset_id) │
│ → VM logs submitted with run result │
│ → Viewable in Ashr Labs dashboard │
└─────────────────────────────────────────────────┘

Important: EvalRunner handles tool call and text comparison automatically, but it does not capture VM/browser logs. You must:

  1. Collect logs inside your agent during respond()
  2. Attach them to each TestBuilder after runner.run() returns
  3. Then call run.deploy()

Quick Start

Minimal example — a browser agent that records navigation logs:

from ashr_labs import AshrLabsClient, EvalRunner
import time

client = AshrLabsClient.from_env()

class MyBrowserAgent:
def __init__(self):
self.logs: dict[str, dict] = {}
self._scenario: str = "default"

def respond(self, message: str) -> dict:
if self._scenario not in self.logs:
self.logs[self._scenario] = {"session_id": f"sess_{int(time.time())}", "entries": []}
ctx = self.logs[self._scenario]

# ... your browser automation here ...
ctx["entries"].append({"ts": 0, "type": "navigation", "data": {"url": "https://example.com"}})

return {"text": "Done", "tool_calls": []}

def reset(self):
pass # Don't clear logs — we need them after eval

agent = MyBrowserAgent()
runner = EvalRunner.from_dataset(client, dataset_id=42)
run = runner.run(agent, on_scenario=lambda sid, s: setattr(agent, "_scenario", sid))

# Attach VM streams BEFORE deploying
for test in run.tests:
session = agent.logs.get(test.test_id)
if session:
test.set_vm_stream(
provider="playwright",
session_id=session["session_id"],
logs=session["entries"],
)

run.deploy(client, dataset_id=42)

End-to-End Example: Browser Agent with EvalRunner

This is a complete, working example of a browser agent that:

  • Uses the Anthropic SDK for reasoning
  • Drives a browser via automation
  • Records VM logs per scenario
  • Runs evals and deploys with VM streams attached
import anthropic
import time
from ashr_labs import AshrLabsClient, EvalRunner


class BrowserSession:
def __init__(self, session_id: str):
self.session_id = session_id
self.logs: list[dict] = []
self.start_time = time.time()

def elapsed_ms(self) -> int:
return int((time.time() - self.start_time) * 1000)


class WebNavigationAgent:
"""Browser agent that records VM logs for each scenario."""

TOOLS = [
{
"name": "navigate",
"description": "Navigate to a URL",
"input_schema": {"type": "object", "properties": {"url": {"type": "string"}}, "required": ["url"]},
},
{
"name": "click",
"description": "Click an element",
"input_schema": {"type": "object", "properties": {"selector": {"type": "string"}}, "required": ["selector"]},
},
{
"name": "type_text",
"description": "Type text into an input",
"input_schema": {
"type": "object",
"properties": {"selector": {"type": "string"}, "text": {"type": "string"}},
"required": ["selector", "text"],
},
},
]

def __init__(self):
self.anthropic = anthropic.Anthropic()
self.sessions: dict[str, BrowserSession] = {}
self.conversations: dict[str, list] = {}
self._current_scenario: str = "default"

def respond(self, message: str) -> dict:
sid = self._current_scenario

# Initialize browser session for this scenario
if sid not in self.sessions:
session = self._create_browser_session()
self.sessions[sid] = BrowserSession(session["id"])

session = self.sessions[sid]
history = self.conversations.setdefault(sid, [])
history.append({"role": "user", "content": message})

all_tool_calls = []
final_text = ""

for _ in range(10):
response = self.anthropic.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system="You are a web navigation agent. Use tools to interact with the browser.",
tools=self.TOOLS,
messages=history,
)

# Collect text
for block in response.content:
if block.type == "text":
final_text += block.text

# Process tool calls
tool_blocks = [b for b in response.content if b.type == "tool_use"]

if not tool_blocks:
history.append({"role": "assistant", "content": response.content})
break

# Execute each tool call and record logs
tool_results = []
for tool in tool_blocks:
ts = session.elapsed_ms()
all_tool_calls.append({"name": tool.name, "arguments": tool.input})

if tool.name == "navigate":
self._browser_navigate(session.session_id, tool.input["url"])
session.logs.append({"ts": ts, "type": "navigation", "data": {"url": tool.input["url"]}})
tool_results.append({"type": "tool_result", "tool_use_id": tool.id, "content": "Navigated"})

elif tool.name == "click":
self._browser_click(session.session_id, tool.input["selector"])
session.logs.append({
"ts": ts, "type": "action",
"data": {"action": "click", "selector": tool.input["selector"]},
})
tool_results.append({"type": "tool_result", "tool_use_id": tool.id, "content": "Clicked"})

elif tool.name == "type_text":
self._browser_type(session.session_id, tool.input["selector"], tool.input["text"])
session.logs.append({
"ts": ts, "type": "action",
"data": {"action": "type", "selector": tool.input["selector"], "value": tool.input["text"]},
})
tool_results.append({"type": "tool_result", "tool_use_id": tool.id, "content": "Typed"})

# Screenshot after each action
session.logs.append({"ts": session.elapsed_ms(), "type": "screenshot", "data": {}})

history.append({"role": "assistant", "content": response.content})
history.append({"role": "user", "content": tool_results})

if response.stop_reason == "end_turn":
break

self.conversations[sid] = history
return {"text": final_text, "tool_calls": all_tool_calls}

def reset(self):
# Don't clear sessions — we need the logs after eval completes
pass

def set_scenario(self, scenario_id: str):
"""Called from on_scenario callback to track current scenario."""
self._current_scenario = scenario_id

def get_session(self, scenario_id: str) -> BrowserSession | None:
return self.sessions.get(scenario_id)

# -- Replace these stubs with your actual browser provider --
def _create_browser_session(self) -> dict:
return {"id": f"sess_{int(time.time())}"}

def _browser_navigate(self, session_id: str, url: str): pass
def _browser_click(self, session_id: str, selector: str): pass
def _browser_type(self, session_id: str, selector: str, text: str): pass


# ─── Main ────────────────────────────────────────────────────────────────

def main():
client = AshrLabsClient.from_env()
agent = WebNavigationAgent()

# 1. Run eval
runner = EvalRunner.from_dataset(client, dataset_id=42)
run = runner.run(
agent,
max_workers=1, # Use 1 for browser agents (see Gotchas below)
on_scenario=lambda sid, s: (
agent.set_scenario(sid),
print(f"▶ Scenario: {s.get('title', sid)}"),
),
on_action=lambda i, a: print(f" Action {i}: {a.get('content', '')[:60]}"),
)

# 2. Attach VM streams to each test
for test in run.tests:
session = agent.get_session(test.test_id)
if session:
test.set_vm_stream(
provider="custom",
session_id=session.session_id,
duration_ms=session.elapsed_ms(),
logs=session.logs,
metadata={
"browser": "chromium",
"viewport": {"width": 1280, "height": 720},
},
)

# 3. Inspect metrics
result = run.build()
metrics = result["aggregate_metrics"]
print(f"\nResults: {metrics['tests_passed']}/{metrics['total_tests']} passed")
print(f"Tool divergence: {metrics['total_tool_call_divergence']}")

# 4. Deploy and wait for grading
created = run.deploy(client, dataset_id=42)
print(f"Run submitted: {created['id']}")

graded = client.poll_run(created["id"], timeout=300)
print(f"Grading complete: {graded['result']['aggregate_metrics']}")


if __name__ == "__main__":
main()

Log Entry Format

Every log entry in a VM stream follows this structure:

{
"ts": 1200, # Timestamp in ms (relative to session start, or absolute)
"type": "action", # One of the types below
"data": {"action": "click"} # Type-specific payload
}

Log Types Reference

TypeWhen to useRequired data fieldsOptional data fields
navigationBrowser navigated to a new URLurl-
actionAgent interacted with the pageaction ("click", "type", "select", "scroll")selector, value, delta_x, delta_y
networkHTTP request completedmethod, urlstatus, duration_ms
consoleBrowser console outputmessagelevel ("log", "warn", "error")
errorAn error occurredmessagecode, details
screenshotScreenshot was captured-s3_key, format ("png", "jpeg")

Example Logs

# Navigation
{"ts": 0, "type": "navigation", "data": {"url": "https://app.example.com"}}

# Click action
{"ts": 1200, "type": "action", "data": {"action": "click", "selector": "#login-btn"}}

# Type into input
{"ts": 2000, "type": "action", "data": {"action": "type", "selector": "#email", "value": "user@example.com"}}

# Scroll
{"ts": 2500, "type": "action", "data": {"action": "scroll", "delta_x": 0, "delta_y": 300}}

# Network request
{"ts": 3000, "type": "network", "data": {"method": "POST", "url": "/api/login", "status": 200}}

# Console warning
{"ts": 3100, "type": "console", "data": {"level": "warn", "message": "Deprecated API called"}}

# Error
{"ts": 4000, "type": "error", "data": {"message": "Element not found: #checkout-btn"}}

# Screenshot
{"ts": 5000, "type": "screenshot", "data": {"s3_key": "vm-streams/run_42/frame_001.png"}}

Providers

Generic Provider

Use set_vm_stream() for any browser provider (Browserbase, Steel, Scrapybara, Playwright, Selenium, or your own):

test.set_vm_stream(provider, session_id=None, duration_ms=None, logs=None, metadata=None)
ParameterTypeRequiredDescription
providerstrYesProvider name (e.g. "browserbase", "steel", "playwright", "custom")
session_idstrNoYour provider's session ID
duration_msintNoTotal session duration in milliseconds
logslist[dict]NoArray of timestamped log entries
metadatadictNoAny additional provider-specific data
test.set_vm_stream(
provider="browserbase",
session_id="sess_abc123",
duration_ms=12000,
logs=[
{"ts": 0, "type": "navigation", "data": {"url": "https://shop.example.com"}},
{"ts": 800, "type": "action", "data": {"action": "click", "selector": "#product"}},
{"ts": 3500, "type": "network", "data": {"method": "POST", "url": "/api/cart", "status": 200}},
],
metadata={"browser": "chromium", "viewport": {"width": 1280, "height": 720}},
)

Kernel Browser

Use set_kernel_vm() for Kernel browser sessions. This is a convenience wrapper that sets provider="kernel" and exposes Kernel-specific metadata as named parameters:

test.set_kernel_vm(session_id, duration_ms=None, logs=None, *,
live_view_url=None, cdp_ws_url=None, replay_id=None,
replay_view_url=None, headless=None, stealth=None,
viewport=None)

Note: Parameters after * are keyword-only — you must pass them by name (e.g. replay_id="replay_abc").

ParameterTypeRequiredDescription
session_idstrYesKernel browser session ID
duration_msintNoTotal session duration in milliseconds
logslist[dict]NoArray of timestamped log entries
live_view_urlstrNoKernel's browser_live_view_url for real-time viewing
cdp_ws_urlstrNoChrome DevTools Protocol WebSocket URL
replay_idstrNoKernel session recording ID
replay_view_urlstrNoURL to view the session replay
headlessboolNoWhether the session ran headless
stealthboolNoWhether anti-bot stealth mode was enabled
viewportdictNoBrowser viewport dimensions {"width": int, "height": int}
test.set_kernel_vm(
session_id="kern_sess_abc123",
duration_ms=15000,
logs=[
{"ts": 0, "type": "navigation", "data": {"url": "https://app.example.com"}},
{"ts": 1200, "type": "action", "data": {"action": "click", "selector": "#login"}},
{"ts": 2500, "type": "action", "data": {"action": "type", "selector": "#email", "value": "user@example.com"}},
{"ts": 3800, "type": "action", "data": {"action": "click", "selector": "#submit"}},
],
replay_id="replay_abc123",
replay_view_url="https://www.kernel.sh/replays/replay_abc123",
stealth=True,
viewport={"width": 1920, "height": 1080},
)

When to use which?

  • set_kernel_vm() — You're using Kernel and want replay URLs, live view, and CDP access tracked in your results.
  • set_vm_stream() — Everything else. Works with any provider. Pass whatever metadata makes sense for your setup.

Attaching VM Streams to EvalRunner Results

This is the most common source of confusion. EvalRunner.run() returns a RunBuilder with TestBuilder instances for each scenario. The runner records tool calls and text comparisons, but you must attach VM streams.

run = runner.run(agent, on_scenario=lambda sid, s: agent.set_scenario(sid))

# run.tests is a list of TestBuilder instances
# Each test.test_id matches the scenario ID from the dataset

for test in run.tests:
session = agent.get_session(test.test_id)
if session:
# Option A: generic provider
test.set_vm_stream(
provider="my-provider",
session_id=session.id,
logs=session.logs,
duration_ms=session.duration,
)

# Option B: Kernel
# test.set_kernel_vm(
# session_id=session.kernel_id,
# logs=session.logs,
# replay_id=session.replay_id,
# )

# Now deploy — VM streams are included
run.deploy(client, dataset_id=42)

Why doesn't EvalRunner capture logs automatically?

The Agent protocol (respond + reset) is intentionally minimal. It doesn't know or care whether your agent uses a browser, a terminal, or an API. VM stream capture is provider-specific and varies widely, so it's left to you.

Python-specific: Tracking scenario ID

Unlike the TypeScript SDK where respond(message, scenarioId) receives the scenario ID as a parameter, the Python respond(message) method does not. Use the on_scenario callback to track which scenario is active:

class MyAgent:
def __init__(self):
self._current_scenario = "default"
self.sessions = {}

def respond(self, message: str) -> dict:
sid = self._current_scenario
# Use sid to key your browser sessions
...

def reset(self):
pass

def set_scenario(self, scenario_id: str):
self._current_scenario = scenario_id

agent = MyAgent()
run = runner.run(agent, on_scenario=lambda sid, s: agent.set_scenario(sid))

Manual RunBuilder with VM Streams

If you're not using EvalRunner and building results manually:

from ashr_labs import AshrLabsClient, RunBuilder

client = AshrLabsClient(api_key="tp_...")
run = RunBuilder()
run.start()

test = run.add_test("login_flow")
test.start()

# Record what the agent did
test.add_tool_call(
expected={"name": "navigate", "arguments_json": '{"url":"https://app.example.com/login"}'},
actual={"name": "navigate", "arguments": {"url": "https://app.example.com/login"}},
match_status="exact",
)
test.add_agent_response(
expected_response={"text": "Navigated to login page"},
actual_response={"text": "I've opened the login page"},
match_status="similar",
semantic_similarity=0.85,
)

# Attach the browser session
test.set_vm_stream(
provider="playwright",
session_id="sess_001",
duration_ms=8000,
logs=[
{"ts": 0, "type": "navigation", "data": {"url": "https://app.example.com/login"}},
{"ts": 1500, "type": "action", "data": {"action": "type", "selector": "#email", "value": "test@example.com"}},
{"ts": 2500, "type": "action", "data": {"action": "type", "selector": "#password", "value": "********"}},
{"ts": 3500, "type": "action", "data": {"action": "click", "selector": "#login-btn"}},
{"ts": 5000, "type": "navigation", "data": {"url": "https://app.example.com/dashboard"}},
{"ts": 5500, "type": "screenshot", "data": {}},
],
)

test.complete()
run.complete()

run.deploy(client, dataset_id=42)

Gotchas and Common Mistakes

1. VM streams are not auto-captured

Mistake: Running runner.run(agent) and deploying, expecting VM logs to appear.

Fix: You must call test.set_vm_stream() or test.set_kernel_vm() on each test after run() returns and before deploy().

2. Parallel execution with browser agents

Mistake: Setting max_workers=4 with a browser agent.

Fix: Use max_workers=1 (the default) for browser agents. Python's EvalRunner uses deepcopy for parallel execution, and most browser clients (Playwright, Selenium, HTTP clients) cannot be deep-copied. If you need parallelism, you'll need to implement __deepcopy__ on your agent or restructure to create fresh browser sessions per copy.

3. Python agent doesn't receive scenario ID

Mistake: Expecting respond(message, scenario_id) like TypeScript.

Fix: Use the on_scenario callback:

run = runner.run(agent, on_scenario=lambda sid, s: agent.set_scenario(sid))

4. Forgetting to expose session data

Mistake: Agent collects logs in respond() but has no way to retrieve them after eval.

Fix: Add a get_session(scenario_id) method so you can access logs after runner.run() completes.

5. reset() is called at the START of each scenario, not the end

Mistake: Capturing VM metadata only inside reset(), expecting it to run after the scenario completes.

How EvalRunner actually works: reset(scenario_id) is called at the beginning of each scenario to clear state from the previous one. After the last scenario finishes, reset() is NOT called again. This means:

reset("scenario_1")  → No session exists yet, nothing to capture
respond(...) → Session created, agent runs
reset("scenario_2") → Called with scenario_2's ID, but scenario_1's session
is keyed under "scenario_1" — never matched
respond(...) → Session created for scenario_2
[run() returns] → No final reset() — last scenario's data uncaptured
attach_vm_streams() → Metadata map is empty

Fix: Capture VM metadata inside respond() after each agent turn, not in reset(). The session is guaranteed to be alive during respond():

def respond(self, message: str) -> dict:
sid = self._current_scenario
session = self._get_or_create_session(sid)

result = session.agent.process_message(message)

# Capture metadata HERE — session is alive, scenario is correct
self._captured_metadata[sid] = {
"session_id": session.id,
"logs": session.logs,
"duration_ms": int((time.time() - session.start_time) * 1000),
}

return {"text": result, "tool_calls": [...]}

This way your metadata map is populated by the time you iterate run.tests to attach VM streams.

6. Clearing sessions on reset()

Mistake: Agent's reset() deletes browser session data, so logs are gone before you attach them.

Fix: Don't delete logs in reset(). Keep them until after deploy().

7. VM stream attached but logs are empty

Symptom: vm_stream has provider, session_id, and metadata, but logs: [] and duration_ms: 0.

Cause: You're capturing session identity (IDs, URLs) but not recording browser actions into a logs array. If your agent emits tool call events internally (e.g. navigate_url, click_mouse, type_text), you need to collect those into log entries during respond().

Fix: Create a per-scenario log list, append entries when tool calls complete, and include it in your VM stream:

def __init__(self):
self._scenario_vm_logs: dict[str, list[dict]] = {}

def respond(self, message: str) -> dict:
sid = self._current_scenario
if sid not in self._scenario_vm_logs:
self._scenario_vm_logs[sid] = []
vm_logs = self._scenario_vm_logs[sid]
session_start = self._session_start_times.get(sid, time.time())

def on_tool_complete(name: str, args: dict, status: str, error: str | None = None):
log_type = ("navigation" if "navigate" in name
else "screenshot" if "screenshot" in name
else "action")
entry = {
"ts": int((time.time() - session_start) * 1000),
"type": log_type,
"data": {"action": name, **args},
}
if status == "error" and error:
entry["data"]["error"] = error
vm_logs.append(entry)

# ... run agent, hook into tool events ...

Then include logs when attaching:

test.set_vm_stream(
provider="my-provider",
session_id=session.id,
duration_ms=int((time.time() - session_start) * 1000),
logs=self._scenario_vm_logs.get(test.test_id, []), # don't forget this
)

8. Timestamps

Recommendation: Use milliseconds relative to session start (i.e., first log entry at ts: 0). Absolute timestamps work too, but relative timestamps make it easier to calculate durations and display timelines.

9. Grading is server-side

Calling deploy() submits your results (including VM streams) but grading happens asynchronously on the server (typically 1-3 minutes). Use poll_run() to wait:

created = run.deploy(client, dataset_id=42)
graded = client.poll_run(created["id"], timeout=300)
metrics = graded["result"]["aggregate_metrics"]
print(f"Passed: {metrics['tests_passed']}/{metrics['total_tests']}")