VM Integration Guide
This guide covers how to evaluate browser-based and desktop-based agents that interact with virtual machines. If your agent controls a browser (via Playwright, Selenium, Browserbase, Kernel, Steel, etc.), this is the guide for you.
Table of Contents
- Overview
- How VM Streams Work
- Quick Start
- End-to-End Example: Browser Agent with EvalRunner
- Log Entry Format
- Providers
- Attaching VM Streams to EvalRunner Results
- Manual RunBuilder with VM Streams
- Gotchas and Common Mistakes
Overview
When your agent operates in a browser or VM, you want to capture what the agent did visually — not just what tool calls it made. VM streams attach timestamped browser logs (navigations, clicks, screenshots, network requests) to each test in a run, so you can replay and debug agent behavior in the Ashr Labs dashboard.
Key concept: VM streams are metadata attached to individual tests, not to the run as a whole. Each test (scenario) gets its own VM stream with its own session, logs, and duration.
How VM Streams Work
┌─────────────────────────────────────────────────┐
│ Your Browser Agent │
│ │
│ 1. Receives message from EvalRunner │
│ 2. Executes browser actions (click, type, etc.)│
│ 3. Collects logs as it goes │
│ 4. Returns { "text": ..., "tool_calls": ... } │
└────────────────────┬────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────┐
│ EvalRunner │
│ │
│ - Records tool call matches ✓ │
│ - Records text response matches ✓ │
│ - Does NOT auto-capture VM logs ✗ │
└────────────────────┬────────────────────────────┘
│
▼ You must do this yourself
┌─────────────────────────────────────────────────┐
│ After runner.run(), attach VM streams: │
│ │
│ test.set_vm_stream("browserbase", ...) │
│ — or — │
│ test.set_kernel_vm("session_id", ...) │
└────────────────────┬────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────┐
│ run.deploy(client, dataset_id) │
│ → VM logs submitted with run result │
│ → Viewable in Ashr Labs dashboard │
└─────────────────────────────────────────────────┘
Important: EvalRunner handles tool call and text comparison automatically, but it does not capture VM/browser logs. You must:
- Collect logs inside your agent during
respond() - Attach them to each
TestBuilderafterrunner.run()returns - Then call
run.deploy()
Quick Start
Minimal example — a browser agent that records navigation logs:
from ashr_labs import AshrLabsClient, EvalRunner
import time
client = AshrLabsClient.from_env()
class MyBrowserAgent:
def __init__(self):
self.logs: dict[str, dict] = {}
self._scenario: str = "default"
def respond(self, message: str) -> dict:
if self._scenario not in self.logs:
self.logs[self._scenario] = {"session_id": f"sess_{int(time.time())}", "entries": []}
ctx = self.logs[self._scenario]
# ... your browser automation here ...
ctx["entries"].append({"ts": 0, "type": "navigation", "data": {"url": "https://example.com"}})
return {"text": "Done", "tool_calls": []}
def reset(self):
pass # Don't clear logs — we need them after eval
agent = MyBrowserAgent()
runner = EvalRunner.from_dataset(client, dataset_id=42)
run = runner.run(agent, on_scenario=lambda sid, s: setattr(agent, "_scenario", sid))
# Attach VM streams BEFORE deploying
for test in run.tests:
session = agent.logs.get(test.test_id)
if session:
test.set_vm_stream(
provider="playwright",
session_id=session["session_id"],
logs=session["entries"],
)
run.deploy(client, dataset_id=42)
End-to-End Example: Browser Agent with EvalRunner
This is a complete, working example of a browser agent that:
- Uses the Anthropic SDK for reasoning
- Drives a browser via automation
- Records VM logs per scenario
- Runs evals and deploys with VM streams attached
import anthropic
import time
from ashr_labs import AshrLabsClient, EvalRunner
class BrowserSession:
def __init__(self, session_id: str):
self.session_id = session_id
self.logs: list[dict] = []
self.start_time = time.time()
def elapsed_ms(self) -> int:
return int((time.time() - self.start_time) * 1000)
class WebNavigationAgent:
"""Browser agent that records VM logs for each scenario."""
TOOLS = [
{
"name": "navigate",
"description": "Navigate to a URL",
"input_schema": {"type": "object", "properties": {"url": {"type": "string"}}, "required": ["url"]},
},
{
"name": "click",
"description": "Click an element",
"input_schema": {"type": "object", "properties": {"selector": {"type": "string"}}, "required": ["selector"]},
},
{
"name": "type_text",
"description": "Type text into an input",
"input_schema": {
"type": "object",
"properties": {"selector": {"type": "string"}, "text": {"type": "string"}},
"required": ["selector", "text"],
},
},
]
def __init__(self):
self.anthropic = anthropic.Anthropic()
self.sessions: dict[str, BrowserSession] = {}
self.conversations: dict[str, list] = {}
self._current_scenario: str = "default"
def respond(self, message: str) -> dict:
sid = self._current_scenario
# Initialize browser session for this scenario
if sid not in self.sessions:
session = self._create_browser_session()
self.sessions[sid] = BrowserSession(session["id"])
session = self.sessions[sid]
history = self.conversations.setdefault(sid, [])
history.append({"role": "user", "content": message})
all_tool_calls = []
final_text = ""
for _ in range(10):
response = self.anthropic.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system="You are a web navigation agent. Use tools to interact with the browser.",
tools=self.TOOLS,
messages=history,
)
# Collect text
for block in response.content:
if block.type == "text":
final_text += block.text
# Process tool calls
tool_blocks = [b for b in response.content if b.type == "tool_use"]
if not tool_blocks:
history.append({"role": "assistant", "content": response.content})
break
# Execute each tool call and record logs
tool_results = []
for tool in tool_blocks:
ts = session.elapsed_ms()
all_tool_calls.append({"name": tool.name, "arguments": tool.input})
if tool.name == "navigate":
self._browser_navigate(session.session_id, tool.input["url"])
session.logs.append({"ts": ts, "type": "navigation", "data": {"url": tool.input["url"]}})
tool_results.append({"type": "tool_result", "tool_use_id": tool.id, "content": "Navigated"})
elif tool.name == "click":
self._browser_click(session.session_id, tool.input["selector"])
session.logs.append({
"ts": ts, "type": "action",
"data": {"action": "click", "selector": tool.input["selector"]},
})
tool_results.append({"type": "tool_result", "tool_use_id": tool.id, "content": "Clicked"})
elif tool.name == "type_text":
self._browser_type(session.session_id, tool.input["selector"], tool.input["text"])
session.logs.append({
"ts": ts, "type": "action",
"data": {"action": "type", "selector": tool.input["selector"], "value": tool.input["text"]},
})
tool_results.append({"type": "tool_result", "tool_use_id": tool.id, "content": "Typed"})
# Screenshot after each action
session.logs.append({"ts": session.elapsed_ms(), "type": "screenshot", "data": {}})
history.append({"role": "assistant", "content": response.content})
history.append({"role": "user", "content": tool_results})
if response.stop_reason == "end_turn":
break
self.conversations[sid] = history
return {"text": final_text, "tool_calls": all_tool_calls}
def reset(self):
# Don't clear sessions — we need the logs after eval completes
pass
def set_scenario(self, scenario_id: str):
"""Called from on_scenario callback to track current scenario."""
self._current_scenario = scenario_id
def get_session(self, scenario_id: str) -> BrowserSession | None:
return self.sessions.get(scenario_id)
# -- Replace these stubs with your actual browser provider --
def _create_browser_session(self) -> dict:
return {"id": f"sess_{int(time.time())}"}
def _browser_navigate(self, session_id: str, url: str): pass
def _browser_click(self, session_id: str, selector: str): pass
def _browser_type(self, session_id: str, selector: str, text: str): pass
# ─── Main ────────────────────────────────────────────────────────────────
def main():
client = AshrLabsClient.from_env()
agent = WebNavigationAgent()
# 1. Run eval
runner = EvalRunner.from_dataset(client, dataset_id=42)
run = runner.run(
agent,
max_workers=1, # Use 1 for browser agents (see Gotchas below)
on_scenario=lambda sid, s: (
agent.set_scenario(sid),
print(f"▶ Scenario: {s.get('title', sid)}"),
),
on_action=lambda i, a: print(f" Action {i}: {a.get('content', '')[:60]}"),
)
# 2. Attach VM streams to each test
for test in run.tests:
session = agent.get_session(test.test_id)
if session:
test.set_vm_stream(
provider="custom",
session_id=session.session_id,
duration_ms=session.elapsed_ms(),
logs=session.logs,
metadata={
"browser": "chromium",
"viewport": {"width": 1280, "height": 720},
},
)
# 3. Inspect metrics
result = run.build()
metrics = result["aggregate_metrics"]
print(f"\nResults: {metrics['tests_passed']}/{metrics['total_tests']} passed")
print(f"Tool divergence: {metrics['total_tool_call_divergence']}")
# 4. Deploy and wait for grading
created = run.deploy(client, dataset_id=42)
print(f"Run submitted: {created['id']}")
graded = client.poll_run(created["id"], timeout=300)
print(f"Grading complete: {graded['result']['aggregate_metrics']}")
if __name__ == "__main__":
main()
Log Entry Format
Every log entry in a VM stream follows this structure:
{
"ts": 1200, # Timestamp in ms (relative to session start, or absolute)
"type": "action", # One of the types below
"data": {"action": "click"} # Type-specific payload
}
Log Types Reference
| Type | When to use | Required data fields | Optional data fields |
|---|---|---|---|
navigation | Browser navigated to a new URL | url | - |
action | Agent interacted with the page | action ("click", "type", "select", "scroll") | selector, value, delta_x, delta_y |
network | HTTP request completed | method, url | status, duration_ms |
console | Browser console output | message | level ("log", "warn", "error") |
error | An error occurred | message | code, details |
screenshot | Screenshot was captured | - | s3_key, format ("png", "jpeg") |
Example Logs
# Navigation
{"ts": 0, "type": "navigation", "data": {"url": "https://app.example.com"}}
# Click action
{"ts": 1200, "type": "action", "data": {"action": "click", "selector": "#login-btn"}}
# Type into input
{"ts": 2000, "type": "action", "data": {"action": "type", "selector": "#email", "value": "user@example.com"}}
# Scroll
{"ts": 2500, "type": "action", "data": {"action": "scroll", "delta_x": 0, "delta_y": 300}}
# Network request
{"ts": 3000, "type": "network", "data": {"method": "POST", "url": "/api/login", "status": 200}}
# Console warning
{"ts": 3100, "type": "console", "data": {"level": "warn", "message": "Deprecated API called"}}
# Error
{"ts": 4000, "type": "error", "data": {"message": "Element not found: #checkout-btn"}}
# Screenshot
{"ts": 5000, "type": "screenshot", "data": {"s3_key": "vm-streams/run_42/frame_001.png"}}
Providers
Generic Provider
Use set_vm_stream() for any browser provider (Browserbase, Steel, Scrapybara, Playwright, Selenium, or your own):
test.set_vm_stream(provider, session_id=None, duration_ms=None, logs=None, metadata=None)
| Parameter | Type | Required | Description |
|---|---|---|---|
provider | str | Yes | Provider name (e.g. "browserbase", "steel", "playwright", "custom") |
session_id | str | No | Your provider's session ID |
duration_ms | int | No | Total session duration in milliseconds |
logs | list[dict] | No | Array of timestamped log entries |
metadata | dict | No | Any additional provider-specific data |
test.set_vm_stream(
provider="browserbase",
session_id="sess_abc123",
duration_ms=12000,
logs=[
{"ts": 0, "type": "navigation", "data": {"url": "https://shop.example.com"}},
{"ts": 800, "type": "action", "data": {"action": "click", "selector": "#product"}},
{"ts": 3500, "type": "network", "data": {"method": "POST", "url": "/api/cart", "status": 200}},
],
metadata={"browser": "chromium", "viewport": {"width": 1280, "height": 720}},
)
Kernel Browser
Use set_kernel_vm() for Kernel browser sessions. This is a convenience wrapper that sets provider="kernel" and exposes Kernel-specific metadata as named parameters:
test.set_kernel_vm(session_id, duration_ms=None, logs=None, *,
live_view_url=None, cdp_ws_url=None, replay_id=None,
replay_view_url=None, headless=None, stealth=None,
viewport=None)
Note: Parameters after
*are keyword-only — you must pass them by name (e.g.replay_id="replay_abc").
| Parameter | Type | Required | Description |
|---|---|---|---|
session_id | str | Yes | Kernel browser session ID |
duration_ms | int | No | Total session duration in milliseconds |
logs | list[dict] | No | Array of timestamped log entries |
live_view_url | str | No | Kernel's browser_live_view_url for real-time viewing |
cdp_ws_url | str | No | Chrome DevTools Protocol WebSocket URL |
replay_id | str | No | Kernel session recording ID |
replay_view_url | str | No | URL to view the session replay |
headless | bool | No | Whether the session ran headless |
stealth | bool | No | Whether anti-bot stealth mode was enabled |
viewport | dict | No | Browser viewport dimensions {"width": int, "height": int} |
test.set_kernel_vm(
session_id="kern_sess_abc123",
duration_ms=15000,
logs=[
{"ts": 0, "type": "navigation", "data": {"url": "https://app.example.com"}},
{"ts": 1200, "type": "action", "data": {"action": "click", "selector": "#login"}},
{"ts": 2500, "type": "action", "data": {"action": "type", "selector": "#email", "value": "user@example.com"}},
{"ts": 3800, "type": "action", "data": {"action": "click", "selector": "#submit"}},
],
replay_id="replay_abc123",
replay_view_url="https://www.kernel.sh/replays/replay_abc123",
stealth=True,
viewport={"width": 1920, "height": 1080},
)
When to use which?
set_kernel_vm()— You're using Kernel and want replay URLs, live view, and CDP access tracked in your results.set_vm_stream()— Everything else. Works with any provider. Pass whatever metadata makes sense for your setup.
Attaching VM Streams to EvalRunner Results
This is the most common source of confusion. EvalRunner.run() returns a RunBuilder with TestBuilder instances for each scenario. The runner records tool calls and text comparisons, but you must attach VM streams.
run = runner.run(agent, on_scenario=lambda sid, s: agent.set_scenario(sid))
# run.tests is a list of TestBuilder instances
# Each test.test_id matches the scenario ID from the dataset
for test in run.tests:
session = agent.get_session(test.test_id)
if session:
# Option A: generic provider
test.set_vm_stream(
provider="my-provider",
session_id=session.id,
logs=session.logs,
duration_ms=session.duration,
)
# Option B: Kernel
# test.set_kernel_vm(
# session_id=session.kernel_id,
# logs=session.logs,
# replay_id=session.replay_id,
# )
# Now deploy — VM streams are included
run.deploy(client, dataset_id=42)
Why doesn't EvalRunner capture logs automatically?
The Agent protocol (respond + reset) is intentionally minimal. It doesn't know or care whether your agent uses a browser, a terminal, or an API. VM stream capture is provider-specific and varies widely, so it's left to you.
Python-specific: Tracking scenario ID
Unlike the TypeScript SDK where respond(message, scenarioId) receives the scenario ID as a parameter, the Python respond(message) method does not. Use the on_scenario callback to track which scenario is active:
class MyAgent:
def __init__(self):
self._current_scenario = "default"
self.sessions = {}
def respond(self, message: str) -> dict:
sid = self._current_scenario
# Use sid to key your browser sessions
...
def reset(self):
pass
def set_scenario(self, scenario_id: str):
self._current_scenario = scenario_id
agent = MyAgent()
run = runner.run(agent, on_scenario=lambda sid, s: agent.set_scenario(sid))
Manual RunBuilder with VM Streams
If you're not using EvalRunner and building results manually:
from ashr_labs import AshrLabsClient, RunBuilder
client = AshrLabsClient(api_key="tp_...")
run = RunBuilder()
run.start()
test = run.add_test("login_flow")
test.start()
# Record what the agent did
test.add_tool_call(
expected={"name": "navigate", "arguments_json": '{"url":"https://app.example.com/login"}'},
actual={"name": "navigate", "arguments": {"url": "https://app.example.com/login"}},
match_status="exact",
)
test.add_agent_response(
expected_response={"text": "Navigated to login page"},
actual_response={"text": "I've opened the login page"},
match_status="similar",
semantic_similarity=0.85,
)
# Attach the browser session
test.set_vm_stream(
provider="playwright",
session_id="sess_001",
duration_ms=8000,
logs=[
{"ts": 0, "type": "navigation", "data": {"url": "https://app.example.com/login"}},
{"ts": 1500, "type": "action", "data": {"action": "type", "selector": "#email", "value": "test@example.com"}},
{"ts": 2500, "type": "action", "data": {"action": "type", "selector": "#password", "value": "********"}},
{"ts": 3500, "type": "action", "data": {"action": "click", "selector": "#login-btn"}},
{"ts": 5000, "type": "navigation", "data": {"url": "https://app.example.com/dashboard"}},
{"ts": 5500, "type": "screenshot", "data": {}},
],
)
test.complete()
run.complete()
run.deploy(client, dataset_id=42)
Gotchas and Common Mistakes
1. VM streams are not auto-captured
Mistake: Running runner.run(agent) and deploying, expecting VM logs to appear.
Fix: You must call test.set_vm_stream() or test.set_kernel_vm() on each test after run() returns and before deploy().
2. Parallel execution with browser agents
Mistake: Setting max_workers=4 with a browser agent.
Fix: Use max_workers=1 (the default) for browser agents. Python's EvalRunner uses deepcopy for parallel execution, and most browser clients (Playwright, Selenium, HTTP clients) cannot be deep-copied. If you need parallelism, you'll need to implement __deepcopy__ on your agent or restructure to create fresh browser sessions per copy.
3. Python agent doesn't receive scenario ID
Mistake: Expecting respond(message, scenario_id) like TypeScript.
Fix: Use the on_scenario callback:
run = runner.run(agent, on_scenario=lambda sid, s: agent.set_scenario(sid))
4. Forgetting to expose session data
Mistake: Agent collects logs in respond() but has no way to retrieve them after eval.
Fix: Add a get_session(scenario_id) method so you can access logs after runner.run() completes.
5. reset() is called at the START of each scenario, not the end
Mistake: Capturing VM metadata only inside reset(), expecting it to run after the scenario completes.
How EvalRunner actually works: reset(scenario_id) is called at the beginning of each scenario to clear state from the previous one. After the last scenario finishes, reset() is NOT called again. This means:
reset("scenario_1") → No session exists yet, nothing to capture
respond(...) → Session created, agent runs
reset("scenario_2") → Called with scenario_2's ID, but scenario_1's session
is keyed under "scenario_1" — never matched
respond(...) → Session created for scenario_2
[run() returns] → No final reset() — last scenario's data uncaptured
attach_vm_streams() → Metadata map is empty
Fix: Capture VM metadata inside respond() after each agent turn, not in reset(). The session is guaranteed to be alive during respond():
def respond(self, message: str) -> dict:
sid = self._current_scenario
session = self._get_or_create_session(sid)
result = session.agent.process_message(message)
# Capture metadata HERE — session is alive, scenario is correct
self._captured_metadata[sid] = {
"session_id": session.id,
"logs": session.logs,
"duration_ms": int((time.time() - session.start_time) * 1000),
}
return {"text": result, "tool_calls": [...]}
This way your metadata map is populated by the time you iterate run.tests to attach VM streams.
6. Clearing sessions on reset()
Mistake: Agent's reset() deletes browser session data, so logs are gone before you attach them.
Fix: Don't delete logs in reset(). Keep them until after deploy().
7. VM stream attached but logs are empty
Symptom: vm_stream has provider, session_id, and metadata, but logs: [] and duration_ms: 0.
Cause: You're capturing session identity (IDs, URLs) but not recording browser actions into a logs array. If your agent emits tool call events internally (e.g. navigate_url, click_mouse, type_text), you need to collect those into log entries during respond().
Fix: Create a per-scenario log list, append entries when tool calls complete, and include it in your VM stream:
def __init__(self):
self._scenario_vm_logs: dict[str, list[dict]] = {}
def respond(self, message: str) -> dict:
sid = self._current_scenario
if sid not in self._scenario_vm_logs:
self._scenario_vm_logs[sid] = []
vm_logs = self._scenario_vm_logs[sid]
session_start = self._session_start_times.get(sid, time.time())
def on_tool_complete(name: str, args: dict, status: str, error: str | None = None):
log_type = ("navigation" if "navigate" in name
else "screenshot" if "screenshot" in name
else "action")
entry = {
"ts": int((time.time() - session_start) * 1000),
"type": log_type,
"data": {"action": name, **args},
}
if status == "error" and error:
entry["data"]["error"] = error
vm_logs.append(entry)
# ... run agent, hook into tool events ...
Then include logs when attaching:
test.set_vm_stream(
provider="my-provider",
session_id=session.id,
duration_ms=int((time.time() - session_start) * 1000),
logs=self._scenario_vm_logs.get(test.test_id, []), # don't forget this
)
8. Timestamps
Recommendation: Use milliseconds relative to session start (i.e., first log entry at ts: 0). Absolute timestamps work too, but relative timestamps make it easier to calculate durations and display timelines.
9. Grading is server-side
Calling deploy() submits your results (including VM streams) but grading happens asynchronously on the server (typically 1-3 minutes). Use poll_run() to wait:
created = run.deploy(client, dataset_id=42)
graded = client.poll_run(created["id"], timeout=300)
metrics = graded["result"]["aggregate_metrics"]
print(f"Passed: {metrics['tests_passed']}/{metrics['total_tests']}")