VM Integration Guide

This guide covers how to evaluate browser-based and desktop-based agents that interact with virtual machines. If your agent controls a browser (via Playwright, Selenium, Browserbase, Kernel, Steel, etc.), this is the guide for you.

Overview
How VM Streams Work
Quick Start
End-to-End Example: Browser Agent with EvalRunner
Log Entry Format
Providers
- Generic Provider (Browserbase, Steel, Playwright, etc.)
- Kernel Browser
Attaching VM Streams to EvalRunner Results
Manual RunBuilder with VM Streams
Gotchas and Common Mistakes

Overview

When your agent operates in a browser or VM, you want to capture what the agent did visually — not just what tool calls it made. VM streams attach timestamped browser logs (navigations, clicks, screenshots, network requests) to each test in a run, so you can replay and debug agent behavior in the Ashr Labs dashboard.

Key concept: VM streams are metadata attached to individual tests, not to the run as a whole. Each test (scenario) gets its own VM stream with its own session, logs, and duration.

How VM Streams Work

┌─────────────────────────────────────────────────┐
│  Your Browser Agent                             │
│                                                 │
│  1. Receives message from EvalRunner            │
│  2. Executes browser actions (click, type, etc.)│
│  3. Collects logs as it goes                    │
│  4. Returns { "text": ..., "tool_calls": ... }  │
└────────────────────┬────────────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────────────┐
│  EvalRunner                                     │
│                                                 │
│  - Records tool call matches ✓                  │
│  - Records text response matches ✓              │
│  - Does NOT auto-capture VM logs ✗              │
└────────────────────┬────────────────────────────┘
                     │
                     ▼  You must do this yourself
┌─────────────────────────────────────────────────┐
│  After runner.run(), attach VM streams:         │
│                                                 │
│  test.set_vm_stream("browserbase", ...)         │
│  — or —                                         │
│  test.set_kernel_vm("session_id", ...)          │
└────────────────────┬────────────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────────────┐
│  run.deploy(client, dataset_id)                 │
│  → VM logs submitted with run result            │
│  → Viewable in Ashr Labs dashboard              │
└─────────────────────────────────────────────────┘

Important: EvalRunner handles tool call and text comparison automatically, but it does not capture VM/browser logs. You must:

Collect logs inside your agent during respond()
Attach them to each TestBuilder after runner.run() returns
Then call run.deploy()

Quick Start

Minimal example — a browser agent that records navigation logs:

from ashr_labs import AshrLabsClient, EvalRunner
import time

client = AshrLabsClient.from_env()

class MyBrowserAgent:
    def __init__(self):
        self.logs: dict[str, dict] = {}
        self._scenario: str = "default"

    def respond(self, message: str) -> dict:
        if self._scenario not in self.logs:
            self.logs[self._scenario] = {"session_id": f"sess_{int(time.time())}", "entries": []}
        ctx = self.logs[self._scenario]

        # ... your browser automation here ...
        ctx["entries"].append({"ts": 0, "type": "navigation", "data": {"url": "https://example.com"}})

        return {"text": "Done", "tool_calls": []}

    def reset(self):
        pass  # Don't clear logs — we need them after eval

agent = MyBrowserAgent()
runner = EvalRunner.from_dataset(client, dataset_id=42)
run = runner.run(agent, on_scenario=lambda sid, s: setattr(agent, "_scenario", sid))

# Attach VM streams BEFORE deploying
for test in run.tests:
    session = agent.logs.get(test.test_id)
    if session:
        test.set_vm_stream(
            provider="playwright",
            session_id=session["session_id"],
            logs=session["entries"],
        )

run.deploy(client, dataset_id=42)

End-to-End Example: Browser Agent with EvalRunner

This is a complete, working example of a browser agent that:

Uses the Anthropic SDK for reasoning
Drives a browser via automation
Records VM logs per scenario
Runs evals and deploys with VM streams attached

import anthropic
import time
from ashr_labs import AshrLabsClient, EvalRunner


class BrowserSession:
    def __init__(self, session_id: str):
        self.session_id = session_id
        self.logs: list[dict] = []
        self.start_time = time.time()

    def elapsed_ms(self) -> int:
        return int((time.time() - self.start_time) * 1000)


class WebNavigationAgent:
    """Browser agent that records VM logs for each scenario."""

    TOOLS = [
        {
            "name": "navigate",
            "description": "Navigate to a URL",
            "input_schema": {"type": "object", "properties": {"url": {"type": "string"}}, "required": ["url"]},
        },
        {
            "name": "click",
            "description": "Click an element",
            "input_schema": {"type": "object", "properties": {"selector": {"type": "string"}}, "required": ["selector"]},
        },
        {
            "name": "type_text",
            "description": "Type text into an input",
            "input_schema": {
                "type": "object",
                "properties": {"selector": {"type": "string"}, "text": {"type": "string"}},
                "required": ["selector", "text"],
            },
        },
    ]

    def __init__(self):
        self.anthropic = anthropic.Anthropic()
        self.sessions: dict[str, BrowserSession] = {}
        self.conversations: dict[str, list] = {}
        self._current_scenario: str = "default"

    def respond(self, message: str) -> dict:
        sid = self._current_scenario

        # Initialize browser session for this scenario
        if sid not in self.sessions:
            session = self._create_browser_session()
            self.sessions[sid] = BrowserSession(session["id"])

        session = self.sessions[sid]
        history = self.conversations.setdefault(sid, [])
        history.append({"role": "user", "content": message})

        all_tool_calls = []
        final_text = ""

        for _ in range(10):
            response = self.anthropic.messages.create(
                model="claude-sonnet-4-6",
                max_tokens=1024,
                system="You are a web navigation agent. Use tools to interact with the browser.",
                tools=self.TOOLS,
                messages=history,
            )

            # Collect text
            for block in response.content:
                if block.type == "text":
                    final_text += block.text

            # Process tool calls
            tool_blocks = [b for b in response.content if b.type == "tool_use"]

            if not tool_blocks:
                history.append({"role": "assistant", "content": response.content})
                break

            # Execute each tool call and record logs
            tool_results = []
            for tool in tool_blocks:
                ts = session.elapsed_ms()
                all_tool_calls.append({"name": tool.name, "arguments": tool.input})

                if tool.name == "navigate":
                    self._browser_navigate(session.session_id, tool.input["url"])
                    session.logs.append({"ts": ts, "type": "navigation", "data": {"url": tool.input["url"]}})
                    tool_results.append({"type": "tool_result", "tool_use_id": tool.id, "content": "Navigated"})

                elif tool.name == "click":
                    self._browser_click(session.session_id, tool.input["selector"])
                    session.logs.append({
                        "ts": ts, "type": "action",
                        "data": {"action": "click", "selector": tool.input["selector"]},
                    })
                    tool_results.append({"type": "tool_result", "tool_use_id": tool.id, "content": "Clicked"})

                elif tool.name == "type_text":
                    self._browser_type(session.session_id, tool.input["selector"], tool.input["text"])
                    session.logs.append({
                        "ts": ts, "type": "action",
                        "data": {"action": "type", "selector": tool.input["selector"], "value": tool.input["text"]},
                    })
                    tool_results.append({"type": "tool_result", "tool_use_id": tool.id, "content": "Typed"})

                # Screenshot after each action
                session.logs.append({"ts": session.elapsed_ms(), "type": "screenshot", "data": {}})

            history.append({"role": "assistant", "content": response.content})
            history.append({"role": "user", "content": tool_results})

            if response.stop_reason == "end_turn":
                break

        self.conversations[sid] = history
        return {"text": final_text, "tool_calls": all_tool_calls}

    def reset(self):
        # Don't clear sessions — we need the logs after eval completes
        pass

    def set_scenario(self, scenario_id: str):
        """Called from on_scenario callback to track current scenario."""
        self._current_scenario = scenario_id

    def get_session(self, scenario_id: str) -> BrowserSession | None:
        return self.sessions.get(scenario_id)

    # -- Replace these stubs with your actual browser provider --
    def _create_browser_session(self) -> dict:
        return {"id": f"sess_{int(time.time())}"}

    def _browser_navigate(self, session_id: str, url: str): pass
    def _browser_click(self, session_id: str, selector: str): pass
    def _browser_type(self, session_id: str, selector: str, text: str): pass


# ─── Main ────────────────────────────────────────────────────────────────

def main():
    client = AshrLabsClient.from_env()
    agent = WebNavigationAgent()

    # 1. Run eval
    runner = EvalRunner.from_dataset(client, dataset_id=42)
    run = runner.run(
        agent,
        max_workers=1,  # Use 1 for browser agents (see Gotchas below)
        on_scenario=lambda sid, s: (
            agent.set_scenario(sid),
            print(f"▶ Scenario: {s.get('title', sid)}"),
        ),
        on_action=lambda i, a: print(f"  Action {i}: {a.get('content', '')[:60]}"),
    )

    # 2. Attach VM streams to each test
    for test in run.tests:
        session = agent.get_session(test.test_id)
        if session:
            test.set_vm_stream(
                provider="custom",
                session_id=session.session_id,
                duration_ms=session.elapsed_ms(),
                logs=session.logs,
                metadata={
                    "browser": "chromium",
                    "viewport": {"width": 1280, "height": 720},
                },
            )

    # 3. Inspect metrics
    result = run.build()
    metrics = result["aggregate_metrics"]
    print(f"\nResults: {metrics['tests_passed']}/{metrics['total_tests']} passed")
    print(f"Tool divergence: {metrics['total_tool_call_divergence']}")

    # 4. Deploy and wait for grading
    created = run.deploy(client, dataset_id=42)
    print(f"Run submitted: {created['id']}")

    graded = client.poll_run(created["id"], timeout=300)
    print(f"Grading complete: {graded['result']['aggregate_metrics']}")


if __name__ == "__main__":
    main()

Log Entry Format

Every log entry in a VM stream follows this structure:

{
    "ts": 1200,                   # Timestamp in ms (relative to session start, or absolute)
    "type": "action",             # One of the types below
    "data": {"action": "click"}   # Type-specific payload
}

Log Types Reference

Type	When to use	Required `data` fields	Optional `data` fields
`navigation`	Browser navigated to a new URL	`url`	-
`action`	Agent interacted with the page	`action` (`"click"`, `"type"`, `"select"`, `"scroll"`)	`selector`, `value`, `delta_x`, `delta_y`
`network`	HTTP request completed	`method`, `url`	`status`, `duration_ms`
`console`	Browser console output	`message`	`level` (`"log"`, `"warn"`, `"error"`)
`error`	An error occurred	`message`	`code`, `details`
`screenshot`	Screenshot was captured	-	`s3_key`, `format` (`"png"`, `"jpeg"`)

Example Logs

# Navigation
{"ts": 0, "type": "navigation", "data": {"url": "https://app.example.com"}}

# Click action
{"ts": 1200, "type": "action", "data": {"action": "click", "selector": "#login-btn"}}

# Type into input
{"ts": 2000, "type": "action", "data": {"action": "type", "selector": "#email", "value": "user@example.com"}}

# Scroll
{"ts": 2500, "type": "action", "data": {"action": "scroll", "delta_x": 0, "delta_y": 300}}

# Network request
{"ts": 3000, "type": "network", "data": {"method": "POST", "url": "/api/login", "status": 200}}

# Console warning
{"ts": 3100, "type": "console", "data": {"level": "warn", "message": "Deprecated API called"}}

# Error
{"ts": 4000, "type": "error", "data": {"message": "Element not found: #checkout-btn"}}

# Screenshot
{"ts": 5000, "type": "screenshot", "data": {"s3_key": "vm-streams/run_42/frame_001.png"}}

Providers

Generic Provider

Use set_vm_stream() for any browser provider (Browserbase, Steel, Scrapybara, Playwright, Selenium, or your own):

test.set_vm_stream(provider, session_id=None, duration_ms=None, logs=None, metadata=None)

Parameter	Type	Required	Description
`provider`	`str`	Yes	Provider name (e.g. `"browserbase"`, `"steel"`, `"playwright"`, `"custom"`)
`session_id`	`str`	No	Your provider's session ID
`duration_ms`	`int`	No	Total session duration in milliseconds
`logs`	`list[dict]`	No	Array of timestamped log entries
`metadata`	`dict`	No	Any additional provider-specific data

test.set_vm_stream(
    provider="browserbase",
    session_id="sess_abc123",
    duration_ms=12000,
    logs=[
        {"ts": 0, "type": "navigation", "data": {"url": "https://shop.example.com"}},
        {"ts": 800, "type": "action", "data": {"action": "click", "selector": "#product"}},
        {"ts": 3500, "type": "network", "data": {"method": "POST", "url": "/api/cart", "status": 200}},
    ],
    metadata={"browser": "chromium", "viewport": {"width": 1280, "height": 720}},
)

Kernel Browser

Use set_kernel_vm() for Kernel browser sessions. This is a convenience wrapper that sets provider="kernel" and exposes Kernel-specific metadata as named parameters:

test.set_kernel_vm(session_id, duration_ms=None, logs=None, *,
                   live_view_url=None, cdp_ws_url=None, replay_id=None,
                   replay_view_url=None, headless=None, stealth=None,
                   viewport=None)

Note: Parameters after * are keyword-only — you must pass them by name (e.g. replay_id="replay_abc").

Parameter	Type	Required	Description
`session_id`	`str`	Yes	Kernel browser session ID
`duration_ms`	`int`	No	Total session duration in milliseconds
`logs`	`list[dict]`	No	Array of timestamped log entries
`live_view_url`	`str`	No	Kernel's `browser_live_view_url` for real-time viewing
`cdp_ws_url`	`str`	No	Chrome DevTools Protocol WebSocket URL
`replay_id`	`str`	No	Kernel session recording ID
`replay_view_url`	`str`	No	URL to view the session replay
`headless`	`bool`	No	Whether the session ran headless
`stealth`	`bool`	No	Whether anti-bot stealth mode was enabled
`viewport`	`dict`	No	Browser viewport dimensions `{"width": int, "height": int}`

test.set_kernel_vm(
    session_id="kern_sess_abc123",
    duration_ms=15000,
    logs=[
        {"ts": 0, "type": "navigation", "data": {"url": "https://app.example.com"}},
        {"ts": 1200, "type": "action", "data": {"action": "click", "selector": "#login"}},
        {"ts": 2500, "type": "action", "data": {"action": "type", "selector": "#email", "value": "user@example.com"}},
        {"ts": 3800, "type": "action", "data": {"action": "click", "selector": "#submit"}},
    ],
    replay_id="replay_abc123",
    replay_view_url="https://www.kernel.sh/replays/replay_abc123",
    stealth=True,
    viewport={"width": 1920, "height": 1080},
)

When to use which?

set_kernel_vm() — You're using Kernel and want replay URLs, live view, and CDP access tracked in your results.
set_vm_stream() — Everything else. Works with any provider. Pass whatever metadata makes sense for your setup.

Attaching VM Streams to EvalRunner Results

This is the most common source of confusion. EvalRunner.run() returns a RunBuilder with TestBuilder instances for each scenario. The runner records tool calls and text comparisons, but you must attach VM streams.

run = runner.run(agent, on_scenario=lambda sid, s: agent.set_scenario(sid))

# run.tests is a list of TestBuilder instances
# Each test.test_id matches the scenario ID from the dataset

for test in run.tests:
    session = agent.get_session(test.test_id)
    if session:
        # Option A: generic provider
        test.set_vm_stream(
            provider="my-provider",
            session_id=session.id,
            logs=session.logs,
            duration_ms=session.duration,
        )

        # Option B: Kernel
        # test.set_kernel_vm(
        #     session_id=session.kernel_id,
        #     logs=session.logs,
        #     replay_id=session.replay_id,
        # )

# Now deploy — VM streams are included
run.deploy(client, dataset_id=42)

Why doesn't EvalRunner capture logs automatically?

The Agent protocol (respond + reset) is intentionally minimal. It doesn't know or care whether your agent uses a browser, a terminal, or an API. VM stream capture is provider-specific and varies widely, so it's left to you.

Python-specific: Tracking scenario ID

Unlike the TypeScript SDK where respond(message, scenarioId) receives the scenario ID as a parameter, the Python respond(message) method does not. Use the on_scenario callback to track which scenario is active:

class MyAgent:
    def __init__(self):
        self._current_scenario = "default"
        self.sessions = {}

    def respond(self, message: str) -> dict:
        sid = self._current_scenario
        # Use sid to key your browser sessions
        ...

    def reset(self):
        pass

    def set_scenario(self, scenario_id: str):
        self._current_scenario = scenario_id

agent = MyAgent()
run = runner.run(agent, on_scenario=lambda sid, s: agent.set_scenario(sid))

Manual RunBuilder with VM Streams

If you're not using EvalRunner and building results manually:

from ashr_labs import AshrLabsClient, RunBuilder

client = AshrLabsClient(api_key="tp_...")
run = RunBuilder()
run.start()

test = run.add_test("login_flow")
test.start()

# Record what the agent did
test.add_tool_call(
    expected={"name": "navigate", "arguments_json": '{"url":"https://app.example.com/login"}'},
    actual={"name": "navigate", "arguments": {"url": "https://app.example.com/login"}},
    match_status="exact",
)
test.add_agent_response(
    expected_response={"text": "Navigated to login page"},
    actual_response={"text": "I've opened the login page"},
    match_status="similar",
    semantic_similarity=0.85,
)

# Attach the browser session
test.set_vm_stream(
    provider="playwright",
    session_id="sess_001",
    duration_ms=8000,
    logs=[
        {"ts": 0, "type": "navigation", "data": {"url": "https://app.example.com/login"}},
        {"ts": 1500, "type": "action", "data": {"action": "type", "selector": "#email", "value": "test@example.com"}},
        {"ts": 2500, "type": "action", "data": {"action": "type", "selector": "#password", "value": "********"}},
        {"ts": 3500, "type": "action", "data": {"action": "click", "selector": "#login-btn"}},
        {"ts": 5000, "type": "navigation", "data": {"url": "https://app.example.com/dashboard"}},
        {"ts": 5500, "type": "screenshot", "data": {}},
    ],
)

test.complete()
run.complete()

run.deploy(client, dataset_id=42)

Gotchas and Common Mistakes

1. VM streams are not auto-captured

Mistake: Running runner.run(agent) and deploying, expecting VM logs to appear.

Fix: You must call test.set_vm_stream() or test.set_kernel_vm() on each test after run() returns and before deploy().

2. Parallel execution with browser agents

Mistake: Setting max_workers=4 with a browser agent.

Fix: Use max_workers=1 (the default) for browser agents. Python's EvalRunner uses deepcopy for parallel execution, and most browser clients (Playwright, Selenium, HTTP clients) cannot be deep-copied. If you need parallelism, you'll need to implement __deepcopy__ on your agent or restructure to create fresh browser sessions per copy.

3. Python agent doesn't receive scenario ID

Mistake: Expecting respond(message, scenario_id) like TypeScript.

Fix: Use the on_scenario callback:

run = runner.run(agent, on_scenario=lambda sid, s: agent.set_scenario(sid))

4. Forgetting to expose session data

Mistake: Agent collects logs in respond() but has no way to retrieve them after eval.

Fix: Add a get_session(scenario_id) method so you can access logs after runner.run() completes.

5. `reset()` is called at the START of each scenario, not the end

Mistake: Capturing VM metadata only inside reset(), expecting it to run after the scenario completes.

How EvalRunner actually works: reset(scenario_id) is called at the beginning of each scenario to clear state from the previous one. After the last scenario finishes, reset() is NOT called again. This means:

reset("scenario_1")  → No session exists yet, nothing to capture
respond(...)          → Session created, agent runs
reset("scenario_2")  → Called with scenario_2's ID, but scenario_1's session
                        is keyed under "scenario_1" — never matched
respond(...)          → Session created for scenario_2
[run() returns]       → No final reset() — last scenario's data uncaptured
attach_vm_streams()   → Metadata map is empty

Fix: Capture VM metadata inside respond() after each agent turn, not in reset(). The session is guaranteed to be alive during respond():

def respond(self, message: str) -> dict:
    sid = self._current_scenario
    session = self._get_or_create_session(sid)

    result = session.agent.process_message(message)

    # Capture metadata HERE — session is alive, scenario is correct
    self._captured_metadata[sid] = {
        "session_id": session.id,
        "logs": session.logs,
        "duration_ms": int((time.time() - session.start_time) * 1000),
    }

    return {"text": result, "tool_calls": [...]}

This way your metadata map is populated by the time you iterate run.tests to attach VM streams.

6. Clearing sessions on reset()

Mistake: Agent's reset() deletes browser session data, so logs are gone before you attach them.

Fix: Don't delete logs in reset(). Keep them until after deploy().

7. VM stream attached but logs are empty

Symptom: vm_stream has provider, session_id, and metadata, but logs: [] and duration_ms: 0.

Cause: You're capturing session identity (IDs, URLs) but not recording browser actions into a logs array. If your agent emits tool call events internally (e.g. navigate_url, click_mouse, type_text), you need to collect those into log entries during respond().

Fix: Create a per-scenario log list, append entries when tool calls complete, and include it in your VM stream:

def __init__(self):
    self._scenario_vm_logs: dict[str, list[dict]] = {}

def respond(self, message: str) -> dict:
    sid = self._current_scenario
    if sid not in self._scenario_vm_logs:
        self._scenario_vm_logs[sid] = []
    vm_logs = self._scenario_vm_logs[sid]
    session_start = self._session_start_times.get(sid, time.time())

    def on_tool_complete(name: str, args: dict, status: str, error: str | None = None):
        log_type = ("navigation" if "navigate" in name
            else "screenshot" if "screenshot" in name
            else "action")
        entry = {
            "ts": int((time.time() - session_start) * 1000),
            "type": log_type,
            "data": {"action": name, **args},
        }
        if status == "error" and error:
            entry["data"]["error"] = error
        vm_logs.append(entry)

    # ... run agent, hook into tool events ...

Then include logs when attaching:

test.set_vm_stream(
    provider="my-provider",
    session_id=session.id,
    duration_ms=int((time.time() - session_start) * 1000),
    logs=self._scenario_vm_logs.get(test.test_id, []),  # don't forget this
)

8. Timestamps

Recommendation: Use milliseconds relative to session start (i.e., first log entry at ts: 0). Absolute timestamps work too, but relative timestamps make it easier to calculate durations and display timelines.

9. Grading is server-side

Calling deploy() submits your results (including VM streams) but grading happens asynchronously on the server (typically 1-3 minutes). Use poll_run() to wait:

created = run.deploy(client, dataset_id=42)
graded = client.poll_run(created["id"], timeout=300)
metrics = graded["result"]["aggregate_metrics"]
print(f"Passed: {metrics['tests_passed']}/{metrics['total_tests']}")

Table of Contents​

Overview​

How VM Streams Work​

Quick Start​

End-to-End Example: Browser Agent with EvalRunner​

Log Entry Format​

Log Types Reference​

Example Logs​

Providers​

Generic Provider​

Kernel Browser​

When to use which?​

Attaching VM Streams to EvalRunner Results​

Why doesn't EvalRunner capture logs automatically?​

Python-specific: Tracking scenario ID​

Manual RunBuilder with VM Streams​

Gotchas and Common Mistakes​

1. VM streams are not auto-captured​

2. Parallel execution with browser agents​

3. Python agent doesn't receive scenario ID​

4. Forgetting to expose session data​

5. reset() is called at the START of each scenario, not the end​

6. Clearing sessions on reset()​

7. VM stream attached but logs are empty​

8. Timestamps​

9. Grading is server-side​

Table of Contents