VM Integration Guide

This guide covers how to evaluate browser-based and desktop-based agents that interact with virtual machines. If your agent controls a browser (via Playwright, Puppeteer, Browserbase, Kernel, Steel, etc.), this is the guide for you.

Overview
How VM Streams Work
Quick Start
End-to-End Example: Browser Agent with EvalRunner
Log Entry Format
Providers
- Generic Provider (Browserbase, Steel, Playwright, etc.)
- Kernel Browser
Attaching VM Streams to EvalRunner Results
Manual RunBuilder with VM Streams
Gotchas and Common Mistakes

Overview

When your agent operates in a browser or VM, you want to capture what the agent did visually — not just what tool calls it made. VM streams attach timestamped browser logs (navigations, clicks, screenshots, network requests) to each test in a run, so you can replay and debug agent behavior in the Ashr Labs dashboard.

Key concept: VM streams are metadata attached to individual tests, not to the run as a whole. Each test (scenario) gets its own VM stream with its own session, logs, and duration.

How VM Streams Work

┌─────────────────────────────────────────────────┐
│  Your Browser Agent                             │
│                                                 │
│  1. Receives message from EvalRunner            │
│  2. Executes browser actions (click, type, etc.)│
│  3. Collects logs as it goes                    │
│  4. Returns { text, tool_calls }                │
└────────────────────┬────────────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────────────┐
│  EvalRunner                                     │
│                                                 │
│  - Records tool call matches ✓                  │
│  - Records text response matches ✓              │
│  - Does NOT auto-capture VM logs ✗              │
└────────────────────┬────────────────────────────┘
                     │
                     ▼  You must do this yourself
┌─────────────────────────────────────────────────┐
│  After runner.run(), attach VM streams:         │
│                                                 │
│  test.setVmStream("browserbase", { ... })       │
│  — or —                                         │
│  test.setKernelVm("session_id", { ... })        │
└────────────────────┬────────────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────────────┐
│  run.deploy(client, datasetId)                  │
│  → VM logs submitted with run result            │
│  → Viewable in Ashr Labs dashboard              │
└─────────────────────────────────────────────────┘

Important: EvalRunner handles tool call and text comparison automatically, but it does not capture VM/browser logs. You must:

Collect logs inside your agent during respond()
Attach them to each TestBuilder after runner.run() returns
Then call run.deploy()

Quick Start

Minimal example — a browser agent that records navigation logs:

import { AshrLabsClient, EvalRunner, Agent } from "ashr-labs";

const client = AshrLabsClient.fromEnv();

// Your agent must track logs per scenario
class MyBrowserAgent implements Agent {
  logs = new Map<string, { sessionId: string; entries: any[] }>();

  async respond(message: string, scenarioId?: string) {
    const id = scenarioId ?? "default";
    if (!this.logs.has(id)) {
      this.logs.set(id, { sessionId: `sess_${Date.now()}`, entries: [] });
    }
    const ctx = this.logs.get(id)!;

    // ... your browser automation here ...
    ctx.entries.push({ ts: Date.now(), type: "navigation", data: { url: "https://example.com" } });

    return { text: "Done", tool_calls: [] };
  }

  async reset(scenarioId?: string) {
    this.logs.delete(scenarioId ?? "default");
  }
}

const agent = new MyBrowserAgent();
const runner = await EvalRunner.fromDataset(client, 42);
const run = await runner.run(agent);

// Attach VM streams BEFORE deploying
for (const test of run.tests) {
  const session = agent.logs.get(test.test_id);
  if (session) {
    test.setVmStream("playwright", {
      sessionId: session.sessionId,
      logs: session.entries,
    });
  }
}

await run.deploy(client, 42);

End-to-End Example: Browser Agent with EvalRunner

This is a complete, working example of a browser agent that:

Uses the Anthropic SDK for reasoning
Drives a browser via automation
Records VM logs per scenario
Runs evals and deploys with VM streams attached

import { AshrLabsClient, EvalRunner, Agent } from "ashr-labs";
import Anthropic from "@anthropic-ai/sdk";

interface BrowserSession {
  sessionId: string;
  logs: Array<{ ts: number; type: string; data: Record<string, unknown> }>;
  startTime: number;
}

class WebNavigationAgent implements Agent {
  private anthropic = new Anthropic();
  private sessions = new Map<string, BrowserSession>();
  private conversations = new Map<string, Anthropic.MessageParam[]>();

  async respond(message: string, scenarioId?: string): Promise<Record<string, unknown>> {
    const id = scenarioId ?? "default";

    // Initialize browser session for this scenario
    if (!this.sessions.has(id)) {
      const session = await this.createBrowserSession();
      this.sessions.set(id, {
        sessionId: session.id,
        logs: [],
        startTime: Date.now(),
      });
    }

    const session = this.sessions.get(id)!;
    const history = this.conversations.get(id) ?? [];
    history.push({ role: "user", content: message });

    // Ask the LLM what browser actions to take
    const allToolCalls: Record<string, unknown>[] = [];
    let finalText = "";

    for (let turn = 0; turn < 10; turn++) {
      const response = await this.anthropic.messages.create({
        model: "claude-sonnet-4-6",
        max_tokens: 1024,
        system: "You are a web navigation agent. Use tools to interact with the browser.",
        tools: [
          {
            name: "navigate",
            description: "Navigate to a URL",
            input_schema: { type: "object" as const, properties: { url: { type: "string" } }, required: ["url"] },
          },
          {
            name: "click",
            description: "Click an element",
            input_schema: { type: "object" as const, properties: { selector: { type: "string" } }, required: ["selector"] },
          },
          {
            name: "type_text",
            description: "Type text into an input",
            input_schema: {
              type: "object" as const,
              properties: { selector: { type: "string" }, text: { type: "string" } },
              required: ["selector", "text"],
            },
          },
        ],
        messages: history,
      });

      // Collect text blocks
      for (const block of response.content) {
        if (block.type === "text") finalText += block.text;
      }

      // Process tool calls
      const toolUseBlocks = response.content.filter((b): b is Anthropic.ToolUseBlock => b.type === "tool_use");

      if (toolUseBlocks.length === 0) {
        history.push({ role: "assistant", content: response.content });
        break;
      }

      // Execute each tool call in the browser and record logs
      const toolResults: Anthropic.ToolResultBlockParam[] = [];
      for (const tool of toolUseBlocks) {
        const input = tool.input as Record<string, string>;
        const now = Date.now() - session.startTime;

        allToolCalls.push({ name: tool.name, arguments: input });

        // Execute and log
        switch (tool.name) {
          case "navigate":
            await this.browserNavigate(session.sessionId, input.url);
            session.logs.push({ ts: now, type: "navigation", data: { url: input.url } });
            toolResults.push({ type: "tool_result", tool_use_id: tool.id, content: "Navigated" });
            break;

          case "click":
            await this.browserClick(session.sessionId, input.selector);
            session.logs.push({ ts: now, type: "action", data: { action: "click", selector: input.selector } });
            toolResults.push({ type: "tool_result", tool_use_id: tool.id, content: "Clicked" });
            break;

          case "type_text":
            await this.browserType(session.sessionId, input.selector, input.text);
            session.logs.push({
              ts: now,
              type: "action",
              data: { action: "type", selector: input.selector, value: input.text },
            });
            toolResults.push({ type: "tool_result", tool_use_id: tool.id, content: "Typed" });
            break;
        }

        // Capture a screenshot after each action
        session.logs.push({ ts: Date.now() - session.startTime, type: "screenshot", data: {} });
      }

      history.push({ role: "assistant", content: response.content });
      history.push({ role: "user", content: toolResults });

      if (response.stop_reason === "end_turn") break;
    }

    this.conversations.set(id, history);

    return { text: finalText, tool_calls: allToolCalls };
  }

  async reset(scenarioId?: string) {
    const id = scenarioId ?? "default";
    await this.closeBrowserSession(this.sessions.get(id)?.sessionId);
    this.sessions.delete(id);
    this.conversations.delete(id);
  }

  /** Expose sessions so we can attach VM streams after eval */
  getSession(scenarioId: string): BrowserSession | undefined {
    return this.sessions.get(scenarioId);
  }

  // -- Replace these stubs with your actual browser provider --
  private async createBrowserSession() { return { id: `sess_${Date.now()}` }; }
  private async closeBrowserSession(_id?: string) {}
  private async browserNavigate(_sessionId: string, _url: string) {}
  private async browserClick(_sessionId: string, _selector: string) {}
  private async browserType(_sessionId: string, _selector: string, _text: string) {}
}

// ─── Main ────────────────────────────────────────────────────────────────

async function main() {
  const client = AshrLabsClient.fromEnv();
  const agent = new WebNavigationAgent();

  // 1. Run eval
  const runner = await EvalRunner.fromDataset(client, 42);
  const run = await runner.run(agent, {
    maxWorkers: 1, // Use 1 for browser agents (see Gotchas below)
    onScenario: (id, s) => console.log(`▶ Scenario: ${s.title}`),
    onAction: (i, a) => console.log(`  Action ${i}: ${a.content?.slice(0, 60)}`),
  });

  // 2. Attach VM streams to each test
  for (const test of run.tests) {
    const session = agent.getSession(test.test_id);
    if (session) {
      test.setVmStream("custom", {
        sessionId: session.sessionId,
        durationMs: Date.now() - session.startTime,
        logs: session.logs,
        metadata: {
          browser: "chromium",
          viewport: { width: 1280, height: 720 },
        },
      });
    }
  }

  // 3. Inspect metrics
  const result = run.build();
  const metrics = result.aggregate_metrics as Record<string, unknown>;
  console.log(`\nResults: ${metrics.tests_passed}/${metrics.total_tests} passed`);
  console.log(`Tool divergence: ${metrics.total_tool_call_divergence}`);

  // 4. Deploy
  const created = await run.deploy(client, 42);
  console.log(`Run deployed: ${(created as Record<string, unknown>).id}`);
}

main().catch(console.error);

Log Entry Format

Every log entry in a VM stream follows this structure:

interface VmLogEntry {
  ts: number;    // Timestamp in milliseconds (relative to session start, or absolute)
  type: string;  // One of the types below
  data: Record<string, unknown>;  // Type-specific payload
}

Log Types Reference

Type	When to use	Required `data` fields	Optional `data` fields
`navigation`	Browser navigated to a new URL	`url`	-
`action`	Agent interacted with the page	`action` (`"click"`, `"type"`, `"select"`, `"scroll"`)	`selector`, `value`, `delta_x`, `delta_y`
`network`	HTTP request completed	`method`, `url`	`status`, `duration_ms`
`console`	Browser console output	`message`	`level` (`"log"`, `"warn"`, `"error"`)
`error`	An error occurred	`message`	`code`, `details`
`screenshot`	Screenshot was captured	-	`s3_key`, `format` (`"png"`, `"jpeg"`)

Example Logs

// Navigation
{ ts: 0, type: "navigation", data: { url: "https://app.example.com" } }

// Click action
{ ts: 1200, type: "action", data: { action: "click", selector: "#login-btn" } }

// Type into input
{ ts: 2000, type: "action", data: { action: "type", selector: "#email", value: "user@example.com" } }

// Scroll
{ ts: 2500, type: "action", data: { action: "scroll", delta_x: 0, delta_y: 300 } }

// Network request
{ ts: 3000, type: "network", data: { method: "POST", url: "/api/login", status: 200 } }

// Console warning
{ ts: 3100, type: "console", data: { level: "warn", message: "Deprecated API called" } }

// Error
{ ts: 4000, type: "error", data: { message: "Element not found: #checkout-btn" } }

// Screenshot
{ ts: 5000, type: "screenshot", data: { s3_key: "vm-streams/run_42/frame_001.png" } }

Providers

Generic Provider

Use setVmStream() for any browser provider (Browserbase, Steel, Scrapybara, Playwright, Puppeteer, or your own):

test.setVmStream(provider, opts?)

Parameter	Type	Required	Description
`provider`	`string`	Yes	Provider name (e.g. `"browserbase"`, `"steel"`, `"playwright"`, `"custom"`)
`opts.sessionId`	`string`	No	Your provider's session ID
`opts.durationMs`	`number`	No	Total session duration in milliseconds
`opts.logs`	`VmLogEntry[]`	No	Array of timestamped log entries
`opts.metadata`	`Record<string, unknown>`	No	Any additional provider-specific data

test.setVmStream("browserbase", {
  sessionId: "sess_abc123",
  durationMs: 12000,
  logs: [
    { ts: 0, type: "navigation", data: { url: "https://shop.example.com" } },
    { ts: 800, type: "action", data: { action: "click", selector: "#product" } },
    { ts: 3500, type: "network", data: { method: "POST", url: "/api/cart", status: 200 } },
  ],
  metadata: { browser: "chromium", viewport: { width: 1280, height: 720 } },
});

Kernel Browser

Use setKernelVm() for Kernel browser sessions. This is a convenience wrapper that sets provider="kernel" and exposes Kernel-specific metadata as named parameters:

test.setKernelVm(sessionId, opts?)

Parameter	Type	Required	Description
`sessionId`	`string`	Yes	Kernel browser session ID
`opts.durationMs`	`number`	No	Total session duration in milliseconds
`opts.logs`	`VmLogEntry[]`	No	Array of timestamped log entries
`opts.liveViewUrl`	`string`	No	Kernel's `browser_live_view_url` for real-time viewing
`opts.cdpWsUrl`	`string`	No	Chrome DevTools Protocol WebSocket URL
`opts.replayId`	`string`	No	Kernel session recording ID
`opts.replayViewUrl`	`string`	No	URL to view the session replay
`opts.headless`	`boolean`	No	Whether the session ran headless
`opts.stealth`	`boolean`	No	Whether anti-bot stealth mode was enabled
`opts.viewport`	`{ width, height }`	No	Browser viewport dimensions

test.setKernelVm("kern_sess_abc123", {
  durationMs: 15000,
  logs: [
    { ts: 0, type: "navigation", data: { url: "https://app.example.com" } },
    { ts: 1200, type: "action", data: { action: "click", selector: "#login" } },
    { ts: 2500, type: "action", data: { action: "type", selector: "#email", value: "user@example.com" } },
    { ts: 3800, type: "action", data: { action: "click", selector: "#submit" } },
  ],
  replayId: "replay_abc123",
  replayViewUrl: "https://www.kernel.sh/replays/replay_abc123",
  stealth: true,
  viewport: { width: 1920, height: 1080 },
});

When to use which?

setKernelVm() — You're using Kernel and want replay URLs, live view, and CDP access tracked in your results.
setVmStream() — Everything else. Works with any provider. Pass whatever metadata makes sense for your setup.

Attaching VM Streams to EvalRunner Results

This is the most common source of confusion. EvalRunner.run() returns a RunBuilder with TestBuilder instances for each scenario. The runner records tool calls and text comparisons, but you must attach VM streams.

const run = await runner.run(agent);

// run.tests is an array of TestBuilder instances
// Each test.test_id matches the scenario ID from the dataset

for (const test of run.tests) {
  const session = agent.getSession(test.test_id);
  if (session) {
    // Option A: generic provider
    test.setVmStream("my-provider", {
      sessionId: session.id,
      logs: session.logs,
      durationMs: session.duration,
    });

    // Option B: Kernel
    // test.setKernelVm(session.kernelId, {
    //   logs: session.logs,
    //   replayId: session.replayId,
    // });
  }
}

// Now deploy — VM streams are included
await run.deploy(client, datasetId);

Why doesn't EvalRunner capture logs automatically?

The Agent interface (respond + reset) is intentionally minimal. It doesn't know or care whether your agent uses a browser, a terminal, or an API. VM stream capture is provider-specific and varies widely, so it's left to you.

Manual RunBuilder with VM Streams

If you're not using EvalRunner and building results manually:

import { AshrLabsClient, RunBuilder } from "ashr-labs";

const client = new AshrLabsClient("tp_...");
const run = new RunBuilder();
run.start();

const test = run.addTest("login_flow");
test.start();

// Record what the agent did
test.addToolCall(
  { name: "navigate", arguments_json: '{"url":"https://app.example.com/login"}' },
  { name: "navigate", arguments: { url: "https://app.example.com/login" } },
  "exact",
);
test.addAgentResponse(
  { text: "Navigated to login page" },
  { text: "I've opened the login page" },
  "similar",
  0.85,
);

// Attach the browser session
test.setVmStream("playwright", {
  sessionId: "sess_001",
  durationMs: 8000,
  logs: [
    { ts: 0, type: "navigation", data: { url: "https://app.example.com/login" } },
    { ts: 1500, type: "action", data: { action: "type", selector: "#email", value: "test@example.com" } },
    { ts: 2500, type: "action", data: { action: "type", selector: "#password", value: "********" } },
    { ts: 3500, type: "action", data: { action: "click", selector: "#login-btn" } },
    { ts: 5000, type: "navigation", data: { url: "https://app.example.com/dashboard" } },
    { ts: 5500, type: "screenshot", data: {} },
  ],
});

test.complete();
run.complete();

await run.deploy(client, 42);

Gotchas and Common Mistakes

1. VM streams are not auto-captured

Mistake: Running runner.run(agent) and deploying, expecting VM logs to appear.

Fix: You must call test.setVmStream() or test.setKernelVm() on each test after run() returns and before deploy().

2. Parallel execution with browser agents

Mistake: Setting maxWorkers: 4 with a browser agent that shares a single browser instance.

Fix: Use maxWorkers: 1 (the default) for browser agents, unless your agent creates independent browser sessions per scenarioId. The scenarioId parameter in respond() exists specifically for this — key your sessions on it:

async respond(message: string, scenarioId?: string) {
  const id = scenarioId ?? "default";
  // Each scenario gets its own browser session
  if (!this.sessions.has(id)) {
    this.sessions.set(id, await this.createNewBrowserSession());
  }
  // ...
}

3. Forgetting to expose session data

Mistake: Agent collects logs in respond() but has no way to retrieve them after eval.

Fix: Add a getSession(scenarioId) method (or similar) so you can access logs after runner.run() completes.

4. `reset()` is called at the START of each scenario, not the end

Mistake: Capturing VM metadata only inside reset(), expecting it to run after the scenario completes.

How EvalRunner actually works: reset(scenarioId) is called at the beginning of each scenario to clear state from the previous one. After the last scenario finishes, reset() is NOT called again. This means:

reset("scenario_1")  → No session exists yet, nothing to capture
respond(...)          → Session created, agent runs
reset("scenario_2")  → Called with scenario_2's ID, but scenario_1's session
                        is keyed under "scenario_1" — never matched
respond(...)          → Session created for scenario_2
[run() returns]       → No final reset() — last scenario's data uncaptured
attachVmStreams()     → Metadata map is empty

Fix: Capture VM metadata inside respond() after each agent turn, not in reset(). The session is guaranteed to be alive during respond():

async respond(message: string, scenarioId?: string) {
  const id = scenarioId ?? "default";
  const session = this.getOrCreateSession(id);

  const result = await session.agent.processMessage(message);

  // Capture metadata HERE — session is alive, scenarioId is correct
  this.capturedMetadata.set(id, {
    sessionId: session.id,
    logs: session.logs,
    durationMs: Date.now() - session.startTime,
  });

  return { text: result, tool_calls: [...] };
}

This way attachVmStreams() always has data, regardless of when reset() or cleanup runs.

5. Clearing sessions on reset()

Mistake: Agent's reset() deletes browser session data, so logs are gone before you attach them.

Fix: Don't delete logs in reset(). Either keep them until after deploy(), or copy them out in the onScenario callback:

const allLogs = new Map<string, any>();

const run = await runner.run(agent, {
  onScenario: (id) => {
    // Save previous scenario's logs before agent resets
    const prev = agent.getSession(id);
    if (prev) allLogs.set(id, prev);
  },
});

6. VM stream attached but logs are empty

Symptom: vm_stream has provider, session_id, and metadata, but logs: [] and duration_ms: 0.

Cause: You're capturing session identity (IDs, URLs) but not recording browser actions into a logs array. If your agent emits tool call events internally (e.g. navigate_url, click_mouse, type_text), you need to collect those into VmLogEntry[] during respond().

Fix: Create a per-scenario log array, push entries when tool calls complete, and include the array in your VM stream:

private readonly scenarioVmLogs = new Map<string, VmLogEntry[]>();

async respond(message: string, scenarioId?: string) {
  const id = scenarioId ?? "default";
  if (!this.scenarioVmLogs.has(id)) {
    this.scenarioVmLogs.set(id, []);
  }
  const vmLogs = this.scenarioVmLogs.get(id)!;
  const sessionStart = this.sessionStartTimes.get(id) ?? Date.now();

  // When a tool call completes, push a log entry:
  const onToolComplete = (tool: { name: string; args: any; status: string; error?: string }) => {
    const type = tool.name.includes("navigate") ? "navigation"
      : tool.name.includes("click") ? "action"
      : tool.name.includes("type") ? "action"
      : tool.name.includes("screenshot") ? "screenshot"
      : "action";

    vmLogs.push({
      ts: Date.now() - sessionStart,
      type,
      data: {
        action: tool.name,
        ...tool.args,
        ...(tool.status === "error" ? { error: tool.error } : {}),
      },
    });
  };

  // ... run agent, listen for tool events ...
}

Then include logs when attaching:

test.setVmStream("my-provider", {
  sessionId: session.id,
  durationMs: Date.now() - sessionStart,
  logs: this.scenarioVmLogs.get(test.test_id) ?? [],  // ← don't forget this
});

7. Timestamps

Recommendation: Use milliseconds relative to session start (i.e., first log entry at ts: 0). Absolute timestamps work too, but relative timestamps make it easier to calculate durations and display timelines.

8. Grading is server-side

Calling deploy() submits your results (including VM streams) but grading happens asynchronously on the server (typically 1-3 minutes). To get graded results:

const created = await run.deploy(client, 42);
const runResult = await client.getRun((created as Record<string, unknown>).id as number);
// Check runResult.status — "graded" means complete

Table of Contents​

Overview​

How VM Streams Work​

Quick Start​

End-to-End Example: Browser Agent with EvalRunner​

Log Entry Format​

Log Types Reference​

Example Logs​

Providers​

Generic Provider​

Kernel Browser​

When to use which?​

Attaching VM Streams to EvalRunner Results​

Why doesn't EvalRunner capture logs automatically?​

Manual RunBuilder with VM Streams​

Gotchas and Common Mistakes​

1. VM streams are not auto-captured​

2. Parallel execution with browser agents​

3. Forgetting to expose session data​

4. reset() is called at the START of each scenario, not the end​

5. Clearing sessions on reset()​

6. VM stream attached but logs are empty​

7. Timestamps​

8. Grading is server-side​

Table of Contents