Skip to main content

VM Integration Guide

This guide covers how to evaluate browser-based and desktop-based agents that interact with virtual machines. If your agent controls a browser (via Playwright, Puppeteer, Browserbase, Kernel, Steel, etc.), this is the guide for you.

Table of Contents


Overview

When your agent operates in a browser or VM, you want to capture what the agent did visually — not just what tool calls it made. VM streams attach timestamped browser logs (navigations, clicks, screenshots, network requests) to each test in a run, so you can replay and debug agent behavior in the Ashr Labs dashboard.

Key concept: VM streams are metadata attached to individual tests, not to the run as a whole. Each test (scenario) gets its own VM stream with its own session, logs, and duration.

How VM Streams Work

┌─────────────────────────────────────────────────┐
│ Your Browser Agent │
│ │
│ 1. Receives message from EvalRunner │
│ 2. Executes browser actions (click, type, etc.)│
│ 3. Collects logs as it goes │
│ 4. Returns { text, tool_calls } │
└────────────────────┬────────────────────────────┘


┌─────────────────────────────────────────────────┐
│ EvalRunner │
│ │
│ - Records tool call matches ✓ │
│ - Records text response matches ✓ │
│ - Does NOT auto-capture VM logs ✗ │
└────────────────────┬────────────────────────────┘

▼ You must do this yourself
┌─────────────────────────────────────────────────┐
│ After runner.run(), attach VM streams: │
│ │
│ test.setVmStream("browserbase", { ... }) │
│ — or — │
│ test.setKernelVm("session_id", { ... }) │
└────────────────────┬────────────────────────────┘


┌─────────────────────────────────────────────────┐
│ run.deploy(client, datasetId) │
│ → VM logs submitted with run result │
│ → Viewable in Ashr Labs dashboard │
└─────────────────────────────────────────────────┘

Important: EvalRunner handles tool call and text comparison automatically, but it does not capture VM/browser logs. You must:

  1. Collect logs inside your agent during respond()
  2. Attach them to each TestBuilder after runner.run() returns
  3. Then call run.deploy()

Quick Start

Minimal example — a browser agent that records navigation logs:

import { AshrLabsClient, EvalRunner, Agent } from "ashr-labs";

const client = AshrLabsClient.fromEnv();

// Your agent must track logs per scenario
class MyBrowserAgent implements Agent {
logs = new Map<string, { sessionId: string; entries: any[] }>();

async respond(message: string, scenarioId?: string) {
const id = scenarioId ?? "default";
if (!this.logs.has(id)) {
this.logs.set(id, { sessionId: `sess_${Date.now()}`, entries: [] });
}
const ctx = this.logs.get(id)!;

// ... your browser automation here ...
ctx.entries.push({ ts: Date.now(), type: "navigation", data: { url: "https://example.com" } });

return { text: "Done", tool_calls: [] };
}

async reset(scenarioId?: string) {
this.logs.delete(scenarioId ?? "default");
}
}

const agent = new MyBrowserAgent();
const runner = await EvalRunner.fromDataset(client, 42);
const run = await runner.run(agent);

// Attach VM streams BEFORE deploying
for (const test of run.tests) {
const session = agent.logs.get(test.test_id);
if (session) {
test.setVmStream("playwright", {
sessionId: session.sessionId,
logs: session.entries,
});
}
}

await run.deploy(client, 42);

End-to-End Example: Browser Agent with EvalRunner

This is a complete, working example of a browser agent that:

  • Uses the Anthropic SDK for reasoning
  • Drives a browser via automation
  • Records VM logs per scenario
  • Runs evals and deploys with VM streams attached
import { AshrLabsClient, EvalRunner, Agent } from "ashr-labs";
import Anthropic from "@anthropic-ai/sdk";

interface BrowserSession {
sessionId: string;
logs: Array<{ ts: number; type: string; data: Record<string, unknown> }>;
startTime: number;
}

class WebNavigationAgent implements Agent {
private anthropic = new Anthropic();
private sessions = new Map<string, BrowserSession>();
private conversations = new Map<string, Anthropic.MessageParam[]>();

async respond(message: string, scenarioId?: string): Promise<Record<string, unknown>> {
const id = scenarioId ?? "default";

// Initialize browser session for this scenario
if (!this.sessions.has(id)) {
const session = await this.createBrowserSession();
this.sessions.set(id, {
sessionId: session.id,
logs: [],
startTime: Date.now(),
});
}

const session = this.sessions.get(id)!;
const history = this.conversations.get(id) ?? [];
history.push({ role: "user", content: message });

// Ask the LLM what browser actions to take
const allToolCalls: Record<string, unknown>[] = [];
let finalText = "";

for (let turn = 0; turn < 10; turn++) {
const response = await this.anthropic.messages.create({
model: "claude-sonnet-4-6",
max_tokens: 1024,
system: "You are a web navigation agent. Use tools to interact with the browser.",
tools: [
{
name: "navigate",
description: "Navigate to a URL",
input_schema: { type: "object" as const, properties: { url: { type: "string" } }, required: ["url"] },
},
{
name: "click",
description: "Click an element",
input_schema: { type: "object" as const, properties: { selector: { type: "string" } }, required: ["selector"] },
},
{
name: "type_text",
description: "Type text into an input",
input_schema: {
type: "object" as const,
properties: { selector: { type: "string" }, text: { type: "string" } },
required: ["selector", "text"],
},
},
],
messages: history,
});

// Collect text blocks
for (const block of response.content) {
if (block.type === "text") finalText += block.text;
}

// Process tool calls
const toolUseBlocks = response.content.filter((b): b is Anthropic.ToolUseBlock => b.type === "tool_use");

if (toolUseBlocks.length === 0) {
history.push({ role: "assistant", content: response.content });
break;
}

// Execute each tool call in the browser and record logs
const toolResults: Anthropic.ToolResultBlockParam[] = [];
for (const tool of toolUseBlocks) {
const input = tool.input as Record<string, string>;
const now = Date.now() - session.startTime;

allToolCalls.push({ name: tool.name, arguments: input });

// Execute and log
switch (tool.name) {
case "navigate":
await this.browserNavigate(session.sessionId, input.url);
session.logs.push({ ts: now, type: "navigation", data: { url: input.url } });
toolResults.push({ type: "tool_result", tool_use_id: tool.id, content: "Navigated" });
break;

case "click":
await this.browserClick(session.sessionId, input.selector);
session.logs.push({ ts: now, type: "action", data: { action: "click", selector: input.selector } });
toolResults.push({ type: "tool_result", tool_use_id: tool.id, content: "Clicked" });
break;

case "type_text":
await this.browserType(session.sessionId, input.selector, input.text);
session.logs.push({
ts: now,
type: "action",
data: { action: "type", selector: input.selector, value: input.text },
});
toolResults.push({ type: "tool_result", tool_use_id: tool.id, content: "Typed" });
break;
}

// Capture a screenshot after each action
session.logs.push({ ts: Date.now() - session.startTime, type: "screenshot", data: {} });
}

history.push({ role: "assistant", content: response.content });
history.push({ role: "user", content: toolResults });

if (response.stop_reason === "end_turn") break;
}

this.conversations.set(id, history);

return { text: finalText, tool_calls: allToolCalls };
}

async reset(scenarioId?: string) {
const id = scenarioId ?? "default";
await this.closeBrowserSession(this.sessions.get(id)?.sessionId);
this.sessions.delete(id);
this.conversations.delete(id);
}

/** Expose sessions so we can attach VM streams after eval */
getSession(scenarioId: string): BrowserSession | undefined {
return this.sessions.get(scenarioId);
}

// -- Replace these stubs with your actual browser provider --
private async createBrowserSession() { return { id: `sess_${Date.now()}` }; }
private async closeBrowserSession(_id?: string) {}
private async browserNavigate(_sessionId: string, _url: string) {}
private async browserClick(_sessionId: string, _selector: string) {}
private async browserType(_sessionId: string, _selector: string, _text: string) {}
}

// ─── Main ────────────────────────────────────────────────────────────────

async function main() {
const client = AshrLabsClient.fromEnv();
const agent = new WebNavigationAgent();

// 1. Run eval
const runner = await EvalRunner.fromDataset(client, 42);
const run = await runner.run(agent, {
maxWorkers: 1, // Use 1 for browser agents (see Gotchas below)
onScenario: (id, s) => console.log(`▶ Scenario: ${s.title}`),
onAction: (i, a) => console.log(` Action ${i}: ${a.content?.slice(0, 60)}`),
});

// 2. Attach VM streams to each test
for (const test of run.tests) {
const session = agent.getSession(test.test_id);
if (session) {
test.setVmStream("custom", {
sessionId: session.sessionId,
durationMs: Date.now() - session.startTime,
logs: session.logs,
metadata: {
browser: "chromium",
viewport: { width: 1280, height: 720 },
},
});
}
}

// 3. Inspect metrics
const result = run.build();
const metrics = result.aggregate_metrics as Record<string, unknown>;
console.log(`\nResults: ${metrics.tests_passed}/${metrics.total_tests} passed`);
console.log(`Tool divergence: ${metrics.total_tool_call_divergence}`);

// 4. Deploy
const created = await run.deploy(client, 42);
console.log(`Run deployed: ${(created as Record<string, unknown>).id}`);
}

main().catch(console.error);

Log Entry Format

Every log entry in a VM stream follows this structure:

interface VmLogEntry {
ts: number; // Timestamp in milliseconds (relative to session start, or absolute)
type: string; // One of the types below
data: Record<string, unknown>; // Type-specific payload
}

Log Types Reference

TypeWhen to useRequired data fieldsOptional data fields
navigationBrowser navigated to a new URLurl-
actionAgent interacted with the pageaction ("click", "type", "select", "scroll")selector, value, delta_x, delta_y
networkHTTP request completedmethod, urlstatus, duration_ms
consoleBrowser console outputmessagelevel ("log", "warn", "error")
errorAn error occurredmessagecode, details
screenshotScreenshot was captured-s3_key, format ("png", "jpeg")

Example Logs

// Navigation
{ ts: 0, type: "navigation", data: { url: "https://app.example.com" } }

// Click action
{ ts: 1200, type: "action", data: { action: "click", selector: "#login-btn" } }

// Type into input
{ ts: 2000, type: "action", data: { action: "type", selector: "#email", value: "user@example.com" } }

// Scroll
{ ts: 2500, type: "action", data: { action: "scroll", delta_x: 0, delta_y: 300 } }

// Network request
{ ts: 3000, type: "network", data: { method: "POST", url: "/api/login", status: 200 } }

// Console warning
{ ts: 3100, type: "console", data: { level: "warn", message: "Deprecated API called" } }

// Error
{ ts: 4000, type: "error", data: { message: "Element not found: #checkout-btn" } }

// Screenshot
{ ts: 5000, type: "screenshot", data: { s3_key: "vm-streams/run_42/frame_001.png" } }

Providers

Generic Provider

Use setVmStream() for any browser provider (Browserbase, Steel, Scrapybara, Playwright, Puppeteer, or your own):

test.setVmStream(provider, opts?)
ParameterTypeRequiredDescription
providerstringYesProvider name (e.g. "browserbase", "steel", "playwright", "custom")
opts.sessionIdstringNoYour provider's session ID
opts.durationMsnumberNoTotal session duration in milliseconds
opts.logsVmLogEntry[]NoArray of timestamped log entries
opts.metadataRecord<string, unknown>NoAny additional provider-specific data
test.setVmStream("browserbase", {
sessionId: "sess_abc123",
durationMs: 12000,
logs: [
{ ts: 0, type: "navigation", data: { url: "https://shop.example.com" } },
{ ts: 800, type: "action", data: { action: "click", selector: "#product" } },
{ ts: 3500, type: "network", data: { method: "POST", url: "/api/cart", status: 200 } },
],
metadata: { browser: "chromium", viewport: { width: 1280, height: 720 } },
});

Kernel Browser

Use setKernelVm() for Kernel browser sessions. This is a convenience wrapper that sets provider="kernel" and exposes Kernel-specific metadata as named parameters:

test.setKernelVm(sessionId, opts?)
ParameterTypeRequiredDescription
sessionIdstringYesKernel browser session ID
opts.durationMsnumberNoTotal session duration in milliseconds
opts.logsVmLogEntry[]NoArray of timestamped log entries
opts.liveViewUrlstringNoKernel's browser_live_view_url for real-time viewing
opts.cdpWsUrlstringNoChrome DevTools Protocol WebSocket URL
opts.replayIdstringNoKernel session recording ID
opts.replayViewUrlstringNoURL to view the session replay
opts.headlessbooleanNoWhether the session ran headless
opts.stealthbooleanNoWhether anti-bot stealth mode was enabled
opts.viewport{ width, height }NoBrowser viewport dimensions
test.setKernelVm("kern_sess_abc123", {
durationMs: 15000,
logs: [
{ ts: 0, type: "navigation", data: { url: "https://app.example.com" } },
{ ts: 1200, type: "action", data: { action: "click", selector: "#login" } },
{ ts: 2500, type: "action", data: { action: "type", selector: "#email", value: "user@example.com" } },
{ ts: 3800, type: "action", data: { action: "click", selector: "#submit" } },
],
replayId: "replay_abc123",
replayViewUrl: "https://www.kernel.sh/replays/replay_abc123",
stealth: true,
viewport: { width: 1920, height: 1080 },
});

When to use which?

  • setKernelVm() — You're using Kernel and want replay URLs, live view, and CDP access tracked in your results.
  • setVmStream() — Everything else. Works with any provider. Pass whatever metadata makes sense for your setup.

Attaching VM Streams to EvalRunner Results

This is the most common source of confusion. EvalRunner.run() returns a RunBuilder with TestBuilder instances for each scenario. The runner records tool calls and text comparisons, but you must attach VM streams.

const run = await runner.run(agent);

// run.tests is an array of TestBuilder instances
// Each test.test_id matches the scenario ID from the dataset

for (const test of run.tests) {
const session = agent.getSession(test.test_id);
if (session) {
// Option A: generic provider
test.setVmStream("my-provider", {
sessionId: session.id,
logs: session.logs,
durationMs: session.duration,
});

// Option B: Kernel
// test.setKernelVm(session.kernelId, {
// logs: session.logs,
// replayId: session.replayId,
// });
}
}

// Now deploy — VM streams are included
await run.deploy(client, datasetId);

Why doesn't EvalRunner capture logs automatically?

The Agent interface (respond + reset) is intentionally minimal. It doesn't know or care whether your agent uses a browser, a terminal, or an API. VM stream capture is provider-specific and varies widely, so it's left to you.


Manual RunBuilder with VM Streams

If you're not using EvalRunner and building results manually:

import { AshrLabsClient, RunBuilder } from "ashr-labs";

const client = new AshrLabsClient("tp_...");
const run = new RunBuilder();
run.start();

const test = run.addTest("login_flow");
test.start();

// Record what the agent did
test.addToolCall(
{ name: "navigate", arguments_json: '{"url":"https://app.example.com/login"}' },
{ name: "navigate", arguments: { url: "https://app.example.com/login" } },
"exact",
);
test.addAgentResponse(
{ text: "Navigated to login page" },
{ text: "I've opened the login page" },
"similar",
0.85,
);

// Attach the browser session
test.setVmStream("playwright", {
sessionId: "sess_001",
durationMs: 8000,
logs: [
{ ts: 0, type: "navigation", data: { url: "https://app.example.com/login" } },
{ ts: 1500, type: "action", data: { action: "type", selector: "#email", value: "test@example.com" } },
{ ts: 2500, type: "action", data: { action: "type", selector: "#password", value: "********" } },
{ ts: 3500, type: "action", data: { action: "click", selector: "#login-btn" } },
{ ts: 5000, type: "navigation", data: { url: "https://app.example.com/dashboard" } },
{ ts: 5500, type: "screenshot", data: {} },
],
});

test.complete();
run.complete();

await run.deploy(client, 42);

Gotchas and Common Mistakes

1. VM streams are not auto-captured

Mistake: Running runner.run(agent) and deploying, expecting VM logs to appear.

Fix: You must call test.setVmStream() or test.setKernelVm() on each test after run() returns and before deploy().

2. Parallel execution with browser agents

Mistake: Setting maxWorkers: 4 with a browser agent that shares a single browser instance.

Fix: Use maxWorkers: 1 (the default) for browser agents, unless your agent creates independent browser sessions per scenarioId. The scenarioId parameter in respond() exists specifically for this — key your sessions on it:

async respond(message: string, scenarioId?: string) {
const id = scenarioId ?? "default";
// Each scenario gets its own browser session
if (!this.sessions.has(id)) {
this.sessions.set(id, await this.createNewBrowserSession());
}
// ...
}

3. Forgetting to expose session data

Mistake: Agent collects logs in respond() but has no way to retrieve them after eval.

Fix: Add a getSession(scenarioId) method (or similar) so you can access logs after runner.run() completes.

4. reset() is called at the START of each scenario, not the end

Mistake: Capturing VM metadata only inside reset(), expecting it to run after the scenario completes.

How EvalRunner actually works: reset(scenarioId) is called at the beginning of each scenario to clear state from the previous one. After the last scenario finishes, reset() is NOT called again. This means:

reset("scenario_1")  → No session exists yet, nothing to capture
respond(...) → Session created, agent runs
reset("scenario_2") → Called with scenario_2's ID, but scenario_1's session
is keyed under "scenario_1" — never matched
respond(...) → Session created for scenario_2
[run() returns] → No final reset() — last scenario's data uncaptured
attachVmStreams() → Metadata map is empty

Fix: Capture VM metadata inside respond() after each agent turn, not in reset(). The session is guaranteed to be alive during respond():

async respond(message: string, scenarioId?: string) {
const id = scenarioId ?? "default";
const session = this.getOrCreateSession(id);

const result = await session.agent.processMessage(message);

// Capture metadata HERE — session is alive, scenarioId is correct
this.capturedMetadata.set(id, {
sessionId: session.id,
logs: session.logs,
durationMs: Date.now() - session.startTime,
});

return { text: result, tool_calls: [...] };
}

This way attachVmStreams() always has data, regardless of when reset() or cleanup runs.

5. Clearing sessions on reset()

Mistake: Agent's reset() deletes browser session data, so logs are gone before you attach them.

Fix: Don't delete logs in reset(). Either keep them until after deploy(), or copy them out in the onScenario callback:

const allLogs = new Map<string, any>();

const run = await runner.run(agent, {
onScenario: (id) => {
// Save previous scenario's logs before agent resets
const prev = agent.getSession(id);
if (prev) allLogs.set(id, prev);
},
});

6. VM stream attached but logs are empty

Symptom: vm_stream has provider, session_id, and metadata, but logs: [] and duration_ms: 0.

Cause: You're capturing session identity (IDs, URLs) but not recording browser actions into a logs array. If your agent emits tool call events internally (e.g. navigate_url, click_mouse, type_text), you need to collect those into VmLogEntry[] during respond().

Fix: Create a per-scenario log array, push entries when tool calls complete, and include the array in your VM stream:

private readonly scenarioVmLogs = new Map<string, VmLogEntry[]>();

async respond(message: string, scenarioId?: string) {
const id = scenarioId ?? "default";
if (!this.scenarioVmLogs.has(id)) {
this.scenarioVmLogs.set(id, []);
}
const vmLogs = this.scenarioVmLogs.get(id)!;
const sessionStart = this.sessionStartTimes.get(id) ?? Date.now();

// When a tool call completes, push a log entry:
const onToolComplete = (tool: { name: string; args: any; status: string; error?: string }) => {
const type = tool.name.includes("navigate") ? "navigation"
: tool.name.includes("click") ? "action"
: tool.name.includes("type") ? "action"
: tool.name.includes("screenshot") ? "screenshot"
: "action";

vmLogs.push({
ts: Date.now() - sessionStart,
type,
data: {
action: tool.name,
...tool.args,
...(tool.status === "error" ? { error: tool.error } : {}),
},
});
};

// ... run agent, listen for tool events ...
}

Then include logs when attaching:

test.setVmStream("my-provider", {
sessionId: session.id,
durationMs: Date.now() - sessionStart,
logs: this.scenarioVmLogs.get(test.test_id) ?? [], // ← don't forget this
});

7. Timestamps

Recommendation: Use milliseconds relative to session start (i.e., first log entry at ts: 0). Absolute timestamps work too, but relative timestamps make it easier to calculate durations and display timelines.

8. Grading is server-side

Calling deploy() submits your results (including VM streams) but grading happens asynchronously on the server (typically 1-3 minutes). To get graded results:

const created = await run.deploy(client, 42);
const runResult = await client.getRun((created as Record<string, unknown>).id as number);
// Check runResult.status — "graded" means complete