VM Integration Guide
This guide covers how to evaluate browser-based and desktop-based agents that interact with virtual machines. If your agent controls a browser (via Playwright, Puppeteer, Browserbase, Kernel, Steel, etc.), this is the guide for you.
Table of Contents
- Overview
- How VM Streams Work
- Quick Start
- End-to-End Example: Browser Agent with EvalRunner
- Log Entry Format
- Providers
- Attaching VM Streams to EvalRunner Results
- Manual RunBuilder with VM Streams
- Gotchas and Common Mistakes
Overview
When your agent operates in a browser or VM, you want to capture what the agent did visually — not just what tool calls it made. VM streams attach timestamped browser logs (navigations, clicks, screenshots, network requests) to each test in a run, so you can replay and debug agent behavior in the Ashr Labs dashboard.
Key concept: VM streams are metadata attached to individual tests, not to the run as a whole. Each test (scenario) gets its own VM stream with its own session, logs, and duration.
How VM Streams Work
┌─────────────────────────────────────────────────┐
│ Your Browser Agent │
│ │
│ 1. Receives message from EvalRunner │
│ 2. Executes browser actions (click, type, etc.)│
│ 3. Collects logs as it goes │
│ 4. Returns { text, tool_calls } │
└────────────────────┬────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────┐
│ EvalRunner │
│ │
│ - Records tool call matches ✓ │
│ - Records text response matches ✓ │
│ - Does NOT auto-capture VM logs ✗ │
└────────────────────┬────────────────────────────┘
│
▼ You must do this yourself
┌─────────────────────────────────────────────────┐
│ After runner.run(), attach VM streams: │
│ │
│ test.setVmStream("browserbase", { ... }) │
│ — or — │
│ test.setKernelVm("session_id", { ... }) │
└────────────────────┬────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────┐
│ run.deploy(client, datasetId) │
│ → VM logs submitted with run result │
│ → Viewable in Ashr Labs dashboard │
└─────────────────────────────────────────────────┘
Important: EvalRunner handles tool call and text comparison automatically, but it does not capture VM/browser logs. You must:
- Collect logs inside your agent during
respond() - Attach them to each
TestBuilderafterrunner.run()returns - Then call
run.deploy()
Quick Start
Minimal example — a browser agent that records navigation logs:
import { AshrLabsClient, EvalRunner, Agent } from "ashr-labs";
const client = AshrLabsClient.fromEnv();
// Your agent must track logs per scenario
class MyBrowserAgent implements Agent {
logs = new Map<string, { sessionId: string; entries: any[] }>();
async respond(message: string, scenarioId?: string) {
const id = scenarioId ?? "default";
if (!this.logs.has(id)) {
this.logs.set(id, { sessionId: `sess_${Date.now()}`, entries: [] });
}
const ctx = this.logs.get(id)!;
// ... your browser automation here ...
ctx.entries.push({ ts: Date.now(), type: "navigation", data: { url: "https://example.com" } });
return { text: "Done", tool_calls: [] };
}
async reset(scenarioId?: string) {
this.logs.delete(scenarioId ?? "default");
}
}
const agent = new MyBrowserAgent();
const runner = await EvalRunner.fromDataset(client, 42);
const run = await runner.run(agent);
// Attach VM streams BEFORE deploying
for (const test of run.tests) {
const session = agent.logs.get(test.test_id);
if (session) {
test.setVmStream("playwright", {
sessionId: session.sessionId,
logs: session.entries,
});
}
}
await run.deploy(client, 42);
End-to-End Example: Browser Agent with EvalRunner
This is a complete, working example of a browser agent that:
- Uses the Anthropic SDK for reasoning
- Drives a browser via automation
- Records VM logs per scenario
- Runs evals and deploys with VM streams attached
import { AshrLabsClient, EvalRunner, Agent } from "ashr-labs";
import Anthropic from "@anthropic-ai/sdk";
interface BrowserSession {
sessionId: string;
logs: Array<{ ts: number; type: string; data: Record<string, unknown> }>;
startTime: number;
}
class WebNavigationAgent implements Agent {
private anthropic = new Anthropic();
private sessions = new Map<string, BrowserSession>();
private conversations = new Map<string, Anthropic.MessageParam[]>();
async respond(message: string, scenarioId?: string): Promise<Record<string, unknown>> {
const id = scenarioId ?? "default";
// Initialize browser session for this scenario
if (!this.sessions.has(id)) {
const session = await this.createBrowserSession();
this.sessions.set(id, {
sessionId: session.id,
logs: [],
startTime: Date.now(),
});
}
const session = this.sessions.get(id)!;
const history = this.conversations.get(id) ?? [];
history.push({ role: "user", content: message });
// Ask the LLM what browser actions to take
const allToolCalls: Record<string, unknown>[] = [];
let finalText = "";
for (let turn = 0; turn < 10; turn++) {
const response = await this.anthropic.messages.create({
model: "claude-sonnet-4-6",
max_tokens: 1024,
system: "You are a web navigation agent. Use tools to interact with the browser.",
tools: [
{
name: "navigate",
description: "Navigate to a URL",
input_schema: { type: "object" as const, properties: { url: { type: "string" } }, required: ["url"] },
},
{
name: "click",
description: "Click an element",
input_schema: { type: "object" as const, properties: { selector: { type: "string" } }, required: ["selector"] },
},
{
name: "type_text",
description: "Type text into an input",
input_schema: {
type: "object" as const,
properties: { selector: { type: "string" }, text: { type: "string" } },
required: ["selector", "text"],
},
},
],
messages: history,
});
// Collect text blocks
for (const block of response.content) {
if (block.type === "text") finalText += block.text;
}
// Process tool calls
const toolUseBlocks = response.content.filter((b): b is Anthropic.ToolUseBlock => b.type === "tool_use");
if (toolUseBlocks.length === 0) {
history.push({ role: "assistant", content: response.content });
break;
}
// Execute each tool call in the browser and record logs
const toolResults: Anthropic.ToolResultBlockParam[] = [];
for (const tool of toolUseBlocks) {
const input = tool.input as Record<string, string>;
const now = Date.now() - session.startTime;
allToolCalls.push({ name: tool.name, arguments: input });
// Execute and log
switch (tool.name) {
case "navigate":
await this.browserNavigate(session.sessionId, input.url);
session.logs.push({ ts: now, type: "navigation", data: { url: input.url } });
toolResults.push({ type: "tool_result", tool_use_id: tool.id, content: "Navigated" });
break;
case "click":
await this.browserClick(session.sessionId, input.selector);
session.logs.push({ ts: now, type: "action", data: { action: "click", selector: input.selector } });
toolResults.push({ type: "tool_result", tool_use_id: tool.id, content: "Clicked" });
break;
case "type_text":
await this.browserType(session.sessionId, input.selector, input.text);
session.logs.push({
ts: now,
type: "action",
data: { action: "type", selector: input.selector, value: input.text },
});
toolResults.push({ type: "tool_result", tool_use_id: tool.id, content: "Typed" });
break;
}
// Capture a screenshot after each action
session.logs.push({ ts: Date.now() - session.startTime, type: "screenshot", data: {} });
}
history.push({ role: "assistant", content: response.content });
history.push({ role: "user", content: toolResults });
if (response.stop_reason === "end_turn") break;
}
this.conversations.set(id, history);
return { text: finalText, tool_calls: allToolCalls };
}
async reset(scenarioId?: string) {
const id = scenarioId ?? "default";
await this.closeBrowserSession(this.sessions.get(id)?.sessionId);
this.sessions.delete(id);
this.conversations.delete(id);
}
/** Expose sessions so we can attach VM streams after eval */
getSession(scenarioId: string): BrowserSession | undefined {
return this.sessions.get(scenarioId);
}
// -- Replace these stubs with your actual browser provider --
private async createBrowserSession() { return { id: `sess_${Date.now()}` }; }
private async closeBrowserSession(_id?: string) {}
private async browserNavigate(_sessionId: string, _url: string) {}
private async browserClick(_sessionId: string, _selector: string) {}
private async browserType(_sessionId: string, _selector: string, _text: string) {}
}
// ─── Main ────────────────────────────────────────────────────────────────
async function main() {
const client = AshrLabsClient.fromEnv();
const agent = new WebNavigationAgent();
// 1. Run eval
const runner = await EvalRunner.fromDataset(client, 42);
const run = await runner.run(agent, {
maxWorkers: 1, // Use 1 for browser agents (see Gotchas below)
onScenario: (id, s) => console.log(`▶ Scenario: ${s.title}`),
onAction: (i, a) => console.log(` Action ${i}: ${a.content?.slice(0, 60)}`),
});
// 2. Attach VM streams to each test
for (const test of run.tests) {
const session = agent.getSession(test.test_id);
if (session) {
test.setVmStream("custom", {
sessionId: session.sessionId,
durationMs: Date.now() - session.startTime,
logs: session.logs,
metadata: {
browser: "chromium",
viewport: { width: 1280, height: 720 },
},
});
}
}
// 3. Inspect metrics
const result = run.build();
const metrics = result.aggregate_metrics as Record<string, unknown>;
console.log(`\nResults: ${metrics.tests_passed}/${metrics.total_tests} passed`);
console.log(`Tool divergence: ${metrics.total_tool_call_divergence}`);
// 4. Deploy
const created = await run.deploy(client, 42);
console.log(`Run deployed: ${(created as Record<string, unknown>).id}`);
}
main().catch(console.error);
Log Entry Format
Every log entry in a VM stream follows this structure:
interface VmLogEntry {
ts: number; // Timestamp in milliseconds (relative to session start, or absolute)
type: string; // One of the types below
data: Record<string, unknown>; // Type-specific payload
}
Log Types Reference
| Type | When to use | Required data fields | Optional data fields |
|---|---|---|---|
navigation | Browser navigated to a new URL | url | - |
action | Agent interacted with the page | action ("click", "type", "select", "scroll") | selector, value, delta_x, delta_y |
network | HTTP request completed | method, url | status, duration_ms |
console | Browser console output | message | level ("log", "warn", "error") |
error | An error occurred | message | code, details |
screenshot | Screenshot was captured | - | s3_key, format ("png", "jpeg") |
Example Logs
// Navigation
{ ts: 0, type: "navigation", data: { url: "https://app.example.com" } }
// Click action
{ ts: 1200, type: "action", data: { action: "click", selector: "#login-btn" } }
// Type into input
{ ts: 2000, type: "action", data: { action: "type", selector: "#email", value: "user@example.com" } }
// Scroll
{ ts: 2500, type: "action", data: { action: "scroll", delta_x: 0, delta_y: 300 } }
// Network request
{ ts: 3000, type: "network", data: { method: "POST", url: "/api/login", status: 200 } }
// Console warning
{ ts: 3100, type: "console", data: { level: "warn", message: "Deprecated API called" } }
// Error
{ ts: 4000, type: "error", data: { message: "Element not found: #checkout-btn" } }
// Screenshot
{ ts: 5000, type: "screenshot", data: { s3_key: "vm-streams/run_42/frame_001.png" } }
Providers
Generic Provider
Use setVmStream() for any browser provider (Browserbase, Steel, Scrapybara, Playwright, Puppeteer, or your own):
test.setVmStream(provider, opts?)
| Parameter | Type | Required | Description |
|---|---|---|---|
provider | string | Yes | Provider name (e.g. "browserbase", "steel", "playwright", "custom") |
opts.sessionId | string | No | Your provider's session ID |
opts.durationMs | number | No | Total session duration in milliseconds |
opts.logs | VmLogEntry[] | No | Array of timestamped log entries |
opts.metadata | Record<string, unknown> | No | Any additional provider-specific data |
test.setVmStream("browserbase", {
sessionId: "sess_abc123",
durationMs: 12000,
logs: [
{ ts: 0, type: "navigation", data: { url: "https://shop.example.com" } },
{ ts: 800, type: "action", data: { action: "click", selector: "#product" } },
{ ts: 3500, type: "network", data: { method: "POST", url: "/api/cart", status: 200 } },
],
metadata: { browser: "chromium", viewport: { width: 1280, height: 720 } },
});
Kernel Browser
Use setKernelVm() for Kernel browser sessions. This is a convenience wrapper that sets provider="kernel" and exposes Kernel-specific metadata as named parameters:
test.setKernelVm(sessionId, opts?)
| Parameter | Type | Required | Description |
|---|---|---|---|
sessionId | string | Yes | Kernel browser session ID |
opts.durationMs | number | No | Total session duration in milliseconds |
opts.logs | VmLogEntry[] | No | Array of timestamped log entries |
opts.liveViewUrl | string | No | Kernel's browser_live_view_url for real-time viewing |
opts.cdpWsUrl | string | No | Chrome DevTools Protocol WebSocket URL |
opts.replayId | string | No | Kernel session recording ID |
opts.replayViewUrl | string | No | URL to view the session replay |
opts.headless | boolean | No | Whether the session ran headless |
opts.stealth | boolean | No | Whether anti-bot stealth mode was enabled |
opts.viewport | { width, height } | No | Browser viewport dimensions |
test.setKernelVm("kern_sess_abc123", {
durationMs: 15000,
logs: [
{ ts: 0, type: "navigation", data: { url: "https://app.example.com" } },
{ ts: 1200, type: "action", data: { action: "click", selector: "#login" } },
{ ts: 2500, type: "action", data: { action: "type", selector: "#email", value: "user@example.com" } },
{ ts: 3800, type: "action", data: { action: "click", selector: "#submit" } },
],
replayId: "replay_abc123",
replayViewUrl: "https://www.kernel.sh/replays/replay_abc123",
stealth: true,
viewport: { width: 1920, height: 1080 },
});
When to use which?
setKernelVm()— You're using Kernel and want replay URLs, live view, and CDP access tracked in your results.setVmStream()— Everything else. Works with any provider. Pass whatever metadata makes sense for your setup.
Attaching VM Streams to EvalRunner Results
This is the most common source of confusion. EvalRunner.run() returns a RunBuilder with TestBuilder instances for each scenario. The runner records tool calls and text comparisons, but you must attach VM streams.
const run = await runner.run(agent);
// run.tests is an array of TestBuilder instances
// Each test.test_id matches the scenario ID from the dataset
for (const test of run.tests) {
const session = agent.getSession(test.test_id);
if (session) {
// Option A: generic provider
test.setVmStream("my-provider", {
sessionId: session.id,
logs: session.logs,
durationMs: session.duration,
});
// Option B: Kernel
// test.setKernelVm(session.kernelId, {
// logs: session.logs,
// replayId: session.replayId,
// });
}
}
// Now deploy — VM streams are included
await run.deploy(client, datasetId);
Why doesn't EvalRunner capture logs automatically?
The Agent interface (respond + reset) is intentionally minimal. It doesn't know or care whether your agent uses a browser, a terminal, or an API. VM stream capture is provider-specific and varies widely, so it's left to you.
Manual RunBuilder with VM Streams
If you're not using EvalRunner and building results manually:
import { AshrLabsClient, RunBuilder } from "ashr-labs";
const client = new AshrLabsClient("tp_...");
const run = new RunBuilder();
run.start();
const test = run.addTest("login_flow");
test.start();
// Record what the agent did
test.addToolCall(
{ name: "navigate", arguments_json: '{"url":"https://app.example.com/login"}' },
{ name: "navigate", arguments: { url: "https://app.example.com/login" } },
"exact",
);
test.addAgentResponse(
{ text: "Navigated to login page" },
{ text: "I've opened the login page" },
"similar",
0.85,
);
// Attach the browser session
test.setVmStream("playwright", {
sessionId: "sess_001",
durationMs: 8000,
logs: [
{ ts: 0, type: "navigation", data: { url: "https://app.example.com/login" } },
{ ts: 1500, type: "action", data: { action: "type", selector: "#email", value: "test@example.com" } },
{ ts: 2500, type: "action", data: { action: "type", selector: "#password", value: "********" } },
{ ts: 3500, type: "action", data: { action: "click", selector: "#login-btn" } },
{ ts: 5000, type: "navigation", data: { url: "https://app.example.com/dashboard" } },
{ ts: 5500, type: "screenshot", data: {} },
],
});
test.complete();
run.complete();
await run.deploy(client, 42);
Gotchas and Common Mistakes
1. VM streams are not auto-captured
Mistake: Running runner.run(agent) and deploying, expecting VM logs to appear.
Fix: You must call test.setVmStream() or test.setKernelVm() on each test after run() returns and before deploy().
2. Parallel execution with browser agents
Mistake: Setting maxWorkers: 4 with a browser agent that shares a single browser instance.
Fix: Use maxWorkers: 1 (the default) for browser agents, unless your agent creates independent browser sessions per scenarioId. The scenarioId parameter in respond() exists specifically for this — key your sessions on it:
async respond(message: string, scenarioId?: string) {
const id = scenarioId ?? "default";
// Each scenario gets its own browser session
if (!this.sessions.has(id)) {
this.sessions.set(id, await this.createNewBrowserSession());
}
// ...
}
3. Forgetting to expose session data
Mistake: Agent collects logs in respond() but has no way to retrieve them after eval.
Fix: Add a getSession(scenarioId) method (or similar) so you can access logs after runner.run() completes.
4. reset() is called at the START of each scenario, not the end
Mistake: Capturing VM metadata only inside reset(), expecting it to run after the scenario completes.
How EvalRunner actually works: reset(scenarioId) is called at the beginning of each scenario to clear state from the previous one. After the last scenario finishes, reset() is NOT called again. This means:
reset("scenario_1") → No session exists yet, nothing to capture
respond(...) → Session created, agent runs
reset("scenario_2") → Called with scenario_2's ID, but scenario_1's session
is keyed under "scenario_1" — never matched
respond(...) → Session created for scenario_2
[run() returns] → No final reset() — last scenario's data uncaptured
attachVmStreams() → Metadata map is empty
Fix: Capture VM metadata inside respond() after each agent turn, not in reset(). The session is guaranteed to be alive during respond():
async respond(message: string, scenarioId?: string) {
const id = scenarioId ?? "default";
const session = this.getOrCreateSession(id);
const result = await session.agent.processMessage(message);
// Capture metadata HERE — session is alive, scenarioId is correct
this.capturedMetadata.set(id, {
sessionId: session.id,
logs: session.logs,
durationMs: Date.now() - session.startTime,
});
return { text: result, tool_calls: [...] };
}
This way attachVmStreams() always has data, regardless of when reset() or cleanup runs.
5. Clearing sessions on reset()
Mistake: Agent's reset() deletes browser session data, so logs are gone before you attach them.
Fix: Don't delete logs in reset(). Either keep them until after deploy(), or copy them out in the onScenario callback:
const allLogs = new Map<string, any>();
const run = await runner.run(agent, {
onScenario: (id) => {
// Save previous scenario's logs before agent resets
const prev = agent.getSession(id);
if (prev) allLogs.set(id, prev);
},
});
6. VM stream attached but logs are empty
Symptom: vm_stream has provider, session_id, and metadata, but logs: [] and duration_ms: 0.
Cause: You're capturing session identity (IDs, URLs) but not recording browser actions into a logs array. If your agent emits tool call events internally (e.g. navigate_url, click_mouse, type_text), you need to collect those into VmLogEntry[] during respond().
Fix: Create a per-scenario log array, push entries when tool calls complete, and include the array in your VM stream:
private readonly scenarioVmLogs = new Map<string, VmLogEntry[]>();
async respond(message: string, scenarioId?: string) {
const id = scenarioId ?? "default";
if (!this.scenarioVmLogs.has(id)) {
this.scenarioVmLogs.set(id, []);
}
const vmLogs = this.scenarioVmLogs.get(id)!;
const sessionStart = this.sessionStartTimes.get(id) ?? Date.now();
// When a tool call completes, push a log entry:
const onToolComplete = (tool: { name: string; args: any; status: string; error?: string }) => {
const type = tool.name.includes("navigate") ? "navigation"
: tool.name.includes("click") ? "action"
: tool.name.includes("type") ? "action"
: tool.name.includes("screenshot") ? "screenshot"
: "action";
vmLogs.push({
ts: Date.now() - sessionStart,
type,
data: {
action: tool.name,
...tool.args,
...(tool.status === "error" ? { error: tool.error } : {}),
},
});
};
// ... run agent, listen for tool events ...
}
Then include logs when attaching:
test.setVmStream("my-provider", {
sessionId: session.id,
durationMs: Date.now() - sessionStart,
logs: this.scenarioVmLogs.get(test.test_id) ?? [], // ← don't forget this
});
7. Timestamps
Recommendation: Use milliseconds relative to session start (i.e., first log entry at ts: 0). Absolute timestamps work too, but relative timestamps make it easier to calculate durations and display timelines.
8. Grading is server-side
Calling deploy() submits your results (including VM streams) but grading happens asynchronously on the server (typically 1-3 minutes). To get graded results:
const created = await run.deploy(client, 42);
const runResult = await client.getRun((created as Record<string, unknown>).id as number);
// Check runResult.status — "graded" means complete