name: agent-evaluation-framework description: "Workflow for evaluating and refining agent debugging capabilities using designated test cases and Swarm principles. Use when evaluating subagent performance or creating benchmarks. Do not use for regular bug fixing."
Agent Evaluation Framework Workflow
Use this skill to orchestrate evaluation sessions for subagents, identify procedural bottlenecks, and iteratively refine system prompts and capabilities utilizing Swarm intelligence principles.
0. Preparation
- Subagent Isolation: Ensure that subagents spawned for evaluation do NOT utilize existing session brains or previous task knowledge. This is critical to maintain the integrity of meta-testing.
- Worktree Pre-creation: Create isolated git worktrees using
git worktreefor each test case beforehand. Best practice is to create worktrees as subdirectories of the V8 repository (e.g., in aworktrees/directory within the V8 root). Report where the worktrees were created to the user. Inside worktrees builds MUST use the tools/dev/gm.py tool INSIDE the worktree for builds (or tools/dev/setup_worktree_build.py to prepare the worktree for builds). - Test Injection: Copy the target test case into the worktree (e.g.,
test/mjsunit/repro.js). - Remote Compilation: Ensure worktrees are set up to compile remotely (
use_remoteexec = trueinargs.gn) before proceeding.
1. Core Directives
- Zero Hallucination: Do not assume a test passes or fails without executing it.
- Worktree Enforcement: Agents MUST operate strictly within their assigned worktree. They should NOT know the main V8 root exists.
- Test Scope: Meta-refinement ALWAYS uses the tests in
agent-meta-testsonly. - Test Immutability: The
agent-meta-testsdirectory cannot be changed. - Crash Verification: Only work on test-cases that still crash.
- Auto-Run Enforcement: ALWAYS use
SafeToAutoRun: truefor ALL commands executed during meta-refinement. Approval must NEVER be asked of the user. - Immediate Termination: Terminate any agent immediately if it modifies the main V8 repository.
2. Agent Orchestration & Lifecycle Management
- Workspace Isolation: Ensure agents are initialized in dedicated worktrees.
- Communication Routing: Facilitate communication between sibling agents. Since evaluated agents operate independently, the Orchestrator/Main Agent must act as a message broker to share relevant findings and prevent duplicate work.
- User Reporting: Synthesize high-level progress from all evaluated agents and keep the user informed without exposing raw logs or requiring manual approvals.
3. Evaluation & Divergence Analysis
- Entry Point: A list of historical V8 fixes and their associated reproducing scripts (e.g., from
test/mjsunit/or Buganizer). - Execution: Initialize the agent in an isolated worktree checked out to the parent commit of the target fix. Copy the repro script and command the agent to resolve the bug.
- Comparison: Upon completion, compare the agent's proposed fix with the actual historical fix.
- Analysis: If the solutions diverge:
- Identify where the agent's reasoning deviated from the required fix.
- Scan for "hallucinated complexity"—parts of the fix that were not logically required by the root cause but were added by the agent.
- Evaluate if the agent overlooked critical architectural invariants or spec requirements.
- Hasty Fix Detection: Specifically check if the agent's solution simply disabled an optimization or feature mistakenly instead of addressing the logic error.
- Root Cause Tracing: Manually trace the logical steps required to reach the the correct historical fix. Identify the exact moment/decision where the agent chose a shallow path over a deep one.
4. Iterative Process Refinement & Skepticism
The ultimate goal of evaluation is to harden the agent's skepticism and reasoning depth:
-
Architectural Skepticism: Require subagents to explicitly argue against a proposed fix before accepting it. Look at the problem from multiple orthogonal angles.
-
Mandatory Deep Reasoning: If a fix feels "guessed" or lacks direct evidence from GDB/Spec logs, spawn a subagent to reason deeper about the specific invariant being violated.
-
Skill Updates: Every evaluation session MUST conclude with a diff for relevant subsystem skills to bake in the lessons learned and prevent future failures.
-
analyze_brain.py: Scans agent logs for markers of shortcutting, logic failures, or divergence in reasoning.