name: vr-loop description: IMO-style verification-and-refinement loop. Iteratively verifies a proof via the external LLM, extracts structured bug reports, applies fixes, and requires 5 consecutive passes for acceptance. argument-hint: "<tex-file> — path to .tex file or section; defaults to most recently modified .tex" compatibility: Requires LLM_API_KEY environment variable. Depends on external-llm skill. license: Apache-2.0 metadata: version: "1.0" category: math

Verification-and-Refinement Loop

Adapted from "Winning Gold at IMO 2025 with a Model-Agnostic Verification-and-Refinement Pipeline" (Huang & Yang, arXiv:2507.15855). Iteratively verifies a mathematical proof by cycling between a verifier (an external reasoning model via /external-llm) and a refiner (the agent), requiring 5 consecutive clean passes for acceptance.

Setup

Requires the external-llm skill to be installed. Set your API key:

export LLM_API_KEY="your-key-here"

Invocation

/vr-loop proof.tex
/vr-loop                        # defaults to most recently modified .tex in current directory
/vr-loop sec3                   # verify a specific section (matched by \section title)

Key Difference from external-review-loop

Feature	external-review-loop	vr-loop
Scope	Whole paper, section-by-section	Single proof/argument, tight loop
Acceptance	1 pass	5 consecutive passes
Verifier output	ISSUE / ALL CLEAR	Structured bug report (critical errors + justification gaps)
Rejection	Oscillation detection	10 consecutive fails
Focus	Breadth (many sections)	Depth (one proof, iterated)

Algorithm

0.   Resolve target (file or section)
0.5. Interview: what to verify, what concerns exist
1.   VERIFY: send proof to the external LLM verifier (fresh session each time)
2.   EXTRACT VERDICT: parse structured Summary from response
3.   CLAUDE REVIEWS: check each finding (CONFIRMED / FALSE POSITIVE)

     correct_count = 0, error_count = 0, iteration = 0

LOOP (max 30 iterations):
   Agent determines effective verdict (PASS/FAIL) after reviewing findings:
     - PASS = 0 critical, 0 confirmed-major (false positives don't count)
     - FAIL = >=1 confirmed critical or confirmed major gap

   If effective verdict = PASS:
      correct_count += 1, error_count = 0
      If correct_count >= 5 -> ACCEPT (done!)
      Else -> re-verify (fresh verifier session)
   If effective verdict = FAIL:
      error_count += 1, correct_count = 0
      If error_count >= 10 -> REJECT (stop, report issues)
      Else:
        a. Agent fixes confirmed issues directly (skip the external LLM correction
           for simple fixes; use the external LLM correction only for complex issues)
        b. Re-verify (back to VERIFY step)

Lessons Learned

These lessons are critical — they override the theoretical algorithm above.

1. Parse the Summary directly — do NOT use a yes/no follow-up

Problem: the external LLM's yes/no extraction is unreliable. It often contradicts its own Summary.

Solution: Parse the structured Summary output directly:

Extract Final Verdict: PASS/FAIL
Extract Critical errors: N
Extract Justification gaps: N major, M minor

The verdict is determined by the counts, NOT by a follow-up question.

2. The agent is the gatekeeper — false positives don't count

Problem: ~30% of the external LLM findings are false positives. Common hallucinations:

"Result stated without reference" when the reference EXISTS
Flagging standard algebraic identities as unproven
Flagging results from prior work as unproven

Solution: After EVERY verification, the agent reviews each finding:

Read the flagged location in the tex file
Check the math
Classify: CONFIRMED / FALSE POSITIVE / UNCLEAR
All findings FALSE POSITIVE -> effective PASS (even if raw FAIL)

3. Fix simple issues directly — skip the external LLM correction

Most fixes are 2-3 line edits. Only use the external LLM correction for genuinely complex mathematical issues.

4. Add "Do NOT flag results from prior work" to the verifier prompt

5. Run parallel verifications in the pass-accumulation phase

Once you have 2+ consecutive passes, launch 2 verification sessions in parallel.

Prompts

All the external LLM interactions use the /external-llm skill.

Prompt 1: Verifier System Prompt

You are an expert mathematician and meticulous referee. Rigorously verify
the proof below. A proof is correct ONLY if every step is justified.

### Instructions ###

**1. Core Task**
*   Find and report ALL issues. Act as a **verifier**, NOT a solver.
*   Perform a **step-by-step** check of the entire proof.

**2. Issue Categories**

*   **Critical Error:** A logical fallacy or mathematical mistake that
    **breaks the logical chain** of the proof.

*   **Justification Gap:**
    - **Major gap:** A nontrivial step is asserted without proof or reference.
    - **Minor gap:** A step most experts would accept but is not fully rigorous.

**3. Downstream Checking**

Even if a step has a gap, **assume its conclusion correct** for checking
downstream steps. Find ALL issues, not just the first.

**4. What NOT to Flag**

- Style issues, notation preferences, or formatting
- Well-known results (Sard, IFT, Borsuk-Ulam, etc.)
- Standard algebraic manipulations verifiable in 1-2 lines
- Results cited from prior work or with explicit references

**5. What TO Flag**

- "By genericity" without specifying bad locus and codimension
- Dimensional or rank claims stated without computation
- Topological arguments with unverified hypotheses

**6. Output Format**

Your response MUST contain these sections in order:

**a. Summary**
- **Final Verdict:** PASS or FAIL
- **Critical errors:** [count]
- **Justification gaps:** [count major] major, [count minor] minor
- **Findings list:** One-line summary of each issue with location

**b. Detailed Verification Log**
Step-by-step analysis. For each key step:
- Quote the claim or equation
- Check the reasoning
- Mark verified or issue found

IMPORTANT: Your Final Verdict MUST be consistent with your counts.
If critical == 0 and major gaps == 0, the verdict MUST be PASS.

Prompt 2: Verification Reminder

Appended after the proof content:

### Verification Task Reminder ###

Generate the **summary** (with verdict, counts, and findings list) and the
**step-by-step detailed verification log** for the proof above.

IMPORTANT: Ensure your Final Verdict is consistent with your error counts.

Prompt 3: Correction Prompt (for complex issues only)

Below is the bug report from the verifier. If you agree with an item,
improve the proof so that it is complete and rigorous. If you disagree,
add detailed explanations to avoid such misunderstanding. Address every item.

Step-by-Step Instructions

Step 0: Resolve Target

If argument is a file path, use it. Otherwise find the most recently modified .tex in the current directory:

ls -t *.tex | head -1

Read the target content. Store as $TEX_FILE, $BASE, and $TARGET_CONTENT.

Step 0.5: Interview

Ask the user:

What is the proof trying to show? (one sentence — context for verifier)
What are you least confident about? (specific steps, estimates, edge cases)

Step 1: Verify

1a. Create fresh verification session

/external-llm /new vr-{BASE}-verify-{iteration}

1b. Send verifier prompt + proof

Compose a single message: Verifier System Prompt + proof context/concerns + proof content + Verification Reminder.

1c. Parse verdict from structured Summary

Do NOT send a follow-up yes/no question. Parse the Summary directly:

import re
text = llm_response

# Extract from [OUTPUT] section if present
output_idx = text.find("[OUTPUT]")
if output_idx >= 0:
    text = text[output_idx:]

# Parse counts
critical = int(re.search(r'Critical errors?:\s*(\d+)', text).group(1))
major = int(re.search(r'(\d+)\s*major', text).group(1))

raw_verdict = "PASS" if (critical == 0 and major == 0) else "FAIL"

1d. Agent reviews each finding

For EVERY iteration (pass or fail), review each finding:

Read the flagged location in the tex file
Check the math
Classify: CONFIRMED / FALSE POSITIVE / UNCLEAR

Determine effective verdict:

All findings FALSE POSITIVE -> effective PASS
Any finding CONFIRMED critical or major -> effective FAIL
UNCLEAR -> treat as FAIL (conservative)

Step 2: Handle PASS

Increment correct_count. Reset error_count = 0.

If correct_count >= 3 and no fixes pending: consider launching 2 parallel verification sessions.

If correct_count >= 5: ACCEPT. Go to Delivery.

Step 3: Handle FAIL

Increment error_count. Reset correct_count = 0.

If error_count >= 10: REJECT. Go to Delivery.

Simple fixes: Apply edits directly. Complex fixes: Create a correction session and send Prompt 3 + bug report to the external LLM.

Then re-verify (back to Step 1).

Delivery

After ACCEPT or REJECT:

Compile

pdflatex -interaction=nonstopmode $BASE.tex && pdflatex -interaction=nonstopmode $BASE.tex

Git Commit (only if fixes were applied)

Ask the user before committing:

"VR loop complete with fixes applied. Want me to commit?"

If yes:

git add $TEX_FILE && git commit -m "VR loop: [status] after [N] iterations"

Final Summary

Always print at the end:

===========================================
VR LOOP -- Final Summary
===========================================
File:        $BASE.tex
Result:      ACCEPTED / REJECTED / MAX ITERATIONS
Iterations:  N
Passes:      X (consecutive)
Fails:       Y (consecutive)

Issues Fixed:
  - [list]

False Positives Filtered:
  - [list]

Remaining Issues (if rejected):
  - [list]
===========================================

Configuration

Parameter	Default	Description
CONSECUTIVE_PASS	5	Passes needed for acceptance
CONSECUTIVE_FAIL	10	Fails before rejection
MAX_ITERATIONS	30	Hard limit on total iterations
LLM_TIMEOUT	600s	Per-verification timeout

Important Notes

All the external LLM calls go through /external-llm. Session persistence + transcripts.
Each verification uses a FRESH session. Prevents bias from previous rounds.
The agent is the gatekeeper. Reviews EVERY finding before acting. False positives never reset the pass counter.
Parse structured output, don't ask follow-up questions. The yes/no extraction prompt is unreliable.
Fix simple issues directly. Only use the external LLM correction for genuinely complex problems.

Reference

Huang & Yang, "Winning Gold at IMO 2025 with a Model-Agnostic Verification-and-Refinement Pipeline," arXiv:2507.15855, 2025.

ナビゲーション

Skillsとは？

リンク

vr-loop