name: vr-loop description: IMO-style verification-and-refinement loop. Iteratively verifies a proof via the external LLM, extracts structured bug reports, applies fixes, and requires 5 consecutive passes for acceptance. argument-hint: "<tex-file> — path to .tex file or section; defaults to most recently modified .tex" compatibility: Requires LLM_API_KEY environment variable. Depends on external-llm skill. license: Apache-2.0 metadata: version: "1.0" category: math
Verification-and-Refinement Loop
Adapted from "Winning Gold at IMO 2025 with a Model-Agnostic Verification-and-Refinement Pipeline" (Huang & Yang, arXiv:2507.15855). Iteratively verifies a mathematical proof by cycling between a verifier (an external reasoning model via /external-llm) and a refiner (the agent), requiring 5 consecutive clean passes for acceptance.
Setup
Requires the external-llm skill to be installed. Set your API key:
export LLM_API_KEY="your-key-here"
Invocation
/vr-loop proof.tex
/vr-loop # defaults to most recently modified .tex in current directory
/vr-loop sec3 # verify a specific section (matched by \section title)
Key Difference from external-review-loop
| Feature | external-review-loop | vr-loop |
|---|---|---|
| Scope | Whole paper, section-by-section | Single proof/argument, tight loop |
| Acceptance | 1 pass | 5 consecutive passes |
| Verifier output | ISSUE / ALL CLEAR | Structured bug report (critical errors + justification gaps) |
| Rejection | Oscillation detection | 10 consecutive fails |
| Focus | Breadth (many sections) | Depth (one proof, iterated) |
Algorithm
0. Resolve target (file or section)
0.5. Interview: what to verify, what concerns exist
1. VERIFY: send proof to the external LLM verifier (fresh session each time)
2. EXTRACT VERDICT: parse structured Summary from response
3. CLAUDE REVIEWS: check each finding (CONFIRMED / FALSE POSITIVE)
correct_count = 0, error_count = 0, iteration = 0
LOOP (max 30 iterations):
Agent determines effective verdict (PASS/FAIL) after reviewing findings:
- PASS = 0 critical, 0 confirmed-major (false positives don't count)
- FAIL = >=1 confirmed critical or confirmed major gap
If effective verdict = PASS:
correct_count += 1, error_count = 0
If correct_count >= 5 -> ACCEPT (done!)
Else -> re-verify (fresh verifier session)
If effective verdict = FAIL:
error_count += 1, correct_count = 0
If error_count >= 10 -> REJECT (stop, report issues)
Else:
a. Agent fixes confirmed issues directly (skip the external LLM correction
for simple fixes; use the external LLM correction only for complex issues)
b. Re-verify (back to VERIFY step)
Lessons Learned
These lessons are critical — they override the theoretical algorithm above.
1. Parse the Summary directly — do NOT use a yes/no follow-up
Problem: the external LLM's yes/no extraction is unreliable. It often contradicts its own Summary.
Solution: Parse the structured Summary output directly:
- Extract
Final Verdict: PASS/FAIL - Extract
Critical errors: N - Extract
Justification gaps: N major, M minor
The verdict is determined by the counts, NOT by a follow-up question.
2. The agent is the gatekeeper — false positives don't count
Problem: ~30% of the external LLM findings are false positives. Common hallucinations:
- "Result stated without reference" when the reference EXISTS
- Flagging standard algebraic identities as unproven
- Flagging results from prior work as unproven
Solution: After EVERY verification, the agent reviews each finding:
- Read the flagged location in the tex file
- Check the math
- Classify: CONFIRMED / FALSE POSITIVE / UNCLEAR
- All findings FALSE POSITIVE -> effective PASS (even if raw FAIL)
3. Fix simple issues directly — skip the external LLM correction
Most fixes are 2-3 line edits. Only use the external LLM correction for genuinely complex mathematical issues.
4. Add "Do NOT flag results from prior work" to the verifier prompt
5. Run parallel verifications in the pass-accumulation phase
Once you have 2+ consecutive passes, launch 2 verification sessions in parallel.
Prompts
All the external LLM interactions use the /external-llm skill.
Prompt 1: Verifier System Prompt
You are an expert mathematician and meticulous referee. Rigorously verify
the proof below. A proof is correct ONLY if every step is justified.
### Instructions ###
**1. Core Task**
* Find and report ALL issues. Act as a **verifier**, NOT a solver.
* Perform a **step-by-step** check of the entire proof.
**2. Issue Categories**
* **Critical Error:** A logical fallacy or mathematical mistake that
**breaks the logical chain** of the proof.
* **Justification Gap:**
- **Major gap:** A nontrivial step is asserted without proof or reference.
- **Minor gap:** A step most experts would accept but is not fully rigorous.
**3. Downstream Checking**
Even if a step has a gap, **assume its conclusion correct** for checking
downstream steps. Find ALL issues, not just the first.
**4. What NOT to Flag**
- Style issues, notation preferences, or formatting
- Well-known results (Sard, IFT, Borsuk-Ulam, etc.)
- Standard algebraic manipulations verifiable in 1-2 lines
- Results cited from prior work or with explicit references
**5. What TO Flag**
- "By genericity" without specifying bad locus and codimension
- Dimensional or rank claims stated without computation
- Topological arguments with unverified hypotheses
**6. Output Format**
Your response MUST contain these sections in order:
**a. Summary**
- **Final Verdict:** PASS or FAIL
- **Critical errors:** [count]
- **Justification gaps:** [count major] major, [count minor] minor
- **Findings list:** One-line summary of each issue with location
**b. Detailed Verification Log**
Step-by-step analysis. For each key step:
- Quote the claim or equation
- Check the reasoning
- Mark verified or issue found
IMPORTANT: Your Final Verdict MUST be consistent with your counts.
If critical == 0 and major gaps == 0, the verdict MUST be PASS.
Prompt 2: Verification Reminder
Appended after the proof content:
### Verification Task Reminder ###
Generate the **summary** (with verdict, counts, and findings list) and the
**step-by-step detailed verification log** for the proof above.
IMPORTANT: Ensure your Final Verdict is consistent with your error counts.
Prompt 3: Correction Prompt (for complex issues only)
Below is the bug report from the verifier. If you agree with an item,
improve the proof so that it is complete and rigorous. If you disagree,
add detailed explanations to avoid such misunderstanding. Address every item.
Step-by-Step Instructions
Step 0: Resolve Target
If argument is a file path, use it. Otherwise find the most recently modified .tex in the current directory:
ls -t *.tex | head -1
Read the target content. Store as $TEX_FILE, $BASE, and $TARGET_CONTENT.
Step 0.5: Interview
Ask the user:
- What is the proof trying to show? (one sentence — context for verifier)
- What are you least confident about? (specific steps, estimates, edge cases)
Step 1: Verify
1a. Create fresh verification session
/external-llm /new vr-{BASE}-verify-{iteration}
1b. Send verifier prompt + proof
Compose a single message: Verifier System Prompt + proof context/concerns + proof content + Verification Reminder.
1c. Parse verdict from structured Summary
Do NOT send a follow-up yes/no question. Parse the Summary directly:
import re
text = llm_response
# Extract from [OUTPUT] section if present
output_idx = text.find("[OUTPUT]")
if output_idx >= 0:
text = text[output_idx:]
# Parse counts
critical = int(re.search(r'Critical errors?:\s*(\d+)', text).group(1))
major = int(re.search(r'(\d+)\s*major', text).group(1))
raw_verdict = "PASS" if (critical == 0 and major == 0) else "FAIL"
1d. Agent reviews each finding
For EVERY iteration (pass or fail), review each finding:
- Read the flagged location in the tex file
- Check the math
- Classify: CONFIRMED / FALSE POSITIVE / UNCLEAR
Determine effective verdict:
- All findings FALSE POSITIVE -> effective PASS
- Any finding CONFIRMED critical or major -> effective FAIL
- UNCLEAR -> treat as FAIL (conservative)
Step 2: Handle PASS
Increment correct_count. Reset error_count = 0.
If correct_count >= 3 and no fixes pending: consider launching 2 parallel verification sessions.
If correct_count >= 5: ACCEPT. Go to Delivery.
Step 3: Handle FAIL
Increment error_count. Reset correct_count = 0.
If error_count >= 10: REJECT. Go to Delivery.
Simple fixes: Apply edits directly. Complex fixes: Create a correction session and send Prompt 3 + bug report to the external LLM.
Then re-verify (back to Step 1).
Delivery
After ACCEPT or REJECT:
Compile
pdflatex -interaction=nonstopmode $BASE.tex && pdflatex -interaction=nonstopmode $BASE.tex
Git Commit (only if fixes were applied)
Ask the user before committing:
"VR loop complete with fixes applied. Want me to commit?"
If yes:
git add $TEX_FILE && git commit -m "VR loop: [status] after [N] iterations"
Final Summary
Always print at the end:
===========================================
VR LOOP -- Final Summary
===========================================
File: $BASE.tex
Result: ACCEPTED / REJECTED / MAX ITERATIONS
Iterations: N
Passes: X (consecutive)
Fails: Y (consecutive)
Issues Fixed:
- [list]
False Positives Filtered:
- [list]
Remaining Issues (if rejected):
- [list]
===========================================
Configuration
| Parameter | Default | Description |
|---|---|---|
| CONSECUTIVE_PASS | 5 | Passes needed for acceptance |
| CONSECUTIVE_FAIL | 10 | Fails before rejection |
| MAX_ITERATIONS | 30 | Hard limit on total iterations |
| LLM_TIMEOUT | 600s | Per-verification timeout |
Important Notes
- All the external LLM calls go through
/external-llm. Session persistence + transcripts. - Each verification uses a FRESH session. Prevents bias from previous rounds.
- The agent is the gatekeeper. Reviews EVERY finding before acting. False positives never reset the pass counter.
- Parse structured output, don't ask follow-up questions. The yes/no extraction prompt is unreliable.
- Fix simple issues directly. Only use the external LLM correction for genuinely complex problems.
Reference
Huang & Yang, "Winning Gold at IMO 2025 with a Model-Agnostic Verification-and-Refinement Pipeline," arXiv:2507.15855, 2025.