name: agent-ops-recovery description: "Handle failures and errors during workflow. Use when build breaks, tests fail unexpectedly, or agent gets stuck. Semi-automatic recovery with user confirmation for destructive actions." category: git invokes: [agent-ops-state, agent-ops-git, agent-ops-tasks, agent-ops-debugging] invoked_by: [] state_files: read: [constitution.md, baseline.md, focus.md, issues/.md] write: [focus.md, issues/.md, memory.md]
Error Recovery workflow
Trigger conditions
Use this skill when:
- Build/lint fails unexpectedly after agent changes
- Tests fail that were passing in baseline
- Agent encounters ambiguity it cannot resolve
- Implementation is stuck or going in circles
Recovery procedure
Step 1: Diagnose (invoke debugging)
For non-trivial failures, invoke agent-ops-debugging:
- Apply systematic debugging process:
- Reproduce the issue consistently
- Define expected vs actual behavior
- Form hypothesis about root cause
- Use debugging output to inform recovery decision
- If root cause unclear after initial analysis, continue debugging before recovery
Step 2: Assess rollback options
- Option A: Fix forward — issue is minor, can be resolved quickly
- Option B: Partial rollback — revert specific file(s) to last good state
- Option C: Full rollback — revert all agent changes since checkpoint
- Option D: Escalate — document the issue, mark task blocked, ask user
Step 3: Propose action
Present options to user with:
- What will be reverted/changed
- Risk assessment
- Recommendation
Step 4: Execute (with confirmation)
- For non-destructive actions (fix forward): proceed
- For destructive actions (rollback): ask user first
- Update
.agent/focus.mdwith recovery action taken
Destructive actions (require confirmation)
git resetgit checkout -- <file>(discard changes)git revert- Deleting files
- Overwriting files with previous versions
Non-destructive actions (can proceed)
git stash- Reading files
- Running diagnostics
- Updating focus/tasks with findings
Post-recovery
- Update
.agent/focus.mdwith what happened - Invoke
agent-ops-tasksto create issue for root cause investigation - Update
.agent/memory.mdwith "pitfall to avoid" if applicable - Re-run baseline comparison before continuing
Issue Discovery After Recovery
After recovery, invoke agent-ops-tasks discovery procedure:
-
Create issue for the incident:
📋 Recovery completed. Create issue to track root cause? Suggested: - [BUG] Investigate: {description of what failed} - What happened: {failure description} - Recovery action: {what was done} - Root cause: TBD Create this issue? [Y]es / [N]o -
If pattern detected, create prevention issue:
This failure pattern has occurred before. Create improvement issue? - [CHORE] Add validation to prevent {failure type} - [TEST] Add regression test for {scenario} Create these? [A]ll / [S]elect / [N]one -
After creating issues:
Created {N} issues for tracking. What's next? 1. Investigate root cause now (BUG-0024@abc123) 2. Continue with original work (defer investigation) 3. Review recovery actions