name: rlm description: | RLM-inspired externalize-and-recurse for data-scale tasks. Use when the task involves many files/items, information-dense aggregation, pairwise comparisons, or output too large for one response. Especially when direct in-context handling would overflow, degrade quality, or miss coverage. Do NOT use for small tasks, single-file edits, or tasks solvable by one grep. user-invocable: true allowed-tools: Bash, Read, Write, Edit, Grep, Glob, Agent
RLM
Adapted from Recursive Language Models (Zhang, Kraska, Khattab 2025). This is not a faithful port: coding agents do not expose a persistent REPL. Files are symbolic state, subagents are expensive recursive calls, and you are the loop controller.
The skill teaches three behaviors that differ from your defaults:
- Externalize state — keep large data in files; work from metadata (paths, counts, snippets) after probing; don't copy bulk content into conversational context.
- Programmatic delegation — generate sub-problems from manifests/code, not verbal ad-hoc descriptions. This enables batched fan-out over many items.
- File-built output — assemble large deliverables in files via structured intermediate artifacts, then summarize to chat.
Agent Mapping (approximate)
These are honest adaptations, not equivalences.
| RLM Concept | Agent Approximation |
|---|---|
| Prompt as REPL variable | Source files on disk — refer by path, inspect with tools, never copy bulk content into chat |
Metadata(state) at init | ls, wc, file, head to enumerate before reading content; plan from metadata alone |
sub_RLM() in code loops | Asynchronous subagent dispatch over programmatically generated batches from a manifest — not ad-hoc verbal delegation |
| REPL persistent state | Files and scratch artifacts (manifests, JSONL, intermediate markdown); each Bash call is stateless, so files are the only durable state |
hist (code + metadata) | Bounded control transcript: keep command outputs compact; large results go to files, not conversational context |
Final variable | Canonical final artifact on disk; return file path + brief chat summary; don't regenerate from prose what's already stored |
| Persistent loop | Orchestrator iteration: collect results, assess, compose new sub-problems, dispatch again |
| Depth-1 recursion | A recommended control strategy: keep recursion shallow and prefer leaf-task workers unless nested delegation is clearly necessary |
Decision Test
Use when:
- Many files/items where coverage matters and context would overflow
- Per-item semantic processing (classify, label, transform each item)
- Pairwise or cross-reference reasoning
- Output too large for one response
- Prior direct attempt became muddled or lost coverage
Don't use when:
- Task fits comfortably in one context pass
- One grep/search gives the answer
- Single-file edit or straightforward Q&A
- Subagent coordination overhead would dominate the actual work
- The task is small — recursive approaches can underperform direct calls on small inputs
Escalation Ladder
Work through these levels. Exit as early as possible — most tasks should stop at level 2 or 3. For sparse lookups (finding one thing), level 1 alone is often sufficient.
- Probe — enumerate inputs with metadata tools (
ls,wc,head,file). Plan decomposition from metadata alone, before reading any content. If the task is a simple lookup, one targeted grep may be all you need — do it and stop. - Filter — use grep/regex/code to narrow scope. Leverage domain knowledge in search queries (not just literal prompt terms — use priors about what's likely relevant).
- Process locally — many tasks stop here. Code-only filtering without subagents outperformed full recursion on 2 of 5 benchmarks in the paper. Use subagents only for semantically hard work that code can't handle.
- Batch for subagents — group 3-10 items per subagent depending on item size. Never one-per-item unless items are individually large and semantically complex. Launch asynchronously and continue local work while batches run.
- Aggregate from files — always run a synthesis pass over collected sub-results. Never just concatenate. Write the final deliverable to a file, then summarize to chat.
Complexity Patterns
- Sparse lookup (O(1)): one grep, inspect survivors, answer. Do not escalate beyond level 1-2. The overhead of structured decomposition actively hurts on sparse tasks.
- Per-item processing (O(n)): enumerate items, batch, process each batch, reduce findings. Start simple — uniform chunking or keyword-based grouping, not elaborate decomposition.
- Pairwise reasoning (O(n^2)): generate candidate pairs programmatically, prune with cheap heuristics, recurse on survivors only.
File-Based State
Create a scratch directory for multi-phase work:
/tmp/rlm-<task>/
├── plan.md # decomposition plan
├── inventory.json # enumerated units with metadata
├── results/ # per-batch findings (batch-01.md, etc.)
└── final.md # assembled output artifact
Keep formats consistent so aggregation is mechanical. Update plan.md after each major phase — long workflows can lose verbal state through summarization, compaction, or context drift.
Anti-Patterns
| Instead of... | Do this |
|---|---|
| Opening many files into context | Build a manifest, read selectively |
| One subagent per file/item | Batch 3-10 items per call |
| Summarizing before filtering | Filter first with code/regex, summarize survivors |
| Verbal delegation ("analyze these 5 areas") | Enumerate from manifest, dispatch programmatically |
| Keeping intermediate state in prose | Write to files, read back compact summaries |
| Copying bulk data into chat | Keep data in files; work from paths + metadata |
| Generating long output directly in chat | Build in files, return path + summary |
| Rebuilding answer from prose when it's in state | Return the stored artifact directly |
Gotchas
- Keep subagent tasks bounded — for multi-level work, prefer the main agent to collect results, compose the next wave of sub-problems, and dispatch again rather than nesting delegation.
- Long workflows lose state — verbal context can be lost through summarization, compaction, or context drift. Checkpoint progress to files after each phase.
- Over-recursion is the most common failure mode — models have been observed making thousands of sub-calls for basic tasks. Batch aggressively.
- Plan-as-answer — models sometimes return reasoning/plan instead of the computed result. Keep intermediate work clearly separate from final deliverables.
- Discard-and-regenerate — models sometimes build the correct answer in state, then discard it and re-derive incorrectly from prose. Once the answer exists in a file, return that file.
- Over-verification — continuing to verify after having the answer wastes cost. Set explicit stop conditions.
- Decomposition is simpler than you think — the paper found only uniform chunking and keyword search in practice, not elaborate strategies. Start simple.
Stop Conditions
- Remaining work fits in one direct pass — stop recursing
- Recursion isn't shrinking the search space — redesign decomposition
- Fanout is growing faster than information gain — batch more aggressively
- Sub-calls are producing repetitive low-yield results — stop and synthesize
- Task was mis-classified as data-scale — fall back to direct approach