name: insight-pilot description: Literature research automation - search papers, code, and blogs, deduplicate, download PDFs, analyze and generate research reports. Supports incremental updates. version: 0.3.0
Insight-Pilot Skill
A workflow automation skill for literature research. Searches papers, GitHub repos/code/issues, PubMed, Dev.to, and blogs, deduplicates results, downloads PDFs, analyzes content, and generates incremental research reports.
Setup
Run the bootstrap script (automatically checks environment, creates and installs if missing):
bash .codex/skills/insight-pilot/scripts/bootstrap_env.sh
The script automatically detects if ~/.insight-pilot-venv exists and if packages are installed, only installing when necessary. See --help for advanced options.
Usage
Before running commands, activate the environment:
source ~/.insight-pilot-venv/bin/activate
Then use the CLI:
insight-pilot <command> [options]
CLI Commands
| Command | Purpose | Required Args | Key Optional Args |
|---|---|---|---|
init | Create research project | --topic, --output | --keywords |
search | Search, merge and dedup | --project, --source, --query | --limit, --since, --until |
download | Download PDFs + convert to Markdown | --project | - |
analyze | Analyze papers with LLM | --project | --config, --force |
index | Generate index.md | --project | --template |
status | Check project state | --project | - |
sources | Manage blog/RSS sources | --project | --add, --remove, --config |
JSON Output Mode
Add --json flag for structured output (recommended for agents):
insight-pilot status --json --project ./research/myproject
Blog/RSS Sources Configuration
Create sources.yaml in your project root:
blogs:
- name: "Cursor Blog"
type: "ghost"
url: "https://cursor.sh/blog"
api_key: "auto"
- name: "Example WP Blog"
type: "wordpress"
url: "https://blog.example.com"
- name: "OpenAI Blog"
type: "rss"
url: "https://openai.com/blog/rss.xml"
category: "ai"
Manage sources via:
insight-pilot sources --project ./research/webagent
Environment variables:
GITHUB_TOKEN(GitHub API higher rate limit)PUBMED_EMAIL(required by NCBI)OPENALEX_MAILTO(OpenAlex polite usage)INSIGHT_PILOT_SOURCES(override sources.yaml path)
New Sources Examples
# GitHub repositories + code + issues
insight-pilot search --project $PROJECT --source github --query "agent framework" --limit 30
# PubMed (requires PUBMED_EMAIL)
insight-pilot search --project $PROJECT --source pubmed --query "clinical agents" --limit 20
# Dev.to articles
insight-pilot search --project $PROJECT --source devto --query "ai agents" --limit 20
# Blogs (Ghost/WordPress/RSS from sources.yaml)
insight-pilot search --project $PROJECT --source blog --query "agents" --limit 20
Workflow (Agent + CLI Collaboration)
This is the complete workflow for Agent + CLI collaboration.
Execution Principles:
- Run CLI commands in sequence as prescribed, no line-by-line confirmation needed.
- Agent intervention is ONLY required in Phase 2 for manual review (checking
items.jsonand settingstatus/exclude_reason).
Phase 1: Search and Initial Filtering
Execute the following commands directly, no confirmation needed:
PROJECT=./research/webagent
# Step 1: Initialize project
insight-pilot init --topic "WebAgent Research" --keywords "web agent,browser agent" --output $PROJECT
# Step 2: Search multiple sources (auto merge & dedup)
insight-pilot search --project $PROJECT --source arxiv openalex github pubmed devto blog --query "web agent" --limit 50
Phase 2: Agent Review (Manual Check)
After deduplication, the Agent needs to review the paper list and remove content unrelated to the research topic.
# Check current status
insight-pilot status --json --project $PROJECT
Agent Actions:
- Read
$PROJECT/.insight/items.json - Check
titleandabstractfor each paper - Mark unrelated papers: set
statusto"excluded"and addexclude_reason - Save the updated
items.json
{
"id": "i0023",
"title": "Unrelated Paper Title",
"status": "excluded",
"exclude_reason": "Not related to web agents, focuses on chemical agents"
}
Phase 3: Download PDFs
Execute directly, no confirmation needed:
# Step 3: Download PDFs (converts to Markdown automatically)
insight-pilot download --project $PROJECT
Download Results:
- Success:
download_status: "success", PDF saved topapers/ - Failed:
download_status: "failed", recorded in$PROJECT/.insight/download_failed.json
Failure list format:
[
{
"id": "i0015",
"title": "Paper Title",
"url": "https://...",
"error": "Connection timeout",
"failed_at": "2026-01-17T10:30:00Z"
}
]
Note: Advanced download (proxy/browser automation for failed items) is not yet implemented.
Phase 4: Analyze Papers
Precondition: Must complete Phase 3 Download PDFs first (download command automatically converts PDFs to Markdown).
MUST try LLM analysis first. If LLM is configured, run directly:
# Step 4: LLM Analysis (prefers converted Markdown, falls back to PDF text extraction)
insight-pilot analyze --project $PROJECT
Content Source Priority:
- Markdown (from
downloadauto-conversion via pymupdf4llm) - PDF Extraction (PyMuPDF)
LLM Configuration: Create .codex/skills/insight-pilot/llm.yaml:
provider: openai # openai / anthropic / ollama
model: gpt-4o-mini
api_key: sk-xxx # or set env var OPENAI_API_KEY
When LLM is not configured: Manual Analysis Required
If no LLM is configured, the Agent needs to analyze manually:
- Read PDF files in
papers/directory - Extract key information for each paper
- Write analysis results to
$PROJECT/.insight/analysis/{id}.json
Analysis File Format ($PROJECT/.insight/analysis/{id}.json):
{
"id": "i0001",
"title": "Paper Title",
"summary": "One sentence summary",
"brief_analysis": "2-3 sentences brief analysis",
"detailed_analysis": "300-500 words detailed analysis",
"contributions": ["Contribution 1", "Contribution 2"],
"methodology": "Methodology description",
"key_findings": ["Finding 1", "Finding 2"],
"limitations": ["Limitations"],
"future_work": ["Future work 1"],
"relevance_score": 8,
"tags": ["webagent", "benchmark", "multimodal"],
"analyzed_at": "2026-01-17T12:00:00Z"
}
Phase 5: Generate Incremental Report
# Step 8: Generate/Update Index
insight-pilot index --project $PROJECT
Reports are stored in $PROJECT/index.md, showing only analyzed papers and linking to reports/{id}.md detailed reports.
Report Structure:
# WebAgent Research
> **Generated**: 2026-01-18 10:30
> **Keywords**: web agent, browser agent
> **Analyzed**: 5 papers
---
## 📚 Analyzed Papers
### [Paper Title](reports/i0001.md)
**Authors**: Author A, Author B et al. | **Date**: 2026-01-15 | **Links**: arXiv/DOI | **Relevance**: 8/10
**Summary**: One sentence summary...
> 2-3 sentences brief analysis...
**Tags**: `webagent` `benchmark` `multimodal`
---
## ⚠️ Papers Not Available
_The following papers could not be downloaded. Only abstracts are shown._
### Paper Title
**Authors**: ... | **Date**: ... | **Links**: ...
> Abstract...
---
## 📊 Statistics
| Metric | Value |
|--------|-------|
| Papers Analyzed | 5 |
| Download Failed | 1 |
| Total Processed | 6 |
Incremental Update Workflow
For daily/weekly updates:
# 1. Search new papers (use --since for date limit, auto merge & dedup)
insight-pilot search --project $PROJECT --source arxiv openalex --query "web agent" --since 2026-01-17 --limit 20
# 2. [Agent] Review newly added papers
# 3. Download PDFs for new papers
insight-pilot download --project $PROJECT
# 4. [Agent] Analyze new papers, update reports
# 5. Regenerate index
insight-pilot index --project $PROJECT
Project Structure
research/myproject/
├── .insight/
│ ├── config.yaml # 项目配置
│ ├── state.json # 工作流状态
│ ├── items.json # 论文元数据(含 status, exclude_reason)
│ ├── raw_arxiv.json # 原始搜索结果
│ ├── raw_openalex.json
│ ├── download_failed.json # 下载失败列表(供高级下载重试)
│ ├── analysis/ # 论文分析结果
│ │ ├── i0001.json
│ │ ├── i0002.json
│ │ └── ...
│ └── markdown/ # PDF 转换结果(pymupdf4llm)
│ ├── i0001/
│ │ ├── i0001.md # 转换后的 Markdown
│ │ └── metadata.json
│ └── ...
├── papers/ # 已下载的 PDF
├── reports/ # 历史报告存档
└── index.md # 当前研究报告(增量更新)
Data Schemas
Item (Paper)
{
"id": "i0001",
"type": "paper",
"title": "Paper Title",
"authors": ["Author One", "Author Two"],
"date": "2026-01-15",
"abstract": "...",
"status": "active|excluded|pending",
"exclude_reason": null,
"identifiers": {
"doi": "10.1234/example",
"arxiv_id": "2601.12345",
"openalex_id": "W1234567890"
},
"urls": {
"abstract": "https://arxiv.org/abs/2601.12345",
"pdf": "https://arxiv.org/pdf/2601.12345"
},
"download_status": "success|pending|failed|unavailable",
"local_path": "./papers/i0001.pdf",
"citation_count": 42,
"source": ["arxiv", "openalex"],
"collected_at": "2026-01-17T10:00:00Z"
}
Error Codes
| Code | Meaning | Retryable |
|---|---|---|
PROJECT_NOT_FOUND | Project directory doesn't exist | No |
NO_INPUT_FILES | Required input files missing | No |
NO_ITEMS_FILE | items.json not found | No |
INVALID_SOURCE | Unknown data source | No |
NETWORK_ERROR | API request failed | Yes |
RATE_LIMITED | API rate limit hit | Yes |
DOWNLOAD_FAILED | PDF download failed | Yes |
CONVERSION_FAILED | PDF to Markdown conversion failed | Yes |
MISSING_DEPENDENCY | Required package not installed | No |
Agent Guidelines
Execution Principles:
- First run: Run bootstrap script to auto-setup environment
- CLI Commands (init, search, download, analyze, index): Run in sequence, no confirmation needed
- Agent intervention ONLY needed during Phase 2 (Review) and Manual Analysis (if no LLM)
Specific Guidelines:
- Environment Setup: Run
bash .codex/skills/insight-pilot/scripts/bootstrap_env.shfirst - Use
--jsonflag: Get structured output for parsing - Execute CLI directly: Do not ask for confirmation, follow workflow sequence
- Review: Modify
statusandexclude_reasoninitems.json - LLM Analysis First: Use
analyzecommand if configured, otherwise manually createanalysis/{id}.json - Incremental Updates: Only process new papers, keep existing analysis results