name: insight-pilot description: Literature research automation - search papers, code, and blogs, deduplicate, download PDFs, analyze and generate research reports. Supports incremental updates. version: 0.3.0

Insight-Pilot Skill

A workflow automation skill for literature research. Searches papers, GitHub repos/code/issues, PubMed, Dev.to, and blogs, deduplicates results, downloads PDFs, analyzes content, and generates incremental research reports.

Setup

Run the bootstrap script (automatically checks environment, creates and installs if missing):

bash .codex/skills/insight-pilot/scripts/bootstrap_env.sh

The script automatically detects if ~/.insight-pilot-venv exists and if packages are installed, only installing when necessary. See --help for advanced options.

Usage

Before running commands, activate the environment:

source ~/.insight-pilot-venv/bin/activate

Then use the CLI:

insight-pilot <command> [options]

CLI Commands

Command	Purpose	Required Args	Key Optional Args
`init`	Create research project	`--topic`, `--output`	`--keywords`
`search`	Search, merge and dedup	`--project`, `--source`, `--query`	`--limit`, `--since`, `--until`
`download`	Download PDFs + convert to Markdown	`--project`	-
`analyze`	Analyze papers with LLM	`--project`	`--config`, `--force`
`index`	Generate index.md	`--project`	`--template`
`status`	Check project state	`--project`	-
`sources`	Manage blog/RSS sources	`--project`	`--add`, `--remove`, `--config`

JSON Output Mode

Add --json flag for structured output (recommended for agents):

insight-pilot status --json --project ./research/myproject

Blog/RSS Sources Configuration

Create sources.yaml in your project root:

blogs:
  - name: "Cursor Blog"
    type: "ghost"
    url: "https://cursor.sh/blog"
    api_key: "auto"
  - name: "Example WP Blog"
    type: "wordpress"
    url: "https://blog.example.com"
  - name: "OpenAI Blog"
    type: "rss"
    url: "https://openai.com/blog/rss.xml"
    category: "ai"

Manage sources via:

insight-pilot sources --project ./research/webagent

Environment variables:

GITHUB_TOKEN (GitHub API higher rate limit)
PUBMED_EMAIL (required by NCBI)
OPENALEX_MAILTO (OpenAlex polite usage)
INSIGHT_PILOT_SOURCES (override sources.yaml path)

New Sources Examples

# GitHub repositories + code + issues
insight-pilot search --project $PROJECT --source github --query "agent framework" --limit 30

# PubMed (requires PUBMED_EMAIL)
insight-pilot search --project $PROJECT --source pubmed --query "clinical agents" --limit 20

# Dev.to articles
insight-pilot search --project $PROJECT --source devto --query "ai agents" --limit 20

# Blogs (Ghost/WordPress/RSS from sources.yaml)
insight-pilot search --project $PROJECT --source blog --query "agents" --limit 20

Workflow (Agent + CLI Collaboration)

This is the complete workflow for Agent + CLI collaboration.

Execution Principles:

Run CLI commands in sequence as prescribed, no line-by-line confirmation needed.
Agent intervention is ONLY required in Phase 2 for manual review (checking items.json and setting status/exclude_reason).

Phase 1: Search and Initial Filtering

Execute the following commands directly, no confirmation needed:

PROJECT=./research/webagent

# Step 1: Initialize project
insight-pilot init --topic "WebAgent Research" --keywords "web agent,browser agent" --output $PROJECT

# Step 2: Search multiple sources (auto merge & dedup)
insight-pilot search --project $PROJECT --source arxiv openalex github pubmed devto blog --query "web agent" --limit 50

Phase 2: Agent Review (Manual Check)

After deduplication, the Agent needs to review the paper list and remove content unrelated to the research topic.

# Check current status
insight-pilot status --json --project $PROJECT

Agent Actions:

Read $PROJECT/.insight/items.json
Check title and abstract for each paper
Mark unrelated papers: set status to "excluded" and add exclude_reason
Save the updated items.json

{
  "id": "i0023",
  "title": "Unrelated Paper Title",
  "status": "excluded",
  "exclude_reason": "Not related to web agents, focuses on chemical agents"
}

Phase 3: Download PDFs

Execute directly, no confirmation needed:

# Step 3: Download PDFs (converts to Markdown automatically)
insight-pilot download --project $PROJECT

Download Results:

Success: download_status: "success", PDF saved to papers/
Failed: download_status: "failed", recorded in $PROJECT/.insight/download_failed.json

Failure list format:

[
  {
    "id": "i0015",
    "title": "Paper Title",
    "url": "https://...",
    "error": "Connection timeout",
    "failed_at": "2026-01-17T10:30:00Z"
  }
]

Note: Advanced download (proxy/browser automation for failed items) is not yet implemented.

Phase 4: Analyze Papers

Precondition: Must complete Phase 3 Download PDFs first (download command automatically converts PDFs to Markdown).

MUST try LLM analysis first. If LLM is configured, run directly:

# Step 4: LLM Analysis (prefers converted Markdown, falls back to PDF text extraction)
insight-pilot analyze --project $PROJECT

Content Source Priority:

Markdown (from download auto-conversion via pymupdf4llm)
PDF Extraction (PyMuPDF)

LLM Configuration: Create .codex/skills/insight-pilot/llm.yaml:

provider: openai  # openai / anthropic / ollama
model: gpt-4o-mini
api_key: sk-xxx   # or set env var OPENAI_API_KEY

When LLM is not configured: Manual Analysis Required

If no LLM is configured, the Agent needs to analyze manually:

Read PDF files in papers/ directory
Extract key information for each paper
Write analysis results to $PROJECT/.insight/analysis/{id}.json

Analysis File Format ($PROJECT/.insight/analysis/{id}.json):

{
  "id": "i0001",
  "title": "Paper Title",
  "summary": "One sentence summary",
  "brief_analysis": "2-3 sentences brief analysis",
  "detailed_analysis": "300-500 words detailed analysis",
  "contributions": ["Contribution 1", "Contribution 2"],
  "methodology": "Methodology description",
  "key_findings": ["Finding 1", "Finding 2"],
  "limitations": ["Limitations"],
  "future_work": ["Future work 1"],
  "relevance_score": 8,
  "tags": ["webagent", "benchmark", "multimodal"],
  "analyzed_at": "2026-01-17T12:00:00Z"
}

Phase 5: Generate Incremental Report

# Step 8: Generate/Update Index
insight-pilot index --project $PROJECT

Reports are stored in $PROJECT/index.md, showing only analyzed papers and linking to reports/{id}.md detailed reports.

Report Structure:

# WebAgent Research

> **Generated**: 2026-01-18 10:30
> **Keywords**: web agent, browser agent
> **Analyzed**: 5 papers

---

## 📚 Analyzed Papers

### [Paper Title](reports/i0001.md)

**Authors**: Author A, Author B et al. | **Date**: 2026-01-15 | **Links**: arXiv/DOI | **Relevance**: 8/10

**Summary**: One sentence summary...

> 2-3 sentences brief analysis...

**Tags**: `webagent` `benchmark` `multimodal`

---

## ⚠️ Papers Not Available

_The following papers could not be downloaded. Only abstracts are shown._

### Paper Title

**Authors**: ... | **Date**: ... | **Links**: ...

> Abstract...

---

## 📊 Statistics

| Metric | Value |
|--------|-------|
| Papers Analyzed | 5 |
| Download Failed | 1 |
| Total Processed | 6 |

Incremental Update Workflow

For daily/weekly updates:

# 1. Search new papers (use --since for date limit, auto merge & dedup)
insight-pilot search --project $PROJECT --source arxiv openalex --query "web agent" --since 2026-01-17 --limit 20

# 2. [Agent] Review newly added papers

# 3. Download PDFs for new papers
insight-pilot download --project $PROJECT

# 4. [Agent] Analyze new papers, update reports

# 5. Regenerate index
insight-pilot index --project $PROJECT

Project Structure

research/myproject/
├── .insight/
│   ├── config.yaml          # 项目配置
│   ├── state.json           # 工作流状态
│   ├── items.json           # 论文元数据（含 status, exclude_reason）
│   ├── raw_arxiv.json       # 原始搜索结果
│   ├── raw_openalex.json
│   ├── download_failed.json # 下载失败列表（供高级下载重试）
│   ├── analysis/            # 论文分析结果
│   │   ├── i0001.json
│   │   ├── i0002.json
│   │   └── ...
│   └── markdown/            # PDF 转换结果（pymupdf4llm）
│       ├── i0001/
│       │   ├── i0001.md     # 转换后的 Markdown
│       │   └── metadata.json
│       └── ...
├── papers/                  # 已下载的 PDF
├── reports/                 # 历史报告存档
└── index.md                 # 当前研究报告（增量更新）

Data Schemas

Item (Paper)

{
  "id": "i0001",
  "type": "paper",
  "title": "Paper Title",
  "authors": ["Author One", "Author Two"],
  "date": "2026-01-15",
  "abstract": "...",
  "status": "active|excluded|pending",
  "exclude_reason": null,
  "identifiers": {
    "doi": "10.1234/example",
    "arxiv_id": "2601.12345",
    "openalex_id": "W1234567890"
  },
  "urls": {
    "abstract": "https://arxiv.org/abs/2601.12345",
    "pdf": "https://arxiv.org/pdf/2601.12345"
  },
  "download_status": "success|pending|failed|unavailable",
  "local_path": "./papers/i0001.pdf",
  "citation_count": 42,
  "source": ["arxiv", "openalex"],
  "collected_at": "2026-01-17T10:00:00Z"
}

Error Codes

Code	Meaning	Retryable
`PROJECT_NOT_FOUND`	Project directory doesn't exist	No
`NO_INPUT_FILES`	Required input files missing	No
`NO_ITEMS_FILE`	items.json not found	No
`INVALID_SOURCE`	Unknown data source	No
`NETWORK_ERROR`	API request failed	Yes
`RATE_LIMITED`	API rate limit hit	Yes
`DOWNLOAD_FAILED`	PDF download failed	Yes
`CONVERSION_FAILED`	PDF to Markdown conversion failed	Yes
`MISSING_DEPENDENCY`	Required package not installed	No

Agent Guidelines