name: data-science-agent-loop description: >- Scaffolds short-lived R or Python scripts that call Ollama Cloud /api/chat with tool definitions, batch rows or records per request, optionally parallelize chunk calls (e.g. furrr), and apply tool results on the main process with JSONL audits—using a folder-local .env. Use when the user wants an agent loop for data cleaning, enrichment, geospatial routing, dataset compilation, batched tool calls, parallel Ollama chunks, or teaching patterns aligned with 10_data_management/fixer (not full Plumber/FastAPI agents unless requested).
Data science agent loop (short-lived scripts)
Teaches a repeatable pattern for single-folder, disposable data-science loops: one main script (optional functions.R or helpers.py), Ollama Cloud tool calling, batching, optional parallel chunk HTTP, and .env next to the script.
When to use this skill
- Use: One runnable script (plus optional shared helpers) that loads data, batches prompts, gets
tool_calls, dispatches tools, writes CSV/GeoJSON + JSONL audit. - Do not default to: Multi-turn “assistant products” (
10_data_management/agentr/,10_data_management/agentpy/) unless the user wants that scope (Plumber, guardrails, many tools, long loops).
Mandatory guided workflow (run before writing code)
Walk the user through these five choices; do not skip silently.
| Step | What to clarify |
|---|---|
| (1) Data / task | Input paths, output paths, stable row/entity keys, immutable raw vs working copy, and audit strategy (append JSONL per tool application). |
| (2) Draft prompts | Review system + user text: column semantics, anti-hallucination rules, missing-value conventions (e.g. empty string vs literal NA text), and whether the model should reply tool-first / minimal prose. Tighten vague instructions. |
| (3) Ollama Cloud model | Offer clear options (see below); record choice in .env as OLLAMA_MODEL. If tool_calls are empty or wrong, try a different tool-capable model or smaller batches (fixer README). |
| (4) Batch size | Rows (or records) per /api/chat. Larger batches → fewer requests but more context and risk of dropped / malformed tool calls. Start conservative; tune via env (e.g. ROWS_PER_BATCH). |
| (5) Tools | Minimal tool set; strict JSON schemas; decide what the model chooses vs what code computes (e.g. sf distances). Use optimistic concurrency where useful (expected_old_value pattern in fixer_csv.R). |
Ollama Cloud model options (typical)
Present tradeoffs, then let the user pick (store in .env).
nemotron-3-nano:30b-cloud— Default infixer/.env.example; faster / cheaper; verify tool quality on the task.gpt-oss:120b— Stronger reasoning; used in fixer smoke path (testme.R); slower / heavier.- Another Cloud tag — Any model the user’s account exposes that supports tools; confirm on Ollama if unsure.
- Fallback strategy — If tools never fire: smaller batch, stronger model, or simpler tool schemas (see troubleshooting in fixer README).
Course repo examples (canonical)
Shared helpers: 10_data_management/fixer/functions.R — ollama_chat_once, parse_function_arguments, split_df_into_row_chunks, truncate_tool_output. All four fixer drivers source() this file.
Run order and artifacts: See fixer/README.md (steps 2–5: CSV → parcels → POIs → spatial context).
| Script | What it does | Primary tools (examples) | Main inputs → outputs |
|---|---|---|---|
fixer_csv.R | Batched tabular repair; DATA_QUALITY_BLURB in-script (no per-row notes column); readr::format_csv chunks | set_cell, write_checkpoint | data/messy_inventory_raw.csv → output/messy_inventory_working.csv, output/fix_audit.jsonl |
fixer_parcels.R | Non-overlapping polygon parcels (wkt, WGS84); maps via sf + ggplot2 | record_parcel_zoning | data/parcels_zoning_raw.csv → output/parcels_enriched.csv, output/parcels_enrich_audit.jsonl, map_parcels_*.png |
fixer_pois.R | Point POIs (x, y, WGS84); maps | record_poi_category | data/pois_messy_raw.csv → output/pois_enriched.csv, output/pois_enrich_audit.jsonl, map_pois_*.png |
fixer_spatial_context.R | After parcels + POIs: LLM routes which spatial tools to call; no geometry in the model — sf computes distances/counts (metric CRS EPSG:32617 in-repo) | nearest_poi, count_pois_within, record_context_note | Default output/parcels_enriched.csv + output/pois_enriched.csv → output/parcels_context_enriched.csv, output/context_routing_audit.jsonl, output/map_parcels_context_transport.png. Override inputs with FIXER_CONTEXT_PARCELS, FIXER_CONTEXT_POIS. Demo grid: 24 parcels / 24 POIs → 3 chunk requests when ROWS_PER_BATCH=10. |
Extended multi-turn agents (heavier): agentr/R/loop.R, agentpy/app/loop.py.
Env template: fixer/.env.example — OLLAMA_API_KEY, OLLAMA_HOST, OLLAMA_MODEL; optional ROWS_PER_BATCH, FIXER_CHUNK_WORKERS, FIXER_MAX_OUTPUT_TOKENS (digits only; omit if Cloud returns HTTP 400 on num_predict).
Related project skills (Cursor)
When students or the agent author or extend fixer-style R in this repo:
- tidyverse-elegant-r + tidyverse_elegant.mdc —
=, native|>, explicitlibrary(), httr2, dplyr 1.1+ (join_by,.by), vectorized table logic. - console-message — progress UX for scripts: section dividers, paths, row counts, previews (aligns with the
cat()-style sections in the fixer drivers).
Architecture (fixer-style)
flowchart LR
subgraph config [Folder config]
env[.env OLLAMA_* BATCH_*]
end
subgraph prep [Prepare]
load[Load table or paths]
chunk[Split into batches]
end
subgraph parallel [Parallel optional]
map[Map chunks to API]
end
subgraph api [Ollama Cloud]
chat["/api/chat tools"]
end
subgraph main [Main process]
apply[Dispatch tools mutate state]
audit[Append JSONL audit]
write[Write outputs]
end
env --> load
load --> chunk
chunk --> map
map --> chat
chat --> apply
apply --> audit
apply --> write
Rule: Workers return parsed tool_calls (or errors); mutate shared tables / files on the main process so state stays coherent (see call_chunk_ollama + sequential dispatch in fixer_csv.R).
Standard project layout
my_task.Rorrun_loop.py— entrypoint.- Optional
functions.R/helpers.py— HTTP, parsing, chunking (no secrets in code). data/— immutable inputs;output/— working CSV + audits..envin the same folder as the script(s) the student runs; commit.env.exampleonly (no keys).
Implementation patterns
R (this repo)
- Load env:
if (file.exists(".env")) readRenviron(".env")(path relative to the task folder). - HTTP:
httr2—request()→req_body_json()→req_perform();resp_body_json(..., simplifyVector = FALSE)somessage$tool_callsstays structured. - Optional parallelism:
future+furrr::future_mapover chunk indices;plan(multisession, workers = …); cap workers if Cloud returns 429/500 (sequential fallback likeFIXER_CHUNK_WORKERS=1in fixer README). - R style: Follow tidyverse-elegant-r and tidyverse_elegant.mdc (
=, native|>, explicitlibrary(), vectorized table logic,join_by/.bywhere appropriate). For polishedcat()progress (dividers, paths,nrow), see console-message.
Python
- Mirror with
httpx(seeagentpy/app/): POST JSON to{OLLAMA_HOST}/api/chat,stream: false,tools: [...]. - Batch with
concurrent.futures.ThreadPoolExecutoror async patterns; keep one writer for DataFrame / files. - Prefer pandas or Polars chains for table logic; avoid row-at-a-time Python loops for expressible column work.
Tool schemas (portable gotchas)
- Ollama expects OpenAI-style
toolswithtype: "function"andfunction.parametersas a JSON Schema object. - In R, empty
propertiesmust serialize as{}, not[]— useproperties = structure(list(), names = character(0))(see fixer README troubleshooting).
Safety and operations
- Never log raw
OLLAMA_API_KEY; mask in console (e.g. first/last chars only), as in fixer scripts. - Redact Bearer tokens and obvious secret patterns in any file logs the student adds.
- HTTP 400 with
num_predict/ options: omitoptions.num_predictunless needed (fixer uses optionalFIXER_MAX_OUTPUT_TOKENS). - Rate limits: reduce parallel workers or batch size.
More use cases (beyond fixer)
- Survey / form coding — Open text → closed codes via tools that only allow a controlled set.
- Entity resolution — Model proposes
merge_group_id/canonical_nameagainst stable keys; code merges. - Schema mapping — Messy headers → target field names + types; JSONL audit per mapping decision.
- Unit / currency normalization — Mixed strings → canonical units in structured columns.
- Time range extraction — Free text → ISO intervals or explicit “unknown”.
- Multi-table stitching — Model suggests join keys; code performs joins and checks row counts.
- Literature / metadata tagging — Title/abstract batches → controlled vocabulary via tools.
- Quality gates — Model flags bad rows; second pass or export quarantine CSV.
- Raster / vector pairing — Model selects which zonal stat / buffer tool to call;
terra/sfcompute values.
Additional resources
- reference.md —
.envsnippet and minimal R / Python tool-definition skeletons. - Hands-on:
ACTIVITY_fixer_csv.md,ACTIVITY_fixer_spatial.md(both link back to these skills). - R style: tidyverse-elegant-r, tidyverse_elegant.mdc. Console UX: console-message.