name: stk-data-annotations description: Create, validate, align and audit stk (salk_toolkit) JSON data meta annotations for survey datasets. Use when working with _meta.json files, infer_meta, read_annotated_data, read_and_process_data, or when the user mentions survey annotations, metafiles, data alignment, or category mapping.
STK Data Meta Annotations
Overview
STK annotations are JSON files (*_meta.json) that describe how to transform raw survey data (.sav, .csv, .parquet) into a standardised, English-language, typed DataFrame. The authoritative schema lives in salk_toolkit/validation.py (DataMeta); processing logic lives in salk_toolkit/io.py.
Always read these two files before starting annotation work — the schema evolves.
IMPORTANT: When in doubt about the semantics of a survey question — what categories mean, whether something is ordered, topk or somethi— always ask the user rather than assuming. Wrong semantic assumptions (e.g. treating an unordered category as ordered, or merging categories that shouldn't be merged) produce silent errors that are extremely hard to detect later in the modeling pipeline.
NEVER edit the raw data file (VERY IMPORTANT). The raw data (.sav, .csv, .xlsx, .parquet) is the immutable source of truth — never modify it, overwrite it, or save a "cleaned" copy over it. All corrections, recodings, merges, synthetic columns, filters and fix-ups happen inside the annotation, in this order of preference:
translate/translate_after— for plain value → value remappings (e.g. merging"Don't remember"and"Difficult to answer"into"Don't know", renaming categories, fixing typos).transform(per column) — for expression-level fixes that need the cell / column in scope (casting, regex,stk.cut_nice, rule-based recoding).preprocessing(top-level code block) — last resort, for changes that need multiple source columns at once, row filtering, or cross-column derivations before any column-level processing runs.
If you think you need to edit the raw file, you're wrong — use translate/transform/preprocessing instead.
Use Cases
1. Creating a new annotation
Definition of done:
- Matches census in category names and granularity — ask for census file if not provided
- All relevant columns annotated (demographics, opinions, scales, etc.)
- Ordered categories correctly ordered (likerts always go negative → positive pole, e.g. disagree → agree); nonordered elements marked;
num_valuesset for all ordered columns (centered on zero for likerts, 1–N otherwise,nullfor nonordered entries) - All conventions followed (see below)
- Loads cleanly via
read_annotated_data(meta_file)with no warnings - Everything translated to English (exception: party acronyms)
- If a questionnaire / data description is available, add
labelentries with the exact question wording (per-item text on individual columns, shared lead-in text onscale.labelfor item-battery blocks) - Region fields have a
topo_featureattached — ask the user for a link to the map JSON if not provided - Party brand colors collected and wired up wherever parties appear (see Colors section) — search the web if not provided by user
2. Aligning to an existing annotation
Definition of done:
- New annotation loads on its own (same criteria as above)
- Both files load together via
read_and_process_datawith no errors - Shared columns have identical category names, order, and types
-
col_prefixusage matches between files
3. Auditing / cleaning up an existing annotation
Same criteria as creating. Focus on correctness of category lists, ordered flags, translations, and consistency between party preference / thermometer / issue ownership blocks.
Review & fix protocol (follow this order strictly when editing existing annotations):
- Read everything first. Read all annotation files, census meta, and any alignment targets fully before making a single edit.
- Produce one consolidated issue list. After reading, output a single list of all issues found. No inline self-corrections or "wait, actually…" — if unsure, verify before listing.
- Batch all fixes. Apply every fix in one pass. Do not stop partway through and wait for the user — complete all edits before moving on.
- Verify once. Run
read_annotated_data(andread_and_process_dataif aligning) exactly once after all fixes are applied. If new issues surface, fix and re-verify — but the goal is one clean pass. - Report. Output: (a) changes made, (b) remaining warnings and whether they are actionable, (c) ambiguity report per the workflow below.
Gathering Inputs
Before doing any annotation work, gather the required inputs. First, search the directory of the provided data file (and nearby folders) for these — only prompt the user for what you can't find:
- Data file (
.sav,.csv,.parquet,.xlsx) — the raw survey data. Required. Always provided or referenced. - Data description (Word/Excel/PDF document) — describes the survey questions, answer codes, and structure. This might not exist, especially if .sav file is provided as that often contains most of the required metadata. Nevertheless, always ask for this file if not found/provided.
- Census file — the country's census parquet/meta defining demographic categories and granularity. Look in the
census/repo or ask the user. Usually present but might not be in very rare cases. - Previous wave / existing annotation — if aligning, the
*_meta.jsonfrom the prior wave or partner survey. Search nearby folders. Might not be present, ex for first wave in each country. - DeepL API key + source language code (e.g.
LT,ET,RO) — needed for automatic translation during bootstrap.
When creating a new meta, ALWAYS ask the user about all 5 in sequence (have him confirm the file if you found one yourself). For other use cases, ask as needed.
Typical Workflow
import salk_toolkit as stk
from salk_toolkit.io import infer_meta, read_annotated_data, read_and_process_data
from salk_toolkit.validation import hard_validate, soft_validate, DataMeta
import json
# 1. Bootstrap from raw data with DeepL translation
meta = infer_meta("raw_data.sav", deepl_key="<key>", source_lang="LT")
# 2. Edit the *_meta.json to fix structure, ordering, conventions (AI does this)
# 3. Validate
hard_validate(json.load(open("data_meta.json")))
# 4. Test loading — iterate on step 2 until this passes cleanly
df = read_annotated_data("data_meta.json")
# 5. Write an ambiguity report: list every semantic judgement call made
# (ordering decisions, category merges, what was marked nonordered, etc.)
# so the user can verify assumptions in one pass
# 6. Hand off to user for review only after step 4 passes with no warnings
# 7. Multi-file alignment test (if applicable)
df = read_and_process_data({
"files": [
{"file": "wave1_meta.json", "code": "W1"},
{"file": "wave2_meta.json", "code": "W2"}
]
})
JSON Structure Quick Reference
{
"description": "...",
"source": "...",
"collection_start": "2026-01-15",
"collection_end": "2026-02-01",
"author": "...",
"constants": { "party_colors": { "PartyA": "#ff0000" } },
"files": [{ "file": "data.sav", "opts": {}, "code": "F0" }],
"read_opts": {},
"preprocessing": "df = df[df['age'] >= 18]",
"postprocessing": null,
"weight_col": null,
"excluded": [],
"structure": [
{
"name": "demographics",
"scale": { "...shared column meta..." },
"columns": [
["new_name", "source_col", { "...column meta..." }],
["new_name", { "...meta, source defaults to new_name..." }],
["new_name"],
"bare_col_name"
]
}
]
}
Column entry formats (inside columns list)
| Format | Meaning |
|---|---|
"col" or ["col"] | Keep column as-is (name = source name in data) |
["new_name", "source"] | Rename: read source from data, expose as new_name |
["new_name", { meta }] | Same name in data, add/override metadata |
["new_name", "source", { meta }] | Combines the two above |
Column-level { meta } should only contain fields that differ from the block's scale. The scale is merged as defaults into every column, so don't repeat what's already set there.
Key ColumnMeta fields
Type declaration — exactly one of these should apply:
| Field | Type | Purpose |
|---|---|---|
categories | list | "infer" | Categorical column. "infer" only valid with translate (order from translate dict). |
continuous | bool | Numeric real-valued column |
datetime | bool | Datetime column |
Ordering — only meaningful for categorical columns:
| Field | Type | Purpose |
|---|---|---|
ordered | bool | Whether categories are naturally ordered (age, income, likerts) |
nonordered | list | Categories outside the order ("Don't know", "No answer") |
likert | bool | Symmetric ordered scale (requires ordered: true) |
neutral_middle | str | Which category is the neutral middle for likert |
num_values | `list[float | null]` |
Transformations — applied in order: translate → transform → translate_after:
| Field | Type | Purpose |
|---|---|---|
translate | dict | Map source values → output values |
transform | str | Python expression with s, df, ndf, pd, np, stk, constants in scope |
translate_after | dict | Like translate, applied after transform |
Display & modeling context:
| Field | Type | Purpose |
|---|---|---|
label | str | Column description for tooltips/headers |
colors | dict | str | Category value → color mapping (or constant name). See Colors section. |
question_colors | dict | str | Block-scale only: column name → color for unpivoted plots. See Colors section. |
groups | dict | Named category groupings for filtering |
topo_feature | [url, type, col] | Link to topojson for geographic columns |
modifiers | list[str] | Columns that modify responses (private inputs for modeling) |
Block-level fields
| Field | Purpose |
|---|---|
name | Block identifier (must not collide with any column name in the annotation) |
scale | Shared ColumnMeta defaults merged into every column in block |
columns | List of column specs |
col_prefix | On scale: prefix prepended to column names (disambiguates shared names) |
hidden | Hide from explorer dashboards |
generated | Column data produced by model, not in source file |
create | TopK or MaxDiff block spec (see below) |
subgroup_transform | Python code applied to all columns in block as gdf |
Constants
Any value in the structure can be a string matching a key in constants. It gets replaced at parse time. Use for colors, topic lists, and translation dicts shared across blocks.
Only define a constant if it is referenced two or more times. Single-use constants add indirection and hurt readability — inline them at the use site instead. When auditing an annotation, remove any constant used zero or one times.
Colors
Two fields, orthogonal dimensions:
| Field | Where | Maps |
|---|---|---|
colors | Column meta (or scale as default) | category value → hex |
question_colors | Block scale only | column name → hex; becomes colors on the synthetic question column after unpivot (see pp.py::_question_meta_clone, ~line 1110) |
Both accept an inline dict or a string referencing a constant. For question_colors, the block's column names must match the keys in the referenced dict. If a block's scale is a string reference to a shared constant (e.g. "scale": "trust_scale"), inline the scale to add question_colors — string refs are whole-value replacements.
Party colors — always collect them. Whenever an annotation has party data (party_preference, per-party thermometer / ownership), define a party_colors constant and reference it via colors on party-valued columns and via scale.question_colors on blocks whose columns are parties. If the user didn't supply colors, search the web:
Wikipedia's "Opinion polling for the [YEAR] [COUNTRY] parliamentary election" pages are the canonical source for exact hex codes. Open the page source on the polling table and look for
{{party color|PartyName}}templates — these pull from a shared CSS database of hex codes used by news organizations. That gives you per-country, per-election, match-the-press colors in one place.
Fall back to distinct placeholder hues (documented in a comment) only if a reliable hex can't be found. Use neutral greys for ballot meta-options (other, spoil_ballot, Against_Everyone, none, Don't know, No answer).
See examples/example_web_meta.json for a worked pattern — party_colors constant, colors: "party_colors" on party_preference, and scale.question_colors: "party_colors" on the thermometer block.
Comments
Every block in the annotation (the top-level DataMeta, any entry in structure, any scale, any per-column meta dict, create blocks, etc.) accepts an optional "comment" field. JSON has no native comment syntax, so this field is the canonical place to leave notes.
- Value is either a single string or a list of strings (one per line) — both render fine in the JSON.
- The field is ignored by all processing code: it carries no semantic meaning and has zero runtime effect.
- It is preserved on load/save round-trips through the pydantic models.
{
"name": "attitudes",
"comment": "5-point Likert collapsed from original 7-point in CATI wave — see below",
"scale": {
"categories": ["Strongly disagree", "Disagree", "Neutral", "Agree", "Strongly agree"],
"ordered": true,
"likert": true
},
"columns": [
["future", { "comment": ["'optimism about the future' in questionnaire", "kept singular name to match previous waves"] }]
]
}
Use comment to document any decision that deviates from best practice or is non-obvious. This includes (but is not limited to):
- Non-standard mappings via translate, especially if they lose information
- Unusual
transformlogic, especially when a simpler form would look correct but be wrong - Placeholder values, known-broken columns, or anything the next editor would otherwise "fix" incorrectly
If you find yourself wanting to explain a choice to the user in chat, write that explanation into comment as well — future readers of the JSON will thank you.
TopK Blocks
For "select top K" questions (e.g. "which 3 issues matter most?"):
{
"name": "issue_importance_top3",
"create": {
"type": "topk",
"from_columns": "Q6r(\\d+)",
"res_columns": "Q6p_R\\1",
"agg_index": 1,
"na_vals": ["NO TO: ...", "..."],
"translate_after": { "1": "Cost of living", "2": "Healthcare" }
},
"scale": { "categories": "infer" },
"columns": []
}
from_columns: regex matching source columns (or explicit list)res_columns: output column template (or explicit list matching from_columns)agg_index: which regex group indexes the items (1-indexed; -1 = last)na_vals: values meaning "not selected" — replaced with NAtranslate_after: map item indices to readable names (applied first)from_prefix: iffrom_columnsis a list, strip this prefix for translation
The columns list in a topk block is usually empty — output columns are auto-generated. However, some topk blocks (e.g. issue ownership) list the raw source columns alongside the create block when those columns are also needed for other purposes.
TopK translate pipeline
After the one-hot columns are reshaped (cell value becomes the column's regex-group label), translations are applied in order:
create.translate_after— maps raw regex-group labels (typically numeric indices like"1","2") to readable names.scale.translate— maps those names (or the original text iftranslate_afterwas not used) to final English output names. Whenscale.translateis present, its values become the outputcategorieslist.
In practice you use one or the other, not both:
- Numeric one-hot columns → use
translate_afterto go from index → English name. - Text-valued one-hot columns (e.g. party names in the local language) → use
scale.translateto go from local name → English short code.
MaxDiff Blocks
For best-worst scaling / maxdiff experiments:
{
"name": "maxdiff",
"create": {
"type": "maxdiff",
"best_columns": "Q6_(\\d+?)best",
"worst_columns": "Q6_(\\d+?)worst",
"set_columns": "Q6_\\1set",
"setindex_column": ["Q6_Version", { "continuous": true }],
"topics": null,
"sets": null
},
"scale": {
"categories": "infer",
"translate": { "Local topic 1": "English topic 1", "...": "..." }
},
"columns": []
}
best_columns/worst_columns: regex or list matching best/worst choice columnsset_columns: regex template or list for the set-membership columnssetindex_column: column containing set version index (with optional meta). Mutually exclusive with explicit set_columns data in the file.topics: list of all topic strings (typically inconstants)sets: list of lists of 1-indexed topic indices per version (typically inconstants)- Scale
translatemaps local-language topics to English
MaxDiff translate pipeline
All translation happens through scale.translate (there is no translate_after for maxdiff). The flow:
topicsdefines the full topic list (usually viaconstants) in the source language.scale.translatemaps each source-language topic to its English name, producingeffective_topics.effective_topicsis used everywhere: best/worst column values are translated and cast to categorical with this list; set columns resolve topic indices through this list; the output meta carrieseffective_topicsas its categories.
So scale.translate is where all the naming happens for maxdiff — it controls both the cell values and the category list.
When using setindex_column, topics and sets must be defined (usually via constants). The columns list should be empty.
Conventions (MUST follow)
-
English: All category names, labels, and column names in English.
- Exception: party names/acronyms kept as originals (e.g. "TS-LKD", "LSDP")
- Exception: geographic names (counties, municipalities) may stay in the local language — match whatever the census uses
-
Column names: short,
snake_case, single identifier where possible. Put the full human-readable name inlabelwhen the column name is a shortening/change.- Default: lowercase (e.g.
age,gender,pol_interest). - Proper nouns (people, parties, organisations) stay capitalized (e.g.
Putin,Macron,Civil_Contract,Fidesz). For people prefer last name only. If any name in a block needs a first-name prefix to disambiguate, use fullFirst_Lastnames for every person in that block. - Acronyms stay fully uppercase (e.g.
ARF,ANC,LSDP,TS-LKD).
- Default: lowercase (e.g.
-
Standard block/column naming: use these names whenever the concept applies, so blocks line up across surveys:
party_preference— who the respondent would vote for (single column or block).thermometer— per-party rating / likability / trust scale (one likert-style column per party).importance— issue-importance ranking, usually pick-top-K or maxdiff.ownership— which party is trusted most to handle each issue.
-
categories: "infer": Only use together withtranslate. Order is derived fromtranslatedict key order. -
translate: Only include if actually performing translation or value mapping. Don't add identity translations unless needed for order disambiguation withcategories: "infer". -
Ordered categories: Naturally ordered data (age, income, education, likerts) must be
ordered: truewithnonorderedmarking outliers ("Don't know", "No answer", "Other"). Any bipolar ordered scale — one with opposing poles (agree/disagree, trust/distrust, positive/negative, better/worse) — must be markedlikert: truewithnum_valuescentred on zero, regardless of whether a neutral middle exists. Setneutral_middlewhen a middle category does exist.Dichotomous choices are likerts too. Any 2-way choice — yes/no, for/against, approve/disapprove, support/oppose, stay/leave, EU/EAEU, etc. — must be marked
ordered: true, likert: truewithnum_values: [-1, 1](plus nulls for DK/NA), not left as unordered categorical. This applies to both opinion bipolars (agree vs disagree) and factual/choice binaries (yes vs no, A vs B).Pick the positive pole by this priority (documented with a
commentwhen non-obvious):- Explicit valence: trust, agree, approve, support, positive, better, more, yes → positive; distrust, disagree, disapprove, oppose, negative, worse, less, no → negative.
- Affirmative / pro-action: yes, for, support, change-to-new > no, against, oppose, keep-status-quo.
- For A vs B choices without explicit valence, pick the pole aligned with the survey's analytical reference direction (e.g. Western/EU orientation as positive in Eastern-European polling) and document with
comment.
Always order likert categories from the negative pole to the positive pole (disagree → agree, distrust → trust, no → yes, against → for, leave → stay, EAEU → EU);
num_valuesincrease monotonically from negative to positive. Flip withtranslateif the source data codes the other way. -
Party consistency: Party names must be identical across
party_preference,thermometer, andownershipblocks. -
Discrete scales: Use categorical (not continuous) for scales with <20 values, even if numeric.
-
col_prefix: Use to disambiguate columns that share names across blocks (e.g.attitude_,issue_,therm_). -
Auto-inferred blocks from topk/maxdiff: Delete any blocks that were auto-generated by
infer_metafor columns that belong to topk/maxdiffcreateblocks — those get regenerated. -
Document non-obvious decisions with
comment: Any choice that deviates from best practice or is non-obvious (unusual merges, ambiguous ordering calls, deliberate category mismatches, tricky transforms) must be noted in acommentfield on the block, scale, or column where it applies. See the Comments subsection above.
Common Pitfalls
- Category order matters:
categories: ["Never", "Sometimes", "Usually", "Always"]defines the modeling/display order. Check it matches the natural ordering. - Many-to-one translate: Multiple source values can map to the same output (e.g. merging districts). This is fine but be aware
categories: "infer"deduplicates while preserving first-seen order. - Missing na_vals in topk: If
na_valsdon't match the actual "not selected" values in the data, topk processing will fail or produce wrong results. - Scale vs column precedence: Column-level meta overrides scale. If a column needs different categories than the block, specify them on the column.
educationordering:["Primary", "Secondary", "Higher"]not alphabetical. Always verify ordered categories make substantive sense.- num_values alignment: Must have same length as categories list and correspond 1:1.
- Variant files: When CATI and WEB surveys share questions but with different scales (5-point vs 7-point), use a
_psuffix for the phone variant columns and create separate blocks with appropriate scale transforms.
Inspecting Raw Data
Before annotating, examine the source file:
import pyreadstat
df, meta = pyreadstat.read_sav("data.sav", apply_value_formats=True)
# meta.column_names, meta.column_labels — useful for labels
# df['Q1'].value_counts() — check actual category values
# df.columns.tolist() — all column names
For SAV files, meta.column_labels often contains the question text in the original language — feed these to a translation function for initial labels.
Validation Commands
# Quick validation
hard_validate(meta_dict) # Raises on any issue
# Load test (most thorough — runs full processing pipeline)
df = read_annotated_data("my_meta.json")
# Multi-file alignment test
df = read_and_process_data({
"files": [{"file": "meta1.json"}, {"file": "meta2.json"}]
})
Warnings during read_annotated_data are important — they flag missing columns, dropped categories, and category mismatches. Resolve all of them.
Aligning With Census
Census files define the ground-truth category names and granularity for demographic columns. When annotating:
- Load the census parquet/meta to see its column names and categories
- Ensure demographic columns (age_group, gender, education, county, municipality, etc.) use exactly the same category strings
- Match any computed columns like
county+that combine geography levels age_groupis typically derived from a continuousagecolumn usingstk.cut_nicewith breakpoints matching the census granularity. The survey data usually has raw age — you create the correct grouped column viatransform:
["age_group", "age", {
"categories": "infer",
"transform": "stk.cut_nice(s, [18, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85])",
"ordered": true,
"label": "age group"
}]
The breakpoint list must match what the census uses. Check the census age_group categories to determine the right bins.
Aligning Two Meta Files
When two surveys (e.g. CATI + WEB, or two waves) need to load together via read_and_process_data, their annotations must be compatible:
- Shared columns must have identical names, categories, and category order. This includes demographics (
gender,age_group,education,county, etc.) and any columns used as model inputs. col_prefixmust match for blocks that should merge (e.g. both files useattitude_for attitudes).- Different scales for the same question are handled with separate blocks and a
_psuffix on column names. For example, WEB uses a 7-point scale (attitudesblock withattitude_prefix), CATI uses 5-point (attitudes_pblock with the sameattitude_prefix but columns likepol_interest_p). The shared prefix means they land in the same namespace; the_psuffix distinguishes the reduced-scale variant. - The
methodcolumn should be added to distinguish data sources (e.g."categories": ["web", "cati"]). Include it in both files. - Translate dicts for party names must produce identical output strings across files — even if the source-language strings differ slightly between surveys.
- Test alignment by loading both together and checking for warnings:
df = read_and_process_data({
"files": [{"file": "web_meta.json"}, {"file": "cati_meta.json"}]
})
Any category mismatch or duplicate column name will surface as a warning or error. Fix these iteratively until the load is clean.
- The last file is the basis for the combined meta.
read_and_process_datauses the last file's annotation as the combined schema. If blocks exist in file A but not in file B (the last file), they won't appear in the output — even though the data is present. To fix this, add the missing blocks to the last file with"generated": trueon each such block. This suppresses "no matching columns in data" warnings for that file while letting the block's schema carry through to the combined result.
Worked Example
A complete minimal example lives in .cursor/skills/stk-data-annotations/examples/:
| File | Description |
|---|---|
example_web_meta.json | WEB survey annotation — 7-point attitudes, topk, maxdiff |
example_cati_meta.json | CATI survey annotation — 5-point attitudes (same questions) |
example_web_data.csv | 60-row synthetic raw data for WEB |
example_cati_data.csv | 40-row synthetic raw data for CATI |
example_census.csv | 30-row census cross-tab (gender × education × age_group) |
Key patterns demonstrated:
- Demographics aligned with census:
gender,age_group(viastk.cut_nicetransform),education— category names and age bins matchexample_census.csvexactly. methodcolumn: Synthetic column created viatransform— WEB file produces'web', CATI file produces'cati'; both share"categories": ["web", "cati"].categories: "infer"+translate:party_preference— category order comes from translate dict key order. Translate dicts are identical across both files for alignment.- Likert
_pvariant pattern: WEB has 7-pointattitudesblock (columnspol_interest,future); CATI has 5-pointattitudesblock (columnspol_interest_p,future_p). Both usecol_prefix: "attitude_"so columns land in the same namespace. generated: truefor alignment: WEB includesattitudes_pblock withgenerated: true— this block has no matching data in the WEB file, but its schema lets the 5-point CATI columns carry through when loading both files together.- TopK with
translate_after:issue_importanceblock uses regexfrom_columns,na_valsto filter unselected items, andtranslate_afterto map numeric regex groups to English names. - MaxDiff with
scale.translate:maxdiffblock (WEB only) usessetindex_column+topics/setsconstants (2 versions × 3 sets of 3 topics).scale.translatemaps Lithuanian topic names to English — this single dict controls both cell values and the output category list. - Colors —
colorsvsquestion_colors:party_colorsconstant is referenced bycolorsonparty_preference(values are parties) and byscale.question_colorson thethermometerblock (columns are parties, so each party gets its brand color when the block is unpivoted into aquestiondimension). Thermometer column names must match theparty_colorskeys.
For more details
- Schema:
salk_toolkit/validation.py—DataMeta,ColumnMeta,ColumnBlockMeta,TopKBlock,MaxDiffBlock - Processing:
salk_toolkit/io.py—_process_annotated_data,infer_meta,_fix_meta_categories - Cursor rule:
salk_toolkit/.cursor/rules/data_annotations.mdc - Examples: look at recent
*_meta.jsonfiles in the sandbox repo for real-world patterns