EDORA Learn — Methods
Metadata, Documentation, and Data Dictionaries
Data without documentation is a puzzle missing half its pieces. Metadata—data about data— explains what each field means, where it came from, and how it changed over time. Without it, even well-built datasets decay into confusion.
What Metadata Contain
- Variable names and labels: What each field represents and its coding rules.
- Data types and formats: Whether values are numeric, categorical, or text, and any units or date structures.
- Provenance: When, how, and by whom the data were collected or modified.
- Quality notes: Flags for missing values, suppression, or estimation methods.
- Versioning: Change logs tracking schema updates or field redefinitions.
Data Dictionaries and Codebooks
A data dictionary is a formal table of variables, their descriptions, valid values, and data types. A codebook goes further, explaining coding conventions, derived variables, and source documents. These tools make a dataset readable by anyone years after its creation.
Version Control and Provenance
- Version tracking: Each dataset release should carry a version number and date.
- Changelog: A record of added, removed, or renamed fields prevents silent breaks in analysis.
- Source lineage: Every variable should point back to its raw source or transformation rule.
Documentation as Reproducibility
Reproducibility depends not only on shared code but on shared meaning. Two analysts using the same dataset but different interpretations of “exit date” or “placement type” will produce contradictory results. Comprehensive metadata is the common language that prevents this drift.
Good Documentation Habits
- Store metadata alongside the dataset, not in separate personal files.
- Include definitions for derived variables and formulas for computed fields.
- Record decisions about data cleaning and suppression.
- Archive old versions instead of overwriting them.
Data & Methods
The research file highlights that many state and local systems lack formal metadata. Fields are renamed, recoded, or dropped without trace, forcing every new analyst to rediscover definitions. Building standardized dictionaries and publishing them with each data release dramatically improves interpretability and trust.
Related
Transparency note: Every published dataset should include a data dictionary, a changelog, and a provenance statement. Without metadata, even accurate data lose their meaning.