Data Privacy, De-identification, and Disclosure Control

Administrative datasets often contain personal or sensitive information. Protecting privacy requires more than removing names—it involves careful control over what can be inferred or reconstructed from released data.

Direct and Indirect Identifiers

Direct identifiers: Names, social security numbers, addresses, or case IDs that uniquely identify an individual. These are removed or replaced before release.
Indirect identifiers: Combinations of fields—such as age, gender, race, and county—that could identify someone when joined with external data.
Linkage risk: Even de-identified data can be re-identified when matched to other sources; risk depends on the distinctiveness of quasi-identifiers.

De-identification and Masking Techniques

Suppression: Removing or blanking fields below cell-size thresholds.
Top- and bottom-coding: Collapsing rare values into open categories (e.g., “Age ≥ 17”).
Aggregation: Replacing small counties or program units with regional totals.
Pseudonymization: Replacing identifiers with random or hashed keys that allow linkage without disclosure.

Statistical Disclosure Control

Beyond manual suppression, statistical disclosure control (SDC) quantifies and manages re-identification risk using formal models. Approaches include:

k-Anonymity: Ensures each record is indistinguishable from at least k others on quasi-identifiers.
l-Diversity and t-Closeness: Strengthen protection by ensuring diversity of sensitive attributes within anonymized groups.
Differential privacy: Adds controlled noise to query results, bounding the information that can be inferred about any individual.

Balancing Utility and Confidentiality

Every protection step trades detail for safety. Aggregation reduces privacy risk but can mask small-area patterns; noise addition preserves privacy but complicates interpretation. Analysts document these trade-offs so users understand what the data can—and cannot—support.

Data & Methods

The research text details how agencies vary in privacy practice. Some rely solely on suppression; others use formal SDC frameworks or synthetic microdata. Regardless of method, the principle remains: disclose only what cannot reasonably identify a person, and always document the protection rules applied.

Transparency note: Release notes should specify suppression thresholds, anonymization methods, and privacy tests used. Hidden rules create the illusion of precision where confidentiality was the real constraint.