Short Definition: The process of replacing personally identifiable information with a pseudonym.
Extended Definition: The process of replacing personally identifiable information with a pseudonym. Identifying fields within a data record are replaced by one or more artificial identifiers, or pseudonyms. This can be done either with or without the possibility of re-identifying the subject of the data (reversible or irreversible). It allows for data on the same subject to be linked across data records without revealing the identities.
Rather than adding pseudonymization as a new term, I suggest adding it as a synonym of “de-identification.” Also add “de-personalization” as a synonym of “de-identification.”
In addition, based on prior discussions (thank you, Matt), I suggest adding the following new terms and definitions to IRiDiuM. These terms should also be listed as “related terms” under “de-identification.”:
Directly identifying information – the information identifies a specific individual through direct identifiers (e.g., name, social insurance number, personal health number).
Indirectly identifying information: Information that can reasonably be expected to identify an individual through a combination of indirect identifiers (e.g., date of birth, place of residence or unique personal characteristic).
Coded information: Direct identifiers have been removed from the information and replaced with a code. Depending on access to the code, it may be possible to re-identify specific participants (e.g., the principal investigator retains a list that links the participants’ code names with their actual name so data can be re-linked if necessary).
Anonymous information: Information that never had identifiers associated with it (e.g., anonymous surveys) and risk of identification of individuals is low or very low.
I’m starting with one of the few topics where I disagree with Claire.
I don’t think we can call anonymization, deidentification, and pseudonymization synonyms. a-zation and p-zation are both forms of deidentification, but vary in the extent to which identification is removed. p-zation is often reversible by the original owner of the dataset, for instance, while anonymized data would not be.
Comments from the last iteration support this: anonymization is a “rigourous process in accordance with a standard”.
The inclusion of the additional 4 terms (from Canada’s TCPS2) further emphasize the difference: p-zation will produce coded information, which is defined as being distinct from anonymous information.
Research ethics is one practical implication here. If I see a research ethics application that commits to making the results public, I want to see some very precise language about how that data will be perturbed prior to release. And rather than everyone writing pages of definitions, I’d like them to cite IRIDIUM or similar. Supporting the practice of publishing research data, especially in SSH, requires being mindful of this requirement.
It also occurs to me I should flag that “de-anonymization” is far less common than “re-identification” as a term. (In addition to my experience, Google Trends scores it 29-4 https://g.co/trends/mZf3n).
Not sure how to flag this since neither term is part of this review, but de-anonymization was mentioned above. I would at minimum suggest listing re-identification as a synonym, but ideally making re-i the term, and de-a the synonym.