UBC Graduate Research

De-Identification Berrada, Salma; Hall, Shirlett; McCollor, Vivian; Tippets, Caroline Nov 24, 2017

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


42591-Berrada_S_et_al_LIBR_559S_De-identification_2017.pdf [ 1.52MB ]
JSON: 42591-1.0361535.json
JSON-LD: 42591-1.0361535-ld.json
RDF/XML (Pretty): 42591-1.0361535-rdf.xml
RDF/JSON: 42591-1.0361535-rdf.json
Turtle: 42591-1.0361535-turtle.txt
N-Triples: 42591-1.0361535-rdf-ntriples.txt
Original Record: 42591-1.0361535-source.json
Full Text

Full Text

De-IdentificationSalma BerradaShirlett HallVivian McCollorCaroline Tippets1LIBR 559SResearch Data ManagementInstructor: Eugene BarskyNovember 24, 2017De-identification1. minimally perturbing individual-level data to decrease the probability of discovering an individual’s identity. ○ masking direct identifiers ○ transforming indirect identifiers○ defensible, repeatable, and auditable process○ goal: very small risk of re-identification 22. the use of one or more techniques designed to make it impossible/difficult to identify an individual.○ to protect the privacy of the individual ○ to make it legal for governments and businesses to share their data without permissionFor the full definition, see the CASRAI Glossary: http://dictionary.casrai.org/De-identification. Available under Creative Commons Attribution 4.0 International LicenseAnonymization● Encrypting or removing personally identifiable information from data sets.● Intent is privacy protection.● Irreversible severing of identifying information.● No trail of individual records.● Prevents any future re-identification of the data contributor by anyone under any circumstances. ● Indirect re-identification may still occur (de-anonymization).3De-AnonymizationDe-anonymization is a reverse engineering process in which de-identified data are cross-referenced with other data sources to re-identify the personally identifiable information. This could occur if a de-identification process had not been not successfully performed, or had not been undertaken in the first place.4See CASRAI, http://dictionary.casrai.org/De-anonymization Content is available under Creative Commons Attribution 4.0 International LicensePseudonymization● Substitutes the identifiable data with a reversible, consistent value. ● A single pseudonym can replace another field or a collection of fields.● Initial information can be restored to a pseudonymized data set.● Individual can also be re-identified indirectly5Name John Doe Alex Smith John DoeID 12345 83502 12345Address 142 North Apple Lane, Orchard City, AB1 Random St., Cityville, CD 142 North Apple Lane, Orchard City, ABTest Test 24 Test 24 Test 24Request Date 2017-11-24 2017-11-24 2017-11-24Admission Reason Fever Fever FeverPseudonymization Re-IdentificationConfidentializationTwo common techniques are top-coding and data aggregationhttps://statistics.laerd.com/spss-tutorials/linear-regression-using-spss-statistics.php https://census.ukdataservice.ac.uk/use-data/guides/aggregate-data6Top coding (and bottom coding) prevents outliers from being identified Data aggregation combines several categories into one to reduce the level of detailInfographic from the Future of Privacy Forum:  https://fpf.org/wp-content/uploads/2017/06/FPF_Visual-Guide-to-Practical-Data-DeID.pdf7De-identification Techniques for Tabular Data8Data Presentation FormatAttribute TypeIdentifier TypePossible Technique Example/DescriptionTabular Numeric Direct Random addition of noiseAdd or subtract a random number from social security number within a defined rangeTabular Numeric Indirect Reduction in detail or aggregationConvert to age range; express dates relative to milestoneTabular Numeric Geographic Suppression Most applicable when dataset has 5 observations or less, ex. in a rural areaTabular Text Direct Pseudonymization or suppressionJanet replaced by Angela, Chris replaced by SamTabular Text Geographic Reduction in detail Reduce postal codes to first 3 charactersTabular Binary Indirect Suppression Positive/negative, M/FTable compiled by Shirlett Hall, based on ‘Best Practice’ Guidelines for Managing the Disclosure of De-Identified Health Information, from the Canadian Institute for Health Information, 2010. Available at http://www.ehealthinformation.ca/wp-content/uploads/2014/08/2011-Best-Practice-Guidelines-for-Managing-the-Disclosure-of-De-Identificatied-Health-Info.pdf Geospatial Data - Point Aggregation9Visualization by Jianting Zhao, Comparison of 4 Point Data Aggregation Methods for Geospatial Analysis, Sept. 27, 2017 https://www.azavea.com/blog/2017/09/27/comparison-of-4-point-data-aggregation-methods-for-geospatial-analysis/ Geospatial Data - Probability Map10Visualization by GIS Map Gallery: Earthquake Probability Map. Data Aggregated to county level. Data from Applied Technology Council, National Geophysical Data Center. 1990-1994. Source: http://www.edgetech-us.com/Map/MapEarthqk.htm Geospatial Data - Jittering11Visualization by Ryan Brideau, Aug. 2, 2016. Mapping with Tableau. https://github.com/Brideau/pokelyzer/wiki/Mapping-with-Tableau Geospatial Data - Data Point Displacement 12Visualization by Morgan McKenzie, 2015. Using ESRI ArcGIS 9.3 Spatial Adjustment. http://slideplayer.com/5267071/17/images/9/Arc+ToolBox+Spatial+Adjustment+Link+Table.jpg Audio and Image DataDigital manipulation techniques can be used to remove identifying information...● voice alteration● image blurring However, they can be:● labour-intensive ● expensive, and● negatively impact data utility13Where possible, a better alternative may be to:● obtain the participant's consent to use and share the data unaltered● implement additional access controls as necessaryANDS Deidentification Guide. https://www.ands.org.au/__data/assets/pdf_file/0003/737211/De-identification.pdf Safe Harbor & Expert DeterminationThe US is unique in having a law specifying when a dataset is considered de-identified. It sounds so easy, but...● It’s not enough to protect privacy effectively● It  still leaves room for interpretation: ex. any identifiable characteristic (?) ● It’s insufficient to determine when the dataset has reached the level of de-identification needed or if the dataset is too risky to share● It doesn’t take into account the utility of the datasetMost jurisdictions rely on Expert Determination to interpret issues, context and methods.Safe Harbor IdentifiersNameAddressPersonal datesTelephone numbersFax numberEmail addressSocial Security NumberMedical record numberHealth plan beneficiary numberAccount numberCertificate or licence numberAny vehicle or other device serial numberWeb URLInternet Protocol (IP) AddressFinger or voice printPhotographicAny other uniquely identifiable characteristic 14Balancing Privacy vs. Utility● Weigh the risks and benefits● Weigh the techniques against the utility of the dataset● Consider the big picture: data protection, security and useTo protect personal privacy, the amount of de-identification that is required to be applied is proportional to the level of re-identification risk involved in the release of the data set. The higher the re-identification risk of a data release, the greater the amount of de-identification required. [1]15 1 De-identification Guidelines for Structured Data, June 2016. Information and Privacy Commissioner of Ontario. https://www.ipc.on.ca/wp-content/uploads/2016/08/Deidentification-Guidelines-for-Structured-Data.pdf Issues in De-IdentificationBenefits:● Research & Analysis● Planning & Evaluation● Tackling complex social problemsHow can you tell if it’s an identifiable characteristic?1. Replicable2. Distinguishable3. KnowableIs de-identification a workaround for consent?16Disclosure Risks● External threats○ Identity theft○ Insurance fraud● Inadvertent leaks○ Violation of privacy, confidentiality● Security VulnerabilitiesOutcomes beyond a Privacy Breach● Legal ramifications● Financial costs● Reputational costsThe Five SafesIcons made by Freepik from www.flaticon.com Use case - NZ Stats: http://archive.stats.govt.nz/browse_for_stats/snapshots-of-nz/integrated-data-infrastructure/keep-data-safe.aspx Further Information about the Five Safes: https://ukdataservice.ac.uk/use-data/secure-lab/security-philosophy 17Safe People Safe SettingsSafe Projects Safe Data Safe OutputsCan the Researcher(s) be trusted to use the data in an appropriate manner?Is the data to be used for an appropriate purpose?Does the Access environment prevent unauthorized use?Is there a disclosure risk in the data itself?Are the statistical results disclosive?Managing identifiable dataRestricted-use collections ● Secure Data Labs ○ Data Enclaves○ Secure Research Environments● Data Linkage○ Trusted Third Parties■ The Separation Principle■ You can’t do it all:● Link data● Grant access● Conduct research18Content IdentifiersOverlapSplitting up the DataRecommended ReadingANDS Guide to Deidentification, 2017: http://www.ands.org.au/working-with-data/sensitive-data/de-identifying-dataICPSR Guide to Sharing Data: https://www.icpsr.umich.edu/icpsrweb/content/deposit/guide/chapter5.htmlThe complete book of data anonymization : from planning to implementation, Balaji Raghunathan. http://resolve.library.ubc.ca/cgi-bin/catsearch?bid=7277832 Ontario IPC De-identification Guidelines for Structured Data, 2016:  https://www.ipc.on.ca/wp-content/uploads/2016/08/Deidentification-Guidelines-for-Structured-Data.pdfUKAN Anonymisation Decision-Making Framework, 2016: http://ukanon.net/wp-content/uploads/2015/05/The-Anonymisation-Decision-making-Framework.pdfUK Anonymisation Code of Practice: https://ico.org.uk/media/for-organisations/documents/1061/anonymisation-code.pdfUSA NIST De-Identifying Government Datasets, 2016: https://csrc.nist.gov/CSRC/media/Publications/sp/800-188/draft/documents/sp800_188_draft2.pdf19Hierarchies for attributes “age” and “sex”Example dataset with quasi identifiers “age”, “gender” and “zip code” as well as a two-anonymous transformation:GeneralizationSuppressionMicroaggregationARXARX Implemented de-identification workflowImport Data Export DataCONFIGURE EXPLORE ANALYZE1. Create and edit hierarchies2. Define criteria3. Configure transformation1. Filter & search2. Save and compare transformations1. Compare with original data2. Analyze properties22The Main Graphical UserInterface of ARX1. Input data set2. specify attribute metadata & view generalization hierarchies3. Configuration of privacy models4. Configuration of utility measures5. provides methods for extracting a research sample Additional ResourcesRelated SoftwaresdcMicro • Cross-platform open source software • Collection of a set of methods, not an integrated application • Different types of recoding models and risk models • Minimalistic graphical user interface µArgus •  More comprehensive user interface than sdcMicro• Closed source software for MS Windows • Development has ceased 23PARAT • Commercial tool for MS Windows  • Centered around a risk-based approach• Methods implemented overlap with methods implemented in ARX • Powerful graphical interfaceARXProject website: http://arx.deidentifier.org  ⦿ Code repository: https://github.com/arx-deidentifier/arx Specific limitations:ARX is not a tool for masking identifiers in unstructured data. For such methods, check out MITdeid or the NLM Scrubber.ARX implements in-memory data management: de-identified dataset must fit in a machine’s main memory. General limitations:There is no single measure that is able to protect datasets from all possible threats, especially not while being flexible enough to support all usage scenarios.References24Elliot, M., Mackey, E., O’Hara, K., & Tudor, C.  (2016).  The Anonymisation Decision-Making Framework.  Retrieved from  http://ukanon.net/wp-content/uploads/2015/05/The-Anonymisation-Decision-making-Framework.pdf El Emam, et al.(2012). De-identification Methods for Open Health Data: The Case of the Heritage Health Prize Claims Dataset. Journal of Medical Internet Research, 14(1), e33. http://doi.org/10.2196/jmir.2001 Health System Use Technical Advisory Committee. (2010, October 31). ‘Best Practice’ Guidelines for Managing the. Retrieved from http://www.ehealthinformation.ca: http://www.ehealthinformation.ca/wp-content/uploads/2014/08/2011-Best-Practice-Guidelines-for-Managing-the-Disclosure-of-De-Identificatied-Health-Info.pdf Information and Privacy Commissioner of Ontario.  (2016).  De-identification Guidelines for Structured Data.  Retrieved from https://www.ipc.on.ca/wp-content/uploads/2016/08/Deidentification-Guidelines-for-Structured-Data.pdf Morin, J.  (2010).  De-identifying Patient Data, Part 2.  Retrieved from http://caristix.com/blog/2010/12/de-identifying-patient-data-part-2/Neamatullah, Ishna, Margaret M. Douglass, Li-wei H. Lehman, Andrew Reisner, Mauricio Villarroel, William J. Long, Peter Szolovits, George B. Moody, Roger G. Mark, and Gari D. Clifford. 2008. Automated de-identification of free-text medical records. BMC Medical Informatics and Decision Making 8 (1) https://doi.org/10.1186/1472-6947-8-32 Yoose, Becky (2017) Balancing Privacy and Strategic Planning Needs: A Case Study in De-Identification of Patron Data. Journal of Intellectual Freedom and Privacy 2 (1). http://dx.doi.org/10.5860/jifp.v2i1.6250 Zand, M. (2014, February 18). Geospatial Data and HIPAA. Retrieved from https://bigdatamedsci.com/: https://bigdatamedsci.com/2014/02/18/geospatial-data-and-hipaa/ 


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items