Professional Documents
Culture Documents
Agenda
What is Population Informatics ? Social Genome Privacy Challenges Data Access Case Study
Agenda
What is Population Informatics ? Social Genome Privacy Challenges Data Access Case Study
(e.g. migration (Population patterns) Health Informatics) in Economics concerned with (e.g. employment groups rather than patterns) individuals
2.
Social Genome ?
Today, nearly all of our activities from birth until
death leave digital traces in large databases.
Like the human genome, the social genome data has much
buried in the massive almost chaotic data
Population Informatics ?
The burgeoning field of population informatics
The systematic study of populations via secondary analysis
of massive data collections (termed big data) about people.
Population Informatics ?
The burgeoning field of population informatics
The systematic study of populations via secondary analysis
of massive data collections (termed big data) about people.
Agenda
What is Population Informatics ? Social Genome Data Science Privacy Challenges Data Access Case Study
Marketing is freedom to send junk mail & call Thus, getting more information to better target is
acceptable and should be allowed
10
Internet : crowdsourced auditing? Logs & audits : what to log, how to keep tamperproof log
11
Contextual Integrity Easier than business, surveillance Confidential relationship between researcher
and subjects of the data
Integrity : correctness
Cost of incorrect information leading to incorrect
decisions
Availability : access
13
Privacy-by-Design
A different perspective on privacy and
research using personal data
Design Principles
Agenda
What is Population Informatics ? Social Genome Data Science Privacy Challenges Data Access Case Study
16
Raw Data
Decision
17
Raw Data
Both extreme have major limitations
Open Access: Must not be sensitive data
No micro data (individual level data) Restricted Access: very high barrier to use
Decision
Applicable to only limited situations Need better access models that can balance privacy protection and usability
18
Raw Data
Decision
19
Raw Data
Decision
20
Raw Data
Decision
Proposed Model
(red arrows show data workflow)
Raw Data Data Preparation
Select needed data (columns & rows) based on intended use Integrate data from different sources Clean data to remove errors
Publication
Publish products (papers, results, data) for public use
Decisions
Researcher often has little or no control over data preparation. Lack of P(linkage) gives researcher limited ability to control for linkage errors in the analysis.
Proposed Model
(red arrows show data workflow)
Raw Data
Data Preparation
Select needed data (columns & rows) based on intended use Integrate data from different sources Clean data to remove errors
Publication
Publish products (papers, results, data) for public use
Decisions
Researcher often has little or no control over data preparation. Lack of P(linkage) gives researcher limited ability to control for linkage errors in the analysis.
Proposed Model
Decisions
(Greater Access)
Remote access via VPN on locked down Virtual Machine User must select from catalog of approved analysis software
Use for datasets requiring more privacy protection (e.g., having a higher potential for exposing individuals or institutions to harm) Mechanisms for deidentifying data include dropping PII from customized data and/or aggregating individual data into groups
Data changed to limit disclosure Methods: data masking, generalization, summation, data simulation
Researcher often has little or no control over data preparation. Lack of P(linkage) gives researcher limited ability to control for linkage errors in the analysis.
Proposed Model
(red arrows show data workflow)
Raw Data Data Preparation
Select needed data (columns & rows) based on intended use Integrate data from different sources Clean data to remove errors
Decisions
Remote access via VPN on locked down Virtual Machine User must select from catalog of approved analysis software
Use for datasets requiring more privacy protection (e.g., having a higher potential for exposing individuals or institutions to harm) Mechanisms for deidentifying data include dropping PII from customized data and/or aggregating individual data into groups
Data changed to limit disclosure Methods: data masking, generalization, summation, data simulation
Lack of P(linkage) gives researcher limited ability to control for linkage errors in the analysis.
Proposed Model
Monitored
Usability
Publication
Open
Decisions
Data Enclave (Lane 2008) Remote access via VPN & user authentication All use on the computer is monitored Locked down Virtual Machine (VM) Data channels are blocked/monitored User must select from catalog of approved analysis software Example: U Chicago-NORC, Current Common Practice (brown arrows show data workflow) UNC-Tracs (CTSA), UCSD-iDASH
Input: De-identified Data Input: Decoupled Micro Data
PII separated from sensitive data (e.g., Social Security number separated from cancer status via encryption) Use for datasets requiring more privacy protection (e.g., having a higher potential for exposing individuals or institutions to harm) Mechanisms for deidentifying data include dropping PII from customized data and/or aggregating individual data into groups
Remote access via VPN on locked down Virtual Machine User must select from catalog of approved analysis software
Access via the WWW or download to desktop No use restrictions No monitoring of use
Data changed to limit disclosure Methods: data masking, generalization, summation, data simulation
Lack of P(linkage) gives researcher limited ability to control for linkage errors in the analysis.
Full IRB
Risk Low risk of analytics attack Low risk of linkage attack
Exempt IRB
High Risk of analytics attack High Risk of linkage attack
27
Summary WORKFLOW
Data to Decision
The start
Write up a research plan on What data you need What you want to do with them Determine access levels for each
data
29
Sample selection
Attribute selection
Data integration (access to PII) Some data cleaning Full IRB
31
32
only approved software Remote access via VPN Very effective for
threats from HBC
Gary King. Ensuring the Data-Rich Future of the Social Sciences, Science, vol 331, 2011, pp 719-721.
Anyone : Publish
Disclosure Limitation
Methods (filter)
publications
Deployed together the four data access models can provide a comprehensive system for privacy protection, balancing the risk and usability of secondary data in population informatics research
Monitored Access
Open Access
Any software Any data Remote Access NA
Only preinstalled data Requested and U1.1: Software integration & tabulation approved statistical Any software (SW) SW. No query capacity software only Only preapproved U1.2: No outside data allowed outside data Any data But PII data Data allowed Remote U2: Access No Remote Access Remote Access Access Low Risk. R1:CryptoWould have to Very Low Risk High Risk graphic Attack break into VM. Very Low Risk. Physical data Electronically R2: Data Memorize data and take leakage (Take a take data off Leakage out picture of monitor) the system.
Usability Risk
NA
37
Agenda
What is Population Informatics ? Social Genome Data Science Privacy Data Access Case Study
38
39
Case Study
Work Data Preparation Analysis of flow (Data Integration and Selection) Micro (Person Level) Data Conventional Proposed Conventional Proposed System System System System System Indirect Direct Monitored Controlled Access via Restricted Model Access Access Health Dept. Access No direct Direct Remote direct access for Access access to data access to data authorized users De-identified Multiple Type Multiple De-identified integrated Decoupled of Identifiable integrated microdata Microdata with Data Microdata Tables microdata Tables P(linkage)
40
Thank you!
Questions?
Population Informatics Research Group
http://pinformatics.web.unc.edu/