You are on page 1of 42

Privacy-by-Design: Understanding Data Access Models for Secondary Data

Hye-Chung Kum Stanley Ahalt


University of North Carolina at Chapel Hill Population Informatics Research Group http://pinformatics.web.unc.edu/

Agenda

What is Population Informatics ? Social Genome Privacy Challenges Data Access Case Study

Agenda

What is Population Informatics ? Social Genome Privacy Challenges Data Access Case Study

AMIA: Population Informatics


Public Health Informatics is the application of informatics in areas of public health, including surveillance, reporting, and health promotion. Public health informatics, and its corollary, population informatics, are concerned with groups rather than individuals. Public health is extremely broad ... etc

Population Informatics Population Research Public Health 1. in Demography Informatics


2.

(e.g. migration (Population patterns) Health Informatics) in Economics concerned with (e.g. employment groups rather than patterns) individuals

AMIA: Population Informatics


Public Health Informatics is the application of informatics in areas of public health, including surveillance, reporting, and health promotion. Public health informatics, and its corollary, population informatics, are concerned with groups rather than individuals. Public health is extremely broad ... etc

Population Informatics Population Research


1. in Demography (e.g. migration patterns) in Economics (e.g. employment patterns)

Public Health Informatics


(Population Health Informatics) concerned with groups rather than individuals

2.

Social Genome ?
Today, nearly all of our activities from birth until
death leave digital traces in large databases.

Together, these digital traces collectively capture the


footprints of our society, our social genome

Like the human genome, the social genome data has much
buried in the massive almost chaotic data

If properly analyzed and interpreted, this social genome


could offer crucial insights into many of the most challenging problems facing our society (i.e. affordable and accessible quality healthcare, economics, education, employment, and welfare)

Population Informatics ?
The burgeoning field of population informatics
The systematic study of populations via secondary analysis
of massive data collections (termed big data) about people.

In particular, health informatics analyzes electronic health


records to improve health outcomes for a population.

Challenges have constrained the use of person-level


data (micro data) in research.

Privacy Data Access Data Integration Data Management


7

Population Informatics ?
The burgeoning field of population informatics
The systematic study of populations via secondary analysis
of massive data collections (termed big data) about people.

In particular, health informatics analyzes electronic health


records to improve health outcomes for a population.

Challenges have constrained the use of person-level


data (micro data) in research.

Privacy Data Access Data Integration Data Management


8

Agenda

What is Population Informatics ? Social Genome Data Science Privacy Challenges Data Access Case Study

Social Issues: Balance between


Individual privacy
Secrecy does not work very well : accuracy of data

Cost of integrity of data


Incorrect analysis that lead to can lead to wrong decisions

Organization transparency & accountability Freedom of speech


Marketing is freedom to express why one should prescribe
certain drugs

Marketing is freedom to send junk mail & call Thus, getting more information to better target is
acceptable and should be allowed
10

Approaches to privacy protection


Secrecy : Hiding information
Instinctive, common sense approach In reality, has limited power to protect privacy
When someone is out to find out, there are multiple avenues, and the
cost of trying to lock down data is too high for the limited protection it can provide

Very high cost related to


Accuracy of data, use of data for legitimate reasons, transparency

Information Transparency & Accountability


Disclosure : Declared in writing, so when something goes wrong the
right people are held accountable (data use agreements)

Internet : crowdsourced auditing? Logs & audits : what to log, how to keep tamperproof log
11

Privacy expectations for doing research

Contextual Integrity Easier than business, surveillance Confidential relationship between researcher
and subjects of the data

In secondary data analysis no contact

IRB approvals based on (Belmont Report) Benefit to society Risk of harm


To individuals: privacy violation & inaccurate results To society : inaccurate analysis resulting in wrong decisions
12

Information system requirements

Federal Information Security Management Act


of 2002 (FISMA)

Integrity : correctness
Cost of incorrect information leading to incorrect
decisions

Confidentiality : sharing information


Sharing what information with who for what purposes

Availability : access

13

Privacy-by-Design
A different perspective on privacy and
research using personal data

Personal Data is Delicate/Hazardous/Valuable Important to have proper systems in place


that give protection but allow for continued research in a safe manner

All hazardous material need standards


Safe environments to handle them in : closed

computer server system lab Proper handling procedures : what software are allowed to run on the data Safe containers to store them : DB system
14

Design Principles

First, the Minimum Necessary Standard states


that maximum privacy protection is provided when the minimum information needed for the task is accessed at any given time.

Second, the Maximum Usability Principle states


that data are most usable when access to the data is least restrictive (i.e. direct remote access is most usable). Based on common activities in the workflow, design systems that have maximum access to the minimum amount of information required
15

Agenda

What is Population Informatics ? Social Genome Data Science Privacy Challenges Data Access Case Study

16

Continuum of Access Models


Usability Data Preparation Analyze Open Publish

Raw Data

Decision

17

Continuum of Access Models


Protection Restricted Data Preparation Analyze Usability Open Publish

Raw Data
Both extreme have major limitations
Open Access: Must not be sensitive data
No micro data (individual level data) Restricted Access: very high barrier to use

Decision

Applicable to only limited situations Need better access models that can balance privacy protection and usability
18

System of Access Models


Protection Restricted Data Preparation Analysis Type I (More sensitive data) More Protection Usability Analysis Type II (Less sensitive data) More Usability Open Publish

Raw Data

Decision

19

System of Access Models


Protection Restricted Data Preparation Controlled Analysis Type I (More sensitive data) More Protection Monitored Usability Open Publish

Analysis Type II (Less sensitive data) More Usability

Raw Data

Decision

20

System of Access Models


Protection Restricted Data Preparation Controlled Analysis Type I (More sensitive data) More Protection Monitored Usability Open Publish

Analysis Type II (Less sensitive data) More Usability

Raw Data

Decision

Goal: To design an information system that


can enforce the varied continuum from one end to the other such that can balance privacy and usability as needed to turn data into decisions for a given task
21

Proposed Model
(red arrows show data workflow)
Raw Data Data Preparation
Select needed data (columns & rows) based on intended use Integrate data from different sources Clean data to remove errors

Use/share results within restricted community

Analysis Type 1 (Greater Protection)


Build models that explain relationships

Analysis Type 2 (Greater Access)


Build models that explain relationships

Publication
Publish products (papers, results, data) for public use

Decisions

Level: Controlled Access


Remote access via VPN on locked down Virtual Machine User must select from catalog of approved analysis software

Level: Monitored Access


Remote access via VPN with user authentication on secure server User may employ any IRBapproved analysis software or method All use is monitored

Level: Open Access


Access via the WWW or download to desktop No use restrictions No monitoring of use

Level: Restricted Access


Computer is physically locked down with no remote access All use on and off the computer is monitored

Input: De-identified Data


Use for datasets requiring more privacy protection (e.g., having a higher potential for exposing individuals or institutions to harm) Mechanisms for deidentifying data include dropping PII from customized data and/or aggregating individual data into groups

Input: Less Sensitive Data


Use for datasets requiring less restrictive privacy protection Generally, individual data are aggregated to groups (note that even aggregated data can expose individuals or institutions to harm; in such cases, controlled access is more appropriate)

Input: Sanitized Data


Data changed to limit disclosure Methods: data masking, generalization, summation, data simulation

Input: Decoupled Micro Data


PII separated from sensitive data (e.g., Social Security number separated from cancer status via encryption)

Approval: Full IRB


IRB review for secondary data analysis

Approval: Full IRB


IRB review for secondary data analysis

Approval: Exempt IRB


File IRB for exemption to document intent

Approval: No IRB, but Data Use Agreement


General data use agreement based on ethical research

Current Common Practice


(brown arrows show data workflow)

Researcher often has little or no control over data preparation. Lack of P(linkage) gives researcher limited ability to control for linkage errors in the analysis.

Use/share results within restricted community

Proposed Model
(red arrows show data workflow)

Use/share results within restricted community

Raw Data

Data Preparation
Select needed data (columns & rows) based on intended use Integrate data from different sources Clean data to remove errors

Analysis Type 1 (Greater Protection)


Build models that explain relationships

Analysis Type 2 (Greater Access)


Build models that explain relationships

Publication
Publish products (papers, results, data) for public use

Decisions

Level: Controlled Access


Remote access via VPN on locked down Virtual Machine User must select from catalog of approved analysis software

Level: Monitored Access


Remote access via VPN with user authentication on secure server User may employ any IRBapproved analysis software or method All use is monitored

Level: Open Access


Access via the WWW or download to desktop No use restrictions No monitoring of use

Level: Restricted Access


Computer is physically locked down with no remote access All use on and off the computer is monitored

Input: De-identified Data


Use for datasets requiring more privacy protection (e.g., having a higher potential for exposing individuals or institutions to harm) Mechanisms for deidentifying data include dropping PII from customized data and/or aggregating individual data into groups

Input: Less Sensitive Data


Use for datasets requiring less restrictive privacy protection Generally, individual data are aggregated to groups (note that even aggregated data can expose individuals or institutions to harm; in such cases, controlled access is more appropriate)

Input: Sanitized Data


Data changed to limit disclosure Methods: data masking, generalization, summation, data simulation

Input: Decoupled Micro Data


PII separated from sensitive data (e.g., Social Security number separated from cancer status via encryption)

Approval: Full IRB


IRB review for secondary data analysis

Approval: Full IRB


IRB review for secondary data analysis

Approval: Exempt IRB


File IRB for exemption to document intent

Approval: No IRB, but Data Use Agreement


General data use agreement based on ethical research

Current Common Practice


(brown arrows show data workflow)

Researcher often has little or no control over data preparation. Lack of P(linkage) gives researcher limited ability to control for linkage errors in the analysis.

Use/share results within restricted community

Proposed Model

(red arrows showProtection data workflow) Restricted


Raw Data Data Preparation
Select needed data (columns & rows) based on intended use Integrate data from different sources Clean data to remove errors

Use/share results within restricted community

Usability MONITORED ACCESS Open


Analysis Type 2 2 Analysis Type (Greater Access) Publication
Publish products (papers, results, data) for public use

Analysis Type 1 (Greater Protection)


Build models that explain relationships

Decisions

(Greater Access)

Build models that explain relationships

Level: Controlled Access

Level: Restricted Access


Computer is physically locked down with no remote access

Remote access via VPN on locked down Virtual Machine User must select from catalog of approved analysis software

Level: Monitored Access Analysis Type 2(Greater Access)


Remote access via VPN User authentication on secure server Typically IRB required Example: secure Unix server
Input: Less Sensitive Data
Use for datasets requiring less restrictive privacy protection Generally, individual data are aggregated to groups (note that even aggregated data can expose individuals or institutions to harm; in such cases, controlled access is more appropriate) Remote access via VPN with user authentication on secure server User may employ any IRBapproved analysis software or method All use is monitored Access via the WWW or download to desktop No use restrictions No monitoring of use

Level: Monitored Access

Level: Open Access

Input: Decoupled Micro Data


PII separated from sensitive data (e.g., Social Security number separated from cancer status via encryption)

Input: De-identified Data

Use for datasets requiring more privacy protection (e.g., having a higher potential for exposing individuals or institutions to harm) Mechanisms for deidentifying data include dropping PII from customized data and/or aggregating individual data into groups

Input: Sanitized Data

Data changed to limit disclosure Methods: data masking, generalization, summation, data simulation

Approval: Full IRB


IRB review for secondary data analysis

Approval: Full IRB


IRB review for secondary data analysis

Approval: Exempt IRB


File IRB for exemption to document intent

Approval: No IRB, but Data Use Agreement


General data use agreement based on ethical research

Current Common Practice


(brown arrows show data workflow)

Researcher often has little or no control over data preparation. Lack of P(linkage) gives researcher limited ability to control for linkage errors in the analysis.

Use/share results within restricted community

Proposed Model
(red arrows show data workflow)
Raw Data Data Preparation

Use/share results within restricted community

Select needed data (columns & rows) based on intended use Integrate data from different sources Clean data to remove errors

Information Accountability Model Greater emphasis on easy of use


Analysis Type 1 (Greater Protection) Analysis Type 2 (Greater Access) Publication
Build models that explain relationships Build models that explain relationships Publish products (papers, results, data) for public use

Decisions

Level: Controlled Access

Level: Restricted Access


Computer is physically locked down with no remote access

Remote access via VPN on locked down Virtual Machine User must select from catalog of approved analysis software

Level: Monitored Access Analysis Type 2(Greater Access)


Remote access via VPN User authentication on secure server Typically IRB required Exempt IRB Monitor use on the computer BUT limitation on more sensitive data Example: SHRINE
Input: Less Sensitive Data
Use for datasets requiring less restrictive privacy protection Generally, individual data are aggregated to groups (note that even aggregated data can expose individuals or institutions to harm; in such cases, controlled access is more appropriate) Remote access via VPN with user authentication on secure server User may employ any IRBapproved analysis software or method All use is monitored Access via the WWW or download to desktop No use restrictions No monitoring of use

Level: Monitored Access

Level: Open Access

Input: Decoupled Micro Data


PII separated from sensitive data (e.g., Social Security number separated from cancer status via encryption)

Approval: Full IRB


IRB review for secondary data analysis

Input: De-identified Data

Use for datasets requiring more privacy protection (e.g., having a higher potential for exposing individuals or institutions to harm) Mechanisms for deidentifying data include dropping PII from customized data and/or aggregating individual data into groups

Input: Sanitized Data

Data changed to limit disclosure Methods: data masking, generalization, summation, data simulation

Approval: Full IRB

Approval: Exempt IRB

Approval: No IRB, but Data Use Agreement


General data use agreement based on ethical research

IRB review for secondary data analysis

File IRB for exemption to document intent

Current Common Practice


(brown arrows show data workflow)

Researcher often has little or no control over data preparation.

Use/share results within restricted community

Lack of P(linkage) gives researcher limited ability to control for linkage errors in the analysis.

Proposed Model

(red arrows showProtection data workflow) Restricted CONTROLLED ACCESS


Raw Data Data Preparation Analysis Type 1 (Greater Protection)

Use/share results within restricted community

Monitored

Usability
Publication

Open
Decisions

Greater emphasis on protection by controlling use


Select needed data (columns & rows) based on intended use Integrate data from different sources Clean data to remove errors Build models that explain relationships Build models that explain relationships Publish products (papers, results, data) for public use

Analysis Type 2 (Greater Access)

Level: Restricted Access


Computer is physically locked down with no remote access

Level: Controlled Access Analysis Type 2(Greater Protection)


Level: Controlled Access Level: Monitored Access
Remote access via VPN with user authentication on secure server User may employ any IRBapproved analysis software or method All use is monitored

Level: Open Access

Data Enclave (Lane 2008) Remote access via VPN & user authentication All use on the computer is monitored Locked down Virtual Machine (VM) Data channels are blocked/monitored User must select from catalog of approved analysis software Example: U Chicago-NORC, Current Common Practice (brown arrows show data workflow) UNC-Tracs (CTSA), UCSD-iDASH
Input: De-identified Data Input: Decoupled Micro Data
PII separated from sensitive data (e.g., Social Security number separated from cancer status via encryption) Use for datasets requiring more privacy protection (e.g., having a higher potential for exposing individuals or institutions to harm) Mechanisms for deidentifying data include dropping PII from customized data and/or aggregating individual data into groups

Approval: Full IRB

Remote access via VPN on locked down Virtual Machine User must select from catalog of approved analysis software

Access via the WWW or download to desktop No use restrictions No monitoring of use

Input: Less Sensitive Data


Use for datasets requiring less restrictive privacy protection Generally, individual data are aggregated to groups (note that even aggregated data can expose individuals or institutions to harm; in such cases, controlled access is more appropriate)

Input: Sanitized Data

Data changed to limit disclosure Methods: data masking, generalization, summation, data simulation

Approval: Full IRB

Approval: Exempt IRB

Approval: No IRB, but Data Use Agreement


General data use agreement based on ethical research

IRB review for secondary data analysis

IRB review for secondary data analysis

File IRB for exemption to document intent

Researcher often has little or no control over data preparation.

Use/share results within restricted community

Lack of P(linkage) gives researcher limited ability to control for linkage errors in the analysis.

Controlled vs Monitored Access


Balance risk and usability
Usability: Remote access VPN Risk:
User authentication Monitor use on the computer

What the researcher can do on the system


Controlled Access (Greater Protection) Locked down VM Usability Data channel monitored Monitored Access (Greater Access) Free to develop SW Typically open data channels

Full IRB
Risk Low risk of analytics attack Low risk of linkage attack

Exempt IRB
High Risk of analytics attack High Risk of linkage attack
27

Summary WORKFLOW

Data to Decision

The start

Write up a research plan on What data you need What you want to do with them Determine access levels for each
data

Submit to IRB process

29

IRB: Risk of privacy violation vs. Benefit to Society


Risk of attribute disclosure
Group disclosure Linkage attack using auxiliary information

Risk of identity disclosure Given?


Kinds of data elements used in the study
Name/dob/cancer status/ etc (are there $$)

What system the data resides in : HW/SW


Risk of outsiders intruding / insider attack / negligence

What can users do with the data on the system


Take data off / look at everything / only do limited queries
30

Restricted Access : Prepare the customized data

Decoupled Data (Kum 2012)


Automated Honest Broker SW

Sample selection

Attribute selection
Data integration (access to PII) Some data cleaning Full IRB
31

Privacy Preserving Interactive Record Linkage

Decouple data via encryption Automated honest broker approach via


computerized third party model

Chaffe to prevent group disclosure Kum et al. 2012

32

Controlled Access : Model using given tools


With approved
deidentified data

Locked down VM:


customized appliances

only approved software Remote access via VPN Very effective for
threats from HBC
Gary King. Ensuring the Data-Rich Future of the Social Sciences, Science, vol 331, 2011, pp 719-721.

Full IRB U Chicago-NORC


UNC-Tracs (CTSA) UCSD-iDASH
33

Monitored Access : Freely Repurpose


Information
Accountability model

Exempt IRB: Explicit


data use agreement

Any software &


auxiliary data

Remote Access via VPN Less sensitive data


Gary King. Ensuring the Data-Rich Future of the Social Sciences, Science, vol 331, 2011, pp 719-721.

(e.g. Aggregate data)

SHRINE, Secure Unix


servers
34

Open Access : No restriction on use


Package with filter information for others (disclosure limitation No IRB methods) & take out of lab No monitoring use

Anyone : Publish

Disclosure Limitation
Methods (filter)

Sanitized data Public websites,


Gary King. Ensuring the Data-Rich Future of the Social Sciences, Science, vol 331, 2011, pp 719-721.

publications

Publish data use terms


35

Use Published Data for Good Decision Making


Protection Restricted Controlled Monitored Usability Open

Deployed together the four data access models can provide a comprehensive system for privacy protection, balancing the risk and usability of secondary data in population informatics research

Comparison of risk and usability


Protection Restricted Controlled Restricted Access Monitored Controlled Access Usability Open

Monitored Access

Open Access
Any software Any data Remote Access NA

Only preinstalled data Requested and U1.1: Software integration & tabulation approved statistical Any software (SW) SW. No query capacity software only Only preapproved U1.2: No outside data allowed outside data Any data But PII data Data allowed Remote U2: Access No Remote Access Remote Access Access Low Risk. R1:CryptoWould have to Very Low Risk High Risk graphic Attack break into VM. Very Low Risk. Physical data Electronically R2: Data Memorize data and take leakage (Take a take data off Leakage out picture of monitor) the system.

Usability Risk

NA

37

Agenda

What is Population Informatics ? Social Genome Data Science Privacy Data Access Case Study

38

Case Study Cancer care among the poor

Yung RL, Chen K, Abel GA, et al. Cancer


disparities in the context of Medicaid insurance: a comparison of survival for acute myeloid leukemia and Hodgkin's lymphoma by Medicaid enrollment. Oncologist 2011;16(8):1082-91

Linked the New York central cancer registry


with Medicaid enrollment and claims files to assess cancer care among the poor

39

Case Study
Work Data Preparation Analysis of flow (Data Integration and Selection) Micro (Person Level) Data Conventional Proposed Conventional Proposed System System System System System Indirect Direct Monitored Controlled Access via Restricted Model Access Access Health Dept. Access No direct Direct Remote direct access for Access access to data access to data authorized users De-identified Multiple Type Multiple De-identified integrated Decoupled of Identifiable integrated microdata Microdata with Data Microdata Tables microdata Tables P(linkage)
40

Analysis of Risk and Usability


Data Preparation Analysis of Micro (Data Integration and Selection) (Person Level) Data Reduce risk significantly Reduce risk significantly from insider attack from insider attack by decoupling PII from the and malware in the proposed sensitive data model by 1. restricting activities on the VM, and 2. running the VM in isolation from the host OS increase data usability increase data usability directly carry out record linkage leading to more accurate data (attribute and sample) analysis by propagating the selection error, probability of linkage, to the analysis phase.
41

Thank you!
Questions?
Population Informatics Research Group

http://pinformatics.web.unc.edu/

You might also like