Professional Documents
Culture Documents
Protecting Patient
Privacy in Clinical
Data Mining
Linda K. Goodwin, RN, C, PhD, and Jonathan C. Prather, PhD
geographic subdivisions smaller than a numeric temporal relationships with We believe our RO1 procedures were
state, including street address, city, coun- regard to conception. For example, if a HIPAA compliant for personal identifiers
ty, precinct, zip code, and their equiva- preconception questionnaire was com- 4 through 17 (see table 2). As a function
lent geocodes, except for the initial three pleted on 1/1/1993 and conception of the years when TMR data was collect-
digits of a zip code if, according to the occurred on 1/15/1993, the relationship ed, some identifier items on the HIPAA
current publicly available data from the of the event was calculated as -14 days. list were not included in our data set
Bureau of the Census:4, pp. 82711, 82818 If the patient experienced nausea during (e.g., e-mail address, web URL, IP
• The geographic unit formed by com- the same pregnancy on 7/15/1993, the address, biometrics, photos). We system-
bining all zip codes with the same atically removed all phone and fax num-
three initial digits must contain more bers, social security numbers, medical
than 20,000 people; and record and account numbers, and insur-
• The initial three digits of a zip code ance information that was recoded into
for all such geographic units contain- categories: private, health department,
ing 20,000 or fewer people is changed managed care, or other.
to 000.
“If dates were removed prior to The HIPAA regulatory definition of de-
identification occurs when 18 personal
Since our final county codes were identifiers listed in table 1 have been
broadly categorized as “Durham County,
our receiving the data, it would removed or aggregated as defined in the
Orange County, and Other”, each geo- have been impossible for us to law. Item 18 places a burden of protec-
graphic subdivision has more than tion on the covered entity to consider
20,000 people and it should be possible complete our study, since we need- “any other unique identifying number,
to follow HIPAA guidelines with regard characteristic, or code (whether generally
to zip codes to achieve HIPAA-compliant ed dates in order to generate available in the public realm or not) that
outcomes with results similar to our RO1 the covered entity has reason to believe
procedures. many of our data elements.” may be available to an anticipated recipi-
Birth dates were converted to age at ent of the information, and the covered
the time of conception, and then entity has no reason to believe that any
removed. Since we did not have any temporal relationship of the observation reasonably anticipated recipient of such
pregnant women over the age of 89, this would be 181 days. This temporal rela- information could use the information
aspect of the regulation did not influence tionships scheme allowed all dates in the alone, or in combination with other
our procedures, and the maximum likeli- data source, including date of conception information, to identify an individual.
hood of this as a future problem in peri- and birth, to be removed. While the Thus, to create de-identified information,
natal medicine is not expected. HIPAA process that included dates in raw data entities that had removed the listed iden-
regulations specify removal of the follow- could be interpreted as violating HIPAA tifiers would still have to remove addi-
ing for date identifiers: all elements of regulations, the final data sets are clearly tional data elements if they had reason
dates (except year) for dates directly in compliance with the law. to believe that a recipient could use the
related to an individual, including birth
date, admission date, discharge date, date
of death; and all ages over 89 and all
Figure 2. Research Data Sets (10 week intervals)
elements of dates (including year) indica-
tive of such age, except that such ages
and elements may be aggregated into a
single category of age 90 or older.4, p. 82818 LMP 10 weeks 20 weeks 30 weeks 37 40+
Here is where we believe new prob-
lems may arise with interpretation of
HIPAA regulations in research. If dates Data Set #1 Preterm
were removed prior to our receiving the 446 vars Full term
data, it would have been impossible for
us to complete our study, since we need- Data Set #2
ed dates in order to generate many of 839 vars
our data elements. Please recognize that
dates were available in the raw data but Data Set #3
converted into data fields, such as years 1,232 vars
of age and weeks of gestation, in the
final research data sets (see figure 2); all Data Set #4
dates were removed from the final data 1,622 vars
set but available to the data warehouse
engineer during the data cleaning and fil-
tering earlier phases of the research.
All clinical observations and events in
the study data were converted into
remaining information, alone or in com- database for operational purposes. The word protection on the hard drive where
bination with other information, to iden- warehouse engineer spent more than a the raw data were stored.
tify an individual.”4, pp. 82717-82818 year working closely with the database It is also important to understand that
See the discussion section for our administrator, physician, Principal potential privacy and informed consent
example of potential problems with Investigator (PI), and other members of issues persist even in de-identified large
“unique” characteristics. Our own the research team to scrub, filter, clean, clinical databases. Our research team was
research found that this necessary final and transform the raw data into de-iden- vigilant in maintaining patient privacy,
step was especially challenging — but tified research data sets. and we removed patient identifiers using
still possible to accomplish. The database administrator and physi- procedures that, with minor modification,
cian were already authorized users of the could be expected to meet a HIPAA “safe
Discussion database, and our research data cleaning harbor” de-identification standard.
Data access and management proce- procedures required that the data ware- But we did encounter a problem after
dures were reviewed and approved by house engineer have access to personal the first 17 HIPAA safe harbor personal
Duke’s IRB in accordance with the identifiers in order to match and verify identifiers had been removed from our
Common Rule and established IRB poli- records. The warehouse engineer research data. Clinician members of the
cies and procedures (since HIPAA priva- worked closely with the research team research team could still identify a very
cy regulations will not be enforced until and was diligent in removing all person- young pregnant girl because of her age,
April 2003). al identifiers from the research data. so her record was removed from the
It is important to understand that the Thus, using de-identified data, data data. A board-certified obstetrician and
data warehouse engineer on our analyses were conducted by multiple expert nurse who had managed the TMR
research team was given full access to members of the research team, but only database for more than a decade, assist-
clinical data with patient identifiers. the warehouse engineer had ready ed us in searching for outliers and possi-
Other participating research team mem- access to raw data that included personal ble remaining identifiers, but no further
bers included the clinical database identifiers. The warehouse engineer care- potential privacy problems with the
administrator (a certified perinatal nurse) fully protected the raw data through research data were found.
and a board-certified obstetrician who physical security (locked office, locked Thus, the 18th HIPAA item that man-
were heavily involved in using the TMR™ machine) and multiple levels of pass- dates removing “any/all other unique
identifying number, characteristic or
code” will require due diligence by those
Table 2. HIPAA and RO1 Comparison responsible for de-identification of clini-
cal data, and could create problems
where researchers do not wish to invest
HIPAA SAFE HARBOR AAMC PROPOSED RO1 “ANONYMIZATION” the time and effort for this level of atten-
DE-IDENTIFICATION DE-IDENTIFICATION PROCEDURES
tion to detail.
1. Names 1. Names Removed The potential impact of HIPAA privacy
2. Geographic subdivisions 2. Street address Recoded as county
of residence regulations on clinical research is not yet
3. All elements of dates Converted to temporal known. The literature and the web are
relationships from date filled with opinions on both sides of the
of conception issue; some believe protecting patient
4. Telephone # 3. Telephone # Removed
privacy is more important while others
5. Fax # 4. Fax # Removed
6. Electronic mail addresses 5. Electronic mail addresses Not available argue that access to PHI for research
7. Social security # 6. Social security # Removed purposes is essential for science. In our
8. Medical record # Removed clinical data mining research, we found
9. Health plan beneficiary # Recoded as category/type these competing privacy and access to
10. Account # Removed research data demands could both be
11. Certificate/license # Not available in TMR™ satisfied through rigorous de-identifica-
data source
12. Vehicle identifiers and 7. Vehicle identifiers and serial #
tion procedures.
serial # including license plate # Our work preceded HIPAA privacy
13. Device identifiers & serial # regulations, but would meet a HIPAA
14. Web Universal Resource safe harbor standard with only minor
Locators (URLs); Acknowledged
modifications. Based on our experiences,
15. Internet Protocol (IP)
address # Acknowledged we believe that clinical data mining
16. Biometric identifiers, researchers can protect patient privacy
including finger and voice prints while advancing science through de-
17. Full face photographic images identified data analyses. We acknowl-
and any comparable images 8. Full face photographic images
and any comparable images edge this process is tedious, time-con-
18. Any other unique identifying suming, and expensive; these facts com-
number, characteristic code Manual outlier filtering bined with increased liability for a cov-
removed very young
pregnant patient ered entity are predicted to reduce the
availability of clinical data for research
purposes in the coming years. Our data
mining research finds it is possible to and de-identified large clinical databases. maintain the public’s trust by doing
balance privacy concerns and research In spite of removing 17 HIPAA personal everything possible to protect patient pri-
needs through careful procedures that identifiers, we found that outliers (e.g., a vacy in clinical research. Privacy protec-
remove personal identifiers to create de- very young pregnant girl) were still tions will require careful stewardship of
identified data for clinical research; we potentially identifiable by a small num- patient data that provides vigilant de-
believe this balance improves both pub- identification and HIPAA compliance as a
lic trust and scientific research. minimum standard for clinical data min-
ing research.
Conclusion
Clinical data mining research conduct- Acknowledgment
ed at Duke University followed careful
procedures to protect the privacy of
“In our clinical data mining Informatics Tools and Perinatal
Knowledge Building RO1 LM-O6488
patient data. All known patient identifiers research, we found these compet- funded by the National Library of
were removed from the data, and a Medicine 1997-2003.
patient identification number was re- ing privacy and access to research
coded so that only the Principal About the Authors
Investigator had the information (locked data demands could both be sat- Linda Goodwin is director of the
in the research office) that could re-identi- Nursing Informatics Program at Duke
fy an individual patient’s record. We were isfied through rigorous de-identi- University and an informatics scientist
diligent in maintaining patient privacy, with a program of funded research in
and removed patient identifiers using pro-
fication procedures.” applied informatics (www.duke.edu/
cedures that, with minor modification, ~goodw010).
could be expected to meet a HIPAA “safe ber of employees who knew the patient, Jonathan Prather is a data warehouse
harbor” de-identification standard. and this record was removed from the engineer for Oregon Health Sciences
Potential privacy and informed con- research data. University Department of
sent issues persist even in anonymized It seems imperative that we strive to Immunogenetics and Transplantation.
References
1
“Who’s Reading Your Patient Records?” Consumer Reports, October 1994, 5
Shalala, D. E. “Protecting Privacy of Health Information.” Address to the
628-632. National Press Club (July 31, 1997). Available at:
2
National Research Council. For the Record: Protecting Electronic Health http://aspe.os.dhhs.gov/adminsimp/pvcy0731.htm
Information. 1997. Available at: http://www.nap.edu/readingroom/ 6
Federal Policy for the Protection of Human Subjects; Notices and Rules, 56
books/ftr/52ea.html Federal Register 28002 - 28032 (June 18, 1991).
3
Detmer, D. E., and Steen, E. B. “Shoring Up Protection of Personal Health 7
67 Federal Register 14, 776.
Data.” Issues in Science and Technology, Summer 1996, 12(4), 73-78. 8
http://www.aamc.org/advocacy/corres/research/hipaa041102.htm
4
45 CFR Parts 160 and 164. December 2000. Available at: 9
http://ncvhs.hhs.gov/011121lt.htm
http://www.hhs.gov/ocr/hipaa/