You are on page 1of 6

Original Contributions

Protecting Patient
Privacy in Clinical
Data Mining
Linda K. Goodwin, RN, C, PhD, and Jonathan C. Prather, PhD

A B S T R A C T trend for patients to withhold informa-


tion, and to seek care under fictitious
This paper investigates whether HIPAA de-identification names and erroneous social security
numbers;3 the potential impact of this
trend on clinical data mining research is
requirements – as well as proposed AAMC de-identification worrisome.
In response to public concerns, a
standards – were met in a large clinical data mining study number of privacy protection bills were
introduced in Congress during the past
(1997-2001) conducted at Duke University prior to the decade. The Health Insurance Portability
and Accountability Act (HIPAA) was
publication of the final rule.While HIPAA has improved signed into law in 1996, and component
privacy regulations were published in
de-identification standards, the study also shows that privacy December 2000.4 Former U.S. Secretary
of Health Donna Shalala issued a report5
issues may persist even in de-identified large clinical databases. indicating the need for legislation
because of the rapid changes in ways
that healthcare is provided, documented,
and paid for in the United States. These
changes pose a challenge to American
K E Y W O R D S Over half of America’s largest 500 values that want both personal privacy
companies, with easy access to a per- as well as quality healthcare; in today’s
Health Insurance Portability son’s social security number, admitted healthcare market, these values are both
and Accountability Act using health records to make hiring and complementary and competing.
other personnel decisions.1 Patients have HIPAA regulations do not change
(HIPAA) reported losing their jobs, insurance cov- already existing law called the “Common
Patient privacy erage, and their reputation due to disclo- Rule”6 that applies to all federally funded
sure of confidential information. Not only research. The Common Rule directs a
Consent do privacy violations result in both real research institution to assure the federal
and potential loss of dignity and autono- government that it will provide and
Authorization my, the information from a person’s enforce protections for human subjects
Institutional review board patient records may influence their credit, of research conducted under the over-
employment options, admission to educa-
(IRB) tional institutions, and their ability to get
sight of that institution. Research institu-
tions must assess research proposals in
Clinical data mining health insurance and at fair market cost.2 terms of their risks to subjects and their
Patients are concerned that health- potential benefits, and they must see that
De-identification related information they give to their the Common Rule’s requirements for
The Common Rule healthcare providers, employers, and selecting subjects and obtaining informed
insurance companies might somehow be consent are met.
used against them. There is a growing One important point to note is that,

62 Journal of Healthcare Information Management — Vol. 16, No. 4


Original Contributions

because of new HIPAA provisions that • De-identified data:


specifically apply to research protocols, data where identi- Figure 1. HIPAA “Safe-Harbor”
an organization cannot merely apply its fiers have been De-identification Algorithm
routine privacy and security protections removed but means
where research protocols are concerned. exist for re-identify-
Regulations are still in flux, but final reg- ing individual
ulations are expected to require special patients or subjects
treatment for research data in an effort to if required. The
protect patient privacy concerns. Common Rule calls
this “coded data.”
Important Terms • HIPAA consent is
It will be important to understand the specifically for treat-
differences in important terms related to ment, billing, and
HIPAA and protecting patient privacy in operations. Note: The
doing research. HIPAA has an impact on • HIPAA authoriza- primary author
Institutional Review Board (IRB) roles, tion includes what wishes to
acknowledge the
and typical clinical data mining research IRBs call “consent.” North Carolina
finds itself at the core of this endeavor • Protected health Health Care
where patient data is often used for information Information and
research without explicit patient consent. (PHI): any informa- Communications
Alliance
Note that what IRBs consider “consent” tion, whether oral or (www.nchica.org)
is similar in concept to what HIPAA recorded in any where she
labels “authorization.” form or medium, developed this
that is created or algorithm
received by a health- working with a
HIPAA Privacy
care provider, health Work Group
plan, public health Subcommittee on
authority, employer, Research.

“Not only do privacy viola- life insurer, school


or university, or
tions result in both real and healthcare clearing-
house; and relates to The National Center for Vital Health
potential loss of dignity and the past, present, or future physical or Statistics (NCVHS)8 recommended that
mental health or condition of an indi- the HIPAA privacy rule continue to
autonomy, the information from a vidual; the provision of healthcare to require either individual authorization or
an individual; or the past, present, or a waiver from an IRB or privacy board
person’s patient records may future payment for the provision of to conduct research involving protected
healthcare to an individual. health information, and that research
influence their credit, employment activities not be included within treat-
options, admission to educational Ongoing Debate ment, payment, and healthcare opera-
Partisan, public, and industry disagree- tions. NCVHS also described participants’
institutions, and their ability to ment concerning HIPAA legislation con- concerns and assertions that the stan-
tinues. The Bush Administration pro- dards for de-identification are too restric-
get health insurance and at fair posed modifications to HIPAA privacy tive, thus de-identified data will have
regulations and provided an opportunity minimal value for research, and recom-
market cost.” for further comment on March 27, 2002.7 mended reconsidering whether the safe
Among other things, proposed modifica- harbor de-identification standard might
tions eliminate a requirement for patient unduly interfere with research.
• Anonymous data: data that was consent to use their PHI for research pur- Thus it is apparent that ongoing debate
never labeled with patient/subject poses. This modification addresses indus- and confusion are inherent in HIPAA pri-
identifiers. try concern that privacy rules will create vacy regulations. What remains clear is
• Anonymized data: data where all substantial impediments to research. that much work remains to be done to
identifiers have been removed and Proposed modifications would permit ensure that privacy laws balance individ-
NO means exists for re-identifying covered entities to obtain a single form ual personal privacy rights with compet-
patients/subjects. that combines authorization and consent ing data and information-sharing benefits.
• Covered entity: health plans, health- for research from the patient, but many The National Research Council2 report-
care clearinghouses, and healthcare question whether proposed “blanket” ed that HIPAA is an important first step
providers who transmit any health consent and authorization will be defensi- in the development of standards for elec-
information in connection with a ble in court, and privacy advocates argue tronic exchange of health information.
transaction (covers electronic, paper, that this proposed modification destroys While data sharing is also possible in
fax, etc.). the value of the privacy regulations. paper form, it is the electronic exchange

Journal of Healthcare Information Management — Vol. 16, No. 4 63


Original Contributions

Methods to Protect Patient Privacy in


Table 1. HIPAA and Proposed AAMC Research Clinical Data Mining Research
De-identification Requirements The purpose of this paper is to com-
pare procedures used for our clinical
CURRENT HIPPA DE-IDENTIFICATION AAMC RECOMMENDATIONS data mining preparation with the HIPAA
REQUIREMENTS safe harbor de-identification standard in
1. Names 1. Names figure 1 and proposed AAMC de-identifi-
2. Geographic subdivisions 2. Street address cation standards. Note that numbers in
3. All elements of dates figure 1 in the upper left-hand corner of
4. Telephone # 3. Telephone #
5. Fax # 4. Fax #
a box correspond to the list of 18 per-
6. Electronic mail addresses 5. Electronic mail addresses sonal health identifiers in the original
7. Social security # 6. Social security # HIPAA “safe harbor” standard of the pri-
8. Medical record # vacy regulations.
9. Health plan beneficiary # Duke’s TMR™ perinatal database pro-
10. Account #
vided data for the study (RO1 LM-O6488)
11. Certificate/license #
12. Vehicle identifiers and serial # 7. Vehicle identifiers and serial # and is the only known clinical database
13. Device identifiers & serial # that electronically collected data on preg-
14. Web Universal Resource Locators (URLs) Acknowledged nant women for more than two decades.
15. Internet Protocol (IP) address # Acknowledged The final research data set included
16. Biometric identifiers, including
finger and voice prints
1,622 variables and 19,970 patients after
17. Full face photographic images and 8. Full face photographic images and cleaning and filtering procedures were
comparable images comparable images completed for data extraction of 71,753
18. Any other unique identifying number, records and approximately 4,000 poten-
characteristic code
tial variables per patient.
Data in TMR™, a comprehensive elec-
tronic medical record system, did contain
many patient identifiers. Since we were
of data that creates the need for recipient researcher to identify an indi- interested in predictive modeling during
increased privacy and security protec- vidual who is the subject of the informa- pregnancy, the first step in preparing our
tions for patient data. Research data is tion; (2) identifiers listed in column 2 research data sets was to remove all
just one of many areas where HIPAA (table 1) are removed; and (3) the cov- infant records, since infant data would
may have an impact on how we conduct not normally be available during preg-
our work. nancy. Infant variables that might be
Figure 1 provides an algorithm that available during pregnancy were retained
outlines procedures for de-identifying 18 (e.g., infant gender and ultrasound meas-
protected personal identifiers enumerated urements).
in the original HIPAA Privacy Regulations
In published comments, the
Association of Academic Medical Centers
“Thus it is apparent that Remaining maternal records were then
assigned a unique code using random
number generation, and the list that
(AAMC)9 recommended a condensed set ongoing debate and confusion are matches the original patient identifier
of data elements for de-identification, with their code number is kept locked in
and suggested that covered entities
inherent in HIPAA privacy regu- the Principal Investigator’s office files.
should be permitted to release informa- lations. What remains clear is This code can be used to re-identify indi-
tion for research purposes if data has vidual patients in the original TMR™ sys-
been de-identified and (1) if the that much work remains to be done tem. When we began the research, we
researcher agrees in writing that they will were not sure if we would need an abili-
not attempt to re-identify or contact sub- to ensure that privacy laws balance ty to compare research data sets with the
jects of the information, and (2) not to original patient records, and we did in
further disclose the information except as individual personal privacy rights fact find this ability useful on at least one
required by law. occasion. Our data filtering and cleaning
Proposed AAMC recommendations with competing data and informa- efforts removed all names (patient, fami-
suggest that data which has met their ly, etc.) from the data.
proposed de-identified standard would
tion-sharing benefits.”
Procedures used to remove geograph-
not be individually identifiable and ic identifiers in our research data con-
would permit the covered entity to dis- ered entity does not have actual knowl- verted zip codes to county codes, and
close health information pursuant to a edge that the information could be readi- removed all other forms of address and
data use agreement if (1) the covered ly used alone or in combination with location. Our procedures would not be
entity has determined that the risk is other reasonably available information to in compliance with the strictest interpre-
very small that the information could be identify an individual who is the subject tation of new HIPAA privacy regulations.
used, alone or in combination with other of the information. HIPAA regulations specify removal of the
reasonably available information, by the following for geographic identifiers: all

64 Journal of Healthcare Information Management — Vol. 16, No. 4


Original Contributions

geographic subdivisions smaller than a numeric temporal relationships with We believe our RO1 procedures were
state, including street address, city, coun- regard to conception. For example, if a HIPAA compliant for personal identifiers
ty, precinct, zip code, and their equiva- preconception questionnaire was com- 4 through 17 (see table 2). As a function
lent geocodes, except for the initial three pleted on 1/1/1993 and conception of the years when TMR data was collect-
digits of a zip code if, according to the occurred on 1/15/1993, the relationship ed, some identifier items on the HIPAA
current publicly available data from the of the event was calculated as -14 days. list were not included in our data set
Bureau of the Census:4, pp. 82711, 82818 If the patient experienced nausea during (e.g., e-mail address, web URL, IP
• The geographic unit formed by com- the same pregnancy on 7/15/1993, the address, biometrics, photos). We system-
bining all zip codes with the same atically removed all phone and fax num-
three initial digits must contain more bers, social security numbers, medical
than 20,000 people; and record and account numbers, and insur-
• The initial three digits of a zip code ance information that was recoded into
for all such geographic units contain- categories: private, health department,
ing 20,000 or fewer people is changed managed care, or other.
to 000.
“If dates were removed prior to The HIPAA regulatory definition of de-
identification occurs when 18 personal
Since our final county codes were identifiers listed in table 1 have been
broadly categorized as “Durham County,
our receiving the data, it would removed or aggregated as defined in the
Orange County, and Other”, each geo- have been impossible for us to law. Item 18 places a burden of protec-
graphic subdivision has more than tion on the covered entity to consider
20,000 people and it should be possible complete our study, since we need- “any other unique identifying number,
to follow HIPAA guidelines with regard characteristic, or code (whether generally
to zip codes to achieve HIPAA-compliant ed dates in order to generate available in the public realm or not) that
outcomes with results similar to our RO1 the covered entity has reason to believe
procedures. many of our data elements.” may be available to an anticipated recipi-
Birth dates were converted to age at ent of the information, and the covered
the time of conception, and then entity has no reason to believe that any
removed. Since we did not have any temporal relationship of the observation reasonably anticipated recipient of such
pregnant women over the age of 89, this would be 181 days. This temporal rela- information could use the information
aspect of the regulation did not influence tionships scheme allowed all dates in the alone, or in combination with other
our procedures, and the maximum likeli- data source, including date of conception information, to identify an individual.
hood of this as a future problem in peri- and birth, to be removed. While the Thus, to create de-identified information,
natal medicine is not expected. HIPAA process that included dates in raw data entities that had removed the listed iden-
regulations specify removal of the follow- could be interpreted as violating HIPAA tifiers would still have to remove addi-
ing for date identifiers: all elements of regulations, the final data sets are clearly tional data elements if they had reason
dates (except year) for dates directly in compliance with the law. to believe that a recipient could use the
related to an individual, including birth
date, admission date, discharge date, date
of death; and all ages over 89 and all
Figure 2. Research Data Sets (10 week intervals)
elements of dates (including year) indica-
tive of such age, except that such ages
and elements may be aggregated into a
single category of age 90 or older.4, p. 82818 LMP 10 weeks 20 weeks 30 weeks 37 40+
Here is where we believe new prob-
lems may arise with interpretation of
HIPAA regulations in research. If dates Data Set #1 Preterm
were removed prior to our receiving the 446 vars Full term
data, it would have been impossible for
us to complete our study, since we need- Data Set #2
ed dates in order to generate many of 839 vars
our data elements. Please recognize that
dates were available in the raw data but Data Set #3
converted into data fields, such as years 1,232 vars
of age and weeks of gestation, in the
final research data sets (see figure 2); all Data Set #4
dates were removed from the final data 1,622 vars
set but available to the data warehouse
engineer during the data cleaning and fil-
tering earlier phases of the research.
All clinical observations and events in
the study data were converted into

Journal of Healthcare Information Management — Vol. 16, No. 4 65


Original Contributions

remaining information, alone or in com- database for operational purposes. The word protection on the hard drive where
bination with other information, to iden- warehouse engineer spent more than a the raw data were stored.
tify an individual.”4, pp. 82717-82818 year working closely with the database It is also important to understand that
See the discussion section for our administrator, physician, Principal potential privacy and informed consent
example of potential problems with Investigator (PI), and other members of issues persist even in de-identified large
“unique” characteristics. Our own the research team to scrub, filter, clean, clinical databases. Our research team was
research found that this necessary final and transform the raw data into de-iden- vigilant in maintaining patient privacy,
step was especially challenging — but tified research data sets. and we removed patient identifiers using
still possible to accomplish. The database administrator and physi- procedures that, with minor modification,
cian were already authorized users of the could be expected to meet a HIPAA “safe
Discussion database, and our research data cleaning harbor” de-identification standard.
Data access and management proce- procedures required that the data ware- But we did encounter a problem after
dures were reviewed and approved by house engineer have access to personal the first 17 HIPAA safe harbor personal
Duke’s IRB in accordance with the identifiers in order to match and verify identifiers had been removed from our
Common Rule and established IRB poli- records. The warehouse engineer research data. Clinician members of the
cies and procedures (since HIPAA priva- worked closely with the research team research team could still identify a very
cy regulations will not be enforced until and was diligent in removing all person- young pregnant girl because of her age,
April 2003). al identifiers from the research data. so her record was removed from the
It is important to understand that the Thus, using de-identified data, data data. A board-certified obstetrician and
data warehouse engineer on our analyses were conducted by multiple expert nurse who had managed the TMR
research team was given full access to members of the research team, but only database for more than a decade, assist-
clinical data with patient identifiers. the warehouse engineer had ready ed us in searching for outliers and possi-
Other participating research team mem- access to raw data that included personal ble remaining identifiers, but no further
bers included the clinical database identifiers. The warehouse engineer care- potential privacy problems with the
administrator (a certified perinatal nurse) fully protected the raw data through research data were found.
and a board-certified obstetrician who physical security (locked office, locked Thus, the 18th HIPAA item that man-
were heavily involved in using the TMR™ machine) and multiple levels of pass- dates removing “any/all other unique
identifying number, characteristic or
code” will require due diligence by those
Table 2. HIPAA and RO1 Comparison responsible for de-identification of clini-
cal data, and could create problems
where researchers do not wish to invest
HIPAA SAFE HARBOR AAMC PROPOSED RO1 “ANONYMIZATION” the time and effort for this level of atten-
DE-IDENTIFICATION DE-IDENTIFICATION PROCEDURES
tion to detail.
1. Names 1. Names Removed The potential impact of HIPAA privacy
2. Geographic subdivisions 2. Street address Recoded as county
of residence regulations on clinical research is not yet
3. All elements of dates Converted to temporal known. The literature and the web are
relationships from date filled with opinions on both sides of the
of conception issue; some believe protecting patient
4. Telephone # 3. Telephone # Removed
privacy is more important while others
5. Fax # 4. Fax # Removed
6. Electronic mail addresses 5. Electronic mail addresses Not available argue that access to PHI for research
7. Social security # 6. Social security # Removed purposes is essential for science. In our
8. Medical record # Removed clinical data mining research, we found
9. Health plan beneficiary # Recoded as category/type these competing privacy and access to
10. Account # Removed research data demands could both be
11. Certificate/license # Not available in TMR™ satisfied through rigorous de-identifica-
data source
12. Vehicle identifiers and 7. Vehicle identifiers and serial #
tion procedures.
serial # including license plate # Our work preceded HIPAA privacy
13. Device identifiers & serial # regulations, but would meet a HIPAA
14. Web Universal Resource safe harbor standard with only minor
Locators (URLs); Acknowledged
modifications. Based on our experiences,
15. Internet Protocol (IP)
address # Acknowledged we believe that clinical data mining
16. Biometric identifiers, researchers can protect patient privacy
including finger and voice prints while advancing science through de-
17. Full face photographic images identified data analyses. We acknowl-
and any comparable images 8. Full face photographic images
and any comparable images edge this process is tedious, time-con-
18. Any other unique identifying suming, and expensive; these facts com-
number, characteristic code Manual outlier filtering bined with increased liability for a cov-
removed very young
pregnant patient ered entity are predicted to reduce the
availability of clinical data for research
purposes in the coming years. Our data

66 Journal of Healthcare Information Management — Vol. 16, No. 4


Original Contributions

mining research finds it is possible to and de-identified large clinical databases. maintain the public’s trust by doing
balance privacy concerns and research In spite of removing 17 HIPAA personal everything possible to protect patient pri-
needs through careful procedures that identifiers, we found that outliers (e.g., a vacy in clinical research. Privacy protec-
remove personal identifiers to create de- very young pregnant girl) were still tions will require careful stewardship of
identified data for clinical research; we potentially identifiable by a small num- patient data that provides vigilant de-
believe this balance improves both pub- identification and HIPAA compliance as a
lic trust and scientific research. minimum standard for clinical data min-
ing research.
Conclusion
Clinical data mining research conduct- Acknowledgment
ed at Duke University followed careful
procedures to protect the privacy of
“In our clinical data mining Informatics Tools and Perinatal
Knowledge Building RO1 LM-O6488
patient data. All known patient identifiers research, we found these compet- funded by the National Library of
were removed from the data, and a Medicine 1997-2003.
patient identification number was re- ing privacy and access to research
coded so that only the Principal About the Authors
Investigator had the information (locked data demands could both be sat- Linda Goodwin is director of the
in the research office) that could re-identi- Nursing Informatics Program at Duke
fy an individual patient’s record. We were isfied through rigorous de-identi- University and an informatics scientist
diligent in maintaining patient privacy, with a program of funded research in
and removed patient identifiers using pro-
fication procedures.” applied informatics (www.duke.edu/
cedures that, with minor modification, ~goodw010).
could be expected to meet a HIPAA “safe ber of employees who knew the patient, Jonathan Prather is a data warehouse
harbor” de-identification standard. and this record was removed from the engineer for Oregon Health Sciences
Potential privacy and informed con- research data. University Department of
sent issues persist even in anonymized It seems imperative that we strive to Immunogenetics and Transplantation.

References
1
“Who’s Reading Your Patient Records?” Consumer Reports, October 1994, 5
Shalala, D. E. “Protecting Privacy of Health Information.” Address to the
628-632. National Press Club (July 31, 1997). Available at:
2
National Research Council. For the Record: Protecting Electronic Health http://aspe.os.dhhs.gov/adminsimp/pvcy0731.htm
Information. 1997. Available at: http://www.nap.edu/readingroom/ 6
Federal Policy for the Protection of Human Subjects; Notices and Rules, 56
books/ftr/52ea.html Federal Register 28002 - 28032 (June 18, 1991).
3
Detmer, D. E., and Steen, E. B. “Shoring Up Protection of Personal Health 7
67 Federal Register 14, 776.
Data.” Issues in Science and Technology, Summer 1996, 12(4), 73-78. 8
http://www.aamc.org/advocacy/corres/research/hipaa041102.htm
4
45 CFR Parts 160 and 164. December 2000. Available at: 9
http://ncvhs.hhs.gov/011121lt.htm
http://www.hhs.gov/ocr/hipaa/

Journal of Healthcare Information Management — Vol. 16, No. 4 67

You might also like