Professional Documents
Culture Documents
ERASE attempts to prevent disclosure of protected entities by The document was sanitized using ERASE; the terms deleted as a
removing certain terms from the document – these terms need to result are shown underlined in Figure 1. Observe that the
be selected such that no protected entity can be inferred as being sanitization has differentiated between the generic symptoms and
mentioned in the document by matching the remaining terms with symptoms specific to the kind of diseases that appear in the
the entity database. protected entity set, and deleted the latter. Furthermore, though
sweating is a common symptom associated with hundreds of
A simplistic approach is to locate “give-away” phrases in the diseases, the combination of sweating, weight loss and fatigue
document and delete them all. To the best of our knowledge, most reveals a protected entity (in this case, HIV) and so, one of these
prior work on document sanitization has followed this approach, terms must be deleted to sanitize the document; consequently, the
and has focused on developing more accurate ways of locating system removed the term “sweating”. We note that a different
such phrases in the document’s text [4, 2, 6]. We believe this is an protected entity set might remove a different set of symptoms.
overkill. For instance, in an intelligence report, removing all
names, locations, etc. would probably leave the report with no 4. REFERENCES
useful content at all. [1] Eugene Agichtein, Luis Gravano, Jeff Pavel, Viktoriya
In contrast, ERASE makes an effort to sanitize a document while Sokolova, and Aleksandr Voskoboynik. Snowball: A
causing the least distortion to the contents of the document; prototype system for extracting relations from large text
indeed, this is considered one of the principal requirements of collections. In SIGMOD, 2001.
document sanitization [6]. Towards this goal, ERASE identifies [2] M.M. Douglass, G.D. Clifford, A. Reisner, W.J. Long, G.B.
the minimum number of terms in the document that need to be Moody, and R.G.Mark. De-identification algorithm for free-
removed in order for the document to be sanitized. text nursing notes. In Computers in Cardiology, 2005.
3. ILLUSTRATIVE EXAMPLE [3] K. LeFevre, D.J. DeWitt, and R. Ramakrishnan. Incognito:
We illustrate our approach using an anecdotal real-life example. Efficient full domain k-anonymity. In SIGMOD, 2005.
We created a database of 2645 diseases obtaining information [4] Latanya Sweeney. Replacing personally-identifying
from the website wrongdiagnosis (www.wrongdiagnosis.com). information in medical records, the srub system. In Journal
Each disease is an entity with the associated context consisting of of the Americal Medical Informatics Association, 1996.
symptoms, tests, treatments and risk factors. The website offers a [5] Latanya Sweeney. K-anonymity: A model for protecting
classification of the diseases. We declared as protected entities the privacy. Intl Journal on Uncertainty, Fuzziness and
diseases under the following categories: sexual conditions, serious Knowledge-based Systems, 10(5), 2002.
conditions (conditions related to heart, thyroid, kidney, liver,
ovary ), cancer conditions, and mental conditions. The number of [6] Alex Tveit. Anonymization of general practitioner medical
protected entities were 550. records. In HelsIT’04, Trondheim, Norway, 2004.
We created a document by taking a paragraph out of a [7] U.S. Department of Energy. Department of energy researches
communication between a doctor and a patient from another use of advanced computing for document declassification.
website (www.netdoctor.co.uk). The document is shown in Figure http://www.osti.gov/opennet.
1, where the relevant terms found in the entity contexts are shown [8] Wikipedia. Sanitization (classified information) —
in bold. wikipedia, the free encyclopedia, 2006.