You are on page 1of 13

WHITE PAPER

ZixCorp® Lexicons: An Overview


ZIXCORP | M AY 2 0 0 6

INSIDE:
> Lexicons Defined

> Healthcare Lexicons:


Business Identifiers, and
Health Terms

> Financial Lexicons:


Financial Terms, Financial Identifiers
and Credit Card Numbers

> Profanity Lexicon

> Social Security Lexicon

> Lexicon Accuracy


ZixCorp Lexicons: An Overview

Table of Contents

ZixCorp Lexicons: An Overview

I. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2. Healthcare Lexicons. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

3. Financial Lexicons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

4. Profanity Lexicon. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

5. Social Security Lexicon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

5. Lexicon Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

6. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

7. About ZixCorp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

© 2006 Zix Corporation 2


ZixCorp Lexicons: An Overview

ZixCorp Lexicons: An Overview

1. INTRODUCTION
ZixCorp services use a set of comprehensive lexicons to scan for sensitive information such as personal
health information or personal financial information in electronic messages. Searches are conducted
by scanning all message subjects, bodies, and attachments for expressions defined within the lexicons.

A lexicon is a file consisting of a comprehensive set of terms, phrases, expressions, and pattern masks
that identify “sensitive” types of information. Sensitive information is defined as information that
could result in liability if disclosed. ZixCorp uses many sources to generate the lexicon content that
searches for sensitive information including federal regulations, authoritative reference sources in the
subject, and standard of care practices.

The following is a description of the lexicons that are typically used in ZixCorp services. In addition to
these standard lexicons, custom lexicons can be created if a customer would like to search for specific
terms in its ZixAuditor® data capture or in its production e-mail systems using ZixVPM®.

© 2006 Zix Corporation 3


ZixCorp Lexicons: An Overview

2. HEALTHCARE LEXICONS
The healthcare lexicons are a set of two lexicons, identifiers and health terms, that work together to
recognize Protected Health Information (PHI). The lexicons search for PHI by taking the intersection of
identifying information, as set forth by the Department of Health and Human Services, combined with
health terms or claims information. This provides the highest level of confidence that the context is
related to PHI. An example of this would be a spreadsheet containing a Social Security number (SSN),
date of service, and diagnosis. The SSN and date of service would constitute an identifier, and the
diagnosis would constitute health information.

To search for potential healthcare content, both of the healthcare lexicons are combined using the
following logic:

(Identifiers AND Health Terms)

The identifiers lexicon looks for identifiers indicating official business communications (such as SSNs,
Subscriber IDs, etc.) This combination of lexicons is referred to within ZixCorp services as the
Healthcare Content Standard definition.

When used with well-designed policies, the healthcare lexicons can effectively help companies comply
with HIPAA legislation by securing email communications that contain PHI. The following are several
example messages that would trigger “violations” by the healthcare lexicons. The expressions shown
in bold font indicate terms that would trigger violations.

© 2006 Zix Corporation 4


ZixCorp Lexicons: An Overview

Example #1: (Standard Rule covering official business messages)

From: Sue
To: Linda
Subject: RE: Shared patient

Linda,
Here’s the info you requested on patient Jane Doe, ss# 123456789. She sees Dr. A. at
General Hospital. She began tamoxifen approximately 5/15/2005. When he saw her in
2006, he stated that she had been on tamoxifen for a year. Her last visit was 1/14/2006.
No sign of cancer.

Example #2: (Standard Rule covering official business messages)

From: Sue
To: Linda
Subject: RE: Daily Inpatient Report

General Hospital does have an acute rehab service. Both members are improving
considerably with their therapy. Members are Mr. Smith, Mbr Num: 123456
& Mr. Jones, Mbr Num: 234567. They are on a rehab unit.

Example #3: (False Positive)

From: Sue
To: Linda
Subject: New Physical Therapy Doc

Hi Linda,
I would like for you to clarify a SSN for a provider – we show it as 123-45-6789 and the
paper work is showing 123-45-6788. Please verify this so I can process the provider.
Thanks,
Linda

© 2006 Zix Corporation 5


ZixCorp Lexicons: An Overview

3. FINANCIAL LEXICONS
Like the healthcare lexicons, the personal financial lexicons consist of a set of three lexicons: financial
terms, financial identifiers, and credit card numbers. These lexicon files are designed to work in com-
bination to recognize personally identifiable financial information as defined by the SEC, FTC, Federal
Reserve, and FDIC in the final rulings of Privacy of Consumer Financial Information. These agencies are
the regulation arm of the Gramm-Leach-Bliley Act (GLBA).

The lexicons work in conjunction to recognize the intersection of financial identifiers (such as SSNs,
account numbers, or loan numbers) AND financial terms (such as “balance transfer,” “refinance” or
“deposit”) or credit card numbers.

The following logic is used:

(Financial Identifiers AND Financial Terms) OR Credit Card Numbers

When used with well-designed policies, these lexicons can effectively help companies implement cor-
porate consumer privacy policies (legislated by GLBA) by reducing the disclosure of personally identifi-
able financial information via email. Additionally, they can help reduce liability risk for financial privacy
issues such as credit card fraud. Below are several example messages that would trigger violations by
the personal financial lexicons. The expressions shown in bold font indicate terms that would trigger
violations.

Example #1: (Credit Card Match)

From: Sue
To: Linda
Subject: My Account

Sorry for the delay in getting back to you. Here is my credit card account info:
5403 1500 0001 0000 – MasterCard Exp. Date: 06/2007

© 2006 Zix Corporation 6


ZixCorp Lexicons: An Overview

Example #2: (Match on Financial Identifier and Financial Terms)

From: Sue
To: Linda
Subject: Your Account

Dear Miss Jones,


We here at Big-Mortgage-Finance Corp. have noticed that you have defaulted on
loan #123456. We are happy to assist you however possible. Perhaps an automatic
payroll deduction could help you make regular bill payments. Please see the attached
account summary and submit payment in full as soon as possible to avoid foreclosure.

Example #3: (Match on Financial Identifier and Financial Terms)

From: Mike
To: Daniel
Subject: Prepayment Fees

In order to complete the monthly billing, please verify the prepayment fee for the
following accounts:
JOHN DOE 111001111 2,630.00
SUE JONES 222002222 4,250.00
Please respond as soon as possible, so we may complete the billing process.
Thank you for your assistance.

© 2006 Zix Corporation 7


ZixCorp Lexicons: An Overview

4. PROFANITY LEXICON
The profanity lexicon is designed to recognize profane and obscene language in email messages.
According to Merriam-Webster, profane means “to debase by a wrong, unworthy, or vulgar use.”
Obscene means “marked by violation of accepted language inhibitions and by the use of words
regarded as taboo in polite usage.”

These definitions form the basis on which this lexicon was designed and developed. The terms,
phrases, and patterns selected for this lexicon satisfy a narrow set of selection criteria. These are:

1. the entry must represent a debasement of the referent OR


2. the entry must be considered taboo in polite usage AND
3. the entry must be a vulgarity synonymous with a milder or acceptable term AND
4. the entry must NOT be merely an elliptical euphemism AND
5. the entry must be at least 4 characters long (expressions of 3 characters or less can result
in high occurrences of false positives)

Additionally, ...
6. some idiomatic misspellings are included
7. some foreign language terms (notably from Yiddish) are included

The primary sources for the terms in this lexicon are:


1. “English as a Second F*cking Language” by Sterling Johnson, St. Martin's Press, New York, 1995.
2. “Slang and Euphemism: A Dictionary of Oaths, Curses, Insults, Ethnic Slurs, Sexual Slang and
Metaphor, Drug Talk, College Lingo, and Related Matters” by Richard A. Spears, Signet Books,
New York, 2001.

Examples of messages containing profanity are not included, for obvious reasons.

© 2006 Zix Corporation 8


ZixCorp Lexicons: An Overview

5. SOCIAL SECURITY NUMBER LEXICON


As you will read a bit further in this paper, the ZixCorp lexicons have been designed to be flexible and to
err on the side of prudent practice and liability protection, which reduces false positives. One of the
ways we employ this flexibility is providing an add-on lexicon that is designed to catch all Social Security
numbers. In the standard lexicon offering, Social Security numbers are used as an identifier, and as such,
must have either a health term or financial term found in the email before encryption takes place. This
lexicon, which has become more popular due to pending legislation and a growing identity theft problem,
is made to encrypt any email that has a Social Security number, even if there is no accompanying health
or financial term.

5. LEXICON ACCURACY
ZixCorp goes to great lengths to ensure that lexicons are accurate and precise. This is accomplished
through a comprehensive definition and design of the lexicons, coupled with exhaustive manual analysis
to ensure that the lexicon results agree with the judgment of the lexicon designers. The following
example provides a high level overview of the design process and validation of the healthcare lexicons:

1. Lexicon designed based on definition of PHI from HIPAA regulations.


2. Jury standard document developed
3. Message samples gathered from payors and providers (18,000+ messages)
4. Samples manually examined (all 18,000+ messages) using the jury standard document as a reference
5. Reference sources identified to be used to ensure comprehensive content (i.e. medical dictionaries,
professionally-accepted terminology lists, etc.)
6. Lexicons constructed and run against sample
7. Lexicons results compared to manual results
8. Lexicons tuned and rerun against sample until performance is excellent
9. Calculations made for Accuracy, False Negative, and False Positive rates
10. Revisions made based on ZixAuditor analyses and ongoing customer input

As with all automated analysis tools, there will be a certain percentage of false positives and false
negatives. With each new release of the ZixCorp lexicons, the accuracy improves, minimizing the
occurrence of false readings. The accuracy of a lexicon is calculated using the following formula:

(All Correctly Processed Messages)


Accuracy = ------------------------------------------------
(All Messages)

The current healthcare lexicons have an average accuracy rate greater than 99%. This means that this
lexicon will correctly identify 99% of messages as containing sensitive health information or not without
any customization.

© 2006 Zix Corporation 9


ZixCorp Lexicons: An Overview

False Negatives:
A message is classified as a false negative when it contains sensitive information and a lexicon fails to
flag it. To minimize false negatives, the ZixCorp lexicons make use of wildcard characters, masks and
other mechanisms to catch multiple forms of various terms. For example, the identifier lexicon that is
used in searching for PHI uses several masks to search for various combinations of nine digits that might
represent a Social Security number.

False negatives can occur when organizations have terms or codes that are unique to their operation. In
the healthcare lexicons for example, if a healthcare provider has unique patient identification number
formats or an insurance company has unique subscriber ID numbers, these may not be flagged by the
healthcare lexicons in their standard form. The lexicons have been designed to be very flexible however,
and, as such, these unique identifiers (and other types of terms) can be added to the lexicons to be used
by ZixVPM, or before data is analyzed by ZixAuditor. This type of minor customization has been shown
to reduce the false negative rates for PHI to almost 0%. The false negative rate is calculated using the
following formula:

(False Negatives)
False Negative Rate = ------------------------------------------------
(All Sensitive Messages)

False Positives:
A false positive occurs when a message is flagged as containing sensitive information when in fact it
does not. The ZixCorp lexicons have been designed to err on the side of prudent practice and liability
protection, which reduces false negatives. Because of this, false positives are more likely to occur.
Through the ongoing validation process described above, the false positive rate of the healthcare lexi-
cons (in the standard definition) has been shown to be less than 1%.

False positives can occur for several reasons. The following list shows situations that could potentially
cause false positives with the healthcare lexicons:

• Resumes of healthcare workers that include SSNs or dates of birth


• Discussions about HIPAA, PHI, or patient claiming procedures
• Numbers that appear to be health ID numbers but are not
• Provider SSNs used for tax purposes

The ZixCorp lexicons use a variety of mechanisms to reduce the rate of false positives. For example
lexicon entries can be combined with exclusion lists (so that terms are ignored when they appear close
to other specific terms) or inclusion lists (so that terms are only considered when found close to other
specific terms). These mechanisms allow lexicons to search not only for content, but for content used
within specific context. The false positive rate is calculated using the following formula:

(False Positives)
False Positive Rate = ---------------------------------------------
(All Non-Sensitive Messages)

© 2006 Zix Corporation 10


ZixCorp Lexicons: An Overview

6. CONCLUSION
It is important to remember that the ZixCorp lexicons do not function in isolation. The lexicons are
designed to work within ZixAuditor and ZixVPM, combined with well-designed policy rules and
actions. Then, the lexicons serve as a fundamental part of the powerful content scanning capabilities
within the ZixCorp products.

ZixCorp Professional Services can help customers develop custom lexicons (for ZixAuditor searching),
deploy custom lexicons (for ZixVPM), or design effective ZixVPM policies that can best implement their
corporate email policies.

© 2006 Zix Corporation 11


ZixCorp Lexicons: An Overview

7. About ZixCorp
www.zixcorp.com

Zix Corporation (ZixCorp®) is the leading provider of hosted email encryption and e-prescribing services.
ZixCorp's email encryption services provide an easy and cost-effective way to ensure customer privacy
and regulatory compliance for corporate email. Its e-prescribing service reduces costs and improves
patient care, by automating the prescription process between payers, doctors and pharmacies.

Email Encryption Services


ZixVPM® (Virtual Private Messenger)
An on-site, policy–driven server-based solution that provides easy-to-deploy, company-wide
encryption, security, content filtering, and management of outbound corporate email.

ZixAuditor®
A non-intrusive email assessment service that enables organizations to identify email security
vulnerabilities and implement more effective policies and procedures to achieve higher levels of
protection and compliance.

ZixPort®
A Web-based secure e-messaging portal that provides enterprises with private, secure, and branded
communication capabilities while minimizing the impact to existing IT, Web, or security infrastructures.

ZixMail®
An easy-to-use, point-to-point desktop service that enables users to encrypt, decrypt, and send private
emails and attachments to anyone.

e-Prescribing
PocketScript®
e-Prescribing application applies the benefits of e-messaging by enabling healthcare providers to
write and transmit prescriptions electronically from anywhere directly to the pharmacy.

Research and Analysis


ZixResearch Center™
A dedicated research and analysis unit that guides the development of patents, lexicons, and
classification systems to expand, improve, and customize ZixCorp services.

For more information on ZixCorp’s products and services contact ZixCorp at 866-257-4949
or email sales@zixcorp.com.

© 2006 Zix Corporation 12


2711 N. Haskell Ave.
Suite 2300, LB 36
Dallas, TX 75204
1-866-257-4949
www.zixcorp.com

© 2006 Zix Corporation. All rights reserved. Zix Corporation cannot be responsible for errors in typography.
All company, brand and product names are trademarks and/or registered trademarks of their respective owners. www.zixcorp.com LEXICONWP5206