You are on page 1of 14

7/10/2019 Identifying and working with sensitive healthcare data with Amazon Comprehend Medical | AWS Machine Learning

chine Learning Blog

Blog Home Category   Edition   Follow   Search Blogs 

AWS Machine Learning Blog

Identifying and working with sensitive healthcare data with


Amazon Comprehend Medical
by Aaron Friedman | on 23 JAN 2019 | in Amazon Comprehend, Artificial Intelligence | Permalink |  Comments |  Share

At AWS, I regularly speak with AWS customers and AWS Partner Network (APN) partners about how they are
using technology to transform human health. These companies often generate large amounts of health
data that they use in a variety of applications, such as population health management and electronic health
records. Developers need to find ways to use the valuable medical information in these applications while
meeting their compliance obligations with regard to sensitive data, such as protected health information
(PHI). Some applications where our customers and APN partners are doing this today are clinical decision
support, revenue cycle management, and clinical trial management.

There are multiple methods to mask data, and each organization has their own approaches based on
internal risk assessments. We recommend that you consult risk assessment specialists for your
organization’s specific implementation process. Typically, data is masked in two steps. First, PHI must be
identified. Then, an algorithm is used that either anonymizes or de-identifies the data, usually in accordance
with Safe Harbor or expert determination. This approach lends itself to using a state machine to apply the
business logic your organization requires for each step independently and pass the information between
states.

In this blog post, I’ll demonstrate how you can use a combination of Amazon Comprehend Medical, AWS
Step Functions, and Amazon DynamoDB to identify sensitive health data and help support your compliance
objectives. I’ll then discuss some potential extensions of the architecture that are patterns customers often
adopt.

The architecture

This architecture uses the following services:

Amazon Comprehend Medical to identify entities within a body of text


AWS Step Functions and AWS Lambda to coordinate and execute the workflow
Amazon DynamoDB to store the de-identified mapping

This architecture and the code that follows are available as an AWS CloudFormation template.

https://aws.amazon.com/blogs/machine-learning/identifying-and-working-with-sensitive-healthcare-data-with-amazon-comprehend-medical/ 1/14
7/10/2019 Identifying and working with sensitive healthcare data with Amazon Comprehend Medical | AWS Machine Learning Blog

The individual components

Like many modern applications being built on AWS, the individual components within this architecture are
represented as Lambda functions. In this blog post, I’ll show you how to build three Lambda functions:

IdentifyPHI: Uses the Amazon Comprehend Medical API to detect and identify PHI entities from a
body of text, such as a medical note.
MaskEntities: Takes the entities from IdentifyPHI as input and masks them in the body of text
DeidentifyEntities: Takes the entities from IdentifyPHI and applies a hash to each entity and stores
that mapping in DynamoDB.

Let’s walk through each in turn.

Identify PHI

The following code reads in a JSON body, extracts PHI entities from the message, and returns a list of
extracted entities.
raise Exception( Execution is about to time out, exiting... )

def extract_entities_from_message(message):
return client.detect_phi(Text=message)

def handler(event, context):


# Add in context for Lambda to exit if needed
https://aws.amazon.com/blogs/machine-learning/identifying-and-working-with-sensitive-healthcare-data-with-amazon-comprehend-medical/ 2/14
7/10/2019 Identifying and working with sensitive healthcare data with Amazon Comprehend Medical | AWS Machine Learning Blog

timer = threading.Timer((context.get_remaining_time_in_millis() / 1000.00) - 1,


timer.start()
print ('Received message payload. Will extract PII')
try:
# Extract the message from the event
message = event['body']['message']
# Extract all entities from the message
entities_response = extract_entities_from_message(message)
entity_list = entities_response['Entities']
print ('PII entity extraction completed')
return entity_list
except Exception as e:

The workhorse in this Lambda function is the Amazon Comprehend Medical DetectPHI API call, which
returns a list of entities that Amazon Comprehend Medical identifies. Note that confidence scores are
provided with each identified entity – these scores indicate the level of confidence in the accuracy of
identified entities. You should take these confidence scores into account and review identified entities
output to make sure they are correct. For more information on the returned data structure, see the
DetectPHI documentation.

Mask entities

There are multiple approaches to masking a message. In this example, we take each entity and replace it
with a series of pound signs (#) corresponding to the length of the entity. The output is the message that
has been input with each entity masked. You could choose whichever methods that are most meaningful to
and appropriate for your business. For example, if there are multiple NAME PHI entities, you could order
them as NAME1, NAME2, and so on.

Here’s the Lambda function:

for entity in entity_list:


message = message.replace(entity['Text'], '#' * len(entity['Text']))
return message

def handler(event, context):

# Add in context for Lambda to exit if needed


timer = threading.Timer((context.get_remaining_time_in_millis() / 1000.00) - 1, ti
timer.start()
print ('Received message payload')
try:
# Extract the entities and message from the event
message = event['body']['message']
entity_list = event['body']['entities']
# Mask entities
masked_message = mask_entities_in_message(message, entity_list)
print (masked_message)
return masked_message
except Exception as e:
https://aws.amazon.com/blogs/machine-learning/identifying-and-working-with-sensitive-healthcare-data-with-amazon-comprehend-medical/ 3/14
7/10/2019 Identifying and working with sensitive healthcare data with Amazon Comprehend Medical | AWS Machine Learning Blog

logging error('Exception: %s Unable to extract entities from message' % e)

De-identify entities

There are multiple methods for de-identification. The example described in this blog post is meant to
demonstrate one way you can de-identify sensitive entities so that they can be reidentified later on by a
user with the appropriate permissions. Here, we do several steps:

1. Apply a salt to the entity.


2. For each entity, generate a sha3-256 hash of the salted entity. Store this entity in a dictionary.
3. Replace each entity in the message with the hash generated in step 1.
4. Generate a sha3-256 hash of the de-identified message.
5. Store the entities in DynamoDB with the hashed message as the hash key and the entity hash as the
range key.

Here is the Lambda function for this step. The EntityMap, which is a DynamoDB table, is read in as an
environment variable:

from botocore.vendored import requests


import json
import boto3
import hashlib
import base64
import logging
import threading
import uuid
import os

ddb = boto3.client('dynamodb')

def timeout(event, context):


raise Exception('Execution is about to time out, exiting...')

def store_deidentified_message(message, entity_map, ddb_table):


hashed_message = hashlib.sha3_256(message.encode()).hexdigest()
for entity hash in entity map:

Building the Boto3 Lambda layer

Next, we’ll create a Lambda layer containing Boto3. This is a common best practice when deploying Lambda
functions in production.

Copy and paste the following code into a terminal. Feel free to change boto3env to a folder of your choice.
The following example uses Python 3.6.

i i t ll b t 3 t t th /
https://aws.amazon.com/blogs/machine-learning/identifying-and-working-with-sensitive-healthcare-data-with-amazon-comprehend-medical/ 4/14
7/10/2019 Identifying and working with sensitive healthcare data with Amazon Comprehend Medical | AWS Machine Learning Blog
pip install boto3 --target python/.

# install botocore
pip install botocore --target python/.

# zip to four layer


zip boto3layer.zip -r python/

aws lambda publish-layer-version --layer-name boto3-layer --zip-file fileb://boto3layer

Note the LayerVersionArn in the output. We’ll use this shortly.

Building the state machine

The multiple steps within this workflow, such as data passed between steps and forking paths based on user
input, can be best represented as a state machine. We’ll use AWS Step Functions to define the state
machines and execute the individual Lambda functions.

https://aws.amazon.com/blogs/machine-learning/identifying-and-working-with-sensitive-healthcare-data-with-amazon-comprehend-medical/ 5/14
7/10/2019 Identifying and working with sensitive healthcare data with Amazon Comprehend Medical | AWS Machine Learning Blog

The state machine reads in a JSON blob containing the message text to process as well as whether to mask
or de-identify the message. The overall steps are:

1. Identify PHI entities using Amazon Comprehend Medical APIs.


2. Determine whether to mask entities or de-identify.
3. Based on results of Step 2, act accordingly.

Here is the Amazon States Language code defining this state machine:

{
"Comment": "State Machine that anonymizes or deidentifies PHI",
"StartAt": "Identify PHI",
"States": {
"Identify PHI": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123456789012:function:IdentifyPHILambda"

"InputPath": "$",
https://aws.amazon.com/blogs/machine-learning/identifying-and-working-with-sensitive-healthcare-data-with-amazon-comprehend-medical/ 6/14
7/10/2019 Identifying and working with sensitive healthcare data with Amazon Comprehend Medical | AWS Machine Learning Blog
put at : $ ,
"ResultPath": "$.body.entities",
"Next": "Anonymize Or De-identify"
},
"Anonymize Or De-identify": {
"Type": "Choice",
"Choices": [
{
"Variable": "$.body.anonymizeOrDeidentify",
"StringEquals": "anonymize",

Testing the state machine

As mentioned in the introduction, you can deploy the entire architecture using AWS CloudFormation.
Launch the CloudFormation template now:

Use the LayerVersionArn output that you noticed previously in the Boto3LayerArn CloudFormation
parameter.

After the CloudFormation stack deploys, you should have the following resources:

The three Lambda functions


A DynamoDB table containing mappings to the re-identified entities
A Step Functions state machine
AWS Identity and Access Management (IAM) resources

Let’s take a fictional medical note, or rather a combination of what would be several notes, which was
provided by the Amazon Comprehend Medical team. Notice that it’s filled with typos, which would present
challenges for rules-based approaches for entity identification.

Stay Free Medical Center


Emergency Department
Clinical Summary
12341 W. Bohannon Rd, Grantville, GA
Phone: (770) 922-9800

PERSON INFORMATION
Name: SALAZAR, CARLOS
MRN: RQ36114734
ED Arrival Time: 11/12/2011 18:15

Sex: Male
DOB: 2/11/1961
Age: 50 Years
Visit Reason: New onset A Fib, SOB
Acuity: 2 Emergent Disposition: Home/Self-Care

Address: 186 VALETINE, NE 69201


https://aws.amazon.com/blogs/machine-learning/identifying-and-working-with-sensitive-healthcare-data-with-amazon-comprehend-medical/ 7/14
7/10/2019 Identifying and working with sensitive healthcare data with Amazon Comprehend Medical | AWS Machine Learning Blog
Address: 186 VALETINE, NE 69201
Phone: 402 213-2221

SUBJECTIVE:
Carlos came to the ED via ambulance accompanied by son, Jorge. He is a 50 yo male who
was working at Food Corp when he had sudden onset of palpitations. Carlos stated his fater,
Diego, also had palpitations through his life.

Provider Contact Time: 11/12/2011 19:00


Decision to Admit: Not entered
ED Departure Time: 11/23/2011 00:07

DIAGNOSIS: Hyperthyroidism
Attending Provider:
Saanvi Sarkar, MD

Primary Nurse(s):
Jackson; Mateo

Fill New Prescriptions:


nepafenac (nepafenac 1 mg / 1mL Ophthalmic Suspension) 1 drop left eye every 12 hours
14 day(s)
zofran (Ondansetron 4 mg oral tablet) 4 mg ORAL PRN
atropine sulfate 0.05 mcg / hyopscyamine sulfate 3.1 mcg / phenobartbital 48.6 MG /
scopolamine hydrobromide 0.0195 mg ( Donnata ER oral tablet) 1 table PO PRN
acetaminophen – hydrocodone ( Vicodin 5 mg – 500 mg oral tablet ) 2 tablet(s) by Mouth
every 6 hours as needed for pain
docusate sodium 100 mg oral capsule 100 mg by Mouth twice daily as needed for
constipation

Allergies:
penicillins
ibuprofen
bee pollen

Patient Education and Follow-up Information


Instructions:
ED, Nausea (Custom)
Follow up:

With:
Address:
When:

Return to Emergency Department

Comments:

Nausea Vomiting

Nausea persists without control from anti-nausea medications Projectile vomiting


https://aws.amazon.com/blogs/machine-learning/identifying-and-working-with-sensitive-healthcare-data-with-amazon-comprehend-medical/ 8/14
7/10/2019 Identifying and working with sensitive healthcare data with Amazon Comprehend Medical | AWS Machine Learning Blog
Nausea persists without control from anti nausea medications Projectile vomiting
Uncontrolled , consistent nausea & vomiting Blood or “coffee grounds” appearing material
in vomit Medicine not kept down because of vomiting Weakness or dizziness along with
nausea/vomiting Severe stomach pain while vomiting

Pain
Severe Chest / Arm pain Severe squeezing or pressure in chest Severe sudden headache
New or uncontrolled pain New headache Chest discomfort Pounding heart Heart “flip –
flop” feeling Painful Central Line site or area of “tunnel” Burning in chest or stomach Pain or
burning while urinating Pain with infusion of medications or fluids into Central Line

Diarrhea

Constant or uncontrolled diarrhea New onset diarrhea Diarrhea with fever and abdominal
cramping Whole pills passed in stool Greater than 5 times each day Stool which is bloody ,
burgundy or black Abdominal cramping

Fatigue
Unable to wake
Dizziness Fatigue is getting worse Too tired to get out of bed or walk to the bathroom
Staying in bed all day

Fever / Chills

Shaking chills , temperature may be normal Temperature greater than 38.3° C or 100.9° F by
mouth Fever greater than 1 degree above usual if on steroids 24 Cold symptoms ( runny
nose , watery eyes , sneezing , coughing )

With:
Address:
When:

Follow up with primary care provider

Comments:

Call tomorrow to make an appointment for the next 1-2 days and to start arranging PCP
follow-up

Thank you for visiting the Stay Free Medical Center.

Comments:

Call tomorrow to make an appointment for the next 1-2 days and to start arranging PCP
follow-up

Thank you for visiting the Stay Free Medical Center.

The input to the state machine takes two values. First, the note. Second, a choice of whether to anonymize
the note or de-identify it. In this example we’ll de-identify the message. Here’s what that looks like:

https://aws.amazon.com/blogs/machine-learning/identifying-and-working-with-sensitive-healthcare-data-with-amazon-comprehend-medical/ 9/14
7/10/2019 Identifying and working with sensitive healthcare data with Amazon Comprehend Medical | AWS Machine Learning Blog

{
"body": {
"message": " Stay Free Medical Center \nEmergency Department \nClinical Summary
"anonymizeOrDeidentify": "deidentify"
}
}

In the AWS CloudFormation console, navigate to the output page and note the state machine Amazon
Resource Name (ARN), you will be using it later to invoke a state machine execution.

You can test using the AWS CLI, your AWS SDK of choice, or the AWS Step Functions console. The following
command shows what it would be like if you used the CLI. However, before you type the following
command, copy the previous JSON and save it to example_note.json. Also replace the AWS Step Functions
state machine ARN with the ARN in the CloudFormation output.

aws stepfunctions start-execution --state-machine-arn YOUR_STATEMACHINE_ARN --input fil

The overall execution should take only a couple of seconds. Let’s navigate to the AWS Step Functions
console to see what happened.

https://aws.amazon.com/blogs/machine-learning/identifying-and-working-with-sensitive-healthcare-data-with-amazon-comprehend-medical/ 10/14
7/10/2019 Identifying and working with sensitive healthcare data with Amazon Comprehend Medical | AWS Machine Learning Blog

When you ran the previous command, several things happened.

1. A Lambda function identified potential PHI entities within the note.


2. These entities were salted and the resulting combination was hashed using SHA3-256.
3. The hashes replaced the original entities in the message and the updated message was then hashed.
4. The mappings were stored in DynamoDB.
5. The hashed message is returned as the output of the execution.

You can view the output from the steps in the AWS Step Functions console. The previous message should
now look like the following (formatted for ease of reading).  The de-identified message still contains
valuable information that can be used, but the sensitive data has been masked using the previous masking
example.

8db49f8fdfc0a003402dd68439d2a848635d6c60a2719020c7b922916aafbdf0
c027ee7d7992ea804c589c2c2777fc646e2f394d5db900177246f9d7bd8d762d
Clinical Summary
5d0276605f49fa2c8e010b9781cb348d9efca84dd7a49e0ce6fb845e156f3331
Phone: 988c20b763f3b60b83aa64f48ce3184642dcf15707eeaead9d24c266e8967680

PERSON INFORMATION
Name: ba1a8b9ba885c1b867b0fda332845d2c4435a921cdbf849a86b4e5768b00972395cbf6c9eda8a
MRN: 45dd4310f18cddb1f37c4e11b36b12e77fc64001229a2632333d1e0f379f5847
2e0fbaa0c9008457c15f9306a9cd588ec09402f8db12194ec705b3f058e3eff6 Arrival Time: 712c

Sex: Male
https://aws.amazon.com/blogs/machine-learning/identifying-and-working-with-sensitive-healthcare-data-with-amazon-comprehend-medical/ 11/14
7/10/2019 Identifying and working with sensitive healthcare data with Amazon Comprehend Medical | AWS Machine Learning Blog
Se : a e
DOB: 88d76b85ad3e7cc2b1d06ea99a8a13df842fdd7ab0986ae3c747a3993944f91d
Age: b9ba885c1b867b0fda332845d2c4435a921cdbf849a86b4e5768b00972395cbf Years
Visit Reason: New onset A Fib, SOB
Acuity: 2 Emergent Disposition: Home/Self-Care
Address: aff3537058e53a9de01a4689cf1c3109584370e98ec31241a3ae4c07eceb0cbb

Here’s what the table looks like after two runs with the same message.

Because each entity is salted, there’s no way of mapping that hash back to the original entity without using
the DynamoDB mapping table, which you can notice by repeated entities having different hashes due to
salting. Additionally, since you can manage DynamoDB access using IAM, you can control who has access to
the items in your table. You can then use AWS CloudTrail to audit reads from your table containing sensitive
information.

Conclusion and next steps

Protecting sensitive data is always job zero for healthcare organizations. In this blog post, I demonstrated
how you can use Amazon Comprehend Medical to work with and identify protected health information.
While organizations have different approaches to protect sensitive data, they follow the same architectural
pattern: (1) identify the sensitive entities, and (2) apply the appropriate protection strategy for the sensitive
entities as defined by your organization. A state machine is well-suited to orchestrate the two steps.

There are additional modifications you can make to this architecture to suit your needs. Here are a few
ideas:

Put the state machine behind Amazon API Gateway to add an authorization layer to process your text,
as well as a gateway to the individual Lambda functions.
Filter by the confidence of the DetectPHI call. Amazon Comprehend Medical entities have a Score
field in addition to Text . You can apply a threshold to filter the calls by, depending on your business
requirements.
Use DetectPHI in conjunction with DetectEntities to help you detect and identify PHI, and also extract
non-PHI entity relationships, which can be used for downstream analytics.

Interested in learning more about Amazon Comprehend Medical?

Check out the documentation


Explore related blog posts discussing Amazon Comprehend Medical
Amazon Comprehend Medical – Natural Language Processing for Healthcare Customers
Extract and visualize clinical entities using Amazon Comprehend Medical

Coming to HIMSS? Meet the AWS Healthcare team live at HIMSS19 Booth #5058!

We welcome your questions and comments. We look forward to hearing from you!

https://aws.amazon.com/blogs/machine-learning/identifying-and-working-with-sensitive-healthcare-data-with-amazon-comprehend-medical/ 12/14
7/10/2019 Identifying and working with sensitive healthcare data with Amazon Comprehend Medical | AWS Machine Learning Blog

About the Author

Dr. Aaron Friedman is a Healthcare and Life Sciences Partner Solutions Architect at
Amazon Web Services. He works with ISVs and SIs to architect healthcare solutions on
AWS, and bring the best possible experience to their customers. His passion is working at
the intersection of science, big data, and software. In his spare time, he’s exploring the
outdoors, learning a new thing to cook, or spending time with his wife, son, and his dog,
Macaroon.

AWS Podcast
Subscribe for weekly AWS news and interviews
Learn more  

AWS Partner Network


Find an APN member to support your cloud business needs
Learn more  

https://aws.amazon.com/blogs/machine-learning/identifying-and-working-with-sensitive-healthcare-data-with-amazon-comprehend-medical/ 13/14
7/10/2019 Identifying and working with sensitive healthcare data with Amazon Comprehend Medical | AWS Machine Learning Blog

AWS Training & Certifications


Free digital courses to help you develop your skills
Learn more  

Related Posts

Analyze content with Amazon Comprehend and Amazon SageMaker notebooks

Amazon Comprehend now supports resource tagging for custom models

Amazon Comprehend now support KMS encryption

De-identify medical images with the help of Amazon Comprehend Medical and Amazon Rekognition

Map clinical notes to the OMOP Common Data Model and healthcare ontologies using Amazon
Comprehend Medical

AI Powered Speech Analytics for Amazon Connect (Preview)

Newstag improves global video news discoverability using AI language services on AWS

Bridgeman Images uses Amazon Translate to establish their business globally

https://aws.amazon.com/blogs/machine-learning/identifying-and-working-with-sensitive-healthcare-data-with-amazon-comprehend-medical/ 14/14

You might also like