You are on page 1of 19

Creating and Evaluating Chatbots as Eligibility Assistants

for Clinical Trials: An Active Deep Learning Approach towards


User-centered Classification

CHING-HUA CHUAN, Department of Interactive Media, University of Miami, Goral Gables, FL, USA
SUSAN MORGAN, Department of Communication Studies, University of Miami, Goral Gables, FL, USA

Clinical trials are important tools to improve knowledge about the effectiveness of new treatments for all diseases, including
cancers. However, studies show that fewer than 5% of cancer patients are enrolled in any type of research study or clinical
trial. Although there is a wide variety of reasons for the low participation rate, we address this issue by designing a chatbot to
help users determine their eligibility via interactive, two-way communication. The chatbot is supported by a user-centered
classifier that uses an active deep learning approach to separate complex eligibility criteria into questions that can be easily
answered by users and information that requires verification by their doctors. We collected all the available clinical trial
eligibility criteria from the National Cancer Institute’s website to evaluate the chatbot and the classifier. Experimental results
show that the active deep learning classifier outperforms the baseline k-nearest neighbor method. In addition, an in-person
experiment was conducted to evaluate the effectiveness of the chatbot. The results indicate that the participants who used
the chatbot achieved better understanding about eligibility than those who used only the website. Furthermore, interfaces
with chatbots were rated significantly better in terms of perceived usability, interactivity, and dialogue.

CCS Concepts: • Human-centered computing → Natural language interfaces; • Applied computing → Health care
information systems; • Computing methodologies → Machine learning;

Additional Key Words and Phrases: Chatbots, active learning, convolution neural networks, clinical trials, eligibility criteria

ACM Reference format:


Ching-Hua Chuan and Susan Morgan. 2020. Creating and Evaluating Chatbots as Eligibility Assistants for Clinical Trials:
An Active Deep Learning Approach towards User-centered Classification. ACM Trans. Comput. Healthcare 2, 1, Article 6
(December 2020), 19 pages. 6
https://doi.org/10.1145/3403575

1 INTRODUCTION
Conversational agents (CA), often known as chatbots, have become increasingly popular in various fields and
industries such as online customer service, product recommendation, and personal finance assistance. In the
realm of health-related applications, chatbots have been designed to provide general health information, explain

This work was supported in part by the University of Miami School of Communication Creative Activity and Research Grants.
Authors’ addresses: C.-H. Chuan, Department of Interactive Media, University of Miami, 5100 Brunson Drive, Coral Gables, FL 33146, USA;
email: c.chuan@miami.edu; S. Morgan, Department of Communication Studies, University of Miami, 5100 Brunson Drive, Coral Gables, FL
33146, USA; email: semorgan@miami.edu.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that
copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first
page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy
otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions
from permissions@acm.org.
© 2020 Association for Computing Machinery.
2637-8051/2020/12-ART6 $15.00
https://doi.org/10.1145/3403575

ACM Transactions on Computing for Healthcare, Vol. 2, No. 1, Article 6. Publication date: December 2020.
6:2 • C.-H. Chuan and S. Morgan

Fig. 1. An excerpt of eligibility criteria in a clinical trial for breast cancer.

medical jargon, and predict risks of diseases given symptoms. Thanks to the advances in natural language
processing and machine learning, today’s chatbots are able to understand conversations in the natural form and
provide useful medical information in lay language.
In this article, we developed a chatbot with a backend machine learning module to help users understand eligi-
bility criteria for clinical trials. Clinical trials are one of the most important tools to understand and improve the
effectiveness of new drugs or treatments for all diseases, including cancers. However, studies show that fewer
than 5% of cancer patients are enrolled in any type of clinical trial. Consequently, the low enrollment rate results
in about 1 in 5 trials being closed with insufficient accrual [Murthy et al. 2004; Bennette et al. 2016]. In the past,
the low enrollment rate could be caused by the time-consuming manual process, i.e., caregivers needed to manu-
ally verify patients’ eligibility against various criteria in multiple clinical trials [Breitfeld et al 1999]. Despite the
dominance of digital media as an information dissemination channel, making information available on the Inter-
net did little to solve the problem because “(online) services offering clinical trials recruitment information have
not been well defined” [Metz et al. 2005]. Constructing and deploying suitable, flexible, and effective definitions
for clinical trials recruitment information is not a trivial task. It is more difficult when substantial differences are
observed in eligibility criteria between clinical trial protocols, registries, and articles [Zhang et al. 2016].
To understand the complexity of the problem, we first examine how clinical trials recruitment information is
presented and accessed. The National Cancer Institute (NCI) provides a web interface for cancer patients and their
families and friends to locate active clinical trials.1 With the basic search interface, users can initiate their search
by entering cancer type, their age, and zip code. The advanced search interface allows further specification by
entering cancer subtype, stage, side effects, keywords, trial type, drug, trial phase, trial ID, and trial investigators.
Submission of the query produces a list of trials; each trial is displayed with detailed information including
description of the trial, eligibility criteria, locations, and contact information. Despite the NCI’s search interface,
a key barrier to participation is the extensive and complicated eligibility criteria that often involve medical
jargon alien to most users. Additionally, the search criteria that users can specify are far too general to help
users determine whether they are eligible for the returned clinical trials. As a result, users or patients must read
and verify their status against the list of criteria one-by-one to determine their eligibility to participate.
To illustrate the complexity of the verification process, an excerpt of eligibility criteria from a breast cancer
clinical trial on the NCI’s website is shown in Figure 1. This particular trial has 26 bullet points of inclusion
eligibility criteria. Medical jargon, such as stage IB and ECOG performance status, can be observed in many
criteria. Although online search engines or medical dictionaries are helpful for understanding the definition of
a specific medical term, providing definition is not sufficient for the patient to verify his or her eligibility for
a criterion, as in the first case shown in Figure 1. By contrast, other criteria, such as the patient’s pregnancy
status, pertain to conditions that can be easily verified. Because the extent to which an inclusion criterion can

1 https://www.cancer.gov/about-cancer/treatment/clinical-trials/search.

ACM Transactions on Computing for Healthcare, Vol. 2, No. 1, Article 6. Publication date: December 2020.
Creating and Evaluating Chatbots as Eligibility Assistants for Clinical Trials • 6:3

be readily verified by the patients their family members differs from one to another, going through the criteria
list one-by-one can be an extremely frustrating experience and even a waste of time for a pregnant patient who
only finds out that she is not eligible in the end after she tries to understand all the other criteria and medical
jargon.
In this study, we aim to simplify the verification process by converting eligibility criteria into a user-centered
classification problem in machine learning. From the user’s perspective, criteria are classified into five categories:
criteria that (1) patients can easily verify (e.g., must not be pregnant), (2) patients need to consult with medical
professionals (e.g., stage IB >=4cm), (3) patients need to take lab tests (e.g., white blood cell count), (4) patients
must agree to comply with (e.g., must sign the consent form), and (5) others. Once the criteria are classified in
this way, patients and their family members can begin with the criteria in the first category to quickly filter out
the unsuitable clinical trials. For the suitable trials, if the patient agrees to comply with the criteria in the fourth
category, the system can prepare the remaining criteria in a separate document as a to-do list so the patient can
bring it to the medical professionals and/or take the necessary lab tests.
The contribution of this article is two-fold: (1) the chatbot that provides an interactive interface to help users
better understand the eligibility criteria, and (2) the machine learning backend module that supplies the con-
versational content to the chatbot. The chatbot, Sofia, is able to answer users’ questions and also proactively
provide assistance to help users with their eligibility for a particular clinical trial. Unlike many of the automated
systems for clinical trials, our goal is not to just automate the process “behind-the-scene” but to empower users
by simplifying the process and going through the criteria together with them, to help them better understand the
inclusion criteria. The chatbot is supported by a machine learning backend module that streamlines the process
of reviewing criteria by classifying criteria into categories based on the user’s ability to verify them. Specifically,
we proposed a novel approach using active deep learning with word embeddings for this classification problem.
The word embeddings that learned from the criteria allow our classifier to be free from using any pre-defined
(medical) lexicon or templates. The active learning approach enables the classifier to be user-centered (i.e., col-
lecting the class labels from a user’s perspective) while overcoming the challenge of obtaining huge amount
of data for training the classifier. To the best of our knowledge, this article is the first to use deep learning on
classifying clinical trial criteria in a user-centered manner. In addition, the design with the front-end chatbot and
the back-end classifier makes the system highly scalable, as no re-programming is needed for different clinical
trials.
The rest of the manuscript is organized as follows: Section 2 describes related work in two topics: clinical trial
eligibility criteria and chatbots in healthcare. Section 3 provides an overview of the system while Sections 4 and
5 present details on the creation and evaluation of the eligibility criteria classifier and the chatbot, respectively.
We conclude this article in Section 6 along with our discussion for future work.

2 RELATED WORK
2.1 Clinical Trial Eligibility Criteria
The importance and complexity of eligibility criteria in clinical trials have motivated many researchers to study
the content of criteria with the goal to simplify the process. Weng et al. [2010] conducted a literature review
to analyze the manner in which eligibility criteria can be represented for computers to process to improve
participant screening. The authors reported a wide variety of representations of eligibility criteria based on
three aspects: expression language, codification of eligibility concepts, and patient data modeling. Even when
a more focused subject was discussed, such as eligibility criteria subclasses, variations still are commonly
observed in terms of how such classes are defined. For example, Niland et al. [2008] proposed three categories
including intent, main clinical category, and main medical topic. Sim et al. [2004] focused on age-gender rules,
ethnicity-language rules, and clinical rules. Rubin et al. [1999] developed 24 categories for eligibility criteria

ACM Transactions on Computing for Healthcare, Vol. 2, No. 1, Article 6. Publication date: December 2020.
6:4 • C.-H. Chuan and S. Morgan

with two goals: “research protocols for patients at similar clinical states would have similar eligibility criteria
and to reduce the total number of criteria needed to author several clinical protocols.”
Recently, machine learning techniques have been used to examine the characteristics of eligibility criteria. For
example, Luo et al. [2011] used a Unified Medical Language System (UMLS)-based semantic lexicon to process
the description of eligibility criteria and applied hierarchical clustering algorithms to the extraction of eligibility
categories. These categories were later merged and labeled manually. The experiment result obtained by us-
ing 5,000 sentences from ClinicalTrials.gov displayed 27 semantic categories with six top groups: health status
(43.72%), treatment or health care (20.74%), diagnostic or lab test (14.85%), demographics (8.79%), ethical con-
sideration (8.52%), and lifestyle choice (3.38%). Another approach, proposed by Milian et al. [2011, 2012], used
natural language processing techniques and regular expressions to automatically identify semantic patterns in
eligibility criteria, such as (. . . allowed if . . . ), (no history of . . . ), and (no . . . except for . . . ).
Analyzing eligibility criteria is an essential step towards expediting or automating the verification of eligible
patients. To achieve this goal, most researchers rely on patients’ medical records. For instance, Carlson et al.
[1995] used patients’ medical records to identify them as eligible, potentially eligible, or ineligible based on four
categories of eligibility criteria: (1) immutable criteria (e.g., gender), (2) routine laboratory tests, (3) physician-
controlled criteria (e.g., concurrent medications), and (4) special criteria (e.g., performance status, unusual lab-
oratory tests, willingness of the patient). However, as pointed out by the authors, missing information in the
medical records is common so eligibility “could not be unequivocally determined.”
Fink et al. [2004] also proposed a system for physicians or clinicians to determine patient eligibility. The expert
system turned eligibility criteria into questions that can be verified using logical expressions (e.g., patient’s sex =
FEMALE and age ≤ 45). To encode the eligibility criteria, each criterion was first classified into three categories:
questions that take yes or no responses, multiple choices, or numeric answers. As a result, each protocol was
then manually converted into a set of logical expressions that describe the inclusion and exclusion criteria. More
recently, Tu et al. [2011] used natural language processing techniques (e.g., part-of-speech tagging) to partially
automate the process of transforming the eligibility text into computable formats (e.g., formats that allow for
Boolean combinations in relational database languages).
Köpcke et al. [2013] used electronic health records (EHR) to identify eligible patients based on a sample of
patients whose EHR have been manually examined for eligibility. Patients’ EHR were stored as entity-attribute-
value in a relational database. Five classification algorithms were tested, including decision trees, random forests,
support vector machine, logistic regression, and stepwise logistic regression. Among the three clinical trials
tested in the experiment, random forests performed the best with ROC-AUC values of between 0.8 and 0.95.
Miotto and Weng [2015] also used EHR to automatically identify patients for clinical trials but using the case-
based reasoning approach. First, the authors identified a group of patients that were already qualified or enrolled
in the clinical trials as target patients. Then, patients who had not been verified were examined by comparing
their EHR to the records of target patients via a similarity measure (i.e., cosine distance). Last, the system returned
a list of patients ranked by their similarity to the target patients in terms of how likely these patients could
be eligible for the clinical trial. In their study, only four EHR data types were considered: medication orders,
diagnosis, laboratory results, and free-text clinical notes. Other data, such as gender, age, and location, were
ignored. Although this approach does not require manual labeling of patients’ eligibility, it is limited, as the
authors also acknowledged that the identified patients may lack of diversity, because the algorithm only assumes
one cohort for each clinical trial. In addition, this approach suffers from the cold-start problem: It is difficult to
find potentially eligible patients when there are little to none patients already enrolled in the clinical trial.
As illustrated by the above-mentioned examples, checking patient data against eligibility criteria is the con-
ventional approach for clinical trial recruitment support systems. The popularity of this approach is also noted
by Cuggia et al. [2011] that all the 28 identified clinical trial recruitment support systems evaluated in their study
require both eligibility criteria and patient data as input. However, the requirement of patient data can limit the
capability of such a system because of two key reasons. First, the complication in patient data integrity, sharing,

ACM Transactions on Computing for Healthcare, Vol. 2, No. 1, Article 6. Publication date: December 2020.
Creating and Evaluating Chatbots as Eligibility Assistants for Clinical Trials • 6:5

and privacy directly impact the system performance. In addition, extent patient data may not address criteria
questions involving a condition or status that can be changed (e.g., weight or pregnancy status) or a patient’s
willingness to comply with certain rules.
Because of the challenges associated with patient data, the current study aims to empower patients by provid-
ing them with a computational tool to find suitable clinical trials. Metz et al. [2015], who studied the effectiveness
of a web-based clinical trials matching resources on OncoLink (www.oncolink.org), similarly advocated for such
assistive technologies for patients. In their study, patients first need to register an account and provide basic
demographic data and other-related information such as personal medical history, cancer diagnosis, and treat-
ments to date by filling out an online questionnaire. The authors reported that more than 600 patients were
successfully matched in 16 months during the experiments but they did not track the patients who were not
matched. In addition, it is not clear how the match was computationally identified in the article.

2.2 Chatbots in Healthcare


Various chatbots have been proposed for assisting and supporting patients in the past two decades. Recently,
these systems are documented and analyzed in two systematic reviews. Specifically, Laranjo et al. [2018]’s re-
view uses three inclusion criteria for study selections: studies must focus on consumers or healthcare profes-
sionals, must involve a chatbot that understands (unconstrained) natural language either via text or speech, and
must report evaluations from the user’s interaction with the chatbot. Seventeen articles published between 2003
and 2017 were reviewed. They reported that the majority of the chatbots were created for supporting patients
with mental health issues (e.g., depression, anxiety, post-traumatic stress disorder, autism, and violence), some
for physical health and addiction (e.g., asthma, sexual health, obstructive sleep apnea, breast cancer, diabetes,
and substance abuse), one for mindfulness mediation training, and one for language impairment. These chatbots
function as supporters, trainers, educators, practice partners, personal assistant (e.g., monitoring and data col-
lection), and diagnosis tools. In terms of the dialogue management techniques, three types were identified in the
review: finite-state (dialogue modeled as a sequence of pre-determined steps/states), frame-based (user’s input is
analyzed via templates), and agent-based (the system has the ability to reason regarding belief states and actions).
Evaluations on the chatbot as a whole were measured via dialogue success rate (% successful task completion),
dialogue duration, number of turns, number of repetitions, and corrections or interruptions. User experience
was evaluated via mostly self-reported data on satisfaction, ease of use, reliability, usefulness, acceptability, and
(positive/negative) perception. The review also pointed out several common issues reported by users, such as
chatbots failing to recognize health concerns or spoken language, and failing to provide appropriate responses.
Montenegro et al. [2019] provided the most recent systematic review on chatbots in the health field. The au-
thors used combinations of keywords such as “conversational agents,” “chatbots,” “intelligent virtual agents,”
“health,” “healthcare,” and “hospital” to search for articles published in the past 10 years and selected 40 articles
for their study. Interestingly, only 2 articles out of the 40 are included in Laranjo et al.’s [2018] review. Based on
the 40 articles, Montenegro et al. proposed a taxonomy for chatbots in health based on three attributes: inter-
action, dialog, and architecture. The interaction attribute was further divided into three categories: health goals
(assistance, diagnosis, education, etc.), health contexts (patient, physician, student), and health domains (derma-
tology, hospital, therapy, cardiology, mindfulness, etc.). The dialog attribute also contains three categories: dialog
types (dialog generation, planner, engine, and management), agent types (counseling, coach), and communication
models (multimodal, speech, text). The architecture attribute differentiates chatbots in terms of techniques (such
as reinforcement learning, convolution neural networks, and pattern matching) and systems (such as Watson,
Kinect, Google Cloud NLP, and Dialogflow).
Particularly relevant to the study focus of helping users understand clinical trial eligibility criteria, several
chatbots for improving health literacy, such as helping users understand health-related information and
assisting them with decision making, have been implemented and tested. For instance, Utami et al. [2013]

ACM Transactions on Computing for Healthcare, Vol. 2, No. 1, Article 6. Publication date: December 2020.
6:6 • C.-H. Chuan and S. Morgan

created an embodied chatbot to help patients with low computer literacy choose a clinical trial. The system
has a repeatable read-aloud component so the patient can read through the text with the agent. In addition, the
system has a simplified title for the clinical trial and a built-in dictionary from the NCI website so the agent can
offer to explain difficult terms to the user. However, the system does not process natural language; user inputs
are restricted to multiple-choice selections from a pre-defined list of utterances. More recently, Azevedo et al.
[2018] implemented a chatbot to explain medication instructions to older adults. The authors focused on testing
the variations in appearance and levels of realism for the chatbot and investigated the impact of affective cues
by experimenting with different ways of message framing (benefits of taking the medicine versus loss due to not
taking the medicine). The authors concluded that chatbots have significant impacts for supporting older adults’
learning on medication instructions. Amith et al.’s [2019] study shows that participants prefer conversational
interaction over paper-based form for learning health information on human papillomavirus (HPV) vaccine.
In their study, the authors used Siri and text-to-speech SDK to program a voice-based interface for the user
to communicate with a “chatbot” that was operated by a human in the Wizard of OZ experiment setting. The
finding shows that participants generally considered such a system as easy to use but demanded more features.
In addition to the design and technical implementation of the chatbot, another important aspect to explore is
how people perceive such systems and how the perception influences their behaviors and decisions. For instance,
Zhang et al. [2017] evaluated how the relational contextualization of the chatbot affects patients’ trust of the
system. In the experiment, the relational contextualization was manipulated by three different relational settings:
a chatbot aligned with the patient, with the medical team, or with the federal government, during an informed
consent process. The result showed that aligning with the patient increases patients’ trust and satisfaction on
the chatbot. In contrast, Ho et al. [2018] examined the content of the conversation between the chatbot and
the user. The authors focused on the user’s self-disclosure, a behavior that has “beneficial emotional, relational,
and psychological outcomes.” Interestingly, the study reported no significant differences between users’ self-
disclosure with a person and that with a chatbot. Such finding thus indicates that users are at least similarly
willing to share personal information with a chatbot as they would with a human partner.
Creating chatbots for healthcare requires careful considerations about the impact of the system on users and
potential consequences caused by unexpected system failures. Bickmore et al. [2018] systematically reviewed
the types of errors that occur in conversational interfaces and discussed the potential safety issues that errors
can cause. Examples of errors include errors in user mental model (e.g., the user has incorrect expectations of the
chatbot’s domain expertise, and the user may not understand the limited ways they can communicate with the
chatbot), automatic speech recognition, natural language processing, system response, and user understanding
and action. The authors further provided design recommendations to prevent or handle such errors. For instance,
the authors recommended that the chatbot should provide examples of expected utterances for the user input
and to incorporate recovery strategies when the chatbot misunderstands or does not understand the user.
In the next section, we describe the overview of our approach to the creation of chatbots for clinical trial
eligibility criteria.

3 SYSTEM OVERVIEW
Figure 2 shows the architecture of the proposed system. The system contains two major components: conver-
sation manager, depicted on the left in the diagram, and criteria classifier, on the right. The conversation
manager is built as a web interface where the chatbot exchanges conversations with the user. The conversation
manager consists of four modules. The first module, intent/entity identification, processes the sentence entered
by the user and uses the cloud-based service, Dialogflow,2 to identify the purpose (intent) and topic/object
(entity) in the user’s sentences. This module is responsible of handling the question-and-answer type of
conversations in which the user always initiates the conversation (i.e., asking the question). The second module,

2 https://dialogflow.com/.

ACM Transactions on Computing for Healthcare, Vol. 2, No. 1, Article 6. Publication date: December 2020.
Creating and Evaluating Chatbots as Eligibility Assistants for Clinical Trials • 6:7

Fig. 2. The diagram illustrates the two major components in the system: conversation manager (front-end) and eligibility
criteria classifier (back-end).

eligibility assistance, receives the information about clinical trials from the NCI dataset and converts eligibility
criteria into a sequence of questions based on the class assigned to each criterion from the criteria classifier.
The third module, conversation status monitoring, keeps track of the status of the conversation. For example,
this module remembers the current criterion that the chatbot is helping the user with, and it signals a reminder
if the user has not verified the criterion after a certain period of time. The last module, conversation generation,
constructs the sentence that chatbot says to the user. This module makes the decision based on the information
provided by the other three modules so it is capable of handling conversations with multiple topics presented
simultaneously. For example, after the chatbot asks the user regarding a particular criterion (topic 1), instead of
answering the chatbot, the user asks a question about a medical term in the criterion (topic 2). In this case, the
chatbot will first explain the medical term to the user to complete the discussion on topic 2 and then move back
to topic 1 to complete the process of verifying the criteria with the user.
Criteria classifier, depicted on the right of Figure 2, maps each criterion into one of the five categories. As
described in Introduction (Section 1), the output category is used to determine the order in which the chatbot
assists the user by going through the criteria with the user one-by-one. Criteria classifier consists of two parts:
the first part uses natural language processing on eligibility criteria (the top dashed-box) while the second part
utilizes active deep learning to learn and classify criteria (the bottom dashed-box). The first part (natural language
processing) works as follows: For each eligibility criteria, n-grams (n = 1, 2, 3) are first used to tokenize the
sentence after stop words are removed and all numeric values in the sentence are replaced with a special label.
For each extracted n-gram pattern from the NCI dataset, term frequency-inverted document frequency (TF-IDF)
is then calculated. TF-IDF is later used in the process of segmenting a criterion sentence into non-overlapping
n-gram patterns. If the sentence can be segmented in multiple ways, the combination of n-gram patterns with
the highest total TF-IDF is selected. Afterwards, the sentences represented as n-gram patterns are used to create
word embeddings using word2vec. Based on the word embeddings, a distance value (word moving distance,
WMD) is calculated between a pair of two criteria. Such distance values are later used in active learning to
determine the manner in which human-annotated labels are applied to the neighboring criteria.

ACM Transactions on Computing for Healthcare, Vol. 2, No. 1, Article 6. Publication date: December 2020.
6:8 • C.-H. Chuan and S. Morgan

The second part (active deep learning) of criteria classifier processes criteria as vectors of word embeddings.
The active learning algorithm, described in detail in Section 4, selects the criterion that the model has the least
confidence about its category for the human oracle to label. Once the label/class is received from the human
oracle, the algorithm propagates the label to the neighboring criteria to increase the number of samples in the
training set. The algorithm is also responsible for selecting a validation set from the ones with labels, either
annotated by the human oracle or inferred by the model. The convolution neural network is then trained on the
training set and tested on the validation set. Based on the confidence of the model’s prediction on the validation
set, the active learning algorithm again selects new criterion with low confidence for the human oracle to label.
The process repeats until a certain number of iterations is met.
The two major components of the system are described in the following sections: Section 4 for criteria clas-
sifier using active deep learning and Section 5 for conversation manager. Since the classification result from
criteria classifier is used by conversation manager, we explain the classifier before the manager to provide
readers a clear and smooth logic flow.

4 ACTIVE DEEP LEARNING FOR ELIGIBILITY CRITERIA CLASSIFICATION


This section describes the modules in criteria classifier in terms of how the modules are created (Section 4.1)
and evaluated (Section 4.2).

4.1 Active Learning with Word Embeddings for Clinical Trial Criteria
4.1.1 Dataset Construction. The first step is to create a dataset of eligibility criteria in clinical trials. To do so,
we used NCI’s web APIs to obtain clinical trial data in the JSON format. Overall, a total number of 9,762 clinical
trials was examined and 209,441 eligibility criteria, both inclusion and exclusion, were collected. To ensure that
each collected criterion presents a meaningful and clear condition or measure, we removed the ones that are too
short (e.g., a bullet point) or too long (e.g., multiple items in one criterion) from the dataset, using the 1st and 3rd
quartiles of the dataset as the lower and upper bound for the number of words in the criteria. The final dataset
consists of 114,749 criteria with numbers of words between 8 and 28 per criterion.
4.1.2 Word Embeddings and Word Mover’s Distance. As described in Section 3, the data-preprocessing module
(n-gram/TF-IDF in Figure 2) converts each criterion into a sequence of non-overlapping n-gram patterns. For
each pattern, either a word or phrase, the word2vec [Mikolov et al., 2013] technique is used to represent the
pattern as a vector. Word2vec is a popular technique for applications such as text classification [Lilleberg et al.,
2015] and sentiment analysis [Tang et al. 2008] because it creates a vector space in which words that occur in
similar context in the original text appear geographically close to each other. Specifically, the skip-gram with
negative sampling [Mikolov et al. 2013] is the implementation of word2vec used in this study. The skip-gram
model aims to increase the conditional probability of a target word if the conditional word is in the near context
while minimizing the probability if the conditional word is a noise sample, i.e., a word selected outside the near
context (negative sampling).
In this study, the word2vec vocabulary consists of 4,000 words: We selected the most frequent 3,999 words
from the eligibility criteria and encoded others as “UNK.” Each word is represented as a vector of 256 elements,
generated by using the surrounding “context” window of 4 to train the skip-gram model. Figure 3 shows the
visualization of the learned word2vec space with the top (most frequent) 200 words used in clinical trial eligibility
criteria. The visualization was created using t-Distributed Stochastic Neighbor Embedding (t-SNE) technique
[Maaten et al. 2008]. It can be observed that words sharing similar meanings or often used together to describe
a certain topic are spatially placed near each other, as shown in the highlighted ovals in Figure 3.
Using the word2vec space, we then calculated the distance between words and further used it to calculate the
distance between sentences. In this study, we adopted the Word Mover’s Distance (WMD), as proposed by Kusner
et al. [2015], to calculate the distance between a pair of criteria. WMD measures the distance as the minimum

ACM Transactions on Computing for Healthcare, Vol. 2, No. 1, Article 6. Publication date: December 2020.
Creating and Evaluating Chatbots as Eligibility Assistants for Clinical Trials • 6:9

Fig. 3. Visualization of the top 200 words in the word2vec space.

amount of distance that a word from one document has to “travel” to any words in the other document. Because
WMD is computationally expensive, we implemented the relaxed word moving distance between two criteria
C1 and C2 , dWMD (C1 , C2 ), as follows:

d WMD = min dw 2v (w 1,i , w 2, j ) (1)
j
i

where the two criteria C1 and C2 contain words {w1, 1 , . . . , w1, m } and {w2, 1 , . . . , w2, n }, respectively, and dw2v (w1,i ,
w2,j ) is the Euclidean distance between words in the word2vec vector space.
Note that as proven by Kusner et al. [2015], the relaxed word moving distance is a tighter lower bound for
WMD than the word centroid distance, which involves finding the centroid for each sentence based on the words
and using the distance between two centroids as the distance between the sentences.
Table 1 shows three examples of using WMD to retrieve the nearest criteria given a target sentence. In example
no. 1, it is obvious that the target criterion and its two nearest neighbors are about the same message. In contrast,
example no. 2 presents a less obvious case where only the target and its second nearest neighbor are about
English proficiency for communication. Yet, being “able to comprehend and provide written informed consent”
in the first nearest neighbor is also related to the patient’s communication capability. However, the two nearest
neighbors for the target in example no. 3 clearly do not convey the same message. Situations like this can happen
when the criterion consists of many words outside the vocabulary and therefore those words are ignored. As a
result, the phrase such as “at least” becomes overly dominant in the calculation of WMD.
The next section describes how the learned word2vec space and the distance between criteria using WMD are
used in active deep learning for the criteria classifier.
4.1.3 Active Deep Learning. The reason for using active learning to train the criteria classifier is to avoid the
labor-intensive labeling process needed in supervised learning. Generally speaking, algorithms in active learning
aim to improve the quality of (supervised) machine learning by allowing the machine to actively query for help.
Active learning is often used in tasks where excessively annotated data are not available. In this case, the goal
of the algorithm is to achieve higher accuracy with fewer actively obtained labels.

ACM Transactions on Computing for Healthcare, Vol. 2, No. 1, Article 6. Publication date: December 2020.
6:10 • C.-H. Chuan and S. Morgan

Table 1. Examples of Criteria and Their Nearest Neighbors Using Word Mover’s Distance (WMD)

Example No. Role Criteria


Target Major surgery within 4 weeks before the start of study therapy.
1 Neighbor 1 Major surgery within 4 weeks before start of study treatment, without complete
recovery.
Neighbor 2 Major surgery, open biopsy, or significant traumatic injury within 4 weeks of
starting therapy.
Target Able to read and comprehend either English or Spanish.
2 Neighbor 1 Able to comprehend and provide written informed consent in accordance with
institutional and federal guidelines.
Neighbor 2 Are eligible if they are able to communicate in English.
Target Self-reported ability to walk at least 2 blocks (at any pace).
3 Neighbor 1 There is no need for steroids and patients have not had steroids at least 2 weeks.
Neighbor 2 Serum albumin of at least 2.8 g/dL
Note: Laboratories in combination must still be Child Pugh score less than 7 Other laboratory parameters.

Fig. 4. The structure of the convolution neural network for clinical trial criteria classification.

In this study, we designed an active learning algorithm that improves the accuracy of the Convolution Neural
Networks (CNN) to classify criteria. Figure 4 shows the network structure of the CNN classifier. The input of the
CNN classifier is a 28 × 256 matrix that represents a criterion as shown in the bottom of the figure. As described in
Section 4.1.1, the maximum number of words in the collected criteria is 28, and therefore the ones with less than
28 words are padded with vectors of zeros in the matrix. The input is processed through two hidden convolution
layers with a pooling layer attached after each. The output of the second pooling layer is then passed to two
fully connected layers with the last layer of five units that represent the five classes in the eligibility criteria
classification. Readers can find more details about the CNN classifier in terms of structural parameters (e.g., the
size of the kernel map), training strategy (e.g., dropout and gradient clipping), and error rates on training and
validation datasets over epochs in Chuan [2018].
The active CNN algorithm proposed for eligibility criteria classification is described in Algorithm 1. The algo-
rithm centers on two components: uncertainty cluster sampling and label propagation with simulated annealing.
Uncertainty cluster sampling means that the algorithm performs a clustering search on uncertain cases (i.e., el-
igibility criteria in this study) and selects the centroid of the cluster to query the human oracle for the class
label. Once the algorithm receives the class label from the oracle, it selects several neighboring cases such that
these cases temporarily share the same class label (label propagation). The selection of neighboring cases for
label propagation is controlled by simulated annealing: In the beginning when few cases have class labels, the
algorithm propagates the labels to a wider range of cases. As the number of iterations increases and more cases

ACM Transactions on Computing for Healthcare, Vol. 2, No. 1, Article 6. Publication date: December 2020.
Creating and Evaluating Chatbots as Eligibility Assistants for Clinical Trials • 6:11

Fig. 5. (a) Test error rates for active CNN, CNN, and KNN, and (b) the distribution of query distance for active learning.

have labels, the algorithm reduces the range of propagation. The simulated annealing technique is implemented
in lines 15–19 in Algorithm 1, where the chance of a neighboring case to be selected for label propagation is
determined by a simple decaying exponential function as the probability threshold:

e (−i /it er ) (2)


where i is the current iteration number and iter is the total number of iterations allowed.
The cases with labels, either obtained from the human oracle or assigned by label propagation, are then used
to train the CNN classifier (line 21). A validation set V is later formed by randomly selecting cases that have
not been labeled (line 22). The trained CNN classifier is then tested on the union of the validation set and the
set consisting of cases with propagated labels (line 23). Based on the predicted label, and more importantly, the
confidence value, the candidate set (C) is constructed with cases having the confidence value less than a pre-
defined threshold (line 24). The confidence value is calculated by first normalizing the values in the five output
units in CNN and then selecting the maximum of such values. The algorithm then continues on to the next
iteration until the total number of iterations is met.

4.2 Evaluation on Criteria Classification


To evaluate the performance of the criteria classifier, we first randomly selected 1,000 criteria from the NCI
dataset and manually labeled their classes. These labeled 1,000 criteria were reserved as the test set. The remain-
ing, non-labeled, criteria were then used in the iterative training process in the active CNN algorithm.
In each iteration, five criteria were selected by the algorithm as queries to obtain the class labels from the
human oracle. The active CNN algorithm would then be trained on the manually labeled criteria as well as the
ones with temporarily propagated labels. For comparisons, we also trained the CNN classifier without active
learning using the same manually labeled criteria. Specifically, the training process was conducted as follows:
the CNN classifier was trained on five labeled criteria, while the active CNN classifier was trained on the same
five criteria plus the ones with propagated labels in the first iteration. In the second iteration, five more criteria
were then selected by the algorithm for the human oracle to label, which resulted in total 10 labeled criteria.
In this iteration, the CNN classifier was trained on the 10 criteria from scratch while the active CNN classifier
trained on the same 10 plus the ones with propagated labels also from scratch. Similar processes continued in
the following iterations. In each iteration, the k-nearest neighbor (KNN) of only manually labeled criteria was
also tested as a baseline for the evaluation.
Figure 5(a) shows the error rate on the test set over 50 iterations for the three algorithms: active CNN, CNN
(without active learning), and KNN. It can be observed that both active CNN and CNN perform better than the
baseline KNN, as the error rates of active CNN and CNN are much lower than KNN’s. But the difference between
the error rates of active CNN and CNN is relatively less significant. The difference is most prominent in the earlier

ACM Transactions on Computing for Healthcare, Vol. 2, No. 1, Article 6. Publication date: December 2020.
6:12 • C.-H. Chuan and S. Morgan

ALGORITHM 1: Active CNN Algorithm


Input D : the entire dataset
Output cnn_model : the trained CNN classifier
Parameters iter : total number of iterations, k : number of clusters, t : prediction confidence threshold, n : number of
neighboring candidates for label propagation, and v : number of validation samples
1. C → {}
2. GT → {}
3. for i = 1. . .iter do
4. if i == 1 then
5. centroids → k_means_clustering(D, k )
6. else
7. centroids → k_means_clustering(C, k )
8. end if
9. foreach c ∈ centroids do
10. if c  GT then
11. Ask human oracle about case c’s class
12. GT → GT ∪ c
13. end if
14. neighbors → nearest_neighbors(D, c, n )
15. foreach nei ∈ neighbors do
16. if rand_float() < e −i /(it er ) then
17. C → C ∪ nei with c’s class for nei
18. end if
19. end foreach
20. end foreach
21. cnn_model_train(GT ∪ C)
22. V → rand_set(D – (GT ∪ C), v )
23. P → cnn_model_test(C ∪ V )
24. C → p ∈ P where case p’s prediction confidence < t
25. end for
26. return cnn_model
Functions
k_means_clustering(D, k ): returns k centroids for cases in D
nearest_neighbors(D, c, n ): returns n nearest neighbors for case c from the dataset D
rand_float(): returns a random floating point between 0 & 1
rand_set(X, v ): returns a set of v samples randomly selected from the set X
cnn_model_train(X ): trains CNN with cases in the set X
cnn_model_test(X ): returns the predicted class and the confidence value for each case in X

stage of iterations between iteration 5 and 40. This implies that the label propagation strategy in active learning
does help with the classification accuracy by expanding the training set with propagated labels. However, the
simulated annealing strategy makes the algorithm more conservative towards the end of the iterations, which
means that the algorithm adds little to none “guessed” cases to the training set. As a result, active CNN essentially
becomes more and more similar to the plain CNN when the active learning algorithm adds less and less cases to
the training set.
Although the baseline KNN performed the worst in the evaluation, the fact that its error rate decreases with
increasing number of iterations is a good sign, because the accuracy of KNN relies on only two factors: the
document distance in the word2vec space and the exploration strategy of uncertainty cluster sampling. The

ACM Transactions on Computing for Healthcare, Vol. 2, No. 1, Article 6. Publication date: December 2020.
Creating and Evaluating Chatbots as Eligibility Assistants for Clinical Trials • 6:13

Fig. 6. Sofia checking eligibility criteria with the user while answering the user’s questions.

document distance in the word2vec space has been examined in Section 4.1.2; therefore, we turn the reader’s
attention to uncertainty cluster sampling, which is described below.
Instead of excessively labeling all randomly sampled cases to compare uncertain cluster sampling with ran-
dom sampling, studying the distribution of query distance provides us an alternative way to obtain insights in
the uncertainty cluster sampling strategy. The query distance examined here is the WMD distance between any
newly selected queries and the previously selected queries. Specifically, the query distance is considered as fol-
lows: Assume that we have a sequence of queries {q1 , q2 , . . . , qt } selected by the algorithm over iterations. The
query distance from qn to qt , n < t, is defined as i if qt is the ith nearest neighbor of qn in the entire dataset defined
by WMD. The query distance for qt in the sequence is then defined as the minimum distance from qn to qt , n =
1, . . . , t-1.
Figure 5(b) shows the distribution of query distance for the cases selected as queries in the evaluation. It can
be observed that the algorithm tends to select cases that are either nearby or far away. Selecting cases that are
far away is generally preferable for the purpose of exploration. In contrast, selecting cases nearby implies that
the model does not have high confidence in its own prediction even if the case is close to some neighbors that
have been annotated. The variation in the content of eligibility criteria may be a direct cause for the lack of high
confidence in the neighboring cases, especially in the earlier iterations when few cases have manual labels.
The next section describes how the results from the criteria classifier are used to assist the user with clinical
trial eligibility via the front-end of the system, i.e., the conversation manager and the web interface for the
chatbot.

5 CHATBOT SOFIA FOR CLINICAL TRIAL ELIGIBILITY ASSISTANCE


5.1 Designing Chatbot Sofia for Clinical Trials
5.1.1 An Example of the Conversation Exchange. To help the reader envision how the user interacts with Sofia,
Figure 6 shows a typical conversation exchange between the two parties. The first seven chat boxes in Figure 6

ACM Transactions on Computing for Healthcare, Vol. 2, No. 1, Article 6. Publication date: December 2020.
6:14 • C.-H. Chuan and S. Morgan

constitute onboarding messages: The chatbot greets the user, explains the task with personalized response, and
demonstrates how the user can ask questions. The last two chat boxes in the left subfigure shows how the user
can initiate the procedure of checking eligibility. Once the procedure is started, Sofia helps the user by going
through the criteria one-by-one based on the classification result for each criterion from the criteria classifier
(Section 4). If the user does not understand any words in the criterion, he or she can ask the chatbot before
providing an answer (the fifth chat box on the right). After answering the user’s question, Sofia also reminds
the user about the criterion they are discussing (the seventh chat box on the right). If the response suggests that
the patient does not meet the criterion, Sofia will explain why the patient may be ineligible for the trial and also
provide a link to a webpage where the user can read the highlighted information about this particular criterion
in detail. In contrast, if the response does not indicate any flags that exclude the user from the trial, Sofia will
provide a link to a webpage where the user can find the remaining criteria that he or she needs to further consult
with a doctor or medical professional to ensure eligibility.
5.1.2 Modules in Conversation Manager. Conversation exchanges, as shown in Figure 6, are processed and or-
chestrated by the four modules in the conversation manager (see Figure 1 for the system diagram): intent/entity
identification, eligibility assistance, conversation status monitoring, and conversation generation. The intent/entity
identification module examines every entry submitted by the user and identifies the user’s intent (greeting, ask-
ing a question, answering a criterion question, or issuing a command) and meaningful entities (medical terms or
keywords for commands) in the sentences. The eligibility assistance module, which acts as an interface between
the conversation manager and criteria classifier, retrieves eligibility criteria and related information from a par-
ticular clinical trial and creates a tuple for each criterion with three items: <description, condition, type>. The
three items work as follows: the description item converts a criterion into a question, the condition item repre-
sents the answer for being included in the trial, and the type item is the class label from the criteria classifier.
The type item is used to determine the order in which the chatbot posts eligibility criteria questions to the user.
As shown in Figure 6, the user can ask questions about medical terms while verifying a particular criterion
with the chatbot. Supporting such dialogues with multiple simultaneously appearing topics is achieved by the
conversation status monitoring and conversation generation modules. After the chatbot asks the user a criterion
question, it is possible that the user asks multiple questions yet forgets to answer the criterion question. Even
when the chatbot is waiting for the user’s response, it is more important to answer any questions that the user
has than to remind the user of replying. As the result, the conversation status monitoring module remembers
which question that it is waiting on and lets the other user-initiated dialogue comes through first. However, the
chatbot eventually needs to direct the user’s attention back to the criterion question to continue the process
of verifying eligibility. The conversation status monitoring module uses a timer to keep track of time, since the
chatbot first asks the criterion question, and reminds the user to answer it within a certain amount of time.
In addition, the conversation status monitoring and conversation generation modules work together for on-
boarding and closing sessions. Once the status of such session is identified in the status monitoring module, the
conversation generation module then produces conversations with pre-scripted dialogues.
5.1.3 Software Architecture. The majority of the chatbot and its web interface is implemented in HTML, CSS,
and JavaScript. The only external component of the conversation manager is the cloud-based service, Dialogflow.
Dialogflow is used as a question-answering engine with natural language processing techniques. In Dialogflow,
conversations are encoded as questions and answers, and a set of questions and answers for the same conver-
sational topic or goal is programmed as an intent. In our system, one function that Dialogflow provides is for
processing and generating general conversations, such as greeting and casual chatting. The other function that
Dialogflow assists with involves explaining medical jargon. Medical jargon is designed as an entity in Dialogflow.
An entity can be considered as a concept or a category (e.g., city), which contains multiple items that belong to
the same category (e.g., New York, Miami, etc.). In this study, medical jargon is an entity and all the medical
terms belong to this entity. Such a design is highly scalable regarding how questions and answers can be said

ACM Transactions on Computing for Healthcare, Vol. 2, No. 1, Article 6. Publication date: December 2020.
Creating and Evaluating Chatbots as Eligibility Assistants for Clinical Trials • 6:15

about any medical jargon, because different ways of asking about the definition of a particular medical jargon
can be used for any other terms as well.
It is necessary to clarify the role between the intent/entity identification module and Dialogflow, since both
process the user’s input. As illustrated in Figure 2, the intent/entity identification module receives and processes
all the user’s input first. This module processes many messages locally without sending them to Dialogflow,
including messages during the onboarding and closing session as well as the user’s response to criteria ques-
tions. Other messages are then handed over to Dialogflow specifically for casual conversations, medical jargon
explanations, and for handling messages that are not recognized by any intents (i.e., generating the reply when
the system does not understand the user’s input).
The next section describes how experiments were designed to evaluate the chatbot Sofia and the results ob-
tained in the experiments.

5.2 Evaluating Chatbot Sofia


5.2.1 Experiment Design. To evaluate the effectiveness of chatbot Sofia, an in-person experiment was con-
ducted. Study participants were recruited from undergraduate and graduate communication classes. In total,
100 students participated in the study. They were randomly assigned to one of the clinical trial scenarios about
helping a family member with specific health issues to determine if the family member is eligible to participate
in the clinical trial. For instance, one of the scenarios in the experiment describes: “Your aunt (55 years old) has
cutaneous melanoma (skin) cancer in stage III. Except this cancer, which is not removable via surgery confirmed
in the laboratory result, she does not have any unstable pre-existing medical conditions or any problems with
her lung. She has high blood pressure and high cholesterol.” Participants were also randomly assigned to one of
the interface conditions (i.e., a static website, website with the chatbot, and a chatbot only interface). Participants
were briefed about the study purpose of evaluating how people process clinical trial information online and told
that they need to use the information on the web-based interface to determine their family member’s eligibility.
After they completed the task, they were asked to answer questions regarding the usability of the interface, such
as “the website offers helpful features,” perceived interactivity, such as “The website enabled two-way commu-
nication,” and perceived dialogue, such as “The site responded quickly to my inputs and requests.”

5.2.2 Results – Information Comprehension. We evaluated participants’ level of information comprehension


by assessing the task outcomes: (1) whether they correctly identify the eligibility status of the patient given in
the scenario (i.e., yes, no, maybe), (2) whether they answered correctly why the patient is eligible or ineligible,
and (3) whether they understand the meaning of the medical term “unresectable.”
First, the percentage of participants who correctly identified the eligibility status after interacting with the
study stimuli is calculated. Based on the reasons provided by the participants regarding the eligibility status,
we further categorize the correct answers into two groups: one with the correct reasons and the other with
acceptable reasons. A correct reason refers to the participant correctly recognizing the condition(s) in the given
scenario that makes the patient ineligible. For example, consider the scenario in Section 5.2.1: The correct reason
why the patient is not eligible is because of her high blood pressure and high cholesterol, as the clinical trial
requires that “patients must not have a history of or evidence of cardiovascular risks.” An acceptable reason
involves the participants recognizing the patient may be eligible but more information is needed. In the same
scenario in Section 5.2.1, a criterion states “patients must not have other current malignancies, other than basal
cell skin cancer, squamous cell skin cancer,” while the scenario only describes that the patient has skin cancer
but does not indicate the specific type. In this way, it is reasonable to be uncertain about the patient’s eligibility
due to missing information.
Figure 7 shows that the percentage of participants who correctly identified eligibility is higher in the settings
with chatbots than without chatbots (website only). Another interesting observation is that the chatbot-only
condition generates much higher overall correct rate than the web-bot setting. The increment of the accuracy

ACM Transactions on Computing for Healthcare, Vol. 2, No. 1, Article 6. Publication date: December 2020.
6:16 • C.-H. Chuan and S. Morgan

Fig. 7. Accuracy of identifying eligibility using three platforms.

Table 2. The Percentage of Users who Correctly Answered


the Meaning of the Word “Unresectable”

website web-bot chatbot


% answered correctly 81.25% 90.32% 94.44%
% Googled the word 9.38% 6.45% 2.77%
% asked chatbot N/A 19.35% 69.44%

is mainly contributed by the percentage of participants who gave acceptable reasons. This suggests that using
only conversational interface (chatbot) may render participants more sensitive to the eligibility criteria and more
aware of missing information.
Next, participants’ information comprehension is measured by multiple choice questions to evaluate their
understanding of the word “unresectable,” which is an important medical term that describes one of the key
criteria for the target patients in this study. Table 2 shows the percentage of participants who correctly answered
the question, who used online search tools to discover the definition, and who asked the chatbot about the
word. For the participants in the website-only condition, more than 80% correctly answered the question. This
is a surprisingly high correct rate, despite the fact that “unresectable” is not everyday vocabulary. At the same
time, this should be recognized as a study limitation that used one word as opposed to multiple words in the
experiment.
For participants who used the chatbot, the accuracy percentage is higher than 90% for both experimental
groups. More importantly, the increase in the accuracy is related to participants’ information seeking behavior,
which requires that they use either search engines or the chatbot to ask for the definition of the word during
the experiment. Note that participants are more likely to ask the chatbot about the definition of “unresectable”
in the chatbot-only condition than in the web-bot condition (69.44% vs. 19.35%). This indicates that delivering
information through only conversations so the chatbot serves as a “guide” rather than a Q&A assistant can help
users concentrate on essential information as well as encourage them to ask important questions.

5.2.3 Results – Perceived Usability, Interactivity, and Dialogue. Figure 8 shows the average rating and standard
deviation of participants’ perceived usability, interactivity, and dialogue across the three settings. To examine if
the difference between the ratings is significant, we conducted pairwise t-tests between pairs of two settings; the
results are shown in Table 3. In general, the average ratings of the perceived usability/interactivity/dialogue are
significantly higher in web-bot and chatbot conditions than the website-only conditions. When comparing web-
bot and chatbot, the ratings on perceived usability and dialogue are significantly higher in the chatbot condition

ACM Transactions on Computing for Healthcare, Vol. 2, No. 1, Article 6. Publication date: December 2020.
Creating and Evaluating Chatbots as Eligibility Assistants for Clinical Trials • 6:17

Fig. 8. The average users’ ratings of perceived usability, interactivity, and dialogue.

Table 3. P-values in the t Tests for the Pairwise Comparisons

Comparison p-value (t-test)


web-bot > website 1.09E-08
Perceived
chatbot > website 5.94E-21
usability
chatbot > web-bot 0.0008
web-bot > website 1.47E-23
Perceived
chatbot > website 7.76E-29
interactivity
chatbot = web-bot 0.197
web-bot > website 6.74E-16
Perceived
chatbot > website 3.72E-25
dialogue
chatbot > web-bot 0.0275

Table 4. Participants’ Activities in Detail

Google Q&A Eligibility


website 22% N/A N/A
web-bot 6% 50% 56%
chatbot 3% 97% 100%

than the web-bot condition. However, the difference in perceived interactivity between web-bot and chatbot is
not significant.

5.2.4 Results – Participants’ Activities. In addition to the overall results for information comprehension and
perceived usability/interactivity/dialogue, we analyzed the specific activities that participants performed when
using the platforms in detail, including using a search engine (Google), asking chatbot questions (Q&A), and
asking chatbot’s help with eligibility (Eligibility). The results are shown in Table 4. First, in the website-only
condition, only 22% of the participants used an online search engine for medical terms. In contrast, in the web-
bot condition, half of the participants asked the chatbot questions about medical terms. In the chatbot-only
condition, 97% asked the chatbot questions about medical terms, and only one participant (near 3%) used the
search engine instead of the chatbot. Similar to the results in Table 2, participants became more active in asking
questions and seeking information when information was delivered through conversations with the chatbot.
Next, the percentage of participants who asked the chatbot to walk them through the eligibility criteria is
shown via the Eligibility column in Table 4. On the web-bot platform, only half of the participants used this fea-
ture after they received prompts from the chatbot for this option. In other words, the other half of the participants

ACM Transactions on Computing for Healthcare, Vol. 2, No. 1, Article 6. Publication date: December 2020.
6:18 • C.-H. Chuan and S. Morgan

decided to figure out the eligibility criteria by themselves. Such finding suggests the potential resistance to novel
technologies for certain users: Some may prefer familiar platforms (website) over a new assistive technology.

6 CONCLUSION AND FUTURE WORK


In this article, a chatbot that supports two-way communication is created to assist patients and their family
members to easily verify their eligibility for clinical trials. This chatbot is capable of answering questions, ex-
plaining medical jargon, proactively checking eligibility with the user, and delivering casual conversations. To
help users verify their eligibility against clinical trial criteria, a user-centered, lexicon/template-free classification
is performed to separate criteria that users can easily verify from the ones that require medical professional’s
assistance. Eligibility criteria are first processed to build word embeddings using word2vec skip-gram model, and
the visualization shows that words/phrases sharing similar context in eligibility criteria are located nearby in the
word embedding space. An active learning algorithm is proposed to actively obtain class labels to speed up the
process of training the multi-layer convolutional neural networks for the classification. Using the eligibility cri-
teria collected from the National Cancer Institute website, experiment results show that the active convolutional
neural network performs significantly better than a baseline k-nearest neighbor method. Additionally, based
on the responses from 100 participants, the proposed chatbot helps users understand the eligibility criteria of a
clinical trial better than a static website. It also enhances users’ perceived usability, interactivity, and dialogue
toward the interface.
Since the proposed chatbot and its backend classifier are highly scalable, a promising future direction is to
implement the system with the National Cancer Institute’s APIs for all on-going clinical trials to reach more
target users. In addition, it will be helpful to collect users’ responses to the eligibility criteria and use such data
to re-train the active deep learning classifier. In this way, the system can be improved to become truly user-
centered. Other future directions for the subsystems include testing different deep learning methods for the
classification manager and conducting experiments using different clinical trials with multiple medical terms.

REFERENCES
M. Amith, Z. H. U. Anna, R. Cunningham, L. I. N. Rebecca, L. Savas, S. H. A. Y. Laura, and T. A. O. Cui. 2019. Early usability assessment of a
conversational agent for HPV vaccination. Stud. Health Technol. Inform. 257 (2019), 17–23.
R. F. Azevedo, D. Morrow, J. Graumlich, A. Willemsen-Dunlap, M. Hasegawa-Johnson, T. S. Huang, and D. J. Halpin. 2018. Using conversa-
tional agents to explain medication instructions to older adults. In Proceedings of the AMIA Annual Symposium. 185. American Medical
Informatics Association.
C. S. Bennette, S. D. Ramsey, C. L. McDermott, J. J. Carlson, A. Basu, and D. L. Veenstra. 2016. Predicting low accrual in the national cancer
institute’s cooperative group clinical trials. J. Nat. Cancer Inst. 108, 2 (2016).
T. Bickmore, H. Trinh, R. Asadi, and S. Olafsson. 2018. Safety first: Conversational agents for health care. In Studies in Conversational UX
Design, 33–57. Springer, Cham.
P. P. Breitfeld, M. Weisburd, J. M. Overhage, G. Sledge Jr, and W. M. Tierney. 1999. Pilot study of a point-of-use decision support tool for
cancer clinical trials eligibility. J. Amer. Med. Inf. Assoc. 6, 6 (1999), 466–477.
R. W. Carlson, S. W. Tu, N. M. Lane, T. L. Lai, C. A. Kemper, M. A. Musen, and E. H. Shortliffe. 1995. Computer-based screening of patients
with HIV/AIDS for clinical-trial eligibility. Online J. Curr. Clin. Trials 4, 179 (1995).
C. H. Chuan. 2018. Classifying eligibility criteria in clinical trials using active deep learning. In Proceedings of the 17th IEEE International
Conference on Machine Learning and Applications. 305–310.
M. Cuggia, P. Besana, and D. Glasspool. 2011. Comparing semi-automatic systems for recruitment of patients to clinical trials. Int. J. Med.
Inf. 80, 6 (2011), 371–388.
E. Fink, L. O. Hall, D. B. Goldgof, B. D. Goswami, M. Boonstra, and J. P. Krischer. 2003. Experiments on the automated selection of patients
for clinical trials. In Proceedings of the IEEE International Conference on Systems, Man, and Cybernetics. 4541–4545.
E. Fink, P. K. Kokku, S. Nikiforou, L. O. Hall, D. B. Goldgof, and J. P. Krischer. 2004. Selection of patients for clinical trials: An interactive
web-based system. Artif. Intell. Med. 31, 241–254.
A. Ho, J. Hancock, and A. S. Miner. 2018. Psychological, relational, and emotional effects of self-disclosure after conversations with a chatbot.
J. Commun. 68, 4 (2018), 712–733.
F. Köpcke, D. Lubgan, R. Fietkau, A. Scholler, C. Nau, M. Stürzl, and D. Toddenroth. 2013. Evaluating predictive modeling algorithms to assess
patient eligibility for clinical trials from routine data. BMC Med. Inf. Dec. Making 13, 1 (2013), 134.

ACM Transactions on Computing for Healthcare, Vol. 2, No. 1, Article 6. Publication date: December 2020.
Creating and Evaluating Chatbots as Eligibility Assistants for Clinical Trials • 6:19

M. Kusner, Y. Sun, N. Kolkin, and K. Weinberger. 2015. From word embeddings to document distances. In Proceedings of the International
Conference on Machine Learning. 957–966.
L. Laranjo, A. G. Dunn, H. L. Tong, A. B. Kocaballi, J. Chen, R. Bashir, D. Surian, B. Gallego, F. Magrabi, A. Y. Lau, and E. Coiera. 2018.
Conversational agents in healthcare: A systematic review. J. Amer. Med. Inf. Assoc. 25, 9 (2018), 1248–1258.
J. Lilleberg, Y. Zhu, and Y. Zhang. 2015. Support vector machines and word2vec for text classification with semantic features. In Proceedings
of the IEEE 14th International Conference on Cognitive Informatics and Cognitive Computing. 136–140.
Z. Luo, R. Duffy, S. Johnson, and C. Weng. 2010. Corpus-based approach to creating a semantic lexicon for clinical research eligibility criteria
from UMLS. In Proceedings of the AMIA Summit on Clinical Research Informatics. 26–31.
Z. Luo, M. Yetisgen-Yildiz, and C. Weng. 2011. Dynamic categorization of clinical research eligibility criteria by hierarchical clustering.
J. Biomed. Inf. 44, 6 (2011), 927–935.
L. V. D. Maaten and G. Hinton. 2008. Visualizing data using t-SNE. J. Mach. Learn. Res. 9 (2008), 2579–2605.
J. M. Metz, C. Coyle, C. Hudson, and M. Hampshire. 2005. An internet-based cancer clinical trials matching resource. J. Med. Internet Res. 7,
3 (2005), e24.
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. 2013. Distributed representations of words and phrases and their composition-
ality. Adv. Neural Inf. Proc. Syst. 3111–3119.
K. Milian, A. Bucur, and A. Ten Teije. 2012. Formalization of clinical trial eligibility criteria: Evaluation of a pattern-based approach. In
Proceedings of the IEEE International Conference on Bioinformatics and Biomedicine. 1–4.
K. Milian, A. Ten Teije, A. Bucur, and F. Van Harmelen. 2011. Patterns of clinical trial eligibility criteria. In Proceedings of the International
Workshop on Knowledge Representation for Health Care. Springer, Berlin, 145–157.
R. Miotto and C. Weng. 2015. Case-based reasoning using electronic health records efficiently identifies eligible patients for clinical trials.
J. Amer. Med. Inf. Assoc. 22, e1 (2015), 141–150.
J. L. Z. Montenegro, C. A. da Costa, and R. da Rosa Righi. 2019. Survey of conversational agents in health. Exp. Syst. Applic. 129 (2019), 56–67.
V. H. Murthy, H. M. Krumbolz, and C. P. Gross. 2004. Participation in cancer clinical trials: Race-, sex-, and age-based disparities. J. Amer.
Med. Assoc. 291 (2004), 2720–2726.
J. Niland, D. Dorr, G. El Saadawi, P. Embi, R. L. Richesson et al. 2007. Knowledge representation of eligibility criteria in clinical trials. In
Proceedings of the American Medical Informatics Association Annual Symposium.
M. Peleg, S. Tu, J. Bury, P. Ciccarese, J. Fox et al. 2003. Comparing computer-interpretable guideline models: A case-study approach. J. Amer.
Med. Inf. Assoc. 10, (2003) 52–68.
D. Rubin, J. Gennari, S. Srinivas, A. Yuen, H. Kaizer, M. Musen, et al. 1999. Tool support for authoring eligibility criteria for cancer trials. In
Proceedings of the AMIA Symposium. 369–373.
I. Sim, B. Olasov, and S. Carini. 2004. An ontology of randomized controlled trials for evidence-based practice: Content specification and
evaluation using the competency decomposition method. J. Biomed. Inf. 37 (2004), 108–119.
D. Tang, F. Wei, N. Yang, M. Zhou, T. Liu, and B. Qin. 2014. Learning sentiment-specific word embedding for Twitter sentiment classification.
In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics 1 (2014), 1555–1565.
S. W. Tu, M. Peleg, S. Carini, M. Bobak, J. Ross, D. Rubin, and I. Sim. 2011. A practical method for transforming free-text eligibility criteria
into computable criteria. J. Biomed. Inf. 44, 2 (2011), 239–250.
D. Utami, B. Barry, T. Bickmore, and M. Paasche-Orlow. 2013. A conversational agent-based clinical trial search engine. In Proceedings of the
Annual Symposium on Human-Computer Interaction and Information Retrieval (HCIR’13).
R. Wallace. 2003. The Elements of AIML Style. Alice AI Foundation.
Z. Zhang, T. W. Bickmore, and M. K. Paasche-Orlow. 2017. Perceived organizational affiliation and its effects on patient trust: Role modeling
with embodied conversational agents. Patient Educ. Counsel. 100, 9 (2017), 1730–1737.
S. Zhang, F. Liang, W. Li, and I. Tannock. 2016. Comparison of eligibility criteria between protocols, registries, and publications of cancer
clinical trials. JNCI: J. Nat. Cancer Ins. 108 (2016), 11.

Received November 2019; revised March 2020; accepted May 2020

ACM Transactions on Computing for Healthcare, Vol. 2, No. 1, Article 6. Publication date: December 2020.

You might also like