MIT Machine Learning Trial Outcomes

Modeling Correlations in Clinical Trial Outcomes using
Machine Learning
by
Arturo Chavez-Gehrig
Submitted to the Department of Electrical Engineering and Computer Science
in partial fulfillment of the requirements for the degree of
Masters of Engineering in Computer Science and Engineering
at the
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
June 2019
@ Massachusetts Institute of Technology 2019. All rights reserved.
Signature redacted
A u th or ............................ ........
Department of Electrical Engineering and Computer Science
June 7, 2019
Signature redacted
C ertified by ...............................
Andrew W. Lo
Charles E. and Susan T. Harris Professor
Thesis Supervisor
Accepted by . .
Signature redacted
MASSACHUE ETTS INSTITUTE Katrina LaCurts
OF TEC ~HNOLOGY
Chair, Master of Engineering Thesis Committee
JUL 08 2019
LIBRARIES ARCHIVES
Modeling Correlations in Clinical Trial Outcomes Using Machine
Learning
by
Arturo Chavez-Gehrig
Submitted to the Department of Electrical Engineering and Computer Science

on June 7, 2019, in partial fulfillment of the
requirements for the degree of
Masters of Engineering in Computer Science and Engineering
Abstract
This thesis explores the problem of characterizing the covariance of clinical trial outcomes
using drug and trial features. The binary nature of FDA approvals makes drug development
risky, but approaches in finance theory could better manage that risk, allowing more high
potential drugs to be developed. To apply these methods confidently, it is necessary to
understand the covariance between projects. The paper outlines several approaches for this
task and their theoretical foundations, such as finding the nearest valid covariance matrix,
online sequence prediction, and a new approach using function approximation via random
forest. This function approximation approach to estimating covariance is implemented and
tested on historical clinical trial data.
Thesis Supervisor: Andrew W. Lo

Title: Charles E. and Susan T. Harris Professor
3
4
Acknowledgments
I would like to thank Professor Lo for suggesting this line of research and providing guidance
throughout the project. It was an incredible learning experience to interact with an expert
of healthcare finance and statistical methods. I would also like to thank the other members
of the research group, particularly Qingyang Xu, who I collaborated with on this project
to study clinical trial correlations. Thank you to Jayna Cummings for playing such an
important role in the Laboratory for Financial Engineering, keeping everyone on track and
organized. I greatly appreciate the guidance I received from Chi Heem Wang and Kien Wei
Siah, particularly their help in interpreting the Informa dataset. I would like to acknowledge
Shomesh Chaudhuri for his insights about other approaches to simulate correlated clinical
trials.
I am forever grateful to my family for supporting me. My parents have been an incredible
stabilizing force in my life, helping me to think through decisions and encouraging me to do
interesting things that make me happy. Lastly, I want to thank my friends for giving me
confidence and preventing me from becoming too stressed, particularly Sabrina, Kelsey, and
my brothers at Theta Chi Fraternity.
5
6
Contents
1 Introduction 13
1.1 Eroom's Law and the Valley of Death. . . . . . . . . . . . . . . . . . . . . . 13
2 Background 17
2.1 FDA Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1.1 P reclinical . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.1.2 P hase I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.1.3 P hase 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.1.4 P hase 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.1.5 Approval and Exceptions . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 Finance Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.1 Crossing the Valley of Death with Portfolio Theory . . . . . . . . . . 21
2.2.2 Expanding the Pool of Available Capital with Securitization ..... 23
2.3 Fitting a Covariance Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3 Methods 27
3.1 Formalizing Correlation in Clinical Trials . . . . . . . . . . . . . . . . . . . . 27
3.1.1 Problem Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2 Online Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2.1 Batch vs Online Learning . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2.2 Motivating Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2.3 W hy Online Learning for Clinical Trial Prediction? . . . . . . . . . . 31
3.2.4 Outlining this section . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
7
3.2.5 Predicting a Coin Toss . . . . . . . . . . . . . . . . . . 33
.
3.2.6 Bit Prediction . . . . . . . . . . . . . . . . . . . . . . . 34
.
3.2.7 Learning from Experts . . . . . . . . . . . . . . . . . . 35
.
3.2.8 Side Information . . . . . . . . . . . . . . . . . . . . . 37
.
3.2.9 Sequential Event Prediction . . . . . . . . . . . . . . . 37
.
3.2.10 Long-Term Sequence Prediction . . . . . . . . . . . . . 39
.
3.2.11 Online Learning Takeways . . . . . . . . . . . . . . . . 40
.
3.3 Function Approximation via Random Forests. . . . . . . . . . . 40
3.3.1 Random Forests . . . . . . . . . . . . . . . . . . . . . . 42
.
4 Data and Results 49
4.1 Clinical Trial Data . . . . . . . . . . . . . . 49
.
4.1.1 Dataset Specification . . . . . . . . . 49
.
4.1.2 Dimensionality Reduction . . . . . . 52
.
4.2 R esults . . . . . . . . . . . . . . . . . . . . . 53
.
4.2.1 Generative Process . . . . . . . . . . 55

.
4.2.2 Empirical Results . . . . . . . . . . . 58

.
5 Discussion 65
6 Contributions 67
8
List of Figures
1-1 Life expectancy gains have been consistent, but the rise in the costs of R&D
suggest this growth might not continue. [251 . . . . . . . . . . . . . . . . . . 14
2-1 Heatmap of the correlations produced by the nearest valid correlation matrix
to the domain experts' predictions [15] . . . . . . . . . . . . . . . . . . . . . 26
3-1 A collection of classifiers that have different biases on the famous Iris dataset
[1 91 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4-1 Informa's Trialtrove and Pharmaprojects datasets include thousands of trials

from the whole spectrum of phases . . . . . . . . . . . . . . . . . . . . . . . 51
4-2 Therapeutic area distribution in Trialtrove . . . . . . . . . . . . . . . . . . . 52
4-3 The route of administration distribution among drugs in the data set. ..... 53
4-4 PCA projection onto the first 2 principal components among pipeline, inactive,
and approved trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4-5 PCA projection onto the first 2 principal components among just inactive and
approved trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4-6 The histogram of estimated covariances by the random forest models averaged
over the epochs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4-7 Calibration plots of the single trial probability of success estimates. . . . . . 58
4-8 Calibration plot for the paired trial probability estimate. . . . . . . . . . . . 59
4-9 Histogram of the covariance estimates from the isotonically calibrated RF

predictions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
9
4-10 Distribution of oncology trial covariance estimates across time periods. The
left side compares the covariance between drug indication pairs. The right
side compares covariance between specific trials. . . . . . . . . . . . . . . . . 62
4-11 Distribution of covariance estimates for metabolic and cardiology therapy ar-
eas across periods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4-12 Comparing covariance and variance distributions between random forest, XG-
Boost, and logistic regression. . . . . . . . . . . . . . . . . . . . . . . . . . . 64
10
List of Tables
2.1 HHS Clinical Trial Costs Weighted Averages by Component [271 . . . . . . . 20
4.1 Features from the Informa databases that were used for our analysis . . . . . 51
4.2 Performance metrics for single trial probability prediction . . . . . . . . . . . 57
4.3 Performance metrics for paired trial probability of success estimates . . . . . 60
11
12
Chapter 1
Introduction
1.1 Eroom's Law and the Valley of Death
Over human history, advances in science and technology have dramatically improved health-
care. The most straightforward performance indicator for this advancement is the expansion
of life expectancy over time. This can be cleverly visualized by Oxford University economist
Rosen in Figure 1-1 [25].
However, it is not clear that this growth in life expectancy is feasible in the long run.
Despite tremendous progress in the understanding of biological systems, "the number of new
drugs approved per billion US dollars spent on R&D has halved roughly every 9 years since
1950, falling around 80-fold in inflation-adjusted terms," [261. Famously Moore's Law was
coined to predict the roughly exponential increase in the number of transistors that could
be placed at a reasonable cost onto an integrated circuit. In fact it doubled every two
years between 1970-2010. Contrasting with Moore's law, the biopharmaceutical industry
has coined "Eroom's Law" to explain the decaying productivity of R&D spending in drug
development [261. Experiencing the inverse of information technology revolution in health
care is not an appealing propositiom.
Despite the tremendous successes of modern medicine, there is no shortage of unmet need
that remains. It is not surprising that drug development is naturally getting more difficult
over time, as the "low-hanging fruit" is taken, leaving the more complicated projects. This
complexity itself can lead to higher development costs, particularly as new therapies involve
13
Share of persons surviving to successive ages for persons born 1851 to 2031, England and Wales
according to mortality rates expvrienced or projected, (on a cohort hbsi%)
100%
90%
80%
cc
70%
20%
40%
0%
~30%
.C
/20%
10%
0%
0 10 20 30 40 50 60 70 80 90 100 110 120
Age
Data so=*:g Office for Nationat Stattfcs IONSI, Note, bte expecrancy tigurss asre rnt available lor the UK before li95i fo long trisloric trenrds England and Wales data are used
The interactive data osualizatron ir avaiabile at OuwftrldnDataorg There you tind the rare data and more arsualizatorre on, thin topic Lrcenee under CC-BY-SA by the author Mao Row
Figure 1-1: Life expectancy gains have been consistent, but the rise in the costs of R&D
suggest this growth might not continue. [251
a combination of drugs. The low probabilities of success, identified by Lo, Wong, and Siah
[34J, make investments into drug development too risky to be attractive for most investors.
The effects of low success probabilities hit the hardest hit for early stage projects. In
fact the risks are amplified, since FDA approval usually requires three phases of trials that
sequentially increase in cost. The overall probability of an approved drug is the product of
the transition probabilities across the phases. As one could expect, this quickly approaches
single digit success percentages, particularly in certain disease areas that matter the most
for society, like oncology. As a result, many of the promising discoveries in science have
not been quickly translated into advances to the standard of care. Translating science to
successful drug development requires crossing the Valley of Death that separates academic
research from the biopharmaceutical industry [6]. This presents an opportunity for new
business models.
In this thesis, we will be exploring one of the proposed new business models for acceler-
ating the development of new therapies: the megafund model [8]. In this work, we discuss
14
one of the critical operational challenges of a biomedical megafund: correlation in clinical
trial outcomes. The background section will provide an overview of the FDA drug review
process and the underlying concepts from financial theory that motivate the megafund idea.
In the next section, we will formalize the problem of correlated clinical trial outcomes and
formulate several approaches to the empirical estimation of correlation in a portfolio of drug
candidates. Following this theoretical portion, we will apply some of these models to his-
torical clinical trial data. In the results section, we will discuss the preliminary findings and
persisting challenges. We will conclude with directions for further work.
15
16
Chapter 2
Background
2.1 FDA Trials
The US Food and Drug Administration (FDA) is the federal agency of the Department of
Health and Human which holds the responsibility of regulating drugs. There are several
branches of the FDA that are relevant to this research: the Center for Drug Evaluation
and Research (CDER) and the Center for Biologics Evaluation and Research (CBER). The
aim for both centers is to ensures that drugs/biologics, "both brand-name and generic, work
correctly and that their health benefits outweigh their known risks" [30]. New drugs are
required to receive CDER approval in order for a pharmaceutical company to sell it. The
FDA approves a drug for an intended population, which is also called an indication. If a
drug is approved for a particular indication, it allows doctors to prescribe it "on-label" to
patients in that specified population [331. Once a drug is approved for one indication, it
is possible for doctors to prescribe it "off-label". for other populations, but this practice is
met with additional risk and is not regulated by the FDA. This is because the FDA does
not directly regulate the practice of medicine. Off-label drug use is not uncommon, though
drug developers will often aim to obtain FDA approval for multiple indications for a single
drug. Each drug indication pair must show safety and efficacy through a series of human
studies, called clinical trials. Clinical trials, also known as clinical studies, are one of the
most important tools used by the FDA to assess the risks and benefits of a potential new
drug for an intended population. The challenges facing drug development are directly tied
17
to the structure by which drugs are regulated, thus the next subsections outline the journey
that a typical drug takes in order to be approved by the US FDA.
2.1.1 Preclinical
The early stages of drug development focus on identifying drugs that are both safe and
exhibit some pharmacological activity of interest. In terms of regulation, before entering
human trials, a drug must obtain apply to the FDA for Investigational New Drug (IND)
status. There are several of IND types, though we focus on commercial investigator INDs.
Before a drug can be tested in people, the drug company or sponsor performs laboratory and
animal tests to discover how the drug works and whether it's likely to be safe in humans.
These preliminary safety studies are combined with manufacturing information about the
drug in an IND application. The application must also specify clinical protocols to mitigate
risks to patients and provide information about the qualifications of the investigators. In
response to an IND application, the FDA may chose to approve the start of clinical trials
or to hold/delay/stop the investigation. The FDA will provide its reasoning if it does not
choose to approve the commencement of human trials 131].
2.1.2 Phase 1
Phase 1 is typically the smallest and least expensive stage of human studies. It focuses
primarily on safety and dosage rather than efficacy. Because estimation of the treatment
effect is not the objective in these studies, typically 20-100 volunteers participate. The FDA
states that approximately 70% of candidate drugs move on to the next phase 131]. An HHS
report in 2014 estimated the average phase 1 trial across all components of running the trial
and all therapeutic areas to be $3.8 million with a standard deviation of $1.5 million [27].
They weighted each therapeutic area equally in the analysis. They computed the weighted
mean and standard deviation of each cost component i and therapy area k for each phase j,
according to the following formulas:
i~j
Zk=1 Wj,kXi,j,k
INi
k=1 Wi,k
18
\/Zk=I Wj,k(Xi,j,k
- :t.j )2
sij(N,1) ZN-iWk
Ni
The components include clinical procedure cost, lab costs, site monitoring costs, adminis-
trative costs, and several others. It is important to note that many of these costs change
depending on the therapy area. For example, immunology was identified as the highest cost
therapeutic area cost in their data set of reported costs.
2.1.3 Phase 2
Phase 2 trials are the first to assess biological activity and clinical effect. They are larger
than phase 1 studies with up to several hundred patients. Approximately 33% of phase 2
studies transition into phase 3. The average cost is $13.35 million with a standard deviation
of $2.51 million [27].
2.1.4 Phase 3
Phase 3 trials attempt to assess the value a therapy would have in clinical practice. To
understand the health benefit from the new therapy, trials will compare against the standard
of care for the disease of interest. These studies typically involve more patients, often from
multiple sites to support the generalizability of the results. According to the FDA, roughly
25-30% advance to the next phase. The cost of a phase 3 trials is estimated to be $19.89 on
average with a standard deviation of $8.59 million with a trial lasting 1-4 years [27].
2.1.5 Approval and Exceptions
Following the successful completion of these clinical studies but before commercialization,
the sponsor of the study must submit a New Drug Application (NDA) or Biologic License
Application (BLA), which is the vehicle through which drug sponsors formally propose that
the FDA approve a new therapy for sale and marketing in the U.S. For an NDA to be
approved, the application must satisfy the following [29]:
9 The drug is safe and effective in its proposed use(s), and whether the benefits of the
19
drug outweigh the risks.
e The drug's proposed labeling (package insert) is appropriate, and what it should con-
tain.
e The methods used in manufacturing the drug and the controls used to maintain the
drug's quality are adequate to preserve the drug's identity, strength, quality, and purity.
Following NDA submission and FDA approval, drugs continue to be studied in phase 4
trials. The purpose of these trials is to identify the long-term effects of the drug. These
are relatively large trials, like phase 3, with comparable overall costs. Since these trials are
conducted following the successful approval of the drug, they are less influential in assessing
the risk of the project. Typically if a drug is approved for the use, the revenues from sales
will be able to support the cost of additional studies. This sequence of trials that is done
gain approval from the FDA, but there are also exceptions, particularly for diseases with no
approved therapies.
COST COMPONENT PHASE 1 P1 % PHASE 2 P2 % PHASE 3 P3 % PHASE 4 P4
%
Data Management Costs $50,331 ($8,467) 2.38% $59,934 ($21,060) 0.79% $39,047 ($19,416) 0.34% $49,702 ($9,489) 0.44%
Cost Per IRB Approvals $11,962 ($6,305) 0.56% $60,188 ($16,092) 0.79% $114,118 ($46,404) 1.00% $137,813 ($112,543) 1.21%
Cost of IRS Amendments $1,094 ($256) 0.05% $1,898 ($447) 0.02% $1,919 ($277) 0.02% $1,636 ($302) 0.01%
SDV Costs $326,437 ($65,659) 15.32% $406,038 ($80,573) 5.34% $400,173 ($66,429) 3.52% $353,602 ($62,942) 3.10%
Patent Recrutment Costs $37,050 ($21,666) 1.74% $161,140 ($102,066) 2.12% $308,672 ($174,702) 2.71% $298,923 ($252,042) 2.62%
Patient Retention Costs $6,145 ($4,745) 0.29% $15,439 ($6,970) 0.20% $24,727 ($15,868) 0.22% $30,568 ($40,466) 0.27%
RNICRA Costs $178,237 ($90,473) 8.36% $441,053 ($140,390) 5.80% $939,540 ($614,943) 8.25% $820,775 ($880,644) 7.20%
Physician Costs $109,681 ($57,626) 5.15% $381,968 ($117,217) 5.03% $805,508 ($499,426) 7.08% $669,464 ($402,072) 5.88%
Cinical Procedure Total $475,667 ($371,586) 22.32% $1,476,368 ($633,44$) 19.43% $2,262,208 ($1,033,618) 19,79% $1,733,576 ($2,251,401) 15.22%
Central Lab Costs [d] $252,163 ($203,342) 11.83% $804,821 ($313,577) 10.59% $849,180 ($600,134) 7.46% $419,758 ($377,823) 3.68%
Ste Recruitment Costs $51,904 ($32,814) 2.44% $233,729 ($83,799) 3.08% $395,182 ($195,983) 3.47% $168,343 ($101,311) 1.48%
Site Retention Costs $193,615 ($79,974) 9.09% $1,127,005 ($544,068) 14.83% $1,305,361 ($1,382,296) 11.47% $1,835,341 ($1,335,892) 16.11%
Administrative Staff Costs $237,889 ($128,547) 11.16% $1,347,390 ($427,859) 17.73% $2,321,628 ($1,910,047) 20.40% $3,323,081 ($2,534,406) 29.17%
Site Monitoring Costs $198,896 ($128,142) 9.33% $1,083,186 ($392,798) 14.25% $1,624,874 ($717,034) 14.28% $1,549,761 ($979,371) 13.60%
Subtotal (in S Million) $2.13 ($0.8) 100% $7.60 ($1.46) 100% $11.3 ($4.93) 100% $11.39 ($8.53) 100%
Site Overhead (c] $528,685 ($235,862) NA $1,741,811 ($302,049) NA $2,541,313 ($1,091,082) NA $2,575,007 ($2,082,161) NA
AN Other Costs [c] $1,139,887 ($468,077) NA $4,003,615 ($752,108) NA $5,967,193 ($2,577,692) NA $6,98,008 ($4,543,505) NA
Total (In $ Million) $3.80 ($1.56) NA $13.35 ($2.51) NA $19 89 ($8.59) NA 19.95 ($15.15) NA
Table 2.1: HHS Clinical Trial Costs Weighted Averages by Component [27]
2.2 Finance Theory

While new effective therapies have tremendous value beyond financial returns, the dynamics
of drug development are highly influenced by the dynamics of capital markets. As previously
mentioned, drug development is expensive, long, and risky. This creates an unattractive risk
profile to many investors. Theory suggests that some of these problems may be surmountable
20
through financial engineering. In this section, we review some of the important concepts in
financial theory that motivate the research interest in correlated binary outcomes in clinical
development, including portfolio theory and securitization.
2.2.1 Crossing the Valley of Death with Portfolio Theory
Dating back to Markowitz's Portfolio Selection in 1952 [17], it is considered that investors
aim to maximize returns while reducing the variance in returns. Given a choice of N assets,
each with a guaranteed return of ri, the optimal portfolio choice is simple. Since returns are
guaranteed, there is no variance of returns, so the problem is reduced to maximizing returns:
N
max wiri = max ri.
i=r1
This portfolio, if we can even call it that, constitutes the entire investment in the highest
yielding asset. However, in nearly all realistic settings, increased expected returns come
at the cost of increased variance. In this case, returns become a random variable. We
are interested in the joint probability distribution of returns for a combination of assets.
Depending on the relationship between the assets, we need to characterize how returns of
one assets are associated with returns of another. We have the following objective:
N N
max wiE(ri) s.t. Var(Z wiri) = c
i1i=1
for some fixed constant c. To get the variance of the portfolio we are interested not only in
the variance of each individual asset, but the covariance between each pair of asset returns.
3 is
Covariance between asset returns ri and rj with mean pi, pU and standard deviation o-, O-
defined as Cov(ri, rj) = E[(ri - pi)(rj - pj)]) = aij. Now we can decompose the variance of
the returns for a portfolio with known asset weights as,
N N N N N N
Var (wiri) = i O + 22Z
Yw Eo wi- = E
Z wiwo-j.
2i=1
j=i+ i=1 j=1
21
Often it is useful to characterize the association between two variables with a normalized
measure, so that the measure ranges from -1 to 1. We formalize this notion with the Pearson
correlation coefficient, defined as:
Pij
Notice that the correlation for ri with itself is 1 and -1 for -ri with ri. If assets are completely
uncorrelated then for every pair i # j, pij = 0.
If we consider a portfolio with equal weights across the N assets, wi = for all i, then
we get the following:
N N nA.o72n 1 1 ri-1
Var[Z wEri] = +-2 Cov[r, r3 ] = -Average Variance+ n Average Covariance.
=1i=1 ipAj
In this form it becomes clear that the average covariance becomes of utmost importance as
we increase the number of assets in the portfolio. The variance term tends to zero as n
increases, so the portfolio variance becomes a function of the covariance between the assets.
If we assume that the probability of success for each drug is nearly independent of the others,
then we can argue that the average covariance is low, making our portfolio of risky assets
much more attractive in this new form. In this scenario, the standard deviation of our returns
becomes a linear function of the number of uncorrelated assets. This proves the power of
diversification in modern portfolio theory.
These ideas are core to financial planning [17] and already play an important role in
drug development. Krantz discusses how pharmaceutical companies can "diversify" drug
development risk by screening a large set of molecules to find what selectively binds to
proteins of interest [12]. Large pharmaceutical companies, like Merck and Co, advertise
their diversified brands in their annual reports. Fernandez et al. 181 take a logical next
step extrapolating diversification to a portfolio of drugs that do not necessarily belong to
one company. Core to their risk management approach to drug development is using the
stability of returns afforded by diversification to attract alternative forms of financing.
22
2.2.2 Expanding the Pool of Available Capital with Securitization
Securitization refers to the selling of securities that are exclusively linked to a pool of cash
flows owned by a special purpose vehicle (SPV). The specified cash flows, typically loan
payments, are originated by an intermediary, but the rights to those cash flows are sold
to the SPV [9]. These cash flows are typically associated with underlying assets, such as
residential homes. In that case, the associated assets are called asset-backed securities.
While the risk tied to a loan is typically held by the originator, with securitization the risk
is transferred from the originator to the owners of the securities. The SPV stands at the
center of the transaction, purchasing rights to future cash flows from the originators with
the funds it generates from selling securities tied to the rights to the acquired cash flows.
There is typically an order to which the cash flows are returned to owners of the securities.
This seniority structure is organized into tranches. The higher tranches gaining priority of
the payouts over the lower tranches, creating a collection of assets with varying levels of risk
exposure. Though this financial structure, investor payouts can be designed to meet investor
needs, while leaving the economic risks related to the underlying cash flows unchanged. [9]
Pairing the potential of diversification to create steady cash flows with the securitization
model, it is possible to bring debt capital to the high-risk business of drug development. The
large supply of debt capital facilitates the scale needed for diversification. For this reason,
in the Megafund model, Fernandez et al. [8] explain that diversification and structured
financing are "inextricably intertwined." The megafund model has been demonstrated to be
feasible for both cancer and orphan disease [5]. Still questions remain regarding whether
the megafund model can be properly operationalized. In this report, we explore one of the
most significant concerns: the role of correlation estimation. In order to construct portfolios
with reliable cash flows, it is necessary to estimate the covariance between drug development
projects.
2.3 Fitting a Covariance Matrix
Starting as a thought experiment, the portfolio approach to drug development in the mega-
fund model assumed a low level of pairwise correlation between assets to give a proof of
23
concept for the potential returns [8]. Focusing on two particular areas of drug development,
oncology focused and orphan disease focused megafunds were proposed and their returns
estimated [8] [5]. With the high impact of correlation on the returns of a megafund, the
next step was to characterize the interactions between projects in a principled way. Turning
to domain experts, Lo et al. provided estimates for pairwise correlation between proposed
therapeutics to Alzheimer's, where there is enormous unmet need [15]. These correlations
were qualitatively assessed to be low, moderate, medium, or high. These determinations
were primarily based on the biological pathways being targeted. For example, at the time
of the paper, the Tau pathway was considered to be a viable treatment target, so various
phosphorlyation inhibitors associated with that pathway were being considered as candidate
therapies. Domain experts would expect the success for any of these inhibitors to depend
upon the confirmation of the Tau pathway's importance in Alzheimer's disease. It follows
that this underlying variable could lead to an all or nothing distribution of clinical trial
outcomes among the phosphorlyation inhibitors that affect the Tau pathway. On the other
hand, candidate therapies designed to target neuroinflammation would more likely be un-
related to the Tau pathway, making the two therapy strategies act more like independent
"shots on goal." The qualitative estimates of correlation were matched to 0.10, 0.25, 0.50,
0.90.
It is possible that the matrix that experts agree on will not be a mathematically valid
correlation matrix. It is critical that a correlation matrix for megafund simulation be sym-
metric positive semidefinite. Therefore, it is necessary to project the qualitative correlation
matrix onto the space of positive semidefinite correlation matrices to find the closest valid
matrix. More formally, we aim to minimize the following
min ||A - X11,
where A is our qualitative correlation matrix and X is a valid correlation matrix. The
notion of distance between matrix is defined using a weighted Frobenius norm, ||Allw :=
Wi/2 AW1/21 IF. W is a positive definite matrix of weights. Higham proposed an alternating
projections method that has a linear convergence time [10].
24
Building off this result Qi and Sun [21] proposed an algorithm for computing the nearest
correlation matrix. They algorithm uses Newton's method, improving the run time for prac-
tical cases. Lo et al [15] used this method to fit a valid correlation matrix from the qualitative
correlation estimates. They claim that the Qi and Sun algorithm had minimal impact on
their correlation matrix, maintaining the relative correlations between project that the ex-
perts had set among the 64 projects split across 9 strategies. Figure 2-1 shows their resulting
correlation matrix as a heatmap. Green corresponds a correlation coefficient between 0 and
0.25. Yellow represents 0.25-0.50, orange represents 0.50-0.75, and red indicated correlation
greater than 0.75.
This approach has two main limitations. First, the use of domain knowledge makes the
approach somewhat difficult to scale, since it would aggregating a large number of domain
experts. Secondly it requires domain experts to produce well calibrated correlation mea-
sures. It could perhaps be overly optimistic to think that doctors can recognize correlation
coefficients from their experience. With these limitations in mind, we are motivate to explore
methods to estimate correlations from historical data.
25
Amnylold
Tau
Neuro inflammation
APOE4
Upid Metabolisn
Autophagy easome
ugfolded protel
/
Hormones/
growth factors
.Dysregulation of
calcium homeostatis
1 rii-m~
.
Heavy metals
1
Miscellaneous
hypotheses
_______I I 7171[ILTS
Figure 2-1: Heatmap of the correlations produced by the nearest valid correlation matrix to
the domain experts' predictions [151
26
Chapter 3
Methods
3.1 Formalizing Correlation in Clinical Trials
3.1.1 Problem Overview
As outlined in the FDA background section, the commercial value of a drug is tied to
the FDA's judgment of safety and efficacy. Given a collection of drug/indication pairs,
each drug/indication could either succeed or fail to get approval from the FDA. An ap-
proved drug/indication could generate revenues in the billions of dollars, while a failed
drug/indication cannot be sold. These binary outcomes make the variance of the returns
for biomedical R&D investment so high, particularly when the number of projects is small.
Borrowing from financial theory, a collection of uncorrelated drug/indication pairs stabilizes
the returns. A suitable collection will need to have many drug/indication pairs. The use of
debt capital in the financing structure facilitates the scale needed to reduce the variance of
the portfolio.
While theoretically promising, one might think that some collections of drug/indication
pairs would not help to stabilize returns due to high interdependence between outcomes. For
example, all drugs targeting a specific protein could all fail together if the associated pathway
is not relevant to the target indication. At the same time, one might think that clinical
trial outcomes for neurological diseases would be uncorrelated to clinical trials outcomes
associated with new influenza vaccines. While perhaps the extremes are straightforward,
27
it is critical to the successful construction of portfolios of candidate drugs to understand
the correlations between all assets in the portfolio. Given the pairwise correlations between
assets in the portfolio, the variance of portfolio D is given by:
N N
Var(D) = wiwp(ri, rj)riro
i=1 j=1
where wi is the percentage of the portfolio value contributed by asset i, Uri is the standard
deviation of the returns for asset i, and p(rj, rj) is the correlation between the returns of asset
i and asset j. As derived in the previous section, the variance of the portfolio is approaches
a linear function of the average covariance over all the pairs of trials when N is large.
Of course, we are not given the correlations between the financial returns of the projects,
so we must find ways to estimate them. For simplicity, we do not focus on charactizing the
returns directly, rather we aim to estimate the dependence between the success/failure of
drug/indication pairs. This task has many challenges, some of the most pressing include:
" One-shot nature of a drug indication pair
" Projects begin, end, and change over time
" Projects vary in a variety of ways, leading to high dimensionality that may still not be
expressive
" Limitations of the data.
Due to these challenges, it becomes necessary to make assumptions about clinical trials and
their interactions. We proceed to highlight the challenges and how to respond to them.
Drug-indications pairs are typically observed once
Unlike other covariance estimation tasks, we only observe one outcome for each drug-
indication pair. Sponsors of clinical trials must assess the probability of success for a candi-
date drug at each stage of testing, updating the probability as new information surfaces. It is
reasonable to cast the outcome of eventual FDA approval as a random variable, drawn from
some unknown distribution, though on the individual project level there is no well defined
28
mean and variance if we make no assumptions about the underlying distribution. In order to
address this concern, we consider the distribution of outcomes for a project to depend upon
its features in specific ways. In the data specification section, we will outline the available
features for each drug-indication pair in our empirical studies.
Projects begin, end, and change over time
Most statistical techniques assume the samples are drawn i.i.d. from a distribution. Because
the drug-indication pairs along with their approval status appear over time, it is harder
to justify the i.i.d. assumption for two reason. First, it is plausible to think that recent
trials inform future trials, so samples are not drawn independently. Also, as biological
understanding evolves and tools improve, the distribution from which samples are drawn
is likely not stationary. For these reasons, we explore online learning techniques that learn
from sequential data that are not necessarily i.i.d.. If there are clear breakpoints where
stationarity fails, we can use distinct time periods as a grouping mechanism, helping us to
understand the association between previous approvals to subsequent projects.
Embedding a drug-indication pair into a feature vectors is non-trivial
Drug-indication pairs and their associated trials are a not a standard object in statistics.
It is a challenge in itself to find suitable mathematical representations of a project that
reflect its probability of success (POS) and its relationship to other projects. Most of the
features of a trial are categorical. Simple examples include: the therapy area, route of
administration, and sponsor type. Despite being categorical, it is plausible to think that
there is a notion of similarity between therapy areas. For example, immunology is likely
more related to respiratory than it is to podiatry. This makes it difficult to be satisfied
with one-hot encoding the features. Embedding these features into a space that preserves
similarity relationships could be a promising direction for future work. Further, effects of
these features on the distribution of outcomes is not well understood and are likely to be
non-linear. For this reason, we need to use models that are capable of capturing these
complicated interactions in a high dimensional feature space.
29
We are constrained by the granularity and completeness of our dataset
Our covariance estimation models will be constrained by the data available to fit them. The
two major challenges related to the data are 1) missingness in the data and 2) scale of the
data set. Some relevant features of a drug-indication pair are missing. For example, start
date would likely be an important factor for trial success, though it is frequently missing.
Additionally, we face the typical bias-variance trade off common to many machine learning
problems. The expressiveness of our model must be weighted against the ability to fit the
parameters. Our data set, though relatively large for the study of clinical trials, is limited.
3.2 Online Learning
3.2.1 Batch vs Online Learning
Most machine learning results today fall into the paradigm of batch learning. Consider an
image classification method for identifying photos of cats versus dogs. Usually a classifier
for this task would be trained from a dataset of labeled examples of both cats and dogs. To
understand the performance of the method, the dataset is randomly split between training
and testing sets (sometimes another split is made for a validation set). The training set is
used to estimate model parameters, then the testing set is used to characterize the out-of-
sample performance, which hopefully generalizes well to the broader underlying distribution
of cats and dogs.
Some of the most successful applications of machine learning do not fit this paradigm, and
neither do many important problems that we may wish to solve. Many real world problems
have a streaming input of new information from which to make decisions. For instance, the
problem of weather prediction takes this form. At each time step, a forecaster must predict
an outcome. At the next time step, yesterday's weather has been revealed and the forecaster
can incorporate the knowledge of yesterday's weather to aid in the next prediction. Besides
weather, product recommendation systems must also learn as users express more information
about their own preferences over time. Similar to weather prediction, we have a set of the
previous output labels yi, ..., yt-1, in this case a sequence of recommended products, then at
30
each time step we make a prediction yt. Afterwards the true label is revealed. In this case,
the true label represents whether or not the customer purchased the recommended item.
Incorporating that information, we make the prediction for t+. Overall our objective is
to minimize the loss between our estimated label and the true label, encoding that with a
function f(yt, yt). Assuming that there are correlations in the sequence, we should be able
to learn a classifier that is able to predict the subsequent label with reasonable accuracy, a
notion we will formalize in this section.
In this new paradigm, some of the concepts and tools from batch learning are no longer
feasible. For this reason, online machine learning methods have been developed. Both
batch and online learning are concerned with problems of making decisions from limited
information, thus both can be analyzed using some of the same approaches. This section
will provide an overview of some of these online learning paradigms and methods, grounding
our analysis in binary classification problems.
3.2.2 Motivating Problem
Binary classification is the simplest case of online learning. However, despite being simply
specified, it is of tremendous importance to many real problems. Because the online learning
setting does not require i.i.d samples, it is potentially useful for our application of interest,
clinical trial outcome prediction. Theoretically it should be possible for estimate the prob-
ability of success for a new drug as a function of both the previous outcomes for clinical
trials as well as the characteristics of the clinical trial of interest. However, it is challenging
to specify this function directly, so we must find a way to estimate the function using the
sequence of historical outcomes. If we can understand the distribution of potential future
sequences, we can build better portfolios of drug-indication pairs.
3.2.3 Why Online Learning for Clinical Trial Prediction?
While it is true that many important problems have streaming information, there is another
important reason that makes the online learning setup attractive for clinical trial prediction.
Online learning is particularly useful in cases where the interactions between variables are
31
complex and unknown. Many of the most advanced weather prediction systems incorporate
a physics-based model of weather dynamics, essentially fitting probabilistic models to pro-
cess that generates the data [20]. This is a reasonable approach because climate scientists
understand many of the physical laws that govern changes in the weather. On the other
hand, there currently are not well-validated models for clinical trial outcome prediction. In
the case that we assume the wrong model, then it is difficult to argue the meaning of the
forecasts. In contrast, the online learning setting does not attempt to fit the process, rather
only aims to model the results. It is simpler to cast the problem in terms of a class of candi-
date functions, rather than making untenable assumptions about the underlying probability
distribution.
3.2.4 Outlining this section
Online methods, particularly sequential prediction, allows us to overcome some of the pre-
viously described challenges. In this setting, we train a model to output the results of a
sequence of future clinical trials according to a underlying probability distribution that is
learned from the data. Taking the collection of simulated result vectors, we can compute
mean and variance for each trial of interest, further allowing an accurate assessment of risk
to overcome the current market failures. To be able to effectively predict clinical trial out-
comes, this section will draw on the existing frameworks for online learning. It will start
with the simplest case of predicting binary outcomes from a known probability distribution
class (without knowing its parameters). Generalizing this result, we will then review the
theoretical results related to predicting sequences of bit strings, without any knowledge of
the underlying generating process. Understanding that there are existing methodologies to
evaluate clinical trial success probabilities, Section 3.2.7 presents the "expert setting" which
makes it straightforward to model the problem of aggregating the predictions from a col-
lection of sources. Given that we have additional information about the trial features (size,
disease area, etc.), Section 3.2.8 explores the paradigms for incorporating side information
into prediction algorithms. Sections 3.2.9 and 3.2.10 review slight modifications to the tra-
ditional online learning paradigms that are relevant to clinical trial prediction, namely order
invariance and long-term prediction. With these paradigms and theoretical results, this pa-
32
per aims to inform the reader of the existing tools that would be useful for building online
learning systems for clinical trial outcome prediction.
3.2.5 Predicting a Coin Toss
We start with the hypothetical example of predicting a (potentially unfair) coin toss. In
this scenario, the results are governed by sampling an unknown binomial distribution, pa-
rameterized by p. For each time step, we can make a prediction yt for the outcome yt. Our
objective is to maximize the number of times that we correctly predict the coin toss. From
the perspective of batch learning, we use empirical risk minimization (ERM) strategy with
the objective to minimize the expected indicator loss E[I(y # y)], by minimizing:
IT
t=1
Casting the problem in the online learning setting, we define a set of candidate functions
that map time steps to labels: F = {f :t -+ y}. We aim to minimize the difference between
our predictor's performance and the performance of the optimal choice of f E F, which we
will call f*. In this form, rather than minimizing loss directly, we minimize the regret:
T
R(y, y) = 1I[(]t # yt) - i[(f*(t) # yt).
t=1
In this case, since we know the underlying distribution, we also know E[li(f*(t) # y)] =
min(p, 1 - p). Obviously, we would need to know p to be able to perform as well as f*, so
instead, we simply predict the label that is currently the majority of the previous labels:
+1= i( < yi)-

i=1
Since the generating probability distribution is binomial, each coin flip is i.i.d.. The i.i.d.
property of each sample allows us to use the law of large numbers to conclude that we will
33
converge to optimal predictor:
lim T ff(yt # yt) - min(1 - P, P) = 0.

n-+oo n =
t=1
Additionally the central limit theorem implies that our estimate for the parameter p will be
no further than O(1/y i) with high probability [231. From this analysis it is clear that we
can make theoretical arguments for problems in the online learning setting.
3.2.6 Bit Prediction
Often we will not know what the underlying distribution of the output sequence, so in this
section we will generalize some of the results from the previous section on predicting a coin
toss to a bit string. The difference here is that the sequence is not necessarily a sampling
from a binomial distribution. Again we have a sequence of labels yi, . . , yt E {0, 1}' and
our task is to generate a prediction yt+1 for the next label, given all the previous labels.
Similar to the previous section, we will use indicator loss. Note that for any deterministic
algorithm for the prediction of the a sequence, there exists an a sequence of labels where
every prediction will be incorrect [22]. To overcome this challenge, we can use a randomized
algorithm to generate predictions. Even then, it is impossible to construct an algorithm that
can correctly predict a truly random sequence of bits with above 50% accuracy. The truly
random case occurs when all bit strings are equally likely, but often in practical applications
we have certain types of sequences that occur more frequently. We can prove the existence
of an algorithm that achieves a high prediction accuracy on a subset of sequences, but must
in turn perform worse on others. We define the function #(yi, ... , y") which maps vertices of
a hypercube to real numbers. We require # to have 2 properties. First, # must be relatively
stable, meaning:
|0(yi, ... + i, ...
iyn) - O(yi, ...
, yn) I <
A ...
,
1
n
-
Second, we must have #satisfy: FE[#] er> 1,

2' which is directly related to the above result that
we cannot do better than -12 on average over truly random bit strings. If these two criteria
34
are satisfied, then Cover's result [16] states that there must exist an algorithm A where
Vy = (yi, ... ,1yn), R[(A, y) = # (y).-
This results gives us hope that it is possible to construct algorithms that perform well for
real-world problems.
3.2.7 Learning from Experts
The above section proves the existence of an algorithm to obtain our desired results, but it
does not specify how to construct these algorithms. For that we turn to a new paradigm
for online classification methods. In this new setting we have a collection of experts, each
providing a prediction at every time step. Again we aim to predict an unknown sequence
y1, ... , yn E Y. Our predictions P1, ..., P, fall into a decision space D which is a convex vector
space [4]. We have access to a set of the expert predictions at time t which we call fE,t. Our
objective is to minimize the loss between the predictions and the true labels of the sequence
by minimizing the regret (as was defined in section 3.2.5) with respect to the predictions of
each expert. Here we assume that the number of experts is finite. Now we must devise ways
to incorporate input from the set of expert predictions such that we achieve vanishing regret
in a time step sufficiently far in advance. Put another way, we hope to achieve comparable
performance to the best expert in our set of experts over time [11.
The natural first approach for incorporating expert prediction is to weight the inputs
from the set of experts according to their credibility. Thus our predictions are informed by
the following quantity:
Pt = N
Ej1 Wj,t-1
where wi,t_1 corresponds to the credibility of expert i for the time step t - 1. To define
the credibility weights, we use the regret up to that time point. Intuitively, if regret for
not following the predictions of expert e increases, then e's credibility will increase as well.
Depending upon the application, we may wish to change the relationship between increases
in regret versus increases in credibility weight. We can re-write the above in a new form,
35
specifically where the weights are the derivative of a function of the regret:
=1#'(Rj,t_1)fi't
Pt YN
Cesa-Bianchi and Lugosi [4] use this formulation as the basis for their analysis of generaliza-
tion bounds. They define a potential function 1, which allows them to rewrite our original
weighed average forecaster approach and define a bound of the difference between the best
expert and the weighed approach. In addition, the potential function method generalizes to
allow other schemes for weighting expert predictions. For example, Cesa-Bianchi and Lugosi
consider a polynomial weighed average forecaster which allow tighter bounds, which take the
following form:
Ln- min Li,n < n(p - 1)N 2/1,
iE[1,...,N]
where p > 2, N is the size of the set of experts, and n denotes the current time step. We can
see here that we do not need an underlying assumption about the probability distribution of
process generating the data to argue theoretical bounds on the performance of our algorithm
compared to the best expert.
Another approach using this framework is the exponentially weighted average forecaster,
which is relatively commonly used in this setting. In this case, the weights for the experts
take the following form as a function of the regret:
enRjt-1
Wi,t-1 =t- N1
where q is a positive parameter. This form is considered to be particularly useful because it

is a function that only depends upon the past performance of the experts, not on their past
predictions. If we assume a convex loss function that is bounded between 0 and 1 (which
notably does not include indicator loss), then we have a regret bound that takes the following
form:
Ln- min Li,n I + = v/2nlnN,
e[1.n] w 2
when we set y= /2ln N/n.
36
Cesa-Bianchi and Lugosi[4] continue by outlining a collection of modifications to this
base paradigm. Many of these changes have effects on the regret bound. Examples of
these alterations include, the ability to simulate the future predictions of an expert and
settings where the regret is discounted over time (using a decreasing weighting factor as
time progresses).
3.2.8 Side Information
So far, the predictions for the sequences have been purely functions of the context. We were
given a collection of previous labels yi, ..., yt-i which we could use to predict yt. In many
cases there is additional information provided to help make the prediction. In our problem
of interest, clinical trials certainly are related to the previous successes and failures of other
clinical trials, but the success probability is also a function of some of the characteristics
of the trial itself. In the online learning setting this additional information is called side
information. Similar to an HMM model that has both labels and underlying hidden states,
the side information can play an important role in aiding the performance of the predictor.
Xiao and Eckert present several algorithms for online sequence information that are able
to effectively incorporate side information [36]. Their aim was to model the execution path
of a process on a desktop/mobile system in real-time. Each process produces an ordered
sequence of system calls which request different services from the operating system. Their
goal was to predict the next system call using the context of observed sequences as well as the
side information about the operation of the machine such as arguments, return values and
structures. Their approach is parametric, fitting a Gaussian distribution over the weights in
their algorithm. At each step these weights are updated with invariants placed such that it
always forms a valid distribution.
3.2.9 Sequential Event Prediction
Using the online learning and sequential prediction frameworks that we have explained pre-
viously as a foundation, researchers have made changes to this paradigm to reflect charac-
teristics of a particular problem. Letham, Rudin, and Madigan [14] present a slight twist
37
on sequential prediction, which they term sequential event prediction, involving context
sets to replace sequences. Inspired by the widely used product recommendation systems
in e-commerce applications, the authors did not want the order of a sequence of labels to
constrain the learning algorithm. Take for example an online store for groceries. It would
be useful to design a system that would learn to predict the collection of elements that the
customer would purchase. The authors argue that a customer who adds apples to her cart
before adding bananas does not present a meaningful difference than if the two items were
selected in the opposite order. The authors present two different models for this setting.
In both cases, they use an ERM framework to estimate the relationships between items in
the observed part of the sequence and potential future items. In mathematical terms these
rankings are performed using a scoring function f(x:t, a) that takes the context sequence
x1:t and an item a from the set of products. An effective scoring function f will produce
the correct ordinal relationship between products as the conditional probability of a product
given the context sequence: IP(ajx1:t). One way to construct this function is using a set of
coefficients that represent the pairwise interaction between product sequences and an addi-
tional product. In this model we have coefficients Ax,b where x is every valid item set (note
that order does not matter), which always includes the empty set 0 and b is every individ-
ual item. With these coefficients, defining the valid item sets to be the powerset of all the
products, the scoring function takes the following form:
f (37:tIb) A) = Z Aa,b
aEP(Xji
)
where P(xit) refers to the power set of the previous items in the basket xi:t. Note that
using the powerset makes number of coefficients scale exponentially with the length of the
sequence. To control the dimensionality of the problem, it is possible consider fixed length
product sequences. The only constraint is that the empty set is included. The authors
term this model "one-stage" because all the coefficients in A are estimated simultaneously.
The authors propose another model, termed the "ML-constrained model" because they use
a maximum likelihood estimate for the conditional probabilities to find the values for the
38
coefficients A :
a,b- aIP(bla)
where Pa only depends on the basket, not the incremental product. This second model
provides a principled way to reduce the number of parameters, so that estimation error
is controlled relative to model complexity. The authors illustrate this model is able to
achieve high performance in the product recommendation task as well as predicting the set
of recipients for an email message. Overall, there are many problems that naturally fall into
this paradigm where the set of past events matters for predicting the future, rather than
the order of the context sequence. Using existing tools in online convex optimization [28]
and statistical learning theory, this paper serves as a foundation for the sequential event
prediction paradigm to other problem settings, such as clinical trial prediction. It might not
concern us the order by which drug-indication pairs within our portfolio are approved, we
only would care about the set that gets approved.
3.2.10 Long-Term Sequence Prediction
If the goal is use effective prediction techniques to help accelerate the development of ther-
apeutics, then clearly the time horizon for our predictions is important. If we are hedge
fund managers, interested in the short-run performance of clinical trial outcomes, then un-
derstanding the probabilities of success for next few trials would be of greatest use. With
good probability estimates, we adjust our portfolios according to the risk ahead of the an-
nouncements. Perhaps though, we are longer term investors in drug development projects,
we are not necessarily interested in the next clinical trial outcome, but rather a collections of
outcomes D timesteps into the future. Korotin et al. [11] study the online learning paradigm
with exactly this modification where we want to predict t+D given a context sequence
yi, ...yt. Theoretical results for this form of delayed feedback from Weinberger [32] show that
under indicator loss, the optimal long-run predictor p*D(Xt+D Xt, ... , X1 ) can be formed using
short-run predictor p*(xiJxt,..., 1). This result involves running D independent copies of
the optimal short-run predictor p* on time periods dicretized by D, characterized by the
39
following:
P*(t+DlIt, -.
- XI, =1P* (t+D jIt, Xt-D, Xt-2D, ..
-X-I)-
Korotin et al. build upon these results, generalizing them for the expert setting. These re-
sults essentially reduce the problem of finding a long-run expert aggregation procedure A* to
characterizing the optimal short run aggregation procedure A*. They propose a new expert
aggregation algorithm called Overlapping Experts which expands the set of experts by in-
corporating the previous predictions. They prove that their algorithm has time-independent
0(1) regret bound with respect to the best expert in the extended pool [11].
3.2.11 Online Learning Takeways
Motivated by the task of predicting clinical trial outcomes in an online setting, we have
covered a broad set of results. We started by explaining the key differences between batch
and online learning. The next sections reviewed the foundational theoretical results, related
to learning to predict a coin flip and an arbitrary bit string. It can be convenient for clinical
trial prediction, and many others, to cast the online classification problem in the expert
setting, where the goal is to match the performance of the best expert in the set. Building
upon the fundamentals, we covered some extensions of the framework in order to incorporate
side information, learn sets rather than sequences, and make long-term predictions out of
optimal short-run prediction algorithms.
3.3 Function Approximation via Random Forests
Inspecting the covariance formula, we see that it is possible to decompose it into two pieces.
If we were able to effectively estimate each piece, we hope to estimate covariance. As before,
we define each clinical trial to have binary outcome Y E {0, 1}. Note the covariance of the
outcomes for clinical trial i and j is defined as,
Cov(Yi = 1, Y = 1) = P(Y = 1, Y = 1) -- P(Y = 1) - P(Y = 1).
40
The first term represents the joint probability that both trials i and j succeed. The second
term is the product of the individual probabilities of success.
As new drugs are being developed, it is common to release information about the drug
and the design of the trials. This information can help to inform the probability of success for
the project. We can modify the above expression to incorporate this knowledge as follows,
Cov(Yi = 1, Y = I1Xi, Xj) = P(Yi = 1, Y = 1|Xi, Xj) - P(Y 1Xi) -P(Y = 1|X ).
We are interested in the conditional covariance of the two drug-indication pairs given the
features of the drug and trials. We make the simplifying assumption that the probability of
success for a project alone is not affected by the features of another trial, but that the joint
probability distribution of outcomes does depend on both sets of features. We believe that
this is a reasonable choice because the outcome of the trial depends on the ability for the
drug to show both efficacy and safety in its trials, which depend on that project's features
alone.
On the other hand, for the joint probabilities, we expect that the probability for a pair
of trials to be approved depends on a notion of similarity between the features of the two
projects. For example, we can imagine a case where two drug-indication pairs have nothing
in common. In that case, we would expect their outcomes to be nearly independent. If
they are independent, then we would expect the joint probability to be very close to the
product of the marginal probabilities. Alternatively, if we have two trials that are very
similar, then the probability that they both succeed together would approach the marginal
probability that one succeeds. If we assume that the underlying variable that determine both
trial outcomes is a Bernoulli random variable with parameter p, then the joint probability
that both succeed is equal to p and each marginal probability also equals p. The resulting
covariance between these trails would be p - p 2 = p(l - p) which is the same as the variance
of a Bernoulli random variable. Variance is positive by definition, allowing us to confirm
that these trials are positively correlated with coefficient 1. Oppositely we could imagine
that two projects could have contradicting views about an underlying biological process such
that if one succeeds, the other certainly will fail. For simplicity, if we assume that one of
41
the two projects will succeed according to a Bernoulli random variable with parameter p,
then we get P(Y1 = 1, Y2 = 1) = 0 by definition. However the individual projects will have
probabilities P(Y1) = p and P(Y2 ) = 1 - p. The resulting covariance is -p(l - p), leading to
a correlation coefficient of -1, as expected.
Building on this intuition, we hope to train classifiers for determining the success of a
single project (defined by the drug and indication) and a pair of projects. The quality of
our covariance estimates will depend not only on the fit of each individual predictor, but on
differences between the predictors according to the following:
Cov(Yi = 1,Y = I|Xi, Xj) = fpair(Xi, Xj) - fsingie(Xi) -fsingle (Xj).
To predict the probabilities of success for trial we will fit a function between the features
X to the outcomes Y. Typically the features are categorical variables, denoting the phase of
the trial, the disease area, or the sponsor type. Given the features of two projects, we need
a non-linear function to capture the interactions between the two projects. We also expect
that the probability of success for a single trial would not be a linear function of the features.
On the other hand, our data set of drug development projects is not endless. The set of
thousands of trials will be useful for fitting models, but likely rules out very complicated
function classes, such as deep neural networks. These assumptions guide our selection of
suitable models.
3.3.1 Random Forests
Random forests are an efficient classification method that have several key strengths for
clinical trial success prediction given categorical features. We will first highlight the funda-
mentals of random forests. Second, we will describe some important theoretical elements.
Then we will discuss limitations and the suitability for clinical trial prediction.
Fundamentals of Random Forests
Random forest classifiers are an ensemble method based on decision trees [21 13]. Ensemble
methods use a set of classifiers together in order to obtain a more accurate predictor than
42
any single classifier in the set. For each sample, each classifier contributes a vote for a
single classification. There are many variants of the voting scheme, with the simplest being
a majority vote among the classifiers. In the two class setting, random forests take the
majority vote of a specified number of decision trees that are constructed as follows. For
each tree, a subset of the training data is sampled with replacement and a subset of the
features are drawn randomly. This approach to sampling the training set is referred to as
bagging. The remaining samples that are not selected are referred to as out of bag. At
each level in the decision tree, a condition is selected from the feature subset. At each split,
a new set of features is sampled. The combination of sampling the training set and the
feature set allows for a variety of trees that are not dependent. The depth of the trees is
a hyperparameter of the classifier along with the number of trees to grow. We will discuss
three important hyperparameters used in random forests, the splitting criterion, the number
of trees, and the size of the subset of features to sample at each split.
Splitting Criterion
The most common ways to select the condition for splitting are the Gini index and infor-
mation gain [3]. The condition assigns one of the variables from the selected subset to a
value. Intuitively, we want to select a condition that separates classes as best as possible for
the training samples that would be encounter by this portion of the tree. Random forest
classifiers that use the Gini index as the splitting criterion select the condition V with the
smallest weighted Gini index. The Gini index is used to measure the impurity of data. To
compute this statistic, we calculate the Gini index on each side of the split and then weight
them by the number of samples on each side. We select the condition t* according to:
C C
t= arg min Nt,T [I - P(Y = i~t)2 ] + Nt,F [~- P(Y 2
t
i=1 i=1
where C is the number of classes, Nt,T, Nt,F are the proportions of samples on true/false
sides of the split on condition t, and P(Y = ilt) is the proportion of samples that have the
class label i of the set of samples satisfying t. The weighted Gini index ranges from 0 to
1, where a perfect split will give a Gini index of 0. Due to the weighting factors, we can
43
see that the algorithm will favor splits that have a relatively large partitions, particularly
close to the root of the tree. Separating a small fraction of the samples perfectly will be
down-weighted, reducing its influence on the Gini index for that condition.
An alternative to using the weighted Gini index criterion is the information gain. It
is based on the concept of entropy. For a discrete random variable, entropy is defined as
H(Y) = - E log2 p, where Y is a categorical variable that takes on C values each
with probability pi for i = 1, ...C. For each branch in the split created by a condition t, we
compute the proportion of samples on each side Nt,T and Nt,F just like in with the weighted
Gini index. We use these proportions to weight the entropy on each side of the split, taking
the condition t* that minimizes the weighted entropy:
C C
t* = arg min -NT [EZP(Y = it) log P(Y = iIt) - Nt,FZP(Y = i101lo 2 P(Y =
t
where P(Y = itt) is the probability that a sample has class label i given the condition is true
and P(Y = ilt) us the probability that a sample has class label i given the condition is false.
Both metrics are widely used for building decision trees and in most cases the two approaches
agree on a splitting condition [24]. From a computational perspective, the information gain
method is more costly due to the need to calculate logarithms, so we move forward with
the Gini indexing method. Altering this parameter in our models did not appear to make
material differences in the classifiers.
Random Forest Generalization Error
Before training a random forest, we must specify the number of trees that should be grown.
The theoretical considerations for this choice make several assumptions about the data,
though they lead to powerful conclusions. As is typical in machine learning, we assume that
features and labels are drawn i.i.d. from a data distribution of features and labels X, Y - D.
Given a collection of classifiers hi(x), h2 (X),..., hK(x) and a training set drawn from D, we
can define the margin function mg:
mg(X, Y) - k= 1 - max Ek=1

K jy K
44
If mg(X, Y) > 0, then the vote of the classifiers set gives the correct classification. The
count of classifiers that predict the correct label is larger than the count of classifiers that
predict any other classification. We can define generalization error of the classifier as:
PXx,Y(mg(X, Y) < 0).
Now rather than considering a set of classifiers, we are concerned with the random forest
classifier, so we can rewrite hk(x) as h(X, 6k) where Ok is the sequence of variable sub-
sets associated with the creation of decision tree k. Breiman [21 proves that random forest
classifiers do not over-fit the training data because the generalization error approaches the
following limit,
Px,y(mg(X, Y) < 0) -+ IPxy(PE(h(X, E) = Y) - max Pe(h(X, 9) j) < 0).
The proof uses the law of large numbers and the tree structure of the classifiers. Further,
an upper bound on the generalization error can be derived in terms of two parameters:
individual accuracy of each tree in the ensemble and the level of dependence between the
trees. Building off the concept of the margin function, we can define a margin function for
a random forest classifier as follows:
mr(X, Y) = Pe (h(X, 8) = Y) - max Pe(h(X, 9) j).

joY
The margin of the classifiers determine their strength. We define the strength of the classifier
as s = Ex,ymr(X, Y). If we assume that s > 0, then we can bound the generalization error
with Chebychev's inequality:
Var(mr)
IPx,y~v(mg(X, Y) < 0) <;2
Breiman's proof [2] shows that the variance of the margin depends upon the correlation
between trees. Therefore the overall generalization error depends upon the ability construct
as many trees with as low correlation as possible, maintaining a high level of strength with
45
each tree. Conveniently this theoretical results suggests that we don't have to worry about
overfitting when selecting the number of trees. However, we would expect to see a plateau
in performance while incurring additional computational cost as we increase the number of
trees beyond a critical point.
Random Forest Probabilities and Choosing a Subset of Features
After exploring the role of the number of trees to grow, we now turn our attention to how
the subset of features is picked for each split in each tree. Olsen and Wyner [?] examined the
probability estimates that are given by random forest classifiers, suggesting a critical role
for the feature subset hyperparameter (they call it mtry). It is common practice to count
the fraction of trees that voted for a the positive label, interpreting the proportion of trees
as the probability of a positive label. Olsen and Wyner argue against this approach because
the fraction of votes is not constrained in any way to reflect the probability of a true label.
The classifications made by a random forest classifier depend on the majority vote of the
decision trees, so the loss over the data set would not change unless the median tree's vote
changes. The authors do note however that a random forest can product good probability
estimates, but that it requires calibration steps.
Calibration can correct biases in the probability predictions of classifiers. Presumably,
if a well calibrated binary classification model gives an 80% probability of a positive label
to a set of samples, then approximately 80% of those samples should actually belong to the
positive label class. The error rate of the classifier on a particular set of samples should be
a linear function of the predicted probabilities. The simple way to visualize this relationship
is with a calibration plot like in Figure 3-1. The x axis refers to the estimate probability
of the classifier on a set of samples. The y axis refers to the true distribution of labels
for those same samples. We can see that the logistic regression probability estimates are
well calibrated because it appears to approach the line. Logistic regression is calibrated by
definition because it optimizes for log loss. On the other hand, naive bayes classifier appears
to overestimate the probability of a true label, while SVC and random forest classifiers appear
to underestimate the probability of a true or false label. A line far from a unit slop linear
relationship suggests that calibration steps should be performed. [19]
46
_4
Calibration plots (reliability curve)
1.0
0.8- -
0.6-
0.4
-
0.2-
Perfectly calibrated ..----
-a- Logistic
--- Naive Bayes
--a- Support Vector Classification
0.0 --- Random Forest
0.0 0.2 0.4 0.6 0.8 1.0
Figure 3-1: A collection of classifiers that have different biases on the famous Iris dataset
[19]
Niculescu-Mizil and Carunana [18] show that isotonic regression performs well to calibrate
prediction probabilities for decision tree models. They find that Platt scaling (sigmoidal) is
a effective when predicted probabilities follow an S-curve, while isotonic regression is more
effective for correcting monotonic distortions. When data is scarce, they show that isotonic
regression can overfit to the training set, making Platt scaling a better approach for small
data sets. Isotonic regression is a form of regression with the added constraint that the
function must be non-decreasing.
Practical Applications for Random Forests
Random forests have found wide use in a variety of fields such as ecology, medicine, bioinfor-
matics, and agriculture [7]. Our use of random forests in clinical trial prediction is motivated
by their successful application to fields with high dimensional categorical data with complex
non-linear interactions. Random forests have been used successfully before to predict out-
comes. Applied to genomics, random forest classifiers had the lowest prediction error rate in
classifying early stage ovarian cancer samples from normal tissue samples using mass spec-
trometry features compared to other classification methods like support vector machines and
47
k-nearest neighbors [351. In an unrelated study using gene expression data, random forests
were shown to outperform all other tree-based methods [13]. Validated on highly complex
and high-dimensional genomic data, random forests appear to be well suited for the task of
clinical trial prediction.
Suitability of Random Forests for Clinical Trial Success Prediction
Our task of learning the effect of one project's features on another informs what kinds of
models might be effective. The features associated with trials are often binary or categorical
in nature. Random forest classifiers have the flexibility to capture non-linear relationships
and can be well-suited for probability estimation. For these reasons, we expect random
forests to outperform other common classification methods such as logistic regression or
naive bayes classifiers, which can serve as baselines for performance.
48
Chapter 4
Data and Results
4.1 Clinical Trial Data
4.1.1 Dataset Specification
We are fortunate to have access to a large dataset of clinical trials, from which to inform
our analysis and fit models. Citeline is a product of Informa Pharma Intelligence. It is an
aggregation of information about both clinical trials and their results, including information
from resources like ClinicalTrials.gov. Citeline includes information from clinical trials dating
back to January 2000, providing 15+ years of both successful and unsuccessful trials. The
dataset includes 311,802 trials, associated with over 77 thousand drugs, 442 thousand investi-
gators and over 166 thousand organizations. Citeline is subdivided into 3 pieces: Trialtrove,
Pharmaprojects, and Sitetrove. Our focus is on characterizing the approval probabilities
for drug-indication pairs, to do this we focused on integrating features from TrialTrove and
Pharmaprojects
Trialtrove
Trialtrove is an aggregation of trial data. There are 311,902 trials each with information
about the phase, status, sponsor, and many others. Each trial targets a specific population
of interest. Trialtrove categorizes these populations with a hierarchy of classifications. The
broadest is the therapeutic area with 9 general categories such as oncology, cardiovascular,
49
and ophthalmology. Lower in the hierarchy is the disease, which is associated with a therapy
area. For example, Lupus is categorized under the autoimmune therapy area. There are
187 diseases, including N/A fields within each therapeutic area that catch unlisted diseases.
Within many disease there are patient segments. For instance, within the oncology, the breast
cancer disease category has patient segments like HER2 positive, Stage II, or adjuvant. These
segments are not mutually exclusive, as the stage of the disease is orthogonal to the HER2
biomarker which is unrelated to the type of therapy (adjuvant referring to therapy strategies
like chemotherapy or radiation therapy that aim to prevent re-occurrence of cancer following
a surgical procedure). Besides the disease, TrialTrove report the phase of each trial. The
phases include the four phases described in the background section on the FDA, along with
joint I/II, II/III, and III/IV trials. These are relatively rare compared to typical phases.
Other features of trials that are likely to be influential include the trial sponsor, whether it
met its accrual target, and it completion date.
Pharmaprojects
Pharmaprojects provides information about drugs and their tested indications. For each
drug indication pair, the development status and global status indicate whether a drug-
indication pair is launched/registered (indicating approval), active (indicating it is still under
development), or ceased. Drugs are associated with companies that sponsored their approval,
called the originator, and the license holder. For example, Baloxavir is an anti-influenza drug
that was developed by Shionogi in Japan, but it has been licensed to Roche. Pharmaprojects
also contains features of the drug itself, such as the delivery route.
Linking the data sets together
In our task, we want to combine the features of the drugs with the features of the trials
that studied them. Together, we hope that these feature sets will inform the probability of
success for individual and pairs of drug-indication pairs. While Pharmaprojects and Trial-
trove are both part of the Citeline platform, there are some challenges. We use the drug IDs
to link the trials with the associated entry in Pharmaprojects. Interestingly, some features,
like the disease, have different categories in Trialtrove compared with Pharmaprojects. Fol-
50
-- I
lowing this merge, we default to including the Trialtrove feature when there is overlap with
Pharmaprojects.
40.406
40-
30.216
30
CU
C
020,
11 117
10-
7 n1i
5.425
2.136
1.331
0.29
0. I
1 1/2 2 2/3 i 3/4 4 Other
Phase
Figure 4-1: Informa's Trialtrove and Pharmaprojects datasets include thousands of trials
from the whole spectrum of phases
Overview of the Data
Once the data from Trialtrove and Pharmaprojects were linked together, we selected a subset
of features that have been suggested to be relevant to the probability of success estimate
from previous work [34]. We show some of these distributions in Figures 4-1, 4-2, and 4-3.
Some of the other features that we included when building our dataset are listed in Table
4.1 and are described in more detail in [34].
route of administration biological/tchemical origin drug medium

biological target trial phase trial status
therapy area patient accrual target number actual patient accrual
disease type trial design type sponsor type
Table 4.1: Features from the Informa databases that were used for our analysis
51
40- 38,158
30-
(D20-
4) 15.81.
1.
13.85.
10.38
8.68 8
34
10
-1 0.738 0.606
.0 NO 'A
Therapy Area
Figure 4-2: Therapeutic area distribution in Trialtrove
One of the main challenges of this work is how to deal with the missingness in the data.
Many times there are trials that do not have entries for all our features of interest. In the
discussion section we will discuss further directions for imputation to address this problem.
For the analysis here, we use a 5-nearest neighbors imputation approach to fill in missing
values. Analysis for various imputation methods on this data is still an ongoing project.
4.1.2 Dimensionality Reduction
It is a challenge to describe a drug indication pair in a feature vector. Each project has
a different number of trials, making a fixed length representation difficult. As the data
specification described, many of the features are categorical. Some of the featured have
many possible categories. For the drug features, route of administration has 40 possibilities,
drug medium has 20, and biological target has 44. Similarly for the trial features, status has
6 possibilities, therapy area has 9, sponsor type has 14, and trial design has 26. In order
to represent that information that is friendly for machine learning, we one-hot encode the
categorical features creating a 0-1 indicator variable for each category in the dataset. The
data quickly becomes high-dimensional. We performed dimensionality reduction techniques
to order to assess the potential to predict successful approvals from failures using the feature
set.
52
- _1
27458
26985
20
I
CM4) 15.871
10
3.987
0.016 ooo 006 ~052 Z017 001 03_4 0.022 0 10.002 D008 0 149 0 031 00610.181 0.015 o.007
0.017 0 3o 0-112 0.014 0.0704 0.006 0028 O.OD6 On0 0.014 0.086 0.033 0.017 0.077 1 21
.
.004 0
\'e e 4#4-
Route
Figure 4-3: The route of administration distribution among drugs in the data set.
One of the most commonly used methods for dimensionality reduction principal compo-
nent analysis. PCA projects the data onto the first k principal components associated with
the largest singular values. Doing so, preserves the maximum total variance in the data in
the reduced dimension. In Figure 4-4 we plot the result of projecting onto the first 2 singular
vectors, coloring each (drug, indication, trial) tuple by its development status. In Figure 4-4
we include the pipeline drugs. The variance explained by the first two principal components
is relatively low, suggesting that space to describe a clinical trial cannot easily be projected
into two dimensions. It does appear the the distributions of approved vs inactive drugs
differs, with the pipeline drugs appearing more separable from others. When we remove the
pipeline projects and re-run PCA in Figure 4-5, we can see more clearly the approved versus
inactive trials are well dispersed and not clearly separable. Seeing these plots motivates the
use of powerful classifiers that are effective in high dimensions.
4.2 Results
As described earlier in the methods section, we apply random forest classifiers to estimate
covariance using the function approximation approach. First, we explore a synthetic data set
with known parameters to see how the method performs. Then we look how our classifiers
perform on the Informa data.
53
-i
Drug result Pipeline o Inactive Approved
10
-L
a)
C
5
04
0*
-5.
-4 -2 (22 4 6
PCi (9.25% of variance explained)
Figure 4-4: PCA projection onto the first 2 principal components among pipeline, inactive,
and approved trials
Drug result Inactive Approved
:a
x
0
0-
-c5-
-5
-4 -2 6
PC1 (9.403% of variance explained)
Figure 4-5: PCA projection onto the first 2 principal components among just inactive and
approved trials
54
4.2.1 Generative Process
To validate our approach we construct a dataset from a known distribution to see how our
method performs. This approach is inspired by Lo et al. in [151 where they construct a
correlation matrix E and generate a set of correlated clinical trials Y. To do this, they
generate a column vector c of i.i.d. normal random variables. The vector is of length n
representing the number of trials they aim to generate. E contains the pairwise correlations
between each of the n projects. Taking the matrix square root (Cholesky decomposition) of
E, we get Z = E 1 /2c. They generate the labels Y by comparing the values of Z to a vector
of thresholds a. If Zi > ai, then Yi = 1, zero otherwise. This process generates a correlated
vector of Bernoulli trials.
In our approach, we similarly draw a vector of latent variables Z - Gaussian(p, E) where
p is a vector of means and E is a valid covariance matrix. With this latent variable Z, we
set the labels vector Y by setting Yi = 1 if Zi < 0 where 0 is a specified threshold. We
draw a d length feature vector for trial i based on the parameter Zi. We tested the poisson
distribution with Zi + 10 as the rate parameter (since the rate parameter must be positive),
the multivariate Gaussian distribution with a mean vector of all Zi and the identity as the
covariance matrix, and the binomial distribution with the sigmoid function applied to Zi
(since the p parameter of a binomial must be between 0 and 1). Once we have generated the
data X, Y we can created paired samples, taking all unique combinations of distinct trials.
This gives us a data set Xpair, Ypair where each row of Xpair is a concatenation of rows of X
and Ypair is equal to the product of the two associated elements of Y. This makes Ypai, equal
to 1 when both trials are approved and 0 otherwise.
Using this we have created correlated trial outcomes and features. We run through a
series of epochs. In each epoch, we train random forest classifiers for both the single trial
data set X, Y and the paired trial data set Xpair, Yair. After training, we generate a test
set of data just as before, using the same latent vector Z. After creating the test pair trial
data, we use our trained random forest models to predict the probability of the paired trials
and subtract it by the product of the estimated probabilities of each trial on its own. The
resulting vector is of length (n 2 - n)/2. We then repeat for the next epoch. The latent
55
variables are the same for all the epochs, so we take the average over T epochs and examine
the distribution of values.
Since we generated this data from a known distribution, we should be able to compute
the ground truth covariance values. Without loss of generality, we are interested to find
Cov(Y1, Y2 ):
Cov(Yi = 1, Y2 = 1) = Cov(Z1 < 0, Z 2 0)
by the definition of our data generation process. We can expand this expression using the
definition of covariance,
Cov(Zi < 0, Z2 0) = Pr(Zi < 0, Z 2 0) - Pr(zi < 0) - Pr(z2 0)
J j f(zI, z2 )dzidz 2 - f(zi)dzi f (z2 )dz 2
where f(zi, z 2 ) is the bivariate gaussian probability density function parameterized by p, E

and f(zi) is the univariate probability density function.
We tested this approach on several different covariance matrices. One simple example
is the case where the covariance matrix is 1.0 on the diagonal, but 0.5 everywhere else.
If we set 0 = 0, then each trial has a 0.5 probability of success on it own. Since all the
covariances between pairs are the same, we can consider any arbitrary pair. The probability
that they both succeed is approximately 1/3 according to the bivariate Gaussian cumulative
distribution function. The correlation between the projects led to a higher chance that both
will succeed or both with fail, taking away probability mass from the cases where just one
succeeds. The covariance between outcomes Y, Y is equal to 1/3 - 1.4 = 1/6. Given this
ground truth, we can look at the how the random forest performs in Figure 4-6. The shape
of the distribution produced is rather surprising. While the groundtruth value is positive,
we get a significant position of the samples that estimate negative covariance.
One concern could be about the calibration of our models. As was discussed in the
methods section, it is possible that random forest probability estimates could be biased.
When we perform calibration on our random forest models, we can see that calibration can
be useful as shown in Figures 4-7 and 4-8. It appears that both the isotonic and sigmoid
56
12000
-
10000
-
8000-
6000-
4000-
2000-
0
-0.3 --0.2 -0.1 0.0 0.1
Figure 4-6: The histogram of estimated covariances by the random forest models averaged
over the epochs.
calibration techniques help the classifier to perform better. We can look at the fit our models
in Tables 4.2 and 4.3. Still despite this improved fit, we still get an estimated value that
is closer to 0 than it is to the ground truth, shown in Figure 4-9. This motivates future
work to better calibrate the random forest model and benchmark performance against other
classsification approaches besides logistic regression.
Metric Logistic RF Isotonic Calibrated RF

Brier Score 0.060 0.079 0.050
Precision Score 0.937 0.926 0.974
Recall Score 0.914 0.922 0.906
F1 Score 0.926 0.924 0.939
Table 4.2: Performance metrics for single trial probability prediction
We were able to run this analysis for several other covariance matrices, feature dis-
tributions, and parameter settings. One of the main takeaways from this analysis is the
importance of random forest probability calibration. In nearly all cases, calibrated models
performed better, particularly according to the Brier score which measures the squared dif-
ference between the probability estimate and the true label. These results serve as a useful
reference to compare against when we examine the Informa data.
57
10
08
S0.6-
0.2
-
... calibrated .Perfectly

--
Logistic (0.060)
-- RF (0.079)
-0- RF + Isotonic (0.050)
0.0 -D- RF + Sigmoid (0.049)
0.0 02 04 0.6 0.8 10

25 0
"" Logistic IRF + Isotonic
RF + Sigmoid
20 0RF
15 0
10 0
0
0.0 0.2 4 )6 0.8 10
Mean predicted value
Figure 4-7: Calibration plots of the single trial probability of success estimates.
4.2.2 Empirical Results
Next we run our random forest models on the Informa data to see if we can identify interesting
patterns. Our approach is similar to the generative process. We take the set of trials and
the labels, splitting them into a training set and test set. Typically our test set was 30%
of the entries. We then fit the single trial probability estimator. We randomly select half
of the entries without replacement. This becomes the first trial in our paired trials data
set. The remaining rows pair with one of the sampled rows. We take the product of the
label associated with the row as the label for the pair. We then can fit the pair probability
estimator. We use the held out test set to measure the fit of both estimators. With these
function approximators, we calculated the difference between the paired probability estimate
with the product of the single probability estimate, reporting it as the estimated covariance.
58
10
0.8
> 0.6
0.4
0.2
-.. Perfectly calibrate . .. .. d
-U- Logistic (0.096)
-4- RF (0.074)
-U- RF + Isotonic (0.0 58)
0.0 -a- RF + Sigmoid (0.0 54)
0.0 0.2 0.4 0.6 08 10
80000- "" Logistic F"" RF + Isotonic

RF T"" RF + Sigmoid
70000
60000-
50000
-
40000-
30000
-
20000-
10000
-
0
-0.75 -0.50 -0.25 0.00 0.25 0-50 035 LOO
Mean predicted value
Figure 4-8: Calibration plot for the paired trial probability estimate.
For this last step we took two different approaches. Our first approach was to treat
each drug, indication, trial combination on its own. We would use the features associated
with this tuple to predict probabilities of success. The second approach involved taking the
average prediction over all the trials associated with a drug indication pair according to the
following formula:
N
IP(Y = 1|xiTi) = IZ RFy

j=1
where Y is the outcome of project i with drug features Xi, where there are N associated trials
each with a trial feature vector Tj. We take the same approach with the pair probability
estimates, averaging over all combinations of trials between the pair of projects. In the
second method we compute the covariance between drug indication pairs instead of drug
indication trial combinations. We believe that the second approach is more informative
59
Metric Logistic RF Isotonic Calibrated RF
Brier Score 0.096 0.074 0.058
Precision Score 0.746 0.932 0.945
Recall Score 0.664 0.711 0.679
F1 Score 0.703 0.806 0.791
Table 4.3: Performance metrics for paired trial probability of success estimates
70000
-
60000
-
50000
-
40000
-
30000
-
20000-
10000
-
0- I I . U-'---
-
0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4
Figure 4-9: Histogram of the covariance estimates from the isotonically calibrated RF pre-
dictions.
because the number of trials for approved drugs is typically higher, leading to an imbalance.
Figure 4-10 shows the distributions from both these approaches. To show a proof of concept,
we looked specifically to calculate the covariance among trials of a particular therapy area
between time periods. We were careful to choose time periods with roughly equal numbers
of drug-indication pairs.
We generated results for other therapeutic areas besides oncology, such as the cardiology
and metabolic therapeutic areas. Figure 4-11 shows these distributions.
Next we wanted to benchmark the random forest approach against other models. We
compared the performance between three different classification algorithms: random forests,
XGBoost, and logistic regression. Logistic regression serves as the standard baseline for
classifications problems, while XGBoost (which is a gradient boosted tree-based classifier)
serves as a reference point for other tree based methods that have shown good performance
60
on other empirical datasets. The results are shown in 4-12.
61
Random Forest Covariance (Cancer, 01-05 vs. 06-10, P2APP) Randon Forest (Cancer, 2001-2G05 vs. 2006-2010)
CD
cy
S1
-1.0 -0.5 0.0 05 10 -10 -05 00 0.5 10
Covariance Covariance
Random Forest Covarlance (Cancer, 06-10 vs. 11-15, P2APP Randon Forest (Cancer, 2006-2010 vs. 2011-2015)
)
G
C3
C3
01 C,
CD
CD
=EI
i
-1.0 -0.5 0.0 0.5 1'0 -10 -05 00 0.5 10
Covanance Covanance
Random Forest Covariance (Cancer, 11-16 vs. 16-19, P2APP Randon Forest (Cancer. 2011-2015 vs. 2016-2019)
)
Cr
. .........- 0-
-10 -05 0.0 05 1.0 1.0 -0.5 0.0 05 1.0
Figure 4-10: Distribution of oncology trial covariance estimates across time periods. The
left side compares the covariance between drug indication pairs. The right side compares
covariance between specific trials.
62
Random Forest Covariance (Metabolic, 01-05 vs. 06-10, P2APP) Random Forest Covariance (Cardio. 01 -05 vs. 06-10. P2APP
)
C5
CD
-1.0 -0.5 0.0 0.5 1.0 -1.0 -0.5 0.0 0.5 1.0
Random Forest Covariance (Metabolic. 06-10 vs. 11-15, PZAPP) Random Forest Covariance (Cardio, 06-10 vs. 11-15, P2APP)
CD
OD
C2
C4
.. . . . . . .. . . . I I
I I I I I
-1.0 -0.5 0.0 0.5 1.0 -1.0 -0.5 0.0 0.5 1.0
Covariance Covarlance
Random Forest Covariance (Metabolic. 11-15 vs. 16-19, PZAPP) Random Forest Covariance (Cardlo. 11-15 vs. 16-15. P2APP)
k,]
CD,
C)
C4
-1.0 -0.5 0.0 0.5 1.0 -1.0 -0.5 0.0 0.5 1.0
Covardance Covariance
Figure 4-11: Distribution of covariance estimates for metabolic and cardiology therapy areas
across periods.
63
- .1
----- 4d
Random Forest (Cancer, 2001-2005 vs 2006-2010 XGBoost (Cancer, 2001-2005 vs. 2006-2010)
ca
Cr
CD I LLIhW
-1.0 -0.5 0.0 0.5 1.0
-1.0 -0.5 0.0 0.5 1.0
Covanance
Covariance
Random Forest (Cancer, 2011-2015 vs 2016-2019 XGBoost (Cancer, 2011-2015 vs. 2016-2019)
CD
CD
cc
CD
CD
CD
C3
Cr
CD
0
In
-
CD
CN
r-y- I F-"-7- h--n

-1.0 -0.5 0.0 0.5 1.0
-1.0 -0.5 0.0 0.5 10
Covariance
Covariance
Log-Regression (Cancer, 2006-2010) Log-Regression (Cancer, 2016-2019)
M C)
CD CD-
CD
t-
C
C3
CD
CD
Ln
I I III
-1.0 -0.5 0.0 0.5 1.0 -1.0 -05 0.0 0.5 1.0
Variance Vanance
Figure 4-12: Comparing covariance and variance distributions between random forest, XG-
Boost, and logistic regression.
64
Chapter 5
Discussion
The ability to estimate single and joint probabilities of success for a one time event as a
function of its related features is a general problem that could be useful in many fields. As
we described earlier the challenges in this task are related to the methods, the data, and
validation/ interpretability of the results.
The function approximation method to computing covariances by decomposing into the
paired success probability estimate and single success probability estimate is introduced in
the methods section. We explored random forest classifiers because they have been shown
to have good empirical performance. We wanted a function approximator that had the
power to model non-linear relationships in high-dimensional categorical data. Exploring the
theoretical results on random forests made us more optimistic about their suitability. We
attempted to validate this approach by generating data from known distributions to see if
we could recover the expected parameters. Unfortunately, we were not able to very closely
approximate the value we were expecting. This could be for a variety of reasons. One of
the most plausible is the lack of good probability calibration in the random forest model.
We studied the role of calibration and found that calibrated random forest models were
able to consistently perform better. It remains possible that even the calibrated random
forest models are not accurate enough in their probability estimates to identify our effect of
interest. In future work it would be helpful to design other synthetic datasets with other
properties. In our case the expected effect was very small. We were trying to estimate a
value in the hundredths decimal place. This level of precision might be asking too much
65
of our model. Future work could explore other models such as neural network models to
estimate probability instead of random forests.
For our empirical results, one key limitation was the data available. Clinical trials are
complicated, but out features were likely not rich enough to capture all the important aspects
of a clinical study. In addition, often times the data would have missing values. Imputing
this data is non-trivial because of the complicated dependencies that likely exist between
the attributes of a project. Expanding the dataset and developing principled way to deal
with missing values would help support the task of covariance estimation. There are many
interesting questions about how a particular attribute of a clinical trial affects success prob-
abilities. The classification of the sponsor as a biotechnology or pharmaceutical company
is one such distinction that is not currently being used in our model. Creating principled
categories of companies is non-trivial, though appears to be useful for understanding the risk
associated with a clinical trial. This was found in other research work done in the LFE.
Once we generated estimates of covariance it was challenging to validate them. We saw
many interesting dynamics in our results such as a bi-modal distribution appearing between
oncology trials in different periods. Explaining why this distribution was bi-modal was not
straightforward, particularly when the random forest probability estimates are difficult to
interpret. Analyzing the sensitivity of clinical trial probabilities of success to individual
features would be helpful for understanding/validating covariance estimates.
This project on characterizing correlations in clinical trial was very open-ended, creat-
ing an exciting opportunity to investigate many aspects of clinical trials and many possible
models for understanding covariance. Without precedent methodologies for estimating co-
variance in a principled way, I hope that this work has provided a useful foundation for
future progress in this field. The level of unmet need in medicine is enormous. With this
in mind, it was energizing to work on this project, hoping that someday advancements in
statistical methods and machine learning could contribute to better human health.
66
Chapter 6
Contributions
* Provides background on FDA trials and financial theory that motivate the need for
understanding of correlations in drug development projects
" Overviews the current approaches to fitting a covariance matrix and outlines other
promising approaches like online sequence prediction and function approximation for
a decomposition of covariance.
" Analyzes the features of drugs and clinical trials, describing a large data set of trials,
and visualizing their variance.
* Implements the function approximation method using random forest classifiers.
" Evaluates the function approximation approach using random forests on both synthetic
and real clinical trial data.
* Highlights future directions for progress in characterizing the covariance between drug
development projects.
67
68
Bibliography
[1] Avrim L Blum. On-Line Algorithms in Machine Learning Dagstuhl workshop on On-
Line algorithms. Technical report, 1997.
[2] Leo Breiman. Random Forests. Machine Learning, 45(1):5-32, 2001.
[3] Leo Breiman and Adele Cutler. Random forests. Technical report, University of Cali-
fornia at Berkeley.
[4] Nicolo Cesa-Bianchi and Gabor Lugosi. Prediction, Learning, and Games.
[5] David E. Fagnan, Austin A. Gromatzky, Roger M. Stein, Jose-Maria Fernandez, and
Andrew W. Lo. Financing drug discovery for orphan diseases. Drug Discovery Today,
19(5):533-538, may 2014.
[6] David E Fagnan, Jose Maria Fernandez, Andrew W Lo, and Roger M Stein. Can
Financial Engineering Cure Cancer? American Economic Review, 103:406-411, 2013.
[7] Khaled Fawagreh, Mohamed Medhat Gaber, and Eyad Elyan. Random forests: from
early developments to recent advancements. Systems Science & Control Engineering,
2(1):602-609, dec 2014.
[8] Jose-Maria Fernandez, Roger M Stein, and Andrew W Lo. Commercializing biomedical
research through securitization techniques. Nature Biotechnology, 30(10):964-975, oct
2012.
[9] Securitization Gary Gorton, Andrew Metrick, and Gary Gorton. Securitization. Tech-
nical report, 2012.
[10] Nicholas J Higham. Computing the nearest correlation matrix-a problem from finance.
Technical report, 2002.
[111 Alexander Korotin, Vladimir V'yugin, and Evgeny Burnaev. Long-Term Online Smooth-
ing Prediction Using Expert Advice. nov 2017.
[12] Allen Krantz. Diversification of the drug discovery process. Nature Biotechnology,
16(13):1294-1294, dec 1998.
[13] Jae Won Lee, Jung Bok Lee, Mira Park, and Seuck Heun Song. An extensive comparison
of recent classification tools applied to microarray data. Computational Statistics and
Data Analysis, 48(4):869-885, 2005.
69
[14] Benjamin Letham, Cynthia Rudin, and David Madigan. Sequential event prediction.
Machine Learning, 93(2-3):357-380, nov 2013.
[15] A. W. Lo, C. Ho, J. Cummings, and K. S. Kosik. Parallel Discovery of Alzheimer's

Therapeutics. Science TranslationalMedicine, 6(241):241cm5-241cm5, jun 2014.
[16] Thomas M. Cover. Behavior of sequential predictors of binary sequences. page 21, 09
1966.
[17] Harry Markowitz. Portfolio Selection. Technical Report 1, 1952.
[18] Alexandru Niculescu-Mizil and Rich Caruana. Predicting good probabilities with su-
pervised learning. pages 625-632, 2006.
[19] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, 0. Grisel, M. Blon-

del, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau,
M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python.
Journal of Machine Learning Research, 12:2825-2830, 2011.
[20] Roger A. Pielke. Mesoscale meteorological modeling. Academic Press, 2002.

[21] Houduo Qi and Defeng Sun. A Quadratically Convergent Newton Method for Comput-
ing the Nearest Correlation Matrix. SIAM Journalon Matrix Analysis and Applications,
28(2):360-385, jan 2006.
[22] Alexander Rakhlin and Arthur Flajolet. 6.883: Online Methods in Machine Learning.
Technical report, MIT, 2016.
[23] Alexander Rakhlin and Karthik Sridharan. Statistical Learning and Sequential Predic-
tion. Technical report, 2014.
[24] Sebastian Raschka and Vahid Mirajalili. Python machine learning : machine learning
and deep learning with Python, scikit-learn, and TensorFlow.
[25] Max Roser. Life expectancy. Our World in Data, 2019. https://ourworldindata.org/life-
expectancy.
[26] Jack W. Scannell, Alex Blanckley, Helen Boldon, and Brian Warrington. Diagnosing the
decline in pharmaceutical R&D efficiency. Nature Reviews Drug Discovery, 11(3):191-
200, mar 2012.
[27] Aylin Sertkaya, Anna Birkenbach, Ayesha Berlind, and John Eyraud. Examination of
Clinical Trial Costs and Barriers for Drug Development. Technical report, U.S. Depart-
ment of Health and Human Services - Office of the Assistant Secretary for Planning
and Evaluation, Washington, D.C., 2014.
[28] Shai Shalev-Shwartz. Online Learning: Theory, Algorithms, and Applications. PhD
thesis, Hebrew University, 2007.
[29] US Food and Drug Administration. New Drug Application (NDA), 2016.
70
[30] US Food and Drug Administration. Development & Approval Process (Drugs), 2018.
[31] U.S. Food and Drug Administration. Step 3: Clinical Research, 2018.
[32] M.J. Weinberger and E. Ordentlich. On delayed prediction of individual sequences.

IEEE Transactions on Information Theory, 48(7):1959-1976, jul 2002.
133] Christopher M Wittich, Christopher M Burkle, and William L Lanier. Ten com-
mon questions (and their answers) about off-label drug use. Mayo Clinic proceedings,
87(10):982-90, oct 2012.
[34] Chi Heem Wong, Kien Wei Siah, and Andrew W Lo. Estimation of clinical trial success
rates and related parameters. Biostatistics, 00:1-14, 2018.
[35] Baolin Wu, Tom Abbott, David Fishman, Walter McMurray, Gil Mor, Kathryn Stone,
David Ward, Kenneth Williams, and Hongyu Zhao. Comparison of statistical meth-
ods for classification of ovarian cancer using mass spectrometry data. Bioinformatics,
19(13):1636-1643, sep 2003.
[36] Han Xiao and Claudia Eckert. Efficient Online Sequence Prediction with Side Informa-
tion. In 2013 IEEE 13th International Conference on Data Mining, pages 1235-1240.
IEEE, dec 2013.
71

MIT Machine Learning Trial Outcomes

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

MIT Machine Learning Trial Outcomes

Uploaded by

Copyright:

Available Formats

Modeling Correlations in Clinical Trial Outcomes using

Masters of Engineering in Computer Science and Engineering

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

@ Massachusetts Institute of Technology 2019. All rights reserved.

Submitted to the Department of Electrical Engineering and Computer Science

Thesis Supervisor: Andrew W. Lo

4.2.1 Generative Process . . . . . . . . . . 55

4.2.2 Empirical Results . . . . . . . . . . . 58

4-1 Informa's Trialtrove and Pharmaprojects datasets include thousands of trials

4-2 Therapeutic area distribution in Trialtrove . . . . . . . . . . . . . . . . . . . 52

4-7 Calibration plots of the single trial probability of success estimates. . . . . . 58

4-8 Calibration plot for the paired trial probability estimate. . . . . . . . . . . . 59

4-9 Histogram of the covariance estimates from the isotonically calibrated RF

2.1 HHS Clinical Trial Costs Weighted Averages by Component [271 . . . . . . . 20

1.1 Eroom's Law and the Valley of Death

2.1 FDA Trials

2.1.5 Approval and Exceptions

2.2 Finance Theory

2.2.1 Crossing the Valley of Death with Portfolio Theory

the returns for a portfolio with known asset weights as,

2.3 Fitting a Covariance Matrix

min ||A - X11,

3.1 Formalizing Correlation in Clinical Trials

3.1.1 Problem Overview

assets in the portfolio, the variance of portfolio D is given by:

" One-shot nature of a drug indication pair

" Projects begin, end, and change over time

" Limitations of the data.

Drug-indications pairs are typically observed once

Projects begin, end, and change over time

Embedding a drug-indication pair into a feature vectors is non-trivial

3.2 Online Learning

3.2.1 Batch vs Online Learning

3.2.2 Motivating Problem

3.2.3 Why Online Learning for Clinical Trial Prediction?

3.2.4 Outlining this section

3.2.5 Predicting a Coin Toss

+1= i( < yi)-

lim T ff(yt # yt) - min(1 - P, P) = 0.

3.2.6 Bit Prediction

Second, we must have #satisfy: FE[#] er> 1,

Vy = (yi, ... ,1yn), R[(A, y) = # (y).-

3.2.7 Learning from Experts

where q is a positive parameter. This form is considered to be particularly useful because it

3.2.8 Side Information

3.2.9 Sequential Event Prediction

3.2.10 Long-Term Sequence Prediction

3.2.11 Online Learning Takeways

3.3 Function Approximation via Random Forests

Cov(Yi = 1, Y = 1) = P(Y = 1, Y = 1) -- P(Y = 1) - P(Y = 1).

Cov(Yi = 1,Y = I|Xi, Xj) = fpair(Xi, Xj) - fsingie(Xi) -fsingle (Xj).

3.3.1 Random Forests

Fundamentals of Random Forests

Random Forest Generalization Error

mg(X, Y) - k= 1 - max Ek=1

PXx,Y(mg(X, Y) < 0).

Px,y(mg(X, Y) < 0) -+ IPxy(PE(h(X, E) = Y) - max Pe(h(X, 9) j) < 0).

mr(X, Y) = Pe (h(X, 8) = Y) - max Pe(h(X, 9) j).

Random Forest Probabilities and Choosing a Subset of Features

Calibration plots (reliability curve)

0.0 0.2 0.4 0.6 0.8 1.0

Practical Applications for Random Forests

Suitability of Random Forests for Clinical Trial Success Prediction