You are on page 1of 44

PRECISION MEDICINE FOR

CANCER TREATMENT
CA PS TO N E PR O J E CT
AGENDA
BACKGROUND
DATA
VARIABLE SELECTION
MODELLING
VALIDATION
INTERPRETATION OF RESULTS
BACKGROUND | DATA | VARIABLE SELECTION | MODELLING | VALIDATION | INTERPRETATION OF
RESULTS

CLIENT INTRODUCTION
• The Center for Translational Data Science is a research Research Areas:
center pioneering translational data science to advance
biology, medicine, and environmental research.   Data Commons and Cloud

• GDC a tool started by CDIS became a core component Complex Systems


of the National Cancer Moonshot headed by Joe Biden. High performance analysis
• GDC went live in 2016. with approximately 4.1 Bio-informatics
petabytes of data from National Cancer Institute-
Pipelines for large scale data analysis
supported research programs, including some of the
largest and most comprehensive cancer genomics Open source data management
datasets in the world -- such as TCGA and TARGET --
and more than 14,000 anonymized patient cases.
BACKGROUND | DATA | VARIABLE SELECTION | MODELLING | VALIDATION | INTERPRETATION OF
RESULTS

PROBLEM STATEMENT

Why? What? How?

Why are we doing this? What is the solution? How are we solving it?

• Patients diagnosed with • This is solvable by the • Using Breast cancer data,
cancer receive the same concept of precision we will prove that “Different
treatment based on type. medicine, an approach to responses of patients to a
Yet, response to the same patient care that allows particular treatment is driven
treatment varies across doctors to select treatments by the difference in the
patients, delaying patient that are most likely to help levels of their gene
recovery patients based on a genetic expressions”
understanding of their disease
BACKGROUND | DATA | VARIABLE SELECTION | MODELLING | VALIDATION | INTERPRETATION OF
RESULTS

RESEARCH PURPOSE
•To derive an algorithm that can predict the best treatment for a breast cancer patient based on our
dataset
•The model will use progression free interval as a treatment outcome measure and evaluate
treatments accordingly
Our three main research objectives:

1|• Identify molecular and 2|• Build a treatment 3|• Identify the interactions
clinical features of high outcome prediction between genes and
importance in model using the drugs affecting the
predicting the selected molecular progression free
progression free features, clinical survival (treatment
survival of a breast features, and treatment outcome) of the
cancer patient history patients
DATA
BACKGROUND | DATA | VARIABLE SELECTION | MODELLING | VALIDATION | INTERPRETATION OF
RESULTS

DATA
Publicly available TCGA (The Cancer Genome Atlas) genomic and clinical data
comprising:
• mRNA (messenger RNA) sequence data
• miRNA (micro RNA) sequence data
• DNA methylation data
• Clinical data
WHAT IS THE METHYLATION,
MRNA & MIRNA DATA ABOUT?
RNAs copying gene miRNAs silencing additional
Genes present in all information to create genes as per requirement for
cells of the body proteins protein to be made

BREAST CELL
NUCLEUS
RIBOSOMES
X
Protein synthesis
Methylation: RNAs taking
Genes silenced relevant genetic
within nucleus information outside
(not available for nucleus
copying by X

RNA)

TRANSCRIPTION TRANSLATION
BACKGROUND | DATA | VARIABLE SELECTION | MODELLING | VALIDATION | INTERPRETATION OF
RESULTS

CLINICAL DATA
Consists of demographic, diagnostic, treatment and survival indicators
Demographic and Diagnostic features
• Race
• Tumor stage
• Menopause status
• Lymph node count
Treatment features
• Radiation therapy flag
• Drug counts
Survival indicators
• Progression Free Interval: Period since treatment began during which the
disease hasn’t worsened
• Overall Survival time: Period from the date of diagnosis until the date of
death/date of last follow up (depending on vital status)
VARIABLE SELECTION
BACKGROUND | DATA | VARIABLE SELECTION | MODELLING | VALIDATION | INTERPRETATION OF
RESULTS

SELECTING FEATURES
RELEVANT TO BREAST CANCER
Summarizing the genomic features is paramount given their high dimensionality
~ 100,000 genomic features

Differential Expression
Analysis

Recursive Feature
Elimination

10 genomic features
BACKGROUND | DATA | VARIABLE SELECTION | MODELLING | VALIDATION | INTERPRETATION OF
RESULTS

1. DIFFERENTIAL EXPRESSION
ANALYSIS
• Every cell nucleus contains the
complete genome, however only a small
percentage of the genome is expressed
in each cell
• DEA is carried out on normalized
counts to identify genes that are
differentially expressed in breast cancer
samples against normal samples using
the DESeq2 library in R
Normal breast cell vs cancerous breast cell
• Large set of genes deemed differentially
expressed, warranting further reduction
BACKGROUND | DATA | VARIABLE SELECTION | MODELLING | VALIDATION | INTERPRETATION OF
RESULTS

2. RECURSIVE FEATURE
ELIMINATION
• RFE employed to eliminate redundancies and avoid overfitting
• Random Forest model fit to variables selected in previous step
• Determine which feature is the least important in making predictions on the testing
dataset and drop this feature
• Select feature set that yields highest classification accuracy
BACKGROUND | DATA | VARIABLE SELECTION | MODELLING | VALIDATION | INTERPRETATION OF
RESULTS

FINAL FEATURE SET


• mRNA genes: ENSG00000102796, ENSG00000198832,
1 ENSG00000100906, ENSG00000134917, ENSG00000103269
• miRNA gene: hsa.mir.146a

~
Dependent
variable: 1 • Methylation CpG sites: cg05337753, cg02282892
PFI time*
• Clinical: tumor_stage_1, tumor_stage_3
*PFI time preferred over 1 • Treatment: Doxorubicin, Paclitaxel, Trastuzumab, Epirubicin,
Overall Survival time as Tamoxifen, Docetaxel, Letrozole, Carboplatin, Cyclophosphamide,
some deaths in data might Anastrozole, Fluorouracil, Exemestane, Methotrexate
occur due to non-cancer
causes
MODELING
BACKGROUND | DATA | VARIABLE SELECTION | MODELLING | VALIDATION | INTERPRETATION OF

PFI TIME BINNED INTO


RESULTS

CLASSES AND MODELED AS


DEPENDENT VARIABLE
Considering the small sample size, equal frequency binning done for PFI to keep the analysis powerful by
focusing on the causes of larger variation in the PFI time as compared to smaller changes (days of few weeks)
Considering two options:
A Two bin classification (n=180 each) B Three bin classification (n=120 each)
1 2 1 2 3
2.1 yrs. 1.5 yrs. 4 yrs.
BACKGROUND | DATA | VARIABLE SELECTION | MODELLING | VALIDATION | INTERPRETATION OF
RESULTS

PREDICTION OF TWO PFI TIME


BUCKETS
Dependent variable classes: Model evaluation:
1. PFI < 2.1 years (0) Cross-validated Accuracy
Model  False Positive Rate
(Sensitivity)
2. PFI > 2.1 years (1)
Logistic regression (Lasso) 71% 14%

Random Forest 70% 15%


Metrics for model evaluation:
SVM (Polynomial) 72% 11%
1. Cross-Validated Accuracy SVM (Radial) 73% 14%
(Sensitivity): High variation in
sensitivity of the small test sample Results from four different binary classification models

with varying test-train splits, and


hence not trustworthy Polynomial SVM is the best model for predicting PFI time
2. False positive rate: We want to with two buckets, considering its performance on both the
control the prediction of a higher PFI Accuracy and False Positive Rate metrics
for a patient compared to the actual
Improvement:
PFI We can explore the prediction of a more granular bucketing to increase the amount of variation being
predicted, while keeping the model performance under consideration
BACKGROUND | DATA | VARIABLE SELECTION | MODELLING | VALIDATION | INTERPRETATION OF

PFI TIME BINNED INTO


RESULTS

CLASSES AND MODELED AS


DEPENDENT VARIABLE
Considering the small sample size, equal frequency binning done for PFI to keep the analysis powerful by
focusing on the causes of larger variation in the PFI time as compared to smaller changes (days of few weeks)
Considering two options:
A Two bin classification (n=180 each) B Three bin classification (n=120 each)
1 2 1 2 3
2.1 yrs. 1.5 yrs. 4 yrs.
BACKGROUND | DATA | VARIABLE SELECTION | MODELLING | VALIDATION | INTERPRETATION OF
RESULTS

PREDICTION OF THREE PFI


TIME BUCKETS
Dependent variable classes: Model evaluation:
1. PFI < 1.5 years (1) Cross-validated Accuracy
Model  False Positive Rate
(Sensitivity)
2. PFI <= 4 years (2)
Random Forest 57% 21%
3. PFI > 4 years (3) SVM (Polynomial) 54% 22%

SVM (Radial) 56% 23%

Metrics for model evaluation:


Results from three different multi-classification models
1. Cross-Validated Accuracy
(Sensitivity): High variation in
sensitivity of the small test sample Random Forest is the best model for predicting PFI time with
with varying test-train splits, and three buckets, considering its performance on both the
hence not trustworthy Accuracy and False Positive Rate metrics
2. False positive rate: We want to
control the prediction of a higher PFI
for a patient compared to the actual
BACKGROUND | DATA | VARIABLE SELECTION | MODELLING | VALIDATION | INTERPRETATION OF
RESULTS

MODEL SELECTION
• We selected the three class PFI prediction model because it provides the ability to
differentiate the survival classes better than the binary classification
• Random Forest is better suited for multi-class classification and is not contingent
on centering and scaling of the test data for accurate predictions. Hence, we have
selected the Random Forest model with genes selected from survival analysis and
recursive feature elimination, as our final model

• With an accuracy of 57%, our model is 24 percentage points more accurate compared to
the 33% random chance of prediction of one of the three classes
• Since a tree based model gave us the best performance, our data has multiple inherent
interactions within its structure which we will explore in the further slides
BACKGROUND | DATA | VARIABLE SELECTION | MODELLING | VALIDATION | INTERPRETATION OF
RESULTS

COMPARISON OF FEATURES –
GENES VS. DRUGS
Why compare the two? Accuracy of models:
Drug data will have some genetic Model 
Cross-validated Accuracy
information encoded because of (Sensitivity)
prescription decision based on Drugs only 48%
clinical attributes of the patient. Genes + Clinical 49%
Clinical attributes are in turn driven
Gene PCA + Clinical +
by genetic modifications Drugs
52%

Genes + Clinical + Drugs 57%

Other comparison:
• Using only genes + clinical vs. only the drug data
Performance of genes selected after gave similar accuracies (~48%), which is lower
feature selection (DEA, survival than both the variables modeled together (57%)
model, and RFE) vs. the top 10 • Modeling the genes selected using DEA, survival
Principal Components from genetic models, and RFE, gave better accuracy as
data compared to the top 10 PCs from genetic data
MODEL VALIDATION
BACKGROUND | DATA | VARIABLE SELECTION | MODELLING | VALIDATION | INTERPRETATION OF
RESULTS

MODEL VALIDATION – WHY


DO WE NEED IT?
Situation – Sample & Bias Solution – Permutation Testing
• Model based on 360 samples (only) • Permute the outcome variable (PFI) to
create multiple datasets with random
• Drugs are not prescribed randomly but assignment of PFI buckets to the feature
based on certain features of the patients sets
• Sampling is retrospective and not • Fit the finalized model to these datasets
completely random. May lead to biased and record the cross-validation accuracies
and confounded results
• Plot a histogram of all the accuracies
obtained above and measure the p-value
of the accuracy of our original model
BACKGROUND | DATA | VARIABLE SELECTION | MODELLING | VALIDATION | INTERPRETATION OF
RESULTS

MODEL VALIDATION –
RESULTS
Histogram of the cross-validated accuracies The highly significant p-value of our model’s
of models trained with 800 permuted samples accuracy shows that the model result and
hence the dependence of patient PFI on the
selected features is not by random chance

56.8%
No. of models 

Accuracy of our
model

<2.2e-16 p-value of this


accuracy in the
distribution

0.30---- ------------------0.35-------------- --------0.40------------ ----------0.45---------------- ------0.50----------------------0.55 0.568


Accuracy 
QUALITATIVE FINDINGS
INTERACTION AMONG
DRUGS AND GENES
BACKGROUND | DATA | VARIABLE SELECTION | MODELLING | VALIDATION | INTERPRETATION OF

IDENTIFYING INTERACTIONS
RESULTS

AND QUANTIFYING THEIR


EFFECTS ON PFI
Identify by fitting a simple classification tree model
TIME
Quantify through the significance and
coefficients of the interaction terms in an
ordered logistic regression model
<1.5 yrs Interaction Coeff.
1.5 – 4 yrs
>4 yrs
Doxorubicin : tumor_Stage.I1 23.39
tumor_Stage.I1

Doxorubicin : tumor_Stage.I1 : RHBDL1 -10.57

Doxorubicin : tumor_Stage.I1 : DHRS12 20.53

tumor_Stage.I1: RHBDL1 -9.95

Doxorubicin : DHRS12 -0.51

Doxorubicin : DHRS12 : cg05337753 -0.36

DHRS12 : cg05337753 -0.30


VISUALIZING EFFECT OF
INTERACTING FEATURES ON
PROBABILITY OF THE 3 PFI
BUCKETS
Illustration Tumor Stage > 1 Tumor Stage = 1

1 < 1.5 years


Gene 2 (High)

2 1.5 – 4 years

3 > 4 years

3 Lighter the area


Gene 2 (Low)

within the box,


2 higher the
probability of a
1
better Progression
Free Interval
Gene 1 Expression  Gene 1 Expression 
BACKGROUND | DATA | VARIABLE SELECTION | MODELLING | VALIDATION | INTERPRETATION OF

EFFECT OF THE TUMOR


RESULTS

STAGE*RHBDL1*DHRS12
INTERACTIONS ON PFI TIME
Stacked charts of probability of patients belonging to
each of the survival buckets given the combination of:
Tumor Stage > 1 Tumor Stage = 1
• Tumor Stage 1 flag (left >1, right =1)
DHRS12 = High
• RHBDL1 expression (increasing left to right)
• DHRS12 expression (increasing bottom to top)

DHRS12 expression levels 


Insights:
DHRS12 = Med
• Tumor Stage 1 patients have a better PFI compared to
patients at higher stages
• Increasing expression of RHBDL1 gene implies
DHRS12 = Low
reduced PFI time
(tumor_Stage.I1: RHBDL1 coef. = -9.95)

• Increasing expression of DHRS12 gene implies


increased PFI time for Tumor Stage 1 patients
(tumor_Stage.I1: DHRS12 coef. = 17.88) RHBDL1 expression levels  RHBDL1 expression levels 
BACKGROUND | DATA | VARIABLE SELECTION | MODELLING | VALIDATION | INTERPRETATION OF

EFFECT OF THE TUMOR


RESULTS

STAGE*RHBDL1*DHRS12
INTERACTIONS ON PFI TIME
Stacked charts of probability of patients belonging to
each of the survival buckets given the combination of:
Tumor Stage > 1 Tumor Stage = 1
• Tumor Stage 1 flag (left >1, right =1)
DHRS12 = High
• RHBDL1 expression (increasing left to right)
• DHRS12 expression (increasing bottom to top)

DHRS12 expression levels 


Insights:
DHRS12 = Med
• Tumor Stage 1 patients have a better PFI compared to
patients at higher stages
• Increasing expression of RHBDL1 gene implies
DHRS12 = Low
reduced PFI time
(tumor_Stage.I1: RHBDL1 coef. = -9.95)

• Increasing expression of DHRS12 gene implies


increased PFI time for Tumor Stage 1 patients
(tumor_Stage.I1: DHRS12 coef. = 17.88) RHBDL1 expression levels  RHBDL1 expression levels 
STUDYING EFFECTIVENESS OF BACKGROUND | DATA | VARIABLE SELECTION | MODELLING | VALIDATION | INTERPRETATION OF
RESULTS

DRUGS ON TUMORS DUE TO


DIFFERENT GENE EXPRESSION
WeVARIATIONS
compared the effectiveness of Trastuzumab and Doxorubicin in terms of PFI on patients
with different genetic variations, based on the observation that they are
 the most effective, and
 prescribed to a good proportion of patients in our sample

Illustration: Using gene-drug interaction for prescribing


treatment

Survival time
PREP TO UNDERSTAND NEXT
THREE SLIDES
Illustration Doxorubicin (Low) Doxorubicin (High)
Trastuzumab (High)

Trastuzumab Both Lighter the area


within the box,
higher the
probability of a
better Progression
Trastuzumab (Low)

Free Interval

Neither Doxorubici
n
Increasing expression levels  Increasing expression levels 
BACKGROUND | DATA | VARIABLE SELECTION | MODELLING | VALIDATION | INTERPRETATION OF
RESULTS

EFFECT OF GENE-DRUG
INTERACTIONS ON PFI TIME
Trastuzumab is more effective in improving vs Doxorubicin works better for high
PFI of patients with high RHBDL1 . SELENOM expression
expression

Doxorubicin (Low) Doxorubicin (High) Doxorubicin (Low) Doxorubicin (High)


Trastuzuma

Trastuzuma
b (High)

b (High)
Trastuzuma

Trastuzuma
b (Low)

b (Low)
RHBDL1 expression levels  RHBDL1 expression levels  SELENOM expression levels SELENOM expression levels 

BACKGROUND | DATA | VARIABLE SELECTION | MODELLING | VALIDATION | INTERPRETATION OF
RESULTS

EFFECT OF GENE-DRUG
INTERACTIONS ON PFI TIME
Lower methylation levels of cg02282892 also indicate lower PFI and
Doxorubicin works better on such patients as compared to
Trastuzumab

Doxorubicin (Low) Doxorubicin (High)


Trastuzuma
b (High)
Trastuzuma
b (Low)
BACKGROUND | DATA | VARIABLE SELECTION | MODELLING | VALIDATION | INTERPRETATION OF
RESULTS

EFFECT OF GENE-DRUG
INTERACTIONS ON PFI TIME
Both the drugs show similar efficacy for NFKBIA and methylation cg05337753 variations

Doxorubicin (Low) Doxorubicin (High) Doxorubicin (Low) Doxorubicin (High)


Trastuzuma

Trastuzuma
b (High)

b (High)
Trastuzuma

Trastuzuma
b (Low)

b (Low)
NFKBIA expression levels  cg05337753 expression levels 
EXAMPLE OF APPLICATION AREA
FOR THIS RESEARCH
Genomic information has already helped to shape
the development and use of some of the newest
cancer treatments
For example,
 Trastuzumab works only for women whose
tumors have a particular genetic profile called
HER-2 positive (subtype)
 the Drug imatinib was designed to inhibit an
altered enzyme produced by a fused version
of two genes found in chronic myelogenous
leukemia

Source: https://cancergenome.nih.gov/cancergenomics/impact
LIMITATIONS & FUTURE STEPS

LIMITATIONS OF THE STUDY


Sample &
• Small sample size and aggressive feature selection
Dimensionalit • Limited to US geography
y
• Only chemotherapies considered as they were the most commonly prescribed
• Dosage information not available
Drug data • Order of use of therapies not taken into account
• Drugs modeled individually but some were prescribed in combination

• Data not sufficient for test-train validation


• Drug assignment bias still exists even after permutation testing. The right
Modeling way here would be to design a clinical trail with randomized drug assignment
but that is difficult to accomplish
LIMITATIONS & FUTURE STEPS

FUTURE STEPS
Data • Gather data from multiple data sources to try and increase the sample size

• Perform modeling for other cancer types and with other molecular datasets
Modeling (protein, mutation) to validate the same hypothesis

• Work in tandem with cancer genetics experts to validate the outputs of gene
Interpretations selection algorithms, modeling of drug combinations, and interpreting
& model interactions results
• Understand the level of genetic information already accounted for by clinical
evaluation variables to avoid confounding effects of certain genes
QUESTIONS
APPENDIX
PRINCIPAL COMPONENT
ANALYSIS
• Top 10 components explained 40% of the
overall variance
• Low interpretability

Variance Explained with the number of principal components Principal Components 1, 2 and 3 of the entire omics data with important variable loadings
PREDICTION OF SQ.ROOT(PFI
TIME)
• Given the small sample size of the final dataset (360), we used cross-validated train metrics to
measure the model performance. We were getting a high variance in the test performance results with
different test-train split seeds due to the small sample size of test data.

Results: Models Cross- validated R square

Linear Regression 28.4%


RF Regression 33.4%
SV Regression 27.2%

Conclusion : To test the capability of the combination of drugs history data and genetic/ clinical
features in predicting the PFI time better, we divided the PFI time into three buckets.
BACKGROUND | DATA | VARIABLE SELECTION | MODELLING | VALIDATION | INTERPRETATION OF
RESULTS

EFFECT OF GENE
INTERACTIONS ON SURVIVAL
Stacked charts of probability of patients belonging to each of cg--753 * DHRS12 * Cg---892
the survival buckets given the combination of:
• cg05337753 methylation (increasing bottom to top)
SELENOM SELENOM SELENOM
• DHRS12 expression (increasing within the box, left to right) (Low) (Med) (High)
• SELENOM expression (increasing left to right) cg--753 (High)

SELENOM expression levels 


Insight:
• cg05337753 and SELENOM interact in such a way
cg--753 (Med)
that the PFI time reduces at very low and very high
expression values
(explained by the negative coefficient for interaction)
cg--753 (Low)
• At higher cg05337753 methylation, increase in
SELENOM expression reduces PFI time.

DHRS12 expression levels 


BACKGROUND | DATA | VARIABLE SELECTION | MODELLING | VALIDATION | INTERPRETATION OF
RESULTS

EFFECT OF GENE-GENE
INTERACTIONS ON PFI TIME
Stacked charts of probability of patients belonging to Doxoribicin * DHRS12 * Cg---892
2
each of the survival buckets given the combination of:
• Doxorubicin (increasing left to right)
Doxorubicin (Low) Doxorubicin (High)
• DHRS12 expression (increasing within the box, left to right)
High Cg---892
• cg02282892 methylation (increasing bottom to top)

Cg---892 expression levels 


Insights:
• Higher the methylation, better the survival
• PFS time reduces with increasing expression levels Low Cg---892
of DHRS12 gene at lower levels of cg02282892
methylation
(explained by the negative coefficient for interaction)

• This effect of DHRS12 gene on PFS time weakens


at higher cg02282892 methylation
DHRS12 expression levels 
BACKGROUND | DATA | VARIABLE SELECTION | MODELLING | VALIDATION | INTERPRETATION OF
RESULTS

EFFECT OF GENE-GENE
INTERACTIONS ON PFI TIME
Stacked charts of probability of patients belonging to Doxoribicin * DHRS12 * Cg---892
2
each of the survival buckets given the combination of:
• Doxorubicin (increasing left to right)
Doxorubicin (Low) Doxorubicin (High)
• DHRS12 expression (increasing within the box, left to right)
High Cg---892
• cg02282892 methylation (increasing bottom to top)

Cg---892 expression levels 


Insights:
• Higher the methylation, better the survival
• PFS time reduces with increasing expression levels Low Cg---892
of DHRS12 gene at lower levels of cg02282892
methylation
(explained by the negative coefficient for interaction)

• This effect of DHRS12 gene on PFS time weakens


at higher cg02282892 methylation
DHRS12 expression levels 

You might also like