Complete Project Presentation

PRECISION MEDICINE FOR
CANCER TREATMENT
CA PS TO N E PR O J E CT
AGENDA
BACKGROUND
DATA
VARIABLE SELECTION
MODELLING
VALIDATION
INTERPRETATION OF RESULTS
BACKGROUND | DATA | VARIABLE SELECTION | MODELLING | VALIDATION | INTERPRETATION OF
RESULTS
CLIENT INTRODUCTION
• The Center for Translational Data Science is a research Research Areas:
center pioneering translational data science to advance
biology, medicine, and environmental research. Data Commons and Cloud
• GDC a tool started by CDIS became a core component Complex Systems

of the National Cancer Moonshot headed by Joe Biden. High performance analysis
• GDC went live in 2016. with approximately 4.1 Bio-informatics
petabytes of data from National Cancer Institute-
Pipelines for large scale data analysis
supported research programs, including some of the
largest and most comprehensive cancer genomics Open source data management
datasets in the world -- such as TCGA and TARGET --
and more than 14,000 anonymized patient cases.
RESULTS
PROBLEM STATEMENT
Why? What? How?
Why are we doing this? What is the solution? How are we solving it?
• Patients diagnosed with • This is solvable by the • Using Breast cancer data,
cancer receive the same concept of precision we will prove that “Different
treatment based on type. medicine, an approach to responses of patients to a
Yet, response to the same patient care that allows particular treatment is driven
treatment varies across doctors to select treatments by the difference in the
patients, delaying patient that are most likely to help levels of their gene
recovery patients based on a genetic expressions”
understanding of their disease
RESULTS
RESEARCH PURPOSE
•To derive an algorithm that can predict the best treatment for a breast cancer patient based on our
dataset
•The model will use progression free interval as a treatment outcome measure and evaluate
treatments accordingly
Our three main research objectives:
1|• Identify molecular and 2|• Build a treatment 3|• Identify the interactions
clinical features of high outcome prediction between genes and
importance in model using the drugs affecting the
predicting the selected molecular progression free
progression free features, clinical survival (treatment
survival of a breast features, and treatment outcome) of the
cancer patient history patients
DATA
RESULTS
DATA
Publicly available TCGA (The Cancer Genome Atlas) genomic and clinical data
comprising:
• mRNA (messenger RNA) sequence data
• miRNA (micro RNA) sequence data
• DNA methylation data
• Clinical data
WHAT IS THE METHYLATION,
MRNA & MIRNA DATA ABOUT?
RNAs copying gene miRNAs silencing additional
Genes present in all information to create genes as per requirement for
cells of the body proteins protein to be made
BREAST CELL
NUCLEUS
RIBOSOMES
X
Protein synthesis
Methylation: RNAs taking
Genes silenced relevant genetic
within nucleus information outside
(not available for nucleus
copying by X
RNA)
TRANSCRIPTION TRANSLATION
RESULTS
CLINICAL DATA
Consists of demographic, diagnostic, treatment and survival indicators
Demographic and Diagnostic features
• Race
• Tumor stage
• Menopause status
• Lymph node count
Treatment features
• Radiation therapy flag
• Drug counts
Survival indicators
• Progression Free Interval: Period since treatment began during which the
disease hasn’t worsened
• Overall Survival time: Period from the date of diagnosis until the date of
death/date of last follow up (depending on vital status)
VARIABLE SELECTION
RESULTS
SELECTING FEATURES
RELEVANT TO BREAST CANCER
Summarizing the genomic features is paramount given their high dimensionality
~ 100,000 genomic features
Differential Expression
Analysis
Recursive Feature
Elimination
10 genomic features
RESULTS
1. DIFFERENTIAL EXPRESSION
ANALYSIS
• Every cell nucleus contains the
complete genome, however only a small
percentage of the genome is expressed
in each cell
• DEA is carried out on normalized
counts to identify genes that are
differentially expressed in breast cancer
samples against normal samples using
the DESeq2 library in R
Normal breast cell vs cancerous breast cell
• Large set of genes deemed differentially
expressed, warranting further reduction
RESULTS
2. RECURSIVE FEATURE
ELIMINATION
• RFE employed to eliminate redundancies and avoid overfitting
• Random Forest model fit to variables selected in previous step
• Determine which feature is the least important in making predictions on the testing
dataset and drop this feature
• Select feature set that yields highest classification accuracy
RESULTS
FINAL FEATURE SET

• mRNA genes: ENSG00000102796, ENSG00000198832,
1 ENSG00000100906, ENSG00000134917, ENSG00000103269
• miRNA gene: hsa.mir.146a
~
Dependent
variable: 1 • Methylation CpG sites: cg05337753, cg02282892
PFI time*
• Clinical: tumor_stage_1, tumor_stage_3
*PFI time preferred over 1 • Treatment: Doxorubicin, Paclitaxel, Trastuzumab, Epirubicin,
Overall Survival time as Tamoxifen, Docetaxel, Letrozole, Carboplatin, Cyclophosphamide,
some deaths in data might Anastrozole, Fluorouracil, Exemestane, Methotrexate
occur due to non-cancer
causes
MODELING
PFI TIME BINNED INTO

RESULTS
CLASSES AND MODELED AS

DEPENDENT VARIABLE
Considering the small sample size, equal frequency binning done for PFI to keep the analysis powerful by
focusing on the causes of larger variation in the PFI time as compared to smaller changes (days of few weeks)
Considering two options:
A Two bin classification (n=180 each) B Three bin classification (n=120 each)
1 2 1 2 3
2.1 yrs. 1.5 yrs. 4 yrs.
RESULTS
PREDICTION OF TWO PFI TIME

BUCKETS
Dependent variable classes: Model evaluation:
1. PFI < 2.1 years (0) Cross-validated Accuracy
Model False Positive Rate
(Sensitivity)
2. PFI > 2.1 years (1)
Logistic regression (Lasso) 71% 14%
Random Forest 70% 15%

Metrics for model evaluation:
SVM (Polynomial) 72% 11%
1. Cross-Validated Accuracy SVM (Radial) 73% 14%
(Sensitivity): High variation in
sensitivity of the small test sample Results from four different binary classification models
with varying test-train splits, and

hence not trustworthy Polynomial SVM is the best model for predicting PFI time
2. False positive rate: We want to with two buckets, considering its performance on both the
control the prediction of a higher PFI Accuracy and False Positive Rate metrics
for a patient compared to the actual
Improvement:
PFI We can explore the prediction of a more granular bucketing to increase the amount of variation being
predicted, while keeping the model performance under consideration
PFI TIME BINNED INTO

RESULTS
CLASSES AND MODELED AS

DEPENDENT VARIABLE
Considering the small sample size, equal frequency binning done for PFI to keep the analysis powerful by
focusing on the causes of larger variation in the PFI time as compared to smaller changes (days of few weeks)
Considering two options:
A Two bin classification (n=180 each) B Three bin classification (n=120 each)
1 2 1 2 3
2.1 yrs. 1.5 yrs. 4 yrs.
RESULTS
PREDICTION OF THREE PFI

TIME BUCKETS
Dependent variable classes: Model evaluation:
1. PFI < 1.5 years (1) Cross-validated Accuracy
Model False Positive Rate
(Sensitivity)
2. PFI <= 4 years (2)
Random Forest 57% 21%
3. PFI > 4 years (3) SVM (Polynomial) 54% 22%
SVM (Radial) 56% 23%
Metrics for model evaluation:

Results from three different multi-classification models
1. Cross-Validated Accuracy
(Sensitivity): High variation in
sensitivity of the small test sample Random Forest is the best model for predicting PFI time with
with varying test-train splits, and three buckets, considering its performance on both the
hence not trustworthy Accuracy and False Positive Rate metrics
2. False positive rate: We want to
control the prediction of a higher PFI
for a patient compared to the actual
RESULTS
MODEL SELECTION
• We selected the three class PFI prediction model because it provides the ability to
differentiate the survival classes better than the binary classification
• Random Forest is better suited for multi-class classification and is not contingent
on centering and scaling of the test data for accurate predictions. Hence, we have
selected the Random Forest model with genes selected from survival analysis and
recursive feature elimination, as our final model
• With an accuracy of 57%, our model is 24 percentage points more accurate compared to
the 33% random chance of prediction of one of the three classes
• Since a tree based model gave us the best performance, our data has multiple inherent
interactions within its structure which we will explore in the further slides
RESULTS
COMPARISON OF FEATURES –
GENES VS. DRUGS
Why compare the two? Accuracy of models:
Drug data will have some genetic Model
Cross-validated Accuracy
information encoded because of (Sensitivity)
prescription decision based on Drugs only 48%
clinical attributes of the patient. Genes + Clinical 49%
Clinical attributes are in turn driven
Gene PCA + Clinical +
by genetic modifications Drugs
52%
Genes + Clinical + Drugs 57%
Other comparison:
• Using only genes + clinical vs. only the drug data
Performance of genes selected after gave similar accuracies (~48%), which is lower
feature selection (DEA, survival than both the variables modeled together (57%)
model, and RFE) vs. the top 10 • Modeling the genes selected using DEA, survival
Principal Components from genetic models, and RFE, gave better accuracy as
data compared to the top 10 PCs from genetic data
MODEL VALIDATION
RESULTS
MODEL VALIDATION – WHY

DO WE NEED IT?
Situation – Sample & Bias Solution – Permutation Testing
• Model based on 360 samples (only) • Permute the outcome variable (PFI) to
create multiple datasets with random
• Drugs are not prescribed randomly but assignment of PFI buckets to the feature
based on certain features of the patients sets
• Sampling is retrospective and not • Fit the finalized model to these datasets
completely random. May lead to biased and record the cross-validation accuracies
and confounded results
• Plot a histogram of all the accuracies
obtained above and measure the p-value
of the accuracy of our original model
RESULTS
MODEL VALIDATION –
RESULTS
Histogram of the cross-validated accuracies The highly significant p-value of our model’s
of models trained with 800 permuted samples accuracy shows that the model result and
hence the dependence of patient PFI on the
selected features is not by random chance
56.8%
No. of models 
Accuracy of our
model
<2.2e-16 p-value of this

accuracy in the
distribution
0.30---- ------------------0.35-------------- --------0.40------------ ----------0.45---------------- ------0.50----------------------0.55 0.568

Accuracy 
QUALITATIVE FINDINGS
INTERACTION AMONG
DRUGS AND GENES
IDENTIFYING INTERACTIONS
RESULTS
AND QUANTIFYING THEIR

EFFECTS ON PFI
Identify by fitting a simple classification tree model
TIME
Quantify through the significance and
coefficients of the interaction terms in an
ordered logistic regression model
<1.5 yrs Interaction Coeff.
1.5 – 4 yrs
>4 yrs
Doxorubicin : tumor_Stage.I1 23.39
tumor_Stage.I1
Doxorubicin : tumor_Stage.I1 : RHBDL1 -10.57
Doxorubicin : tumor_Stage.I1 : DHRS12 20.53
tumor_Stage.I1: RHBDL1 -9.95
Doxorubicin : DHRS12 -0.51
Doxorubicin : DHRS12 : cg05337753 -0.36
DHRS12 : cg05337753 -0.30

VISUALIZING EFFECT OF
INTERACTING FEATURES ON
PROBABILITY OF THE 3 PFI
BUCKETS
Illustration Tumor Stage > 1 Tumor Stage = 1
1 < 1.5 years

Gene 2 (High)
2 1.5 – 4 years
3 > 4 years
3 Lighter the area

Gene 2 (Low)
within the box,

2 higher the
probability of a
1
better Progression
Free Interval
Gene 1 Expression  Gene 1 Expression 
EFFECT OF THE TUMOR

RESULTS
STAGE*RHBDL1*DHRS12
INTERACTIONS ON PFI TIME
Stacked charts of probability of patients belonging to
each of the survival buckets given the combination of:
Tumor Stage > 1 Tumor Stage = 1
• Tumor Stage 1 flag (left >1, right =1)
DHRS12 = High
• RHBDL1 expression (increasing left to right)
• DHRS12 expression (increasing bottom to top)
DHRS12 expression levels 

Insights:
DHRS12 = Med
• Tumor Stage 1 patients have a better PFI compared to
patients at higher stages
• Increasing expression of RHBDL1 gene implies
DHRS12 = Low
reduced PFI time
(tumor_Stage.I1: RHBDL1 coef. = -9.95)
• Increasing expression of DHRS12 gene implies

increased PFI time for Tumor Stage 1 patients
(tumor_Stage.I1: DHRS12 coef. = 17.88) RHBDL1 expression levels  RHBDL1 expression levels 
EFFECT OF THE TUMOR

RESULTS
STAGE*RHBDL1*DHRS12
Stacked charts of probability of patients belonging to
Tumor Stage > 1 Tumor Stage = 1
• Tumor Stage 1 flag (left >1, right =1)
DHRS12 = High
• RHBDL1 expression (increasing left to right)
• DHRS12 expression (increasing bottom to top)

Insights:
DHRS12 = Med
• Tumor Stage 1 patients have a better PFI compared to
patients at higher stages
• Increasing expression of RHBDL1 gene implies
DHRS12 = Low
reduced PFI time
(tumor_Stage.I1: RHBDL1 coef. = -9.95)
• Increasing expression of DHRS12 gene implies

increased PFI time for Tumor Stage 1 patients
(tumor_Stage.I1: DHRS12 coef. = 17.88) RHBDL1 expression levels  RHBDL1 expression levels 
STUDYING EFFECTIVENESS OF BACKGROUND | DATA | VARIABLE SELECTION | MODELLING | VALIDATION | INTERPRETATION OF
RESULTS
DRUGS ON TUMORS DUE TO

DIFFERENT GENE EXPRESSION
WeVARIATIONS
compared the effectiveness of Trastuzumab and Doxorubicin in terms of PFI on patients
with different genetic variations, based on the observation that they are
 the most effective, and
 prescribed to a good proportion of patients in our sample
Illustration: Using gene-drug interaction for prescribing

treatment
Survival time
PREP TO UNDERSTAND NEXT
THREE SLIDES
Illustration Doxorubicin (Low) Doxorubicin (High)
Trastuzumab (High)
Trastuzumab Both Lighter the area

within the box,
higher the
probability of a
better Progression
Trastuzumab (Low)
Free Interval
Neither Doxorubici
n
Increasing expression levels  Increasing expression levels 
RESULTS
EFFECT OF GENE-DRUG
Trastuzumab is more effective in improving vs Doxorubicin works better for high
PFI of patients with high RHBDL1 . SELENOM expression
expression
Doxorubicin (Low) Doxorubicin (High) Doxorubicin (Low) Doxorubicin (High)

Trastuzuma
Trastuzuma
b (High)
b (High)
Trastuzuma
Trastuzuma
b (Low)
b (Low)
RHBDL1 expression levels  RHBDL1 expression levels  SELENOM expression levels SELENOM expression levels 

RESULTS
EFFECT OF GENE-DRUG
Lower methylation levels of cg02282892 also indicate lower PFI and
Doxorubicin works better on such patients as compared to
Trastuzumab
Doxorubicin (Low) Doxorubicin (High)

Trastuzuma
b (High)
Trastuzuma
b (Low)
RESULTS
EFFECT OF GENE-DRUG
Both the drugs show similar efficacy for NFKBIA and methylation cg05337753 variations
Doxorubicin (Low) Doxorubicin (High) Doxorubicin (Low) Doxorubicin (High)

Trastuzuma
Trastuzuma
b (High)
b (High)
Trastuzuma
Trastuzuma
b (Low)
b (Low)
NFKBIA expression levels  cg05337753 expression levels 
EXAMPLE OF APPLICATION AREA
FOR THIS RESEARCH
Genomic information has already helped to shape
the development and use of some of the newest
cancer treatments
For example,
 Trastuzumab works only for women whose
tumors have a particular genetic profile called
HER-2 positive (subtype)
 the Drug imatinib was designed to inhibit an
altered enzyme produced by a fused version
of two genes found in chronic myelogenous
leukemia
Source: https://cancergenome.nih.gov/cancergenomics/impact
LIMITATIONS & FUTURE STEPS
LIMITATIONS OF THE STUDY

Sample &
• Small sample size and aggressive feature selection
Dimensionalit • Limited to US geography
y
• Only chemotherapies considered as they were the most commonly prescribed
• Dosage information not available
Drug data • Order of use of therapies not taken into account
• Drugs modeled individually but some were prescribed in combination
• Data not sufficient for test-train validation

• Drug assignment bias still exists even after permutation testing. The right
Modeling way here would be to design a clinical trail with randomized drug assignment
but that is difficult to accomplish
LIMITATIONS & FUTURE STEPS
FUTURE STEPS
Data • Gather data from multiple data sources to try and increase the sample size
• Perform modeling for other cancer types and with other molecular datasets
Modeling (protein, mutation) to validate the same hypothesis
• Work in tandem with cancer genetics experts to validate the outputs of gene
Interpretations selection algorithms, modeling of drug combinations, and interpreting
& model interactions results
• Understand the level of genetic information already accounted for by clinical
evaluation variables to avoid confounding effects of certain genes
QUESTIONS
APPENDIX
PRINCIPAL COMPONENT
ANALYSIS
• Top 10 components explained 40% of the
overall variance
• Low interpretability
Variance Explained with the number of principal components Principal Components 1, 2 and 3 of the entire omics data with important variable loadings
PREDICTION OF SQ.ROOT(PFI
TIME)
• Given the small sample size of the final dataset (360), we used cross-validated train metrics to
measure the model performance. We were getting a high variance in the test performance results with
different test-train split seeds due to the small sample size of test data.
Results: Models Cross- validated R square
Linear Regression 28.4%

RF Regression 33.4%
SV Regression 27.2%
Conclusion : To test the capability of the combination of drugs history data and genetic/ clinical
features in predicting the PFI time better, we divided the PFI time into three buckets.
RESULTS
EFFECT OF GENE
INTERACTIONS ON SURVIVAL
Stacked charts of probability of patients belonging to each of cg--753 * DHRS12 * Cg---892
the survival buckets given the combination of:
• cg05337753 methylation (increasing bottom to top)
SELENOM SELENOM SELENOM
• DHRS12 expression (increasing within the box, left to right) (Low) (Med) (High)
• SELENOM expression (increasing left to right) cg--753 (High)
SELENOM expression levels 

Insight:
• cg05337753 and SELENOM interact in such a way
cg--753 (Med)
that the PFI time reduces at very low and very high
expression values
(explained by the negative coefficient for interaction)
cg--753 (Low)
• At higher cg05337753 methylation, increase in
SELENOM expression reduces PFI time.

RESULTS
EFFECT OF GENE-GENE
Stacked charts of probability of patients belonging to Doxoribicin * DHRS12 * Cg---892
2
• Doxorubicin (increasing left to right)
• DHRS12 expression (increasing within the box, left to right)
High Cg---892
Cg---892 expression levels 

Insights:
• Higher the methylation, better the survival
• PFS time reduces with increasing expression levels Low Cg---892
of DHRS12 gene at lower levels of cg02282892
methylation
• This effect of DHRS12 gene on PFS time weakens

at higher cg02282892 methylation
RESULTS
EFFECT OF GENE-GENE
Stacked charts of probability of patients belonging to Doxoribicin * DHRS12 * Cg---892
2
• Doxorubicin (increasing left to right)
• DHRS12 expression (increasing within the box, left to right)
High Cg---892
Cg---892 expression levels 

Insights:
• Higher the methylation, better the survival
• PFS time reduces with increasing expression levels Low Cg---892
of DHRS12 gene at lower levels of cg02282892
methylation
• This effect of DHRS12 gene on PFS time weakens

at higher cg02282892 methylation

Complete Project Presentation

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Complete Project Presentation

Uploaded by

Copyright:

Available Formats

PRECISION MEDICINE FOR

• GDC a tool started by CDIS became a core component Complex Systems

Why? What? How?

FINAL FEATURE SET

PFI TIME BINNED INTO

CLASSES AND MODELED AS

PREDICTION OF TWO PFI TIME

Random Forest 70% 15%

with varying test-train splits, and

PFI TIME BINNED INTO

CLASSES AND MODELED AS

PREDICTION OF THREE PFI

SVM (Radial) 56% 23%

Metrics for model evaluation:

Genes + Clinical + Drugs 57%

MODEL VALIDATION – WHY

<2.2e-16 p-value of this

0.30---- ------------------0.35-------------- --------0.40------------ ----------0.45---------------- ------0.50----------------------0.55 0.568

AND QUANTIFYING THEIR

Doxorubicin : tumor_Stage.I1 : RHBDL1 -10.57

Doxorubicin : tumor_Stage.I1 : DHRS12 20.53

tumor_Stage.I1: RHBDL1 -9.95

Doxorubicin : DHRS12 -0.51

Doxorubicin : DHRS12 : cg05337753 -0.36

DHRS12 : cg05337753 -0.30

1 < 1.5 years

3 Lighter the area

within the box,

EFFECT OF THE TUMOR

DHRS12 expression levels 

• Increasing expression of DHRS12 gene implies

EFFECT OF THE TUMOR

DHRS12 expression levels 

• Increasing expression of DHRS12 gene implies

DRUGS ON TUMORS DUE TO

Illustration: Using gene-drug interaction for prescribing

Trastuzumab Both Lighter the area

Doxorubicin (Low) Doxorubicin (High) Doxorubicin (Low) Doxorubicin (High)

Doxorubicin (Low) Doxorubicin (High)

Doxorubicin (Low) Doxorubicin (High) Doxorubicin (Low) Doxorubicin (High)

LIMITATIONS OF THE STUDY

• Data not sufficient for test-train validation

Results: Models Cross- validated R square

Linear Regression 28.4%

SELENOM expression levels 

DHRS12 expression levels 

Cg---892 expression levels 

• This effect of DHRS12 gene on PFS time weakens

Cg---892 expression levels 

• This effect of DHRS12 gene on PFS time weakens

You might also like