You are on page 1of 115

臺北醫學大學藥學系碩士班

碩士論文
Master Program in School of Pharmacy
Taipei Medical University
Master Thesis

以多種機器學習演算法預測非小細胞肺癌病人
使用鉑類藥物引起之腎毒性
Multiple Machine Learning Algorithms to predict Platinum-
induced Nephrotoxicity in Non-small Cell Lung Cancer Patients

研究生:黃世輝 (Shih-Hui Huang)


指導教授:陳香吟教授 (Hsiang-Yin Chen, Pharm.D.)

民國一百一十年七月
Acknowledgement

兩年的時光過得非常快速,眼看著藥學系碩士班臨床藥學組的生涯即將在 COVID-19 三級
警戒下畫下句點,在寫這份致謝的同時,過去兩年認真努力的歲月湧上心頭,希望多年後再回
味這本碩士論文時,想到的不是 COVID-19,而是在教學研究大樓七樓和一群志同道合的好朋
友們一起努力奮鬥的美好時光。
這兩年來受了很多人的幫助,才能順利完成這本碩士論文。首先要感謝香吟老師,讓我參
與很多藥學系上以及研究室的許多活動磨練我,無論是舉辦高中生微課程、參與食藥署計畫、
撰寫教育部教學實踐計畫和科技部計畫及擔任多堂課的教學助理,都是很難能可貴的經驗,最
後在衝刺碩士論文和撰寫 manuscript 時也給予我很多很棒的建議,讓我最後能夠順利產出這本
論文。感謝文能和三源老師擔任我學位考試的口試委員,給予論文許多建議,讓這份碩士論文
更加完美。感謝肇余,在做研究的過程中和我討論許多執行機器學習研究的細節,也啟發我對
資料探勘軟體 Python 和 R 的興趣,也是我在碩士班學習到眾多技能當中最特殊的部份。感謝
家崙醫師,雖然很可惜無法將新收案的病人放在我的碩士論文中,但在收案過程中從醫師身上
學到與病人溝通的技巧及收案經驗分享,是我在收案過程當中的意外收穫。感謝為論文而苦惱
的夥伴們: 靖雯、威志、宇宏、曉澤、惠方、婷婷,兩年來一起度過無數個修課、meeting、
seminar 及寫論文等充滿壓力的時刻,幸好大家在閒暇之餘可以一起出去吃飯、運動及出遊,
讓大家能夠在這充滿緊繃的碩士生涯中獲得解放,永遠也忘不了那些彼此互相激勵的時光。最
後感謝我的父母,支持我就讀兩年的碩士班持續精進自我,有了他們的支持才能讓我無後顧之
憂地往前衝,畢業後我也會找到一份穩定的工作來回報家人對我的付出。
畢業之後期許自己能更有所長進,發揮這兩年的所學與經歷,投入在工作及社會中,最後
再次感謝大家的支持,我才能順利地走完碩士班這條路,這個研究獻給所有支持我完成碩士論
文的所有人。

謹記 盛夏臺北
黃世輝
2021/07

ii
中文摘要

研究背景
鉑類藥物引起之腎毒性是嚴重且無法預期的藥物不良反應,且容易造成非小細胞肺癌病人治療
失敗。做好事先預測及處置此腎毒性能提升病人的存活率。先前基因關聯性研究及預測模型並
無法得到良好的結果。另一新興的人工智慧演算法有潛力能良好預測此多重因子相關且複雜的
藥物不良反應。
研究目的
結合臨床及基因特徵以建立並比較最佳機器學習模型以預測非小細胞肺癌病人使用鉑類藥物引
起之腎毒性。
研究方法
蒐集在萬芳醫院使用鉑類藥物的非小細胞肺癌病人的臨床和基因資料。本研究一共建立了 12 個
模型由 4 種機器學習演算法(類神經網路、羅吉斯迴歸、隨機森林、支持向量機)及 3 種模式(整
合模式、臨床模式、基因模式)。網格搜尋及基因演算法分別用於找出最具預測能力的超參數及
特徵組合。12 個模型的表現將用準確率、精確率、召回率、F1 分數、ROC 曲線及其面積來評
估。
研究結果
本研究一共招募 118 位病人,其中有 28 (23.73%)位病人為腎毒性組。機器學習模型在整合模式
下相較於臨床模式和基因模型有較好的預測能力,在 12 個機器學習模型中以類神經網路在整合
模式下預測能力最佳。類神經網路在整合模式的表現為準確率 0.923、精確率 0.950、召回率 0.713、
F1 分數 0.808 及 ROC 曲線下的面積 0.900。
結論
機器學習模型結合臨床及基因特徵可以成為初步工具協助血液腫瘤科醫師預測病人發生鉑類藥
物引起之腎毒性及事先為病人做預防的處置。

關鍵字: 鉑類藥物、腎毒性、類神經網路、羅吉斯迴歸、隨機森林、支持向量機

iii
Abstract

Background
Platinum-induced nephrotoxicity is a severe and unexpected adverse drug reaction that could lead to
treatment failure in non-small cell lung cancer patients. Better prediction and management of this
nephrotoxicity can increase patient survival. Previous gene association studies and prediction models
showed the limited results. Another innovative artificial intelligence techniques could be able to predict
this multifactorial complicated adverse drug reactions.
Objective
To compare and build up the best machine learning models with clinical and genomic features to predict
platinum-induced nephrotoxicity for non-small cell lung cancer patients.
Methods
Patients undergoing platinum chemotherapy at Wan Fang Hospital were recruited to collect their clinical
and genomic data. Twelve models were established by artificial neural network, logistic regression,
random forest and support vector machine with integrated, clinical and genomic modes. Grid search and
genetic algorithm were applied to construct the fine-tuned model with the best predictive
hyperparameters and feature combination. Accuracy, precision, recall, F1 score, area under the receiver
operating characteristic curve were calculated to compare the performance of 12 models.
Results
In total, 118 patients were recruited in this study, and 28 (23.73%) with nephrotoxicity. Machine learning
models with clinical and genomic features reached better prediction than clinical or genomic features
alone. Artificial neural network with clinical and genomic features demonstrated the best predictive
outcomes among all the 12 models. The average accuracy, precision, recall, F1 score and area under the
receiver operating characteristic curve from artificial neural network of integrated mode were 0.923,
0.950, 0.713, 0.808 and 0.900, respectively.
Conclusion
Machine learning models with clinical and genomic features could be a preliminary tool for the
oncologists to predict platinum-induced nephrotoxicity and make preventive strategies in advance.

Keywords: platinum, nephrotoxicity, artificial neural network, logistic regression, random forest,
support vector machine

iv
Contents
中文摘要 ........................................................................................................................................... iii
Abstract .............................................................................................................................................. iv
Contents ............................................................................................................................................. v
List of Tables .................................................................................................................................. viii
List of Figures .................................................................................................................................... x

Chapter 1. Introduction ...................................................................................................................... 11


Chapter 2. Literature review ............................................................................................................. 13
2.1 Cisplatin and Carboplatin ................................................................................................... 13
2.1.1 Structure and mechanisms ........................................................................................ 14
2.1.2 Pharmacokinetics ..................................................................................................... 15
2.1.3 Clinical application .................................................................................................. 15
2.1.4 Adverse drug reactions ............................................................................................. 16
2.1.4.1 Nephrotoxicity .............................................................................................. 16
2.1.4.2 Myelosuppression.......................................................................................... 16
2.1.4.3 Nausea and vomiting ..................................................................................... 16
2.1.4.4 Neurotoxicity ................................................................................................ 17
2.1.4.5 Ototoxicity .................................................................................................... 17
2.2 Platinum-induced nephrotoxicity ........................................................................................ 17
2.2.1 Clinical presentation ................................................................................................ 17
2.2.1.1 Acute kidney injury ....................................................................................... 17
2.2.1.2 Chronic renal failure ...................................................................................... 18
2.2.2 Pathogenesis ............................................................................................................ 18
2.2.2.1 Cellular toxicity ............................................................................................. 19
2.2.2.2 Proinflammatory effects ................................................................................ 20
2.2.2.3 Effects on proximal tubule ............................................................................. 20
2.2.2.4 Other cellular effects ..................................................................................... 20
2.2.3 Clinical risk factors .................................................................................................. 20
2.2.4 Gene polymorphisms ............................................................................................... 21
2.2.4.1 Drug transport ............................................................................................... 21
2.2.4.1.1 ABCB1................................................................................................ 21
2.2.4.1.2 ABCC2 ............................................................................................... 22
v
2.2.4.1.3 MATE1 ............................................................................................... 22
2.2.4.1.4 OCT2 .................................................................................................. 23
2.2.4.2 DNA repair .................................................................................................... 23
2.2.4.2.1 ERCC1 and ERCC4 ............................................................................ 24
2.2.4.2.2 OGG1 ................................................................................................. 24
2.2.4.2.3 XPA..................................................................................................... 25
2.2.4.2.4 XPG.................................................................................................... 25
2.2.4.3 Metabolism and detoxification ..................................................................... 26
2.2.4.3.1 GSTP1 ................................................................................................ 27
2.2.4.3.2 NAT2 .................................................................................................. 27
2.2.4.3.3 NQO1 ................................................................................................. 28
2.2.4.3.4 UGT1A7 ............................................................................................. 28
2.2.4.4 Apoptosis ...................................................................................................... 28
2.2.4.4.1 TP53 ................................................................................................... 28
2.3 Machine learning ................................................................................................................ 29
2.3.1 Artificial neural network (ANN) .............................................................................. 29
2.3.2 Logistic regression (LR)........................................................................................... 31
2.3.3 Random forest (RF) ................................................................................................. 32
2.3.4 Support vector machine (SVM) ................................................................................ 33
Chapter 3. Objective .......................................................................................................................... 35
Chapter 4. Method ............................................................................................................................. 36
4.1 Study design ....................................................................................................................... 36
4.2 Study participants ............................................................................................................... 38
4.2.1 Inclusion criteria ...................................................................................................... 38
4.2.2 Exclusion criteria ..................................................................................................... 38
4.3 Data and sample collection ................................................................................................. 38
4.4 Modes and features ............................................................................................................. 39
4.4.1 Prediction outcome .................................................................................................. 40
4.4.2 Input features ........................................................................................................... 40
4.4.3 Other study definitions ............................................................................................. 41
4.5 Model building ................................................................................................................... 42
4.6 Grid search in ANN, LR, RF and SVM ............................................................................... 42
4.7 Five-fold cross validation ................................................................................................... 42
4.8 Leave-one-out cross validation ........................................................................................... 43
vi
4.9 Feature selection and importance ........................................................................................ 44
4.10 Artificial neural network (ANN) ....................................................................................... 44
4.11 Logistic regression (LR) ................................................................................................... 45
4.12 Random forest (RF) .......................................................................................................... 45
4.13 Support vector machine (SVM) ........................................................................................ 46
4.14 Metrics evaluation ............................................................................................................ 46
4.15 Statistical analysis ............................................................................................................ 47
Chapter 5. Results .............................................................................................................................. 49
5.1 Baseline characteristics ....................................................................................................... 49
5.2 Hardy-Weinberg equilibrium for genotyping....................................................................... 51
5.3 Feature selection and importance ........................................................................................ 53
5.4 Sequential backward selection ............................................................................................ 64
5.5 Model performance ............................................................................................................ 66
5.6 Survival analysis................................................................................................................. 74
5.7 Subgroup analysis............................................................................................................... 75
5.8 Performance comparison of with and without dose-related features .................................... 88
Chapter 6. Discussion....................................................................................................................... 91
Chapter 7. Conclusion ...................................................................................................................... 94
Reference ......................................................................................................................................... 95
Appendices .................................................................................................................................... 107

vii
List of Tables

Table 2.3.1 Artificial neural network applied in medical research ..................................................... 31


Table 2.3.2 Logistic regression applied in medical research .............................................................. 32
Table 2.3.3 Random forest applied in medical research .................................................................... 33
Table 2.3.4 Support vector machine applied in medical research ...................................................... 34
Table 4.3.1 Clinical data collected in this study ................................................................................ 39
Table 4.3.2 Single nucleotide polymorphisms (SNP) candidates in this study ................................... 39
Table 4.4.2 The accuracy of KNN (k-nearest neighbors) algorithm in missing features. .................... 40
Table 4.10.1 Hyperparameters of the artificial neural network for five-fold cross validation ............. 45
Table 4.10.2 Hyperparameters of the artificial neural network for leave-out-one cross validation ..... 45
Table 4.12.1 Hyperparameters of the random forest for five-fold cross validation............................. 45
Table 4.12.2 Hyperparameters of the random forest for leave-out-one cross validation ..................... 46
Table 4.13.1 Hyperparameters of the support vector machine for five-fold cross validation .............. 46
Table 4.13.2 Hyperparameters of the support vector machine for leave-one-out cross validation ...... 46
Table 4.14.1 Metrics used to evaluate model performance in this study ............................................ 47
Table 4.14.2 Confusion matrix ......................................................................................................... 47
Table 5.1 Baseline characteristics of non-small cell lung cancer patients segregated by renal
toxicity............................................................................................................................................. 49
Table 5.2 The Hardy-Weinberg equilibrium distribution in this study ............................................... 51
Table 5.3.1 Number of selected features for 12 models by five-fold and leave-one-out cross validation
……. ................................................................................................................................................ 53
Table 5.3.2 Feature selection for integrated mode by five-fold and leave-one-out cross validation .. 54
Table 5.3.3 Feature selection for clinical mode by five-fold and leave-one-out cross validation ...... 59
Table 5.3.4 Feature selection for genomic mode by five-fold and leave-one-out cross validation .... 61
Table 5.5.1 The average accuracy, precision, recall and F1 score for 12 models in five-fold and leave-
one-out cross validation ................................................................................................................... 67
Table 5.5.2 Accuracy, precision, recall, and F1 score for 12 models in five-fold cross validation ...... 68
Table 5.5.3 Accuracy, precision, recall and F1 score for 12 models in leave-one-out cross validation . 69
Table 5.5.4 The area under the receiver operating characteristic curve (AUC) of the testing set
calculated for the 12 models in five-fold cross validation ................................................................. 71

viii
List of Tables
Table 5.5.5 The DeLong test calculated for different modes in five-fold cross validation .................. 72
Table 5.5.6 The area under the receiver operating characteristic curve (AUC) of the testing set
calculated for the 12 models in leave-one-out cross validation.......................................................... 73
Table 5.5.7 The DeLong test calculated for different modes in leave-one-out cross validation .......... 73
Table 5.7.1 Baseline characteristics of non-small cell lung cancer patients segregated by renal toxicity
in cisplatin subgroup ........................................................................................................................ 76
Table 5.7.2 Number of selected features for 12 machine learning models in cisplatin subgroup ........ 77
Table 5.7.3 Feature selection for 12 models in cisplatin subgroup..................................................... 78
Table 5.7.4 Features selected by all of four algorithms in main and cisplatin subgroup analysis........ 82
Table 5.7.5 The average accuracy, precision, recall, and F1 score for 12 models in cisplatin subgroup
........................................................................................................................................................ 83
Table 5.7.6 Accuracy, precision, recall, and F1 score for 12 models in cisplatin subgroup ................ 84
Table 5.7.7 The area under the receiver operating characteristic curve (AUC) of the testing set
calculated for the 12 models in cisplatin subgroup............................................................................ 86
Table 5.7.8 The DeLong test calculated for different modes in cisplatin subgroup ............................ 87
Table 5.8.1 Comparison of the average accuracy, precision, recall and F1 score for with and without
dose-related features in five-fold cross validation. ............................................................................ 88
Table 5.8.2 Comparison of the area under the receiver operating characteristic curve (AUC) of the
testing set calculated for the 12 models in five-fold cross validation ................................................. 89
Table 5.8.3 Comparison of the average accuracy, precision, recall and F1 score for with and without
dose-related features in leave-one-out cross validation. .................................................................... 89
Table 5.8.4 Comparison of the area under the receiver operating characteristic curve (AUC) of the
testing set calculated for the 12 models in leave-one-out cross validation ......................................... 89
Table 5.8.5 Comparison of the average accuracy, precision, recall and F1 score for with and without
dose-related features in cisplatin subgroup analysis. ......................................................................... 90
Table 5.8.6 Comparison of the area under the receiver operating characteristic curve (AUC) of the
testing set calculated for the 12 models in cisplatin subgroup analysis .............................................. 90

ix
List of Figures

Figure 2.1.1 Structure of cisplatin and carboplatin ............................................................................ 14


Figure 2.2.4.3 Gamma-glutamyl transpeptidase (GGT) pathway ...................................................... 26
Figure 2.3.1.1 Preceptor structure in artificial neural network ........................................................... 29
Figure 2.3.1.2 Structure of artificial neural network ......................................................................... 30
Figure 2.3.2 Sigmoid Function ......................................................................................................... 31
Figure 2.3.3 Structure of random forest ............................................................................................ 32
Figure 2.3.4 Rationale of support vector machine ............................................................................. 34
Figure 4.1 Study flowchart ............................................................................................................... 37
Figure 4.7 Five-fold and leave-one-out cross validation for example ................................................ 43
Figure 5.4.1 The plot of sequential backward selection (SBS) by logistic regression (LR), random
forest (RF), and support vector machine (SVM) of integrated mode in five-fold cross validation ..... 65
Figure 5.4.2 The plot of sequential backward selection (SBS) by logistic regression (LR), random
forest (RF), and support vector machine (SVM) of integrated mode in leave-one-out cross validation
........................................................................................................................................................ 66
Figure 5.5.1 The receiver operating characteristic (ROC) curve and area under the receiver operating
characteristic curve (AUC) generated by 12 models in five-fold cross validation.............................. 71
Figure 5.5.2 The receiver operating characteristic (ROC) curve and area under the receiver operating
characteristic curve (AUC) generated by 12 models in leave-one-out cross validation ...................... 72
Figure 5.6.1 The Kaplan-Meier plot for artificial neural network (ANN) of integrated mode in five-fold
cross validation ................................................................................................................................ 74
Figure 5.6.2 The Kaplan-Meier plot for artificial neural network (ANN) of integrated mode in leave-
one-out cross validation ................................................................................................................... 75
Figure 5.7.1 The receiver operating characteristic (ROC) curve and area under the receiver operating
characteristic curve (AUC) of 12 models in cisplatin subgroup ........................................................ 86
Figure 5.7.2 The Kaplan-Meier plot for artificial neural network (ANN) of integrated mode in cisplatin
subgroup .......................................................................................................................................... 87

x
Chapter 1. Introduction

Platinum-induced nephrotoxicity, a severe and unexpected adverse drug reaction (ADR), leads to

treatment failure in non-small cell lung cancer (NSCLC) patients. One-third cancer patients treated

with platinum chemotherapy would experience nephrotoxicity even under hydration in advance [1, 2].

The pathophysisology of platinum-induced nephrotoxicity has been proved to be associated with

multifactorial complicated pathways including clinical and genomic risk factors [3-6]. These

mechanisms are mainly related to dose, drug transport, DNA repair, metabolism and apoptosis [6, 7].

Prevention and management of the ADR to increase patient survival is crucial because platinum is the

first line chemotherapy treatment for NSCLC patients.

Predicting complicated ADR like platinum-induced nephrotoxicity required sophisticated data

mining techniques. Giving the facts of non-linear relationship between metabolic pathways and gene-

gene interaction, traditional statistics models were not able to compute reasonable prediction [8-10]. A

common statistical method, classification and regression trees (CART), has been applied to predict this

ADR with only 0.845 of accuracy [11]. Innovative methods such as artificial intelligence (AI)

algorithms may be tools to overcome complex features relationship to better optimize the unmet need

of NSCLC patients treatment.

How to best use AI algorithms to predict platinum-induced nephrotoxicity requires the best fitted

models with suitable selection of clinical and genomic features. Prediction models constructed with

only clinical data showed the limited results [8, 9], and gene association studies using single genetic

polymorphisms (SNP) produced inconsistent outcomes [10-17]. Random forest (RF) demonstrated

area under the receiver operating characteristic curve (AUC) less than 0.7 in ADR prediction, plausibly

due to omission of key features [10, 18]. Support vector machine (SVM), classifying data in high-

dimensional space by hyperplane, showed significantly better AUC in breast tumor classification than

11
the experienced healthcare professionals [19, 20]. Convolutional neural network, an advanced artificial

neural network (ANN), were able to predict gene mutations from NSCLC patients’ histopathologic

images [21]. Appropriate feature selection with these AI techniques could show the promising

performance in complex medical issues, such as platinum-induced nephrotoxicity.

The objective of this study was to compare and build up the best machine learning (ML) models

by ANN, RF, SVM and logistic regression (LR) with clinical and genomic features to predict platinum-

induced nephrotoxicity for NSCLC patients. The specific aims were two-fold: 1) to select the best

hyperparameters by grid search and features by genetic algorithm (GA) to construct the best fine-tuned

ML models to predict platinum-induced nephrotoxicity; 2) to evaluate the models by multiple metrics

including accuracy, precision, recall, F1 score and AUC.

12
Chapter 2. Literature review

2.1. Cisplatin and Carboplatin


Platinum drugs were developed under numerous scientists for over two centuries. In 1845,

Michael Peyrone’s lab accidentally discovered this unknown compound and it was named as Peyrone’s

chloride [22]. Its cytotoxicity was initial accidentally discovered by Alfred Werner’s electrolysis in the

inhibition of E.coli cell growth in 1965 [23]. After discovering its inhibition of cell division, Alfred

Werner tested platinum’s antitumor ability in the cancer mice and showed a promising result [24]. As a

result, cisplatin had its first clinical trial conducted in 1971 and successfully approved by the Food and

Drug Administration (FDA) in 1978.

Although Cisplatin can perform the efficacy on solid tumor, its toxicity limits its clinical use. The

common adverse drug reactions are nephrotoxicity and ototoxicity. In order to deal with the risk of

using cisplatin, many scientists and chemists started to develop a safer platinum for clinical use. In

1985, a cisplatin analog called carboplatin was successfully manufactured and it was proved to own

the similar efficacy as cisplatin in the preclinical studies [25]. Later on, Carboplatin was also approved

in clinical use by the FDA in 1986.

Carboplatin performed the similar antitumor activity and demonstrated a lower toxicity than

cisplatin did. Although carboplatin required higher serum concentration and lower rate to interact with

DNA than cisplatin [26], it showed a similar survival rate compared to cisplatin when treating

advanced ovarian cancer patients [27]. As a result, carboplatin was approved by FDA for clinical

chemotherapy use. Carboplatin showed a minor nephrotoxicity and ototoxicity, but it could lead to

severe dose-dependent myelosuppression, especially for thrombocytopenia.

13
2.1.1 Structure and mechanisms
Cisplatin, cis-diamminedichloroplatinum(II) (CDDP), is a flat square structure with platinum as

central atom containing two chloride atoms ammonia molecules (Figure 2.1.1). It can show the

antitumor activity when acting as cis-form instead of trans-form.

Carboplatin, cis-(1,1-cyclobutanedicarboxylato)diammineplatinum(II), has a similar structure as

cisplatin but the leaving group is cyclobutaneplatinum (Figure 2.1.1). Due to its stabler leaving group

than chloride, Carboplatin shows a slower aquation reaction and less active. As a result, Carboplatin

may be less potent than cisplatin.

Figure 2.1.1 Structure of cisplatin and carboplatin [28]

Platinum drugs belong to alkylating agents, which embedded into DNA molecule to perform

cytotoxicity of tumor cells. After entering the cell, platinum will perform aquation reaction with

leaving group leaving and make platinum positive [28]. This active cisplatin diaquo form can react

with cellular nucleophile, especially DNA, to generate its cytotoxicity [29]. Platinum alkylates the N-7

position of DNA guanine to covalently generate monoadducts, intrastrand and interstrand cross-links

[30]. Among these crosslinked DNA products, 1,2-d(GpG) intrastrand crosslinks shows the most

dominant cytotoxicity. When platinum and DNA form DNA adducts, it will interfere with the

transcription and synthesis of DNA to reach its tumor suppression effect.

14
2.1.2 Pharmacokinetics
The following statements introduce the pharmacokinetics of cisplatin and carboplatin:

1. Absorption

Cisplatin and carboplatin can be administered intravenously instead of oral absorption. After

injecting cisplatin, it would reach maximum serum concentration in 90 to 150 minutes [31].

2. Distribution

The protein bounding of cisplatin is approximately 90% and it’s well distributed in human tissues,

liver and kidney [31]. Carboplatin does not bind to the serum protein and can easily distribute to

the tissues [32].

3. Metabolism

Cisplatin may be transformed into several metabolites without enzymes. Approximate 90%

metabolites binding to protein will not show the cytotoxicity while minor ones in the serum are

cytotoxic [33]. Carboplatin would not metabolize apparently and demonstrate a slower aquation

reaction than cisplatin [32].

4. Excretion

Both cisplatin and carboplatin are mainly excreted through urine. About 50% of cisplatin

metabolites are excreted in 5 days and it could also be discovered in tissues after several months.

The toxic unbound cisplatin metabolite may be actively excreted by the renal tubules [34].

Approximate 70% of carboplatin metabolites are excreted in 24 hours and about one-third are

excreted with unchanged form [35].

2.1.3 Clinical application


Cisplatin and Carboplatin have been approved for numerous solid tumors in clinical use and both

have their individual advantages in different indications. The main indications for cisplatin are lung

cancer, ovarian cancer, head and neck cancer and testicular cancer while for carboplatin are lung

15
cancer, head and neck cancer, ovarian cancer and germ cell cancer [36]. Cisplatin can be much potent

in head and neck while carboplatin is as effective as cisplatin in lung cancer and ovarian cancer [37].

2.1.4 Adverse drug reactions


Using cisplatin and carboplatin may lead to numerous adverse drug reactions. The following

statements focus on commonly occurred adverse effects of platinum, like nephrotoxicity,

myelosuppression, nausea and vomiting, neurotoxicity and ototoxicity.

2.1.4.1 Nephrotoxicity
Nephrotoxicity is a dose-related ADR that commonly occurred in patients undergoing cisplatin.

About 30% patients may occur acute kidney injury when using cisplatin [38]. Platinum-induced

nephrotoxicity usually begins during the second week after administering cisplatin with symptoms of

renal failure, imbalanced renal electrolytes and decreased serum creatinine [31]. This kidney toxicity is

typically reversible but could lead to irreversible under undergoing high cisplatin doses over several

courses [39]. Carboplatin is renoprotective than cisplatin because carboplatin is soluble and binds more

slowly to the plasma protein than cisplatin [40].

2.1.4.2 Myelosuppression
Myelosuppression is a dose-related and dose-limiting toxic effect that usually occurred in elderly

patients undergoing platinum. Its main symptoms include thrombocytopenia, neutropenia, and anemia.

About 25 to 30% patients undergoing cisplatin may occur this toxicity [31]. Myelosuppression leads to

nadir with a median of 21 days with patients using carboplatin but could recover after discontinuing

platinum [37].

2.1.4.3 Nausea and vomiting


Nausea and vomiting are the adverse drug reactions that patients undergoing chemotherapy may

experience, especially for platinum chemotherapy. Cisplatin is a highly emetogenic agent that could

induce nausea and vomiting in 1 to 4 hours and reach maximum intensity in 48 to 72 hours after

16
administration [31]. Approximately 75 to 80% patients would occur nausea and vomiting within 24

hours after undergoing carboplatin [41].

2.1.4.4 Neurotoxicity
Neurotoxicity is a dose-related effect that has been reported in patients using cisplatin. Peripheral

neuropathy typically occur with irreversible persistent paresthesias and numbness that could continue

or progress after cisplatin discontinuation [42]. Carboplatin could lead to minor neurotoxicity of which

incidence is only about 5% [41].

2.1.4.5 Ototoxicity
Ototoxicity is a dose-related, cumulative and reversible toxicity when patients undergo high doses

of platinum. Cisplatin is much toxic in hearing tissue than carboplatin [31, 41].

2.2 Platinum-induced nephrotoxicity


As platinum can show its potency in multiple cancers, its toxicity limits its clinical use, especially

in the kidney. Platinum-induced nephrotoxicity could still occur after appropriate prevention and

deteriorate to severe acute kidney injury and chronic renal failure [38, 43, 44]. Clinical presentations

and pathogenesis are reviewed here.

2.2.1 Clinical presentation


Kidney injury is the main clinical presentation of the platinum-induced nephrotoxicity, including

acute and chronic. The following statements will only focus on acute kidney injury and chronic renal

failure.

2.2.1.1 Acute kidney injury


Acute kidney injury (AKI) is known as rapidly deteriorated kidney function, like increased serum

creatinine and decreased urine outcome [45]. This renal toxicity are proved to be a risk factor of

mortality which could be up to 50% [46]. Platinum-induced nephrotoxicity typically occurs after 5 to 7

days of administration and could be prolonged to several weeks [47]. It will also company with some

severe complications including electrolyte disturbances, like hypomagnesemia, hyponatremia,

17
hypocalcemia and hypokalemia [48, 49]. Severe electrolyte loss may lead to serious orthostatic

hypotension and even induce distal renal tubular acidosis [50].

Multiple studies have shown that patients having platinum-induced nephrotoxicity still have a

normal urine output and even leads to polyuria. This phenomenon is attribute to the decreased

expression of renal aquaporins in collecting duct and inner medulla that induce concentrating defect

[51, 52]. Cisplatin-induced polyuria may occur in two phases. The early phase shows a normal level of

glomerular filtration rate after 1 to 2 days of cisplatin administration and could be reversed by

vasopressin [53, 54]. The late phase could induce decreased glomerular filtration rate and papillary

hypertonicity after 5 to 6 days of injecting cisplatin and may be irreversible by vasopressin [55].

Severe AKI may lead to multiple severe renal morphologic changes and cell death is one of its

characteristics. Necrosis is mainly discovered at the S3 segment in proximal tubule cells and apoptosis

is extensively occurred in not only proximal tubule cells but also distal convoluted tubule and Henle’s

loop [56]. Severity of renal cell death is significantly associated with platinum cumulative dosage,

concentration and administration time [57, 58]. Some human renal slices have also indicated the

significant histopathologic changes during platinum-induced nephrotoxicity, like decreased brush

border cells [59].

2.2.1.2 Chronic renal failure


Chronic renal failure may lead to multiple long-term irreversible adverse drug reactions due to

long-term platinum exposure. Some common clinical presentations like permanent changes in nephron

structure and decreased estimated glomerular filtration rate. Other minor manifestations were also

noted including increased serum creatinine, loss of urine electrolytes and proteinuria [60]. There are

some observed lesions in kidney after 5 months cisplatin administrations including apoptosis and

necrosis in focal tubular, interstitial hyperplasia, peritubular cell infiltration and tubulointerstitial

fibrosis [44].

2.2.2 Pathogenesis
18
Multiple complex pathways were investigated to be associated with renal toxicity induced by

platinum. The main mechanisms of kidney injury could be classified into: cellular toxicity,

proinflammatory effects, effects on proximal tubule and other cellular effects.

2.2.2.1 Cellular toxicity


Platinum-induced cellular toxicity could be induced by its elimination pathways. Cisplatin would

easily pass glomerulus and be absorbed by kidney through channel protein. Most of the cisplatin would

be mainly eliminated by kidney through urine. More than half of the cisplatin would be excreted

through urine after 24 hours of platinum administration and its concentration in renal cortex would

increase to greater fold than other places [61]. Therefore, cisplatin has been proved to predominantly

injure the S3 segment of the proximal tubule, with decreased glomerular filtration rate [62].

Cisplatin metabolic pathways could also cause cellular toxicities. When cisplatin entered the cell,

the lower chloride cellular concentration environment would facilitate its hydrolysis and transform to

active charged product. This compound would engage in the glutathione metabolism pathway and

finally generate the toxic unstable reactive thiols that would induce cytotoxicity by crosslink with

DNA [63].

Some preclinical studies have implicated that multiple drug transport channels are crucial in

platinum-induced nephrotoxicity. Overexpression of organic cation transporter 2 (OCT2) in proximal

tubule are related to accumulated cisplatin in proximal tubules, leading to renal toxicity [64-66]. Some

independent pathways like organic anion transporters including OAT1 and OAT2 could also generate

the similar mechanism as OCT2 does [67]. Moreover, copper transporter protein 1 (Ctr1) distributed in

proximal and distal tubular cells could conduct the similar pathway to increase cisplatin uptake and

lead to nephrotoxicity [68, 69].

There are multiple mechanisms that lead to cellular toxicity, including cisplatin accumulation

through channel protein, increased platinum toxic metabolites, DNA damage, cellular transport

19
systems changes, mitochondrial dysfunction, oxidative and nitrosative stresses, inflammatory reaction,

mitogen-activated protein kinases family (MAPKs) activation, and apoptotic activation.

2.2.2.2 Proinflammatory effects


Multiple proinflammatory cytokines have been proved to be induced by cisplatin and might lead

to severe nephrotoxicity due to inflammatory responses. These mediators including tumor necrosis

factor (TNA)-alpha, interferon (IFN)-gamma, interleukin (IL) 6, and caspases are able to facilitate

neutrophils, T cell and other substances to induce inflammatory responses [57, 70]. Preclinical studies

have shown that increased expression cell adhesion molecules in proximal cells would lead to

leukocytes and T cells infiltration and induced platinum-induced nephrotoxicity [71-73].

2.2.2.3 Effects on proximal tubule


Cisplatin has been proved to selectively injure the proximal tubule cells in kidney and might lead

to necrosis and apoptosis on high and low level concentration respectively [3]. There are multiple

reasons to demonstrate why cisplatin injures kidney specifically: increased renal uptake by expression

of OCT and OAT [74], decreased expression of sodium-dependent glucose and amino acid transporter,

magnesium and water transporter [61, 75], induced cisplatin mediated glutathione metabolism

pathways [76], enhanced the amount of reactive oxygen species [77].

2.2.2.4 Other cellular effects


There are still numerous possible minor cellular effects that could induce platinum-induced

nephrotoxicity. Cisplatin would induce apoptosis by inhibition of F1F0-ATPase and oxidative

phosphorylation in mitochondria [78, 79]. Owing to platinum-induced reduced oxidative fatty acid

effects in proximal tubule, it would also lead to increased accumulated triglyceride and nonesterified

fatty acids in renal tissue [79, 80]. Additionally, the potential reduced expression of peroxisome

proliferator-activated receptor (PPAR)-alpha mediated genes would occur during cisplatin induced

acute kidney injury [81].

2.2.3 Clinical risk factors


20
Multiple clinical pathways are directly related to platinum-induced nephrotoxicity in several

studies. These factors included age (older than 65 years old), female (due to lower unbound platinum

clearance), smoking, cisplatin dose > 100 mg in single cycle, coadministered with paclitaxel, previous

cisplatin chemotherapy, pre-existing kidney damage, coadministered with other nephrotoxic

medications, hypoalbuminemia, high peak plasma free platinum concentrations, history of

hypertension [8, 82-84].

2.2.4 Gene polymorphisms


Studies have investigated that multiple genomic pathways and mechanisms are potentially

associated with platinum-induced nephrotoxicity. These genes involved in the pathways of drug

transporting, DNA repairing, metabolism, detoxification and apoptosis [85]. The genes associated to

the risk of platinum nephrotoxicity as follows.

2.2.4.1 Drug transport


Overexpression of the renal channel protein genes could generate drug influx that generates

nephrotoxicity by accumulating platinum or drug efflux that leads to resistance [86]. The introduction

of the drug transporting genes selected in this study as follows.

2.2.4.1.1 ABCB1
Adenosine triphosphate-binding cassette subfamily B member 1 (ABCB1) gene, also called

multidrug resistance (MDR1) gene, can transcript to P-glycoprotein (P-gp) protein, one of the well-

known drug efflux channel protein [87]. The ABCB1 gene is responsible for excluding platinum

chemotherapy from tumor cells and has been extensively investigated for chemotherapy resistance [88-

90]. Thus, overexpression of ABCB1 gene might lead to the poor response rate in cancer patients

undergoing chemotherapy [91].

Although research about the ABCB1 gene and platinum-induced nephrotoxicity are limited,

multiple studies have investigated its pharmacogenomics and showed inconsistent results. For 3435

T>C analysis, Korean SCLC patients with CC variant taking etoposide-cisplatin had higher response

21
rate [90]. The other studies indicated that the 3435 T>C gene in Korean and Chinese NSCLC patients

undergoing irinotecan-cisplatin had insignificant relationship with response rate [92, 93]. As of now,

limited study showed the toxicity in 3435 T>C variant of cancer patients [94, 95].

2.2.4.1.2 ABCC2
Adenosine triphosphate-binding cassette subfamily C member 2 (ABCC2), also called multidrug

resistance-associated protein 2 (MRP2) gene, is one of the well-known gene of drug efflux channel

protein. The overexpression of ABCC2 led to increased platinum resistance with a significant lower

response rate [96, 97]. Also, the ABCC2 gene could involve in the elimination of platinum-conjugated

glutathione complexes [98].

There have been few empirical studies to figure out the platinum-induced nephrotoxicity in

ABCC2, but some pharmacogenomics studies were also conducted. For 3972 C>T analysis, the

NSCLC patients with TT variant undergoing platinum had a significant better response rate and longer

survival period [92]. However, another study showed that patient with TT variant would have a higher

risk of severe thrombocytopenia and hematologic toxicity for female [99]. For -24C>T analysis, lung

cancer with TT variant had a better response rate and one study discovered that those with CC variant

also had a better efficacy [92, 100]. As of now, there are few clinical discoveries in 1249 G>A gene

[92, 99, 100].

2.2.4.1.3 MATE1
Multidrug and toxin extrusion 1 (MATE1) is a proton/organic cation drug influx antiporter

distributed in kidney and was found to synergistically transport with OCT2 [101, 102]. Multiple

studies have described the mechanisms for MATE1 in platinum-induced nephrotoxicity. Previous

research using human embryonic kidney (HEK) cell has found that cisplatin is the substrate of human

MATE1 while carboplatin is not [103]. In addition, the MATE1 knockout (mate(-/-)) mice have

significantly higher serum creatinine and shorter survival time than wild-type mice after cisplatin

treatment [104].

22
Several pharmacogenomics studies investigating MATE1 have shown the inconsistent results. The

effect of MATE1 G>A variant in decreasing MATE1 channel protein performance that could increase

the efficacy of metformin is still controversial [105, 106]. Besides, it has previously been observed that

the MATE1 G>A variant did not show the association in elimination and adverse effects induced by

cisplatin [17, 107]. Recent work has established that the peroxisome proliferator-activated receptor

alpha (PPAR-α) deletion had potential renoprotective effect by modulating the OCT2 and MATE1

[108].

2.2.4.1.4 OCT2
Organic cation transporter 2 (OCT2) participates in secreting the organic cation from circulation

to renal cell and mainly distributes in renal proximal tubule [109]. Multiple studies have demonstrated

that the overexpression of OCT2 have the association with platinum-induced nephrotoxicity. Several

preclinical studies have indicated that rat OCT2-expressing HEK293 cells could accumulate cisplatin

more than mock-transfected cells and proved that cisplatin is the substrate of human OCT2 [103, 110,

111]. Similarly, cimetidine, one of the OCT2 substrate, could inhibit the cisplatin uptake [111].

Several pharmacogenomics studies have generated the inconsistent results in the association

between the expression of OCT2 and platinum-induced nephrotoxicity. For OCT2 G808T analyses,

Caucasian patients undergoing cisplatin with GG variant had a significantly higher serum creatinine in

one study while another study did not show the similar result [12, 16]. Asian patients undergoing

cisplatin with GG variant had a higher serum creatinine while those with GT did not [17]. However,

one study to monitor protein biomarkers related to kidney injuries showed that OCT2 could protect

renal proximal tubules [112].

2.2.4.2 DNA repair


Platinum induces not only cytotoxicity but also renal toxicity by DNA damage through platinum-

DNA adduct. The damaged cell could generate nucleotide excision repair (NER) and base excision

repair (BER) to decrease these adducts and relieve kidney injuries [14, 113]. Variations in DNA repair

23
genes are related to platinum-induced nephrotoxicity [15, 114, 115]. The introduction of the DNA

repair genes selected in this study as follows.

2.2.4.2.1 ERCC1 and ERCC4


Excision repair cross-complementation group 1 (ERCC1) forms the dimer with excision repair

cross-complementation group 4 (ERCC4), which implements 5’ excision to remove the damaged DNA

fragment. Multiple previous literatures have studied the relationship between ERCC1 expression and

chemotherapy efficacy. Overexpression of ERCC1 was related to cisplatin resistance in cancer patients

[116]. In addition, extensive research has shown that NSCLC patient undergoing platinum with

ERCC1 negative had a significantly better survival period and lower resistance compared to those with

negative [117].

Several pharmacogenomics researches showed the inconsistent results for variants of ERCC1

C8092>A (rs3212986), ERCC1 C118T (rs11615) in renal toxicity induced by cisplatin. It has not been

observed that gastric cancer patients with ERCC1 C8092>A and ERCC1 C118T were associated with

kidney injury [13]. However, another study has identified that advanced lung cancer patients

undergoing cisplatin with AA of 8092 C>A variant and CC of C118T variant showed a significantly

lower eGFR decrement and renal toxicity compared to other variants [15, 16].

Previous researches have established the association between ERCC4 T2505C (rs1799801) and

toxicity. Recent work has identified that ERCC4 is associated with significantly higher severe toxicity

and thrombocytopenia in NSCLC patients [118]. In addition, another study has shown that ERCC4 C

allele had significantly lower serious hematologic toxicity in head and neck cancer patients [119].

2.2.4.2.2 OGG1
The 8-oxoguanine DNA glycosylase-1 (OGG1) protein could generate BER to remove the 8-

oxoguanine, a mismatch base pair induced by reactive oxygen species (ROS) [120]. The BER function

of OGG1 protein could be related to increased incidence of lung cancer. It has previously been

24
observed that the expression of OGG1 protein in lung cancer patients were lower than the normal

people although both groups showed the same OGG1 gene expression [121].

Several pharmacogenomics researches to identify the variant for OGG1 Ser326Cys (rs1052133)

in suffering from lung cancers have shown the inconsistent results. For the lung cancer patients, a

study showed that the Asian patients with Cys variant would significantly increase the incidence of

lung cancer while the another did not show the increased statically significance [122, 123]. In addition,

some studies have demonstrated that OGG1 could be an indicator for lung cancer patient survival [124,

125]. For other cancers, one meta-analysis showed patients with OGG1 Ser326Cys had lower risk of

prostate cancer while the other ones did not identify this relationship in gallbladder and colorectal

cancer [126-128].

2.2.4.2.3 XPA
Xeroderma pigmentosum group A (XPA) protein would bind to the DNA damaged fragment to

help the NER to identify the position that needs to be excised [129]. Several pharmacogenomics

literatures identified the relationship between XPA -4G>A (rs1800975) and lung cancer incidence. Two

meta-analyses have identified that patients with AG variant of XPA -4G>A (rs1800975) are related to

lung cancer susceptibility [130, 131]. Research from China indicated that AA variant had a

significantly higher risk of lung cancers compared to G variant [132]. Besides, another study from

Korea indicated that NSCLC patients with AA variant were apt to have TP53 mutation [133].

2.2.4.2.4 XPG
Xeroderma Pigmentosum group G (XPG), known as excision repair cross-complementation group

5 (ERCC5), is the 3’ endonuclease that could stabilize the NER complex to help excise the DNA

damaged fragment [129]. Variation of XPG 1104G>C (rs17655) would affect the NER function by the

integration of XPG and Transcription factor II H (TFIIH) [134]. One study showed that the US patients

with XPG Asp1104Asp variant had a significantly lower lung cancer incidence [135]. However, this

genetic association to lung cancer susceptibility did not be identified in a meta-analysis [136]. The

25
similar results were also performed in colorectal, bladder and stomach cancer patients in Turkish,

Tunisia and China respectively [137-139].

2.2.4.3 Metabolism and detoxification


Platinum is mainly engaged in gamma-glutamyl transpeptidase (GGT) pathway and generates the

toxic potent nephrotoxin metabolites (Figure 2.2.4.3) [140, 141]. The introduction of the metabolism

and detoxification genes selected in this study as follows.

Figure 2.2.4.3 Gamma-glutamyl transpeptidase (GGT) pathway [76]


*X represents the alkene and Y represents a halogen molecule: fluorine, chlorine, or bromine
*GGT, gamma-glutamyl transpeptidase
26
2.2.4.3.1 GSTP1
Glutathione S-transferase (GST) is a phase II enzyme that mainly distributed in human liver to

catalyze nucleophilic reactions for glutathione to conjugate with electrophiles and it could help the

kidney to eliminate these metabolites by increasing their solubility [142]. The GST pi 1 (GSTP1) gene

are usually presented in tumor cells, so some studies have tried to investigate the association between

GSTP1 and cancer susceptibility [143]. However, these studies showed the inconsistent results on

whether the overexpression of GSTP1 would lead to increased cancer suffering probability [144-146] .

Multiple pharmacogenomics researches discussed the association between GSTP1 expression and

platinum-induced nephrotoxicity. Animal study has observed that the GSTP1/P2 wild-type mice have

more severe nephrotoxicity than the null mice [147]. For GSTP1 A313G analyses, two Caucasian

studies did not identify the relationship between the variant and cisplatin-induced kidney toxicity [13,

15].

2.2.4.3.2 NAT2
N-acetyltransferase 2 (NAT2) enzyme is encoded by NAT2 gene that catalyzes acetylation to help

acetyl coenzyme A (Acetyl-CoA) transfer acetyl to aromatic amine, heterocyclic amine and

hydralazine. Many NAT2 genotypes were discovered and showed different acetylation rate [148]. Most

studies classified their phenotype into rapid acetylator (with NAT2*4) and slow acetylator (without

NAT2*4). The glutathione S-conjugates of cisplatin are catalyzed by NATs or cysteine S-conjugated β-

lyases into mercapturate and reactive thiols that would lead to nephrotoxicity [149].

Multiple studies have investigated the relationship between NAT2 expression and kidney injury

induced by cisplatin. It has previously been identified that the cisplatin metabolites generated by NAT2

were much nephrotoxic [76]. Previous pharmacogenomic research has established that NAT2 would

correlate cisplatin-induced kidney injuries with in ovarian cancer patients [150]. Also, A genomic

prediction model has showed the association between NAT2 and renal toxicity induced by cisplatin

[10].

27
2.2.4.3.3 NQO1
NAD(P)H: quinone oxidoreductase 1 (NQO1) is a flavoenzyme that could reduce quinones,

quinoneimines and nitroaromatics to relieve cellular oxidative stress to reach cytoprotective [151].

Only benzene toxicity was related to NQO1 gene expression and previous study has reported that

patients with NQO1 heterozygote were significantly increased benzene toxicities than those with

homozygote [152].

Some pharmacogenomic studies have shown that NQO1 609 C>T (rs1800566) gene expression

has no association with platinum-induced nephrotoxicity. For osteosarcoma patients undergoing

cisplatin, the multivariate analysis did not demonstrate the association between NQO1 and renal

toxicities [153]. The similar result was also identified in advanced NSCLC patients [154].

2.2.4.3.4 UGT1A7

Uridine-diphospho glucuronosyl transferases 1A7 (UGT1A7) is a phase II enzyme that

participates in the glucuronidation of heterocyclic amines and nitroamines to reach detoxification

[155]. A meta-analysis had indicated that the UGT1A7 polymorphisms may lead to increased

carcinogenicity in Asian patients [156]. The UGT1A7 genotypes showed different metabolism

activities (UGT1A7*1 > UGT1A7*2 > UGT1A7*4 > UGT1A7*3) and usually classified as high

activity for UGT1A7*1 and low activity for non UGT1A7*1 [157]. Although there are few studies to

investigate the association between UGT1A7 and platinum-induced nephrotoxicity, a comprehensive

study has indicated that the UGT1A7 are crucial in cisplatin metabolism [158].

2.2.4.4 Apoptosis

2.2.4.4.1 TP53

Tumor suppressor protein 53 (TP53) could help damaged cells repair their DNA by stopping at

phase G1/S [159]. Under the stress condition, TP53 gene could induce apoptosis by transcriptional

activation by PUMA (p53 upregulated modulator of apoptosis) and NOXA [160-162]. Several studies

28
have tried to prove the association between the TP53-induced apoptosis activated by caspase-3 and

platinum-induced nephrotoxicity [163, 164]. An animal study has identified the ability of apoptosis

induction for TP53 Arg72 variation is five-fold than Pro72 [165]. Although previous study did not

show the correlation between TP53 and renal toxicity, a recent research using CART and Framingham

risk score selected TP53 as an important genomic factor to predict acute kidney injury [11, 15].

2.3 Machine learning

Multiple machine learning models have been applied in medicine and showed a promising result

in prediction and classification. The machine learning algorithms applied in this study are discussed as

follows:

2.3.1 Artificial neural network (ANN)

The multi-layer perception ANN is the non-linear computational algorithms that mimics the

signal transduction of the human’s brain. Preceptor (Figure 2.3.1), the basic structure of the ANN,

simulates human’s neuron. Every preceptor will calculate the value by the input, weight and bias. The

activated function, which mimics the action potential, will decide whether to export the output

depending on the quantity of the summed value.

Figure 2.3.1.1 Preceptor structure in artificial neural network [166]

29
The structure of ANN (Figure 2.3.1.2) consists of numerous layers and neurons to figure out the

non-linear association between input and output features. The basic structure contains input, hidden

and output layer. Input layer should include the variables from preprocessed data, such as clinical or

genomic data. Hidden layers are the main layers to learn the features that can predict the outcomes

from the input layers. The more hidden layers the ANN has, the more complex data the ANN could

process. However, the complexity of the hidden layers is not proven to have a better prediction. Output

layer contains the outcomes of the interest, including the probabilities of each outcome category.

Figure 2.3.1.2 Structure of artificial neural network


Lack of interpretability, also known as black-box effect, is the main limitation for the ANN [167].

Although we could know which features that the hidden layers learn from the input layers, the

relationship between the outcomes and features sometimes could not be directly interpreted or

explained. To increase the interpretability of the results, the best ANN is always constructed under try

and error process.

Studies have indicated that the ANN could show a promising result in medicine. Most of the

studies built the ANN to visualize the image features and classify the high risk and low risk of

suffering the disease [168-171]. The accuracy or AUC of these studies reached at least above 0.8,

which prevailed than other traditional studies in each area.

30
Table 2.3.1 Artificial neural network applied in medical research
Authors Year Aim Performance
Ayer T et al. [169] 2010 Cancer susceptibility AUC: 0.965
Chen Y-C et al. [170] 2014 Cancer survival Accuracy: 0.835
Seah, et al. [171] 2019 Feature extraction of congestive heart failure AUC: 0.82

2.3.2 Logistic regression (LR)

A common statistics tool LR is usually used to solve classification problem. Cost function of the

LR is usually defined as sigmoid function (Figure 2.3.2), also known as logistic function, to transform

the prediction probabilities to value between 0 and 1. The threshold of the LR is usually set to 0.5. If

the probabilities are higher than 0.5, the output value would be labeled as 1, and vice versa [172].

Figure 2.3.2 Sigmoid Function [173]


Studies have indicated that the LR could reached a better performance in medicine. Most of the

studies using LR aimed to solve the lung cancer classification problem [174-179]. The performance of

these studies reached a promising result compared to other prediction model.

31
Table 2.3.2 Logistic regression applied in medical research
Authors Year Aim Performance
Swensen et al. [174] 1997 Solid nodules in lung cancer AUC: 0.832
Cassidy et al. [175] 2008 Lung cancer development AUC: 0.70
Bilimoria et al. [176] 2013 Risk classification model Mortality AUC: 0.944
Morbidity AUC: 0.816
Farjah et al. [177] 2013 N2 nodules classification Internal AUC: 0.70
External AUC: 0.65
Tammemagi et al. [178] 2013 Nodules classification AUC > 0.9
Deppen et al. [179] 2015 Indeterminate nodules Internal AUC: 0.87
classification External AUC: 0.89

2.3.3 Random Forest (RF)

An ensemble model RF (Figure 2.3.3) is made up of multiple classification and regression trees

(CART). Each CART is composed of numerous parameters and variables and generate a predicted

value. RF will determine the final prediction by calculating the majority votes from the CARTs.

Figure 2.3.3 Structure of random forest [180]


The structure of RF consists of multiple hyperparameters including maximum depths of the trees,

maximum features, minimum samples leaf size, minimum samples split and number of estimators.

Maximum depths of the trees decide the depth of each tree which captures the features of the data. The

deeper the trees, the more information the RF could get. Maximum features decide the maximum

number of features that could be used in RF. Performance of the RF could be improved when

32
increasing the feature. However, it could also decrease the diversity of the RF and the operating

efficiency of the algorithm. Minimum samples of leaf size present the minimum numbers of data in

any leaf node. Smaller leaf size could make the RF easily capture the noise of the dataset. Minimum

samples split stand for the minimum samples required to split in an internal node. Increase the

minimum samples split only when analyzing the larger dataset. Number of estimators is the number of

the trees. The more trees RF constructs, the more performance it could have. However, it could also

decrease the efficiency of the algorithms.

Studies have indicated that the RF could reached a better performance in medicine. Most of the

RF studies are used to predict the disease progression and risk classification of different disease

severity [181-186]. Many of them reached the good performance compared to other classification

methods.
Table 2.3.3 Random forest applied in medical research
Authors Year Aim Performance
Tong et al. [181] 2004 Cancer classification Sensitivity: 0.992
Delen D et al. [182] 2005 Cancer survival prediction Accuracy: 0.93
Szabo de Edelenyi et 2008 Prediction of metabolic syndrome Accuracy: 0.717
al. [183]
Xu et al. [184] 2011 Severe asthma in children AUC: 0.66
Lebedev et al. [185] 2014 Prediction of Alzheimer’s disease Sensitivity: 0.886

2.3.4 Support vector machine (SVM)

A well-known AI algorithm SVM (Figure 2.3.4) divides the data in high-dimensional space via

the optimal hyperplane to solve the classification problem. Optimal hyperplane is defined as the

hyperplane that could reach the maximum margin between each category.

33
Figure 2.3.4 Rationale of support vector machine [187]

The structure of SVM consists of multiple hyperparameters to find the optimal hyperplane with

maximum margin. Kernel function is the basic hyperparameter of SVM. Kernel function is to

transform the input data into the required form. Radial basis function (RBF) is the most commonly

used kernel function. Each kernel function has their own different hyperparameters. For the RBF, C

and γ are the hyperparameters that constitute the RBF. C is the penalty coefficient and represents the

tolerance to error. If C is higher, the SVM model will tend to be overfitting, and vice versa. γ

determines the distribution that the data projects to the new feature space. If γ is higher, the support

vector would be lesser, and vice versa.

Studies have indicated that the SVM could reached a better performance in medicine. Most of the

SVM studies were applied to predict the cancer susceptibility, recurrence and survival [188-195].

Accuracy of these studies all reached a better performance compare to traditional prediction model.
Table 2.3.4 Support vector machine applied in medical research
Authors Year Aim Performance
Listgarten J et al. [188] 2004 Cancer susceptibility Accuracy: 0.69
Waddell M et al. [189] 2005 Cancer susceptibility Accuracy: 0.71
Kim W et al. [190] 2012 Cancer recurrence Accuracy: 0.89
Xu X et al. [191] 2012 Cancer survival prediction Accuracy: 0.97
Chang S-W et al. [192] 2013 Cancer survival prediction Accuracy: 0.75
Ahmad LG et al. [193] 2013 Cancer recurrence Accuracy: 0.95
Rosado P et al. [194] 2013 Cancer survival prediction Accuracy: 0.98
Tseng C-J et al. [195] 2014 Cancer recurrence Accuracy: 0.89

34
Chapter 3. Objective

Platinum-induced nephrotoxicity is the most frequently occurred adverse drug reaction for the

NSCLC patients. This study aimed to build the ML models with clinical and genomic data to forecast

platinum-induced nephrotoxicity for NSCLC patients. Four ML algorithms, including ANN, LR, RF

and SVM, were applied to compare the performance of each model.

This study expected to achieve the following objectives:

1. Identify the key contributing clinical and genomic factors of platinum-induced nephrotoxicity in

non-small cell lung cancer patients.

2. Build up machine learning models to predict nephrotoxicity for NSCLC patients undergoing

platinum chemotherapy.

3. Evaluate the accuracy, precision, recall rate, F1 score, AUC and ROC curve of different models.

4. Find the best combination of variables for each model by genetic algorithm (GA).

35
Chapter 4. Method

4.1 Study design

This is a retrospective cohort study. There were four main phases in our study: patient

recruitment, data collection, model construction and model evaluation (Figure 4.1). The study was

approved by the Wan Fang hospital Institutional Review Board (WFH-IRB No. 99054, Appendix 1).

The cohort included 118 NSCLC patients newly received with platinum chemotherapy from Wan

Fang Hospital from January 2005 to March 2010. All the subjects were recruited by the inclusion and

exclusion criteria. Clinical data were collected from electronic health record and genomic data were

collected from peripheral blood sample after patient competed written informed consent. Model

construction was repeatedly generated the following three steps until we got the optimized ML models.

All the patient information was selected to be the features by literature review and some new features

were engineered. All the features were analyzed under three modes. Twelve models were defined and

built by four ML algorithms (ANN, LR, RF, and SVM) and three modes (I, C, and G). The integrated

(I) mode included all the clinical and genomic features. The clinical (C) mode included only clinical

features. The genomic (G) mode included only genomic features. Five-fold and leave-one-out cross

validation were both applied to examine the model generalizability and find the best predicted model

for each one. Finally, the fine-tuned models were evaluated by multiple ML metrics to decide the best

performing model among the 12 models.

36
Figure 4.1 Study flowchart
*NSCLC, non-small cell lung cancer; WFH, Wan Fang hospital; ANN, artificial neural network; LR,
logistic regression; RF, random forest; SVM, support vector machine; I, integrated; C, clinical; G,
genomic; Cis, cisplatin; Car, Carboplatin; AUC, the area under the receiver operating characteristic
curve; ROC curve, receiver operating characteristic curve

37
4.2 Study participants

Patients recruited in this study followed the following inclusion and exclusion criteria:

4.2.1 Inclusion criteria

1. At least taking one course treatment of cisplatin and carboplatin

2. At least having one serum creatinine data before and after the administration

3. Willing to provide DNA sample and sign the informed consent.

4.2.2 Exclusion criteria

1. Younger than 20 years old or older than 89 years-old

2. Pregnant women

3. Infected by Human Immunodeficiency Virus (HIV)

4. Coadministered with ifosfamide

5. Couldn’t evaluate subjects’ kidney function

6. Refuse to provide DNA sample and sign the informed consent

4.3 Data and sample collection

Written informed consents were obtained to collect patient’s clinical information and blood

samples. Clinical data of the first chemotherapy course (Table 4.3.1) were collected at electronic health

record retrospectively. Sixteen genes with eighteen SNP (Table 4.3.2) were genotyped from patients’

blood sample. For the internal validation cohort, genotyping was applied by the polymerase chain

reaction restriction fragment length polymorphism (PCR-RFLP) analysis, denaturing high performance

liquid chromatography (DHPLC) and Taqman genotyping. All the genotyped results were further

validated by ABI Prism 3100 (Applied Biosystem) from at least 5% randomly chosen samples.

38
Table 4.3.1 Clinical data collected in this study
Types Data
General information Gender, age, height, weight, smoking, drinking
Cancer-related information Histology, pathology, stage
Platinum chemotherapy Administration date, administration course, average
administration information dosage, accumulative dosage, concomitant
chemotherapy medications, the number of
administrations
Laboratory data Serum creatinine

Table 4.3.2 Single nucleotide polymorphisms (SNP) candidates in this study


Gene/Allele name SNP SNP number
OCT2 G808T rs316019
ABCB1 C3435T rs1045642
MATE1 G>A rs2289669
ABCC2 -24 C>T rs717620
MATE1 g.-66 T>C rs2252281
XPA G23A rs1800975
OGG1 C326G rs1052133
XPG G1104C rs17655
ERCC4 T2505C rs1799801
ERCC1 C118T rs11615
ERCC1 C>A rs3212986
NAT2*5 C481T rs1799929
NAT2*6 C590T rs1799930
NAT2*7 G857T rs1799931
NQO1 C609T rs1800566
GSTP1 A313G rs1695
UGT1A7 T622C rs11692021
TP53 G215C rs1042522

4.4 Modes and features

Three different modes were constructed under clinical or genomic data in four machine learning

models. The integrated (I) mode included clinical and genomic features. The clinical (C) mode

included only clinical features. The gene (G) mode included only genomic features.
39
4.4.1 Prediction outcome

Acute kidney injury (AKI) was the prediction outcome in this study and was referred to the

definition of Common Terminology Criteria for Adverse Events (CTCAE) Version 5.0. If the patient’s

serum creatinine reached more than 1.5 times (Grade 1) after taking the platinum chemotherapy

compared to baseline serum creatinine, then the patients would be labeled as AKI [196]. This binary

feature would be coded as 1 (AKI) or 0 (non-AKI).

4.4.2 Input features

Clinical and genomic features were collected after data collection and feature engineering.

Missing values were imputed depends on different types of variables. For continuous variables, impute

mean or median if the variable distribution obeyed or disobeyed normal distribution, respectively. For

categorical variables, imputation was applied by KNN (k-nearest neighbors) algorithm. The accuracy

of KNN algorithm for those missing features was also calculated by the non-missing patients as

ground truth data (Table 4.4.2). Most of the prediction outcome was greater than 0.500 even up to

1.000. Code book of clinical and genomic variables were listed in Appendix 2.

Table 4.4.2 The accuracy of KNN (k-nearest neighbors) algorithm in missing features.
Missing Number of Missing Number of
Accuracy Accuracy
features missing value features missing value
ERCC4_CT
Alcohol 1 0.829 1 0.650
(rs1799801)
ERCC4_TT
TNM stage 1 0.538 1 0.641
(rs1799801)
NAT2*7_AA
Pathology 2 1.000 1 0.974
(rs1799931)
NAT2*7_AG
Adenocarcinoma 2 0.793 1 0.726
(rs1799931)
Large cell NAT2*7_GG
2 0.931 1 0.752
carcinoma (rs1799931)

40
Squamous cell UGT1A7_CC
2 0.862 11 0.963
carcinoma (rs11692021)
OCT2_GG UGT1A7_CT
1 0.684 11 0.720
(rs316019) (rs11692021)
ERCC4_CC UGT1A7_TT
1 0.949 11 0.748
(rs1799801) (rs11692021)

4.4.3 Other study definitions

1. Evaluation method of survival analysis

(1) Index date: Date when the subjects first received the platinum chemotherapy.

(2) Closing date:

I. Subjects with AKI: Date when the subjects first occurred AKI after taking platinum

chemotherapy.

II. Subjects without AKI: Date when the subjects’ last blood sampling time of serum

creatinine.

2. Platinum drugs:

If the subjects changed their platinum therapy during taking chemotherapy, the principles of

subjects’ platinum identification were:

(1) Subjects with AKI: Pick the platinum when the subjects occurred AKI.

(2) Subjects without AKI: Pick the platinum that the subjects mainly received.

3. Concomitant chemotherapy drugs:

If the subjects changed their concomitant chemotherapy drugs during the chemotherapy, the

principles of subjects’ platinum identification were:

(1) Subjects with AKI: Pick the concomitant medication when the subjects occurred AKI.

(2) Subjects without AKI: Pick the concomitant medication that the subjects mainly received.

4. Accumulative dose of platinum chemotherapy: Calculate the accumulative dose of platinum

chemotherapy (mg/m2) until the closing date.

41
5. Administration times of platinum chemotherapy: Calculate the total administration times of

platinum chemotherapy until the closing date.

6. Average dose of platinum chemotherapy: Accumulative dose of platinum chemotherapy

(mg/m2) divided by the administration times of platinum chemotherapy.

7. Accumulative intensity of platinum chemotherapy: Accumulative dose of platinum

chemotherapy (mg/m2) divided by the course duration (weeks).

4.5 Model building

Twelve models were constructed by 4 ML algorithms and 3 modes as I mode for ANN (ANN-I),

LR (LR-I), RF (RF-I) and SVM (SVM-I); C mode for ANN (ANN-C), LR (LR-C), RF (RF-C) and

SVM (SVM-C); G mode for ANN (ANN-G), LR (LR-G), RF (RF-G) and SVM (SVM-G) to predict

the AKI.

4.6 Grid search in ANN, LR, RF and SVM

Grid search was applied to find the best hyperparameter combination of ANN, LR, RF and SVM.

For ANN, batch size, numbers of epochs, learning rate, optimizers, momentum were chosen to build

the neural network. A total of 1536 combinations were tested for the best performance of the ANN. For

LR, solver was the hyperparameter to be trained. A total of 5 combinations were calculated. For RF,

number of estimators, maximum features, maximum depths, minimum samples splits, minimum

samples leaf and bootstrap were set to each decision tree. A total of 4400 combinations had been tested

to find best hyperparameters. For SVM, kernel, C and gamma were the hyperparameters. A total of 120

possible combinations were calculated before combination of best performance was selected.

4.7 Five-fold cross validation

Five-fold cross validation (Figure 4.7) was applied to assess model performance and decrease

overfitting. The subjects were divided into five different groups stratifies by cisplatin/carboplatin and

AKI/non-AKI. Each group would take turns to be the testing set while the other groups would be the

42
training set, repeating until every group had been used as testing set [197]. The performance

differences between the five groups can also be measured to evaluate whether these models were

appropriate to predict the outcomes.

Figure 4.7 Five-fold and leave-one-out cross validation for example


*Red color represents training set while white color represents testing set.
*Sample size is hypothesized at 15 for the leave-one-out cross validation.

4.8 Leave-one-out cross validation

Leave-one-out cross validation (Figure 4.7) were also applied to assess model performance.

Compared with five-fold, leave-one-out could be suitable for small sample size evaluation. The group

number applied in leave-one-out cross validation equals to the sample size. Each group would take

turns to be the testing set while the other groups would be the training set, repeating until every group

had been used as testing set [197].

43
4.9 Feature selection and importance

Genetic algorithm (GA, Figure 4.9), which simulates the evolutionary biology, was applied in this

study to select the optimal feature combinations for each model. First test (generation) contained 34

clinical and 50 genomic features with the initial population set to 300 [198]. Number of the generation

was set to 30 or until the fitness function reached the convergence. Fitness function defined as the

target optimized function were calculated by accuracy and AUC. Based on the literature review, the

crossover rate and mutation rate was set to 0.9 and 0.03, respectively [199]. Sequential Backward

Selection (SBS) was also applied to validate the accuracy and number of selected features from GA in

LR, RF and SVM by mlxtend module (Python Software Foundation, Wilmington, DE, USA) [200].

Feature importance was also calculated for LR- and RF-related models. The coefficient of the

selected features was generated for the LR-related models. The impurity-based feature importance,

also known as Gini importance, was applied for the RF-related models. We did not calculate the

feature importance for ANN- and SVM-related models because they were only available on single

neuron layer and linear SVM, respectively.

4.10 Artificial neural network (ANN)

The well-known algorithm ANN was constructed by the sequential layer, input layer, 2 hidden

layer and output layer with keras module. The activation function of input and hidden layer was

sigmoid. The neuron number of input layer and hidden layer was 8 and 2, respectively. Binary

crossentropy was defined as loss function. Accuracy was chosen as the metric for model evaluation.

Multiple combination of hyperparameters were chosen for I, C, G modes of five-fold and leave-

one-out cross validation after grid search. For five-fold cross validation, optimizer, batch size, learning

rate and epoch for different modes were presented in Figure 4.10.1. For leave-one-out cross validation,

optimizer, batch size, learning rate and epoch for different modes were presented in Figure 4.10.2.

44
Table 4.10.1 Hyperparameters of the artificial neural network for five-fold cross validation
Hyperparameters Optimizer Batch size Learning rate Epoch
I mode SGD 5 0.3 100
C mode Adamax 10 0.005 100
G mode Adagrad 5 0.001 100

Table 4.10.2 Hyperparameters of the artificial neural network for leave-out-one cross validation
Hyperparameters Optimizer Batch size Learning rate Epoch
I mode SGD 5 0.01 100
C mode Adamax 5 0.2 100
G mode SGD 5 0.001 100

4.11 Logistic regression (LR)

A common statistics method LR was established by solver with scikit-learn module and generated

binary outcome with a cut-point of 0.5. If the outcome value is greater than 0.5, the outcome would be

classified as 1, and 0 on the contrary. For five-fold cross validation, the solver used for I, C, G mode

were liblinear, saga, and newton-cg, respectively. For leave-one-out cross validation, the solver used

for I, C, G mode were all liblinear.

4.12 Random forest (RF)

The algorithms with multiple CART RF was constructed based on RandomForestClassifier

module from scikit-learn module with multiple hyperparameters. For five-fold cross validation,

hyperparameters for different modes were presented in figure 4.12.1. For leave-one-out cross

validation, hyperparameters for different modes were presented in figure 4.12.2.

Table 4.12.1 Hyperparameters of the random forest for five-fold cross validation
Hyperparameters Bootstrap Max depth Max Min Min Number of
features samples samples estimators
leaf split
I mode True None sqrt 2 5 50
C mode True None sqrt 10 15 200
G mode False None sqrt 6 2 10

45
Table 4.12.2 Hyperparameters of the random forest for leave-one-out cross validation
Hyperparameters Bootstrap Max depth Max Min Min Number of
features samples samples estimators
leaf split
I mode True None sqrt 2 2 50
C mode False None sqrt 6 15 10
G mode True None log2 4 10 10

4.13 Support vector machine (SVM)

The classification algorithm SVM was established by SVC from scikit-learn module with three

hyperparameters. For five-fold cross validation, hyperparameters for different modes were presented in

figure 4.13.1. For leave-one-out cross validation, hyperparameters for different modes were presented

in figure 4.13.2.
Table 4.13.1 Hyperparameters of the support vector machine for five-fold cross validation
Hyperparameters C kernel gamma
I mode 5 rbf (radial basis function) 0.005
C mode 5 rbf 0.01
G mode 10000 rbf 0.1
Table 4.13.2 Hyperparameters of the support vector machine for leave-one-out cross validation
Hyperparameters C kernel gamma
I mode 5 rbf 0.005
C mode 5 rbf 0.005
G mode 10000 rbf 0.1

4.14 Metrics evaluation

Multiple metrics (Table 4.14.1) will be used to evaluate the model performance. Accuracy,

precision, recall, F1 score will be estimated based on the confusion matrix (Table 4.14.2) by scikit-

learn module. These metrics will be calculated by true positive, true negative, false positive and false

negative and their formulas are listed below [197]. The AUC will be calculated by scikit-learn module.

46
The receiver operating characteristic (ROC) curve will be illustrated by matplotlib.pyplot module 3.2.2

(Python Software Foundation, Wilmington, DE, USA).


Table 4.14.1 Metrics used to evaluate model performance in this study
Metrics Formula
Accuracy TP + TN
TP + FP + TN + FN
Precision TP
TP + FP
Recall TP
TP + FN
F1 score 2 × precision × recall
precision + recall
1
AUC
∫ Pr [𝑇𝑃](𝑣)𝑑𝑣
0
*TP: True positive, TN: True negative, FP: False positive, FN: False negative

Table 4.14.2 Confusion matrix


Actual values
Positive Negative
Predicted Positive TP FP
values Negative FN TN
*TP: True positive, TN: True negative, FP: False positive, FN: False negative

4.15 Statistical analysis

1. Baseline demographics analyses were performed by chi-squared test or fisher's exact test for

categorical variables and independent t-test for continuous variables.

2. Hardy-Weinberg equilibrium was tested for the distribution of allele frequency.

3. Paired t-test was calculated to compare the model performances between the I and G modes, and

C and G modes.

4. Subgroup analysis was to analyze model performance and feature selection for the patients only

undergoing cisplatin chemotherapy.


47
5. Kaplan-Meier survival analysis and log-rank test were applied to compare the cumulative

nephrotoxicity incidence between high risk and low risk group. Youden’s index, defined as

maximum value of (sensitivity + specificity - 1), was applied to find the best threshold of high

and low risk AKI patients.

6. DeLong test were calculated to test the AUC statistics differences between the models.

7. All hypothesis testing were 2-tailed, with statistical significance set at 2-sided P ≤ .05.

8. Data were analyzed by using SAS 9.4 (SAS Institute, Cary, NC, USA), R studio 1.3.1093

(RStudio, Inc., Boston, MA, USA), Python 3.9.1 (Python Software Foundation, Wilmington, DE,

USA).

48
Chapter 5. Results

5.1 Baseline characteristics

Table 5.1 presented the baseline characteristics of NSCLC patients in this study. In total, 118

subjects were recruited in this study with 28 (23.73%) AKI patients; 70 (59.32%) were male and 48

(40.68%) were female; 51 (43.22%) patients had a smoking habit and 16 (13.56%) patients

accustomed to alcohol; 95 (80.51%) were adenocarcinoma, 16 (13.56%) were squamous cell

carcinoma, 7 (5.93%) were large cell carcinoma; 9 (7.63%) were at stage 1, 4 (3.39%) were at stage 2,

32 (27.12%) were at stage 3, 73 (61.86%) were at stage 4; 55 (46.61%) underwent cisplatin and 63

(53.39%) underwent carboplatin. The mean age was 65.5 ± 11.4 years; the mean baseline serum

creatinine was 0.90 ± 0.29 ml/min.

Most of the characteristics were not significantly different while some were. Both cohorts

exhibited insignificant differences between renal toxicity and non-renal toxicity group in their baseline

serum creatinine, average and cumulated dose of platinum. However, the patients in the renal toxicity

group had significant lower number of chemotherapy cycles (2.07 ± 1.39 versus 3.54 ± 1.56, p <

0.001), and thus lower cumulative doses of platinum drugs than the non-renal toxicity group (145.12 ±

122.23 versus 228.04 ± 129.01, p < 0.05; 405.47 ± 328.13 versus 787.65 ± 450.07, p < 0.05).

Table 5.1 Baseline characteristics of non-small cell lung cancer patients segregated by renal toxicity

Renal Toxicity Non-Renal Toxicity


Characteristics N= 28 N= 90 P value

n (%) or mean ± SD
Male 14 (50.00) 56 (62.22) 0.250
Age (year) 61.61 ± 11.39 66.69 ± 11.17 0.039*
Alcohol 3 (10.71) 13 (14.44) 0.759
Smoking 11 (39.29) 40 (44.44) 0.630
Histology 0.844
49
Adenocarcinoma 23 (82.14) 72 (80.00)
Squamous cell
3 (10.71) 13 (14.44)
carcinoma
Large cell
2 (7.14) 5 (5.56)
carcinoma
Stage 0.389
Ia/Ib 3 (10.71) 6 (6.67)
IIa/IIb 2 (7.14) 2 (2.22)
IIIa/IIIb 8 (28.57) 24 (26.67)
IV 15 (53.57) 58 (64.44)
Concomitant
0.738
chemotherapy
Gemcitabine 19 (67.86) 58 (64.44)
Pemetrexed 0 (0.00) 3 (3.33)
Vinorelbine 4 (14.29) 8 (8.89)
Paclitaxel 3 (10.71) 11 (12.22)
Etoposide 0 (0.00) 2 (2.22)
CCRTa 2 (7.14) 3 (3.33)
Otherb 0 (0.00) 5 (5.56)
Platinum drug <0.001***
Cisplatin 22 (78.57) 33 (36.67)
Carboplatin 6 (21.43) 57 (63.33)
Cumulative dose
(mg/m2)
Cisplatin 145.12 ± 122.23 228.04 ± 129.01 0.017*
Carboplatin 405.47 ± 328.13 787.65 ± 450.07 0.048*
Average dose (mg/m2)
Cisplatin 70.06 ± 16.95 66.58 ± 15.92 0.442
Carboplatin 205.45 ± 157.68 218.06 ± 76.49 0.734
Chemotherapy cycles 2.07 ± 1.39 3.54 ± 1.56 <0.001***
Chemotherapy course
7.99 ± 8.41 12.35 ± 6.75 0.006**
(weeks)
Baseline Scrc
0.81 ± 0.24 0.92 ± 0.29 0.055
(mg/dl)
p < 0.05*, < 0.01**, < 0.001***; aCCRT: Concurrent chemoradiotherapy; b5-Fluorouracil,
Procarbazine, Alkeran and Vinblastine (PAVE), Cyclophosphamide, adriamycin and platinum (CAP);
c
Scr: serum creatinine
50
5.2 Hardy-Weinberg equilibrium for genotyping

Hardy-Weinberg Equilibrium was tested for the eighteen SNPs in this study (Table 5.2). The

distribution of the SNPs selected in this study obeyed to the Hardy-Weinberg equilibrium.

Table 5.2 The Hardy-Weinberg equilibrium distribution in this study.


Genotype Number MAF χ2 P value
OCT2
G/G 90 0.119 2.138 0.144
G808T
rs316019 G/T 28
T/T 0
ABCB1
C/C 44 0.394 0.068 0.794
C3435T
rs1045642 C/T 55
T/T 19
MATE1
A/A 33 0.479 0.122 0.727
G>A
rs2289669 G/A 57
G/G 28
ABCC2
C/C 75 0.203 0.005 0.946
-24C>T
rs717620 C/T 38
T/T 5
MATE1
T/T 72 0.208 1.364 0.243
g-66T>C
rs2252281 T/C 43
C/C 3
XPA
G/G 29 0.483 0.873 0.35
G23A
rs1800975 G/A 64
A/A 25
OGG1
G/G 44 0.377 0.488 0.485
C>G
rs1052133 C/G 59
C/C 15

51
XPG1104
G/G 33 0.483 0.292 0.589
G>C
rs17655 G/C 56
C/C 29
ERCC4
T/T 71 0.225 0.001 0.979
T>C
rs1799801 T/C 41
C/C 6
ERCC1
C/C 52 0.326 0.428 0.513
C118T
rs11615 C/T 55
T/T 11
ERCC1
C/C 59 0.292 0.001 0.969
C8092A
rs3212986 C/A 49
A/A 10
TP53
G/G 29 0.492 0.309 0.578
C>G
rs1042522 C/G 62
C/C 27
NAT2*5
C/C 114 0.017 0.035 0.851
C>T
rs1799929 C/T 4
T/T 0
NAT2*6
G/G 61 0.275 0.193 0.661
G>A
rs1799930 G/A 49
A/A 8
NAT2*7
G/G 79 0.178 0.215 0.643
G>A
rs1799931 G/A 36
A/A 3
NQO1
T/T 34 0.458 0.07 0.792
C609T
rs1800566 C/T 60
C/C 24

52
GSTP1
A/A 71 0.212 1.603 0.205
A313G
rs1695 A/G 44
G/G 3
UGT1A7
T/T 72 0.212 0.511 0.475
T622C
rs11692021 T/C 42
C/C 4
*MAF: Minor allele frequency

5.3 Features selection and importance

Genetic algorithm was applied to select features of the 12 ML models from five-fold and leave-

one-out cross validation. In five-fold cross validation, there were 25, 10 and 9 features selected by LR-

I, LR-C and LR-G, respectively (Table 5.3.1). Fewer features were selected in LR algorithms

compared to other 3 algorithms. In leave-one-out cross validation, there were 26, 16 and 9 features

selected by LR-I, LR-C and LR-G, respectively (Table 5.3.1), with the similar results to five-fold cross

validation.

Table 5.3.1 Number of selected features for 12 models by five-fold and leave-one-out cross validation
Five-fold cross validation Leave-one-out cross validation
Features I mode C mode G mode I mode C mode G mode
ANN 38 12 27 30 17 27
LR 25 10 9 26 16 9
RF 38 13 16 38 19 22
SVM 35 16 20 29 17 16
ANN, artificial neural network; LR, logistic regression; RF, random forest; SVM, support vector
machine; I, integrated mode; C, clinical mode; G, genomic mode.

Table 5.3.2, Table 5.3.3 and Table 5.3.4 presented the features selected by the 12 models and

feature importance in five-fold cross validation and leave-one-out cross validation through genetic

algorithm. For I mode, all four algorithms chose age group (classified by 65 years old), number of
53
chemotherapy cycles, AG in MATE1 G>A, AA in XPA G23A and CT in NQO C609T in five-fold cross

validation while AG in XPA G23A, AA in GSTP1 A313G and TT in UGT1A7 T622C were chosen in

leave-one-out cross validation. For C mode, all four algorithms chose age group, number of

chemotherapy cycles, pemetrexed and cumulative carboplatin dose in five-fold cross validation while

cancer stage and number of chemotherapy cycles were chosen in leave-one-out cross validation. For G

mode, all four algorithms chose CC in ERCC1 C118T and AA in GSTP1 A313G in five-fold cross

validation while only AA in GSTP1 A313G was chosen in leave-one-out cross validation.

Table 5.3.2 Feature selection for integrated mode by five-fold and leave-one-out cross validation
I mode Five-fold cross validation Leave-one-out cross validation
Features ANN LR RF SVM ANN LR RF SVM
Height O O O
(-0.292)
Weight
Body surface O O
area (0.047)
Gender O O O O O
(0.025) (-0.548) (0.013)
Age O
Age groupa O O O O O O O O
(-1.573) (0.034) (-1.481) (0.022)
Alcohol
Smoking
TNM stageb O
TNM stage O O O O O O
groupc (-0.639) (-0.627)
Adenocarcino O O
ma (0.004)
Squamous cell O
carcinoma (0.063)
Large cell O O
carcinoma (0.001) (0.003)
Pathology O O O O
54
(0.000) (0.000)
CTd O O O O O O O O
(-0.959) (0.155) (-1.413) (0.129)
Weeke O O
(0.082)
Platinum O O O
(Cisplatin, (0.007)
Carboplatin)
Concomitant O
5-FUf (0.000)
Concomitant O O
CCRTg
Concomitant O O
Etoposide (0.000) (-0.089)
Concomitant O O
Gemcitabine
Concomitant O O O O
Otherh (-0.346) (0.000)
Concomitant O O O O
PAVEi (0.000) (0.000)
Concomitant O O O O
Paclitaxel (0.794)
Concomitant O O O O O
Pemetrexed (-0.202) (-0.351) (0.005)
Concomitant O O O O O
Vinorelbine (0.376) (0.489) (0.011)
Cisplatin O O O
cumulative (-0.014)
dose (mg)
Cisplatin O O O O
cumulative (0.341) (0.088)
dose (mg/m2)
Cisplatin O O
average dose
(mg)

55
Cisplatin O O O O
average dose (0.080) (0.091)
(mg/m2)
Carboplatin O O O O O
cumulative (0.096) (-1.244) (0.054)
dose (mg)
Carboplatin O O O O O
cumulative (-1.479) (0.054)
dose (mg/m2)
Carboplatin O O O
average dose (0.119) (0.070)
(mg)
Carboplatin O O
average dose (0.070) (0.063)
(mg/m2)
OCT2_GG O O O
(rs316019) (-0.111) (0.018)
ABCB1_CC O O O
(rs1045642) (0.029) (0.020)
ABCB1_CT O
(rs1045642) (0.015)
ABCB1_TT O O
(rs1045642) (-0.341)
MATE1_AA
(rs2289669)
MATE1_AG O O O O O O O O
(rs2289669) (0.986) (0.041) (1.233) (0.031)
MATE1_GG O O O O
(rs2289669) (-0.519)
ABCC2_CC O O O O O
(rs717620) (0.585) (0.438)
ABCC2_CT
(rs717620)
ABCC2_TT O O O O
(rs717620) (0.002) (-0.277)
MATE1_CC O O O O O O
(rs2252281) (-0.010) (0.000) (-0.185) (0.000)
56
MATE1_CT O O O O
(rs2252281) (0.026) (0.441)
MATE1_TT O O
(rs2252281) (0.018)
XPA_AA O O O O O O O
(rs1800975) (1.084) (0.017) (0.014)
XPA_AG O O O O O O O
(rs1800975) (0.027) (0.825) (0.013)
XPA_GG
(rs1800975)
OGG1_CC O O O
(rs1052133) (0.273) (0.009)
OGG1_CG O O
(rs1052133)
OGG1_GG O O O O
(rs1052133) (-0.174) (0.015) (0.325)
XPG_CC O
(rs17655) (0.200)
XPG_CG O O
(rs17655) (0.009)
XPG_GG O
(rs17655)
TP53_CC
(rs1042522)
TP53_CG O
(rs1042522)
TP53_GG
(rs1042522)
ERCC4_CC O O
(rs1799801) (0.000)
ERCC4_CT O O O
(rs1799801) (0.449)
ERCC4_TT O O O O O
(rs1799801) (0.013) (-0.743)
ERCC1_CC O O
(rs11615) (0.024) (0.013)
ERCC1_CT
57
(rs11615)
ERCC1_TT O O O O
(rs11615) (0.375) (0.007)
ERCC1_AA O O O O O O
(rs3212986) (-0.656) (-0.772)
ERCC1_AC O O
(rs3212986) (0.009)
ERCC1_CC O O O O O
(rs3212986) (0.010) (0.007)
NAT2*5_CC O O O O
(rs1799929) (0.002) (0.000)
NAT2*6_AA O O O
(rs1799930) (0.384) (0.006) (0.007)
NAT2*6_AG O O
(rs1799930) (0.015) (0.023)
NAT2*6_GG O O O O
(rs1799930) (0.018) (-0.344) (0.016)
NAT2*7_AA O O O O
(rs1799931) (-0.052) (0.000)
NAT2*7_AG
(rs1799931)
NAT2*7_GG
(rs1799931)
GSTP1_AA O O O O O O O
(rs1695) (-0.976) (0.015) (-0.807) (0.019)
GSTP1_AG O O O O O
(rs1695) (0.012) (0.012)
GSTP1_GG
(rs1695)
NQO_CC O
(rs1800566)
NQO_CT O O O O O O
(rs1800566) (-0.463) (0.025)
NQO_TT O O O O O
(rs1800566) (0.037) (0.026)
UGT1A7_CC O O
(rs11692021)
58
UGT1A7_CT O O O O O
(rs11692021) (0.019) (0.521)
UGT1A7_TT O O O O O O O
(rs11692021) (-0.704) (0.027) (-0.448) (0.014)
a
Age group: classify patients by 65 years old.
b
TNM stage: classify patients by cancer stage (I, II, III, IV, E).
c
TNM stage group: classify patients by IA~IIb, IIIA~IIIB, IV~E.
d
CT: number of chemotherapy cycles.
e
Week: inclusion period.
f
5-FU: 5-Fluorouracil.
g
CCRT: Concurrent chemoradiotherapy.
h
Other: Procarbazine, Cyclophosphamide, adriamycin and platinum (CAP)
i
PAVE: Procarbazine, Alkeran and Vinblastine.

Table 5.3.3 Feature selection for clinical mode by five-fold and leave-one-out cross validation
C mode Five-fold cross validation Leave-one-out cross validation
Features ANN LR RF SVM ANN LR RF SVM
Height O O O O O O
(-0.428) (0.835)
Weight O O
(0.034)
Body surface O O
area (0.107)
Gender O O O O O
(0.031) (0.015)
Age O O O O
(-0.363)
a
Age group O O O O O O O
(-1.023) (0.038) (0.000) (0.085)
Alcohol O
Smoking O
(0.002)
TNM stageb O O O O
(-0.197) (0.025)
TNM stage O O O O O
groupc (1.046)

59
Adenocarcino O O
ma (0.001)
Squamous cell O O O
carcinoma (-0.329)
Large cell O
carcinoma (0.000)
Pathology O O O O O
(-0.199) (0.000) (0.216)
CTd O O O O O O O O
(-0.847) (0.150) (-0.661) (0.262)
Weeke O O
(0.126)
Platinum O O O O
(Cisplatin, (0.040) (0.191) (0.083)
Carboplatin)
Concomitant O O O O
5-FUf (-0.418) (0.000)
Concomitant O O
CCRTg (0.000)
Concomitant O O O O O
Etoposide (-0.390) (0.000)
Concomitant
Gemcitabine
Concomitant O O O O O O
Otherh (-0.255) (0.000) (0.455) (0.000)
Concomitant O O O
PAVEi (0.000)
Concomitant O
Paclitaxel (0.000)
Concomitant O O O O O O
Pemetrexed (-0.422) (0.000) (0.000)
Concomitant O
Vinorelbine
Cisplatin O O O O
cumulative (-0.052)
dose (mg)

60
Cisplatin O O O O O O
cumulative (0.151) (-0.052)
2
dose (mg/m )
Cisplatin O O
average dose (0.319)
(mg)
Cisplatin O O O O
average dose (0.109) (-0.052)
(mg/m2)
Carboplatin O O O O
cumulative (-0.467) (1.341) (0.116)
dose (mg)
Carboplatin O O O O O O O
cumulative (-0.501) (0.169) (-0.762)
dose (mg/m2)
Carboplatin O O
average dose (0.162) (0.146)
(mg)
Carboplatin O
average dose (0.148)
(mg/m2)
a
Age group: classify patients by 65 years old.
b
TNM stage: classify patients by cancer stage (I, II, III, IV, E).
c
TNM stage group: classify patients by IA~IIb, IIIA~IIIB, IV~E.
d
CT: number of chemotherapy cycles.
e
Week: inclusion period.
f
5-FU: 5-Fluorouracil.
g
CCRT: Concurrent chemoradiotherapy.
h
Other: Procarbazine, Cyclophosphamide, adriamycin and platinum (CAP)
i
PAVE: Procarbazine, Alkeran and Vinblastine.

Table 5.3.4 Feature selection for genomic mode by five-fold and leave-one-out cross validation
G mode Five-fold cross validation Leave-one-out cross validation
Features ANN LR RF SVM ANN LR RF SVM
OCT2_GG O O O O
(rs316019) (0.835) (0.063)
ABCB1_CC O

61
(rs1045642)
ABCB1_CT O O O
(rs1045642) (0.068)
ABCB1_TT O O O O
(rs1045642) (0.000)
MATE1_AA O O
(rs2289669)
MATE1_AG O O O O O O
(rs2289669) (0.829) (0.104) (-0.363) (0.125)
MATE1_GG O O O
(rs2289669) (0.019)
ABCC2_CC O O
(rs717620)
ABCC2_CT O O
(rs717620)
ABCC2_TT O O
(rs717620) (0.000) (0.000)
MATE1_CC O O O
(rs2252281) (-0.229) (0.000)
MATE1_CT O O O O
(rs2252281) (0.065)
MATE1_TT O O O O O
(rs2252281) (0.081)
XPA_AA O O
(rs1800975)
XPA_AG O O
(rs1800975) (0.128)
XPA_GG O O O
(rs1800975) (0.000) (0.028)
OGG1_CC O O O O O O
(rs1052133) (0.761) (0.040)
OGG1_CG O O O
(rs1052133)
OGG1_GG O O O
(rs1052133)
XPG_CC O
(rs17655) (0.039)
62
XPG_CG O O
(rs17655)
XPG_GG O O
(rs17655)
TP53_CC O O O
(rs1042522) (0.064)
TP53_CG O O O
(rs1042522) (0.064)
TP53_GG O O
(rs1042522) (0.082) (0.013)
ERCC4_CC O O
(rs1799801) (0.000) (0.000)
ERCC4_CT O O
(rs1799801)
ERCC4_TT O O O O
(rs1799801) (-0.197) (0.000)
ERCC1_CC O O O O O O O
(rs11615) (-0.342) (0.061) (0.026)
ERCC1_CT O O O
(rs11615) (0.027)
ERCC1_TT O O O
(rs11615) (1.046) (0.020)
ERCC1_AA O O O O O O
(rs3212986) (-0.565) (-0.329) (0.035)
ERCC1_AC O
(rs3212986) (0.098)
ERCC1_CC
(rs3212986)
NAT2*5_CC O O
(rs1799929) (-0.189)
NAT2*6_AA O
(rs1799930) (0.000)
NAT2*6_AG O
(rs1799930)
NAT2*6_GG O O O O
(rs1799930)
NAT2*7_AA O O O O
63
(rs1799931) (-0.288) (0.000) (0.216)
NAT2*7_AG O O O
(rs1799931) (0.132) (-0.661) (0.074)
NAT2*7_GG O O O
(rs1799931) (0.186)
GSTP1_AA O O O O O O O O
(rs1695) (-0.499) (0.176) (0.191) (0.116)
GSTP1_AG O
(rs1695)
GSTP1_GG O O O O
(rs1695)
NQO_CC O
(rs1800566)
NQO_CT O O O O O
(rs1800566) (0.066)
NQO_TT
(rs1800566)
UGT1A7_CC O O O
(rs11692021) (0.000) (0.000)
UGT1A7_CT O
(rs11692021) (0.532)
UGT1A7_TT O O O
(rs11692021)

5.4 Sequential backward selection

The results of sequential backward selection (SBS) of the LR, RF, and SVM from five-fold and

leave-one-out cross validation were shown in Figure 5.4.1 and Figure 5.4.2. For five-fold cross

validation, the best SBS-estimated accuracy of LR-I was around 0.898 showing number of features

from 18 to 31 (Figure 5.4.1A). Number of selected features from GA in LR-I was 25, which falls into

the range of feature number estimated by SBS. Similarly, the best accuracy estimated by SBS was

0.864 with 43 features for RF-I, and 0.856 for SVM-I within 22 to 58 features, which were close to the

number of features calculated by GA (Figure 5.4.1B and Figure 5.4.1C). For leave-one-out cross

64
validation, the best SBS-estimated accuracy of LR-I was around 0.898 between number of features

from 22 to 26 (Figure 5.4.2A). Number of selected features from GA in LR-I was 25, which falls into

the range of feature number estimated by SBS, showing the similar result as five-fold did. Also, the

best SBS-estimated accuracy was 0.873 with 22 features for RF-I, and 0.873 for SVM-I within 17 to

32 features, indicating the similar results as GA did (Figure 5.4.2B and Figure 5.4.2C). In this study,

we all calculated the SBS of C and G mode in both cross validation but did not demonstrate the similar

result as I mode did (detailed information in Appendix 3 and 4).


(A) (B)

Best accuracy estimated by SBS: 0.898 Best accuracy estimated by SBS: 0.864
Number of selected features: 18 to 31 Number of selected features: 43

(C)

Best accuracy estimated by SBS: 0.856


Number of selected features: 22 to 58

Figure 5.4.1 The plot of sequential backward selection (SBS) by logistic regression (LR), random
forest (RF), and support vector machine (SVM) of integrated mode in five-fold cross validation
*SVM, support vector machine; I, integrated

65
(A) (B)

Best accuracy estimated by SBS: 0.898 Best accuracy estimated by SBS: 0.873
Number of selected features: 22, 23, 25, 26 Number of selected features: 22

(C)

Best accuracy estimated by SBS: 0.873


Number of selected features: 17~32

Figure 5.4.2 The plot of sequential backward selection (SBS) by logistic regression (LR), random
forest (RF), and support vector machine (SVM) of integrated mode in leave-one-out cross validation
*SVM, support vector machine; I, integrated

5.5 Model Performance

The average accuracy, precision, recall, and F1 score in five-fold cross validation were compared

in Table 5.5.1 and Table 5.5.2. The best performed model was ANN-I, with the average accuracy and

F1 score of 0.923 and 0.808, respectively. For ANN algorithm, there was no significant difference in

all 4 measurements in ANN-I, ANN-C and ANN-G model. The performance of I and C mode was

significantly better than those of G mode for LR, RF and SVM. In the LR analyses, the average F1

score of LR-I and LR-C were significantly higher than that of LR-G (0.716 versus 0.124, p < 0.05;

66
0.590 versus 0.124, p < 0.05). Similarly, the average accuracy of RF-I and RF-C were also

significantly higher than that of RF-G (0.847 versus 0.704, p < 0.05; 0.839 versus 0.704, p < 0.05).

The average precision of SVM-I and SVM-C demonstrated a significantly better result than that of

SVM-G (1.000 versus 0.613, p < 0.05; 0.950 versus 0.613, p < 0.05).

The average accuracy, precision, recall, and F1 score in leave-one-out cross validation were

compared in Table 5.5.1 and Table 5.5.3. Similar to five-fold cross validation, the best performed

model was ANN-I, with the average accuracy and F1 score of 0.907 and 0.766, respectively. Compared

to five-fold cross validation, the results from leave-one-out cross validation generated a much more

significant performance. The performance of I and C mode was significantly better than those of G

mode for ANN and LR. In the ANN analyses, the average F1 score of ANN-I and ANN-C were

significantly higher than that of ANN-G (0.766 versus 0.634, p < 0.05; 0.745 versus 0.634, p < 0.05).

Similarly, the average F1 score of LR-I and LR-C were also significantly higher than that of LR-G

(0.739 versus 0.236, p < 0.05; 0.522 versus 0.236, p < 0.05). For RF and SVM analyzes, the

performance of G mode was better than C mode. The average F1 score of RF-G demonstrated a

significantly better result than that of RF-C (0.580 versus 0.550, p < 0.05). The average F1 score of

SVM-G even showed a significantly better result than that of SVM-I and SVM-C (0.708 versus 0.619,

p < 0.05; 0.708 versus 0.433, p < 0.05).

Table 5.5.1 The average accuracy, precision, recall and F1 score for 12 models in five-fold and leave-
one-out cross validation.
Five-fold cross validation Leave-one-out cross validation
Model Accuracy Precision Recall F1 score Accuracy Precision Recall F1 score
ANN-I 0.923 0.950 0.713 0.808 0.907* 0.947 0.643* 0.766*
ANN-C 0.873 0.900 0.533 0.634 0.890* 0.826 0.679* 0.745*
ANN-G 0.864 0.826 0.613 0.668 0.873 1.000 0.464 0.634
LR-I 0.898* 0.960 0.607* 0.716* 0.898* 0.944 0.607* 0.739*
LR-C 0.839 0.805 0.533* 0.590* 0.814* 0.667 0.429* 0.522*
LR-G 0.779 0.400 0.073 0.124 0.780 0.667 0.143 0.236
RF-I 0.847* 0.950* 0.393 0.539 0.847* 0.750* 0.536* 0.625*
67
RF-C 0.839* 0.853* 0.427 0.541 0.847* 0.917* 0.393* 0.550*
RF-G 0.704 0.425 0.720 0.529 0.754 0.488 0.714 0.580
SVM-I 0.856 1.000* 0.393 0.552 0.864 0.929 0.464* 0.619*
SVM-C 0.847 0.950* 0.393 0.539 0.822* 0.889 0.286* 0.433*
SVM-G 0.805 0.613 0.573 0.583 0.881 0.850 0.607 0.708
p < 0.05*; ANN, artificial neural network; LR, logistic regression; RF, random forest; SVM, support
vector machine; I, integrated mode; C, clinical mode; G, genomic mode.

Table 5.5.2 Accuracy, precision, recall, and F1 score for 12 models in five-fold cross validation.
Model Testing set
Accuracy CV 1 CV 2 CV 3 CV 4 CV 5 Average P value
ANN-I 0.917 0.958 1.000 0.826 0.913 0.923 0.187
ANN-C 0.833 0.917 0.917 0.783 0.913 0.873 0.841
ANN-G 0.917 0.875 0.833 0.870 0.826 0.864 Reference
LR-I 0.875 0.917 0.958 0.826 0.913 0.898 0.008*
LR-C 0.750 0.917 0.833 0.783 0.913 0.839 0.107
LR-G 0.792 0.833 0.750 0.739 0.783 0.779 Reference
RF-I 0.875 0.875 0.833 0.783 0.870 0.847 0.013*
RF-C 0.833 0.875 0.792 0.783 0.913 0.839 0.025*
RF-G 0.792 0.667 0.625 0.739 0.696 0.704 Reference
SVM-I 0.875 0.875 0.875 0.783 0.870 0.856 0.174
SVM-C 0.875 0.875 0.833 0.783 0.870 0.847 0.138
SVM-G 0.875 0.833 0.708 0.739 0.870 0.805 Reference
Precision CV 1 CV 2 CV 3 CV 4 CV 5 Average P value
ANN-I 1.000 1.000 1.000 0.750 1.000 0.950 0.323
ANN-C 0.667 1.000 0.833 1.000 1.000 0.900 0.265
ANN-G 0.714 1.000 0.667 1.000 0.750 0.826 Reference
LR-I 1.000 0.800 1.000 1.000 1.000 0.960 0.108
LR-C 0.400 1.000 0.625 1.000 1.000 0.805 0.101
LR-G 0.000 1.000 0.000 0.000 1.000 0.400 Reference
RF-I 1.000 1.000 0.750 1.000 1.000 0.950 <0.001*
RF-C 0.667 1.000 0.600 1.000 1.000 0.853 0.009*
RF-G 0.500 0.364 0.333 0.500 0.429 0.425 Reference
SVM-I 1.000 1.000 1.000 1.000 1.000 1.000 0.004*
SVM-C 1.000 1.000 0.750 1.000 1.000 0.950 0.002*
SVM-G 0.667 0.667 0.429 0.500 0.800 0.613 Reference
Recall CV 1 CV 2 CV 3 CV 4 CV 5 Average P value
68
ANN-I 0.600 0.800 1.000 0.500 0.667 0.713 0.523
ANN-C 0.400 0.600 0.833 0.167 0.667 0.533 0.650
ANN-G 1.000 0.400 0.667 0.500 0.500 0.613 Reference
LR-I 0.400 0.800 0.833 0.333 0.667 0.607 0.004*
LR-C 0.400 0.600 0.833 0.167 0.667 0.533 0.013*
LR-G 0.000 0.200 0.000 0.000 0.167 0.073 Reference
RF-I 0.400 0.400 0.500 0.167 0.500 0.393 0.103
RF-C 0.400 0.400 0.500 0.167 0.667 0.427 0.169
RF-G 0.800 0.800 0.500 1.000 0.500 0.720 Reference
SVM-I 0.400 0.400 0.500 0.167 0.500 0.393 0.095
SVM-C 0.400 0.400 0.500 0.167 0.500 0.393 0.095
SVM-G 0.800 0.400 0.500 0.500 0.667 0.573 Reference
F1 score CV 1 CV 2 CV 3 CV 4 CV 5 Average P value
ANN-I 0.750 0.889 1.000 0.600 0.800 0.808 0.198
ANN-C 0.500 0.750 0.833 0.286 0.800 0.634 0.811
ANN-G 0.833 0.571 0.667 0.667 0.600 0.668 Reference
LR-I 0.571 0.800 0.909 0.500 0.800 0.716 0.002*
LR-C 0.400 0.750 0.714 0.286 0.800 0.590 0.003*
LR-G 0.000 0.333 0.000 0.000 0.286 0.124 Reference
RF-I 0.571 0.571 0.600 0.286 0.667 0.539 0.928
RF-C 0.500 0.571 0.545 0.286 0.800 0.541 0.927
RF-G 0.615 0.500 0.400 0.667 0.462 0.529 Reference
SVM-I 0.571 0.571 0.667 0.286 0.667 0.552 0.707
SVM-C 0.571 0.571 0.600 0.286 0.667 0.539 0.542
SVM-G 0.727 0.500 0.462 0.500 0.727 0.583 Reference
p < 0.05*; CV, cross validation; ANN, artificial neural network; LR, logistic regression; RF, random
forest; SVM, support vector machine; I, integrated mode; C, clinical mode; G, genomic mode.

Table 5.5.3 Accuracy, precision, recall and F1 score for 12 models in leave-one-out cross validation.
Model Testing set
Accuracy Average P value Precision Average P value
ANN-I 0.907 0.045* ANN-I 0.947 0.331
ANN-C 0.890 0.158 ANN-C 0.826 0.083
ANN-G 0.873 Reference ANN-G 1.000 Reference
LR-I 0.898 < 0.001* LR-I 0.944 0.175
LR-C 0.814 0.045* LR-C 0.667 0.175
LR-G 0.780 Reference LR-G 0.667 Reference
69
RF-I 0.847 < 0.001* RF-I 0.750 0.021*
RF-C 0.847 < 0.001* RF-C 0.917 < 0.001*
RF-G 0.754 Reference RF-G 0.488 Reference
SVM-I 0.864 0.158 SVM-I 0.929 0.336
SVM-C 0.822 0.008* SVM-C 0.889 0.347
SVM-G 0.881 Reference SVM-G 0.850 Reference
Recall Average P value F1 score Average P value
ANN-I 0.643 0.022* ANN-I 0.766 0.014*
ANN-C 0.679 0.011* ANN-C 0.745 0.007*
ANN-G 0.464 Reference ANN-G 0.634 Reference
LR-I 0.607 < 0.001* LR-I 0.739 < 0.001*
LR-C 0.429 0.003 LR-C 0.522 0.002*
LR-G 0.143 Reference LR-G 0.236 Reference
RF-I 0.536 0.022* RF-I 0.625 0.015*
RF-C 0.393 0.001* RF-C 0.550 0.001*
RF-G 0.714 Reference RF-G 0.580 Reference
SVM-I 0.464 0.043* SVM-I 0.619 0.042*
SVM-C 0.286 0.001* SVM-C 0.433 <0.001*
SVM-G 0.607 Reference SVM-G 0.708 Reference
p < 0.05*; CV, cross validation; ANN, artificial neural network; LR, logistic regression; RF, random
forest; SVM, support vector machine; I, integrated mode; C, clinical mode; G, genomic mode.

Figure 5.5.1 showed the ROC curves of the 12 models in five-fold cross validation (detailed

comparison in Table 5.5.4 and Table 5.5.5). Similar to the results from accuracy and F1 score, the

ANN-I showed the highest AUC of 0.900 among the 12 models. The AUC of ANN-I, LR-I and RF-I

was significantly higher than that of ANN-G, LR-G and RF-G (0.900 versus 0.744; 0.891 versus

0.688; 0.872 versus 0.724, p < 0.05). However, the AUC of C, G mode and 4 algorithms in three

modes all demonstrated the insignificant difference.

70
(A) (B) (C)

Model I mode [95% CI] C mode [95% CI] G mode [95% CI]
ANN 0.900 [0.835-0.965] 0.819 [0.731-0.907] 0.744 [0.611-0.876]
LR 0.891 [0.825-0.958] 0.818 [0.731-0.905] 0.688 [0.582-0.795]
RF 0.872 [0.799-0.945] 0.816 [0.732-0.900] 0.724 [0.622-0.826]
SVM 0.862 [0.780-0.945] 0.792 [0.689-0.895] 0.813 [0.722-0.903]
Figure 5.5.1 The receiver operating characteristic (ROC) curve and area under the receiver operating
characteristic curve (AUC) generated by 12 models in five-fold cross validation
*CI, confidence interval; ANN, artificial neural network; LR, logistic regression; RF, random forest;
SVM, support vector machine; AUC, area under the receiver operating characteristic curve

Table 5.5.4 The area under the receiver operating characteristic curve (AUC) of the testing set
calculated for the 12 models in five-fold cross validation.
Model Testing set [95% CI] P value
ANN-I 0.900 [0.835-0.965] 0.033*
ANN-C 0.819 [0.731-0.907] 0.399
ANN-G 0.744 [0.611-0.876] Reference
LR-I 0.891 [0.825-0.958] < 0.001*
LR-C 0.818 [0.731-0.905] 0.079
LR-G 0.688 [0.582-0.795] Reference
RF-I 0.872 [0.799-0.945] 0.026*
RF-C 0.816 [0.732-0.900] 0.207
RF-G 0.724 [0.622-0.826] Reference
SVM-I 0.862 [0.780-0.945] 0.459
SVM-C 0.792 [0.689-0.895] 0.775
SVM-G 0.813 [0.722-0.903] Reference
p < 0.05*; CI, confidence interval; ANN, artificial neural network; LR, logistic regression; RF, random
forest; SVM, support vector machine; I, integrated mode; C, clinical mode; G, genomic mode.
71
Table 5.5.5 The DeLong test calculated for different modes in five-fold cross validation
Algorithm/P value I mode C mode G mode
ANN & RF 0.423 0.925 0.805
LR & RF 0.606 0.937 0.512
SVM & RF 0.807 0.593 0.146
SVM & LR 0.198 0.403 0.068
SVM & ANN 0.235 0.362 0.411
LR & ANN 0.695 0.975 0.466
*ANN, artificial neural network; LR, logistic regression; RF, random forest; SVM, support vector
machine; I, integrated mode; C, clinical mode; G, genomic mode.

Figure 5.5.2 plotted the ROC curves of the 12 models in leave-one-out cross validation (detailed

comparison in Table 5.5.6 and Table 5.5.7). Consistent to the results from five-fold cross validation,

the ANN-I showed the highest AUC of 0.891 among the 12 models. The AUC of ANN-I, ANN-C and

LR-I was significantly higher than that of ANN-G and LR-G (0.891 versus 0.696; 0.867 versus 0.696;

0.887 versus 0.665, p < 0.05). However, the AUC of C and G mode all showed the insignificant

difference.

(A) (B) (C)

Model I mode [95% CI] C mode [95% CI] G mode [95% CI]
ANN 0.891 [0.825-0.957] 0.867 [0.786-0.947] 0.696 [0.561-0.830]
LR 0.887 [0.821-0.952] 0.785 [0.685-0.885] 0.665 [0.543-0.786]
RF 0.850 [0.770-0.931] 0.836 [0.752-0.920] 0.760 [0.664-0.855]
SVM 0.884 [0.810-0.958] 0.786 [0.679-0.893] 0.785 [0.662-0.907]

72
Figure 5.5.2 The receiver operating characteristic (ROC) curve and area under the receiver operating
characteristic curve (AUC) generated by 12 models in leave-one-out cross validation
*CI, confidence interval; ANN, artificial neural network; LR, logistic regression; RF, random forest;
SVM, support vector machine; AUC, area under the receiver operating characteristic curve

Table 5.5.6 The area under the receiver operating characteristic curve (AUC) of the testing set
calculated for the 12 models in leave-one-out cross validation
Model Testing set [95% CI] P value
ANN-I 0.891 [0.825-0.957] 0.012*
ANN-C 0.867 [0.786-0.947] 0.041*
ANN-G 0.696 [0.561-0.830] Reference
LR-I 0.887 [0.821-0.952] 0.002*
LR-C 0.785 [0.685-0.885] 0.192
LR-G 0.665 [0.543-0.786] Reference
RF-I 0.850 [0.770-0.931] 0.170
RF-C 0.836 [0.752-0.920] 0.264
RF-G 0.760 [0.664-0.855] Reference
SVM-I 0.884 [0.810-0.958] 0.198
SVM-C 0.786 [0.679-0.893] 0.989
SVM-G 0.785 [0.662-0.907] Reference
p < 0.05*; CI, confidence interval; ANN, artificial neural network; LR, logistic regression; RF, random
forest; SVM, support vector machine; I, integrated mode; C, clinical mode; G, genomic mode.

Table 5.5.7 The DeLong test calculated for different modes in leave-one-out cross validation
P value I mode C mode G mode
ANN & RF 0.321 0.519 0.431
LR & RF 0.411 0.140 0.062
SVM & RF 0.420 0.256 0.706
SVM & LR 0.903 0.961 0.165
SVM & ANN 0.786 0.104 0.364
LR & ANN 0.826 0.042* 0.700
p < 0.05*; ANN, artificial neural network; LR, logistic regression; RF, random forest; SVM, support
vector machine; I, integrated mode; C, clinical mode; G, genomic mode.

73
5.6 Survival analysis

Kaplan-Meier plot of the ANN-I by five-fold cross validation was showed in Figure 5.6.1.

Youden’s index, sensitivity, and specificity were 0.703, 0.714, and 0.989, respectively. The hazard ratio

of the high risk AKI group for the ANN-I model was 18.77 (8.18-43.07, p < 0.001) compared to low

risk AKI group, which can classify the high risk patients from the 118 patients.

Kaplan-Meier plot of the ANN-I by leave-one-out cross validation was showed in Figure 5.6.2.

Youden’s index, sensitivity, and specificity were 0.632, 0.643, and 0.989, respectively. Using low risk

AKI group as the reference, the hazard ratio of the high risk AKI group for the ANN-I model was

18.24 (8.22-40.44, p < 0.001), which can distinguish the high and low risk group patients.

Figure 5.6.1 The Kaplan-Meier plot for artificial neural network (ANN) of integrated mode in five-fold
cross validation

74
Figure 5.6.2 The Kaplan-Meier plot for artificial neural network (ANN) of integrated mode in leave-
one-out cross validation

5.7 Subgroup analysis

In this study, we only presented the cisplatin subgroup analysis results because the prediction

outcome of carboplatin was distort due to the imbalanced data and few AKI patients. As a result, we

only listed the meaningful data in the following pages.

Table 5.7.1 presented the baseline characteristics of NSCLC patients in cisplatin subgroup

analysis. In total, 55 subjects underwent cisplatin chemotherapy in this study with 22 (40.00%) AKI

patients; 33 (60.00%) were male and 22 (40.00%) were female; 23 (41.82%) patients had a smoking

habit and 6 (10.91%) patients accustomed to alcohol; 47 (85.45%) were adenocarcinoma, 4 (7.27%)

were squamous cell carcinoma, 4 (7.27%) were large cell carcinoma; 4 (7.27%) were at stage 1, 4

(7.27%) were at stage 2, 14 (25.45%) were at stage 3, 33 (60.00%) were at stage 4. The mean age was

61.1 ± 11.2 years; the mean baseline serum creatinine was 0.84 ± 0.21 ml/min.

Most of the characteristic trends were similar to the main analysis. Baseline serum creatinine,

average cisplatin doses demonstrated insignificant differences between AKI and non-AKI group (0.80

± 0.22 versus 0.86 ± 0.21; 70.06 ± 16.95 versus 66.58 ± 15.92, p > 0.05). However, the renal toxicity

75
group had significant lower number of chemotherapy cycles and cumulative doses of platinum drugs

than the non-renal toxicity group (2.09 ± 1.54 versus 3.39 ± 1.56; 145.12 ± 122.23 versus 228.04 ±

129.01, p < 0.05).

Table 5.7.1 Baseline characteristics of non-small cell lung cancer patients segregated by renal toxicity
in cisplatin subgroup
Renal Toxicity Non-Renal Toxicity
Characteristics N= 22 N= 33 P value
n (%) or mean ± SD
Male 13 (59.09) 20 (60.61) 0.911
Age (year) 60.36 ± 10.21 61.55 ± 11.94 0.705
Alcohol 2 (9.09) 4 (12.12) 1.000
Smoking 9 (40.91) 14 (42.42) 0.911
Histology 0.743
Adenocarcinoma 18 (81.82) 29 (87.88)
Squamous cell
2 (9.09) 2 (6.06)
carcinoma
Large cell carcinoma 2 (9.09) 2 (6.06)
Stage 0.437
Ia/Ib 3 (13.64) 1 (3.03)
IIa/IIb 2 (9.09) 2 (6.06)
IIIa/IIIb 4 (18.18) 10 (30.30)
IV 13 (59.09) 20 (60.61)
Concomitant
0.435
chemotherapy
Gemcitabine 15 (68.18) 17 (51.52)
Pemetrexed 0 (0.00) 3 (9.09)
Vinorelbine 4 (18.18) 5 (15.15)
Paclitaxel 1 (4.55) 3 (9.09)
Etoposide 0 (0.00) 1 (3.03)
CCRTa 2 (9.09) 1 (3.03)
Otherb 0 (0.00) 3 (9.09)
Cumulative dose (mg/m2) 145.12 ± 122.23 228.04 ± 129.01 0.017*
Average dose (mg/m2) 70.06 ± 16.95 66.58 ± 15.92 0.442
Chemotherapy cycles 2.09 ± 1.54 3.39 ± 1.56 0.004*

76
Chemotherapy course
8.30 ± 9.35 12.05 ± 7.21 0.100
(weeks)
Baseline Scrc (mg/dl) 0.80 ± 0.22 0.86 ± 0.21 0.340
p < 0.05*, < 0.01**, < 0.001***; aCCRT: Concurrent chemoradiotherapy; b5-Fluorouracil,
Procarbazine, Alkeran and Vinblastine (PAVE), Cyclophosphamide, adriamycin and platinum (CAP);
c
Scr: serum creatinine

Number of features selected by GA and feature importance by five-fold cross validation in

cisplatin subgroup were listed in Table 5.7.2 (detailed information in Table 5.7.3). Instead of using 84

features, we removed carboplatin-related features from cisplatin subgroup analysis. Consistent to main

analysis, the LR selected the least features compared to other 3 algorithms, with 21, 12 and 14 features

in I, C and G mode, respectively. Table 5.7.4 listed the key features that were selected by the 4

algorithms in main and cisplatin subgroup analysis. None of the clinical and genomic features in

cisplatin subgroup was both selected in I, C mode and I, G mode, respectively. Comparing the features

that both selected in main and cisplatin subgroup analysis, age group and AG of MATE1 (rs2289669)

were both selected in I mode and number of chemotherapy cycles in C mode. In G mode, none of the

genomic features were both selected. Main analysis chose CC of ERCC1 (rs11615) and AA of GSTP1

(rs1695) while cisplatin subgroup analysis selected TT of ERCC4 (rs1799801) and TT of UGT1A7

(rs11692021). The result from SBS in cisplatin subgroup analysis did not have an optimal result as the

main analysis did due to fewer sample size (detailed information in Appendix 5).

Table 5.7.2 Number of selected features for 12 machine learning models in cisplatin subgroup
Subgroup analysis
Number of
I mode C mode G mode
features
ANN 27 11 13
LR 21 12 14
RF 43 12 21
SVM 26 14 27
Total 79 29 50
77
ANN, artificial neural network; LR, logistic regression; RF, random forest; SVM, support vector
machine; I, integrated mode; C, clinical mode; G, genomic mode.

Table 5.7.3 Feature selection for 12 models in cisplatin subgroup


I mode C mode
Features ANN LR RF SVM ANN LR RF SVM
Height O
(-0.327)
Weight O O
Body surface O O O
area (0.058)
Gender O O
(0.028)
Age O O
(0.096)
Age groupa O O O O O O O
(-0.938) (0.061) (0.129)
Alcohol O O O
(0.009) (0.000)
Smoking O
b
TNM stage O O O O
(-0.258)
TNM stage O O O O
groupc (0.005)
Adenocarcino O O O
ma (-0.163) (0.003) (0.002)
Squamous cell O O O O O
carcinoma (0.000) (0.484) (0.000)
Large cell O O
carcinoma (0.000) (0.000)
Pathology O O O
(0.000) (0.119)
CTd O O O O O O O
(-0.890) (0.143) (-0.535) (0.373)
e
Week O O O
(0.433)

78
Concomitant O O O
5-FUf (-0.215) (-0.199)
Concomitant O O
CCRTg (0.000) (0.829)
Concomitant O O O O O
Etoposide (-0.326) (0.000)
Concomitant O O
Gemcitabine (-0.154) (-0.003)
Concomitant O O O
Otherh (-0.388) (0.000)
Concomitant O O
PAVEi (0.000) (0.000)
Concomitant O O O
Paclitaxel (0.000)
Concomitant O O O O O
Pemetrexed (-0.297) (0.000) (0.000)
Concomitant O O O
Vinorelbine (0.338)
Cisplatin O O O
cumulative (-0.596)
dose (mg)
Cisplatin O O O
cumulative (0.239)
dose (mg/m2)
Cisplatin O
average dose (0.506)
(mg)
Cisplatin O
average dose (0.229)
(mg/m2)
I mode G mode
OCT2_GG O O
(rs316019) (0.000)
ABCB1_CC O O O O O O
(rs1045642) (0.032) (0.000)
ABCB1_CT O O
(rs1045642) (-0.053)

79
ABCB1_TT O O O O
(rs1045642) (-0.680) (0.037)
MATE1_AA O O
(rs2289669) (0.000)
MATE1_AG O O O O O O O
(rs2289669) (0.802) (0.040) (0.835) (0.300)
MATE1_GG O O O
(rs2289669) (-0.306) (0.000)
ABCC2_CC O O O
(rs717620) (0.000) (0.000)
ABCC2_CT O O O O
(rs717620) (0.006) (-0.363) (0.000)
ABCC2_TT O O O
(rs717620) (0.069) (0.000)
MATE1_CC O O
(rs2252281) (0.000) (0.000)
MATE1_CT O O O O
(rs2252281) (0.415) (0.017)
MATE1_TT O O O
(rs2252281) (0.046) (0.100)
XPA_AA O O O O
(rs1800975) (-0.197) (0.000)
XPA_AG O O O O
(rs1800975) (0.984) (0.224)
XPA_GG O O
(rs1800975)
OGG1_CC O O O O O O
(rs1052133) (0.031) (1.046) (0.000)
OGG1_CG O O
(rs1052133) (-0.256) (0.023)
OGG1_GG O O
(rs1052133) (0.023)
XPG_CC O
(rs17655)
XPG_CG O O
(rs17655)
XPG_GG O
80
(rs17655)
TP53_CC
(rs1042522)
TP53_CG O O
(rs1042522) (-0.329)
TP53_GG O
(rs1042522) (0.000)
ERCC4_CC O O O O O
(rs1799801) (0.198) (0.216) (0.000)
ERCC4_CT O O O
(rs1799801) (0.000)
ERCC4_TT O O O O O O
(rs1799801) (0.071) (-0.661) (0.077)
ERCC1_CC O O O
(rs11615) (0.191) (0.000)
ERCC1_CT O
(rs11615)
ERCC1_TT O O O O O
(rs11615) (0.000) (0.000)
ERCC1_AA O O
(rs3212986) (0.000) (0.000)
ERCC1_AC O O O
(rs3212986) (0.010)
ERCC1_CC O O O O
(rs3212986) (0.022)
NAT2*5_CC O O O O O
(rs1799929) (-0.173) (0.000) (-0.418)
NAT2*6_AA O O
(rs1799930) (0.000)
NAT2*6_AG O O O
(rs1799930) (0.066)
NAT2*6_GG O O O O O
(rs1799930) (0.027) (0.455)
NAT2*7_AA O O O
(rs1799931)
NAT2*7_AG O
(rs1799931)
81
NAT2*7_GG O O O
(rs1799931) (0.002) (0.100)
GSTP1_AA O O O O O O O
(rs1695) (-0.460) (0.030) (-0.052)
GSTP1_AG O
(rs1695) (0.010)
GSTP1_GG O O
(rs1695)
NQO_CC
(rs1800566)
NQO_CT O O O O
(rs1800566) (0.017)
NQO_TT O O O O O O
(rs1800566) (0.662) (0.129) (1.341) (0.076)
UGT1A7_CC O
(rs11692021) (0.000)
UGT1A7_CT O O
(rs11692021) (0.013) (0.000)
UGT1A7_TT O O O O
(rs11692021) (-0.762) (0.100)
a
Age group: classify patients by 65 years old.
b
TNM stage: classify patients by cancer stage (I, II, III, IV, E).
c
TNM stage group: classify patients by IA~IIb, IIIA~IIIB, IV~E.
d
CT: number of chemotherapy cycles.
e
Week: inclusion period.
f
5-FU: 5-Fluorouracil.
g
CCRT: Concurrent chemoradiotherapy.
h
Other: Procarbazine, Cyclophosphamide, adriamycin and platinum (CAP)
i
PAVE: Procarbazine, Alkeran and Vinblastine.

Table 5.7.4 Features selected by all of four algorithms in main and cisplatin subgroup analysis
I mode
Features Definition Main analysis Subgroup analysis
Age group Classify patients into 2 groups by 65 years old O O
MATE1_1 Patients with AG of MATE1 G>A (rs2289669) O O
CT Number of chemotherapy cycles O
XPA Patients with AA of XPA G23A (rs1800975) O
82
NQO Patients with CT of NQO1 C609T (rs1800566) O
MATE1_2 Patients with CT of MATE1 g-66T>C (rs2252281) O
GSTP1 Patients with AA of GSTP1 A313G (rs1695) O
C mode
CT Number of chemotherapy cycles O O
Age group Classify patients into 2 groups by 65 years old O
Pemetrexed Concomitant chemotherapy pemetrexed drugs O
Carboplatin Carboplatin cumulative dose (mg/m2) O
dose
Squamous Patients with squamous cell carcinoma O
cell
carcinoma
G mode
ERCC1_1 Patients with CC of ERCC1 C118T (rs11615) O
GSTP1 Patients with AA of GSTP1 A313G (rs1695) O
ERCC4 Patients with TT of ERCC4 T2505C (rs1799801) O
UGT1A7 Patients with TT of UGT1A7 C118T (rs11692021) O

Model performance metrics for cisplatin subgroup were listed in Table 5.7.5 (detailed information

in Table 5.7.6). The best performing model was ANN-I, with the average accuracy of 0.909 and F1

score of 0.887 in cisplatin subgroup. Consistent to main analysis, I mode showed significant better

result than G mode, while C and G mode were insignificant for 4 algorithms (Table 5.7.6).

Table 5.7.5 The average accuracy, precision, recall, and F1 score for 12 models in cisplatin subgroup
Testing set
Model Accuracy Precision Recall F1 score
ANN-I 0.909* 0.920 0.870* 0.887*
ANN-C 0.818 0.893 0.650 0.708
ANN-G 0.800 0.950 0.550 0.670
LR-I 0.873 0.914 0.830* 0.845
LR-C 0.763 0.833 0.600 0.638
LR-G 0.854 1.000 0.640 0.776
RF-I 0.854 0.960* 0.680 0.772
RF-C 0.782 0.914 0.600 0.656
RF-G 0.763 0.708 0.730 0.703
83
SVM-I 0.782* 0.960* 0.510* 0.620*
SVM-C 0.691 0.600 0.240 0.333
SVM-G 0.600 0.200 0.040 0.067
p < 0.05*; AUC, area under curve; ANN, artificial neural network; LR, logistic regression; RF, random
forest; SVM, support vector machine; I, integrated mode; C, clinical mode; G, genomic mode.

Table 5.7.6 Accuracy, precision, recall, and F1 score for 12 models in cisplatin subgroup
Model Testing set
Accuracy CV 1 CV 2 CV 3 CV 4 CV 5 Average P value
ANN-I 0.909 1.000 0.909 0.909 0.818 0.909 0.033*
ANN-C 0.727 0.909 0.909 0.636 0.909 0.818 0.704
ANN-G 0.727 0.909 0.909 0.727 0.727 0.800 Reference
LR-I 0.909 1.000 0.727 0.818 0.909 0.873 0.749
LR-C 0.636 0.818 0.818 0.636 0.909 0.763 0.142
LR-G 0.818 0.909 0.909 0.818 0.818 0.854 Reference
RF-I 0.818 0.818 0.909 0.818 0.909 0.854 0.089
RF-C 0.818 0.818 0.727 0.636 0.909 0.782 0.778
RF-G 0.818 0.727 0.727 0.818 0.727 0.763 Reference
SVM-I 0.818 0.727 0.909 0.727 0.727 0.782 0.003*
SVM-C 0.818 0.636 0.818 0.636 0.545 0.691 0.089
SVM-G 0.636 0.545 0.636 0.636 0.545 0.600 Reference
Precision CV 1 CV 2 CV 3 CV 4 CV 5 Average P value
ANN-I 1.000 1.000 0.800 1.000 0.800 0.920 0.736
ANN-C 0.667 1.000 0.800 1.000 1.000 0.893 0.599
ANN-G 1.000 1.000 1.000 0.750 1.000 0.950 Reference
LR-I 1.000 1.000 0.571 1.000 1.000 0.914 0.374
LR-C 0.500 1.000 0.667 1.000 1.000 0.833 0.189
LR-G 1.000 1.000 1.000 1.000 1.000 1.000 Reference
RF-I 1.000 1.000 0.800 1.000 1.000 0.960 0.007*
RF-C 1.000 1.000 0.571 1.000 1.000 0.914 0.072
RF-G 0.750 0.571 0.667 0.800 0.750 0.708 Reference
SVM-I 1.000 1.000 0.800 1.000 1.000 0.960 0.017*
SVM-C 1.000 0.000 1.000 1.000 0.000 0.600 0.178
SVM-G 0.000 0.000 0.000 1.000 0.000 0.200 Reference
Recall CV 1 CV 2 CV 3 CV 4 CV 5 Average P value
ANN-I 0.750 1.000 1.000 0.800 0.800 0.870 0.005*
ANN-C 0.500 0.750 1.000 0.200 0.800 0.650 0.516
84
ANN-G 0.250 0.750 0.750 0.600 0.400 0.550 Reference
LR-I 0.750 1.000 1.000 0.600 0.800 0.830 0.017*
LR-C 0.500 0.500 1.000 0.200 0.800 0.600 0.767
LR-G 0.500 0.750 0.750 0.600 0.600 0.640 Reference
RF-I 0.500 0.500 1.000 0.600 0.800 0.680 0.792
RF-C 0.500 0.500 1.000 0.200 0.800 0.600 0.569
RF-G 0.750 1.000 0.500 0.800 0.600 0.730 Reference
SVM-I 0.500 0.250 1.000 0.400 0.400 0.510 0.030*
SVM-C 0.500 0.000 0.500 0.200 0.000 0.240 0.178
SVM-G 0.000 0.000 0.000 0.200 0.000 0.040 Reference
F1 score CV 1 CV 2 CV 3 CV 4 CV 5 Average P value
ANN-I 0.857 1.000 0.889 0.889 0.800 0.887 0.036*
ANN-C 0.572 0.857 0.889 0.333 0.889 0.708 0.747
ANN-G 0.400 0.857 0.857 0.667 0.571 0.670 Reference
LR-I 0.857 1.000 0.727 0.750 0.889 0.845 0.311
LR-C 0.500 0.667 0.800 0.333 0.889 0.638 0.202
LR-G 0.667 0.857 0.857 0.750 0.750 0.776 Reference
RF-I 0.667 0.667 0.889 0.750 0.889 0.772 0.454
RF-C 0.667 0.667 0.727 0.333 0.889 0.656 0.719
RF-G 0.750 0.727 0.572 0.800 0.667 0.703 Reference
SVM-I 0.667 0.400 0.889 0.571 0.571 0.620 0.008*
SVM-C 0.667 0.000 0.667 0.333 0.000 0.333 0.178
SVM-G 0.000 0.000 0.000 0.333 0.000 0.067 Reference
p < 0.05*; CV, cross validation; ANN, artificial neural network; LR, logistic regression; RF, random
forest; SVM, support vector machine; I, integrated mode; C, clinical mode; G, genomic mode

Figure 5.7.1 showed the ROC curves of the 12 models in cisplatin subgroup (detailed comparison

in Table 5.7.7 and Table 5.7.8). The best predictive performance model was ANN-I, with AUC of

0.888, which demonstrated the similar result as main analysis did. Figure 5.7.3 plotted the Kaplan-

Meier estimate of the ANN-I in cisplatin subgroup. Youden’s index, sensitivity, and specificity were

0.803, 0.864, and 0.939, respectively. The hazard ratio of high risk AKI group in ANN-I model was

statistically significant compared to low risk AKI group, which can well identify the high risk AKI

patients in cisplatin subgroup (16.92 [4.93-58.03], p < 0.001).

85
(A) (B) (C)

Model I mode [95% CI] C mode [95% CI] G mode [95% CI]
ANN 0.888 [0.777-1.000] 0.679 [0.501-0.857] 0.784 [0.649-0.920]
LR 0.828 [0.696-0.959] 0.654 [0.489-0.820] 0.818 [0.686-0.951]
RF 0.829 [0.702-0.956] 0.730 [0.538-0.877] 0.736 [0.587-0.884]
SVM 0.803 [0.663-0.944] 0.661 [0.501-0.821] 0.871 [0.760-0.981]
Figure 5.7.1 The receiver operating characteristic (ROC) curve and area under the receiver operating
characteristic curve (AUC) of 12 models in cisplatin subgroup
*CI, confidence interval; ANN, artificial neural network; LR, logistic regression; RF, random forest;
SVM, support vector machine; AUC, area under the receiver operating characteristic curve

Table 5.7.7 The area under the receiver operating characteristic curve (AUC) of the testing set
calculated for the 12 models in cisplatin subgroup
Model Testing set [95% CI] P value
ANN-I 0.888 [0.777-1.000] 0.134
ANN-C 0.679 [0.501-0.857] 0.347
ANN-G 0.784 [0.649-0.920] Reference
LR-I 0.828 [0.696-0.959] 0.913
LR-C 0.654 [0.489-0.820] 0.152
LR-G 0.818 [0.686-0.951] Reference
RF-I 0.829 [0.702-0.956] 0.312
RF-C 0.730 [0.538-0.877] 0.960
RF-G 0.736 [0.587-0.884] Reference
SVM-I 0.803 [0.663-0.944] 0.458
SVM-C 0.661 [0.501-0.821] 0.048*
SVM-G 0.871 [0.760-0.981] Reference
p < 0.05*; CI, confidence interval; ANN, artificial neural network; LR, logistic regression; RF, random
forest; SVM, support vector machine; I, integrated mode; C, clinical mode; G, genomic mode.

86
Table 5.7.8 The DeLong test calculated for different modes in cisplatin subgroup
P value I mode C mode G mode
ANN & RF 0.263 0.357 0.581
LR & RF 0.976 0.126 0.228
SVM & RF 0.548 0.303 0.128
SVM & LR 0.577 0.919 0.567
SVM & ANN 0.062 0.805 0.390
LR & ANN 0.204 0.619 0.454
*ANN, artificial neural network; LR, logistic regression; RF, random forest; SVM, support vector
machine; I, integrated mode; C, clinical mode; G, genomic mode.

Figure 5.7.2 The Kaplan-Meier plot for artificial neural network (ANN) of integrated mode in cisplatin
subgroup
CI, confidence interval; ANN, artificial neural network; LR, logistic regression; RF, random forest;
SVM, support vector machine; I, integrated

87
5.8 Performance comparison of with and without dose-related features

In this study, we also compared the model performance with and without dose-related features for

main and cisplatin subgroup analysis (Table 5.8.1 to 5.8.6). Most of the models F1 score decreased

more than 0.05 after removing dose-related features in five-fold and, especially for RF-C and SVM-I

(0.541 versus 0.000, 0.552 versus 0.000, respectively, Table 5.8.1). Similarly, in leave-one-out cross

validation, the F1 score of SVM-I and SVM-C both significantly decreased (0.619 versus 0.000, 0.433

versus 0.000, Table 5.8.3). Also, the AUC of all algorithms in main analysis dropped more than 0.05

compared to baseline, especially for LR-I in five-fold cross validation (0.891 versus 0.809, Table 5.8.2

and 5.8.4). For cisplatin subgroup analysis, only the F1 score of ANN-I and ANN-C decreased more

than 0.05 (0.887 versus 0.650, 0.708 versus 0.421, Table 5.8.5). Also, the AUC of ANN_I dropped

more than 0.05 compared to baseline performance (0.888 versus 0.727, Table 5.8.6).

Table 5.8.1 Comparison of the average accuracy, precision, recall and F1 score for with and without
dose-related features in five-fold cross validation.
With dose-related features Without dose-related features
Model Accuracy Precision Recall F1 score Accuracy Precision Recall F1 score
ANN-I 0.923 0.950 0.713 0.808 0.889 0.867* 0.647* 0.734*
ANN-C 0.873 0.900 0.533 0.634 0.864 0.950 0.467* 0.598*
LR-I 0.898 0.960 0.607 0.716 0.822* 0.761* 0.500* 0.549*
LR-C 0.839 0.805 0.533 0.590 0.822 0.819 0.433* 0.524
RF-I 0.847 0.950 0.393 0.539 0.796* 0.600* 0.147* 0.233*
RF-C 0.839 0.853 0.427 0.541 0.762* 0.000* 0.000* 0.000*
SVM-I 0.856 1.000 0.393 0.552 0.762* 0.000* 0.000* 0.000*
SVM-C 0.847 0.950 0.393 0.539 0.788* 0.000* 0.107* 0.181*
Decrease > 0.05*; ANN, artificial neural network; LR, logistic regression; RF, random forest; SVM,
support vector machine; I, integrated mode; C, clinical mode; G, genomic mode.

88
Table 5.8.2 Comparison of the area under the receiver operating characteristic curve (AUC) of the
testing set calculated for the 12 models in five-fold cross validation
With dose-related features Without dose-related features
Model Testing set [95% CI]
ANN-I 0.900 [0.835-0.965] 0.872 [0.800-0.945]
ANN-C 0.819 [0.731-0.907] 0.780 [0.666-0.893]
LR-I 0.891 [0.825-0.958] 0.809 [0.710-0.908] *
LR-C 0.818 [0.731-0.905] 0.779 [0.668-0.891]
RF-I 0.872 [0.799-0.945] 0.848 [0.769-0.928]
RF-C 0.816 [0.732-0.900] 0.812 [0.727-0.898]
SVM-I 0.862 [0.780-0.945] 0.808 [0.709-0.907] *
SVM-C 0.792 [0.689-0.895] 0.739 [0.620-0.857] *
Decrease > 0.05*; CI, confidence interval; ANN, artificial neural network; LR, logistic regression; RF,
random forest; SVM, support vector machine; I, integrated mode; C, clinical mode; G, genomic mode.

Table 5.8.3 Comparison of the average accuracy, precision, recall and F1 score for with and without
dose-related features in leave-one-out cross validation.
With dose-related features Without dose-related features
Model Accuracy Precision Recall F1 score Accuracy Precision Recall F1 score
ANN-I 0.907 0.947 0.643 0.766 0.881 0.889 0.571 0.695
ANN-C 0.890 0.826 0.679 0.745 0.872 0.810 0.607 0.694
LR-I 0.898 0.944 0.607 0.739 0.831 0.700* 0.500* 0.583*
LR-C 0.814 0.667 0.429 0.522 0.814 0.667 0.429 0.522
RF-I 0.847 0.750 0.536 0.625 0.822 0.769 0.357* 0.488*
RF-C 0.847 0.917 0.393 0.550 0.814 0.800* 0.286* 0.421*
SVM-I 0.864 0.929 0.464 0.619 0.763 0.000* 0.000* 0.000*
SVM-C 0.822 0.889 0.286 0.433 0.763 0.000* 0.000* 0.000*
Decrease > 0.05*; ANN, artificial neural network; LR, logistic regression; RF, random forest; SVM,
support vector machine; I, integrated mode; C, clinical mode; G, genomic mode.

Table 5.8.4 Comparison of the area under the receiver operating characteristic curve (AUC) of the
testing set calculated for the 12 models in leave-one-out cross validation
With dose-related features Without dose-related features
Model Testing set [95% CI]
ANN-I 0.891 [0.825-0.957] 0.866 [0.788-0.945]
ANN-C 0.867 [0.786-0.947] 0.815 [0.708-0.923] *
LR-I 0.887 [0.821-0.952] 0.819 [0.719-0.920] *
89
LR-C 0.785 [0.685-0.885] 0.787 [0.683-0.890]
RF-I 0.850 [0.770-0.931] 0.772 [0.667-0.877] *
RF-C 0.836 [0.752-0.920] 0.785 [0.682-0.888] *
SVM-I 0.884 [0.810-0.958] 0.785 [0.680-0.889] *
SVM-C 0.786 [0.679-0.893] 0.771 [0.660-0.883]
Decrease > 0.05*; CI, confidence interval; ANN, artificial neural network; LR, logistic regression; RF,
random forest; SVM, support vector machine; I, integrated mode; C, clinical mode; G, genomic mode.

Table 5.8.5 Comparison of the average accuracy, precision, recall and F1 score for with and without
dose-related features in cisplatin subgroup analysis.
With dose-related features Without dose-related features
Model Accuracy Precision Recall F1 score Accuracy Precision Recall F1 score
ANN-I 0.909 0.920 0.870 0.887 0.745* 0.753* 0.600* 0.650*
ANN-C 0.818 0.893 0.650 0.708 0.709* 1.000 0.270* 0.421*
LR-I 0.873 0.914 0.830 0.845 0.873 0.914 0.830 0.845
LR-C 0.763 0.833 0.600 0.638 0.709* 0.780* 0.550* 0.587*
RF-I 0.854 0.960 0.680 0.772 0.855 0.960 0.680 0.772
RF-C 0.782 0.914 0.600 0.656 0.745 0.914 0.510* 0.575*
SVM-I 0.782 0.960 0.510 0.620 0.727* 0.660* 0.600 0.617
SVM-C 0.691 0.600 0.240 0.333 0.655 0.600 0.140* 0.227*
Decrease > 0.05*; ANN, artificial neural network; LR, logistic regression; RF, random forest; SVM,
support vector machine; I, integrated mode; C, clinical mode; G, genomic mode.

Table 5.8.6 Comparison of the area under the receiver operating characteristic curve (AUC) of the
testing set calculated for the 12 models in cisplatin subgroup analysis
With dose-related features Without dose-related features
Model Testing set [95% CI]
ANN-I 0.888 [0.777-1.000] 0.727 [0.581-0.873] *
ANN-C 0.679 [0.501-0.857] 0.694 [0.538-0.851]
LR-I 0.828 [0.696-0.959] 0.828 [0.696-0.959]
LR-C 0.654 [0.489-0.820] 0.654 [0.495-0.813]
RF-I 0.829 [0.702-0.956] 0.829 [0.702-0.956]
RF-C 0.730 [0.538-0.877] 0.656 [0.492-0.820] *
SVM-I 0.803 [0.663-0.944] 0.791 [0.647-0.936]
SVM-C 0.661 [0.501-0.821] 0.635 [0.474-0.796]
Decrease > 0.05*; CI, confidence interval; ANN, artificial neural network; LR, logistic regression; RF,
random forest; SVM, support vector machine; I, integrated mode; C, clinical mode; G, genomic mode.
90
Chapter 6. Discussion

Our study has demonstrated that the ANN, LR, RF and SVM could generate a promising result to

predict platinum-induced nephrotoxicity in NSCLC patients. Using CTCAE criteria to define AKI and

the clinical and genomic features selected by GA were both proven to demonstrate a better predictive

ability. By comparing the 12 models constructed in this study, ANN constructed with clinical and

genomic features could generate the best performance. Survival analysis further validated the capability

of ANN-I to distinguish the patients at high risk of AKI. The study further discovered that the ML

algorithms using clinical features could perform better than using genomic features alone. The AI

prediction models in this study could have the potential to benefit the healthcare professionals to develop

tailored therapies for NSCLC patients.

The study using CTCAE criteria to define AKI showed a promising predictive performance. Most

of the previous retrospective studies defined nephrotoxicity by CTCAE criteria to predict platinum-

induced nephrotoxicity [8, 9, 15, 17]. Although the Acute Kidney Injury Network (AKIN) and Risk,

Injury, Failure, Loss of kidney function, and End-stage kidney disease (RIFLE) are the most commonly

kidney function evaluation scale, these criteria are limited due to the restrict measurement of renal

function after intervention, which are not appropriate for retrospective study.

Two possible reasons for improved model performance were increased number of clinical and

genomic features and adding dose-related features, especially for the accumulated platinum dose.

Inclusion of the key clinical features, such as dose, could significantly enhance model performance.

Clinical features, including dose, aging and number of chemotherapy cycles, selected in this study were

consistent to previous studies [8-11]. Renal toxicity induced by cisplatin was proved to be dose-related

in its early preclinical study [201]. Older age was proved to be a risk factor for this ADR [82]. Those

who developed AKI had to discontinue their chemotherapy cycle to limit the effect of the accumulated

91
dose. By combining essential clinical and genomic features, especially dose-related features, our best

predictive model demonstrated the AUC of 0.900. Previous studies using CART and RT showed the

AUC of 0.700 to 0.800 in their individual dataset [10, 11].

Genomic features selected in this study evidenced the most important genomes involved in the

mechanisms of the platinum-induced nephrotoxicity. One previous study adding dose-related features

only reached 0.70 of AUC due to lack of genomic features [8]. The present study selected only one

genomic feature, MATE1, in main and cisplatin subgroup analysis by I and G mode of all 4 algorithms.

The function of proton/organic cation antiporter encoded by MATE1 could transport platinum from

proximal tubular cells to glomerulus [64]. However, traditional statistical genomic association studies

found insignificant relationship between MATE1 and other genomic features that only selected in I or G

mode with this ADR, probably due to Type 2 error [7, 17]. The non-linear algorithms applied in our

study could encore the biochemical mechanisms and identified these genes as features in this study.

Including cornerstone clinical and genomic features is imperative in platinum-induced nephrotoxicity

prediction.

Selecting the best features for ML algorithms depends on the nature of ADR. The current study

found that the performance of clinical mode was better than genomic mode for platinum-induced

nephrotoxicity. While, one study demonstrated genomic mode better predicted anti-tuberculosis drug-

induced hepatotoxicity (ATDH) than clinical mode [202]. Platinum-induced nephrotoxicity are dose-

related thus clinical mode with dose-related features could perform better than genomic mode. As ATDH

are an idiosyncratic ADR, genomic component played a critical role in prediction. Choosing the

appropriate features based on ADR mechanisms could enhance the ML performance.

These specific clinical implications of our models could help the healthcare professionals to make

decisions in determining chemotherapy protocol and raising awareness of patients’ risk. When the

clinicians considered to prescribe platinum chemotherapy, the prediction model could help them identify

whether using platinum based on risk of nephrotoxicity. For the patients having received several cycles,

the oncologists could use the prediction model to consider patients’ accumulated platinum dose before
92
every chemotherapy treatment. The preliminary tool could help the healthcare professionals to develop

the appropriate chemotherapy and prophylaxis for this ADR in NSCLC patients.

There were several limitations in this study. First, a limited sample size with few AKI patients was

included in this study. Five-fold and leave-one-out cross validation was applied to overcome this

limitation and still demonstrated a promising result in testing set. Second, patients’ blood sampling time

might not be precise on the expected date, for example day 3 or day 7 after taking platinum chemotherapy,

due to busy clinical practice. It is also impossible to include all the clinical and genomic risk factors for

AKI to build up the models. The concomitant medications that were nephrotoxic and affected the uptake

of platinum in kidney did not completely include as features. Medications such as cimetidine and

metformin are the competitive substrate of the transporter OCT2 for platinum were not able to include

into the study.

93
Chapter 7. Conclusion

This is the first study to build ANN, LR, RF and SVM models to predict platinum-induced

nephrotoxicity in NSCLC patients. The fine-tuned ANN constructed with clinical and genomic features

demonstrated the best predictive performance in this ADR. The optimized AI model constructed in this

study was a preliminary cost-effective tool to help the oncologists to determine the preventive

management of nephrotoxicity or change the chemotherapy regimen for individual NSCLC patients.

Further studies are needed to validate the AI platinum-induced nephrotoxicity models to other cancer

patients and other population to generalize the application.

94
Reference

1. Lebwohl, D. and R. Canetta, Clinical development of platinum complexes in cancer therapy: an


historical perspective and an update. Eur J Cancer, 1998. 34(10): p. 1522-34.
2. Ries, F. and J. Klastersky, Nephrotoxicity induced by cancer chemotherapy with special
emphasis on cisplatin toxicity. Am J Kidney Dis, 1986. 8(5): p. 368-79.
3. Yao, X., et al., Cisplatin nephrotoxicity: a review. Am J Med Sci, 2007. 334(2): p. 115-24.
4. Pabla, N. and Z. Dong, Cisplatin nephrotoxicity: mechanisms and renoprotective strategies.
Kidney Int, 2008. 73(9): p. 994-1007.
5. Sánchez-González, P.D., et al., An integrative view of the pathophysiological events leading to
cisplatin nephrotoxicity. Crit Rev Toxicol, 2011. 41(10): p. 803-21.
6. Manohar, S. and N. Leung, Cisplatin nephrotoxicity: a review of the literature. J Nephrol, 2018.
31(1): p. 15-25.
7. Zazuli, Z., et al., Genetic Variations and Cisplatin Nephrotoxicity: A Systematic Review. Front
Pharmacol, 2018. 9: p. 1111.
8. Motwani, S.S., et al., Development and Validation of a Risk Prediction Model for Acute Kidney
Injury After the First Course of Cisplatin. J Clin Oncol, 2018. 36(7): p. 682-688.
9. Burns, C.V., et al., Cisplatin-induced nephrotoxicity in an outpatient setting. Pharmacotherapy,
2021. 41(2): p. 184-190.
10. Garcia, S.L., et al., Prediction of Nephrotoxicity Associated With Cisplatin-Based Chemotherapy
in Testicular Cancer Patients. JNCI Cancer Spectr, 2020. 4(3): p. pkaa032.
11. Liu, H.E., et al., Multiple analytical approaches demonstrate a complex relationship of genetic
and nongenetic factors with cisplatin- and carboplatin-induced nephrotoxicity in lung cancer
patients. Biomed Res Int, 2014. 2014: p. 937429.
12. Filipski, K.K., et al., Contribution of organic cation transporter 2 (OCT2) to cisplatin-induced
nephrotoxicity. Clin Pharmacol Ther, 2009. 86(4): p. 396-402.
13. Goekkurt, E., et al., Pharmacogenetic analyses of a phase III trial in metastatic
gastroesophageal adenocarcinoma with fluorouracil and leucovorin plus either oxaliplatin or
cisplatin: a study of the arbeitsgemeinschaft internistische onkologie. J Clin Oncol, 2009.
27(17): p. 2863-73.
14. Hildebrandt, M.A., J. Gu, and X. Wu, Pharmacogenomics of platinum-based chemotherapy in
NSCLC. Expert Opin Drug Metab Toxicol, 2009. 5(7): p. 745-55.
15. Khrunin, A.V., et al., Genetic polymorphisms and the efficacy and toxicity of cisplatin-based
chemotherapy in ovarian cancer patients. Pharmacogenomics J, 2010. 10(1): p. 54-61.
16. Tzvetkov, M.V., et al., Pharmacogenetic analyses of cisplatin-induced nephrotoxicity indicate a
renoprotective effect of ERCC1 polymorphisms. Pharmacogenomics, 2011. 12(10): p. 1417-27.

95
17. Iwata, K., et al., Effects of genetic variants in SLC22A2 organic cation transporter 2 and
SLC47A1 multidrug and toxin extrusion 1 transporter on cisplatin-induced adverse events. Clin
Exp Nephrol, 2012. 16(6): p. 843-51.
18. Breiman, L., Random Forests. Machine Learning, 2001. 45(1): p. 5-32.
19. Shia, W.C. and D.R. Chen, Classification of malignant tumors in breast ultrasound using a
pretrained deep residual network model and support vector machine. Comput Med Imaging
Graph, 2020. 87: p. 101829.
20. Cortes, C. and V. Vapnik, Support-vector networks. Machine Learning, 1995. 20(3): p. 273-297.
21. Coudray, N., et al., Classification and mutation prediction from non-small cell lung cancer
histopathology images using deep learning. Nat Med, 2018. 24(10): p. 1559-1567.
22. Kelland, L., The resurgence of platinum-based cancer chemotherapy. Nature Reviews Cancer,
2007. 7(8): p. 573-584.
23. Rosenberg, B., L. Van Camp, and T. Krigas, Inhibition of Cell Division in Escherichia coli by
Electrolysis Products from a Platinum Electrode. Nature, 1965. 205(4972): p. 698-699.
24. Rosenberg, B., et al., Platinum Compounds: a New Class of Potent Antitumour Agents. Nature,
1969. 222(5191): p. 385-386.
25. Harrap, K.R., Preclinical studies identifying carboplatin as a viable cisplatin alternative. Cancer
Treat Rev, 1985. 12 Suppl A: p. 21-33.
26. Knox, R.J., et al., Mechanism of cytotoxicity of anticancer platinum drugs: evidence that cis-
diamminedichloroplatinum(II) and cis-diammine-(1,1-cyclobutanedicarboxylato)platinum(II)
differ only in the kinetics of their interaction with DNA. Cancer Res, 1986. 46(4 Pt 2): p. 1972-9.
27. Aabo, K., et al., Chemotherapy in advanced ovarian cancer: four systematic meta-analyses of
individual patient data from 37 randomized trials. Advanced Ovarian Cancer Trialists' Group.
Br J Cancer, 1998. 78(11): p. 1479-87.
28. Rabik, C.A. and M.E. Dolan, Molecular mechanisms of resistance and toxicity associated with
platinating agents. Cancer Treat Rev, 2007. 33(1): p. 9-23.
29. Zorbas, H. and B.K. Keppler, Cisplatin damage: are DNA repair proteins saviors or traitors to
the cell? Chembiochem, 2005. 6(7): p. 1157-66.
30. Wang, D. and S.J. Lippard, Cellular processing of platinum anticancer drugs. Nat Rev Drug
Discov, 2005. 4(4): p. 307-20.
31. Product Information: CISPLATIN intravenous injection, cisplatin intravenous injection. 2019.
32. Product Information: Paraplatin(R), carboplatin. 1999.
33. Balis, F.M., J.S. Holcenberg, and W.A. Bleyer, Clinical pharmacokinetics of commonly used
anticancer drugs. Clin Pharmacokinet, 1983. 8(3): p. 202-32.
34. Cotte, E., et al., Population pharmacokinetics and pharmacodynamics of cisplatinum during
hyperthermic intraperitoneal chemotherapy using a closed abdominal procedure. J Clin
Pharmacol, 2011. 51(1): p. 9-18.

96
35. van der Vijgh, W.J., Clinical pharmacokinetics of carboplatin. Clin Pharmacokinet, 1991. 21(4):
p. 242-61.
36. Chu, E., Physicians' Cancer Chemotherapy Drug Manual 2021. 2021.
37. Lokich, J. and N. Anderson, Carboplatin versus cisplatin in solid tumors: an analysis of the
literature. Ann Oncol, 1998. 9(1): p. 13-21.
38. Beyer, J., et al., Nephrotoxicity after high-dose carboplatin, etoposide and ifosfamide in germ-
cell tumors: incidence and implications for hematologic recovery and clinical outcome. Bone
Marrow Transplant, 1997. 20(10): p. 813-9.
39. Aleksander, I. and H. Morton, An Introduction to Neural Computing . 2nd edition.
40. Van Echo, D.A., et al., Phase I clinical and pharmacologic trial of carboplatin daily for 5 days.
Cancer Treat Rep, 1984. 68(9): p. 1103-14.
41. PARAPLATIN(R) IV injection, carboplatin IV injection. 2010.
42. Product Information: cisplatin intravenous injection solution, cisplatin intravenous injection
solution. 2013.
43. Kang, D.G., et al., Butein ameliorates renal concentrating ability in cisplatin-induced acute
renal failure in rats. Biol Pharm Bull, 2004. 27(3): p. 366-70.
44. Kawai, Y., et al., The effect of antioxidant on development of fibrosis by cisplatin in rats. J
Pharmacol Sci, 2009. 111(4): p. 433-9.
45. Bellomo, R., J.A. Kellum, and C. Ronco, Acute kidney injury. Lancet, 2012. 380(9843): p. 756-66.
46. Chertow, G.M., et al., Independent association between acute renal failure and mortality
following cardiac surgery. Am J Med, 1998. 104(4): p. 343-8.
47. Badary, O.A., et al., Naringenin attenuates cisplatin nephrotoxicity in rats. Life Sci, 2005.
76(18): p. 2125-35.
48. Ali, B.H., et al., Ontogenic aspects of cisplatin-induced nephrotoxicity in rats. Food Chem
Toxicol, 2008. 46(11): p. 3355-9.
49. Antunes, L.M., J.D. Darin, and L. Bianchi Nde, Effects of the antioxidants curcumin or selenium
on cisplatin-induced nephrotoxicity and lipid peroxidation in rats. Pharmacol Res, 2001. 43(2):
p. 145-50.
50. Bearcroft, C.P., et al., Cisplatin impairs fluid and electrolyte absorption in rat small intestine: a
role for 5-hydroxytryptamine. Gut, 1999. 44(2): p. 174-9.
51. Kishore, B.K., et al., Expression of renal aquaporins 1, 2, and 3 in a rat model of cisplatin-
induced polyuria. Kidney Int, 2000. 58(2): p. 701-11.
52. Kim, S.W., et al., Cisplatin decreases the abundance of aquaporin water channels in rat kidney.
J Am Soc Nephrol, 2001. 12(5): p. 875-82.
53. Safirstein, R. and G. Deray, Anticancer: Cisplatin/carboplatin, in Clinical Nephrotoxins: Renal
Injury from Drugs and Chemicals, M.E. De Broe, et al., Editors. 1998, Springer Netherlands:
Dordrecht. p. 261-271.

97
54. Clifton, G.G., et al., Early polyuria in the rat following single-dose cis-
dichlorodiammineplatinum (II): effects on plasma vasopressin concentration and posterior
pituitary function. J Lab Clin Med, 1982. 100(5): p. 659-70.
55. Safirstein, R., et al., Cisplatin nephrotoxicity in rats: defect in papillary hypertonicity. Am J
Physiol, 1981. 241(2): p. F175-85.
56. Arany, I., et al., Cisplatin-induced cell death is EGFR/src/ERK signaling dependent in mouse
proximal tubule cells. Am J Physiol Renal Physiol, 2004. 287(3): p. F543-9.
57. Ramesh, G. and W.B. Reeves, TNF-alpha mediates chemokine and cytokine expression and
renal injury in cisplatin nephrotoxicity. J Clin Invest, 2002. 110(6): p. 835-42.
58. Ramesh, G. and W.B. Reeves, TNFR2-mediated apoptosis and necrosis in cisplatin-induced
acute renal failure. Am J Physiol Renal Physiol, 2003. 285(4): p. F610-8.
59. Vickers, A.E., et al., Kidney slices of human and rat to characterize cisplatin-induced injury on
cellular pathways and morphology. Toxicol Pathol, 2004. 32(5): p. 577-90.
60. Hamilton, R.W., M.B. Hopkins, 3rd, and Z.K. Shihabi, Myoglobinuria, hemoglobinuria, and
acute renal failure. Clin Chem, 1989. 35(8): p. 1713-20.
61. Safirstein, R., P. Miller, and J.B. Guttenplan, Uptake and metabolism of cisplatin by rat kidney.
Kidney Int, 1984. 25(5): p. 753-8.
62. Dobyan, D.C., et al., Mechanism of cis-platinum nephrotoxicity: II. Morphologic observations. J
Pharmacol Exp Ther, 1980. 213(3): p. 551-6.
63. Boulikas, T. and M. Vougiouka, Cisplatin and platinum drugs at the molecular level. (Review).
Oncol Rep, 2003. 10(6): p. 1663-82.
64. Yokoo, S., et al., Differential contribution of organic cation transporters, OCT2 and MATE1, in
platinum agent-induced nephrotoxicity. Biochem Pharmacol, 2007. 74(3): p. 477-87.
65. Choi, M.K. and I.S. Song, Organic cation transporters and their pharmacokinetic and
pharmacodynamic consequences. Drug Metab Pharmacokinet, 2008. 23(4): p. 243-53.
66. Filipski, K.K., et al., Interaction of Cisplatin with the human organic cation transporter 2. Clin
Cancer Res, 2008. 14(12): p. 3875-80.
67. Hu, S., et al., Identification of OAT1/OAT3 as Contributors to Cisplatin Toxicity. Clin Transl Sci,
2017. 10(5): p. 412-420.
68. Ishida, S., et al., Uptake of the anticancer drug cisplatin mediated by the copper transporter
Ctr1 in yeast and mammals. Proc Natl Acad Sci U S A, 2002. 99(22): p. 14298-302.
69. Pabla, N., et al., The copper transporter Ctr1 contributes to cisplatin uptake by renal tubular
cells during cisplatin nephrotoxicity. Am J Physiol Renal Physiol, 2009. 296(3): p. F505-11.
70. Ramesh, G. and W.B. Reeves, p38 MAP kinase inhibition ameliorates cisplatin nephrotoxicity in
mice. Am J Physiol Renal Physiol, 2005. 289(1): p. F166-74.
71. Kelly, K.J., et al., Protection from toxicant-mediated renal injury in the rat with anti-CD54
antibody. Kidney Int, 1999. 56(3): p. 922-31.

98
72. Li, S., et al., Anti-inflammatory effect of fibrate protects from cisplatin-induced ARF. Am J
Physiol Renal Physiol, 2005. 289(2): p. F469-80.
73. Liu, M., et al., A pathophysiologic role for T lymphocytes in murine acute cisplatin
nephrotoxicity. J Am Soc Nephrol, 2006. 17(3): p. 765-74.
74. Urakami, Y., et al., Functional characteristics and membrane localization of rat multispecific
organic cation transporters, OCT1 and OCT2, mediating tubular secretion of cationic drugs. J
Pharmacol Exp Ther, 1998. 287(2): p. 800-5.
75. Xu, E.Y., et al., Integrated pathway analysis of rat urine metabolic profiles and kidney
transcriptomic profiles to elucidate the systems toxicology of model nephrotoxicants. Chem
Res Toxicol, 2008. 21(8): p. 1548-61.
76. Townsend, D.M., et al., Metabolism of Cisplatin to a nephrotoxin in proximal tubule cells. J Am
Soc Nephrol, 2003. 14(1): p. 1-10.
77. Masuda, H., T. Tanaka, and U. Takahama, Cisplatin generates superoxide anion by interaction
with DNA in a cell-free system. Biochem Biophys Res Commun, 1994. 203(2): p. 1175-80.
78. Nowak, G., Protein kinase C-alpha and ERK1/2 mediate mitochondrial dysfunction, decreases
in active Na+ transport, and cisplatin-induced apoptosis in renal cells. J Biol Chem, 2002.
277(45): p. 43377-88.
79. Portilla, D., et al., Metabolomic study of cisplatin-induced nephrotoxicity. Kidney Int, 2006.
69(12): p. 2194-204.
80. Abdel-Gayoum, A.A., K.B. El-Jenjan, and K.A. Ghwarsha, Hyperlipidaemia in cisplatin-induced
nephrotic rats. Hum Exp Toxicol, 1999. 18(7): p. 454-9.
81. Li, S., et al., PPAR alpha ligand protects during cisplatin-induced acute renal failure by
preventing inhibition of renal FAO and PDC activity. Am J Physiol Renal Physiol, 2004. 286(3): p.
F572-80.
82. de Jongh, F.E., et al., Weekly high-dose cisplatin is a feasible treatment option: analysis on
prognostic factors for toxicity in 400 patients. Br J Cancer, 2003. 88(8): p. 1199-206.
83. de Jongh, F.E., et al., Body-surface area-based dosing does not increase accuracy of predicting
cisplatin exposure. J Clin Oncol, 2001. 19(17): p. 3733-9.
84. Reece, P.A., et al., Creatinine clearance as a predictor of ultrafilterable platinum disposition in
cancer patients treated with cisplatin: relationship between peak ultrafilterable platinum
plasma levels and nephrotoxicity. J Clin Oncol, 1987. 5(2): p. 304-9.
85. Zazuli, Z., et al., Genetic Variations and Cisplatin Nephrotoxicity: A Systematic Review. Frontiers
in Pharmacology, 2018. 9(1111).
86. Kelland, L.R., et al., Preclinical antitumor evaluation of bis-acetato-ammine-dichloro-
cyclohexylamine platinum(IV): an orally active platinum drug. Cancer Res, 1993. 53(11): p.
2581-6.
87. Marzolini, C., et al., Polymorphisms in human MDR1 (P-glycoprotein): recent advances and
clinical relevance. Clin Pharmacol Ther, 2004. 75(1): p. 13-33.

99
88. Sauna, Z.E., et al., The mechanism of action of multidrug-resistance-linked P-glycoprotein. J
Bioenerg Biomembr, 2001. 33(6): p. 481-91.
89. Wei, H.B., et al., Polymorphisms of ERCC1 C118T/C8092A and MDR1 C3435T predict outcome
of platinum-based chemotherapies in advanced non-small cell lung cancer: a meta-analysis.
Arch Med Res, 2011. 42(5): p. 412-20.
90. Sohn, J.W., et al., MDR1 polymorphisms predict the response to etoposide-cisplatin
combination chemotherapy in small cell lung cancer. Jpn J Clin Oncol, 2006. 36(3): p. 137-41.
91. Campling, B.G., et al., Expression of the MRP and MDR1 multidrug resistance genes in small
cell lung cancer. Clin Cancer Res, 1997. 3(1): p. 115-22.
92. Han, J.Y., et al., Associations of ABCB1, ABCC2, and ABCG2 polymorphisms with irinotecan-
pharmacokinetics and clinical outcome in patients with advanced non-small cell lung cancer.
Cancer, 2007. 110(1): p. 138-47.
93. Qian, C.Y., et al., Associations of genetic polymorphisms of the transporters organic cation
transporter 2 (OCT2), multidrug and toxin extrusion 1 (MATE1), and ATP-binding cassette
subfamily C member 2 (ABCC2) with platinum-based chemotherapy response and toxicity in
non-small cell lung cancer patients. Chin J Cancer, 2016. 35(1): p. 85.
94. Pan, J.H., et al., MDR1 single nucleotide polymorphisms predict response to vinorelbine-based
chemotherapy in patients with non-small cell lung cancer. Respiration, 2008. 75(4): p. 380-5.
95. Han, J.Y., et al., Integrated pharmacogenetic prediction of irinotecan pharmacokinetics and
toxicity in patients with advanced non-small cell lung cancer. Lung Cancer, 2009. 63(1): p. 115-
20.
96. Taniguchi, K., et al., A human canalicular multispecific organic anion transporter (cMOAT) gene
is overexpressed in cisplatin-resistant human cancer cell lines with decreased drug
accumulation. Cancer Res, 1996. 56(18): p. 4124-9.
97. Korita, P.V., et al., Multidrug resistance-associated protein 2 determines the efficacy of cisplatin
in patients with hepatocellular carcinoma. Oncol Rep, 2010. 23(4): p. 965-72.
98. Burger, H., et al., Drug transporters of platinum-based anticancer agents and their clinical
significance. Drug Resist Updat, 2011. 14(1): p. 22-34.
99. Han, B., et al., Association of ABCC2 polymorphisms with platinum-based chemotherapy
response and severe toxicity in non-small cell lung cancer patients. Lung Cancer, 2011. 72(2): p.
238-43.
100. Sun, N., et al., MRP2 and GSTP1 polymorphisms and chemotherapy response in advanced non-
small cell lung cancer. Cancer Chemother Pharmacol, 2010. 65(3): p. 437-46.
101. Tsuda, M., et al., Oppositely directed H+ gradient functions as a driving force of rat H+/organic
cation antiporter MATE1. Am J Physiol Renal Physiol, 2007. 292(2): p. F593-8.
102. Sato, T., et al., Transcellular transport of organic cations in double-transfected MDCK cells
expressing human organic cation transporters hOCT1/hMATE1 and hOCT2/hMATE1. Biochem
Pharmacol, 2008. 76(7): p. 894-903.

100
103. Yonezawa, A., et al., Cisplatin and oxaliplatin, but not carboplatin and nedaplatin, are
substrates for human organic cation transporters (SLC22A1-3 and multidrug and toxin
extrusion family). J Pharmacol Exp Ther, 2006. 319(2): p. 879-86.
104. Nakamura, T., et al., Disruption of multidrug and toxin extrusion MATE1 potentiates cisplatin-
induced nephrotoxicity. Biochem Pharmacol, 2010. 80(11): p. 1762-7.
105. Becker, M.L., et al., Genetic variation in the multidrug and toxin extrusion 1 transporter protein
influences the glucose-lowering effect of metformin in patients with diabetes: a preliminary
study. Diabetes, 2009. 58(3): p. 745-9.
106. Tzvetkov, M.V., et al., The effects of genetic polymorphisms in the organic cation transporters
OCT1, OCT2, and OCT3 on the renal clearance of metformin. Clin Pharmacol Ther, 2009. 86(3):
p. 299-306.
107. Guo, D., et al., Selective Inhibition on Organic Cation Transporters by Carvedilol Protects Mice
from Cisplatin-Induced Nephrotoxicity. Pharm Res, 2018. 35(11): p. 204.
108. Freitas-Lima, L.C., et al., PPAR-α Deletion Attenuates Cisplatin Nephrotoxicity by Modulating
Renal Organic Transporters MATE-1 and OCT-2. Int J Mol Sci, 2020. 21(19).
109. Motohashi, H., et al., Gene expression levels and immunolocalization of organic ion
transporters in the human kidney. J Am Soc Nephrol, 2002. 13(4): p. 866-74.
110. Yonezawa, A., et al., Association between tubular toxicity of cisplatin and expression of organic
cation transporter rOCT2 (Slc22a2) in the rat. Biochem Pharmacol, 2005. 70(12): p. 1823-31.
111. Ciarimboli, G., et al., Cisplatin nephrotoxicity is critically mediated via the human organic
cation transporter 2. Am J Pathol, 2005. 167(6): p. 1477-84.
112. Chang, C., et al., Pharmacogenomic Variants May Influence the Urinary Excretion of Novel
Kidney Injury Biomarkers in Patients Receiving Cisplatin. Int J Mol Sci, 2017. 18(7).
113. Frosina, G., Commentary: DNA base excision repair defects in human pathologies. Free Radic
Res, 2004. 38(10): p. 1037-54.
114. Zhang, L., et al., Association between single nucleotide polymorphisms (SNPs) and toxicity of
advanced non-small-cell lung cancer patients treated with chemotherapy. PLoS One, 2012.
7(10): p. e48350.
115. Xu, X., et al., Association between eIF3α polymorphism and severe toxicity caused by platinum-
based chemotherapy in non-small cell lung cancer patients. Br J Clin Pharmacol, 2013. 75(2): p.
516-23.
116. Altaha, R., et al., Excision repair cross complementing-group 1: gene expression and platinum
resistance. Int J Mol Med, 2004. 14(6): p. 959-70.
117. Olaussen, K.A., et al., DNA repair by ERCC1 in non-small-cell lung cancer and cisplatin-based
adjuvant chemotherapy. N Engl J Med, 2006. 355(10): p. 983-91.
118. Zheng, Y., et al., The association of genetic variations in DNA repair pathways with severe
toxicities in NSCLC patients undergoing platinum-based chemotherapy. Int J Cancer, 2017.
141(11): p. 2336-2347.

101
119. Duran, G., et al., Association of GSTP1 and ERCC1 polymorphisms with toxicity in locally
advanced head and neck cancer platinum-based chemoradiotherapy treatment. Head Neck,
2019. 41(8): p. 2704-2715.
120. Janssen, K., et al., DNA repair activity of 8-oxoguanine DNA glycosylase 1 (OGG1) in human
lymphocytes is not dependent on genetic polymorphism Ser326/Cys326. Mutat Res, 2001.
486(3): p. 207-16.
121. Mambo, E., et al., Oxidized guanine lesions and hOgg1 activity in lung cancer. Oncogene, 2005.
24(28): p. 4496-508.
122. Li, H., et al., The hOGG1 Ser326Cys polymorphism and lung cancer risk: a meta-analysis.
Cancer Epidemiol Biomarkers Prev, 2008. 17(7): p. 1739-45.
123. Duan, W.X., et al., The association between OGG1 Ser326Cys polymorphism and lung cancer
susceptibility: a meta-analysis of 27 studies. PLoS One, 2012. 7(4): p. e35970.
124. Su, Y., et al., DNA Repair Gene Polymorphisms in Relation to Non-Small Cell Lung Cancer
Survival. Cell Physiol Biochem, 2015. 36(4): p. 1419-29.
125. Wang, J. and P. Wu, Correlation analysis of mRNA expression and prognosis of hOGG1 gene
polymorphism in patients with non-small cell lung cancer. Oncol Lett, 2019. 18(3): p. 2310-
2315.
126. Srivastava, K., et al., Candidate gene studies in gallbladder cancer: a systematic review and
meta-analysis. Mutat Res, 2011. 728(1-2): p. 67-79.
127. Zhang, H., et al., The hOGG1 Ser326Cys polymorphism and prostate cancer risk: a meta-
analysis of 2584 cases and 3234 controls. BMC Cancer, 2011. 11: p. 391.
128. Zhang, Y., et al., Association of OGG1 Ser326Cys polymorphism with colorectal cancer risk: a
meta-analysis. Int J Colorectal Dis, 2011. 26(12): p. 1525-30.
129. Leibeling, D., P. Laspe, and S. Emmert, Nucleotide excision repair and cancer. J Mol Histol,
2006. 37(5-7): p. 225-38.
130. Lou, Y., et al., XPA gene rs1800975 single nucleotide polymorphism and lung cancer risk: a
meta-analysis. Tumour Biol, 2014. 35(7): p. 6607-17.
131. Liu, X., et al., Association between XPA gene rs1800975 polymorphism and susceptibility to
lung cancer: a meta-analysis. Clin Respir J, 2018. 12(2): p. 448-458.
132. Qian, B., et al., Association of genetic polymorphisms in DNA repair pathway genes with non-
small cell lung cancer risk. Lung Cancer, 2011. 73(2): p. 138-46.
133. Cho, S., et al., Associations between polymorphisms in DNA repair genes and TP53 mutations
in non-small cell lung cancer. Lung Cancer, 2011. 73(1): p. 25-31.
134. Wakasugi, M. and A. Sancar, Order of assembly of human DNA repair excision nuclease. J Biol
Chem, 1999. 274(26): p. 18759-68.
135. Cui, Y., et al., Polymorphism of Xeroderma Pigmentosum group G and the risk of lung cancer
and squamous cell carcinomas of the oropharynx, larynx and esophagus. Int J Cancer, 2006.
118(3): p. 714-20.

102
136. Liang, Y., et al., Genetic association between ERCC5 rs17655 polymorphism and lung cancer
risk: evidence based on a meta-analysis. Tumour Biol, 2014. 35(6): p. 5613-8.
137. Hussain, S.K., et al., Genetic variation in immune regulation and DNA repair pathways and
stomach cancer in China. Cancer Epidemiol Biomarkers Prev, 2009. 18(8): p. 2304-9.
138. Canbay, E., et al., Association of APE1 and hOGG1 polymorphisms with colorectal cancer risk in
a Turkish population. Curr Med Res Opin, 2011. 27(7): p. 1295-302.
139. Rouissi, K., et al., The effect of tobacco, XPC, ERCC2 and ERCC5 genetic variants in bladder
cancer development. BMC Cancer, 2011. 11: p. 101.
140. Kanhai, W., W. Dekant, and D. Henschler, Metabolism of the nephrotoxin dichloroacetylene by
glutathione conjugation. Chem Res Toxicol, 1989. 2(1): p. 51-6.
141. Dekant, W., Bioactivation of nephrotoxins and renal carcinogens by glutathione S-conjugate
formation. Toxicol Lett, 1993. 67(1-3): p. 151-60.
142. Di Pietro, G., L.A. Magno, and F. Rios-Santos, Glutathione S-transferases: an overview in cancer
research. Expert Opin Drug Metab Toxicol, 2010. 6(2): p. 153-70.
143. Daniel, V., Glutathione S-transferases: gene structure and regulation of expression. Crit Rev
Biochem Mol Biol, 1993. 28(3): p. 173-207.
144. Li, X.M., et al., Glutathione S-transferase P1, gene-gene interaction, and lung cancer
susceptibility in the Chinese population: An updated meta-analysis and review. J Cancer Res
Ther, 2015. 11(3): p. 565-70.
145. Wang, Y., et al., Correlation between metabolic enzyme GSTP1 polymorphisms and
susceptibility to lung cancer. Exp Ther Med, 2015. 10(4): p. 1521-1527.
146. Cui, J., et al., GSTP1 and cancer: Expression, methylation, polymorphisms and signaling
(Review). Int J Oncol, 2020. 56(4): p. 867-878.
147. Townsend, D.M., et al., Role of glutathione S-transferase Pi in cisplatin-induced nephrotoxicity.
Biomed Pharmacother, 2009. 63(2): p. 79-85.
148. Huang, Y.S., et al., Polymorphism of the N-acetyltransferase 2 gene as a susceptibility risk
factor for antituberculosis drug-induced hepatitis. Hepatology, 2002. 35(4): p. 883-9.
149. Kolb, R.J., A.M. Ghazi, and D.W. Barfuss, Inhibition of basolateral transport and cellular
accumulation of cDDP and N-acetyl- L-cysteine-cDDP by TEA and PAH in the renal proximal
tubule. Cancer Chemother Pharmacol, 2003. 51(2): p. 132-8.
150. Khrunin, A.V., et al., Pharmacogenomic assessment of cisplatin-based chemotherapy outcomes
in ovarian cancer. Pharmacogenomics, 2014. 15(3): p. 329-37.
151. Dinkova-Kostova, A.T. and P. Talalay, NAD(P)H:quinone acceptor oxidoreductase 1 (NQO1), a
multifunctional antioxidant enzyme and exceptionally versatile cytoprotector. Arch Biochem
Biophys, 2010. 501(1): p. 116-23.
152. Rothman, N., et al., Benzene poisoning, a risk factor for hematological malignancy, is
associated with the NQO1 609C-->T mutation and rapid fractional excretion of chlorzoxazone.
Cancer Res, 1997. 57(14): p. 2839-42.

103
153. Windsor, R.E., et al., Germline genetic polymorphisms may influence chemotherapy response
and disease outcome in osteosarcoma: a pilot study. Cancer, 2012. 118(7): p. 1856-67.
154. Lamba, J.K., et al., Genetic variation in platinating agent and taxane pathway genes as
predictors of outcome and toxicity in advanced non-small-cell lung cancer. Pharmacogenomics,
2014. 15(12): p. 1565-74.
155. Guillemette, C., Pharmacogenomics of human UDP-glucuronosyltransferase enzymes.
Pharmacogenomics J, 2003. 3(3): p. 136-58.
156. Han, S.X., L. Wang, and D.Q. Wu, The association between UGT1A7 polymorphism and cancer
risk: a meta-analysis. Cancer Epidemiol, 2012. 36(4): p. e201-6.
157. Wang, Y., et al., UDP-glucuronosyltransferase 1A7 genetic polymorphisms are associated with
hepatocellular carcinoma in japanese patients with hepatitis C virus infection. Clin Cancer Res,
2004. 10(7): p. 2441-6.
158. Cooper, A.J.L. and M.H. Hanigan, Metabolism of Glutathione S-Conjugates: Multiple Pathways.
Comprehensive Toxicology, 2018: p. 363-406.
159. Kastan, M.B., et al., Participation of p53 protein in the cellular response to DNA damage.
Cancer Res, 1991. 51(23 Pt 1): p. 6304-11.
160. Polyak, K., et al., A model for p53-induced apoptosis. Nature, 1997. 389(6648): p. 300-5.
161. Nakano, K. and K.H. Vousden, PUMA, a novel proapoptotic gene, is induced by p53. Mol Cell,
2001. 7(3): p. 683-94.
162. Oda, E., et al., Noxa, a BH3-only member of the Bcl-2 family and candidate mediator of p53-
induced apoptosis. Science, 2000. 288(5468): p. 1053-8.
163. Cummings, B.S. and R.G. Schnellmann, Cisplatin-induced renal cell apoptosis: caspase 3-
dependent and -independent pathways. J Pharmacol Exp Ther, 2002. 302(1): p. 8-17.
164. Wei, Q., et al., Activation and involvement of p53 in cisplatin-induced nephrotoxicity. Am J
Physiol Renal Physiol, 2007. 293(4): p. F1282-91.
165. Dumont, P., et al., The codon 72 polymorphic variants of p53 have markedly different apoptotic
potential. Nat Genet, 2003. 33(3): p. 357-65.
166. Wilson, A. Simple Neural Networks in Python. 2019; Available from:
https://towardsdatascience.com/inroduction-to-neural-networks-in-python-7e0b422e6c24.
167. Lek, S. and J.F. Guégan, Artificial neural networks as a tool in ecological modelling, an
introduction. Ecological Modelling, 1999. 120(2): p. 65-73.
168. Madani, A., et al., Fast and accurate view classification of echocardiograms using deep
learning. NPJ Digit Med, 2018. 1.
169. Ayer, T., et al., Breast cancer risk estimation with artificial neural networks revisited:
discrimination and calibration. Cancer, 2010. 116(14): p. 3310-21.
170. Chen, Y.C., W.C. Ke, and H.W. Chiu, Risk classification of cancer survival using ANN with gene
expression data from multiple laboratories. Comput Biol Med, 2014. 48: p. 1-7.

104
171. Seah, J.C.Y., et al., Chest Radiographs in Congestive Heart Failure: Visualizing Neural Network
Learning. Radiology, 2019. 290(2): p. 514-522.
172. Hosmer Jr, D.W., S. Lemeshow, and R.X. Sturdivant, Applied logistic regression. Vol. 398. 2013:
John Wiley & Sons.
173. Suresha, H. Logistic regression in Statistics and Machine learning. 2019; Available from:
https://medium.com/mlearning-ai/logistic-regression-60694a973bee.
174. Swensen, S.J., et al., The probability of malignancy in solitary pulmonary nodules. Application
to small radiologically indeterminate nodules. Arch Intern Med, 1997. 157(8): p. 849-55.
175. Cassidy, A., et al., The LLP risk model: an individual risk prediction model for lung cancer. Br J
Cancer, 2008. 98(2): p. 270-6.
176. Bilimoria, K.Y., et al., Development and evaluation of the universal ACS NSQIP surgical risk
calculator: a decision aid and informed consent tool for patients and surgeons. J Am Coll Surg,
2013. 217(5): p. 833-42.e1-3.
177. Farjah, F., et al., A prediction model for pathologic N2 disease in lung cancer patients with a
negative mediastinum by positron emission tomography. J Thorac Oncol, 2013. 8(9): p. 1170-
80.
178. Tammemägi, M.C., et al., Selection criteria for lung-cancer screening. N Engl J Med, 2013.
368(8): p. 728-36.
179. Deppen, S.A. and E.L. Grogan, Using Clinical Risk Models for Lung Nodule Classification. Semin
Thorac Cardiovasc Surg, 2015. 27(1): p. 30-5.
180. Wikipedia. Random forest. Available from: https://en.wikipedia.org/wiki/Random_forest.
181. Tong, W., et al., Using decision forest to classify prostate cancer samples on the basis of SELDI-
TOF MS data: assessing chance correlation and prediction confidence. Environ Health Perspect,
2004. 112(16): p. 1622-7.
182. Delen, D., G. Walker, and A. Kadam, Predicting breast cancer survivability: a comparison of
three data mining methods. Artif Intell Med, 2005. 34(2): p. 113-27.
183. Szabo de Edelenyi, F., et al., Prediction of the metabolic syndrome status based on dietary and
genetic parameters, using Random Forest. Genes Nutr, 2008. 3(3-4): p. 173-6.
184. Xu, M., et al., Genome Wide Association Study to predict severe asthma exacerbations in
children using random forests classifiers. BMC Med Genet, 2011. 12: p. 90.
185. Lebedev, A.V., et al., Random Forest ensembles for detection and prediction of Alzheimer's
disease with a good between-cohort robustness. Neuroimage Clin, 2014. 6: p. 115-25.
186. Chen, Y., et al., Predicting postoperative complications of head and neck squamous cell
carcinoma in elderly patients using random forest algorithm model. BMC Med Inform Decis
Mak, 2015. 15: p. 44.
187. Dabakoglu, C. What is Support Vector Machine (SVM) with Python. 2018; Available from:
https://medium.com/@cdabakoglu/what-is-support-vector-machine-svm-fd0e9e39514f.

105
188. Listgarten, J., et al., Predictive models for breast cancer susceptibility from multiple single
nucleotide polymorphisms. Clin Cancer Res, 2004. 10(8): p. 2725-37.
189. Waddell, M., et al., Predicting cancer susceptibility from single-nucleotide polymorphism data:
A case study in multiple myeloma. Proceedings of the ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining, 2005.
190. Kim, W., et al., Development of novel breast cancer recurrence prediction model using support
vector machine. J Breast Cancer, 2012. 15(2): p. 230-8.
191. Xu, X., et al. A gene signature for breast cancer prognosis using support vector machine. in
2012 5th International Conference on BioMedical Engineering and Informatics. 2012.
192. Chang, S.W., et al., Oral cancer prognosis based on clinicopathologic and genomic markers
using a hybrid of feature selection and machine learning methods. BMC Bioinformatics, 2013.
14: p. 170.
193. Ghasem Ahmad, L., et al., Using three machine learning techniques for predicting breast
cancer recurrence. Journal of Health & Medical Informatics, 2013. 4: p. 124-130.
194. Rosado, P., et al., Survival model in oral squamous cell carcinoma based on clinicopathological
parameters, molecular markers and support vector machines. Expert Systems with
Applications, 2013. 40(12): p. 4770-4776.
195. Tseng, C.-J., et al., Application of machine learning to predict the recurrence-proneness for
cervical cancer. Neural Computing and Applications, 2014. 24(6): p. 1311-1316.
196. Cancer Therapy Evaluation Program (CTAE). Common Terminology Criteria for Adverse Events
(CTCAE). Accessed October 1, 2020]; Available from:
https://ctep.cancer.gov/protocolDevelopment/electronic_applications/ctc.htm#ctc_40.
197. Fushiki, T., Estimation of prediction error by using K-fold cross-validation. Statistics and
Computing, 2011. 21(2): p. 137-146.
198. Lucasius, C.B. and G. Kateman, Understanding and using genetic algorithms Part 1. Concepts,
properties and context. Chemometrics and Intelligent Laboratory Systems, 1993. 19(1): p. 1-
33.
199. Hassanat, A., et al., Choosing mutation and crossover ratios for genetic algorithms—a review
with a new dynamic approach. Information, 2019. 10(12): p. 390.
200. Whitney, A.W., A Direct Method of Nonparametric Measurement Selection. IEEE Transactions
on Computers, 1971. C-20(9): p. 1100-1103.
201. Levine, B.S., et al., Nephrotoxic potential of cis-diamminedichloroplatinum and four analogs in
male Fischer 344 rats. J Natl Cancer Inst, 1981. 67(1): p. 201-6.
202. Lai, N.H., et al., Comparison of the predictive outcomes for anti-tuberculosis drug-induced
hepatotoxicity by different machine learning techniques. Comput Methods Programs Biomed,
2020. 188: p. 105307.

106
Appendices

Appendix 1. The approval of Wan Fang Hospital Institutional Review Board (WFH-IRB) in this study

107
Appendix 2. Code book of clinical and genomic variables
Variables Definition Type Numeric description
Height Patients’ height (cm) Continuous -
Weight Patients’ weight (kg) Continuous -
Body surface Body surface area (m2) = Continuous -
area (Height (cm) x Weight
(kg)/3600)½
Gender Patients’ gender Categorical 0 = Female
1 = Male
Age Patients’ age Continuous -
Age group Classify patients by 65 Categorical 0 = Younger than 65 y/o
years old 1 = Older than 65 y/o
Alcohol Any drinking experience? Categorical 0 = No
1 = Yes
Smoking Any smoking experience? Categorical 0 = No
1 = Yes
TNM stage Classify patients by cancer Categorical 1 = Ia/Ib
stage 2 = IIa/IIb
3 = IIIa/IIIb
4 = IV
5=E
TNM stage Classify patients by cancer Categorical 1 = IA~IIb
group stage into 3 groups 2 = IIIA~IIIB
3 = IV~E
Adenocarcinoma Patients’ histology Categorical 0 = Other
1 = Adenocarcinoma
Large cell 0 = Other
carcinoma 1 = Large cell carcinoma
Squamous cell 0 = Other
carcinoma 1 = Squamous cell
carcinoma
Pathology Patients’ pathology Categorical 1 = Non-small cell lung
cancer
CT Number of chemotherapy Continuous -
cycles patient received
Week Patients’ total inclusion Continuous -
period (weeks)

108
Platinum Which platinum drug did Categorical 0 = Carboplatin
the patients take? 1 = Cisplatin
Concomitant Which concomitant Categorical 0 = Other
5-FU chemotherapy drugs did the 1 = 5-FU
Concomitant patients undergo? 0 = Other
CCRT 1 = CCRT (Concurrent
chemoradiotherapy)
Concomitant 0 = Other
Etoposide 1 = Etoposide
Concomitant 0 = Other
Gemcitabine 1 = Gemzar
(Gemcitabine)
Concomitant 0 = Other
Other 1 = Procarbazine,
Cyclophosphamide,
adriamycin and platinum
(CAP)
Concomitant 0 = Other
PAVE 1 = PAVE (Procarbazine,
Alkeran and Vinblastine)
Concomitant 0 = Other
Paclitaxel 1 = Taxol (Paclitaxel)
Concomitant 0 = Other
Pemetrexed 1 = Alimta (Pemetrexed)
Concomitant 0 = Other
Vinorelbine 1 = Vinorelbine
Cisplatin Cisplatin cumulative dose Continuous -
cumulative dose (mg)
(mg)
Cisplatin Cisplatin cumulative dose Continuous -
cumulative dose (mg/m2)
(mg/m2)
Cisplatin Cisplatin average dose (mg) Continuous -
average dose
(mg)
Cisplatin Cisplatin average dose Continuous -
average dose (mg/m2)
(mg/m2)

109
Carboplatin Carboplatin cumulative Continuous -
cumulative dose dose (mg)
(mg)
Carboplatin Carboplatin cumulative Continuous -
cumulative dose dose (mg/m2)
(mg/m2)
Carboplatin Carboplatin average dose Continuous -
average dose (mg)
(mg)
Carboplatin Carboplatin average dose Continuous -
average dose (mg/m2)
(mg/m2)
OCT2_GG OCT2 G808T rs316019 Categorical 0 = GT
(rs316019) 1 = GG
ABCB1_CC ABCB1 C3435T rs1045642 Categorical 0 = Other
(rs1045642) 1 = CC
ABCB1_CT 0 = Other
(rs1045642) 1 = CT
ABCB1_TT 0 = Other
(rs1045642) 1 = TT
MATE1_AA MATE1 G>A rs2289669 Categorical 0 = Other
(rs2289669) 1 = AA
MATE1_AG 0 = Other
(rs2289669) 1 = AG
MATE1_GG 0 = Other
(rs2289669) 1 = GG
ABCC2_CC ABCC2 -24C>T rs717620 Categorical 0 = Other
(rs717620) 1 = CC
ABCC2_CT 0 = Other
(rs717620) 1 = CT
ABCC2_TT 0 = Other
(rs717620) 1 = TT
MATE1_CC MATE1 g-66T>C rs2252281 Categorical 0 = Other
(rs2252281) 1 = CC
MATE1_CT 0 = Other
(rs2252281) 1 = CT
MATE1_TT 0 = Other
(rs2252281) 1 = TT

110
ERCC1_AC 0 = Other
(rs3212986) 1 = AC
ERCC1_CC 0 = Other
(rs3212986) 1 = CC
NAT2*5_CC NAT2*5 C>T rs1799929 Categorical 0 = CT
(rs1799929) 1 = CC
NAT2*6_AA NAT2*6 G>A rs1799930 Categorical 0 = Other
(rs1799930) 1 = AA
NAT2*6_AG 0 = Other
(rs1799930) 1 = AG
NAT2*6_GG 0 = Other
(rs1799930) 1 = GG
NAT2*7_AA NAT2*7 G>A rs1799931 Categorical 0 = Other
(rs1799931) 1 = AA
NAT2*7_AG 0 = Other
(rs1799931) 1 = AG
NAT2*7_GG 0 = Other
(rs1799931) 1 = GG
GSTP1_AA GSTP1 A313G rs1695 Categorical 0 = Other
(rs1695) 1 = AA
GSTP1_AG 0 = Other
(rs1695) 1 = AG
GSTP1_GG 0 = Other
(rs1695) 1 = GG
NQO_CC NQO1 C609T rs1800566 Categorical 0 = Other
(rs1800566) 1 = CC
NQO_CT 0 = Other
(rs1800566) 1 = CT
NQO_TT 0 = Other
(rs1800566) 1 = TT
UGT1A7_CC UGT1A7 T622C Categorical 0 = Other
(rs11692021) rs11692021 1 = CC
UGT1A7_CT 0 = Other
(rs11692021) 1 = CT
UGT1A7_TT 0 = Other
(rs11692021) 1 = TT

112
Appendix 3. The plot of sequential backward selection (SBS) by logistic regression (LR), random
forest (RF), and support vector machine (SVM) of clinical and genomic mode in five-fold cross
validation

113
Appendix 4. The plot of sequential backward selection (SBS) by logistic regression (LR), random
forest (RF), and support vector machine (SVM) of clinical and genomic mode in leave-one-out cross
validation

114
Appendix 5. The plot of sequential backward selection (SBS) by logistic regression (LR), random
forest (RF), and support vector machine (SVM) of clinical and genomic mode in cisplatin subgroup
analysis

115

You might also like