You are on page 1of 9

Prediction of Lung Cancer using Ensemble Classifiers

Ashwin G Shanbhag1, Anurag Prabhu K2, N V Subba Reddy, Ashwath Rao B


Department of Computer Science and Engineering,
Manipal Institute of Technology, Manipal
Academy of Higher Education,
Manipal, Karnataka- 576104

1
ashwingshan98@gmail.com
2
anuragprabhu122@gmail.com

Abstract. Automatic defect detection in CT pictures is extremely necessary for several


diagnostic and therapeutic applications. Because of the high quantity of data in CT pictures and
blurred boundaries, tumor segmentation and classification are extremely laborious. The goal is
to classify the tissues into a few categories of traditional, benign, and malignant. In MR
pictures, the quantity of information is simply too a lot of for manual interpretation and
analysis. Over the past few years, carcinoma detection in CT has become an emerging analysis
space within the field of the medical imaging system. Correct detection of size and site of lung
cancer plays an important role within the designation of carcinoma. In this paper, we introduce
one automatic carcinoma detection methodology to extend the accuracy and yield and reduce
the designation time. The method consists of four stages, pre-processing of CT images,
segmentation, feature extraction, and classification step to classify the benign and malignant.
This work uses different models for detecting lung cancer in a CT scan by building an ensemble
classifier. The ensemble classifier includes five machine learning models like SVM, LR, MLP,
decision tree, KNN. The inevitable parameters like accuracy, Recall, and precision is calculated
to determine accurate results of the classifier.

Keywords: Cancer Detection, Feature-Extraction, Classification, Machine-Learning Models,


Segmentation, Ensemble, lung cancer, Ensemble-Classifier, Random Forest.

1. INTRODUCTION
Cancer is one of the foremost dangerous diseases in human life. The foremost necessary task of the lungs
is to require the element to the body and to get rid of CO2 from the body throughout the very important
activities. Lung cancer happens as a result of the uncontrolled proliferation of tissues and cells within the
lungs. Once these lots age uncontrolled in their setting it will unfold injury-encompassing problems. Lung
cancer is that the 1st sort of cancer that causes death among males and therefore the second sort of cancer
among females close to 1.3 million folks die each year within the world thanks to lung cancer. In Turkey,
30-40 thousand folk’s area unit diagnosed with lung cancer each year [2]. Once cancer cells develop,
however, this orderly method breaks down. As cells become a lot of and a lot of abnormal, previous or
broken cells survive after they ought to die, and new cells kind after they aren't required. These further
cells will divide no end and should kind growths refer to a tumor. This tumor starts spreading to different
parts of the body [6].
Tumors are of two types’ benign and malignant where benign (non-cancerous) is mass of cell which
cannot spread to other parts of the body and malignant (cancerous) is the growth of cell which can spread
to other parts of the body this spreading of infection is called metastasis. There is various type of cancer
like Lung cancer, leukemia, and colon cancer, etc. The incidence of lung cancer has significantly
increased since the early 19th century. There is a various cause of lung cancer like smoking, exposure to
radon gas, secondhand smoking, and exposure to asbestos, etc. Lung cancer is of two types small cell
lung cancer (SCLC) and non-small cell lung cancer (NSCLC). Non-small cell lung cancer is more
common than SCLC and it generally grows and spreads more slowly. SCLC is almost related to smoking

1
and grows more quickly and form large tumors that can spread widely through the body. With the fast
increase in population rate, the speed of diseases like cancer, chikungunya, cholera, etc., are increasing
[1]. The harmful nodules will be detected at associate degree earlier stage by the radiologist’s
mistreatment computerized tomography (CT) and alternative scanning techniques [19]. They usually
begin within the bronchi close to the center of the chest. Symptoms which will counsel lung cancer
include symptom like shortness of breath with activity, coughing up blood, chronic coughing or
modification in regular coughing pattern, wheezing, pain or pain within the abdomen, weight loss,
fatigue, and loss of appetite, speech defect, dysphasia (difficulty swallowing), Pain in shoulder, chest, arm
[20]. To diagnose lung cancer numerous techniques area unit used like chest X-Ray, CT scan, MRI
through that doctor will decide the situation of tumor supported that treatments area unit given [7].
Challenge facing medical practitioners makes this study of a far larger significance. The challenge of
detecting cancer in its early stages since symptoms appear only in the advanced stages thereby causing the
mortality rate of lung cancer to be the highest among all other types of cancer. The correct designation for
various forms of cancer plays a crucial role to the doctors to help them in determining and selecting the
right treatment. Undeniably, the selections created by the doctors are the foremost necessary factors in
designation however recently, the application of various AI classification techniques are evidenced in
serving to doctors to facilitate their method} process. Possible errors which may occur because of
unskilled doctors are often decreased by mistreatment classification techniques. This system may examine
medical information in an exceedingly shorter time and additional exactly [13].
Feature Extraction relates the total of benefits expected to characterize an enormous arrangement of
information. The planned work uses machine learning strategies for the recognition of tumor cells within
the body a lot of accurately. After performing the feature extraction method, ML techniques are applied to
chosen options for extracting sensitive values type the info given, and acknowledging tumor cells [10].
Pre-diagnosis helps to spot or slim down the likelihood of screening for lung cancer malady. Symptoms
and risk factors (smoking, alcohol consumption, obesity, and hypoglycemic agent resistance) had a
statistically important impact in the pre-diagnosis stage. The lung cancer diagnostic and prognostic issues
are principally within the scope of the wide mentioned classification issues. These issues have attracted
several researchers in machine intelligence, data processing, and statistics fields [11]. Hence the main
goal is to make up a framework that helps the clinical specialists to cross make sure their analyzed results
of foreseen lung cancer because the existing diagnosing method is time-intense, effortful and dear, by and
huge this deep learning-based mostly tool will determine the tumor growth and predict stages. Since this
can be a machine-controlled tool supported by image processing and AI, it minimizes human effort in
predicting the presence of cancer cells from the image [4].
In this paper, we discuss the approach of predicting cancer from CT scanned images by building an
ensemble classifier and the results of the same are analyzed. The rest of the paper is organized as follows:
Section 2 gives the brief detail of previously carried works in the field of prediction, Section 3 broadly
discusses the process of lung cancer prediction. Section 4 gives the detailed analysis of the report to
support the proposed methodology, Section 5 concludes the paper, and Section 6 has the references used
in this paper.

2. LITERATURE SURVEY
Many works have already been proposed for the prediction of cancer by various researchers among them,
Nikita Banerjee et al., [1] Özge Günaydin et al., [2] projected numerous ways for police to work cancer in
early stages. In this paper, machine learning models are used to sight carcinoma nodules. They applied,
K-Nearest Neighbors, Support Vector Machines, Naïve Thomas Bayes, call Trees, and Artificial Neural
Networks machine learning ways to sight anomaly and compared all methods when preprocessing is done
as well as not done.
Syed island Raoof et al., [3] Radhika P R et al., [18] detection, prediction, and diagnosing of carcinoma
has become essential because it expedites and simplifies the resultant clinical board. To erect the progress
and drugs of cancerous conditions machine learning techniques are utilized as a result of their correct

2
outcomes. Various varieties of machine learning algorithms like Naive Thomas Bayes, Support Vector
Machine, provision regression, are applied within the care sector for analysis and prognosis of carcinoma.
Swati Mukherjee et al., [4] the analysis and study of respiratory organ diseases has been the foremost
intriguing investigation zone of doctors from time to this day. To deal with this concern, a diagnosing
system like this will solely facilitate diminish the percentages of obtaining risk to human life by early
discovery of malignant growth. The machine learning approach will offer Associate in nursing a new
chance to enhance call support in carcinoma treatment at a low price. Wasudeo Rahane et al., [5]
Kyamelia Roy et al., [6] Amrit Sreekumar et al., [7] Junjie Zhanga et al., [20] carcinoma detection system
victimization image process and machine learning is employed to classify the presence of carcinoma
during a CT- pictures and blood samples. The CT scan reports are more effective, therefore patient CT
scan images are categorized as normal and abnormal. The abnormal pictures are subjected to
segmentation to specialize in the growth portion. Classification is done on options extracted from the
photographs.
Sanjukta blue blood Jena et al., [8] Öztürk et al., [17] projected a model wherever a 5 sort of feature
extraction techniques were utilized in individual classification formula to predict at that options extraction
technique that machine learning formula is giving a lot of accuracies. Dendi Gayathri Reddy et al., [9]
projected a model that is economical in predicting the stages of respiratory organ malignant neoplastic
disease by applying the ideas of cc algorithms. It is a combination of K-Nearest Neighbors, call Tree, and
Neural Networks models beside cloth ensemble methodology for enhancing the accuracy of the general
prediction. The expected results of the urged model are showing higher accuracy compared to individual
algorithms. M.Siddardha Kumar et al., [10] projected pre-handling procedures are likewise utilized during
this work to urge correct outcomes. In preprocessing technique, the morphological technique has been
utilized to expel the undesirable data from the image. The feature extraction procedure that's accustomed
limit the one in all a sort dataset by manipulating some modified over options. To find feature extraction
of image geometrical and measurable properties, completely different techniques are utilized to
disentangle footage. V.Krishnaiah et al., [11] Muhammad Imran Faisal et al., [12] aim of the paper is to
propose a model for early detection and proper designation of the malady which can facilitate the doctor
in saving the lifetime of the patient. This research paper attempts to evaluate the discriminative power of
several predictors in the study to increase the efficiency of lung cancer detection through their symptoms.
Several classifiers including Decision tree, Multi-Layer Perceptron, Neural Network, and Naïve Bayes are
evaluated on a benchmark dataset obtained from the UCI repository. The performance is also compared
with well-known ensembles such as Random Forest and Majority Voting.
Fenwa et al., [13] proposed a model where a feature like contrast, brightness from image dataset is
extracted using texture-based feature extraction and on those two types of ML algorithm are applied one
is ANN another one is SVM, and then performance has been evaluated on both the algorithm to compare
which algorithm is giving more accuracy. Maisa Daouda et al., [14] analyzing the studies reveals that
neural network strategies are either used for filtering (data engineering) the gene expressions in a very
previous step to prediction; predicting the existence of cancer, cancer sort, or the survivability risk; or for
bunch untagged samples. This paper additionally discusses some sensible problems which will be thought
of once building a neural network-based cancer prediction model. Results indicate that the practicality of
the neural network determines its general design. Palani et al., [15] have proposed IoT based predictive
modeling by mistreatment fuzzy C mean cluster for segmentation and progressive classification formula
mistreatment association rule mining and call tree for classification for classifying the growth sets and
supported the output generated by progressive classification model convolutional neural network has been
applied with alternative options for predicting benign or malignant.
Lynch et al., [16] various machine learning algorithms are implemented for predicting the survivability
rate of a person, performance is measured based on root mean square error. Every model is trained using
10-fold cross-validation because the parameters are preprocessed by distribution default price thus cross-
validation is employed for avoiding overfitting. Şaban Oztürk et al., [17] classification of histopathologic
pictures and identification of cancerous areas is sort of difficult because of image background quality and
determination. The distinction between traditional tissue and cancerous tissue is extremely tiny in some

3
cases. So, the options of the tissue patches within the image have key importance for automatic
classification. Sumathipala et al., [19] planned a model wherever the image information is taken from
LIDC-IDRI, once grouping the image information image filtration has been enforced, filtration is
completed supported the patient United Nations agency went through diagnostic test and module level is
adequate to thirty and so pictures whose module level is adequate to thirty is divided and so logistical
regression and random forest has been applied for prediction. Using these concepts, we introduce a novel
approach to predict cancer using ensemble techniques which are discussed in detail in the next section.

3. METHODOLOGY
In this section we discuss the detailed approach for predicting lung Cancer from CT scanned images by
extracting the region-based features and an ensemble classifier. The blueprint of the process is shown in
Figure 1.

3.1 Pre-Processing Layer


The primary objective of pre-processing is to enhance image quality to build it for diminishing or
evacuating the irrelevant parts of the images. Pre-processing stage is important to enhance the quality of
the image. The noise and other high recurrence segments are evacuated by filters and prepare the datasets
for additional processing [8].

Figure 1. Process of Predicting cancer from images

3.2 Segmentation Layer


Image Segmentation is the process of partitioning a digital image into multiple segments (image objects)
using thresholding techniques. We used the Otsu thresholding technique, which makes binarization of the
image and gives us the threshold value that is appropriate to binarization of the image. Filters are used for
4
the optimization of the binary image. Filtering is a procedure to change or improve the picture, for
example, to show certain features or eliminate other features. [4] It incorporates smoothing, removing
noise, and edge upgrades. After this, we remove all the unwanted edges which are noise in the image
using edge detection techniques. Then we generate labels for our image and plot it.

3.3 Feature Extraction Layer


The output generated by segmentation is used for feature extraction. It includes extraction, by which
certain features of interest within an image are detected and represented for further processing.
Segmenting the lungs from the remaining CT scan reduces the problem space and hence feature
extraction becomes more effective. It is a critical step as it marks the transition from pictorial to
alphanumerical data representation. [4]

3.4 Classification Layer


After feature extraction, we will apply the classification technique to classify the tumor into benign or
malignant using different ensemble techniques. After applying the classification technique, it can be
predicted that the tumor is cancerous or not and at which feature we are getting more accurate prediction
[6]. Below mentioned are some of the machine learning models that are used to create a simple ensemble
classifier.
SVM [3] For prediction, regression, and classification the foremost distinguished technique utilized is
SVM. It classifies the input data set by introducing a boundary known as a hyper-plane that separates the
dataset into two components. SVM may be a data-driven approach associated with possible while not a
hypothetic theme that produces a correct classification. This is one of the common models that is used for
classification. SVM classifier is used to classify the linear and non-linear regions. A linear separation,
classifier is employed to separate the affected and non-affected regions inside the Image. It primarily uses
soft and onerous margins. These linear equations are homogenized. In non-linear separation, we'll
separate the affected portion or region by representing the non-linear type. These linear equations are
homogenized [5]. Linear Regression [16] The simplest method implemented is linear regression, one of
the oldest and most widely used correlational techniques. The goal of the method is to fit a straight line to
a set of data points using a series of coefficients multiplied by each input, like a weighting function, and
an intercept. The weights are decided within the linear regression function in a way to minimizes the
mean error. These weight coefficients multiplied by the respective inputs, plus an intercept, give a general
function for the outcome, patient survival time. In this way, linear regression is easy to understand and
quick to implement, even on larger datasets.
MLP [9] A multilayer perceptron (MLP) is a feed-forward artificial neural network that generates a set of
outputs from a set of inputs. An MLP is characterized by several layers of input nodes connected as a
directed graph between the input and output layers. MLP uses backpropagation for training the network.
The backpropagation technique is preponderantly employed in Neural Networks for creating predictions.
For every new instance given, the error is backpropagated to correct the edge weights. An input layer, one
or a lot of hidden layers, and an output layer are the elements of a neural network. Every layer is created
from units.
Using the above discussed machine learning models, we construct a classifier via the utilization of an
ensemble technique called max-voting [12] as shown in Figure 2. The choice ensemble technique may be
a common example of the multi-professional approach that helps to mix the classifiers in a very parallel
fashion. After, every classifier is trained on all information and contributes to a decision. Finally, the
voting technique helps to get the ultimate solution. This will help to increases the accuracy by combining
the advantages of each classifier. The mechanism for improved performance with ensembles is often the
reduction in variance component of prediction errors made by the contributing models.

5
Figure 2. Ensemble-Classifier

Random Forest [16] The Random Forest technique generates a variety of call trees throughout coaching
that area unit allowed to separate arbitrarily from a seed purpose. This ends up in a “forest” of arbitrarily
generated call trees whose outcomes area unit ensemble by the Random Forest algorithmic program to
predict additional accurately than one tree will alone. Individual call trees may be fanciful as if-then-else
rules that may be generated from the dataset directly, creating them one amongst the additional human-
understandable techniques. One downside with one call tree is overfitting, creating the predictions appear
excellent on the coaching knowledge, however unreliable in future predictions.

Then the result of the built ensemble classifier and RF is discussed in the next section. The steps involved
in the algorithm can be viewed as follows:

Input: Standard CT scan image data


Output: classification as Benign or Malignant

Step 1: Input the CT scanned Image dataset


Step 2: Pre-Process the image
Step 2.1: If the image is noise-free Go to step 3
Else Go to step 2.1
Step 3: Segment the image
Step 3.1: Segment the boundary of the output image generated at step 2.1 using Otsu
Thresholding
Step 3.2: After that remove all the unwanted edges using edge detection techniques
Step 4: Feature Extraction
Step 4.1: Region-based features are extracted like area, perimeter, centroid, solidity, mean,
eccentricity.
Step 5: Apply a classification algorithm for training and prediction of the tumor as benign or malignant.
Step 6: Evaluate and analyze the result based on the different parameters.
Step 7: End

To test the algorithm, using some random data may lead to different results each time tested. This may
mislead the prediction rate of the model. So, to reduce these shortcomings we have used the standard CT
scanned images of lungs [21].

6
4. Results and Analysis
In this section, we discuss and analyze the results of the built ensemble classifier with Random-Forest
based on the different parameters. To analyze the result, we have used the data [21] that consists of CT
scanned images of Lungs. It has 561 images belonging to class 1 and 416 images belonging to class 0
where class 0 refers to Benign and 1 refers to Malignant.

4.1 Confusion matrix


The confusion matrix gives a detailed description of classification or misclassification in a form of a
matrix. By using the concept of binary classification, four combinations of data category can be formed
which are True-Positive (TP), False-Positive (FP), True-Negative (TN), and False-Negative (FN) as shown
in Table 1 which are later used for calculation of different types of performance evaluation metrics. Where
the True-Positive (TP) are more concerned samples and False-Negative (FP) are merely rejected/discarded
samples.

Table 1. contingency table for predicted vs real output (logical details of binary classification)

4.2 Accuracy
Accuracy is the proportion of correct predictions versus the total number of predictions made. Accuracy is
mainly used for measuring the performance of a classifier.

𝐶𝑜𝑟𝑟𝑒𝑐𝑡 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑟𝑒𝑑𝑐𝑡𝑖𝑜𝑛𝑠 𝑚𝑎𝑑𝑒

𝑇𝑃+𝑇𝑁
Accuracy = (1)
𝑇𝑃+𝑇𝑁+𝐹𝑃+𝐹𝑁
5. Conclusion
In this paper, we propose a novel algorithm for detecting lung cancer in a CT scan by building an
ensemble classifier and then the results are compared with the RF classifier. In Ensemble-Classifier we
included five machine learning models like SVM, LR, MLP, Decision -tree, KNN. The proposed model
gives an overview of the prediction of lung cancer at an early stage. After the prediction of the tumor
whether it is malignant or benign, we then generate a confusion matrix for each machine learning
technique and based on the confusion matrix we calculate accuracy, Recall, precision, and F1 score.
In the future, Deep-Learning techniques can be used for the prediction of carcinoma. More range
of pictures is often thought of like X-ray, CT, MRI, PET which will evoke additional accuracy, thereby
serving to the medical practitioners to supply fast prevention at low value.

References
[1] Nikita Banerjee, Subhalaxmi Das. “Prediction Lung Cancer– In Machine Learning Perspective”
International Conference on Computer Science, Engineering and Applications (ICCSEA) (2020).

7
[2] Ozge Gunaydin, Melike Gunay, Oznur Sengel. “Comparison of Lung Cancer Detection Algorithms”
Scientific Meeting on Electrical-Electronics & Biomedical Engineering, Computer Science (EBBT)
IEEE 2019.
[3] Syed Saba Raoof, M A. Jabbar, Syed Aley Fathima. "Lung Cancer Prediction using Machine
Learning: A Comprehensive Approach" Second International Conference on Innovative Mechanisms
for Industry Applications (ICIMIA 2020).
[4] Swati Mukherjee, Prof. S. U. Bohra. "Lung Cancer Disease Diagnosis Using Machine Learning
Approach" Third International Conference on Intelligent Sustainable Systems [ICISS 2020].
[5] Wasudeo Rahane, Himali Dalvi. “Lung Cancer Detection Using Image Processing and Machine
Learning HealthCare” IEEE International Conference on Current Trends toward Converging
Technologies, Coimbatore, India, IEEE 2018.
[6] Kamelia Roy, Sheli Sinha Chaudhury, Madhurima Burman, Ahana Ganguly, Chandrima Dutta,
Sayani Banik, Rayna Banik. “A Comparative Study of Lung Cancer detection using supervised
neural network” International Conference on Opto-Electronics and Applied Optics (Optronix 2019).
[7] Amrit Sreekumar, Karthika Rajan Nair, Sneha Sudheer, Ganesh Nayar H, and Jyothisha J Nair.
"Malignant Lung Nodule Detection using Deep Learning" International Conference on
Communication and Signal Processing, July 28 - 30, 2020, India.
[8] Sanjukta Rani Jena, Dr. Thomas George, Dr. Narain Ponraj. "Texture Analysis Based Feature
Extraction and Classification of Lung Cancer" International Conference on Electrical, Computer and
Communication Technologies (ICECCT) IEEE 2019.
[9] DendiGayathri Reddy, Emmidi Naga Hemanth Kumar, Desireddy Lohith Sai Charan Reddy, Monika
P "Integrated Machine Learning Model for Prediction of Lung Cancer Stages from Textual data
using Ensemble Method". 1st International Conference on Advances in Information Technology
IEEE 2019.
[10] M.Siddardha Kumar, Prof.Dr.K.Venkata Rao. "Prediction Of Lung Cancer Using Machine Learning
Technique: A Survey" International Conference on Computer Communication and Informatics
(ICCCI -2021), Jan. 27 – 29, 2021, Coimbatore, India.
[11] Krishnaiah, V., G. Narsimha, and Dr N. Subhash Chandra. "Diagnosis of lung cancer prediction
system using data mining classification techniques." International Journal of Computer Science and
Information Technologies 4.1 (2013): 39-45.
[12] Muhammad Imran Faisal, Saba Bashir, Zain Sikandar Khan, Farhan Hassan Khan. "An Evaluation
of Machine Learning Classifiers and Ensembles for Early-Stage Prediction of Lung Cancer" 3rd
International Conference on Emerging Trends in Engineering, Sciences, and Technology (ICEEST
2018).
[13] Fenwa, Olusayo D., Funmilola A. Ajala, and A. Adigun. "Classification of cancer of the lungs using
SVM and ANN." Int. J. Comput. Technol. 15.1 (2016): 6418-6426.
[14] Daoud, Maisa, and Michael Mayo. "A survey of neural network-based cancer prediction models
from microarray data." Artificial intelligence in medicine (2019).
[15] Palani D, K. Venkatalakshmi. "An IoT-based predictive modeling for predicting lung cancer using
fuzzy cluster-based segmentation and classification." Journal of medical systems 43.2 (2019): 21.
[16] Lynch, Chip M., et al. "Prediction of lung cancer patient survival via supervised machine learning
classification techniques." International journal of medical informatics 108 (2017): 1-8.
[17] Ozturk, Şaban, and Bayram Akdemir. "Application of feature extraction and classification methods
for the histopathological image using GLCM, LBP, LBGLCM, GLRLM, and SFTA." Procedia
computer science 132 (2018): 40-46.
[18] Radhika P R, Rakhi.A.S.Nair. "A Comparative Study of Lung Cancer Detection using Machine
Learning Algorithms". International Conference on Electrical, Computer and Communication
Technologies (ICECCT) IEEE 2019.
[19] Sumathipala, Yohan, et al. "Machine learning to predict lung nodule biopsy method using CT image
features: A pilot study." Computerized Medical Imaging and Graphics 71 (2019): 1-8.

8
[20] Zhang, Junjie, et al. "Pulmonary nodule detection in medical images: a survey." Biomedical Signal
Processing and Control 43 (2018): 138-147.
[21] alacrity, Shamballa (2020), “The IQ-OTHNCCD lung cancer dataset”, Mendeley Data, V1, DOI:
10.17632/bhmdr45bh2.1

You might also like