Professional Documents
Culture Documents
Abstract—Breast Cancer is the second most dangerous is to save a person fro m even developing cancer, and that
cancer in the world. Most of the women die due to breast can be achieved by a timely diagnosis.
cancer not only in India but everywhere in the world. In Machine Learn ing is a technique that can learn and
2011, US A stated that one in eight women suffered from
retrieve information fro m data and use that ‘gained’
cancer. Breast cancer develops due to the abnormal cell
division in the breast itself which results in the formation of experience to predict the required outcomes. Machine
either benign or malignant cancer. S o, it is very important Learn ing algorith ms have provided great assistance in
to predict breast cancer at an early stage and by providing many fields and early stage cancer prediction.
proper treatment, many lives can be saved. This paper aims In computer science, machine learning can be classified
to give a comparative study by applying different machine in three different ways as supervised learning,
learning algorithms such as Support Vector Machine, K- unsupervised learn ing and reinforcement learn ing. In
Nearest Neighbour, Naïve Bayes, Decision Tree, K-means supervised learning, labelled data is available, and based
and Artificial Neural Networks on Wisconsin Diagnostic on that labelled data, the machine predicts the label of the
dataset to predict breast cancer at an early stage. unlabelled input features, whereas , in unsupervised
learning, all the features are available without any label
Keywords— Artificial neural network, breast cancer, support
vector machine, human disease or output class. In the case of reinforcement learning, it is
the technique of letting the models learn on their own. In
this type of learning the machine performs a specific task
I. INT RODUCT ION and either rewards or penalizes itself based on a set of
defined rules, and mainly focuses on maximizing the total
Cancer which develops in the breast is called Breast reward.
Cancer and is the second most dangerous cancer in the There are several machine learning techniques that can
world. It is responsible for the death of many wo men help to predict whether the person affected is having a
around the globe. More than thirteen thousand Indians die benign or malignant cancer, and this process would be
every day due to cancer, according to the efficient and without any errors . In this paper, authors
National Cancer Reg istry Programme of used a total of six machine learning algorith ms to predict
the India Council of Medical Research (ICM R). Bet ween the breast cancer.
2012 and 2014, the mortality rate due to cancer increased The further sections are as follows: section 2 describes
by approximately 6% [1]. Most doctors do the biopsy in the existing literature work done by other researchers.
order to check whether the patient has cancer or not and Section 3 exp lains the methodology of the implemented
whether it is benign or malignant. Cancer that is benign work. Section 4 describes the used dataset with
can be said to be “non-cancerous cancer” as it does not experiments results. Section 5 concludes the work with
spread to other parts of the body whereas a malignant future scope.
cancer is fatal as it spreads throughout the body and is
uncontrollable. II. RELAT ED W ORK
The cure to cancer has not been found yet. The only way
to save a person’s life suffering fro m it is to remove that Cancer is the most dangerous disease in the world. There
portion of the body that has been affected. The best way are different types of cancer such as lung cancer, breast
cancer, brain cancer, sarcoma cancer, carcino ma cancer,
etc. Out of these many cancers, breast cancer is the
second most dangers of cancer in terms of death ratio. If
Authorized licensed use limited to: University of Exeter. Downloaded on June 23,2020 at 06:43:27 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Fifth International Conference on Inventive Computation Technologies (ICICT-2020)
IEEE Xplore Part Number:CFP20F70-ART; ISBN:978-1-7281-4685-0
there is no proper treatment/diagnosis on time, it tends to work done by researchers. In table 1, colu mn 1 includes
be quite fatal and lead to death. Many researchers did the year and author name. Colu mn 2 shows the name of
research to predict breast cancer by applying different algorith ms used by researchers, column 3 includes the
mach ine learn ing algorith ms as Decision Tree (DT) [2], accuracy of used algorithms. Colu mn 4 and colu mn 5
Neural Netwo rk (NN) [3], Random Forest (RF) [4], shows the advantage and disadvantage respectively of the
Naïve Bayes (NB) [5], Linear Regression (LR) [6], applied techniques.
Support Vector Machine (SVM) [7], and many more on
different data set. Table 1 shows the existing literature
T ABLE1. COMPARAT IVE LIT ERAT URE REVIEW OF EXIST ING ALGORIT HMS
T homas Noel, Hiba C4.5, 95.13, Risk of overfitting is less Best parameters are needed
Asri, Hajar Mousannif, SVM, 97.13, for correct classification
Hassan Al NB, 95.99,
Moatassime, 2016 K-NN 95.27
Yixuan Li, Zixuan DT , 96.1, It runs efficiently on large T hey are much harder to
Chen, 2018 SVM, 95.1, databases construct and are time taking
RF, 96.1,
LR, 93.7,
NN 95.6
Priyanka Gupta, Prof. CART , 92.35, It is very robust and can be It does not learn anything
Shalini L, 2018 RF, 96.47, simply implemented on from the training set and just
K-NN, 97, classification datasets uses it to predict the class of
Boosted T rees 96.47 the test points
Jabeen Sultana, Abdul LR, 97.18, If all the useful independent Doesn’t perform well with a
Khader Jilani, 2018 MLP, 95.25, variables have been identified large dataset
RF, 95.25, and used, then it is a good
DT 93.14 classifier
Kriti Jain1, Megha SVM 97.13 Risk of overfitting is less Best parameters are needed
Saxena, Shweta for correct classification
Sharma. 2018
Puneet Yadav, Rajat Decision T ree, 90.29, DT require little effort for their T hey are unstable and
Varshney, Vishan SVM 94.5 - 97 preparation complex to understand when
Kumar Gupta, 2018 there are many outcomes
III. M ET HODOLOGY
A. Data Collection B. Data Pre-processing
To predict breast cancer, authors used Wisconsin Wisconsin Diagnostic dataset for breast cancer
Diagnostic dataset [8] co llected fro m the UCI Mach ine prediction has some missing values. To handle these
Learn ing Repository. There were 699 instances with total missing values data pre-processing was also used on the
of eleven features. Out of eleven features, ten features as mentioned dataset. The attribute like Bare Nuclei colu mn
input features and remaining one feature is treated as has missing features in the form ‘?’ string which need to
output feature. The whole dataset is divided into training be inputted. 16 such instances of missing values were
and testing instances in the ratio of 80:20. It means out of found in this feature. These missing values were rep laced
699 instances, 560 instances were used as training dataset by the average/mean values of the features. On the o ther
and the remaining 140 instances were used as a testing hand, the attributes like sample code number have no
dataset. relevance in predicting breast cancer so such types of
attributes have been dropped from the dataset.
Authorized licensed use limited to: University of Exeter. Downloaded on June 23,2020 at 06:43:27 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Fifth International Conference on Inventive Computation Technologies (ICICT-2020)
IEEE Xplore Part Number:CFP20F70-ART; ISBN:978-1-7281-4685-0
C. Algorithm Used are moved a bit randomly and this process repeats until
Support Vector Machine: Support vector mach ine there are no more reassignments of the data points to any
(SVM ) is a quite simple classificat ion algorith m. This centroid. Finally, the clusters are determined, and the
classifier is named so because it takes the help of vectors process ends. The only drawback of K-Means is the
in the feature space to classify the class of a new vector Random Init ialisation Trap, that leads to the format ion of
[9,10]. The Maximu m Marg in Hyper-plane (MM H) wrong clusters, overcome it by selecting the ‘K-Means++’
decides whether the new vector belongs to class one or as the random initializer.
class two. If the data point lies beyond the negative
hyper-plane or to the left of MM H then it belongs to the Artificial Neural Network: Artificial Neural Network
class one, else it belongs to the class two, where class one (ANN) is one the most advanced and powerful machine
and two are two different classes in a given situation. learning algorithm that can be used for many purposes such
SVMs can also be used if there are more than two classes. as classification, regression, voice recognition, targeted
market ing, etc. A single unit of an ANN is called a
K-Nearest Neighbour: K-Nearest Neighbour (KNN) is perceptron, it was first invented in 1957 by Frank Rosenblat
said to be the simplest and the most straightforward [15], as he wanted to invent something that could learn and
classification algorith m. Like most machine learning adjust itself to meet the required needs.
algorith ms, K-NN does not learn anything from the The way in which A NNs work is a bit co mp lex but
provided dataset and its attributes, but simply use the points understandable, firstly the input nodes are taken and the
fro m the training data and finds the K nu mber of nearest number o f input nodes is equal to the nu mber of
neighbours to that data point using Euclidean Distance [11] independent variables, then these inputs are weighted
and classify it to the class which has the first K neighbours through synapses and passed on to the next neuron where a
closest to it. ctivation unction is applied and this process repeats till
the weights have reached the utput ayer. he output is
Naï ve B ayes: Naïve Bayes (NB) theorem is a machine taken as and the actual value is taken as y. Now the Cost
learning algorith m that works on the probability concepts as Function (C) is calculated according to the equation 2.
mentioned in equation 1.
(̂ )
( | ) ( )
( | ) ( ) (2)
( )
The aim of the ANN here is to min imize the cost function
Where P(A) is the Prior Probability, P(B) is the Marginal and this is achieved by the process of Back Propagation. If
Likelihood, P(B|A) is the Likelihood and P(A|B) is the the value of cost function is high above a certain level, then
Posterior Probability. The NB algorith m follows the above the informat ion is fed back into the Neural Network and the
equation for the determination of the class of a data point. weights are modified and this process is repeated till the C
The posterior probability is calculated based on the position function is min imized. Th is flow of the ANN is repeated for
of the vector in the feature space and then the data point is every observation in the dataset and finally the ANN gets
assigned to the class with greater posterior probability [12]. ready for real world classification.
Figure 1 shows the used architecture of ANN. Authors uses
Decision Tree: Decision Tree (DT) is a powerful machine ten neurons as input layer and two hidden layer each of six
learning algorith m used for both classification as well as neurons. At the end of mentioned architecture, only single
regression [13]. DT can be a tree type of structure where neuron used because to predict patient suffer fro m b reast
each internal node is a test condition for the vector to move cancer or not.
further and the terminal nodes represent the class or the
prediction value to be predicted. DT is good for the
classification of a few class labels but do not produce
proper results if there are many classes and less training
observations. And moreover, DTs can be expensive to train
computationally.
Authorized licensed use limited to: University of Exeter. Downloaded on June 23,2020 at 06:43:27 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Fifth International Conference on Inventive Computation Technologies (ICICT-2020)
IEEE Xplore Part Number:CFP20F70-ART; ISBN:978-1-7281-4685-0
⁄( ) ( ) Co mp aris io n o f mo d els b as ed o n
Recall 0.976
⁄( ) ( ) 0.98 0.976
0.964 0.964
Figure 2. 96
94 91.42
92
T ABLE2. RESULT ANALYSIS OF APPLIED T ECHNIQUES 90
88
T echniq TP FP TN FN P R Accura
ues cy (%)
SVM 82 1 54 3 0.987 0.964 97.14
CLASSIFIERS
K-NN 83 3 52 2 0.965 0.976 96.42 (c)
NB 81 2 53 4 0.975 0.953 95.71 Figure 2. Comparison of models based on a) Precision b) Recall c)
DT 80 7 48 5 0.919 0.941 91.42 Accuracy
K- 446 19 22 11 0.959 0.976 95.70
Means 2
ANN 82 0 55 3 1.000 0.964 97.85
V. CONCLUSION AND FUT URE WORK
Authorized licensed use limited to: University of Exeter. Downloaded on June 23,2020 at 06:43:27 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Fifth International Conference on Inventive Computation Technologies (ICICT-2020)
IEEE Xplore Part Number:CFP20F70-ART; ISBN:978-1-7281-4685-0
REFERENCES
Authorized licensed use limited to: University of Exeter. Downloaded on June 23,2020 at 06:43:27 UTC from IEEE Xplore. Restrictions apply.