You are on page 1of 6

International Journal of Inventions in Computer Science and Engineering, Volume 6 Issue 5-7 May-July 2019

ANALYSIS OF MACHINE LEARNING APPROACHES TO DETECT AND


CLASSIFY BREAST CANCER

M. Lakshmitha1, A. Abdul Hayum2


1
M.E. VLSI Design, Akshaya College of Engineering and Technology, Coimbatore
2
Assistant Professor, Department of ECE, Akshaya College of Engineering and Technology, Coimbatore

ABSTRACT: The breast cancer is the most threatening and lethal disease among women and its more prevalent in the
developed countries. From the report of World Health Organization (WHO) it has been described that as of 2015 there
are about 2.1 million women were diagnosed with benign and malignant breast cancers. On an average out of 268,600
women diagnosed with cancer, about 41,760 deaths were reported. Record of various Oncologists show that early
detection of tumour can increase the survival rate of the patients. Though cancer diagnosis is done initially by experts in
radiology it is not always accurate. In the modern era new technologies such as mammography, Computerized
Tomography, Breast MRI and computer- aided detection (CAD) mammogram are used in the diagnosis procedure. But
because of the false-positive and false- negative values from the CAD report it becomes mandatory to improve the
efficiency of the system. The efficiency of the algorithm depends on the type of data and the quality of data images
employed in the process Machine Learning plays a vital role in detecting cancer by processing the mammogram images.
This paper presents different technologies and their limitations in detection and classification of the breast cancer images.
Keywords: Breast Cancer, CART , KNN , NB, RF, SVM,

I. INTRODUCTION False-Negative is a chance where a 10% probability of


tumour is present but it is not detected.
Cancer begins when normal healthy cells grow
tremendously without any control forming a huge mass
called the tumour. This tumour may be cancerous or Generally mammography is done to identify some
benign. Cancerous tumours is more deadly and named important abnormalities in the breast such as
malignant when it spreads to other parts of the body and  Asymmetries – the improper regions of breast
the lymph nodes. Benign tumour grows nut it does not with varying densities
spread over the body. Cancers are categorized into 5  Clusters of small calcifications
stages from Stage 0 to Stage IV. When the tumour is  Any area of skin thickening
diagnosed early in the Stage 0 and I it ensures 100%
cure and the survival rate is high but this rate gradually The radiologists classify the breasts in two
decreases when the tumour reaches upper stages from categories namely soft breasts and dense breasts. Soft
93% for Stage II, 72% for Stage III and 22% for Stage breasts appear to be more transparent to mammogram
IV. So it is very important to diagnose the tumour in the whereas dense breasts are thicker and it is difficult to
early stages. The diagnosis of cancer involves various locate tiny lumps or tumours in them thus posing some
procedures starting from self-examination of breasts errors such as false-positive and false-negative values.
after which oncologists might advise for
mammography, breast ultrasound, biopsy or breast
MRI.
The mammography is a method of passing x-
rays into the breasts after which the images are analysed
by radiologists to detect abnormalities. These images
shows the areas of micro-calcifications and lumps.
There are chances that these structures are not
cancerous also the nature of breasts influence the
mammogram resulting in the False-Positive and False-
Negative values. The variations in mammography led to
Fine-Needle aspiration cytology (1). The False-Positive
value is something that will be detected as a lump but
that is actually not a cancerous tumour. Conversely

M. Lakshmitha, A. Abdul Hayum ISSN (Online): 2348 – 3539

18
International Journal of Inventions in Computer Science and Engineering, Volume 6 Issue 5-7 May-July 2019

Fig 1.1. Mammogram image from new and old III. Creating summary of each class and each
technologies feature.
IV. Finding probability of each features.
The technology of computerizing the process of V. Finding the probability of each class as the
diagnosis of breast cancer from the mammogram multiplication of all features.
images without the need of a radiologist or oncologist VI. Predict the class of instance.
has been growing day by day and machine learning
becomes an inevitable part of this development. The Limitation: One important problem about this
accuracy of diagnosis with experienced Physician is algorithm is Zero Probability. This is a condition that
found to be 79.97% while with Machine Learning it is occurs when probability of a feature is zero thus it fails
91.1% (3). to give valid prediction.
II. MACHINE LEARNING APPROACHES
The art of training machines with numerous data 2.2. K- NEAREST NEIGHBOUR:
and testing them to perform in a desired way reduces
the time and effort put through by performing the same This approach takes a bunch of label points and
task manually. when a new point enters it looks for the most nearer
point.[7]. In contrast to Naïve Bayes this algorithm
There are two different types in machine learning classifies data with the feature similarity. It works well
for noisy inputs.
 Supervised Learning
 Reinforcement learning Algorithm:

This paper focuses on classification of breast I. The input data set is split into training and
cancer as benign or malignant. For the classification of testing data, mostly 80% for training and 20%
images there are some algorithms such as K-nearest for testing.
neighbours, Support Vector Machines (SVM), decision II. Pick an instance from the testing set and
trees and so on. But before applying the algorithm there compute its distance with the training set
are some preparation steps to be performed on the data III. Listing distances in ascending order.
in order to prepare the data for training. Once the data is IV. The class of instance is the most common class
prepared then it can be split into training and testing first 3 training instances.
data sets.
Limitation: The value of K denotes the nearest
Machine learning offers list of approaches or distance between each points and this is done for each
algorithms for classification. The task lies in choosing sample and each instance thus increases the
the best algorithm for the required dataset. The machine computational cost.
learning approaches for classifying given set of data are
as follows, 2.3. DECISION TREES:
1) Linear Classifier
a. Logistic Regression A large dataset is broken down into various subsets
b. Naïve Bayes until it ends up with a leaf node having the least cost.
2) K - Nearest Neighbour This leaf node is the one from which the classifier
3) Support Vector Machine is chosen.
4) Decision Trees
5) Boosted Trees
6) Random Forest
7) Neural Networks
2.1. LINEAR CLASSIFIER – NAÏVE BAYES:

This algorithm is simple in implementation and it


can be used effectively for large datasets. Naïve Bayes
is selected when the presence of a particular feature is
unrelated to any other feature which means they are
independent of features. It is a fast algorithm and works
well even with less training data.[7]
Algorithm: Fig. 2.1. Decision Tree Representation
I. The dataset is divided into 2 classes and 2 sets. Limitation: Decision Trees are found to be unstable
II. Calculate Mean and Standard Deviation for because even a small variation in the data results in a
each feature and each class. different tree.

M. Lakshmitha, A. Abdul Hayum ISSN (Online): 2348 – 3539

19
International Journal of Inventions in Computer Science and Engineering, Volume 6 Issue 5-7 May-July 2019

2.4. SUPPORT VECTOR MACHINE: Regression provides a relation between


independent and dependant datasets along with
In a support vector machine, the provided data set explaining the factors that lead to classification.
is mapped in space and when a new data enters the
points are separated by a hyperplane with a wide gap.
SVM can be effectively used for high dimensional data
set.

Fig. 2.2. SVM hyperplane


Fig. 2.4. Logistic Regression Gradient
Algorithm:
Limitation: Regression algorithm works only when
I. Prepare the dataset by dividing it into testing the data is binary.
and training set.
II. In the training set prepare a validation set.
III LITERATURE SURVEY
III. The necessary features are selected for
classification.
Machine learning is a vast technology. In order
IV. Find the best hyperplane parameters.
to know the different technologies available many
V. Test the model with the test set.
research papers are studies and previous paper works
Limitation: Probability estimator is not provided are analysed. This will show the nature of algorithms
directly. used and will help in selecting the best approach
required. This section includes list of papers ranging
2.5. RANDOM FOREST: from 2010 to 2018 giving a clear understanding of
Supervised Machine Learning Techniques.
This algorithm belongs to an ensemble learning
methods. It constructs a multitude of decision trees and Li Rong [1] in 2010 derived a relation between
uses average value to improve accuracy. SVM and KNN in 2 theories where SVM = 1 KNN,
each class has one support vector and it is not for whole
class representation. Conversely in KNN each support
vector is taken as a representative point in which more
information is used. It concludes, SVM has an accuracy
of 96.9% but when SVM is combined with KNN the
accuracy is boasted to 98.19%.

Ahmet Mert [3] in 2011 incorporated


Independent Component Analysis (ICA) to reduce the
dimension of WDBC dataset as a pre-processing
procedure. Then the data with reduced dimension was
classified with SVM whose accuracy was 93.71% and
with the original data without ICA technique the
accuracy was surprisingly increased to 95.8%.
Fig. 2.3. Random Forest Tree Representation
Xiufeng Yang [4] in 2013 worked with
Limitation: Since it has the ability to handle large
multiple kernels in SVM by using ISOMAP where a
data the algorithm becomes too complex hence
high dimensional data is projected into a low
increases the implementation cost.
dimensional space. The SVM classifier was used along
with Radial Basis kernel Function and poly kernel. This
2.6. LOGISTIC REGRESSION:

M. Lakshmitha, A. Abdul Hayum ISSN (Online): 2348 – 3539

20
International Journal of Inventions in Computer Science and Engineering, Volume 6 Issue 5-7 May-July 2019

reduced the dimension of WDBC data to 5 and having a test to evaluate predictive models. The results of
good accuracy of 98.2%. simulating each algorithms gives different values of
time required to build the model and accuracy. Among
P. Hamsagayathri [6] in 2017 performed 4 different classifiers SVM has the highest accuracy of
decision tree classification with J48 classifier in order to 97.9% with processing time of 0.08 s and KNN with the
reduce the size of the tree and the number of leaf nodes. least of computation time of 0 s is a lazy model as it
This method also eliminated the repetitive sub-trees by does not work much. Also Random Frost and Naïve
implementing attribute priority. The dataset employed Bayes has been estimated to have highest error rates.
was from SEER and it was witnessed that J48 classifier
showed an accuracy of 98.5% reducing computational Anusha Barath [19] in 2018 used a WBCD
cost and complexity by increasing the memory size. dataset using 80% of data for training and the remaining
for testing. They worked on each algorithm NB, KNN,
Meriem Armane [7] in 2018 compared the SVM and CART without standardizing and the
Naïve Bayes and KNN algorithm in the same data set accuracy is found to be above 92%. On the other hand
WBCD from UCI. This dataset analyses different after standardizing the dataset the accuracy of the SVM
characteristics assigned by pathologists in order to is increased drastically to 99.1%. This proves that fine-
classify Breast Cancer such as lump thickness, tuning of parameters improves the accuracy of the
uniformity of cell shape or size, bare nuclei etc. most of classifier.
the time NB is combines with other algorithms for
classification. Through the results from simulation the IV COMPARISON OF ML ALGORITHMS
author has noted the accuracy of KNN is 97.5% which
is higher than that of NB which has 96.19%. A table is formulated by comparing different
approaches in classification of breast cancer. This
Youness Khourdifi and Mohamed Bahaj [8] in shows the advantages and disadvantages of the list of
2018 proposed a survey on selecting the best classifier algorithms that enables the proper selection of
for Breast Cancer prediction by comparing various data appropriate algorithm for the dataset and for the right
mining algorithms. They used a 10-fold cross-validation application.
TABLE 4.1. COMPARISON OF DIFFERENT MACHINE LEARNING TECHNIQUES

ALGORITHMS ADVANTAGES DISADVANTAGES


Linear Classifier:  Even Large datasets can be  Zero probability- fails
Naïve bayes easily built to give valid prediction
 Fast model and works  NB is a bad estimator.
effectively for less training
data
Linear classifier –  Ability to explain the features  It works only when the
Logistic Regression that lead to classification. predicted variable is
binary.
K- nearest  Classifies data just with  Computational cost is
Neighbours. feature similarity. high.
 This model works well for
noisy inputs.
Support Vector  Highly memory efficient.  Probability estimation
Machines  SVM is well suited for high is not provided directly
dimensional space.
 Important values in SVM
 Kernel type
 Gamma value
 C value
Decision Trees  Ability to handle both  Small variations in the
numerical and categorical data results in a
data. different tree
 Makes classification easier by  Unstable.
breaking down large datasets
into small subsets.
Random Forest  Ability to handle large data  RF is a complex
 It can handle missing data algorithm and is
while computing difficult to implement.

M. Lakshmitha, A. Abdul Hayum ISSN (Online): 2348 – 3539

21
International Journal of Inventions in Computer Science and Engineering, Volume 6 Issue 5-7 May-July 2019

From the table, it is evident that each algorithm on Multi-scale Blob Detection Algorithm in Automated
is used in particular situations and it depends on the Breast Ultrasound Images 2011.
datasets. The datasets that are considered in
classification may be of images or just numerical data [7] Ahmet Mert, Niyazi Kilic, Aydn Akan, Breast
and its length is also accounted for the right algorithm. Cancer Classification by Using Support Vector
As a compilation of the literature survey the Support Machines with Reduced Dimension IEEE Proceedings
Vector Machine algorithms found to be used often ELMAR-2011.
providing good accuracy. Moreover SVM can be
combined with other algorithms to improve the [8] E. F. Hall, M., I. Witten, Data mining:
efficiency and accuracy [4]. Usually the breast cancer Practical machine learning tools and techniques,
classification employs only 2 categories such as Benign Kaufmann,. 2011.
or malignant so it is much efficient to use binary SVM
classifier than adopting other classifiers.in order to [9] Aruna S, Rajagopalan SP, Nandakishore LV.
achieve improved results proper pre-processing Knowledge based analysis of various statistical tools in
techniques can be followed. detecting breast cancer. Computer Science &
Information Technology 2011; 2:37–45
CONCLUSION
This paper is a survey of various algorithms [10] Evanthia E. Tripoliti et al. Automated
available in machine learning to classify the breast Diagnosis of Diseases Based on Classification:
cancer. The survey was conducted from a collection of Dynamic Determination of the Number of Trees in
standard papers published by researchers and content Random Forests Algorithm, IEEE Transactions On
studied from authorized websites. It provides an insight Information Technology In Biomedicine Vol. 16, No. 4,
of selecting a suitable algorithm in classification of July 2012
breast mammogram images. In regards to the survey
SVM is found to be best suited for binary classification [11] Xiufeng Yang,Hui Peng, Mingrui Shi, SVM
of breast cancer. with Multiple Kernels based on Manifold Learning for
Breast Cancer Diagnosis, Proceeding of the IEEE
REFERENCES International Conference on Information and
Automation Yinchuan, China, August 2013.
[1] T. Jinshan, R.R., X. Jun, I. El Naqa, Y.
Yongyi, Computer-Aided Detection and Diagnosis of [12] Mitko Veta, Josien.P.W. Pluim, Paul J.
Breast Cancer With Mammography: Recent Advances , Vandiest, Max A. Viergever, Breast Cancer Histopathy
Information Technology in Biomedicine. IEEE, Vol. Image Analysis: A Review , IEEE Transactions On
13, pp. 236-251, 2009. Biomedical Engineering Vol. 61, No. 5, May 2014.

[2] O. Chapelle, B. Scholkopf, and A. Z. Eds., [13] A. Alarabeyyat, A.M., Breast Cancer Detection
Supervised Learning (Chapelle, O. et al., Eds.; 2006) Using K-Nearest Neighbor Machine Learning
[Book reviews], IEEE Trans. Neural Networks, vol. 20, Algorithm in 9th International Conference on. IEEE,
no. 3, p. 542, 2009. v.i.e.E. (DeSE), pp. 35-39, 2016.

[3] Li Rong Sun Yuvan, Diagnosis of Breast [14] M.H. Asri, H.A Moatassime, Using Machine
Tumour Using SVM-KNN Classifier , 2010 Second Learning Algorithms for Breast Cancer Risk Prediction
WRI Global Congress on Intelligent Systems. and Diagnosis. Procedia Comput Sci, Vol. 83, pp.
1064– 1073, 2016.
[4] A. M. Krishnan, R. Banerjee, S. Chakraborty
and C. Chakraborty, Statistical analysis of [15] S. Kanta Sarkar, A.N., Identifying patients at
mammographic features and its classification using risk of breast cancer through decision trees ,
support vector machine Expert Systems with International Journal of Advanced Research in
Applications vol. 37, pp. 470-478, 2010. Computer Science. Vol. 08, pp. 88-96, 2017.

[5] N. H. Sweliam, A. A. Tharwat, N. K. [16] P.Hamsagayathri, P.Sampath, Priority Based


Moniem, Support vector machine for diagnosis cancer Decision Tree Classifier for Breast Cancer Detection,
disease: A comparative study Egyptian Informatics 2017 International Conference on Advanced Computing
Journal vol. 11, pp. 81-92, 2010. and Communication Systems (ICACCS -2017) Jan. 06
– 07, 2017, Coimbatore, INDIA .
[6] Woo Kyung Moon, Yi-Wei Shen, Min Sun
Bae, Chiun-Sheng Huang, Jeon-Hor Chen, and Ruey- [17] Meriem Armane, Ikram Gagaoua, Breast
Feng Chang, Computer-aided Tumor Detection Based Cancer Classification Using Machine Learning. IEEE

M. Lakshmitha, A. Abdul Hayum ISSN (Online): 2348 – 3539

22
International Journal of Inventions in Computer Science and Engineering, Volume 6 Issue 5-7 May-July 2019

conference 2018 Electric Electronics, Computer 2018.


Science, Biomedical Engineerings' Meeting (EBBT),
[18] Youness Khourdifi, Mohamed Bahaj,
Applying best Machine learning approaches for breast
cancer prediction and classification , IEEE Conference
On Electronics, Control, Optimization And Computer
Science, 2018.

[19] Anusha Bharat, Pooja N, Anishka Reddy,


Using Machine Learning algorithms for breast cancer
risk prediction and diagnosis 2018 IEEE Third
International Conference on Circuits, Control,
Communication and Computing.

[20] National Breast Cancer Foundation Inc.,


http://www.nationalbreastcancer.org/about-breast-
cancer.

M. Lakshmitha, A. Abdul Hayum ISSN (Online): 2348 – 3539

23

You might also like