You are on page 1of 6



2019 International Conference on Computing, Communication, and Intelligent Systems (ICCCIS)

Early Stage Detection of Malignant Cells:


A Step Towards Better Life
Jasmine Awatramani Nitasha Hasteer
Amity University, Noida, Uttar Pradesh Amity University, Noida, Uttar Pradesh
jasmineawatramani@gmail.com nitasha78@gmail.com

Abstract- Cancer is a collection of diseases, which is I. INTRODUCTION


driven by change in cells of the body by increasing
the normal growth and control. Its prevalence is Cancer is described as malignant tumor which is made up of
collection of cells that grows uncontrollably. Cancer is a life
increasing year by year, and is accordingly
threatening disease as it holds the second rank in causing
advancing along with it to counter the occurrences
deaths globally[10]. It is responsible for around 9.6 million
and provide solution. Breast Cancer is considered deaths in 2018. The most commonly occurring cancers are
to be a deadly disease and is one of the crucial Colorectal Cancer, Breast Cancer, Lung Cancer, Skin
reasons of demise among the women globally. Cancer, Stomach, Prostate Cancer[2]. It tends to leave long-
Early detection of breast cancer increases the term and devastating effects on the patient as well as the
probability of better treatment and viability. family of the patient.
Research has been done mostly on mammogram
Breast Cancer arises in the cells of the breast tissue
images. Although, sometimes these images are
(carcinoma) or the conjoining tissue of the breast (sarcoma).
inaccurate and may show fallacious detection. When the cells in the breast start expanding abnormally, it is
Thus, it can risk the patient’s well-being. It is, then commonly known as a tumor. There are two types of
therefore, important to obtain substitutes that are tumors, benign (non-cancerous) and malignant
trouble-free, economical, secure, and can generate (cancerous)[1]. According to the World Health Organization
a more genuine prediction. Presently, Machine (WHO), Breast Cancer is the most chronic among women
Learning approaches are being widely used in worldwide[2]. In 2018, an organization Breast Cancer -
breast cancer detection. Machine Learning enables India Against Cancer, outlined that Breast Cancer reports for
the system to master based on former occurrences 14% of all cancers in females of India[3]. The incidence rate
of India is the lowest but its mortality rate is the highest. This
and decide using a variety of statistical and
indicates that the gap between mortality and incidence rate
probabilistic techniques with a minimum human is extensive. Therefore, advancements in the current
intrusion. This research work showcase the use of methods are needed to detect breast cancer at an early
five machine learning methods, which are SVM stage[6]. With the help of early stage detection, the gap
(Support Vector Machine), KNN (K-Nearest between mortality and incidence can be decreased, as well
Neighbor), K-SVM (Kernel Support Vector as, it increases the chances of successful treatment and
Machine), Random Forest Tree, Decision Tree and survival.
the accuracy achieved for breast cancer detection
The signs of Breast Cancer may comprise of formation of a
has been 97.20%, 95.10%, 96.50%, 98.60%,
lump, pain in breast, blood discharge from the nipple when
95.80% respectively. As per the results, Random size and shape of the nipple of breast changes, and others.
Forest Tree offers the highest accuracy in [4,7]. Although, several females don’t experience any of
comparison with other algorithms when applied to these signs, therefore, the only way out left in this situation
the Wisconsin breast cancer detection dataset is screening. At present, the mammogram is being used to
which has been taken from a machine learning detect a malignant tumor in the breast[18]. A mammogram
repository. is a largely used method for the diagnosis of breast cancer,
which produces 2-dimensional breast images. However,
Keywords- Breast Cancer, Machine Learning, Random many times it is not able to diagnose even the benign tumor
Forest Tree, Breast Cancer Detection, Accuracy. accurately as the size of both benign and malignant tumor at

ISBN:
 978-1-7281-4826-7/19/$31.00 ©2019 IEEE 262
1

Authorized licensed use limited to: AMITY University. Downloaded on June 29,2021 at 11:38:38 UTC from IEEE Xplore. Restrictions apply.

2019 International Conference on Computing, Communication, and Intelligent Systems (ICCCIS)

an early stage is small. Sometimes even mammograms have structured as follows: Section II gives an overview of
false-positive results. Due to these false-positive results, it machine learning algorithms used in the study. Section III
often leads to unwanted surgeries, biopsies and harmful presents the related work in the subject area of the study.
Section IV explains the methodology used in the work.
radiations[8,9]. Therefore, early detection through
Section V presents experimental results and discussions.
advancements of Computer Science is an important area of Finally, we conclude and present future direction of work in
research[10]. In order to overcome the challenges faced by Section VI.
patients in terms of harmful radiations, various methods
have been developed such as Breast MRI Scans, Fine Needle
Aspiration (FNA), where a little amount of tissue is extracted II. Machine Learning Algorithms
from the area of suspicion, and checked for the carcinogenic
cells[18]. These methods deliver limited results in terms of Machine Learning is commonly used in recognition of
precision and validity. The techniques discussed above patterns and making predictions. There are mainly three
requires frequent human intervention and have limited types of categories in which machine learning is divided.
capabilities[9]. Thus, it is vital to apply the advances of Those are Supervised Learning, Unsupervised Learning and
machine learning for early detection. It is one of the best Reinforcement Learning. Supervised Learning is applied if
ways to save life and take appropriate action towards the there is an actual structure of inputs progressed and outputs
disease. are already gathered. Here, the labeling of names is done
properly. Unsupervised Learning is applied when the pattern
Artificial Intelligence has been a proven area to bring forth is not known to the system and no proper structure is there.
successful results. It assists us in the processing of data to The labeling of names is not done properly. Reinforcement
acquire secure and faster results[11]. Machine Learning is a Learning is applied when the system communicates with a
collection of tools in Artificial Intelligence, used for dynamic environment[13]. Some of the algorithms used in
producing and analyzing the algorithms that simplify the the study are discussed below:
recognition of pattern, prediction, and classification.
Machine Learning involves a four-step procedure: • SVM
Collection of data, selection of the model, model training Support Vector Machine is an algorithm under
and model testing. Hence, Machine Learning can result in supervised machine learning. It can be determined
even better accurate decision making and prediction[8]. It by segregating hyperplane and classify from the
has and is contributing a lot in the field of Healthcare such assembled data. Therefore, inflated least distance
as Personalized Medicine, Smart Health Records, Disease is discovered. Two or more data types can be
Diagnosis and many more[20]. In research of cancer, these determined with the help of SVM. The kind of
machine learning methods could be used to recognize SVM used here is linear SVM. SVM comprises
various patterns in a data set and accordingly predicting the single or combinational models SSVM(Smooth
type of tumor existing in the breast of the patient, that is, Support Vector Machine), Standard SVM, and
benign or malignant. others.[8,9]

• KNN
  KNN is commonly used for detecting cancer. It is
   

     suggested to choose a bigger dataset for training

along with an odd-numbered K value. It does not
Fig.1: Four-step procedure of machine learning make any suppositions on the fundamental
distribution of data. Its performance is immense in
Machine learning algorithms have become a conventional recognition of pattern and predictive analysis.
mechanism for medical researchers. Breast cancer produces KNN collect points of data that are close to the
no symptoms when lump of the cancer is little. Therefore, new point of data. Varying attributes on a broader
some advance techniques are needed to treat cancer at an scale may impact effectively on the distance
early stage. We have used five machine learning classifiers
in our study: Support Vector Machine(SVM), K-Nearest between points of data. Sorting is done on those
Neighbor(KNN), Kernel-Support Vector Machine(K- closest points of data from the arrival point of data
SVM), Random Forest Tree and Decision Tree. The aim is in terms of data. Usually, Euclidian distance is
to determine the best accuracy determining algorithm among recommended to measure the distance[8,9].
these five algorithms. These algorithms have been
implemented in Jupyter Notebook which is an open-source • K-SVM
web application[19] along with various libraries like
It is very much similar to SVM but is a clustering
Numpy, Pandas, Cufflinks, and others. This paper presents
the results of our experimentation. The rest of the paper is algorithm (unsupervised learning). The major

 263
2

Authorized licensed use limited to: AMITY University. Downloaded on June 29,2021 at 11:38:38 UTC from IEEE Xplore. Restrictions apply.

2019 International Conference on Computing, Communication, and Intelligent Systems (ICCCIS)

difference between the two is that SVM is linear compared. M.Amrane et al.,[11] compared two techniques:
segregating hyperplane whereas, Kernel functions Naïve Bayes and KNN. This paper concludes that KNN
are polynomial. The use of kernel will provide gives the highest accuracy in comparison with Naïve Bayes
better performance as many datasets consist of with the help of cross validation technique. In the research
non-linear decision boundary[5]. work by B.M. Gayathri et al.,[16], a comparison of
Relevance Vector Machine (RVM) with other algorithms
• Random Forest Tree has been shown and proven that RVM shows low
Random Forest Tree method is used at the point of computational cost thus, is better than other techniques.
standardization, where the quality of the model is Study by S. Kharya et al.,[14] reveals that artificial
the highest. Issues are compromised regarding intelligence and neural networks are most commonly used
bias and variance. Random Forest Tree makes methods to predict breast cancer. The study brings forth the
innumerable numbers of Decision Trees, where pros and cons of the algorithms such as Decision Trees ,
random samples are used with a substitution to Naïve Bayes and neural networks. Study by S. Chaurasia et
master the issues of Decision Trees. Observations al.,[17] explains that out of Bayes learner, decision tree and
are determined by each tree, and the decision with neural net, neural net proved to have greater accuracy and
the maximum votes is selected. It is mostly used precision. In the study by Y. Khourdifi et al.,[15], various
in the unsupervised mode of area for evaluating machine algorithms have been compared out of which SVM
the closeness among the points of data[8]. obtained greatest accuracy out of Random Forest, SVM,
KNN and Naïve Bayes. From the review, we infer that breast
• Decision Tree cancer detection has attracted the interest of researchers and
Decision Tree is a method of data mining practitioners.
commonly used to diagnose breast cancer at an
early stage. It is a prototype that displays
classifications or regressions as a tree. Decision
Tree breaks the dataset into tiny sub-data, and then
to even tinier-ones. As an outcome, a tree is
evolved and at the level of termination, the result
is disclosed. In the structure of a tree, leaves
symbolize class labels and branches symbolize
combinations of attributes leading to class labels.
Therefore, the Decision Tree is not noise-
sensitive[8].

III. LITERATURE REVIEW

The research has been done by accessing IEEE Digital


Library to identify the relevant studies in this area. The
literature review has been done. Our work shows the
systematic literature review of studies to investigate the
literature with an objective to identify the studies which had
breast cancer detection as their prime focus and have used
Wisconsin Dataset for their research.

A.sharma et al.,[13] conducted a research with an objective


to diagnose breast cancer as benign (B) or malignant (M)
using three machine learning methods: logistic regression,
KNN and SVM, where the author concluded that Logistic
Regression gives the highest accuracy out of the three .
Research by D. Bazazeh et al.,[12] compares machine
learning techniques to detect breast cancer. In this paper, an
overview of most common algorithms such as SVM random
forest and Bayesian networks has been presented and

 264
3

Authorized licensed use limited to: AMITY University. Downloaded on June 29,2021 at 11:38:38 UTC from IEEE Xplore. Restrictions apply.
2019 International Conference on Computing, Communication, and Intelligent Systems (ICCCIS)
Authors Focus Area of Study Results
Objective of the research was to obtain link between
A. Sharma, S. Kulshreshtha, precision, recall and number of features in the dataset, The results of the study showed the application of
S.Daniel[13] along with the probability of determining cancer in the Logistic Regression with accuracy 96.89%.
affected patients.
Objective of this work was to compare 3 machine learning
algorithms where the major criterion was area of ROC This research showed the application of SVM with
D. Bazazeh, R. Shubair[12]
(receiver operating characteristic), precision, recall, and accuracy 96.60%.
accuracy.
Aim of the research was to compare 2 machine learning
M. Amrane, S.Oukid, I.Gagaoua, T. The authors concluded that KNN showed best
techniques and calculate their accuracy with the help of
Ensari[11] accuracy which was 97.51%.
cross validation.
Study aimed at comparison of RVM with other machine This research showed the application of RVM with
B.M. Gayathri, Dr.C.P.Sumathi[16]
learning techniques for detection of breast cancer. accuracy 97.00%.
Objective of this paper was to examine the performance on
the basis of machine learning tool along with use of The results of the study showed the application of
S. Kharya, Sunita soni[14]
weighted concept which improves the performance of the Naïve Bayes with accuracy 92.00%.
algorithm.
Aim of this paper was to implement and compare 3
S. Chaurasia, N. Chourasia, P. machine learning techniques and to conclude the best The results of the paper showed the application of
Chakrabarthi[17] algorithm with the greater accuracy and precision to Neural Networks with accuracy 96.14%.
diagnose breast cancer at an early stage.
Objective of the research was to deduce the mortality rate
This research showed the application of SVM with
Y. Khourdifi, M.Bahaj[15] due to breast cancer, by detecting and preventing it at an
accuracy 97.90%.
early stage with the help of machine learning techniques.
Table.1: Literature Studies

IV. METHODOLOGY

A. Attributes and Data Set

We have used a dataset which is available publicly and is


known as the Wisconsin Breast Cancer dataset (Diagnostic)
for our study[14,20]. It is contributed by the University of
California, Irvine (UCI). This dataset consisted of 10
attributes for each nucleus, specified as follows: Fig.2: Benign and Malignant Cases

Radius, Texture, Perimeter, Area, Smoothness, A. Preprocessing of Data


Compactness, Concavity, Concave points, Symmetry,
It is the initial step before data modeling. As we have
Fractional Dimension
discovered that the data obtained for modeling is incomplete,
In addition to above, the id and diagnosis were available in that is, some values are missing. There might be a possibility
the dataset out of which Diagnosis was considered for the of inconsistency. Therefore, to remove the errors, smooth
study. The corresponding representation was Benign (B:0) noisy data and reduction of data, preprocessing is done. This
and Malignant (M:1). The dataset was already labelled. is done by feature scaling as the entries in our data are mostly
numeric.
The mean, standard error and largest (worst) of the features
were computed for each and every image which resulted in B. Visualization of Data
30 features.
The entire data set consists of 569 clinical cases. Out of these Data visualization is a process of selection of data for
cases the representation of Benign and Malignant cases is patterns and displaying it in the format of graphs or pictures.
illustrated in Fig.2. It helps to select the correct model for data modeling. Data
cardinality and size of the data play an important role in data
visualization. Greater the cardinality, larger is the percentage
of unique values and vice versa.

265
4

Authorized licensed use limited to: AMITY University. Downloaded on June 29,2021 at 11:38:38 UTC from IEEE Xplore. Restrictions apply.
2019 International Conference on Computing, Communication, and Intelligent Systems (ICCCIS)

The detecting data set consists of data for training(data


modeling) as 75% and data for testing(prediction and
analysis) as 25% in performed in training split and testing
split for data modeling.[13]

A user friendly interface was designed to enable end-users


and practitioners to analyze various algorithms using
different datasets. The snapshot of the designed interface is
given below in Fig.4.

Fig.4: Designed Interface

Fig.3: Data Visualization V. EXPERIMENTAL RESULTS AND


DISCUSSION
Visualization of data can be done in multiple ways such as
heat maps, pie-charts, box plots, etc. The visualization done
Collection of assessments were done using Jupyter
for this research was displayed with the help of libraries of
Notebook. The configuration of the computer was Intel Core
python: matplotlib, cufflinks and plotly, Fig.3. Cufflinks and
i5 with 128mb eDRAM. During the research, 5 algorithms
Plotly, create a visualization that can communicate with the
were implemented, compared and analyzed on the Breast
user.
Cancer dataset. The comparative study was done in terms of
accuracy.
Some other important libraries used in the entire model
were: NumPy, Pandas and Scikit-Learn.
The best accuracy was of Random Forest Tree as depicted in
Table.2. It is inferred that Random Forest can differentiate
C. Selection and Implementation of
between malignant and benign tumors with a higher
Model
accuracy and outperformed other algorithms, shown in
Fig.5.
The selection for the most suitable model varies. It
commonly depends on the kind of data used. After the
selection and withdrawal of features, machine learning
algorithms can be applied to the obtained data. As we have
discussed previously, the machine learning algorithms to be
applied are KNN, SVM, K- SVM, Decision Tree, and
Random Forest Tree.

We developed a model on the dataset of benign and


malignant file. We can detect this issue of classification, as
from the viewpoint of automated learning, detection of
breast cancer can be taken as a clustering issue or
classification.

The size of the testing and training set plays a major role.
The dataset is divided into two parts: testing and training set. Fig.5: Random Forest Classifier Accuracy

266
5

Authorized licensed use limited to: AMITY University. Downloaded on June 29,2021 at 11:38:38 UTC from IEEE Xplore. Restrictions apply.

2019 International Conference on Computing, Communication, and Intelligent Systems (ICCCIS)
for Predicting Breast Cancer Recurrence. J Health Med Inform
S.No. Algorithm Used Accuracy 4:124.
  (&!+ [9] E. Halim, P. P. Halim and M. Hebrard, "Artificial Intelligent
Models for Breast Cancer Early Detection," 2018
!   (%$+ International Conference on Information Management and
"  ($ + Technology (ICIMTech), Jakarta, 2018, pp. 517-521.
[10] M. Amrane, S. Oukid, I. Gagaoua and T. Ensarİ, "Breast
#    ($'+ cancer classification using machine learning," 2018 Electric
$    ('%+ Electronics, Computer Science, Biomedical Engineerings'
Meeting (EBBT), Istanbul, 2018, pp. 1-4.
Table.2: Comparison of Accuracy
[11] D. Bazazeh and R. Shubair, "Comparative study of machine
learning algorithms for breast cancer detection and
VI. CONCLUSION diagnosis," 2016 5th International Conference on Electronic
Devices, Systems and Applications (ICEDSA), Ras Al
In this study, we have implemented different Machine Khaimah, 2016, pp. 1-4.
[12] A. Sharma, S. Kulshrestha and S. Daniel, "Machine learning
Learning algorithms to detect Breast Cancer, along with its approaches for breast cancer diagnosis and prognosis," 2017
accuracy. The major objective of the research is to improve International Conference on Soft Computing and its
the accuracy to detect breast cancer. As, at an early stage Engineering Applications (icSoftComp), Changa, 2017, pp. 1-
5.
sometimes even the mammography fails to recognize the
[13] Kharya, Shweta & Soni, Sunita. (2016). Weighted Naive
presence of small lumps. In this paper, Random Forest Tree Bayes Classifier: A Predictive Model for Breast Cancer
has been the most efficacious algorithm (98.60% accuracy Detection. International Journal of Computer Applications.
in the training phase) to detect breast cancer out of the five 133. 32-37.
algorithms used. [14] Y. Khourdifi and M. Bahaj, "Applying Best Machine Learning
Algorithms for Breast Cancer Prediction and
Classification," 2018 International Conference on
In future, we plan to do in-depth research of these datasets Electronics, Control, Optimization and Computer Science
by including Deep Learning models to obtain even better (ICECOCS), Kenitra, 2018, pp. 1-5.
performance and greater flexibility. Furthermore, we plan to [15] B. M. Gayathri and C. P. Sumathi, "Comparative study of
test these machine learning algorithms on even larger relevance vector machine with various machine learning
techniques used for detecting breast cancer," 2016 IEEE
datasets for better accuracy and be to adopt these Machine
International Conference on Computational Intelligence and
Learning methods for constrained applications in E-health. Computing Research (ICCIC), Chennai, 2016, pp. 1-5.
[16] Chaurasia, S., Chakrabarti, P. and Chourasia, N. (2014)
REFERENCES Prediction of Breast Cancer Biopsy Outcomes—An Approach
Using Machine Leaning Perspectives. International Journal of
Computer Applications, 100, No. 9.
[1] National Breast Cancer, ‘What is Cancer?’, 2016. [Online].
Available:https://www.nationalbreastcancer.org/what-is- [17] American Cancer Society, ‘Fine Needle Aspiration Biopsy of
cancer. [Accessed: May, 2019]. the Breast’, 2017. [Online]. Available:
https://www.cancer.org/cancer/breast-cancer/screening-tests-
[2] World Health Organization, ‘Breast Cancer’, 2019. [Online].
and-early-detection/breast-biopsy/fine-needle-aspiration-
Available:
biopsy-of-the-breast.html [Accessed: May, 2019].
https://www.who.int/cancer/prevention/diagnosis-
screening/breast-cancer/en/. [Accessed: May, 2019]. [18] ‘Jupyter’, 2019. [Online]. Available: https://jupyter.org/
[Accessed: May, 2019].
[3] Cancer India, ‘Cancer Statistics’, 2019. [Online]. Avaialable:
http://cancerindia.org.in/cancer-statistics/. [Accessed: May, [19] D. Fagella, ‘7 Applications of Machine Learning in Pharma
2019]. and Medicine’, 2019. [Online]. Available:
https://emerj.com/ai-sector-overviews/machine-learning-in-
[4] J.R. Balentine, ‘Breast Cancer Causes, Signs and Symptoms’,
pharma-medicine/. [Accessed: May, 2019].
2019. [Online].
Available:https://www.medicinenet.com/breast_cancer_facts
_stages/article.htm. [Accessed: May, 2019].
[5] M. Gupta and B. Gupta, "A Comparative Study of Breast
Cancer Diagnosis Using Supervised Machine Learning
Techniques," 2018 Second International Conference on
Computing Methodologies and Communication (ICCMC),
Erode, 2018, pp. 997-1002.
[6] Ram, Shri. (2017). Indian contribution to breast cancer
research: A bibliometric analysis. Annals of Library and
Information Studies. 64. 99-105.
[7] M. M. Islam, H. Iqbal, M. R. Haque and M. K. Hasan,
"Prediction of breast cancer using support vector machine and
K-Nearest neighbors," 2017 IEEE Region 10 Humanitarian
Technology Conference (R10-HTC), Dhaka, 2017, pp. 226-
229.
[8] Ahmad LG, Eshlaghy AT, Poorebrahimi A, Ebrahimi M,
Razavi AR (2013) Using Three Machine Learning Techniques

 6267

Authorized licensed use limited to: AMITY University. Downloaded on June 29,2021 at 11:38:38 UTC from IEEE Xplore. Restrictions apply.

You might also like