Professional Documents
Culture Documents
Abstract— During their life, among 8% of women are Machine learning is a set of tools utilized for the creation and
diagnosed with Breast cancer (BC), after lung cancer, BC is evaluation of algorithms that facilitate prediction, pattern
the second popular cause of death in both developed and
recognition, and classification. ML is based on four steps:
undeveloped worlds. BC is characterized by the mutation of
genes, constant pain, changes in the size, color(redness), skin Collecting data, picking the model, training the model, testing the
texture of breasts. Classification of breast cancer leads model [4]. The relation between BC and ML is not recent, it had
pathologists to find a systematic and objective prognostic, been used for decades to classify tumors and other malignancies,
generally the most frequent classification is binary (benign predict sequences of genes responsible of cancer and determine the
cancer/malign cancer). Today, Machine Learning (ML) prognostic [5, 6]. The classification’s aim is to put each observation
techniques are being broadly used in the breast cancer
classification problem. They provide high classification in a category that it belongs to. In this study, we used two machine
accuracy and effective diagnostic capabilities. In this paper, learning classifiers which are Naïve Bayesian Classifier and k-
we present two different classifiers: Naive Bayes (NB) nearest neighbor. The purpose is to determine whether a patient has
classifier and knearest neighbor (KNN) for breast cancer a benign or malignant tumor. In this study, we customize two
classification. We propose a comparison between the two new techniques of machine learning for classification of breast cancer.
implementations and evaluate their accuracy using cross
validation. Results show that KNN gives the highest accuracy We use the Wisconsin breast cancer database. The purpose of this
(97.51%) with lowest error rate then NB classifier (96.19 %). article is developing effective machine learning approaches for
cancer classification using two classifiers in a data set. The
performance of each classifier will be evaluated in terms of
Keywords— Breast cancer classification; Bayesian accuracy, training process and testing process.
classifier component; K-nearest neibhor
II. BACKGROUND
I. INTRODUCTION In this section, we first introduce the breast cancer
classification, then different machine learning techniques used in
Breast Cancer’s causes are multifactorial and our cancer classification.
involves family history, obesity, hormones, radiation A. Brest cancer classification (BCC)
therapy, and even reproductive factors. Every year, one
million women are newly diagnosed with breast cancer, BCC aims to determine the suitable treatment, which can be
according to the report of the world health organization aggressive or less aggressive, depending on the class of the cancer.
half of them would die, because it’s usually late when To make a good prognostic, breast cancer classification needs nine
doctors detect the cancer [1]. Breast Cancer is caused by characteristics which are: 1.determine the layered structures
a typo or mutation in a single cell, which can be shut down (Clump Thickness); 2. Evaluate the sample size and its
by the system or causes a reckless cell division. If the consistency (Uniformity of Cell Size); 3. Estimate the equality of
problem is not fixed after a few months, masses are cell shapes and identifies marginal variances, because cancer cells
formed from cells containing wrong instructions. tend to vary in shape (Uniformity of Cell Shape); 4. Cancer cells
spread all over the organ and normal cells are connected to each
Malignant tumors expand to the neighboring cells, other (Marginal Adhesion); 5. Measure of the uniformity, enlarged
which can lead to metastasize or reach other parts, epithelial cells are a sign of malignancy (Single Epithelial Cell
whereas benign masses can’t expand to other tissues, the Size); 6. In benign tumors nuclei is not surrounded by cytoplasm
expansion is then only limited to the benign mass [1] [2]. (Bare Nuclei); 7. Describes the nucleus texture, in benign cells it
Detection of BC may be hard at the beginning of the has a uniform shape. The chromatin tends to be coarser in tumors
disease, due to the absence of symptoms, after some (Bland Chromatin); 8. In normal cells, the nucleolus is usually
clinical tests, the accurate diagnosis should have the invisible and very small. In cancer cells, there are more than one
ability to differentiate the benign and malignant tumors. nucleoli and it becomes much more prominent, (Normal Nucleoli);
A good detection provides low false positive (FP) rate 9. Estimate of the number of mitosis that has taken place. The larger
and false negative (FN) rate[3]. the value, the greater is the chance of malignancy (Mitoses) [7].
In order to classify BC, pathologists assigned to each of these
78-1-5386-5135-3/18/$31.00 ©2018 IEEE
characteristics a number from 1 to 10. The likelihood
of malignancy needs the nine criteria, even if one of them (3)
is very large. C. Cross validation
Cross-Validation is a statistical technique; it is generally used
A. Machine learning approaches
to check and evaluate learning algorithms or models, by
Machine learning is branch of artificial intelligence, partitioning data into a learning set to train the model and testing
ML methods can employ statistics, probabilities, absolute set to evaluate it.
conditionality, Boolean logic, and unconventional The training and testing sets in cross-validation are randomly
optimization strategies to classify patterns or to build divided into partitions (60% of data are in training sets and 40%
prediction models[8]. Machine learning can be divided of data are in testing sets) and go through successive crossover
into two categories: supervised learning (classification) rounds so that each instance is being tested against. K-fold cross
and unsupervised learning. Depending on the used data validation is the basic form, one of the K partitions it is used as a
and their availability [9]. In this section, we will see two validation set. There are more complicated forms of cross
supervised learning classifiers. validation using k-fold as a base. [13].
1) Naïve Bayesian Classifier (NBC)
A Bayesian method is a basic result in probabilities
III. RELATED WORKS
and statistics, it can be defined as a framework to model
A lot of studies have been done in the field of BCC and Ml,
decisions. In NBC, variables are conditionally
some of them used mammography images and the issue is that
independent; NBC can be used on data that directly
images can miss about 15% of breast cancer [14], some
influence each other to determine a model. From known
techniques are more specific and used genome or phenotypes to
training compounds, active (D) and inactive (H), Given
do classification [15, 16]. The breast cancer is classified with
representation B, the conditional probability distribution
serval techniques such as Softmax Discriminant Classifier (SDC),
P(B/D) and P(B/H) are estimated, respectively. Bayesian
Linear Discriminant Analysis (LDA) [17], and Fuzzy C Means
classifiers are additionally well adapted for ranking of
Clustering [18]. The knearest neighbors algorithm is one of the
compound databases all with consideration to probability
most used algorithms in machine learning [19, 20]. Before
of activity [10].
classifying a new element, we must compare it to other elements
Bayesian classifiers use Bayes theorem, which is:
using a similarity measure [14]. In cancer classification, KNN can
be used to measure the performance of false positive rates [21,
22]. Naïve Bayesian classifiers are generally used to predict
(1) biological, chemical and physiological properties. In cancer
classification, NBC are sometimes combined to other classifiers
In Eq. 1, P(h) is the priori probability that event h will such as decision tree to determine prognostics or classification
occur. P(d) is the prior probability of the training data. models.
The conditional probability of d when p (d | h) is given. Different classification techniques were developed for breast
P(h | d) is the conditional probability of h when given d cancer diagnosis, the accuracy of many of them was evaluated
training data. P (h | d) is the probability of generating using the dataset taken from Wisconsin breast cancer database
instance d given class h. In the equation above Bayesian [23]. For example, in [24] the optimized learning vector method’s
decision theorem is used to determine whether a given xi performance was 96.7%, big LVQ method reached, SVM for
belongs to Si where Si represents a class [11]: cancer diagnosis’s accuracy is 97.13% is the highest one in the
literature .
(2)
(5)