Professional Documents
Culture Documents
Abstract: Breast cancer is one of the deadliest diseases in the world. Two There are many types of breast cancer in which most common are ductal
million new cases are registered in the year 2018. In this scenario, detection carcinoma in situ, invasive ductal carcinoma and invasive lobular
and classification of breast cancer at right time become more crucial to carcinoma. If the cancer cell is in situ, it doesn’t spread anywhere but if, it
diagnose. There are several approaches and tools exist to detect and classify invasive cancer then there are maximum chances of spreading into
the breast cancer. But betterment of the approaches and tools are still surrounding of breast tissues.
continuing. In our approach, we classify the breast cancer and compare our As it is known, early detection of breast cancer is very helpful to
approach with other approaches which are implemented in different other control and cure the disease thus; many approaches of machine learning
have been used for the early detection of breast cancer. The automatic
tools. Here, we find that our approach is better than them. detection of cancer without any medical experts can be done with the help
of classifiers. There are many types of classifiers which are used to classify
Keywords- Breast cancer, Support vector classifier, K-NN classifier. the breast cancer into malignant or benign. The classifiers like Support
Vector Machine (SVM) [19] and K-Nearest Neighbour (KNN) classifier
have been used in this paper.
1. Introduction
Our body is made up of cells, which grow and divide every microsecond.
Sometimes, the uncontrollable cells in the body create an extra mass which
is an unwanted part. The mass formed by these unwanted extra cells leads
to a tumour and it may be malignant (cancerous tumour) or benign (non-
cancerous tumour). The benign tumour grows in the body but doesn’t
spread in other parts of body whereas malignant grows rapidly.
Breast cancer is the second most dangerous diseases which lead to
death [1]. It is most commonly found in women. According to the data
provided by the World Cancer Research Fund International (WCRF), breast
cancer mostly occurs after menopause and in 2012 about 1.7 million new
cases had been observed [2]. There are many reasons like hormonal
disbalance, early mensuration, late menopause, first pregnancy issue after
the age of 30 and miscarriages, due to which breast cancer occurs. In the
world, the highest frequency having breast cancer is found in Belgium
which is followed by Denmark. This disease also causes other health issues
like depression, mood swings, anger and anxiety [3]. There are many
methods used till now for the detection of breast cancer like Removing a
sample of breast cells for testing (biopsy), Mammogram [4], Breast
magnetic resonance imaging (MRI) [5] etc. These techniques don’t provide
accurate results in the detection of breast cancer.
The breast cancer is determined by the characteristics of the tumour
size, it is basically divided into four stages. According to the patient’s
tumour size the stages of breast cancer is determined, and doctors provide
treatment. If tumour is greater than 2cm then doctors do chemotherapy. In
the stage 1 tumour size is usually small and contained within the breast. In
stage 2 the size of tumour increases but doesn’t spread to other parts of
body. In some cases, the cancerous cells have spread into lymph node. In
the case of stage 3 the cancerous cell has been started to spread and it is
becoming larger. In stage 4, the tumour is now spread into every parts of
body and there is no chance to save the patient. This stage is also known as
secondary or metastatic cancer [6]. All the stages are shown in Figure 1.
3. Classifiers
SVM is also known as non-probability binary linear classifier in case of underlying distribution of the data. For example, suppose our data is
predicting new instances into one of the two classes. highly non-Gaussian but the learning model we choose assumes a
Gaussian form. In that case, our algorithm would make extremely poor
predictions.
4. Implementation
where w is coefficient vector, b is constant value and ɸ(x) is mapping The dataset of breast cancer is collected from the UCI machine learning
function of input data which is given by k(xi,xj) = ɸ(xi) * ɸ(xj). repository which is named as Wisconsin Breast Cancer (WBC) [17]. This
Additionally, k(xi,xj) is known as kernel function. dataset contains 699 number of instances which consist of 10 attributes. It
The optimal equation is given as contains one more attribute which is named as “class” to differentiate
whether a patient is detected with the cancerous tumour or not. It is given
1
Max [∑𝑛𝑖=1 𝑎(𝑖) ∑𝑛𝑖,𝑗=1 𝑎(𝑖)𝑎(𝑗)𝑦(𝑖)𝑦(𝑗)k(x(i), x(j))] (3) as 2 for benign and 4 for malignant. There are 458 benign and 241
2
malignant instances. We removed the missing values (16) so our new
The kernel function can be RBF or polynomial. dataset contains 683 instances.
KNN is a simple algorithm which falls into category of supervised learning. We can measure the performance of machine learning algorithms by
It is a lazy and non-parametric method used in classification and regression. calculating the performance indices. These indices are calculated on the
It stores all the possible cases while training and classifies new observation basis of the confusion matrix, which consists of 4 parameters named as True
by majority (closest) value of its k- neighbour. In the training process, the positive (TP), False positive (FP), True Negative (TF) and False Negative
algorithm stores the features in the vectors as well as labels the dataset. (FN). The confusion matrix is a table which describe the performance of a
While in the testing process, new observations are classified by calculating classifier which consist of actual and predicted class [18]. It represents the
the similarity measures like Euclidean distance, cosine similarity etc. The classification for two classifiers given in Table 1, where
selection of value ‘k’ is the most important factor in KNN classification as
it will determine how well the data can be utilized to generalize the result. TP=the prediction given is true and in actual it confirms to be true (correctly
KNN falls in the supervised learning family of algorithms. Informally, this recognizes).
means that we are given a labelled dataset consisting of training
observations (x1, y1) (x2, y2) and would like to capture the relationship TN=the prediction given is negative and it confirms to be true (incorrectly
between x1x2 and y1y2. More formally, our goal is to learn a function recognizes).
h:X→Y so that given an unseen observation x1x2, h(x1) h(x2) can
confidently predict the corresponding output y1y2.The KNN classifier is FP= the prediction given is true and it doesn’t to be true (correctly
also a non-parametric and instance-based learning algorithm. excluded).
• Non-parametric means it makes no explicit assumptions about the FN= the prediction given is negative and it confirms to be negative
functional form of h, avoiding the dangers of mismodeling the (incorrectly recognizes).
The Specificity shows how many negatives are correctly identified which
is given as
𝑇𝑁
Specificity = (5)
𝑇𝑁+𝐹𝑃
Table 3. Performance measure indices
and, sensitivity is given as how many positives are correctly identified
𝑻𝑷
Sensitivity = (6)
𝑻𝑷+𝑭𝑵
5. RLanguage
The experiments have been done on the dataset WBC for classification of
breast cancer as malignant or benign. We have use 10 folds cross validation
with 70:30 and 80:20 training-testing datasets. Among 204 instances 138
are correctly identified in both case (SVM and KNN) with 70:30 partitions
and with 136 instances 96 are correctly identified in both case (SVM and
KNN) with 80:20 partitions, the confusion matrix has been shown in Table
2. Here we can observe that if the size of training set increases then accuracy
of both classifiers decreases. The parameters evaluated like specificity,
precision and sensitivity are shown in Table 3. It depicts that SVM
performance is much better than KNN as the specificity, precision and
sensitivity obtained by SVM are 98.41%, 97.87% and 97.87% respectively
in case of 70:30 partitions.
We basically compare the previous result with our result using R
language as shown in Table 4. In the comparison we found that author [15]
used 4 folds techniques using MATLAB 7.0 whereas author [20] used 10
folds technique using WEKA. In our approach we are using 10 folds
techniques i.e. the dataset is divided into 10 portions which is repeated 3
times using R tool. Here, we observed that we are getting more accuracy
with same dataset used by [15] and [20] using different tool.