You are on page 1of 38

BREAST CANCER DETECTION USING ML

MINI PROJECT REPORT

Submitted to -
Dr. Ruchika Malhotra
Associate Dean (IRD), DTU
A PROJECT BY

o ANISHA GUPTA: 2K17/EC/23


o DIWAKAR ARORA: 2K17/EC/61

1
INDEX
I. Timeline and progressive itinerary
II. Introduction
III. Different proposed methods of detection
i. Mammography
ii. Supervised learning techniques
iii. Deep Learning for image classification (CNN)
IV. PHASE 1
i. Mammographic Research
ii. Prior developments in mammography for Breast cancer detection
iii. Why we didn’t proceed with Mammography for Breast Cancer
detection?

V. PHASE 2
i. Supervised Machine Learning Techniques; A brief overview and
chosen method of progression
ii. Proposed Supervised ML algorithms
iii. Linked Google Collaboratory link of implemented code

VI. PHASE 3
i. Convolution Neural Network for Breast Cancer Detection.
ii. Database Elucidated - BreakHis
iii. Why Histopathological Images are used in place of
Mammographic Images?
iv. Convolutional Neural Networks
v. Linked Google Collaboratory link of implemented code
vi. Experimental results
vii. Defining parameters for our model
VII. CONCLUSION AND FUTURE SCOPE

2
TIMELINE AND PROGRESSIVE ITINERARY

o Machine Learning has become a vital part of Medical Imaging research.


ML methods have evolved over the years from manual seeded inputs
to automatic initialisations.
o Since the abundance of varied Machine Learning algorithms seems a bit
overwhelming, we started our research on the topic in a rather
systematic manner.

o We initialised our research on Breast Cancer by working with


Mammographic Imagery. Mammography is currently one of the
important methods to detect breast cancer early.

A mass can be either benign or malignant. The difference between benign and
malignant tumours is that the benign tumours have round or oval shapes,

3
while malignant tumours have a partially rounded shape with an irregular
outline. In addition, the malignant mass will appear whiter than any tissue
surrounding it.

o Even though Mammography involves significantly less radioactive


exposure than an MRI, it still pertains a significant amount of exposure
that may lead to unexpected complications. In order to override the
aforementioned setback, we decided to drop the use of mammographic
images for determination of Breast cancer in human females and hence
began the search for more safer alternatives.

Next, we used a publicly available dataset created by Dr. William H. Wolberg,


physician at the University of Wisconsin Hospital at Madison, Wisconsin, USA.
Fluid samples were taken from patients with solid breast masses and an easy-
to-use graphical computer program called Xcyt, which is capable of perform
the analysis of cytological features based on a digital scan.

o The program uses a curve-fitting algorithm, to compute ten features


from each one of the cells in the sample, then it calculates the mean
value, extreme value and standard error of each feature for the image,
returning a 30 real-valuated vector.

Since our dataset was now available, we used exemplified data visualisation
techniques to strategically sort and examine our data and eventually used
enforced classification algorithms on the processed data and predicted the
accuracy of each.
(The Classification algorithms as well as the corresponding code has been
elaborated upon in the proceeding sections)

o Finally, with paramount success in our previous endeavour using the


classification algorithms, we moved on to a much more concise and

4
pragmatic approach, i.e. the use of deep learning to classify our data
and hence predict the outcome, i.e. the benign or malignancy of the
tumour.

o For the aforementioned we use the publicly available BreakHis dataset,


which is further explained in the proceeding sections of the report.

After a careful investigation of the problem at hand, we decided to apply


image processing using CNN (Convolution Neural Network) for the best
accuracy.

Lastly, we worked on creating an API that can be used publicly to for


classification of tumour cells given the requisite information is provided.

5
INTRODUCTION
Breast cancer affects one out of eight females worldwide. It is diagnosed by
detecting the malignancy of the cells of breast tissue. Modern medical image
processing techniques work on histopathology images captured by a
microscope, and then analyse them by using different algorithms and
methods. Machine learning algorithms are now being used for processing
medical imagery and pathological tools.

Breast cancer (BC) is one of the most common cancers among women
worldwide, representing the majority of new cancer cases and cancer-related
deaths according to global statistics, making it a significant public health
problem in today’s society.

o The early diagnosis of Breast Cancer can improve the prognosis and
chance of survival significantly, as it can promote timely clinical
treatment to patients. Further accurate classification of benign tumours
can prevent patients undergoing unnecessary treatments.

6
o Thus, the correct diagnosis of Breast Cancer and classification of
patients into malignant or benign groups is the subject of much
research. Because of its unique advantages in critical features detection
from complex BC datasets, machine learning (ML) is widely recognized
as the methodology of choice in BC pattern classification and forecast
modelling.

o The drawback of the MRI is that the patient could develop an allergic
reaction to the contrasting agent, or that a skin infection could develop
at the place of injection. It may cause claustrophobia. Masses and
macrocalcifications (MCs) are two important early signs of the disease

Classification and data mining methods are an effective way to classify data.
Especially in medical field, where those methods are widely used in diagnosis
and analysis to make decisions.

o In the last few decades, several data mining and machine learning
techniques have been developed for breast cancer detection and
classification, which can be divided into three main stages: pre-
processing, feature extraction, and classification.

o To facilitate interpretation and analysis, the pre-processing of


mammography films helps improve the visibility of peripheral areas and

7
intensity distribution, and several methods have been reported to assist
in this process.

o Feature extraction is an important step in breast cancer detection


because it helps discriminate between benign and malignant tumours.
After extraction, image properties such as smoothness, coarseness,
depth, and regularity are extracted by segmentation

In this report, we curatively advance through the different techniques of


classifying the tumours as either benign or malignant while using different
machine learning techniques and subsequently analysing the accuracy of
each, therefore reaching at an ultimatum where risks are minimum and
corresponding accuracy is peaked.

8
Deep Learning to Improve Breast Cancer Detection on
Screening Mammography

To visualize the internal breast structures, a low-dose x-ray of the breasts is


performed; this procedure is known as mammography in medical terms. It is
one of the most suitable techniques to detect breast cancer. Mammograms
expose the breast to much lower doses of radiation compared with devices
used in the past.
The mammograms are acquired at 2 different views for each breast:
craniocaudal (CC) view and mediolateral oblique (MLO) view.

Detection of subclinical breast cancer on screening mammography is


challenging as an image classification task because the tumours themselves
occupy only a small portion of the image of the entire breast. For example, a
full-field digital mammography (FFDM) image is typically 4000 × 3000 pixels
while a potentially cancerous region of interest (ROI) can be as small as
100 × 100 pixels.

9
Various steps are involved in a Computer aided diagnosis (CAD) system using
a conventional workflow.

THE WORKFLOW CAN BE DEFINED IN A FLOWCHART AS FOLLOWS:

10
Image enhancement
Image enhancement is processing the mammogram images to increase
contrast and suppress noise in order to aid radiologists in detecting the
abnormalities.
THE CLAHE algorithm can be used for image enhancement and can be defined
as follows:
1. Divide the original image into contextual regions of equal size,
2. Apply the histogram equalization on each region,
3. Limit this histogram by the clip level,
4. Redistribute the clipped amount among the histogram, and
5. Obtain the enhanced pixel value by the histogram integration.

Image segmentation
Image segmentation is used to divide an image into parts having similar
features and properties. The main aim of segmentation is to simplify the
image by presenting in an easily analysable way. Some of the most popular
image segmentation methodologies are edge, fuzzy theory, partial
differential equation (PDE), artificial neural network (ANN), threshold, and
region-based segmentation

Feature extraction
Deep Convolutional Neural Network is used in order to perform Feature
extraction. Feature extraction is a process of dimensionality reduction by
which an initial set of raw data is reduced to more manageable groups for
processing. A characteristic of these large data sets is a large number of
variables that require a lot of computing resources to process.

11
Classification
In this step, the ROI is classified as either benign or malignant according to
the features. There are lots of classifier techniques; such as linear
discriminant analysis (LDA), artificial neural networks (ANN), binary decision
tree, and support vector machines (SVM). We figured that since the problem
at hand is a binary classification problem, SVM should be used because it
achieved high classification rates in the breast cancer classification problem.

o After a strenuously long process of read through of


Research papers, we found out a few imperative
studies and a few excerpts substantial to our project
are shown below.

12
13
14
Why we didn’t proceed with Mammographic
Images?

o While CAD is now being used in radiology in conjunction with a


wide range of body regions and a variety of imaging modalities,
the preponderant question has been: can CAD enable disease
detection?

o For instance, in mammography, CAD methods have been


developed to automatically identify or classify mammographic
lesions. In histopathology, on the other hand, simply identifying
presence or absence of cancer or even the precise spatial extent
of cancer may not hold as much interest as more sophisticated
questions such as: what is the grade of cancer?

o Further, at the histological (microscopic) scale one can begin to


distinguish between different histological subtypes of cancer,
which is quite impossible (or at the very least difficult) at the
coarser radiological scale.

o Moreover, histopathology reduces the exposure to X-RAY at a


complete nil in opposition to Mammography which does induce
radiation on the patient. Radiations, for a prolonged period of
time can have adverse side effect depending on the response of
the immune system and individualised bodily functioning of the
human. As a result, histopathology is deemed much safer as
compared to X-RAY mammography and hence we decided to
proceed with the former.

15
PHASE 2
Herein we have made use of independent supervised learning algorithms on
our dataset (publicly available) and have computed accuracies on the test set
for each of the proposed models. The corresponding curvature of our work
and substantiating code have been elaborated upon in the section that
follows.

16
CLASSIFICATION USING DIVERSIFIED
SUPERVISED LEARNING ALGORITHMS:

We used the UCI Machine Learning Repository for breast


cancer dataset.

o To create the dataset fluid samples were used, taken from


patients with solid breast masses and an easy-to-use
graphical computer program called Xcyt, which is capable
of perform the analysis of cytological features based on a
digital scan.
o The program uses a curve-fitting algorithm, to compute
ten features from each one of the cells in the sample.
o Then it calculates the mean value, extreme value and
standard error of each feature for the image, returning a
30 real-valuated vector.
o The mean, standard error and “worst” or largest (mean of
the three largest values) of these features were computed
for each image, resulting in 30 features.

Attribute Information:
1. ID number
2. Diagnosis (M = malignant, B = benign)

17
Ten real-valued features are computed for each
cell nucleus:

1. Radius (mean of distances from center to points on the


perimeter)
2. Texture (standard deviation of gray-scale values)
3. Perimeter
4. Area
5. Smoothness (local variation in radius lengths)
6. Compactness (perimeter² / area — 1.0)
7. Concavity (severity of concave portions of the contour)
8. Concave points (number of concave portions of the
contour)
9. Symmetry
10.Fractal dimension (“coastline approximation” — 1)

The mean, standard error and “worst” or largest (mean of the


three largest values) of these features were computed for each
image, resulting in 30 features.

18
OBJECTIVE

Our objective was to analyse which features are most helpful in


predicting malignant or benign cancer and to see general trends that
may aid us in model selection and hyper parameter selection.
The goal is to classify whether the breast cancer is benign or malignant.
To achieve this, we have used machine learning classification methods
to fit a function that can predict the discrete class of new input.

THE SEQUENCE OF STEPS WE FOLLOWED TO REACH OUR DESIRED RESULT


ARE ENUNCIATED AS FOLLOWS AND PRESENTED IN THE GOOGLE
COLLABORATORY LINK PROVIDED IN THE FOLLOWING PAGE:

1. Data Preparation
2. Encoding Categorical Data
3. Feature Scaling
4. Model Selection

CLASSIFICATION ALGORITHMS USED:

1. Logistic Regression 4. Kernel SVM


5. Naïve Bayes 2. Nearest Neighbour
6. Decision Tree Algorithm 7. Random Forest Classification
3. Support Vector Machines

19
MEASURING THE ACCURACY
We will use Classification Accuracy method to find the accuracy of our
models. Classification Accuracy is what we usually mean, when we use the
term accuracy.
It is the ratio of number of correct predictions to the total number of input
samples.

CONFUSION MATRIX

After applying the different classification models, we have got below


accuracies with different models:

1. Logistic Regression — 95.8% 6. Kernel SVM — 96.5%


2. Nearest Neighbour — 95.1% 7. Naive Bayes — 91.6%
3. Support Vector Machines — 97.2%
4. Decision Tree Algorithm — 95.8%
5. Random Forest Classification — 98.6%

20
THE CODE IS IMPLEMENTED ON GOOGLE
COLLABORATORY AND THE SUBSEQUENT
LINK FOR THE SAME IS PROVIDED BELOW:

LINK:
https://colab.research.google.com/drive/1dxNE5P2x79gmpsEsyRaqF
OlADP5EvgKo

21
PHASE 3
HEREIN WE USED HISTOPATHOLOGICAL IMAGES OF THE
BREAST AS DATASET TO CLASSIFY TUMOR AS BENIGN OR
MALIGNANT AND HENCE CONFIRM PRESENCE OF BREAST
CANCER.

The goal is to identify whether a tumour is benign or of a


malignant in nature, as malignant tumours are cancerous and
should be treated as soon as possible to reduce and prevent
further complications. In short, it is a binary classification
problem

22
In deep learning algorithms, a series of tasks are
implemented.

o The first step is image pre-processing which is required to


convert data into the format in which it can directly be
input to the network. This step involves multiple
channelling of images, then segmentation is done. On this
stage, data is ready to be used in training, either in a
supervised or an unsupervised manner.

o The next step is feature extraction. Features represent the


visual content of the histopathology image. In the case of
supervised feature extraction, features are known and
different strategies are applied to find them but in case of
unsupervised feature extraction methods, features are not
known and acquired implicitly in proposed solutions
through the Convolutional Neural Network (CNN).

o The last step is classification, which places an image into


the respective class (benign or malignant) and can be done
using SVM (support vector machine) or with a fully
connected layer using an activation function such as
Softmax.

23
Introduction to the proposed Analogy for Image
classification using CNN

The performance of most conventional classification system relies on


appropriate data representation and much of the efforts are dedicated to
feature engineering, a difficult and time-consuming process that uses prior
expert domain knowledge of the data to create useful features. On the other
hand, deep learning can extract and organize the discriminative information
from the data, not requiring the design of feature extractors by a domain
expert.

In this report of the mini project we performed, we herein present our


preliminary experiments using the deep learning approach to classify breast
cancer histopathological images from BreakHis, a publicly available dataset.

We have here in our experimental setup, considered a method based on


extraction of image patches for training the CNN and the combination of
these patches for final classification. This process allows using the high-
resolution histopathological images from BreakHis as input to the existing
CNN, avoiding adaptations of the model that lead to a more complex and
computationally costly architecture.

24
BreakHis Database

The Breakhis database contains microscopic biopsy images of benign and


malignant breast tumours. Breast tissues are taken as samples by the
procedure of surgical (open) biopsy (SOB).

A SLIDE OF BREAST MALIGNANT TUMOR SEEN IN DIFFERENT


MAGNIFICATIONS

o The samples are generated from breast tissue biopsy slides, stained
with Haematoxylin and eosin(HE). The samples are collected by
surgical open Biopsy(SOB), prepared for histological study.

o The preparation procedure used in this work is the standard paraffin


process. The main goal is to observe the original tissue structure and
molecular composition, allowing to observe it in a light microscope.
The complete preparation process includes process such as Fixation,
dehydration, clearing, infiltration and embedded and trimming.

25
o Samples are stained by haematoxylin and eosin and produced by a
standard paraffin process in which specimen infiltration and
embedment are done in paraffin. Images are taken by a Samsung high-
resolution device (SCC-131AN) which is coupled with an Olympus BX-
50 microscopic system equipped with a relay lens with a magnification
of 3.3×. These histopathology images have a RGB (three channel)
TrueColor (8 bits- Red, 8 bits- Green, 8 bits- Blue) colour coding
scheme.

Magnification Benign Malignant Total

40x 1250 2740 3990

100x 1288 2874 4162

200x 1246 2780 4026

400x 1176 2464 3640

Total 4960 10858 15818

This database contains a total of 15818 images of 700×460 pixel resolution.


Images are captured in four different magnification levels – 40X, 100X, 200X,
400X.

26
CONVOLUTION NEURAL NETWORK

CNN is a modified variety of deep neural net which depends upon the
correlation of neighbouring pixels. It uses randomly defined patches for
input at the start, and modifies them in the training process. Once training is
done, the network uses these modified patches to predict and validate the
result in the testing and validation process.

The CNN architecture has two main types of transformation. The first is
convolution, in which pixels are convolved with a filter or kernel. This step
provides the dot product between image patch and kernel. The width and
height of filters can be set according to the network, and the depth of the
filter is the same as the depth of the input.

A second important transformation is subsampling, which can be of many


types (max_pooling, min_pooling and average_pooling) and used as per
requirement. The pooling layer is responsible to lower the dimensionality of
the data, and is quite useful to reduce overfitting. After using a combination
of convolution and pooling layers, the output can be fed to a fully connected
layer for efficient classification.

27
PROPOSED OUTLINE AND WORKFLOW PROCESS

28
CNN ARCHITECTURE:

o Input Layer: It loads the input, image data in our case, and produces
the output to feed to convolutional layers. In our case, the dimension
of an image is (92x140) and the number of channels (3 for RGB).

o Convolutional Layer: A convolutional layer applies a filter to input


to create a feature map that summarises the presence of detected
features in the input. There are three convolutional layers in our
model. The kernels are of size (3x3), set with the “ReLu” activation
function. In our case, the dimensionality of the convolved feature is
either kept the same or increased as compared to input. This is
achieved by using the “same” padding. The first two convolutional
layers learn from 32 filters while the last layer learns from 64 filters.

o Pooling Layer: The pooling layer is responsible for reducing the


spatial size of the convolved feature and extracting dominant features.
This is done to decrease the computational power required to process
the data. The pooling layer is generally used between two
convolutional layers. In our case, each convolutional layer is followed
by the Max-Pooling layer, defined by the pool size of (2,2).

29
Layer Attribute L1 L2 L3 L4 L5 L6
Type Conv Pool Conv Pool Conv pool
Channel 32 - 32 - 64 -
Filter Size 3x3 - 3x3 - 3x3 -
Conv. Stride 1x1 - 1x1 - 1x1 -
Pooling Size - 2x2 - 2x2 - 2x2
Padding Size same none same none same none
Activation ReLu - ReLu - ReLu -

o Dropout Layers: Dropout Layers are the way to prevent over-fitting


and reducing the interdependent learning amongst the neurons. Our
dataset has a difference in the number of Malignant and Benign
Images even after data-augmentation. To increase the accuracy of our
model, we have used two Dropout layers. The first dropout layer is
followed by the last pooling layer with a dropout rate set to 0.5 and
the second dropout layer is used after the first dense layer with a
dropout rate set to 0.2.

o Flatten Layer: After extracting the pooled feature map and passing
it through dropout layer, flattening is done. It converts the pooled
feature map matrix into a single column which in turn is fed to neural
network for processing.

o Dense Layer: Dense Layer is also called a fully connected layer, this is
like a hidden layer, except all the neurons in layers are fully connected
to the next layer. After the flattening process, the feature map is passed
through Dense Layer. Our model makes use of two dense layers, one
after flattening with activation function as “ReLu” and other as output

30
layer (where we get predicted classes) with the “SoftMax” activation
function.

Final figures produced by the neural network should lie between zero
and one, representing the probability of each class. This is achieved by
using the SoftMax activation function in the output layer.

Compiling and evaluating CNN:


Our CNN model is compiled using ‘categorical cross-entropy’ function using
‘Adam’ optimizer as learning rate optimization algorithm and metrics as
‘accuracy’. The use of loss function categorical cross-entropy is rather
straight-forward, our model distinguishes images into the either malignant
or benign cell, making it a single label categorization problem.

31
EXPERIMENTAL RESULTS

o The experimental results have been computed using BreakHis Dataset,


which consisted of a total of 15,818 images comprising of 4960 benign
images and rests 10,858 malignant images. The data was further pre-
processed, augmented to improve the quality of images and for better
training of the model.

o After data augmentation, the model was trained on 16,662 images


and validated on 4156 samples. Our data-set was split into an 80%
training set and a 20% testing set. The CNN model was trained on TPU
backend using the Keras framework.

o The CNN model achieved a maximum accuracy of 94.13%. The analysis


of training loss vs validation loss and training accuracy vs validation
accuracy is depicted using the graph below.

32
THE CODE IS IMPLEMENTED ON GOOGLE
COLLABORATORY AND THE SUBSEQUENT
LINK FOR THE SAME IS PROVIDED BELOW:

LINK:
https://colab.research.google.com/drive/1DfcOSUq6qEj5LU2wo9k_SnltP7y_
sFaU

33
DEFINING PARAMETERS FOR OUR MODEL

The Area under the ROC curve (AUC)


The AUC is used in the medical diagnosis system and it provides an
approach for evaluating models based on the average of each point on
the ROC curve. For a classifier performance the AUC score should be
always between ‘0’ and ‘1’, the model with a higher AUC value gives a
better classifier performance.

Precision
Precision is the ratio of correctly predicted positive observations to the
total predicted positive observations. High precision relates to the low
FPR. The precision is calculated using the following equation.
Precision = T P T P + F P

F1 score
F1 score is the weighted average of precision and recall. It is used as a
statistical measure to rate the performance of the classifier. Therefore,
this score takes both false positives and false negatives into account.

34
CONCLUSION AND FUTURE SCOPE

We have rigorously approached the proposed problem of Breast


Cancer Detection using Machine Learning through a varied set of
approaches –

From using Mammographic images in our dataset, we


speculated the limitations through a curative and intricate study
of the problem via previously published research papers and
online material. Though the process was feasible, recent studies
suggested a downfall in the use of mammography for detecting
breast cancer. Thus, seeing current developments and being
concomitant with future trends, we segued into a more
substantiating method of diagnosis and classification – The use
of Histopathological images for conducting our classification.

Histopathological analysis remains the most widely accepted


and used method for Breast cancer diagnosis in the present day.
A meticulously crafted deep learning algorithm was implied for
image classification with several Image segmentation process
exploited for finer results. Alas, once our model was trained and
tested on the test set, it provided an accuracy of 94.13% which
we feel at the moment is fairly acceptable. Though our strides
continue to improve accuracy, are sustained efforts to
ameliorate it go in vain due to the limited computational power
and backend capabilities we withhold at the moment.

35
Henceforth, we are coerced to limit our experimental
investigation till here.

We also implied Seven supervised Machine Learning algorithms


to the brink and calculated accuracies for each model trained on
a synonymous publicly available dataset. We were successful in
gaining an Test Set accuracy close to 96.5% through SVM
classifier which we feel is a superlative output result given the
scope of our experimental setup and the amount of data used.

Though we are done with the mini-project, we have planned on


working on the aforementioned topic for as long as we can and
aim at devising new undiscovered strategies to augment the Test
accuracies.

As of now we are looking forward into the YOLO(You only look


once) framework for image segmentation and classification on
top on the CNN architecture which as of now is outside the scope
of our current research and understanding and would require a
substantial amount of time to understand and work upon.

We are sincerely thankful for your perseverant guidance and


goodwill until now and will continue to work towards refinement
to persevere it.

36
THANK YOU

A Project By :
o Anisha Gupta
o Diwakar Arora

37

You might also like