Professional Documents
Culture Documents
3 SYSTEM ANALYSIS 7
3.1 EXISTING SYSTEM 7
3.1.1 Disadvantages 7
3.2 PROPOSED SYSTEM 8
3.2.1 Advantages 8
3.3 SYSTEM REQUIREMENTS 8
3.3.1 Hardware Requirement 8
3.3.2 Software Requirement 8
4 SYSTEM ARCHITCTECTURE 9
4.1 ARCHITECTURE DESCRIPTION 9
5 SYSTEM IMPLEMENTATION 10
5.1 LIST OF MODULE 10
5.1.1 dataset 10
5.1.2 Data preprocessing 10
5.1.3 Classification 10
5.1.4 Comparative performance analysis 11
REFERENCES 12
ABSTRACT
CHAPTER 1
INTRODUCTION
1.1 SYSTEM OVERVIEW
1
ARTIFICIAL INTELLIGENCE
“Artificial Intelligence (AI) is the part of computer science concerned with
designing intelligent computer systems, that is, systems that exhibit
characteristics we associate with intelligence in human behaviour –
understanding language, learning, reasoning, solving problems, and so on.”
Scientific Goal To determine which ideas about knowledge representation,
learning, rule systems, search, and so on, explain various sorts of real
intelligence.
Engineering Goal To solve real world problems using AI techniques such as
knowledge representation, learning, rule systems, search, and so on.
Traditionally, computer scientists and engineers have been more interested
in the engineering goal, while psychologists, philosophers and cognitive
scientists have been more interested in the scientific goal.
The Roots - Artificial Intelligence has identifiable roots in a number of older
disciplines, particularly
Philosophy
Logic/Mathematics
Computation
Logic/Mathematics
Computation Psychology/Cognitive Science
Biology/Neuroscience
Evolution
2
1.2 MACHINE LEARNING
3
• Reinforcement Learning: learning based on feedback or reward
• Example: learn to play chess by winning or losing
1.3 CONVOLU
TIONAL
NEURAL
NETWORK
CNNs use a variation of multilayer perceptrons designed to require minimal
preprocessing. They are also known as Shift Invariant or Space Invariant
Artificial Neural networks (SIANN), based on their shared-weights architecture
and translation invariance characteristics.
Convolutional networks were inspired by biological processes in that the
connectivity pattern between neurons resembles the organization of the animal
visual cortex. Individual cortical neurons respond to stimuli only in a restricted
region of the visual field known as the receptive field. The receptive fields of
different neurons partially overlap such that they cover the entire visual field.
CNNs use relatively little pre-processing compared to other image classification
algorithms. This means that the network learns the filters that in traditional
algorithms were hand-engineered. This independence from prior knowledge and
human effort in feature design is a major advantage.
Over the past decades, a continuous evolution related to cancer research has been
performed. Scientists applied different methods, such as screening in early stage, in
4
order to find types of cancer before they cause symptoms. Moreover, they have
developed new strategies for the early prediction of cancer treatment outcome.
With the advent of new technologies in the field of medicine, large amounts of
cancer data have been collected and are available to the medical research
community. However, the accurate prediction of a disease outcome is one of the
most interesting and challenging tasks for physicians. As a result, ML methods
have become a popular tool for medical researchers. These techniques can discover
and identify patterns and relationships between them, from complex datasets, while
they are able to effectively predict future outcomes of a cancer type. Given the
significance of personalized medicine and the growing trend on the application of
ML techniques, we here present a review of studies that make use of these methods
regarding the cancer prediction and prognosis. In these studies prognostic and
predictive features are considered which may be independent of a certain treatment
or are integrated in order to guide therapy for cancer patients, respectively [2]. In
addition, we discuss the types of ML methods being used, the types of data they
integrate, the overall performance of each proposed scheme while we also discuss
their pros and cons. An obvious trend in the proposed works includes the
integration of mixed data, such as clinical and genomic. However, a common
problem that we noticed in several works is the lack of external validation or
testing regarding the predictive performance of their models. It is clear that the
application of ML methods could improve the accuracy of cancer susceptibility,
recurrence and survival prediction. Based on [3], the accuracy of cancer prediction
outcome has significantly improved by 15%–20% the last years, with the
application of ML techniques. Several studies have been reported in the literature
and are based on different strategies that could enable the early cancer diagnosis
and prognosis [4–7]. Specifically, these studies describe approaches related to the
profiling of circulating miRNAs that have been proven a promising class for
5
cancer detection and identification. However, these methods suffer from low
sensitivity regarding their use in screening at early stages and their difficulty to
discriminate benign from malignant tumors. Various aspects regarding the
prediction of cancer outcome based on gene expression signatures are discussed in
[8,9]. These studies list the potential as well as the limitations of microarrays for
the prediction of cancer outcome. Even though gene signatures could significantly
improve our ability for prognosis in cancer patients, poor progress has been made
for their application in the clinics. However, before gene expression profiling can
be used in clinical practice, studies with larger data samples and more adequate
validation are needed. In the present work only studies that employed ML
techniques for modeling cancer diagnosis and prognosis are presented. 2. ML
techniques ML, a branch of Artificial Intelligence, relates the problem of learning
from data samples to the general concept of inference [10–12]. Every learning
process consists of two phases: (i) estimation of unknown dependencies in a
system from a given dataset and (ii) use of estimated dependencies to predict new
outputs of the system. ML has also been proven an interesting area in biomedical
research with many applications, where an acceptable generalization is obtained by
searching through an n-dimensional space for a given set of biological samples,
using different techniques and algorithms [13]. There are two main common types
of ML methods known as (i) supervised learning and (ii) unsupervised learning. In
supervised learning a labeled set of training data is used to estimate or map the
input data to the desired output. In contrast, under the unsupervised learning
methods no labeled examples are provided and there is no notion of the output
during the learning process. As a result, it is up to the learning scheme/model to
find patterns or discover the groups of the input data. In supervised learning this
procedure can be thought as a classification problem. The task of classification
refers to a learning process that categorizes the data into a set of finite classes. Two
6
other common ML tasks are regression and clustering. In the case of regression
problems, a learning function maps the data into a real-value variable.
Subsequently, for each new sample the value of a predictive variable can be
estimated, based on this process. Clustering is a common unsupervised task in
which one tries to find the categories or clusters in order to describe the data items.
Based on this process each new sample can be assigned to one of the identified
clusters concerning the similar characteristics that they share. Suppose for example
that we have collected medical records relevant to breast cancer and we try to
predict if a tumor is malignant or benign based on its size. The ML question would
be referred to the estimation of the probability that the tumor is malignant or no (1
= Yes, 0 = No). Fig. 1 depicts the classification process of a tumor being malignant
or not. The circled records depict any misclassification of the type of a tumor
produced by the procedure. Another type of ML methods that have been widely
applied is semi-supervised learning, which is a combination of supervised and
unsupervised learning. It combines labeled and unlabeled data in order to construct
an accurate learning model. Usually, this type of learning is used when there are
more unlabeled datasets than labeled. When applying a ML method, data samples
constitute the basic components. Every sample is described with several features
and every feature consists of different types of values. Furthermore, knowing in
advance the specific type of data being used allows the right selection of tools and
techniques that can be used for their analysis. Some data-related issues refer to the
quality of the data and the preprocessing steps to make them more suitable for ML.
Data quality issues include the presence of noise, outliers, missing or duplicate
data and data that is biased-unrepresentative. When improving the data quality,
typically the quality of the resulting analysis is also improved. In addition, in order
to make the raw data more suitable for further analysis, preprocessing steps should
be applied that focus on the modification of the data. A number of different
7
techniques and strategies exist, relevant to data preprocessing that focus on
modifying the data for better fitting in a specific ML method. Among these
techniques some of the most important approaches include (i) dimensionality
reduction (ii) feature selection and (iii) feature extraction. There are many benefits
regarding the dimensionality reduction when the datasets have a large number of
features. ML algorithms work better when the dimensionality is lower [14].
Additionally, the reduction of dimensionality can eliminate irrelevant features,
reduce noise and can produce more robust learning models due to the involvement
of fewer features. In general, the dimensionality reduction by selecting new
features which are a subset of the old ones is known as feature selection. Three
main approaches exist for feature selection namely embedded, filter and wrapper
approaches
CHAPTER 2
LITERATURE SURVEY
8
Classification of Breast Cancer Based on Histology Images using
Convolutional Neural Networks
The work is contributed by Dalal Bardou et.al. In recent years, the classification of
breast cancer has been the topic of interest in the field of Healthcare informatics,
because it is the second main cause of cancer-related deaths in women. Breast
cancer can be identified using a biopsy where tissue is removed and studied under
microscope. The diagnosis is based on the qualification of the histopathologist,
who will look for abnormal cells. However, if the histopathologist is not well-
trained, this may lead to wrong diagnosis. With the recent advances in image
processing and machine learning, there is an interest in attempting to develop a
reliable pattern recognition based systems to improve the quality of diagnosis. In
this paper, we compare two machine learning approaches for the automatic
classification of breast cancer histology images into benign and malignant and into
benign and malignant sub-classes. The first approach is based on the extraction of
a set of handcrafted features encoded by two coding models (bag of words and
locality constrained linear coding) and trained by support vector machines, while
the second approach is based on the design of convolutional neural network. We
have also experimentally tested dataset augmentation techniques to enhance the
accuracy of the convolutional neural network as well as “handcrafted features +
convolutional neural network” and “convolutional neural network” and
“convolutional neural network features + classifier” configurations. The results
show convolutional neural networks outperformed the handcrafted feature based
classifier, where we achieved accuracy between 96.15% and 98.33% for the binary
classification and 83.31% and 88.23% for the multi-class classification.
Bacterial colony counting with Convolutional Neural Networks in Digital
Microbiology Imaging
9
The work is contributed by Alessandro Ferrari et.al. In recent years, the
classification of With this work we explore the possibility to find effective
solutions to the above issue by designing and testing two different machine
learning approaches. The first one is based on the extraction of a complete set of
handcrafted morphometric and radiometric features used within a Support Vector
Machines solution. The second one is based on the design and configuration of a
Convolutional Neural Networks deep learning architecture. To validate, in a real
and challenging clinical scenario, the proposed bacterial load estimation
techniques, we built and publicly released a fully labeled large and representative
database of both single and aggregated bacterial colonies extracted from routine
clinical laboratory culture plates. Dataset enhancement approaches have also been
experimentally tested for performance optimization. The adopted deep learning
approach outperformed the handcrafted feature based one, and also a conventional
reference technique, by a large margin, becoming a preferable solution for the
addressed Digital Microbiology Imaging quantification task, especially in the
emerging context of Full Laboratory Automation systems.
10
classification of these images in two classes, which would be a valuable computer-
aided diagnosis tool for the clinician. In order to assess the difficulty of this task,
we show some preliminary results obtained with state-of-the-art image
classification systems. The accuracy ranges from 80% to 85%, showing room for
improvement is left. By providing this dataset and a standardized evaluation
protocol to the scientific community, we hope to gather researchers in both the
medical and the machine learning field to advance toward this clinical application.
11
(average 93.2% accuracy) on a large-scale dataset, which demonstrates the strength
of our method in providing an efficient tool for breast cancer multi-classification in
clinical settings.
12
The work is contributed by Stephen J. McKenna et.al. We investigate glandular
structure segmentation in colon histology images as a window-based classification
problem. We compare and combine methods based on fine-tuned convolutional
neural networks (CNN) and hand-crafted features with support vector machines
(HC-SVM). On 85 images of H&E-stained tissue, we find that fine-tuned CNN
outperforms HC-SVM in gland segmentation measured by pixel-wise Jaccard and
Dice indices. For HC-SVM we further observe that training a second-level window
classifier on the posterior probabilities - as an output refinement - can substantially
improve the segmentation performance. The final performance of HC-SVM with
refinement is comparable to that of CNN. Furthermore, we show that by
combining and refining the posterior probability outputs of CNN and HC-SVM
together, a further performance boost is obtained
13
enables learning of fine-grained (cellular) details and global tissue structures. Our
system is trained and evaluated on a dataset containing 221 WSIs of hematoxylin
and eosin stained breast tissue specimens. The system achieves an AUC of 0.962
for the binary classification of nonmalignant and malignant slides and obtains a
three-class accuracy of 81.3% for classification of WSIs into normal/benign, DCIS,
and IDC, demonstrating its potential for routine diagnostics.
CHAPTER 3
SYSTEM ANALYSIS
3.1 SYSTEM ANALYSIS
14
EXISTING SYSTEM
Existing System are the works that are already implemented successfully.
Techniques used in the existing system are described below.
In recent years, the classification of breast cancer has been the topic of interest in
the field of Healthcare informatics, because it is the second main cause of cancer-
related deaths in women. Breast cancer can be identified using a biopsy where
tissue is removed and studied under microscope. The diagnosis is based on the
qualification of the histopathologist, who will look for abnormal cells. However, if
the histopathologist is not well-trained, this may lead to wrong diagnosis. With the
recent advances in image processing and machine learning, there is an interest in
attempting to develop a reliable pattern recognition based systems to improve the
quality of diagnosis. In this paper, we compare two machine learning approaches
for the automatic classification of breast cancer histology images into benign and
malignant and into benign and malignant sub-classes. The first approach is based
on the extraction of a set of handcrafted features encoded by two coding models
(bag of words and locality constrained linear coding) and trained by support vector
machines, while the second approach is based on the design of convolutional
neural networks. We have also experimentally tested dataset augmentation
techniques to enhance the accuracy of the convolutional neural network as well as
‘‘handcrafted features + convolutional neural network’’ and ‘‘convolutional neural
network features + classifier’’ configurations. The results show convolutional
neural networks outperformed the handcrafted feature based classifier, where we
achieved accuracy between 96.15% and 98.33% for the binary classification and
83.31% and 88.23% for the multi-class classification.
3.1.1 DISADVANTAGES
15
The designed CNN topology worked well on both binary and multi-class
classification tasks. However, the performance of the multi-class
classification was lower when compared to the one of the binary
classification due to the number of handled classes and also due to the
similarities between the sub-classes
The results show convolutional neural networks outperformed the
handcrafted feature based classifier, where they achieved accuracy between
96.15% and 98.33% for the binary classification and 83.31% and 88.23% for
the multi-class classification.
16
Hardware specifications are technical description of the computer's
components and capabilities. Processor speed, model and manufacturer, etc.
So the hardware components required for the proposed system are:
Processor : Intel Core i5.
Hard disk : 1TB.
Speed : 1.80GHz
Memory : 4GB.
CHAPTER 4
SYSTEM ARCHITECTURE
17
CHAPTER 5
SYSTEM IMPLEMENTATION
SYSTEM DESCRIPTION
This system contains certain modules to execute the proposed system and each
module will contain certain algorithms and techniques to be executed. Certainly,
this work also contains few modules to exhibit the breast cancer classification
based on histological images and also graph representation.
5.1.1 DATASET
18
tables, corresponding to a particular experiment or event. This collected data stored
in the data warehouse.
TYPES OF PREPROCESSING:
RGB image
Binary image
The RGB color model is an additive color model in which red, green and
blue light are added together in various ways to reproduce a broad array of colors.
The main purpose of the RGB color model is for the sensing, representation and
display of images in electronic systems, such as televisions and computers, though
it has also been used in conventional photography.
19
BINARY IMAGE MODEL
A binary image is a digital image that has only two possible values for each
pixel. Typically, the two colors used for a binary image are black and white. The
color used for the object ,in the image is the foreground color while the rest of the
image is the background color. In the document-scanning industry, this is often
referred to as "bi-tonal".
Binary images are also called bi-level or two-level. This means that each pixel is
stored as a single bit (i.e., a 0 or 1). The names black-and-white, B&W,
monochrome or monochromatic are often used for this concept, designate any
images that have only one sample per pixel, such as grayscale images.
PROCESSED DATA
5.1.3 CLASSIFICATION
POOLING LAYERS
20
Convolutional networks may include local or global pooling layers, which
combine the outputs of neuron clusters at one layer into a single neuron in the next
layer. For example, max pooling uses the maximum value from each of a cluster of
neurons at the prior layer. Another example is average pooling, which uses the
average value from each of a cluster of neurons at the prior layer.
Finally, after several convolutional and max pooling layers, the high-level
reasoning in the neural network is done via fully connected layers. Neurons in a
fully connected layer have connections to all activations in the previous layer, as
seen in regular neural networks. Their activations can hence be computed with a
matrix multiplication followed by a bias offset.
Precision = TP/TP+FP
21
the F1 Score (or f-measure) to provide a single measurement for a system.
The usage of "precision" in the field of information retrieval differs from the
definition of accuracy and precision within other branches of science and
technology.
Recall = TP/TP+FN
For example, for a text search on a set of documents, recall is the number of
correct results divided by the number of results that should have been
returned. It can be viewed as the probability that a relevant document is
retrieved by the query.
22
F1 Score = 2*(Recall * Precision) / (Recall + Precision)
This measure is approximately the average of the two when they are close,
and is more generally the harmonic mean, which, for the case of two
numbers, coincides with the square of the geometric mean divided by
the arithmetic mean.
Accuracy = TP+TN/TP+FP+FN+TN
(i)True Positives (TP) - These are the correctly predicted positive values which
means that the value of actual class is yes and the value of predicted class is also
yes. E.g. if actual class value indicates that this passenger survived and predicted
class tells you the same thing.
(ii)True Negatives (TN) - These are the correctly predicted negative values
which means that the value of actual class is no and value of predicted class is
also no. E.g. if actual class says this passenger did not survive and predicted class
tells you the same thing.
23
False positives and false negatives, these values occur when your actual class
contradicts with the predicted class.
(iii)False Positives (FP) – When actual class is no and predicted class is yes. E.g.
if actual class says this passenger did not survive but predicted class tells you that
this passenger will survive.
(iv)False Negatives (FN) – When actual class is yes but predicted class in no. E.g.
if actual class value indicates that this passenger survived and predicted class tells
you that passenger will die.
Observation of classification
24
Analysing Classifying classified Breast cancer
Breast cancer images Breast cance classifier cells
processing Sensing
DFD 1 :
acquired
transform data
ation
cleaning selecti
on Binary classification
Extraction
CNN
Feature Extraction
Classified Breast cancer
Breast cancer images cells
binary
classific Classification
ation segregate
analyse prediction
Prediction find output
DFD 2 :
preprocess breast
cancer images
Load
convert conver
classified
RGB images Gray scale binary images Breast
cancer cells
25
preprocess breast
cancer images
Load
convert conver
classified
RGB images Gray scale binary images Breast
cancer cells
DFD 4 :
Feature Extraction
Select
Extract Monitoring
classified
Feature Nodes Breast
cancer cells
DFD 5 :
Classification
Define
Set Load
26
Prediction
load
Predict
Test vs Train Accuracy
Test images
compare
DFD 2.1
cleanin
transfor
mation
selectio Binary classification
n
CNN Extractio
n
Feature extraction Classified Breast cancer
Breast cancer images binary
classifi cells
cation extraction
Classification
analyse
prediction
find output
Prediction
27
Preprocessing Breast cancer
images
transf
select ormati
ion Binary classification on
acquired data
load clas
cleanin preproces sify benight
s data transform
ation
malignant
CNN
Extractio
n
Feature extraction Classified Breast cancer
Breast cancer images binary cells
classifi
cation segregate
Classification
analyse
prediction
find output
Prediction
DFD 2.3
cleanin
transfor
mation
selectio Binary classification
n
CNN Extractio
n
Feature extraction Classified Breast cancer
Breast cancer images cells
sele
ct
feature nodes monitoring
classific
ation
Classification
segreg
ate
28
Preprocessing Breast cancer
Load images
acquired
data
transf
ormati
cleanin Binary classification on
selectio
n
CNN Extractio
Feature extraction n
Classified Breast cancer
Breast cancer images cells
Classification
defi
ne se loa
CNN t Activatio d
layer n class image
classificati segregate
on
DFD 2.5
29
Preprocessing Breast cancer
Load images
acquired
data
transf
ormati
cleanin Binary classification on
selectio
n
CNN Extractio
Feature extraction n
Classified Breast cancer
Breast cancer images cells
Classification
binary
classificatio
segrega
n
te
l Prediction
load
Test Accurac
images Test vs
y
train find
compare output
predict
analyse
prediction
REFERENCES
30
T. Araújo et al., ‘‘Classification of breast cancer histology images using
convolutional neural networks,’’ PLoS ONE, vol. 12, no. 6, p. e0177544,
2017.
31