Machine Learning Detects Breast Cancer

MULUNGUSHI UNIVERSITY
SCHOOL OF SCIENCE, ENGINEERING AND TECHNOLOGY
PROJECT PROPOSAL REPORT
Title: Breast Cancer Diagnosis with Machine Learning
Student Name : Lawrence Kasalwe

Student ID : 201507305
Programme : Bachelor of Computer Science
Supervisor : Dr. M. J. Simfukwe
Table of Contents
DECLARATION ............................................................................................................................ 3
ABSTRACT .................................................................................................................................... 4
LIST OF FIGURES ........................................................................................................................ 5
LIST OF TABLES .......................................................................................................................... 6
ACRONYMS AND ABBREVIATIONS ....................................................................................... 7
CHAPTER 1 – INTRODUCTION ................................................................................................. 8
1.1 Introduction ........................................................................................................................... 8
1.2 Problem Statement ................................................................................................................ 9
1.3 Aim ........................................................................................................................................ 9
1.4 Objectives .............................................................................................................................. 9
1.5 Project Scope ......................................................................................................................... 9
1.6 Project Justification ............................................................................................................... 9
1.7 Summary ............................................................................................................................... 9
CHAPTER 2 – LITERATURE REVIEW .................................................................................... 11
2.1 Introduction ......................................................................................................................... 11
2.1.1 Mortality Rate ............................................................................................................... 11
2.1.2 Causes of Breast Cancer ............................................................................................... 12
2.2 Related literature ................................................................................................................. 12
2.2.1 Breast cancer diagnostic/screening techniques............................................................. 12
2.2.2 Screening and diagnosis of breast cancer in Zambia .................................................... 14
2.3 Review of existing/ current system ..................................................................................... 14
2.3.1 Computer based diagnosis of breast cancer .................................................................. 14
2.4 Proposed system .................................................................................................................. 15
2.4.1 Gaps in existing literature ............................................................................................. 15
2.4.2 Dataset .......................................................................................................................... 15
2.4.3 Training and Testing Model ......................................................................................... 17
2.5 Summary ............................................................................................................................. 26
CHAPTER 3 - RESEARCH METHODOLOGY ......................................................................... 27
3.1 Introduction ......................................................................................................................... 27
3.2 Review of Methodologies ................................................................................................... 27
1
3.3 Summary of the Waterfall Model........................................................................................ 29
3.4 The Waterfall Model ........................................................................................................... 29
3.4.1 Requirements Definition............................................................................................... 31
3.4.2 System and Software Design ........................................................................................ 31
3.4.3 Implementation and Unit Testing ................................................................................. 32
3.4.4 Integration and System Testing .................................................................................... 33
3.4.5 Operation and Maintenance .......................................................................................... 33
3.5 Justification of Selected Methodology ................................................................................ 33
3.6 Technologies and framework .............................................................................................. 34
3.7 Summary ............................................................................................................................. 35
CHAPTER 4 – PROJECT MANAGEMENT............................................................................... 37
4.1 Introduction ......................................................................................................................... 37
4.2 Risk and Quality Management ............................................................................................ 37
4.3 Risk Analysis/ Risk Register ............................................................................................... 38
4.3.1 Risk Register................................................................................................................. 38
4.4 Effort Costing Model .......................................................................................................... 39
4.5 Effort Calculations for Project ............................................................................................ 40
4.6 Scheduling and Work plan .................................................................................................. 43
4.7 Summary ............................................................................................................................. 45
CHAPTER 5 – CONCLUSION ................................................................................................... 46
References ..................................................................................................................................... 47
2
DECLARATION
I hereby declare that the project proposal entitled “Breast Cancer Diagnosis with
Machine Learning” submitted for the course “ICT 431” is my original work and the project has
not formed basis for the award of any degree, associate ship, fellowship or any other similar
tittle. I recognize that failure to acknowledge the material which is acquired from other sources
maybe or can be considered plagiarism.
Student Supervisor
Date: ____________________ Date: ___________________
Sign: ____________________ Sign: ___________________
3
ABSTRACT
Breast cancer being one of the biggest, challenging diseases with a mortality rate where
21% of all breast cancer deaths worldwide are attributable to alcohol use, overweight and
obesity, and physical inactivity with the proportion higher in high-income countries (27%), and
the most important contributor was overweight and obesity and in low- and middle-income
countries, the proportion of breast cancers attributable to these risk factors was 18%, and
physical inactivity was the most important determinant (10%) and is most infectious in women
compared to men, needs an easier and faster way to be diagnosed. The manual techniques of
diagnosing breast cancer are effective but slow. A diagnosis system with the capability of giving
accurate diagnosis in a key to achieving the goal of easier and faster breast cancer diagnosis.
In this paper we propose a breast cancer diagnosis system that will make diagnosis easier and
faster.
4
LIST OF FIGURES
Figure 1 Multi-layer feed forward NN (Daniel Graupe, 2013) .................................................... 19
Figure 2 A processing neuron ....................................................................................................... 20
Figure 3 waterfall lifecycle consists of several non-overlapping stages....................................... 30
Figure 4 waterfall lifecycle consists of several non-overlapping stages....................................... 30
Figure 5 sub-activities system development ................................................................................. 32
Figure 6 the risk management life cycle ....................................................................................... 38
Figure 7 Information Domain Value(umd, 2018) ......................................................................... 40
Figure 8 functional point(umd, 2018) ........................................................................................... 40
Figure 9 Lines of Code(umd, 2018) .............................................................................................. 41
Figure 10 Effort cost and Duration (umd, 2018) .......................................................................... 41
Figure 11 effort costing using COCOMO II calculation (csse, 2018) .......................................... 42
Figure 12 result from the COCOMO calculation (csse, 2018) ..................................................... 43
Figure 13 proposal Gantt Chart..................................................................................................... 44
Figure 14 project Gantt Chart ....................................................................................................... 45
5
LIST OF TABLES
Table 1 Attribute Information ..................................................................................................................... 17
Table 2 the risk register............................................................................................................................... 39
Table 3 Breakdown of costs/expenditure .................................................................................................... 42
6
ACRONYMS AND ABBREVIATIONS
BC Breast Cancer
CBE Clinical Breast Examination
CDH Cancer Disease Hospital
COCOMO Constructive Cost Model
DLL Dynamic-link libraries
GUI Graphical User Interface
KNN K Nearest Neighbor
ML Machine Learning
MRI Magnetic Resonance Imaging
NB Naïve Bayes
NN Neural Networks
PET Position-Emission Tomography
RAD Rapid Application Development
RS_SVM Rough Set based Support Vector Machine
UTH University Teaching Hospital
USPSTF United States Preventive Services Task Force
WBCD Wisconsin Breast Cancer Dataset
7
CHAPTER 1 – INTRODUCTION
1.1 Introduction
Breast cancer is the most common invasive cancer among women and is the and second
main cause of death after lung cancer. Advances in screening and treatment have improved
survival rates dramatically since 1989.
Being aware of the symptoms of breast cancer and early screening allows the prevention
of breast cancer.
Alongside the common screening tools / techniques for detecting breast cancer, scientists
are looking to enhance the screening techniques with the help of computer aided diagnosis. A
number of methods and algorithms for detecting breast cancer are being developed and used.
Several chance factors for most breast cancers were properly documented. but, for most of
the people of the female sex having breast cancer, it isn't feasible to discover unique risk elements
( (IARC, 2008); (Lacey JV Jr et al, 2009)).
A familial history of breast cancer increases the risk by a factor of two or three. Some
mutations, particularly in BRCA1, BRCA2 and p53 result in a very high risk for breast cancer.
However, these mutations are rare and account for a small portion of the total breast cancer burden.
Reproductive factors associated with prolonged exposure to endogenous estrogens, such

as early menarche, late menopause, late age at first childbirth are among the most important risk
factors for breast cancer. Exogenous hormones also exert a higher risk for breast cancer. Oral
contraceptive and hormone replacement therapy users are at a much higher risk than non-users.
Breastfeeding has a protective effect ( (IARC, 2008); (Lacey JV Jr et al, 2009)).
The contribution of various modifiable risk factors, excluding reproductive factors, to the
overall breast cancer burden has been calculated by Danaei et al. (Danaei G et al, 2005).
The overall aim of this project is to quantify existing service delivery capacity and to
identify gaps, challenges, and priority areas for building, setting-appropriate and a sustainable
breast cancer control service system in Zambia.
8
1.2 Problem Statement
In this project we will quantify existing service delivery capacity and to identify gaps,
challenges, and priority areas for building, setting-appropriate and a sustainable breast cancer
control service system in Zambia.
1.3 Aim
The overall aim is to quantify existing service delivery capacity and to identify gaps,
challenges, and priority areas for building setting-appropriate and a sustainable breast cancer
control service system in Zambia.
1.4 Objectives
 Research on the best Machine Language Model to use for breast cancer diagnosis
 Build a diagnosis system based on the Machine Learning Model
 Test the Machine Learning based application with user interface
1.5 Project Scope

This project will consist of trained Machine Language Model as its core for easier and
quicker diagnosis. The project will be completed by May, 2019. Modules of this system will
include a symptom input section, a Machine Language Model and an output section for the
diagnosis results which will be wrapped in Graphical User Interface (GUI).
1.6 Project Justification

To quantify existing service delivery capacity and to identify gaps, challenges, and priority
areas for building setting-appropriate and a sustainable breast cancer control service system in
Zambia.
1.7 Summary
Breast cancer is the most common invasive cancer among women and is the and second
main cause of death after lung cancer. Advances in screening and treatment have improved
survival rates dramatically since 1989. The overall aim of this project is to quantify existing service
delivery capacity and to identify gaps, challenges, and priority areas for building, setting-
appropriate and a sustainable breast cancer control service system in Zambia. This project will
9
consist of trained Machine Language Model as its core for easier and quicker diagnosis and project
will be completed by May, 2019 and will have a GUI for easy usability.
10
CHAPTER 2 – LITERATURE REVIEW
2.1 Introduction
2.1.1 Mortality Rate
21% of all breast cancer deaths worldwide are attributable to alcohol use, overweight and
obesity, and physical inactivity. This proportion was higher in high-income countries (27%), and
the most important contributor was overweight and obesity. In low- and middle-income countries,
the proportion of breast cancers attributable to these risk factors was 18%, and physical inactivity
was the most important determinant (10%).
The differences in breast cancer incidence between developed and developing countries
can partly be explained by dietary effects combined with later first childbirth, lower parity, and
shorter breastfeeding (Peto J, 2001). The increasing adoption of western life-style in low- and
middle-income countries is an important determinant in the increase of breast cancer incidence in
these countries.
The global burden of cancer is growing steadily, with much of this burden falling on
developing countries, where nearly 80% of disability adjusted life years lost to cancer occurs.
Although it is rising, breast cancer incidence in developing nations is much lower than that in
developed nations. Death rates, however, remain the same.
System level barriers to breast cancer care in these environments have been well
documented and are primarily centered around the lack of accessible and affordable screening,
early detection, diagnostic, and treatment facilities.
Other barriers include lack of awareness of the early signs and symptoms of breast cancer,
the belief that cancer has a supernatural origin and is always fatal, the use of traditional therapies
before or in lieu of seeking more modern treatment, and fear of spousal abandonment following
mastectomy.
2.1.1.1 Mortality rate in Zambia

In 2010, there were an estimated 1,007 new breast cancer cases and 359 breast cancer
deaths in Zambia. Although scarce, Zambia-specific data indicates that breast cancer incidence has
been rising (Carla Chibwesha, 2015).
11
2.1.2 Causes of Breast Cancer
i. Exogenous hormones
Ovarian hormones are commonly taken exogenously, either for contraception, or as
‘replacement’ therapy for symptoms believed to be due to low levels of the natural
products, usually during or after menopause. When oral contraceptives were introduced
in the early 1960s there was considerable speculation, based on experimental work, that
they might increase the risk of breast cancer.
Replacement hormones are another matter. In 1976, Hoover et al published the first
evidence of increased risk among women taking replacement estrogens. In a large
gynecologic practice, there was a 30% excess of breast cancer among women taking
Premarin, a kind of estrogen stew derived from the urine of pregnant mares, and among
those taking the medication for 15 years or more, the risk was doubled.
ii. Ionizing Radiation
Mammary tissue is quite susceptible to malignant transformation by ionizing radiation.
Excess Breast cancer has been observed in patients given multiple fluoroscopes.
Radiotherapy for ankylosing spondylitis or enlargement of the thymus gland, and in
survivors of the atomic bombing, painters of radium watch faces and X-ray technicians.
iii. Alcohol
The findings on beverage alcohol are summarized in a joint analysis by the oxford
Group of data from 53 epidemiologic studies. Women who had an average daily
consumption of 4 or more drinks a day had a 50% higher breast cancer risk than those
who did not drink alcohol.
2.2 Related literature

2.2.1 Breast cancer diagnostic/screening techniques
Mammography
Several randomized controlled trials have evaluated mammography as a screening test.

Most of these studies, begun between 1963 and 1980, reported a decreased risk of breast cancer
death in women who were randomized to receive screening, particularly among women between
50 and 69 years of age. However, a meta-analysis questioned the value of mammography as a
screening test. The authors excluded trials they felt were flawed and found no reduction in
12
mortality with mammography; they concluded that screening for breast cancer with mammography
is unjustified. The USPSTF performed a meta-analysis using data from the same trials. The
researchers concluded that the flaws in some of the studies did not significantly influence
outcomes; therefore, they included pooled effects from seven valid studies. The resulting
recommendation was for screening mammography every one to two years for women 40 years and
older (Knutson & Steiner, 2007).
Ultrasonography
Because mammography is less sensitive and breast tissue is denser in younger women,
ultrasonography has been considered as a screening tool for younger women who are at high risk
for breast cancer. A consensus statement published by the European Group for Breast Cancer
Screening concluded that there is no evidence to support the use of ultrasonography for screening
at any age (Knutson & Steiner, 2007).
Magnetic resonance imaging
The use of MRI as a screening test for breast cancer was first reported in the 1980s, and
studies have demonstrated its benefits and limitations. Studies using MRI in high-risk women
report that MRI is significantly more sensitive than mammography, and mammographic screening
with or without ultrasonography is probably an insufficient screen for persons with a known
genetic predisposition for breast cancer (Knutson & Steiner, 2007).
Scintimammography
Clinical studies have been conducted using technetium-99m sestamibi

scintimammography to evaluate some breast abnormalities. In a meta-analysis summarizing
studies from more than 5,000 patients, the sensitivity and specificity for detecting nonpalpable
lesions were found to be 67 and 87 percent, respectively.36 Clinically, this has been used most
often to evaluate patients with a palpable breast lesion and a negative mammogram (Knutson &
Steiner, 2007).
Positron-emission tomography
Positron-emission tomography (PET) scanning is based on increased glucose utilization by

malignant cells. In the evaluation of suspicious lesions, PET scanning has been found to be
13
reasonably sensitive and specific, but it is limited in detecting some breast tumors based on size,
metabolic activity, and histologic subtype.39 There is no evidence demonstrating a clear advantage
over other adjuvant imaging studies, and the high cost has limited its use as a routine diagnostic
tool (Knutson & Steiner, 2007).
2.2.2 Screening and diagnosis of breast cancer in Zambia

Breast cancer screening (mammography), early detection (clinical breast examination,
diagnostic ultrasound), and biopsy services also exist at the provincial level, albeit on a much
smaller scale. While incisional and excisional wedge resections and mastectomy can be performed
in provinces where general surgeons are located, breast conserving (lumpectomy and sentinel
lymph node mapping and sampling) and reconstructive surgery is not available. Similar to cervical
cancer, radiation and chemotherapy treatment for breast cancer are only available at CDH and
hormone therapy at CDH and UTH. Much remains to be done to ensure that all women in Zambia
are aware of and have routine access to cancer prevention, early detection and treatment services
(Carla Chibwesha, 2015).
2.3 Review of existing/ current system

2.3.1 Computer based diagnosis of breast cancer
i. Rough Set based supporting Vector Machine classifier is proposed for breast cancer
diagnosis. Rough Set Reduction algorithm is employed as a feature selection tool to
remove the redundant features and further improve the diagnostic accuracy by
Supporting Vector Machine. The effectiveness of the RS_SVM is examined on
Wisconsin Breast Cancer Data Set (WBCD) using classification accuracy, sensitivity,
specificity, confusion matrix and receiver operating characteristic curves. Experimental
results demonstrate the proposed RS_SVM cannot only achieve very high classification
accuracy but also detect a combination of five informative features, which can give an
important clue to the physicians for breast cancer. (Chen, 2011)
ii. Fuzzy-neural and feature extraction technique for detecting and diagnosing micro
calcifications’ patterns in digital mammograms. After an investigation and analysis of
a number of feature extraction techniques, it was found that a combination of 3 features
(such as entropy, standard deviation and number of pixels) is the best combination to
distinguish a benign micro calcification pattern from one that is malignant. A fuzzy
14
technique in conjunction with 3 features was used to detect a micro calcification pattern
and a neural network was used to classify it into benign/malignant. The system was
developed on a Microsoft windows platform. It is an easy-to-use intelligent system that
gives the user options to diagnose, detect, enlarge, zoom and measure distances of areas
in digital mammograms.
2.4 Proposed system

2.4.1 Gaps in existing literature
Breast cancer awareness in Zambia has taken together efforts that consist of breast cancer
public awareness campaigns led by community-based organizations, mammography where
feasible, clinical breast examination (CBE) by nurses. However, limited funding and the lack of
an efficient delivery and management system preclude the routine availability of these services,
except on a very small scale. In Zambia, there are currently no national level data that map
women’s cancer control services in the country. Such data are needed to inform next steps in
building capacity for cancer prevention and care. This assessment is designed to provide baseline
information that is necessary for the development of a framework for women’s cancer control in
Zambia. Also there is need for a computer based diagnosis system that will help in quick detection
of breast cancer and data gathering for future cases and case studies.
2.4.2 Dataset
This breast cancer database was obtained from the University of Wisconsin Hospitals,
Madison from Dr. William H. Wolberg.
Title: Wisconsin Breast Cancer Database (January 8, 1991).
Sources:
 Dr. WIlliam H. Wolberg (physician) University of Wisconsin Hospitals Madison,

Wisconsin USA
 Donor: Olvi Mangasarian (mangasarian@cs.wisc.edu) Received by David W. Aha
(aha@cs.jhu.edu)
 Date: 15 July 1992
Past Usage:
15
Attributes 2 through 10 have been used to represent instances.
Each instance has one of 2 possible classes: benign or malignant.
Relevant Information:
Samples arrive periodically as Dr. Wolberg reports his clinical cases.
The database therefore reflects this chronological grouping of the data.
This grouping information appears immediately below, having been removed
from the data itself:
Group 1: 367 instances (January 1989)
Group 2: 70 instances (October 1989)
Group 3: 31 instances (February 1990)
Group 4: 17 instances (April 1990)
Group 5: 48 instances (August 1990)
Group 6: 49 instances (Updated January 1991)
Group 7: 31 instances (June 1991)
Group 8: 86 instances (November 1991)
Total: 699 points (as of the donated database on 15 July 1992)
Number of Instances: 699 (as of 15 July 1992)
Number of Attributes: 10 plus the class attribute
Attribute Information: (class attribute has been moved to last column)
16
Table 1 Attribute Information
# Attribute Domain
1 Sample code number id number
2 Clump Thickness 1 - 10
3 Uniformity of Cell Size 1 - 10
4 Uniformity of Cell Shape 1 - 10
5 Marginal Adhesion 1 - 10
6 Single Epithelial Cell Size 1 - 10
7 Bare Nuclei 1 - 10
8 Bland Chromatin 1 - 10
9 Normal Nucleoli 1 - 10
10 Mitoses 1 - 10
11 Class (2 for benign, 4 for malignant)
Class distribution:
Benign: 458 (65.5%)
Malignant: 241 (34.5%)
2.4.3 Training and Testing Model

There are three models or classifiers namely Neural Networks, K-Nearest Neighbor, and
Naïve Bayes that will be used to train and test the system.
2.4.3.1 Neural Networks

Neural Networks are computing systems vaguely inspired by the biological neural
networks that constitute animal brains. The neural network itself isn't an algorithm, but rather a
framework for many different machine learning algorithms to work together and process complex
data inputs. Such systems "learn" to perform tasks by considering examples, generally without
being programmed with any task-specific rules. For example, in image recognition, they might
learn to identify images that contain cats by analyzing example images that have been manually
labeled as "cat" or "no cat" and using the results to identify cats in other images. They do this
without any prior knowledge about cats, e.g., that they have fur, tails, whiskers and cat-like faces.
17
Instead, they automatically generate identifying characteristics from the learning material that they
process.
Neural Networks is based on the collection of connected units or nodes called artificial
neurons, which loosely model actual biological neurons. Each collection can transmit a signal to
another. Warren McCulloch and Walter Pitts (1943) created a computational model for neural
networks based on mathematics and algorithms called threshold logic. This model paved the way
for neural network research to split into two approaches. One approach focused on biological
processes in the brain while the other focused on the application of neural networks to artificial
intelligence. This work led to work on nerve networks and their link to finite automata.
Mathematically, a neuron's network function f(x) is defined as a composition of other

functions
Equation 1 𝑔𝑖 (𝑥)
that can further be decomposed into other functions. This can be conveniently represented as a
network structure, with arrows depicting the dependencies between functions. A widely used type
of composition is the nonlinear weighted sum, where
Equation 2 𝑓(𝑥) = 𝐾(∑ 𝑖 𝑤𝑖 𝑔𝑖 (𝑥))

where K (commonly referred to as the activation function (Wilson, 2012)) is some predefined
function, such as the hyperbolic tangent or sigmoid function or softmax function or rectifier
function.
Figure 1 presents a feed forward NN, with one hidden layer. Except for the input layer
neurons, every neuron is a computational element with an activation function. The principle
mechanism of the NN is that when data is presented to the input layer, the network neurons run
computations in the subsequent layers until an output value is yielded at each of the neurons in
the output layer
18
.
Figure 1 Multi-layer feed forward NN (Daniel Graupe, 2013)

Each neuron in a particular, except for the output layer neurons, feeds its output as input
for the neurons in the next layer. The neurons in the processing layers (i.e. hidden and output
layers) computes weighted sums of their inputs and add a threshold. The resulting sums are then
used to compute the activation levels of the neurons by applying an activation function (e.g.
sigmoid function). The process can be defined as follows.
Equation 3 𝑎𝑗 = ∑𝑝𝑖=1 𝑤𝑗𝑖 𝑥𝑖 + 𝜃𝑗 , 𝑦𝑖 = 𝑓𝑗 (𝑎𝑗 )
where 𝑎𝑎𝑗𝑗 is the activation of neuron j, which is equal to the sum of the weighted sum of the
inputs 𝑥1, 𝑥2,…., 𝑥𝑝 and the threshold 𝜃𝑗 , 𝑤𝑗𝑖 is the connection weight from neuron i to neuron j,
𝑓𝑗 is the activation function for the 𝑗𝑡ℎ neuron and 𝑦𝑗 is the output. Figure 2 shows a graphical
representation of how a neuron processes information.
19
Figure 2 A processing neuron
The sigmoid function is popularly used as the activation function and is defined as:
1
Equation 4 𝑓(𝑡) = 1+ 𝑒 −𝑡
A single neuron in a multi-layer NN is able to linearly separate the input space into
subspaces by means of a hyper plane defined by the weights and the threshold, where the weights
define the direction of the hyper plane and the threshold offsets it from the origin (David Gil,
2012).
2.4.3.2 K Nearest Neighbor

Defer the decision to generalize beyond the training examples till a new query is
encountered. Whenever we have a new point to classify, classify, we find its K nearest neighbors
from the training data.
The distance is calculated using one of the following measures
 Euclidean Distance
 Minkowski Distance
 Mahalanobis Distance
KNN creates local models (or neighborhoods) across the feature space with each space defined
by a subset of the training data. Implicitly a ‘global’ decision space is created with boundaries
between the training data. One advantage of KNN is that updating the decision space is easy. KNN
is a nearest neighbor algorithm that creates an implicit global classification model by aggregating
local models, or neighborhoods.
20
Outliers can create individual spaces which belong to a class but are separated. This mostly relates
to noise in the data.
The solution is dilution of dependency of the algorithm on individual (possibly noisy) instances.
Once we have obtained the K-Nearest-Neighbors using the distance function, it is time for
the neighbors to vote in order to predict its class.
Majority voting assumes that all votes are equal.
For each class l ∈ L we count the amount of k-neighbors that have that class.
We return the class with the most votes.
Voting more mathematically, modification of the algorithm to return the majority vote
within the set of k nearest neighbors to a query q. Mk(q) is the prediction of the model M for query
q given the parameter of the model k.
Levels(l) is the set of levels (classes) in the domain of the target feature and l is an element
of this set. i iterates over the distance di in increasing distance from the query q.
ti is the value of the target feature for distance di.
Delta (ti, l) is the Knoecker Delta function which takes two parameters and returns 1 if they
are equal or 0 if not.
Equation 5 𝕄𝑘(𝑞) = 𝑎𝑟𝑔 𝑚𝑎𝑥 ∑𝑘𝑖=1 𝛿(𝑡𝑖, 𝑙)
2.4.3.3 Naïve Bayes

Bayesian classifiers are statistical classifiers. They can predict class membership
probabilities, such as the probability that a given sample belongs to a particular class. Bayesian
classifier is based on Bayes’ theorem. Naive Bayesian classifiers assume that the effect of an
attribute value on a given class is independent of the values of the other attributes. This assumption
is called class conditional independence. It is made to simplify the computation involved and, in
this sense, is considered “naïve”.
21
The naive Bayesian classifier works as follows:
1. Let T be a training set of samples, each with their class labels. There are k classes, C1, C2,
. . ., Ck. Each sample is represented by an n-dimensional vector, X = {x1, x2, . . .,xn},
depicting n measured values of the n attributes, A1, A2, . . . , An, respectively.
2. Given a sample X, the classifier will predict that X belongs to the class having the highest
a posteriori probability, conditioned on X. That is X is predicted to belong to the class 𝐶𝑖 if
and only if 𝑃(𝐶𝑖 |𝑋) > 𝑃(𝐶𝑗 |𝑋) for 1 ≤ j ≤ m, j ≠ i. Thus we find the class that maximizes
P(𝐶𝑖 |X). The class Ci for which P(𝐶𝑖 |X) is maximized is called the maximum posteriori
hypothesis. By Bayes’ theorem
𝑃(𝑋 |𝐶𝑖 ) 𝑃(𝐶𝑖 )

Equation 6 𝑃(𝐶𝑖 |𝑋) = 𝑃(𝑋)
3. As P(X) is the same for all classes, only P(X|𝐶𝑖 )P(𝐶𝑖 ) need be maximized. If the class a
priori probabilities, P(𝐶𝑖 ), are not known, then it is commonly assumed that the classes are
equally likely, that is, P(C1) = P(C2) = . . . = P(Ck), and we would therefore maximize
P(X|𝐶𝑖 ). Otherwise we maximize P(X|𝐶𝑖 )P(Ci). Note that the class a priori probabilities
may be estimated by P(𝐶𝑖 ) = freq(Ci, T)/|T|.
4. Given data sets with many attributes, it would be computationally expensive to compute
P(X|𝐶𝑖 ). In order to reduce computation in evaluating P(X|𝐶𝑖 ) P(𝐶𝑖 ), the naive assumption
of class conditional independence is made. This presumes that the values of the attributes
are conditionally independent of one another, given the class label of the sample.
Mathematically this means that
Equation 7 𝑃(𝑋|𝐶𝑖 ) ≈ ∏𝑛𝑘=1 𝑃(𝑥𝑘|𝐶𝑖 )
The probabilities P(x1|𝐶𝑖 ), P(x2|𝐶𝑖 ), . . . , P(xn|𝐶𝑖 ) can easily be estimated from the training
set. Recall that here xk refers to the value of attribute Ak for sample X.
a. If Ak is categorical, then P(xk|𝐶𝑖 ) is the number of samples of class 𝐶𝑖 in T having

the value xk for attribute Ak, divided by freq(𝐶𝑖 , T), the number of sample of class
𝐶𝑖 in T.
22
b. If Ak is continuous-valued, then we typically assume that the values have a Gaussian
distribution with a mean µ and standard deviation σ defined by
1 (𝑥− 𝜇)2
Equation 8 𝑔(𝑥, 𝜇, 𝜎) = 𝑒𝑥𝑝 −
√2𝜋𝜎 2𝜎2
so that
Equation 9 𝑝(𝑥𝑦 |𝐶𝑖 ) = 𝑔(𝑥𝑦 , 𝜇𝐶𝑖 , 𝜎𝐶𝑖 )
We need to compute µ𝐶𝑖 and σ𝐶𝑖 , which are the mean and standard deviation of values of
attribute Ak for training samples of class 𝐶𝑖 .
In order to predict the class label of X, P(X|𝐶𝑖 )P(𝐶𝑖 ) is evaluated for each class 𝐶𝑖 . The
classifier predicts that the class label of X is 𝐶𝑖 if and only if it is the class that maximizes
P(X|𝐶𝑖 )P(𝐶𝑖 ).
2.4.4 Research on Machine Language Techniques for Breast Cancer Diagnosis
This section outlines how the best model for Breast Cancer diagnosis will be determined.
The models namely Neural Networks, K-NN (K-Nearest Neighbor) and Naïve Bayes will be in
this research.
Research will be done in MATLAB, using the Wisconsin Breast Cancer Database. The
Wisconsin Breast Cancer Database will be divided into three parts:
i. Training where 70% of the Wisconsin Breast Cancer Database will be used.
ii. Validation where 15% of the Wisconsin Breast Cancer Database will be used.
iii. Testing where 15% of the Wisconsin Breast Cancer Database will be used.
The models ability to produce accurate results will be determined using the following metric for
evaluating classification models:
2.4.4.1 Accuracy Rate

Accuracy Rate is the fraction of predictions our model got right. Formally, accuracy has the
following definition:
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝐶𝑜𝑟𝑟𝑒𝑐𝑡 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠

Equation 10 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑇𝑜𝑡𝑎𝑙 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠
23
For binary classification, accuracy can also be calculated in terms of positives and negatives as
follows:
𝑇𝑃+𝑇𝑁
Equation 11 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑇𝑃+𝑇𝑁+𝐹𝑃 +𝐹𝑁
Where TP = True Positives, TN = True Negatives, FP = False Positives, and FN = False

Negatives.
2.4.4.2 Precision and Recall

Precision (also called positive predictive value) is the fraction of relevant instances among
the retrieved instances, while recall (also known as sensitivity) is the fraction of relevant instances
that have been retrieved over the total amount of relevant instances. Both precision and recall are
based on an understanding and measure of relevance.
Given a set of positive results or findings from the classifier model, the results in which
the required class are, are the ones that are called precision or positive predictions. For example,
in a dataset of 12 pictures of a mixture of cats and dogs, in which the classifier has to recognize
the cats in the pictures. The classifier produces 8 positive results of 12 and out of the 8, only in 5
does the classifier recognize cats. The 5 are what we will pick as true positives while the rest are
false positives. In this case the precision of the classifier is 5/8 and the recall is 5/12.
Precision
In the field of information retrieval, precision is the fraction of retrieved documents that are
relevant to the query:
|{𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠}+ {𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠}|

Equation 12 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = |{𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠}|
Recall
In information retrieval, recall is the fraction of the relevant documents that are successfully
retrieved:
|{𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠} + {𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠}|

Equation 13 𝑟𝑒𝑐𝑎𝑙𝑙 = |{𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠}|
For classification tasks, the terms true positives, true negatives, false positives, and false
negative compare the results of the classifier under test with trusted external judgments. The terms
24
positive and negative refer to the classifier's prediction (sometimes known as the expectation), and
the terms true and false refer to whether that prediction corresponds to the external judgment
(sometimes known as the observation).
Precision and recall are then defined as:
𝑇𝑃
Equation 14 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑃+𝐹𝑃
𝑇𝑃
Equation 15 𝑟𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑃+𝐹𝑃
2.4.4.3 Cohen’s Kappa Statistics

Cohen’s Kappa statistic is a very useful, but under-utilized, metric. Sometimes in machine
learning we are faced with a multi-class classification problem. In those cases, measures such as
the accuracy, or precision/recall do not provide the complete picture of the performance of our
classifier.
In some other cases we might face a problem with imbalanced classes. E.g. we have two
classes, say A and B, and A shows up on 5% of the time. Accuracy can be misleading, so we go
for measures such as precision and recall. There are ways to combine the two.
Cohen’s kappa statistic is a very good measure that can handle very well both multi-class and
imbalanced class problems.
Cohen’s kappa is defined as:
𝑃𝑜 − 𝑃𝑒 1− 𝑃𝑜
Equation 16 𝑘 = =1−
1 − 𝑃𝑒 1− 𝑃𝑒
where Po is the observed agreement among raters (identical to accuracy), Pe is the expected
agreement, the hypothetical probability of chance agreement, using the observed data to calculate
the probabilities of each observer randomly seeing each category. If the raters are in complete
agreement, then k = 1. If there is no agreement among the raters other than what would be expected
by chance (as given by Pe), k = 0. When the statistic is negative, it implies that there is no effective
agreement between the two raters or the agreement is worse than random.
25
2.5 Summary
This chapter reviewed how breast cancer is diagnosed and the existing computer based
systems around the world. We then discovered that there is no existing system in Zambia. We also
talk about the proposed system, the Machine Learning Models to be trained and tested, and how
we will determine the accuracy for the Machine Learning Models.
26
CHAPTER 3 - RESEARCH METHODOLOGY
3.1 Introduction
There are so many software process models for developing and implementing software.
The software process models have stages or phases that are used as a guide to developing the
software. Any of these software process models can be used to develop a software but the
appropriate software process models to use is dependent on the type of software being developed.
3.2 Review of Methodologies

A software process model is an abstract representation of a process. It presents a description of a
process from some particular perspective as:
1. Specification.
2. Design.
3. Validation.
4. Evolution.
General Software Process Models are
1. Waterfall model: Separate and distinct phases of specification and development.

2. Prototype model.
3. Rapid application development model (RAD).
4. Evolutionary development: Specification, development and validation are interleaved.
5. Incremental model.
6. Iterative model.
7. Spiral model.
8. Component-based software engineering: The system is assembled from existing
components.
A Programming process model is an abstract representation to describe the process from a

particular perspective. There are numbers of general models for software processes, like: Waterfall
model, Evolutionary development, Formal systems development and Reuse based development,
etc. Below is a brief review the following five models:
27
I. Spiral Model
The spiral model is similar to the incremental model, with more emphases placed
on risk analysis. The spiral model has four phases: Planning, Risk Analysis, Engineering
and Evaluation. A software project repeatedly passes through these phases in iterations
(called Spirals in this model). The baseline spiral, starting in the planning phase,
requirements are gathered and risk is assessed. Each subsequent spiral builds on the
baseline spiral.
The disadvantages of the spiral model are:
i. It can be a costly model to use.

ii. Risk analysis requires highly specific expertise.
iii. Project’s success is highly dependent on the risk analysis phase.
iv. Doesn’t work well for smaller projects
II. Incremental Model
The incremental build model is a method of software development where the model
is designed, implemented ad tested incrementally and more is added to the software each
time until the product is finished. The program is defined as finished when it satisfies all
of the requirements.
III. Iterative Model
In the iterative model, iterative process starts with a simple implementation of a
small set of the software requirements and iteratively enhances the evolving versions until
the complete system is implemented and ready to be deployed.
An iterative life cycle model does not attempt to start with a full specification of
requirements. Instead, development begins by specifying and implementing just part of the
software, which is the reviewed to identify further requirements.
IV. Prototype Model
In the prototype, a prototype (an early approximation of a final system) is built,
tested and then reworked as necessary until an acceptable prototype is finally achieved
from which the complete system or product can now be developed. This model works best
in scenarios where not all of the project requirements are known in detail ahead of time. It
28
is an iterative, trial-and-error process that takes place between the developer and the
user(s).
V. Rapid application development model (RAD)
This model is based on prototyping and iterative development with no specific
planning involved. The process of writing the software itself involves the planning required
for developing the product.
Rapid application development focuses on gathering the requirements through
workshops or focus groups, early testing of the prototypes by the customers using the
iterative concept and rapid delivery.
3.3 Summary of the Waterfall Model

Development of the application will be done using the waterfall model. The application for
Breast Cancer Diagnosis will be developed based on the best Machine Learning Model from the
research.
The waterfall model is the classical model of software engineering. This model is one of
the oldest models and is widely used in government projects and in many major companies. As
this model emphasizes planning in early stages, it ensures design flaws before they develop. In
addition, its intensive document and planning make it work well for projects in which quality
control is a major.
3.4 The Waterfall Model

The pure waterfall lifecycle consists of several non-overlapping stages, as shown in the
following figure 3 and figure 4. The model begins with establishing system requirements and
software requirements and continues with architectural design, detailed design, coding, testing,
and maintenance. The waterfall model serves as a baseline for many other lifecycle models.
29
Figure 3 waterfall lifecycle consists of several non-overlapping stages (Munassar & Govardhan
A, 2010)
Figure 4 waterfall lifecycle consists of several non-overlapping stages (Munassar & Govardhan
A, 2010)
The following list details the steps for using the waterfall
1. System requirements: Establishes the components for building the system, including the
hardware requirements, software tools, and other necessary components. Examples include
decisions on hardware, such as plug-in boards (number of channels, acquisition speed, and
so on), and decisions on external pieces of software, such as databases or libraries.
2. Software requirements: Establishes the expectations for software functionality and
identifies which system requirements the software affects. Requirements analysis includes
30
determining interaction needed with other applications and databases, performance
requirements, user interface requirements, and so on.
3. Architectural design: Determines the software framework of a system to meet the specific
requirements. This design defines the major components and the interaction of those
components, but it does not define the structure of each component. The external interfaces
and tools used in the project can be determined by the designer.
4. Detailed design: Examines the software components defined in the architectural design
stage and produces a specification for how each component is implemented.
5. Coding: Implements the detailed design specification.
6. Testing: Determines whether the software meets the specified requirements and finds any
errors present in the code.
7. Maintenance: Addresses problems and enhancement requests after the software releases.
3.4.1 Requirements Definition

To determine the requirements for the system, the following requirement techniques will be done;
Document Analysis
This will require evaluating the documentation of a present or existing system. This can
assist when making AS-IS process documentation and also when driving the gap analysis for the
scoping of the project. This way, we can also establish or determine the requirements that drove
to making the existing system which can be the beginning point for documenting all current
requirements.
Interface Analysis
Integration with other external devices and systems is another interface. The user centric
design approaches are quite effective to ensure that you make usable software. Interface analysis
is vital to ensure that there are no overlook requirements that are not instantly visible to users.
3.4.2 System and Software Design

System design is the phase that bridges the gap between problem domain and the existing system
in a manageable way. This phase focuses on the solution domain, i.e. “how to implement?”.
31
In this phase, the complex activity of system development is divided into several smaller
sub-activities, which coordinate with each other to achieve the main objective of system
development as shown in figure below.
Identify Design Goals
System Decomposition
Identification of Concurrency
Hardware Allocation
System Design
Data Management
Software Control Implementation
Boundary Condition
Figure 5 sub-activities system development

This phase includes the design of the application, user interface and system interface. The
system requirements specifications will be transformed into logical structure, which contains
detailed and complete set of specifications that can be implemented in the programming language.
Creating a contingency, training, maintenance and operation plan and review the proposed
design. Ensure that the final design must meet the requirements stated in SRS document. Finally,
prepare a design document which will be used during next phases.
3.4.3 Implementation and Unit Testing

Implementation is a process of ensuring that the information system is operational. It involves
 Constructing a new system from scratch
 Constructing a new system from the existing one.
32
With inputs from the system design, the system will be first developed in small programs called
units, which will be integrated in the next phase. Each unit will be developed and tested for its
functionality, which is referred to as Unit Testing.
The design will be implemented into source code through coding. Combine all the modules
together into training environment that detects errors and defects. A test report which contains
errors is prepared through test plan that includes test related tasks such as test case generation,
testing criteria, and resource allocation for testing.
3.4.4 Integration and System Testing

All the units developed in the implementation phase are integrated into a system after
testing of each unit. Post integration the entire system is tested for any faults and failures.
3.4.5 Operation and Maintenance

Maintenance means restoring something to its original conditions. Enhancement means
adding, modifying the code to support the changes in the user specification. System maintenance
conforms the system to its original requirements and enhancement adds to system capability by
incorporating new requirements.
Thus, maintenance changes the existing system, enhancement adds features to the existing
system, and development replaces the existing system. It is an important part of system
development that includes the activities which corrects errors in system design and
implementation, updates the documents, and tests the data.
3.5 Justification of Selected Methodology

The waterfall model is the classical model of software engineering. This model is one of
the oldest models and is widely used in government projects and in many major companies. As
this model emphasizes planning in early stages, it ensures design flaws before they develop. In
addition, its intensive document and planning make it work well for projects in which quality
control is a major.
It is easy to manage due to the rigidity of the model, each phase has specific deliverables
and a review process. Also the model phases are processed and completed one at a time.
33
It allows for departmentalization and control. A schedule can be set with deadlines for each
stage of development and a product can proceed through the development process model phases
one by one.
Development moves from concept, through design, implementation, testing, installation,

troubleshooting, and ends up at operation and maintenance. Each phase of development proceeds
in strict order.
Some of the major advantages of the Waterfall Model are as follows
 Simple and easy to understand and use
 Easy to manage due to the rigidity of the model. Each phase has specific deliverables and
a review process.
 Phases are processed and completed one at a time.
 Works well for projects where requirements are very well understood.
 Clearly defined stages.
 Well understood milestones.
 Easy to arrange tasks.
 Process and results are well documented.
3.6 Technologies and framework

The entire project will be done and developed in MATLAB.
MATLAB is a programming platform designed specifically for engineers and scientists.

The heart of MATLAB is the MATLAB language, a matrix-based language allowing the most
natural expression of computational mathematics.
With MATLAB, you can:
 Analyze data
 Develop algorithms
 Create models and applications
34
The language, apps, and built-in math functions enable you to quickly explore multiple
approaches to arrive at a solution. MATLAB lets you take your ideas from research to production
by deploying to enterprise applications and embedded devices, as well as integrating with
Simulink and Model-Based Design.
MATLAB Doesn’t require compiler to execute like C, C++. It just executes each sentence as
it is written in code. This increase productivity and coding efficiency. It is higher level language.
Using MATLAB Coder the codes written in MATLAB can be converted to C++, Java, Python,
.Net etc. This makes this language more versatile. So, scientific theories can be implemented in
other languages also. And those library files, or Dynamic-link libraries (dlls) can be directly
implemented in other languages.
MATLAB has inbuilt rich library of Neural Network, Fuzzy Logic, Simulink, Power System,
Hydrolins, Electricsl , Communication, Electromagnetics etc.
Thus developing any scientific simulation is easy to do using such rich library.
3.7 Summary
There are so many software process models for developing and implementing software.
The software process models have stages or phases that are used as a guide to developing the
software.
A software process model is an abstract representation of a process. It presents a description of a

process from some particular perspective as:
1. Specification.
2. Design.
3. Validation.
4. Evolution.
General Software Process Models are
1. Waterfall model
2. Prototype model.
3. Rapid application development model (RAD).
4. Evolutionary development: Specification, development and validation are interleaved.
35
5. Incremental model.
6. Iterative model.
7. Spiral model.
8. Component-based software engineering: The system is assembled from existing
components.
The waterfall model which will be adopted for the project is the classical model of software
engineering. This model is one of the oldest models and is widely used in government projects and
in many major companies. As this model emphasizes planning in early stages, it ensures design
flaws before they develop.
The waterfall model is easy to manage due to the rigidity of the model, each phase has
specific deliverables and a review process. Also the model phases are processed and completed
one at a time.
The framework to be used for the project is MATLAB which is a programming platform
designed specifically for engineers and scientists. The programming platform is easy to use for
research and app development because of the built-in models and the integration with other
programming languages.
36
CHAPTER 4 – PROJECT MANAGEMENT
4.1 Introduction
The risks in project management refers to a range of probabilities that cause an adverse
event and therefore the results prior to the event. Risks in project management can be identified,
estimated, assessed and controlled. Management of project risk management can be described as
a complex process of planning, identification, analysis, evaluation and control of project risks. The
risk in project management refers to a range of probabilities that cause an adverse event and
therefore the results prior to the event. Risks in project management can be identified, estimated,
assessed and controlled risk management activities of the project. Management of project risk
management can be described as a complex process of planning, identification, analysis,
evaluation and control of project risks.
4.2 Risk and Quality Management

This phase of the project involves the formulation of management responses to the main
risk. Risk management can during the quantitative analysis phase as the need to risk may be urgent
and the solution fairly obvious.
Risk management includes:
 Identifying preventive measures to avoid a risk or to reduce its effect

 Establishing contingency plans to deal with risks if they should occur
 Initiating further investigations to reduce uncertainty through better information
 Considering risk transfer to insurers
 Considering risk allocation
 Setting contingencies in cost estimates, float in programmes and tolerances in performance
specifications.
Risk Management Life Cycle
The risk management life cycle are the steps or phases taken to manage risk in a project. The risk
management life cycle includes:
1. Identifying risks
2. Evaluating risks and their impact
37
3. Identifying responses
4. Selecting responses
The Figure 6 shows the risk management life cycle
Identify risks
Select Response Evaluate risks and

their impact
Identify
Responses
Figure 6 the risk management life cycle
4.3 Risk Analysis/ Risk Register

This stage of the process is generally split into two sub-stages; a qualitative analysis that
focuses on identification and subjective assessment of risks, and a quantitative analysis that
focuses on an objective assessment of the risks.
4.3.1 Risk Register

The risk register also known as the risk log is a document or log table that is created to help
manage the problems and issues that will arise as they arise. The table below shows the risk register
or risk log for this project.
38
Table 2 the risk register
Ref Risk Category Risk Risk Risk Risk Response

Identification Evaluation Result Owner
Probability Impact Score
1 User Personnel 2 2 4 Medium Owner End user sensitization
Resistance Risk of the system prior to
its deployment
2 Scope Scope Risk 2 3 6 High Owner Adjust course and
Estimates are reallocate new periods
Inaccurate to the project
3 Design lacks Design Risk 1 2 2 Low Owner Design a user friendly
flexibility and appealing
interface for the
project
4 System Technical 1 3 3 Medium Owner Handle errors and
outages Risk create an error log for
every possible error
5 Requirements Requirements 2 3 6 High Owner Develop the project
are incomplete Risk incrementally
4.4 Effort Costing Model

The key features that define project success are twofold: managing costs to achieve
efficiencies, and creating and enhancing value. These two elements enable project stakeholders to
understand the activities and resources required to meet project goals, as well as the expenditures
necessary to complete the project to the satisfaction of the customer (Venkataraman Ray R, 2018).
Functional point
The functional point for the project is calculated as shown in the figures below. The
calculations where done using basic COCOMO.
39
Figure 7 Information Domain Value(umd, 2018)
Figure 8 functional point(umd, 2018)
4.5 Effort Calculations for Project

Taking into consideration that the proposed project language is a fourth generation
language, we used the basic COCOMO to calculate the lines of code for the project based on the
functional point obtained above. We then calculated Effort and duration using the same basic
COCOMO.
The figure below shows the Lines of Code for this project
40
Figure 9 Lines of Code(umd, 2018)
The figure below shows Effort and Duration of the project.
Figure 10 Effort cost and Duration (umd, 2018)

We obtained an estimate of Effort cost with COCOMO II which is calculated as shown in
the figure below. Considering that this project will be developed using a fourth generation
language, and COCOMO II does not have provision for fourth generation languages, the estimate
is not as accurate as it should be but is within reasonable costs and duration of completion.
41
Figure 11 effort costing using COCOMO II calculation (csse, 2018)
The result from the COCOMO II calculation shown in the figure below are an estimate of how
long and how much it will take and cost to finish the project.
Breakdown of Costs
Table 3 Breakdown of costs/expenditure
Cost For Amount in Dollars Approximate Amount in Kwacha

Hardware $350 K3500
MATLAB License $150 K1500
42
Figure 12 result from the COCOMO calculation (csse, 2018)
4.6 Scheduling and Work plan

Gantt Chart for proposal
This chart shows the stages taken to come up with the proposal for this project. Each task was
started after the task before it is completed.
43
Figure 13 proposal Gantt Chart
Gantt Chart for System Development
This chart shows the phases and durations for each phase. The phases have sub tasks and
each task has a duration. A task can only start if the previous task has been completed, once all
tasks in a phase are completed then the next phase can start.
44
Figure 14 project Gantt Chart
4.7 Summary
In this chapter we determined the risks, their impact on the project, and how to manage or
handle the risks. We also calculated effort cost for this project using basic COCOMO where we
obtained the functional point, lines of code for the project, and the duration of the project. We later
on used COCOMO II to get an estimate of how much the project will cost. We came up with a
Gantt chart for the project which shows when each phase in developing the system will start and
how long it will take to for each phase to finish.
45
CHAPTER 5 – CONCLUSION
The techniques of diagnosing breast cancer in Zambia are still manual techniques. From
the literature, we discovered there is no known existing computer based system for breast cancer
diagnosis in Zambia. This project is breast cancer diagnosis system with the core of diagnosis, a
Machine Language Model which will be determined through research on which ML Model is best
suited for the project. The research will be training and testing of the ML Model and the ML Model
are; Neural Networks, Naïve Bayes and K Nearest Neighbor. Training and testing will be done
using the dataset obtained from the University of Wisconsin Hospitals, Madison from Dr. William
H. Wolberg which has ten (10) attributes and two (2) classes: benign and malignant. The ML
Model will be used based on how accurate it will be during testing. Accuracy will be done using
the following matrix: Accuracy Rate, Precision ad Recall and Cohen’s Kappa statistic. Effort costs
and duration for the project have been estimated using COCOMO II and basic COCOMO. Later
the project will be developed using the waterfall model.
46
References
Carla Chibwesha. (2015). A Comprehensive Assessment of Breast and Cervical Cancer Control in
Zambia.
Chen, H.-L. (2011). A support vector machine classifier with rough set-based feature selection for
breast cancer diagnosis. Expert Systems with Applications, 9014-9022.
csse. (2018, november 29). COCOMO II - Constructive Cost Model. Retrieved from COCOMO
II: http://csse.usc.edu/tools/COCOMOII.php
Daniel Graupe. (2013). Principles of Artificial Neural Networks. World Scientific.
Danaei G et al. (2005). Comparative risk assessment of nine behavioural and environmental risk
factors. Causes of cancer in the world, 1784-1793.
David Gil, J. L.-T. (2012). Predicting seminal quality with artificial intelligence methods. Expert
Systems with Applications.
David Kriesel. (2005). A Brief Introduction to Neural Networks. Bonn: www.dkriesel.com.
IARC. (2008). World cancer report. Lyon: International Agency for Research on Cancer.
Kevin koidl. (n.d.). The Nearest Neighbor Algorithm. Example KNN.
KNUTSON, D., & STEINER, E. (2007). Screening for Breast Cancer. Current Recommendations and
Future Directions, 1-7.
Lacey JV Jr et al. (2009). Breast cancer epidemiology according to recognized breast cancerrisk
factors in the Prostate, Lung, Colorectal and Ovarian (PLCO) Cancer Screening Trial
Cohort. BMC Cancer.
MacMahon, B. (2006). Epidemiology and the causes of breast cancer.
Ming Leung K. (2007). Naive Bayesian Classifier.
47
Munassar, N. M., & Govardhan A. (2010). A Comparison Between Five Models Of Software
Engineering. In IJCSI International Journal of Computer Science Issues (pp. 94-101).
www.IJCSI.org .
Peto J. (2001). Cancer epidemiology in the last century and the next decade. 390-395.
Tutorials, P. (n.d.). sdlc/sdlc_waterfall_model.htm. Retrieved from Tutorials Point:

http://www.tutorialspoint.com/sdlc/sdlc_waterfall_model.htm
umd. (2018, november 29). Basic cocomo model. Retrieved from umd.umich.edu:
http://groups.umd.umich.edu/cis/course.des/cis525/js/f00/gamel/cocomo.html
U.S., P. S. (2009). Screening for Breast Cancer. U.S. Preventive Services Task Force
Recommendation Statement, 716-726.
Venkataraman Ray R, J. P. (2018). Cost and value management in projects. New Jersey: John
Wiley & Sons Inc.
What is Matlab. (n.d.). Retrieved from MathWorks: https://www.mathworks.com/discovery/what-

is-matlab.html
WHO. (n.d.). World Health Organisation. Retrieved from

http://www.who.int/cancer/detection/breastcancer/en/index2.html
Wilson, B. (2012, June 25). The Machine Learning Dictionary. Retrieved from
http://www.cse.unsw.edu.au/~billw/mldict.html#activnfn
48

Machine Learning Detects Breast Cancer

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Machine Learning Detects Breast Cancer

Uploaded by

Copyright:

Available Formats

MULUNGUSHI UNIVERSITY

SCHOOL OF SCIENCE, ENGINEERING AND TECHNOLOGY

PROJECT PROPOSAL REPORT

Title: Breast Cancer Diagnosis with Machine Learning

Student Name : Lawrence Kasalwe

Date: ____________________ Date: ___________________

Sign: ____________________ Sign: ___________________

Reproductive factors associated with prolonged exposure to endogenous estrogens, such

1.5 Project Scope

1.6 Project Justification

2.1.1.1 Mortality rate in Zambia

2.2 Related literature

Several randomized controlled trials have evaluated mammography as a screening test.

Magnetic resonance imaging

Clinical studies have been conducted using technetium-99m sestamibi

Positron-emission tomography (PET) scanning is based on increased glucose utilization by

2.2.2 Screening and diagnosis of breast cancer in Zambia

2.3 Review of existing/ current system

2.4 Proposed system

Title: Wisconsin Breast Cancer Database (January 8, 1991).

 Dr. WIlliam H. Wolberg (physician) University of Wisconsin Hospitals Madison,

Each instance has one of 2 possible classes: benign or malignant.

Samples arrive periodically as Dr. Wolberg reports his clinical cases.

The database therefore reflects this chronological grouping of the data.

This grouping information appears immediately below, having been removed

from the data itself:

Group 1: 367 instances (January 1989)

Group 2: 70 instances (October 1989)

Group 3: 31 instances (February 1990)

Group 4: 17 instances (April 1990)

Group 5: 48 instances (August 1990)

Group 6: 49 instances (Updated January 1991)

Group 7: 31 instances (June 1991)

Group 8: 86 instances (November 1991)

Total: 699 points (as of the donated database on 15 July 1992)

Number of Instances: 699 (as of 15 July 1992)

Number of Attributes: 10 plus the class attribute

Attribute Information: (class attribute has been moved to last column)

Benign: 458 (65.5%)

Malignant: 241 (34.5%)

2.4.3 Training and Testing Model

2.4.3.1 Neural Networks

Mathematically, a neuron's network function f(x) is defined as a composition of other

Equation 2 𝑓(𝑥) = 𝐾(∑ 𝑖 𝑤𝑖 𝑔𝑖 (𝑥))

Figure 1 Multi-layer feed forward NN (Daniel Graupe, 2013)

Equation 3 𝑎𝑗 = ∑𝑝𝑖=1 𝑤𝑗𝑖 𝑥𝑖 + 𝜃𝑗 , 𝑦𝑖 = 𝑓𝑗 (𝑎𝑗 )

2.4.3.2 K Nearest Neighbor

The distance is calculated using one of the following measures

Majority voting assumes that all votes are equal.

We return the class with the most votes.

ti is the value of the target feature for distance di.

Equation 5 𝕄𝑘(𝑞) = 𝑎𝑟𝑔 𝑚𝑎𝑥 ∑𝑘𝑖=1 𝛿(𝑡𝑖, 𝑙)

2.4.3.3 Naïve Bayes

𝑃(𝑋 |𝐶𝑖 ) 𝑃(𝐶𝑖 )

Equation 7 𝑃(𝑋|𝐶𝑖 ) ≈ ∏𝑛𝑘=1 𝑃(𝑥𝑘|𝐶𝑖 )

a. If Ak is categorical, then P(xk|𝐶𝑖 ) is the number of samples of class 𝐶𝑖 in T having

Equation 9 𝑝(𝑥𝑦 |𝐶𝑖 ) = 𝑔(𝑥𝑦 , 𝜇𝐶𝑖 , 𝜎𝐶𝑖 )

2.4.4.1 Accuracy Rate

𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝐶𝑜𝑟𝑟𝑒𝑐𝑡 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠

Where TP = True Positives, TN = True Negatives, FP = False Positives, and FN = False

2.4.4.2 Precision and Recall

|{𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠}+ {𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠}|

Date: __ Date: _

Sign: __ Sign: _