Professional Documents
Culture Documents
PROJECT TOPIC
ON
BY
CSC407
LECTURER’S NAME
DR EMEKA OGBUJU
1
GROUP MEMBERS
2
TABLE OF CONTENTS
CHAPTER ONE
1.0 INTRODUCTION 1
CHAPTER TWO
2.0 LITERATURE REVIEW 4
2.1 Historical Review of the Study 4
2.1.1 Definition of the Specific Research Concept 6
2.2 Review of Existing Relevant Literature 7
CHAPTER THREE
3.0 METHODOLOGY DESIGN 9
3.1 A Brief Introduction to the Methodology 9
CHAPTER FOUR
4.0 SYSTEM/MODEL DESIGN AND IMPLEMENTATION 12
4.1 Data Cleaning 13
4.2 Data Selection/Transformation 14
4.3 Data Mining 15
4.4 Pattern Evaluation 19
4.5 Knowledge Representation 23
3
5.0 SUMMARY AND CONCLUSION 26
5.1 Summary 26
5.2 Conclusion 26
REFERENCES 27
4
LIST OF FIGURES
Figure Page
5
6
CHAPTER ONE
1.0. INTRODUCTION
1.1. Background of the Study
Breast cancer is the most common cancer among women, according to the World Health
Organization, with incidence and survival rates varying considerably across the globe and
approximately 1.7 million new cases in 2012. (GLOBOCAN 2012). Because of late-stage
diagnostics, the survival rate is high in North America (80%) and middle-income countries
(around 60%), but it is below 40% in low-income countries [1, 2].
Breast cancer is a disease in which a highly malignant type of tumor develops in the cells of
the breast. A tumor is a growth of abnormal tissue in the body. Tumors may be malignant
(cancerous) or benign (non-cancerous).Tumors develop when cells in the body divide and
multiply excessively. The body normally regulates cell division and growth.To replace old
cells or to perform new functions, new cells are created. Damaged or no longer needed cells
die to make way for healthy replacement cells.
If the balance of cell division and death is disturbed, a tumor may formthat can often be seen
on an x-ray or felt as a lump.Breast tumors that aren't cancerous are abnormal growths that
don't spread outside of the breast. Although benign breast lumps are not life threatening, they
can increase a woman's risk of developing breast cancer.Any lump or change in the breast
should be tested by a health care professional to see if it's benign or malignant (cancer) and
whether it'll affect the cancer risk in the future. Breast cancer can be of the invasive or non-
invasive type, and can occur in both men and women, although in men it is a hundred times
less common than in women.
Breast cancers can start from different parts of the breast.
Most breast cancers begin in the ducts that carry milk to the nipple (ductal cancers)
Some start in the glands that make breast milk (lobular cancers)
There are also other types of breast cancer that are less common like phyllodes tumor and
angiosarcoma
A small number of cancers start in other tissues in the breast. These cancers are called
sarcomas4 and lymphomas5 and are not really thought of as breast cancers.
7
Many kinds of breast cancer can cause a lump in the breast, but not all of them do.
Many breast cancers are also discovered on screening mammograms, which can detect
cancers at an earlier stage, often before they are sensed or symptoms appear.
There are many different types of breast cancer and common ones include ductal
carcinoma in situ (DCIS) and invasive carcinoma.Others, like phyllodes tumors
andangiosarcoma are less common.
Breast cancer cells are tested for proteins called estrogen receptors, progesterone
receptors, and HER2 after a biopsy. In the lab, the tumor cells are also examined closely
to determine the grade.Treatment options can be influenced by the particular proteins
found and the tumor grade.
Breast cancer can spread when the cancer cells get into the blood or lymph system and
are carried to other parts of the body.
The lymph system is a system of lymph (or lymphatic) vessels that connect lymph nodes
(small bean-shaped collections of immune system cells) throughout the body.
Lymph is a clear fluid that includes tissue byproducts and waste material, as well as
immune system cells, within lymph vessels.
Lymph fluid is carried away from the breast by lymph vessels.
Cancer cells may enter those lymph vessels and begin to develop in lymph nodes in the
case of breast cancer.
Most of the lymph vessels of the breast drain into:
Lymph nodes under the arm (axillary nodes)
Lymph nodes around the collar bone (supraclavicular [above the collar bone] and
infraclavicular [below the collar bone] lymph nodes)
Lymph nodes inside the chest near the breast bone (internal mammary lymph nodes)
8
1.2. Statement of The Problem
The problem domain of this study is that women with the issues of the benign (non-
cancerous) tumor often misunderstand the tumor to be malignant (cancerous) tumor, thereby
causing fear and panic that they have the breast cancer. This result into stress, depression and
can lead to several health issues and complications.
1.3.1. Aim
The aim of this study is to develop a model that classify the two main tumors of breast cancer
which are the malignant (cancerous) tumor and the benign (non-cancerous) tumor.
1.3.2. Objectives
To retrieve datasets from an online source, i.e. from Kaggle
To train the model based on the available datasets.
To evaluate the model and get an obtainable accuracy output.
9
CHAPTER TWO
2.0. LITERATURE REVIEW
The modern approach to breast cancer treatment and research started forming in the 19th
century. Below arethe milestones:
1882: William Halsted performed the first radical mastectomy. This surgery will remain
the standard operation to treat breast cancer until into the 20th century.
1895: The first X-ray is taken. Eventually, low-dose X-rays called mammograms will be
used to detect breast cancer.
1898: Marie and Pierre Curie discover the radioactive elements radium and polonium.
Shortly after, radium is used in cancer treatment.
1932: A new approach to mastectomyis developed. The surgical procedure is not as
disfiguring, and becomes the new standard.
1937:Radiation therapy is used in addition to surgery to spare the breast. After removing
the tumor, needles with radium are placed in the breast and near lymph nodes.
1978:Tamoxifen(Nolvadex, Soltamox) is approved by the Food and Drug Administration
(FDA) for use in breast cancer treatment. This antiestrogen drug is the first in a new class
of drugs called selective estrogen receptor modulators (SERMs).
1984: Researchers discover a new gene in rats. The human version, HER2, is found to be
linked with more aggressive breast cancer when overexpressed. Called HER2-
positivebreast cancer, it isn’t as responsive to treatments.
1985: Researchers discover that women with early-stage breast cancer who were treated
with a lumpectomyand radiation have similar survival rates to women treated with only a
mastectomy.
1986: Scientists figure out how to clone the HER2 gene.
10
1995: Scientists can clone the tumor suppressor genes BRCA1and BRCA2. Inherited
mutations in these genes can predict an increased risk of breast cancer.
1996: FDA approves anastrozole(Arimidex) as a treatment for breast cancer. This drug
blocks the production of estrogen.
1998: Tamoxifen is found to decrease the risk of developing breast cancer in at-risk
women by 50percentTrustedSource. It’s now approved by the FDA for use as a
preventive therapy.
1998:Trastuzumab(Herceptin), a drug targeting cancer cells that are over-producing
HER2, is also approved by the FDA.
2006: The SERM drug raloxifene(Evista) is found to reduce breast cancer risk for
postmenopausal women who have higher risk. It has a lower chance of serious side
effects than tamoxifen.
2011: A large meta-analysisTrusted Sourcefinds that radiation therapy significantly
reduces the risk of breast cancer reccurrence and mortality.
2013: The four major subtypesof breast cancer are defined as HR+/HER2 (“luminal A”),
HR-/HER2 (“triple negative”), HR+/HER2+ (“luminal B”), and HR-/HER2+ (“HER2-
enriched”).
2017: The first biosimilar drug, OgivriTrusted Source(trastuzumab-dkst), is approved by
the FDA for breast cancer treatment. Unlike generics, biosimilars are copies of biologic
drugs and cost less than branded drugs.
2018: A clinical trialsuggests that chemotherapy after surgery doesn’t benefit 70 percent
of women with early-stage breast cancer.
2019:EnhertuTrusted Sourceis approved by the FDA, and this drug proves to be very
effective in treating HER2-positive breast cancer that’s metastasized or can’t be removed
with surgery.
2020: The drug Trodelvyis approved by the FDA for treating metastatic triple-negative
breast cancer for people who haven’t responded to at least two other treatments.
11
2.1.1. Definition of The Specific Research Concept
1. Breast Cancer
Breast cancer is a disease in which the cells of the breast grow uncontrollably large.
There are various types of breast cancer. The type of breast cancer is determined by
which cells in the breast become cancerous.
2. Tumor
A tumor is a tissue mass or lump that resembles swelling.The National Cancer Institute
define a tumor as “an abnormal mass of tissue that results when cells divide more than
they should or do not die when they should.”
3. Malignant Tumor
Malignant tumor refers to a lump of cancer cells that can invade and kill nearby tissue
and spread to other parts of your body. This type of tumor is said to be cancerous.
4. Benign Tumor
A benign tumor, like all tumors, is a collection of abnormal cells. They can't move into
adjacent tissue or spread to other parts of the body, unlike malignant (cancerous)
tumors.They're often encased in a protective sac that makes them simple to remove.
5. Lymph
Lymph nodes are small lumps of tissue that contain white blood cells, which fight
infection. They filter lymph fluid, which is composed of fluid and waste products from
our body tissues.
Lymph nodes (or lymph glands) are part of the body’s immune system. They filter
harmful substances like bacteria and cancer cells from your body, and help fight
infections. They also play an important role in cancer diagnosis, treatment and
prognosis.
12
2.2. Review of Existing Relevant Literature
(Raul et al., 2020)Proposed techniques that support theeffective medical diagnosis breast cancer
which has undoubtedly become a priority for the government, for health institutions and for civil
society in general. In this paper, an associative pattern classifier (APC) was used for the
diagnosis of breast cancer. The rate of efficiency obtained on the Wisconsin breast cancer
database was 97.31%. The APC’s performance was compared with the performance of a support
vector machine (SVM) model, back-propagation neural networks, C4.5, naive Bayes, k-nearest
neighbor (k-NN) and minimum distance classifiers. According to the results, the APC performed
best. The algorithm of the APC was written and executed in a JAVA platform, as well as the
experimental and comparativeness between algorithms.
(Abdullahi et al., 2019) proposed an Artificial intelligence (AI) and machine learning (ML)
approaches in combination with ramanspectroscopy (RS) to obtain accurate medical diagnosis
and decision making is a way forward for understanding not only the chemical pathway to the
progression of disease, but also for tailor-made personalised medicine. These processes remove
unwanted affects in the spectra such as noise, fluorescence and normalization, and help in the
optimization of spectral data by employing chemometrics.
Research design and materials: In this study, breast cancer tissues have been analysed by RS in
conjunction with principal component (PCA) and linear discriminate (LDA) analyses. Tissue
microarray (TMA) breast biopsies were investigated using RS and chemometric methods and
classified breast biopsies into luminal A, luminal B, HER2 and triple negative subtypes.
Results: Supervised and unsupervised algorithms were applied on biopsy data to explore intra
and inter dataset biochemical changes associated with lipids, collagen and nucleic acid content.
LDA predicted specificity accuracy of luminal A, B HER2 and triple negative subtypes were
70%, 100%, 90% and 96.7%, respectively.
Conclusion: It is envisaged that a combination of RS with AI and ML may create a precise and
accurate real-time methodology for cancer diagnosis and monitoring.
13
(Chocholova et al., 2018)presented two key novel components for the identification of potential
novel breast cancer(BCa) diagnostic approach: 1. application of
photoimmobilizablezwitterionichydrogels resisting nonspecific protein adsorption for
preparation of the biosensor interfaces and 2. integration of lectins (carbohydrate recognizing
proteins) within biosensors to evaluate changes in the glycan profile of HER2 protein on the
molecular level. A disposable, electrochemical biosensor based on screen printed carbon
electrodes (SPCE) with a deposited hydrogel layer was applied for covalent attachment of
antibodies fora specific interaction with HER2. In the subsequent step, HER2 molecules were in
situ glycoprofiled using lectins. The impedimetricimmunosensor was able to detect HER2 down
to 5 pg mL-1 (≈ 77 fM) with a minimal non-specific protein adsorption. The biosensor was then
combined with lectins to glycoprofileHER2 in two serum samples (one from a healthy, high BCa
risk woman and the other from a woman witha 2nd stage BCa). The results obtained by the
glycoprofilingvia the impedimetric biosensor weresuccessfully verified by independent lectin-
based enzyme-linked immunosorbent assays (ELISA). To our best knowledge it is a very first
biosensor applying zwitterionic polymeric hydrogel-modified interface for glycoprofiling of a
cancer biomarker.
14
CHAPTER THREE
3.0. METHODOLOGY DESIGN
In this research or study, the Knowledge discovery in databases (KDD) was used.
The main reason for using the KDD methodology in this study is based on our objective of
extracting information or knowledge from data in the context of large databases or datasets.
According to GeeksforGeeks, the following are the steps involved in KDD process:
i. Data cleaning:
The term "data cleaning" refers to the process of removing noisy and irrelevant data
from a collection.
Cleaning in case of Missingvalues.
Cleaning noisy data, where noise is a random or variance error.
Cleaning with Datadiscrepancydetection and Datatransformationtools.
15
iii. Data selection:
Data selection is defined as the process where data relevant to the analysis is decided and
retrieved from the data collection.
Data selection using Neural network.
Data selection using Decision Trees.
Data selection using Naive bayes.
Data selection using Clustering, Regression, etc.
v. Data mining:
Data mining is defined as clever techniques that are applied to extract patterns
potentially useful.
Transforms task relevant data into patterns.
Decides purpose of model using classification or characterization.
16
vii. Knowledge representation:
Knowledge representation is defined as technique which utilizes visualization tools to
represent data mining results.
Generate reports.
Generate tables.
Generate discriminantrules, classification rules, characterization rules, etc.
Fig. 3.0. Diagram to Show the Iterative Steps Involved in KDD Process
17
CHAPTER FOUR
4.0. SYSTEM/MODEL DESIGN AND IMPLEMENTATION
The dataset for this project was gotten from an online source, Kaggle.
Fig. 4.0. Screenshot to show first 5 and last 5 rows of the dataset
18
Using the Knowledge discovery in databases (KDD) methodology processes, to create and
evaluate the model for breast cancer diagnostics;
19
4.2. Data Selection/Transformation
From the cleaned dataset, we selected the ‘diagnosis’ column as the relevant data to the project
analysis, and then we mapped M and B to 1 and 0 respectively.
N.B: M means the malignant tumor, while B means the benign tumor.
20
Using several techniques to extract patterns that are useful. Some of the techniques used in
this phase include;
21
Generating and visualizing the correlation matrix
22
Fig. 4.0.4 Screenshot for the code for the correlation matrix
Fig. 4.0.5. Screenshot to show the heat map of the correlation matrix
23
Dropping all the worst columns
Fig. 4.0.7. Screenshot of code to draw the correlated matrix heat map
24
Visualizing the correlated matrix heat map
Fig. 4.0.8. Screenshot to show the visualized the correlated matrix heat map
25
4.4. Pattern Evaluation
Based on the clean and prepared dataset, the model was built and tested.
Fig. 4.0.9. Screenshot to show the code for building the model
The following algorithms were used to train and test the model.
i. Logistic regression
26
Fig. 4.1.0. Screenshot for building the model with Logistic regression
ii. Decision tree
27
Fig. 4.1.1. Screenshot for building the model with Decision tree
28
Fig. 4.1.2. Screenshot for building the model with Random forest
29
Fig. 4.1.4. Screenshot for building the model with SVM
Fig. 4.1.5. Screenshot for building the model with Naïve Bayes
30
31
Fig. 4.1.6. Screenshot for the classification report for Logistic regression
32
Fig. 4.1.7. Screenshot for the classification report for Decision Tree
33
Fig. 4.1.8. Screenshot for the classification report for Random Forest
34
Fig. 4.1.9. Screenshot for the classification report for KNN
35
Fig. 4.2.0. Screenshot for the classification report for SVM
CHAPTER FIVE
5.0. SUMMARY AND CONCLUSION
36
5.1. SUMMARY
Breast Cancer Diagnostics, is all about the use of machine learning algorithms and
techniques to classify and group breast cancer tumor into malignant (cancerous) tumor and
benign (non-cancerous) tumor. In this research, the dataset was gotten from an online
source; Kaggle.
This research uses the Knowledge Discovery in Databases (KDD) methodology processes
for the creation and evaluation of the model based on the retrieved dataset.
5.2. CONCLUSION
After creating the model, we used six (6) algorithms for diagnostic of the two type of breast
cancer tumor. The evaluation of these algorithms gives an obtainable accuracy.
The Logistic Regression algorithm give 95.90% accuracy, the Decision Tree algorithm give
90.05% accuracy, the Random Forest algorithm give 94.73% accuracy, the Naïve Bayes
algorithm give 92.98% accuracy, the K-Nearest Neighbor algorithm give 96.49% accuracy,
and the Support Vector Machine algorithm give 96.49% accuracy.
From the comparison of the algorithm based on the accuracies, we found out that the K-
Nearest Neighbor and the Support Vector Machine have the highest accuracy and thereby,
are the best algorithms for the breast cancer diagnostics.
REFERENCES
37
Abdullah, C., Shazza, R., Ihtesham, R. (2019). Advancing cancer diagnostics with artificial
intelligence and spectroscopy: Identifying chemical changes associated with breast
cancer.Expert Review of Molecular Diagnostics,1744-8352.
Mei, L., Xiaocheng, Y., Zhu, C., Tong, Y., Dandan, Y., Qianqian, L., Keke, D., Bo, L.,
Zhifei, W., Song, L., Yan, D., Nongyue, H.(2017). Aptamer selection and applications
for breast cancer diagnostics and therapy.Journal of Nanobiotechnology,15-81.
Solin, L., Gray, R., Baehner, F., Butler, M., Hughes, L. (2013).A multigene expression
assay to predict local recurrence risk for ductal carcinoma in situ of the breast. J Natl
Cancer Inst 105: 701-710.
38