You are on page 1of 38

FEDERAL UNIVERSITY LOKOJA

COMPUTER SCIENCE DEPARTMENT

PROJECT TOPIC

ON

HOME LOAN PREDICTION


PRESENTED

BY

GROUP TWO (2)

CSC407

LECTURER’S NAME

DR EMEKA OGBUJU

1
GROUP MEMBERS

S/NO NAMES MATRIC NO


1 OBAJE PHOEBE UNEKWUOJO SCI16CSC100
2 STEPHEN EXCEL OLUWASEGUN SCI16CSC138
3 ONOJA JESSE UNEKWUOJO SCI16CSC120
4 FARIDA ORAHACHI MOHAMMED SCI16CSC093
5 YUNUSA SANUSI MOHAMMED SCI16CSC155
6 YUSUF MANIR SCI16CSC087
7 OBI ABRAHAM IKECHUKWU SCI15CSC048
8 DOMINIC VICTOR ONU SCI16CSC051
9 AJOGWU ANTHONY SCI16CSC023
10 DANIEL OLUWASHINA OSEYA SCI16CSC124
11 GALLA ABDULSHAKUR ABUBAKAR SCI16CSC061
12 MONDAY ENYOJO MARY SCI16CSC095
13 SAMUEL VICTOR ENEJI SCI16CSC134
14 JAMES ILEMONA AKOR SCI16CSC027
15 OYEKANMI OLAYINKA JOSEPH SCI16CSC127
16 IFEANYI ISREAL SCI16CSC070
17 FEMI RAPHEAL EMILI SCI15CSC030
18 PETER JOY OZOHU SCI16CSC130
19 MUBARAK HARUNA SCI16CSC064
20 ISAH ABDULLAHI ABUBAKAR SCI16CSC007
21 LAWAL EMMANUEL TAIWO SCI16CSC086
22 KADRI M. OLUWASEUN SCI16CSC084
23 IGBENIDION MICHEAL SCI16CSC071
24 EMEKA KEVIN ONYEJI SCI16CSC122
25 EMMANUEL AKOGWU SCI15CSC011
26 SHUAIBU RIDWAN AMOTO SCI15CSC071

2
TABLE OF CONTENTS

CHAPTER ONE

1.0 INTRODUCTION 1

1.1 Background of Study 1

1.2 Statement of the Problem 2

1.3 Aim and Objectives 3


1.4 Significance of the Study 3
1.5 Scope of the Study 3

CHAPTER TWO
2.0 LITERATURE REVIEW 4
2.1 Historical Review of the Study 4
2.1.1 Definition of the Specific Research Concept 6
2.2 Review of Existing Relevant Literature 7

CHAPTER THREE
3.0 METHODOLOGY DESIGN 9
3.1 A Brief Introduction to the Methodology 9

CHAPTER FOUR
4.0 SYSTEM/MODEL DESIGN AND IMPLEMENTATION 12
4.1 Data Cleaning 13
4.2 Data Selection/Transformation 14
4.3 Data Mining 15
4.4 Pattern Evaluation 19
4.5 Knowledge Representation 23

3
5.0 SUMMARY AND CONCLUSION 26
5.1 Summary 26
5.2 Conclusion 26
REFERENCES 27

4
LIST OF FIGURES

Figure Page

3.0 Diagram to show the iterative steps in KDD process 11


4.0 Screenshot to show first 5 and last 5 rows of the dataset 12
4.0.1 Screenshot to show the data cleaning phase 13
4.0.2 Screenshot to show the data selection/transformation phase 14
4.0.3 Screenshot to show the scatter plot matrix 15
4.0.4 Screenshot for the code for the correlation matrix 16
4.0.5 Screenshot to show the heat map of the correlation matrix 16
4.0.6 Screenshot of code to drop all worst columns 17
4.0.7 Screenshot of code to draw the correlated matrix heat map 17
4.0.8 Screenshot to show the visualized correlated matrix 18
4.0.9 Screenshot to show the code for building the model 19
4.1.0 Screenshot for building the model with Logistic Regression 19
4.1.1 Screenshot for building the model with Decision Tree 20
4.1.2 Screenshot for building the model with Random Forest 21
4.1.3 Screenshot for building the model with KNN 21
4.1.4 Screenshot for building the model with SVM 22
4.1.5 Screenshot for building the model with Naïve Bayes 22
4.1.6 Screenshot for the classification report for Logistic Regression 23
4.1.7 Screenshot for the classification report for Decision Tree 23
4.1.8 Screenshot for the classification report for Random Forest 24
4.1.9 Screenshot for the classification report for KNN 24
4.2.0 Screenshot for the classification report for SVM 25

5
6
CHAPTER ONE

1.0. INTRODUCTION
1.1. Background of the Study
Breast cancer is the most common cancer among women, according to the World Health
Organization, with incidence and survival rates varying considerably across the globe and
approximately 1.7 million new cases in 2012. (GLOBOCAN 2012). Because of late-stage
diagnostics, the survival rate is high in North America (80%) and middle-income countries
(around 60%), but it is below 40% in low-income countries [1, 2].
Breast cancer is a disease in which a highly malignant type of tumor develops in the cells of
the breast. A tumor is a growth of abnormal tissue in the body. Tumors may be malignant
(cancerous) or benign (non-cancerous).Tumors develop when cells in the body divide and
multiply excessively. The body normally regulates cell division and growth.To replace old
cells or to perform new functions, new cells are created. Damaged or no longer needed cells
die to make way for healthy replacement cells.
If the balance of cell division and death is disturbed, a tumor may formthat can often be seen
on an x-ray or felt as a lump.Breast tumors that aren't cancerous are abnormal growths that
don't spread outside of the breast. Although benign breast lumps are not life threatening, they
can increase a woman's risk of developing breast cancer.Any lump or change in the breast
should be tested by a health care professional to see if it's benign or malignant (cancer) and
whether it'll affect the cancer risk in the future. Breast cancer can be of the invasive or non-
invasive type, and can occur in both men and women, although in men it is a hundred times
less common than in women.
Breast cancers can start from different parts of the breast.
 Most breast cancers begin in the ducts that carry milk to the nipple (ductal cancers)
 Some start in the glands that make breast milk (lobular cancers)
 There are also other types of breast cancer that are less common like phyllodes tumor and
angiosarcoma
 A small number of cancers start in other tissues in the breast. These cancers are called
sarcomas4 and lymphomas5 and are not really thought of as breast cancers.

7
Many kinds of breast cancer can cause a lump in the breast, but not all of them do.
Many breast cancers are also discovered on screening mammograms, which can detect
cancers at an earlier stage, often before they are sensed or symptoms appear.

There are many different types of breast cancer and common ones include ductal
carcinoma in situ (DCIS) and invasive carcinoma.Others, like phyllodes tumors
andangiosarcoma are less common.
Breast cancer cells are tested for proteins called estrogen receptors, progesterone
receptors, and HER2 after a biopsy. In the lab, the tumor cells are also examined closely
to determine the grade.Treatment options can be influenced by the particular proteins
found and the tumor grade.

Breast cancer can spread when the cancer cells get into the blood or lymph system and
are carried to other parts of the body.
The lymph system is a system of lymph (or lymphatic) vessels that connect lymph nodes
(small bean-shaped collections of immune system cells) throughout the body.
Lymph is a clear fluid that includes tissue byproducts and waste material, as well as
immune system cells, within lymph vessels.
Lymph fluid is carried away from the breast by lymph vessels.
Cancer cells may enter those lymph vessels and begin to develop in lymph nodes in the
case of breast cancer.
Most of the lymph vessels of the breast drain into:
 Lymph nodes under the arm (axillary nodes)
 Lymph nodes around the collar bone (supraclavicular [above the collar bone] and
infraclavicular [below the collar bone] lymph nodes)
 Lymph nodes inside the chest near the breast bone (internal mammary lymph nodes)

8
1.2. Statement of The Problem
The problem domain of this study is that women with the issues of the benign (non-
cancerous) tumor often misunderstand the tumor to be malignant (cancerous) tumor, thereby
causing fear and panic that they have the breast cancer. This result into stress, depression and
can lead to several health issues and complications.

1.3. Aim and Objectives

1.3.1. Aim
The aim of this study is to develop a model that classify the two main tumors of breast cancer
which are the malignant (cancerous) tumor and the benign (non-cancerous) tumor.

1.3.2. Objectives
 To retrieve datasets from an online source, i.e. from Kaggle
 To train the model based on the available datasets.
 To evaluate the model and get an obtainable accuracy output.

1.4. Significance of The Study


Breast cancer diagnostics helps breast cancer patients especially women, know and detect the
type of tumor they have, which are the benign or the malignant tumor.

1.5. Scope of The Study


There are several study and research on breast cancer, but this study is encompassed only
breast cancer diagnostics which is classifying the breast cancer tumor into benign or
malignant tumor.

9
CHAPTER TWO
2.0. LITERATURE REVIEW

2.1. Historical Review of the Study

The modern approach to breast cancer treatment and research started forming in the 19th
century. Below arethe milestones:

 1882: William Halsted performed the first radical mastectomy. This surgery will remain
the standard operation to treat breast cancer until into the 20th century.
 1895: The first X-ray is taken. Eventually, low-dose X-rays called mammograms will be
used to detect breast cancer.
 1898: Marie and Pierre Curie discover the radioactive elements radium and polonium.
Shortly after, radium is used in cancer treatment.
 1932: A new approach to mastectomyis developed. The surgical procedure is not as
disfiguring, and becomes the new standard.
 1937:Radiation therapy is used in addition to surgery to spare the breast. After removing
the tumor, needles with radium are placed in the breast and near lymph nodes.
 1978:Tamoxifen(Nolvadex, Soltamox) is approved by the Food and Drug Administration
(FDA) for use in breast cancer treatment. This antiestrogen drug is the first in a new class
of drugs called selective estrogen receptor modulators (SERMs).
 1984: Researchers discover a new gene in rats. The human version, HER2, is found to be
linked with more aggressive breast cancer when overexpressed. Called HER2-
positivebreast cancer, it isn’t as responsive to treatments.
 1985: Researchers discover that women with early-stage breast cancer who were treated
with a lumpectomyand radiation have similar survival rates to women treated with only a
mastectomy.
 1986: Scientists figure out how to clone the HER2 gene.

10
 1995: Scientists can clone the tumor suppressor genes BRCA1and BRCA2. Inherited
mutations in these genes can predict an increased risk of breast cancer.
 1996: FDA approves anastrozole(Arimidex) as a treatment for breast cancer. This drug
blocks the production of estrogen.
 1998: Tamoxifen is found to decrease the risk of developing breast cancer in at-risk
women by 50percentTrustedSource. It’s now approved by the FDA for use as a
preventive therapy.
 1998:Trastuzumab(Herceptin), a drug targeting cancer cells that are over-producing
HER2, is also approved by the FDA.
 2006: The SERM drug raloxifene(Evista) is found to reduce breast cancer risk for
postmenopausal women who have higher risk. It has a lower chance of serious side
effects than tamoxifen.
 2011: A large meta-analysisTrusted Sourcefinds that radiation therapy significantly
reduces the risk of breast cancer reccurrence and mortality.
 2013: The four major subtypesof breast cancer are defined as HR+/HER2 (“luminal A”),
HR-/HER2 (“triple negative”), HR+/HER2+ (“luminal B”), and HR-/HER2+ (“HER2-
enriched”).
 2017: The first biosimilar drug, OgivriTrusted Source(trastuzumab-dkst), is approved by
the FDA for breast cancer treatment. Unlike generics, biosimilars are copies of biologic
drugs and cost less than branded drugs.
 2018: A clinical trialsuggests that chemotherapy after surgery doesn’t benefit 70 percent
of women with early-stage breast cancer.
 2019:EnhertuTrusted Sourceis approved by the FDA, and this drug proves to be very
effective in treating HER2-positive breast cancer that’s metastasized or can’t be removed
with surgery.
 2020: The drug Trodelvyis approved by the FDA for treating metastatic triple-negative
breast cancer for people who haven’t responded to at least two other treatments.

11
2.1.1. Definition of The Specific Research Concept
1. Breast Cancer
Breast cancer is a disease in which the cells of the breast grow uncontrollably large.
There are various types of breast cancer. The type of breast cancer is determined by
which cells in the breast become cancerous.

2. Tumor
A tumor is a tissue mass or lump that resembles swelling.The National Cancer Institute
define a tumor as “an abnormal mass of tissue that results when cells divide more than
they should or do not die when they should.”

3. Malignant Tumor
Malignant tumor refers to a lump of cancer cells that can invade and kill nearby tissue
and spread to other parts of your body. This type of tumor is said to be cancerous.

4. Benign Tumor
A benign tumor, like all tumors, is a collection of abnormal cells. They can't move into
adjacent tissue or spread to other parts of the body, unlike malignant (cancerous)
tumors.They're often encased in a protective sac that makes them simple to remove.

5. Lymph
Lymph nodes are small lumps of tissue that contain white blood cells, which fight
infection. They filter lymph fluid, which is composed of fluid and waste products from
our body tissues.
Lymph nodes (or lymph glands) are part of the body’s immune system. They filter
harmful substances like bacteria and cancer cells from your body, and help fight
infections. They also play an important role in cancer diagnosis, treatment and
prognosis.

12
2.2. Review of Existing Relevant Literature

(Raul et al., 2020)Proposed techniques that support theeffective medical diagnosis breast cancer
which has undoubtedly become a priority for the government, for health institutions and for civil
society in general. In this paper, an associative pattern classifier (APC) was used for the
diagnosis of breast cancer. The rate of efficiency obtained on the Wisconsin breast cancer
database was 97.31%. The APC’s performance was compared with the performance of a support
vector machine (SVM) model, back-propagation neural networks, C4.5, naive Bayes, k-nearest
neighbor (k-NN) and minimum distance classifiers. According to the results, the APC performed
best. The algorithm of the APC was written and executed in a JAVA platform, as well as the
experimental and comparativeness between algorithms.

(Abdullahi et al., 2019) proposed an Artificial intelligence (AI) and machine learning (ML)
approaches in combination with ramanspectroscopy (RS) to obtain accurate medical diagnosis
and decision making is a way forward for understanding not only the chemical pathway to the
progression of disease, but also for tailor-made personalised medicine. These processes remove
unwanted affects in the spectra such as noise, fluorescence and normalization, and help in the
optimization of spectral data by employing chemometrics.
Research design and materials: In this study, breast cancer tissues have been analysed by RS in
conjunction with principal component (PCA) and linear discriminate (LDA) analyses. Tissue
microarray (TMA) breast biopsies were investigated using RS and chemometric methods and
classified breast biopsies into luminal A, luminal B, HER2 and triple negative subtypes.
Results: Supervised and unsupervised algorithms were applied on biopsy data to explore intra
and inter dataset biochemical changes associated with lipids, collagen and nucleic acid content.
LDA predicted specificity accuracy of luminal A, B HER2 and triple negative subtypes were
70%, 100%, 90% and 96.7%, respectively.
Conclusion: It is envisaged that a combination of RS with AI and ML may create a precise and
accurate real-time methodology for cancer diagnosis and monitoring.

13
(Chocholova et al., 2018)presented two key novel components for the identification of potential
novel breast cancer(BCa) diagnostic approach: 1. application of
photoimmobilizablezwitterionichydrogels resisting nonspecific protein adsorption for
preparation of the biosensor interfaces and 2. integration of lectins (carbohydrate recognizing
proteins) within biosensors to evaluate changes in the glycan profile of HER2 protein on the
molecular level. A disposable, electrochemical biosensor based on screen printed carbon
electrodes (SPCE) with a deposited hydrogel layer was applied for covalent attachment of
antibodies fora specific interaction with HER2. In the subsequent step, HER2 molecules were in
situ glycoprofiled using lectins. The impedimetricimmunosensor was able to detect HER2 down
to 5 pg mL-1 (≈ 77 fM) with a minimal non-specific protein adsorption. The biosensor was then
combined with lectins to glycoprofileHER2 in two serum samples (one from a healthy, high BCa
risk woman and the other from a woman witha 2nd stage BCa). The results obtained by the
glycoprofilingvia the impedimetric biosensor weresuccessfully verified by independent lectin-
based enzyme-linked immunosorbent assays (ELISA). To our best knowledge it is a very first
biosensor applying zwitterionic polymeric hydrogel-modified interface for glycoprofiling of a
cancer biomarker.

(Mei et al., 2017) proposed Aptamerswhich are short non-coding, single-stranded


oligonucleotides (RNA or DNA) developed through Systematic Evolution of Ligands by
Exponential enrichment (SELEX) in vitro. Similar to antibodies, aptamers can bind to specific
targetswith high affinity, and are considered promising therapeutic agents as they have several
advantages over antibodies, including high specificity, stability, and non-immunogenicity.
In this review, they first present a systematical review of various aptamerselection methods.
Then, various aptamer-based diagnostic and therapeutic strategies of breast cancer were
provided.

14
CHAPTER THREE
3.0. METHODOLOGY DESIGN
In this research or study, the Knowledge discovery in databases (KDD) was used.
The main reason for using the KDD methodology in this study is based on our objective of
extracting information or knowledge from data in the context of large databases or datasets.

3.1. A Brief Introduction to The Methodology


KDD stands for Knowledge Discovery in Databases. It refers to the broad process of
uncovering knowledge from data, with a focus on high-level applications of specific Data
Mining methods.
The KDD method has one overarching goal; to derive information from data in the sense of
large databases or datasets.
It accomplishes this by employing data mining methods (algorithms) to extract (identify)
what is considered information, based on the requirements of measures and thresholds,
from a database, as well as any necessary preprocessing, subsampling, and database
transformations.

According to GeeksforGeeks, the following are the steps involved in KDD process:
i. Data cleaning:
The term "data cleaning" refers to the process of removing noisy and irrelevant data
from a collection.
 Cleaning in case of Missingvalues.
 Cleaning noisy data, where noise is a random or variance error.
 Cleaning with Datadiscrepancydetection and Datatransformationtools.

ii. Data integration:


Data integration is defined as heterogeneous data from multiple sources combined
in a common source(DataWarehouse).
 Data integration using Data Migration tools.
 Data integration using Data Synchronization tools.
 Data integration using ETL(Extract-Load-Transformation) process.

15
iii. Data selection:
Data selection is defined as the process where data relevant to the analysis is decided and
retrieved from the data collection.
 Data selection using Neural network.
 Data selection using Decision Trees.
 Data selection using Naive bayes.
 Data selection using Clustering, Regression, etc.

iv. Data transformation:


Data Transformation is defined as the process of transforming data into appropriate form
required by mining procedure.
Data Transformation is a two-step process:
 DataMapping: Assigning elements from source base to destination to capture
transformations.
 Codegeneration: Creation of the actual transformation program.

v. Data mining:
Data mining is defined as clever techniques that are applied to extract patterns
potentially useful.
 Transforms task relevant data into patterns.
 Decides purpose of model using classification or characterization.

vi. Pattern evaluation:


Pattern Evaluation is defined as as identifying strictly increasing patterns representing
knowledge based on given measures.
 Find interestingnessscore of each pattern.
 Uses summarization and Visualization to make data understandable by user.

16
vii. Knowledge representation:
Knowledge representation is defined as technique which utilizes visualization tools to
represent data mining results.
 Generate reports.
 Generate tables.
 Generate discriminantrules, classification rules, characterization rules, etc.

Fig. 3.0. Diagram to Show the Iterative Steps Involved in KDD Process

17
CHAPTER FOUR
4.0. SYSTEM/MODEL DESIGN AND IMPLEMENTATION

The dataset for this project was gotten from an online source, Kaggle.

Fig. 4.0. Screenshot to show first 5 and last 5 rows of the dataset

18
Using the Knowledge discovery in databases (KDD) methodology processes, to create and
evaluate the model for breast cancer diagnostics;

4.1. Data Cleaning


From the dataset, we removed the ‘Unnamed: 32’ row and the ‘id’ row since they are
irrelevant features of the project. We used the python function ‘df.drop()’ to remove these
features

Fig. 4.0.1 Screenshot to show the data cleaning phase

19
4.2. Data Selection/Transformation
From the cleaned dataset, we selected the ‘diagnosis’ column as the relevant data to the project
analysis, and then we mapped M and B to 1 and 0 respectively.

N.B: M means the malignant tumor, while B means the benign tumor.

Fig. 4.0.2. Screenshot to show data selection/transformation phase

4.3. Data Mining

20
Using several techniques to extract patterns that are useful. Some of the techniques used in
this phase include;

 Using the "mean" columns to generate a scatter plot matrix

Fig. 4.0.3. Screenshot to show the scatter plot matrix

21
 Generating and visualizing the correlation matrix

22
Fig. 4.0.4 Screenshot for the code for the correlation matrix

Fig. 4.0.5. Screenshot to show the heat map of the correlation matrix

23
 Dropping all the worst columns

Fig. 4.0.6. Screenshot of code to drop all worst columns

Fig. 4.0.7. Screenshot of code to draw the correlated matrix heat map

24
 Visualizing the correlated matrix heat map

Fig. 4.0.8. Screenshot to show the visualized the correlated matrix heat map

25
4.4. Pattern Evaluation
Based on the clean and prepared dataset, the model was built and tested.

Fig. 4.0.9. Screenshot to show the code for building the model

The following algorithms were used to train and test the model.
i. Logistic regression

26
Fig. 4.1.0. Screenshot for building the model with Logistic regression
ii. Decision tree

27
Fig. 4.1.1. Screenshot for building the model with Decision tree

iii. Random forest

28
Fig. 4.1.2. Screenshot for building the model with Random forest

iv. K-Nearest Neighbor (KNN)

Fig. 4.1.3. Screenshot for building the model with KNN


v. Support Vector Machine (SVM)

29
Fig. 4.1.4. Screenshot for building the model with SVM

vi. Naïve Bayes

Fig. 4.1.5. Screenshot for building the model with Naïve Bayes

4.5. Knowledge Representation


Using the classification report technique to represent the data mining results.

30
31
Fig. 4.1.6. Screenshot for the classification report for Logistic regression

32
Fig. 4.1.7. Screenshot for the classification report for Decision Tree

33
Fig. 4.1.8. Screenshot for the classification report for Random Forest

34
Fig. 4.1.9. Screenshot for the classification report for KNN

35
Fig. 4.2.0. Screenshot for the classification report for SVM

CHAPTER FIVE
5.0. SUMMARY AND CONCLUSION

36
5.1. SUMMARY
Breast Cancer Diagnostics, is all about the use of machine learning algorithms and
techniques to classify and group breast cancer tumor into malignant (cancerous) tumor and
benign (non-cancerous) tumor. In this research, the dataset was gotten from an online
source; Kaggle.
This research uses the Knowledge Discovery in Databases (KDD) methodology processes
for the creation and evaluation of the model based on the retrieved dataset.

5.2. CONCLUSION
After creating the model, we used six (6) algorithms for diagnostic of the two type of breast
cancer tumor. The evaluation of these algorithms gives an obtainable accuracy.
The Logistic Regression algorithm give 95.90% accuracy, the Decision Tree algorithm give
90.05% accuracy, the Random Forest algorithm give 94.73% accuracy, the Naïve Bayes
algorithm give 92.98% accuracy, the K-Nearest Neighbor algorithm give 96.49% accuracy,
and the Support Vector Machine algorithm give 96.49% accuracy.
From the comparison of the algorithm based on the accuracies, we found out that the K-
Nearest Neighbor and the Support Vector Machine have the highest accuracy and thereby,
are the best algorithms for the breast cancer diagnostics.

REFERENCES

37
Abdullah, C., Shazza, R., Ihtesham, R. (2019). Advancing cancer diagnostics with artificial
intelligence and spectroscopy: Identifying chemical changes associated with breast
cancer.Expert Review of Molecular Diagnostics,1744-8352.

Bodai, B. (2015).Breast cancer survivorship: a comprehensivereview of long- term medical


issues and lifestyle recommendations.Perm. J. 19, 48–79

Brożek-Płuska, B.,Placek, I.,Kurczewski, K.,Morawiec, Z.,Tazbir, M.,Abramczyk, H.


(2008). Breast cancer diagnostics by Raman spectroscopy. Journal of Molecular
Liquids,141 145–148.

Mei, L., Xiaocheng, Y., Zhu, C., Tong, Y., Dandan, Y., Qianqian, L., Keke, D., Bo, L.,
Zhifei, W., Song, L., Yan, D., Nongyue, H.(2017). Aptamer selection and applications
for breast cancer diagnostics and therapy.Journal of Nanobiotechnology,15-81.

Solin, L., Gray, R., Baehner, F., Butler, M., Hughes, L. (2013).A multigene expression
assay to predict local recurrence risk for ductal carcinoma in situ of the breast. J Natl
Cancer Inst 105: 701-710.

38

You might also like