You are on page 1of 7

Specialty: Materials and

Project of Python Energy, MME-S2


OUARZAZATE: 2022A

Breast cancer prediction using logistic regression


El Ouardi Otmane1 ; Ait Said Abdessamade2
Department of Physics-Chemistry, Polydisciplinary Faculty Ouarzazate, Morroco
Gmail: otmane.elouardi@edu.uiz.ac.ma
May 20, 2022

ABSTRACT: Breast cancer (BC) is one of the most common cancers among women worldwide, representing
the majority of new cancer cases and cancer-related deaths according to global statistics, making it a significant
public health problem in today’s society. The early diagnosis of BC can improve the prognosis and chance of
survival significantly, as it can promote timely clinical treatment to patients. Further accurate classification of
benign tumors can prevent patients undergoing unnecessary treatments. Thus, the correct diagnosis of BC and
classification of patients into malignant or benign groups is the subject of much research. This analysis aims
to observe which features are most helpful in predicting malignant or benign cancer and to see general trends
that may aid us in model selection and hyper parameter selection. The goal is to classify whether the breast
cancer is benign or malignant.The dataset used in this work is publicly available and was created by Dr. William
H. Wolberg, physician at the University Of Wisconsin Hospital at Madison, Wisconsin, USA.The results, of this
work, show that the logistic regression model has an accuracy value of 94.02%, while the decision tree model
and random forest have accuracies of 92.98% and 95.61%, respectively.

Keywords: Breast cancer, Machine learning, Logistic regression, Python programming language.

1 Introduction also may have some side effects and false positives.
However, with the volume of data generating
Breast cancer develops from breast tissues with
extremely fast in the field of biomedical and
abnormal cells growing, changing and multiplying out
advancement of technology, machine learning
of control. It is the most common type of cancer
techniques offer promising results. Machine learning
among women in both developed and less developed
helps to extract information and knowledge from the
nations with an estimated death of 508,000 women
basis of past experiences and detect hard-to- perceive
in the year 2011 alone [1] and accounted for 25% of
pattern from large and noisy dataset to give accurate
all cancer cases and 15% of all cancer deaths among
results within a short period of time. Application of
females in the estimated cancer case in 2012 [2].
machine learning in the medical domain is growing
Cancer constitutes an enormous burden on society
rapidly due to the effectiveness of its approach in
in more and less economically developed countries
prediction and classification, especially in medical
alike. Cancer cases are becoming more common
diagnosis to predict breast cancer, now it is widely
due to the growth and aging of the population, as
applied to biomedical researchs.
well as a widespread rise of established risk factors
such as smoking, overweight, physical inactivity.
There are numerous modern techniques have been
Early detection of cancer significantly increases
evolved with the evolution of technology for the
the probability of recovering through successful
prediction of breast cancer.The work related to this
treatment. Delays in diagnosis results in late-stage
field is outlined shortly as follows.
presentation with consequences of lower likelihood of
survival, higher costs of treatments and even death. Some of the studies [3] and [4] displayed work
The most common techniques used for cancer associated to prediction and diagnosis of diseases
detection are X-ray mammography and magnetic using machine learning techniques like decision tree
resonance imaging (MRI). However, these present for detection of cancer.
innovations have a few downsides as they are very Liu Lei [5] proposed a model that uses machine
costly, extensive in size and are only affordable in learning for cancer detection. In this research,
large hospital facilities. The mentioned methods Logistic Regression algorithm of Sklearn machine

1
Specialty: Materials and
Project of Python Energy, MME-S2
OUARZAZATE: 2022A

learning library has been used to classify the data characteristics of the cell nuclei and are computed
sets of breast cancer. Two features of maximum from a digitized image of a fine needle aspirate (FNA)
texture and minimum perimeter was selected and the of a breast mass. As described in UCI Machine
classification accuracy stood at 96.5%. Learning Repository, the attribute informations are:
Bellaachia and Guven (2006) [6] looked into the
use of Naı̈ve Bayes, the back-propagated neural 1. ID number
network and the C4.5 decision tree algorithms on 2. Diagnosis (M = malignant, B = benign)
SEER dataset which contained 16 attributes and
482,052 records. 3 - 32 Ten real-valued features are computed for each
There have been researches in the recent past and cell nucleus
there are on-going researches which aims to observe
the features that are most helpful in predicting • a) radius (mean of distances from center to points
malignant or benign cancer and to see general trends on the perimeter)
that might help us in selecting particular models and • b) texture (standard deviation of gray-scale
hyper parameter selections. The aim of almost all values)
researches have been to reach the highest accuracy
possible in the shortest time. • c) perimeter

Machine learning (ML) is one of the fields of • d) area


Artificial Intelligence (AI) where statistical techniques
• e) smoothness (local variation in radius lengths)
are used to provide the computer systems with
the capability to “learn” and improve by itself • f) compactness (perimeter2 /area − 1.0)
progressively without being explicitly programmed.
Machine learning explores the study and construction • g) concavity (severity of concave portions of the
of algorithms that can learn from and make contour)
predictions on data and datasets based on different
• h) concave points (number of concave portions of
teaching mechanisms. The term ‘machine learning’
the contour)
was initially coined by Arthur Samuel in 1959 [7].
There are three important categories of machine • i) symmetry
learning : supervised learning, unsuprevised learning
and Reinforcement learning. • j) fractal dimension (”coastline approximation”1)
The objective of this paper is to describe logistic
The mean, standard error and ”worst” or largest
regression as a predictive model built using Machine
(mean of the three largest values) of these features
Learning algorithms for early detection of breast
were computed for each image, resulting in 30
cancer in order to improve the prognosis and chances
features. For instance, field 3 is Mean Radius, field
of survival through timely clinical treatment to
13 is Radius SE, field 23 is Worst Radius.
patients.
2.2 Data Visualization
2 Proposed Model
2.2.1 Histogramme
2.1 Dataset A histogram is a graphical representation of data
The dataset chosen for this research is the Wisconsin or information using bars of different heights where
Breast Cancer (Diagnostic) Data Set (WBCD)[8]. The each bar groups numbers into ranges. Higher bars
dataset is publicly available on the reputed Machine show that more data falls in that range. The shape
Learning Repository that is UCI-Repository. WBCD and spread of continuous sample data can be shown
was made by Dr. William H. Wolberg, doctor at using a histogram. Figure 2.1 shows the class
the University Of Wisconsin Hospital at Madison, distribution of diagnosed malignant (M) and benign
Wisconsin, USA. Dr. Wolfberg used Xcyt to analyze (B) tumors. There are 212 malignant tumors which
fluids samples taken from patients with solid breast is approximately 38% and other 357 benign tumors
masses. The features from the data set describe making up the rest of the 62% of the predictive class.

2
Specialty: Materials and
Project of Python Energy, MME-S2
OUARZAZATE: 2022A

Figure 1: Class Distribution.

2.2.2 Heatmap
A heatmap is a two-dimensional representation with
the help of colors for visualization of both simple
and complex information. Heatmap is an extremely Figure 3: Nucleus Features vs Diagnosis.
useful way to see which intersections of the values
have higher concentration of the data compared to the
2.3 Data Preprocessing
others.Figure 9 represents a correlation matrix using
a heatmap. It is used to show the correlation among 2.3.1 Categorical Variable Conversion
all 30 features in this dataset.
The dataset included both numerical and categorical
N.B: In the figure below we used Seaborn to create features. Among which the ‘Diagnosis’ column had
a heat map of the correlations between the features. categorical feature, which says if the cancer is M =
malignant or B = benign. The rest of the features
are numerical. Most of the algorithms produce better
result with numerical variable. In python, library
“sklearn” requires features in numerical arrays and
categorical variables cannot be fitted into a regression
equation in their raw form. Hence, Label Encoder was
used to transform non-numerical to numerical labels.

Figure 4: Diagnosis Data after Encoding.

Figure 2: Correlation between all variables.


2.3.2 Splitting the dataset
We can also see how the malignant or benign The data we use is usually split into training data
tumors cells can have (or not) different values for and test data. The training set contains a known
the features plotting the distribution of each type of output and the model learns on this data in order
diagnosis for each of the mean features. to be generalized to other data later on. We have

3
Specialty: Materials and
Project of Python Energy, MME-S2
OUARZAZATE: 2022A

the test dataset (or subset) in order to test our variables may represent “Type A” or “Type B” or
model’s prediction on this subset. We done this using “Type C”.
SciKit-Learn library in Python using the traint est −
splitmethod. • Ordinal: In such a kind of classification,
dependent variable can have 3 or more possible
2.3.3 Feature Scaling ordered types or the types having a quantitative
Most of the times, the dataset will contain features significance. For example, these variables
highly varying in magnitudes, units and range. But may represent “poor” or “good”, “very good”,
since, most of the machine learning algorithms use “Excellent” and each category can have the scores
Eucledian distance between two data points in their like 0,1,2,3.
computations. We need to bring all features to the
same level of magnitudes. This can be achieved 2.4.2 Logistic Regression Assumptions
by scaling. This means that you are transforming Before diving into the implementation of logistic
your data so that it fits within a specific scale, regression, we must be aware of the following
like 0–100 or 0–1. In this work, scikit-learn assumptions about the same:
module sklearn.preprocessing.StandardScaler is used
to implement standardization in python. ▷ In case of binary logistic regression, the target
variables must be binary always and the desired
2.4 Logistic Regression outcome is represented by the factor level 1.
Logistic regression is a supervised learning
classification algorithm used to predict the probability ▷ There should not be any multi-collinearity in the
of a target variable. The nature of target or dependent model, which means the independent variables
variable is dichotomous, which means there would must be independent of each other.
be only two possible classes. In simple words, the
▷ We must include meaningful variables in our
dependent variable is binary in nature having data
model.
coded as either 1 (stands for success/yes) or 0 (stands
for failure/no). ▷ We should choose a large sample size for logistic
Mathematically, a logistic regression model predicts regression.
P(Y=1) as a function of X. It is one of the simplest ML
algorithms that can be used for various classification 2.4.3 Regression Models
problems such as spam detection, Diabetes prediction, 1. Binary Logistic Regression Model
cancer detection etc[9].
The simplest form of logistic regression is binary
2.4.1 Types of Logistic Regression or binomial logistic regression in which the target
Generally, logistic regression means binary logistic or dependent variable can have only 2 possible
regression having binary target variables, but there types either 1 or 0. It allows us to model a
can be two more categories of target variables that relationship between multiple predictor variables
can be predicted by it. Based on those number of and a binary/binomial target variable. In case of
categories, Logistic regression can be divided into logistic regression, the linear function is basically
following types: used as an input to another function such as g in
the following relation[10]:
• Binary or Binomial: In such a kind of
step 1:
classification, a dependent variable will have only
hθ (x) = g(θT x) (1)
two possible types either 1 and 0. For example,
these variables may represent success or failure, where 0 ≤ hθ ≤ 1
yes or no, win or loss etc.
step 2:
1
• Multinomial: In such a kind of classification, g(x) = (2)
dependent variable can have 3 or more possible 1 + e−z
unordered types or the types having no where z = θT x and g is the logistic or sigmoid
quantitative significance. For example, these function.

4
Specialty: Materials and
Project of Python Energy, MME-S2
OUARZAZATE: 2022A

step 3: We also need to define a loss function, Classification problems. It is a tree-structured


or binary cross-entropy cost function, to measure classifier, where internal nodes represent the features
how well the algorithm performs using the of a dataset, branches represent the decision rules and
weights on functions, represented by theta as each leaf node represents the outcome.In a Decision
follows tree, there are two nodes, which are the Decision
1 Node and Leaf Node. Decision nodes are used to make
J(θ) = (−y T log(h) − (1 − y)T log(1 − h)) (3) any decision and have multiple branches, whereas
m
Leaf nodes are the output of those decisions and do
step 4: minimize J(θ) using an optimization not contain any further branches [12].
algorithm. The critical difference between the random forest
algorithm and decision tree is that decision trees
Gradient descent :
are graphs that illustrate all possible outcomes of a
d decision using a branching approach. In contrast, the
θ j = θj − γ J(θ) (4)
dθj random forest algorithm output are a set of decision
trees that work according to the output.
The following gradient descent equation tells
us how loss would change if we modified the 3 Result Analysis
parameters Analysis and comparison of the performance of
dJ(θ) 1 different models implemented on the test portion of
= X T (g(Xθ) − y) (5) the dataset was evaluated across various performance
dθj m
metrics.
2. Multinomial Logistic Regression Model 3.1 Performance Metrics
Another useful form of logistic regression is
3.1.1 Confusion matrix
multinomial logistic regression in which the
target or dependent variable can have 3 or more The Confusion matrix is one of the most intuitive and
possible unordered types i.e. the types having no easiest metrics used for finding the correctness and
quantitative significance. accuracy of the model. It is used for classification
problems where the output can be of two or more
2.5 Other Algorithms types of classes which makes it perfect for this
paper[13]. Terms associated with Confusion matrix:
2.5.1 Random Forest
Random Forest is one of the most popular and
powerful machine learning algorithm. It consists
of many decision trees and outputs the class that
is the mode of the class’s output by means of
individual trees. After building decision trees on
the sample sets, many trees are generated, thus
creating a forest. It is good for classification problems
like this one and other tasks like regression which
functions as explained above by creating a multitude
of trees at training and outputting the classes or
mean predictions of each specific tree. The numerous
deep decision trees are trained on separate groups
of the same dataset and averaged with the target of
decreasing the variance [11]. Figure 6: Confusion Matrix.
2.5.2 Decision Tree
Decision Tree is a Supervised learning technique that 1. True Positives (TP): True positives are the cases
can be used for both classification and Regression when the actual class of the data point was
problems, but mostly it is preferred for solving True(1) and the predicted is also True(1).

5
Specialty: Materials and
Project of Python Energy, MME-S2
OUARZAZATE: 2022A

2. True Negatives (TN): True negatives are the cases


when the actual class of the data point was False
(0) and the predicted is also False (0).

3. False Positives (FP): False positives are the cases


when the actual class of the data point was False
(0) and the predicted is True (1).

4. 4. False Negatives (FN): False negatives are the


cases when the actual class of the data point was
True (1) and the predicted is False (0). Figure 7: Performance of Logistic Regression.
3.1.2 Accuracy
Accuracy in classification problems is the number 4.2 Random Forest (RF)
of correct predictions made by the model over the The Accuracy of Random Forest model was 94.73%.
summation of all different types of predictions made : figure 4.2a representing the confusion matrix of the
TP + TN random forest model
Accuracy = (6)
TP + TN + FP + FN

3.1.3 Precision
Precision is the ratio of True Positives to the
summation of True Positives and False Positives :
TP
P recision = (7)
TP + FP

3.1.4 Recall
Recall is a measure that shows the proportion of Figure 8: Performance of random forest.
patients that actually had malignant tumor was
diagnosed by the algorithm as having malignant
tumor.
TP 4.3 Decision Tree
Recall = (8)
TP + FN The Accuracy of this is 91.22% while The Precision
3.1.5 F1 Score and F1 scores have 94.02% and 92.48% , respectively.
Figure 8 shows the confusion matrix of the decision
F1 Score is the Harmonic Mean between precision and tree model.
recall. The range for F1 Score is [0,1] It shows how
precise the classifier is and how robust it is at the same
time.
2 ∗ P recision ∗ Recall
F 1score = (9)
P recision + Recall

4 Model Performances
4.1 Logistic Regression (LR)
Figure 4.5a and figure 4.5c illustrates the confusion
matrix of the Logistic Regression model [14]. The
results show how Logistic Regression performs well
Figure 9: Performance of decision tree.
for this problem with a Precision of 97.02% and Recall
score of 96.54%

6
Specialty: Materials and
Project of Python Energy, MME-S2
OUARZAZATE: 2022A

4.4 Comparison between different models [4] D., Walker, G., and Kadam et al. A. (2005).
“Predicting breast cancer survivability:
After completing the implementation of all algorithms
a comparison of three data mining
for detecting breast cancer from the dataset, the
methods. Artificial intelligence in medicine”
results can be compared from the table 1 using the
34(2):113–127.
performance metrics discussed previously.
[5] L. (2018). “Research on logistic regression
Models Accuracy Precision Recall F1 Score
algorithm of breast cancer diagnose data
logistic regression 94.0% 89.6% 94.05% 91.7%
by machine learning. In 2018 International
Random forest 94.7% 95.0% 94.0% 94.0%
Decision tree 91.2% 90.0% 91.0% 91.0% Conference on Robots Intelligent System
(ICRIS)”pages 157–160. IEEE.

Table 1: Comparison of Scores of Various Models [6] A. and Guven.E. (2006) “Predicting breast cancer
survivability using data mining techniques. Age”
58(13):10–110.
Table above shows accuracy is higher in the case of
Random forest model, followed by that corresponding [7] A. L. (1959). . “Some studies in machine learning
to logistic regression and finally that associated with using the game of checkers.IBM Journal of
decision tree. It is notable to note that that Random research and development” 3(3):210–229.
Forest Classification algorithm gives the best results
for our dataset. Well its not always applicable to every [8] ”https://archive.ics.uci.edu/ml/datasets/
dataset. Breast+Cancer+Wisconsin+%28Diagnostic%29”.

5 Conclusion [9] Hilbe (2015).“Practical Guide to Logistic


Regression”.
Breast cancer, the most common cancer among
women, being responsible for 69% of death related [10] Hastie, T., and Tibshirani, R. (2001).
to cancer among the same gender gives a glimpse of “The elements of statistical learning,volume
the magnitude of the problem. The early detection 1”Springer series in statistics New York, NY, USA.
of breast cancer can lead to chances of survival of
a large number of these people by receiving clinical [11] D. W., Lemeshow, S., and Sturdivant, R.
treatments on time. This paper shows the comparative X. (2013).“Applied logistic regression, volume
analysis of logistic regression algorithm with Random 398”John Wiley Sons.
forest and decision tree in detecting breast cancer. [12] “https://www.analyticsvidhya.com/blog/2021/02/
There other types of classification algorithms in machine-learning-101-decision-tree-algorithm”.
Machine Learning such as Nearest Neighbor, Support
Vector Machines and Naı̈ve Bayes... that we can use [13] D. M. (2011). “Evaluation: from precision, recall
in order to achieve higher performance scores. and f-measure to roc, informedness, markedness
and correlation”.
References [14] Robert (2019) . “Numerical Python: Scientific
[1] WHO (2016). “Breast cancer: prevention and Computing and Data Science Applications with
control.” Numpy, SciPy and Matplotlibm, Second Edition.”

[2] Siegel et al.(2015) “Global cancer statistics, 2012.


CA: a cancer journal for clinicians”65(2):87–108.

[3] Z.-H., Jiang, Y. et al.(2003) “Medical diagnosis


with c4. 5 rule preceded by artificial neural
network ensemble. IEEE Transactions on
information Technology in Biomedicine”
7(1):37–42.

You might also like