Logistic Regression For Malignancy Prediction in Cancer - by Luca Zammataro - Towards Data Science

Get started Open in app
Follow 574K Followers
You have 2 free member-only stories left this month. Sign up for Medium and get an extra one
MACHINE LEARNING FOR BIOMEDICAL DATA
Logistic Regression for malignancy prediction

in cancer
Applying Logistic Regression to the Wisconsin Breast Cancer (Diagnostic) Data Set
Luca Zammataro Dec 23, 2019 · 17 min read
In Linear Regression with one or more variables, we have introduced the

concept of Linear Regression, a statistical model used in Machine Learning
that belongs to the Supervised Learning class of algorithms. Also, we have discussed
the possible applications of the Linear Regression to biomedical data, in particular, in
predicting continuous-valued outputs.
In this article, which ideally represents a continuation of that dedicated to the Linear
Regression, we will explain the Logistic Regression, another Supervised Learning
model, used for predicting discrete-valued outputs.
We will introduce the mathematical concepts underlying the Logistic Regression, and
through Python, step by step, we will make a predictor for malignancy in breast cancer.
We will use the “Breast Cancer Wisconsin (Diagnostic)” (WBCD) dataset, provided by
the University of Wisconsin, and hosted by the UCI, Machine Learning Repository.
Drs Mangasarian, Street, and Wolberg, the creators of the database, intended to utilize
30 characteristics of individual cells of breast cancer obtained from a minimally
invasive fine needle aspirate (FNA), to discriminate benign from malignant lumps of a
breast mass, using Machine Learning [1].
Briefly, an FNA is a kind of biopsy, in which physicians use a needle attached to a

syringe to withdraw a small number of cells from a suspicious area. The biopsy is then
checked for cancer identification [2].
Using an image analysis software called Xcyt based on a curve-fitting algorithm, the
authors were able to determine the boundaries of the cell nuclei from a digitized
640×400, 8-bit-per-pixel grayscale image of the FNA.
Figure 1: A magnified image of a malignant breast FNA. A curve-fitting algorithm was used to outline the cell
nuclei. (Figure from Mangasarian OL., Street WN., Wolberg. WH. Breast Cancer Diagnosis and Prognosis via
Linear Programming. Mathematical Programming Technical Report 94–10. 1994 Dec)
The 30 features describe characteristics of the cell nuclei present in the scanned
images.
As reported in their paper, the authors have used machine learning and image
processing techniques to accomplish malignancy outcome prediction. Still, an in-depth
on their computational approaches go beyond the aim of this article.
In this post, we will not remake the work of Mangasarian, Street, and Wolberg. We will
Get started
use their datasetOpen in app
to implement a Logistic Regression predictor based on some of the 30
features of the WBCD, in Python. We will use the outcome Bening/Malignant to
predict if a new patient has a probability of developing malignancy or not, basing on
the FNA data. Furthermore, our predictor will be an exciting occasion of exposing
some basic concepts of Logistic Regression and implementing a code around the
biomedical problem: what features are essential in predicting malignant outcomes.
The first column of the dataset corresponds to the patient ID, while the last column
represents the diagnosis (the outcome can be “Benign” or “Malignant” based on the
type of diagnosis reported). The resulting dataset consists of 569 patients: 212
(37.2%) have an outcome of Malignancy, and 357 (62.7) are Benign. Figure 2
describes the dataset structure:
Figure 2: The Breast Cancer Wisconsin (Diagnostic), dataset structure.
In detail, the dataset consists of ten real-valued features computed for each cell
nucleus. They are 1) Radius (mean of distances from center to points on the
perimeter), 2) Texture (standard deviation of gray-scale values), 3) Perimeter, 4)
Area, 5) Smoothness (local variation in radius lengths), 6) Compactness (perimeter² /
area — 1.0), 7) Concavity (severity of concave portions of the contour), 8) Concave
points (number of concave portions of the contour), 9) Symmetry and 10) Fractal
Dimension (“coastline approximation” — 1). The ten real-valued features correspond
to the Mean, (values from columns 2 to column 11), to the Standard Errors (values
from columns 12 to 21), and the Worst or largest (mean of the three largest values),
(columns from 22 to 31). Column 32 contains the Bening/Malignant outcome.
For more simplicity, I have formatted the WBDC in a comma-separated-values file. You
can download the formatted version following this link (filename: wdbc.data.csv).
Before starting, I suggest readers following the interesting course in Machine Learning
at Coursera by Andrew NG [3]. The course provides an excellent explanation of all the
arguments treated in this post.
All the code presented in this article is written in Python 2.7. For the implementation
environment, I recommend the use of the Jupyter Notebook.
Step 1: Import packages from SciPy
Import all the packages required for the Python code of this article: Pandas, NumPy,
matplotlib, and SciPy. These packages belong to SciPy.org, which is a Python-based
ecosystem of open-source software for mathematics, science, and engineering. Also,
we will import seaborn, which is a Python data visualization library based on
matplotlib. Moreover, an object op from the scipy.optimize package will be created, to
make optimizations of the Gradient.
1 # Import all the required Packages

2 # All of them are part of SciPy.org
3 import pandas as pd
4 import numpy as np
5 import matplotlib.pyplot as plt
6 import scipy.optimize as op
7 import seaborn as sns
8 from __future__ import division
ImportPackages_2.py hosted with ❤ by GitHub view raw
Code 1: Import all the packages
Step 2: Uploading data

Once we have downloaded the wdbc.data.csv file, use it to create a DataFrame using
pandas:
1 # Read data from file 'wsbc.data.csv'

2 # and create a pandas DataFrame containing the WDBC data.
3
4 df = pd.read_csv("wdbc.data.csv")
CreatePandasDataFrame_2.py hosted with ❤ by GitHub view raw
Code 2: Create a pandas DataFrame.
With pandas we can visualize the first lines of a df content using the .head() method:
Table 1: Typing df.head(), will display the DataFrame content (partial output)
Step 3: Data Visualization and Bivariate Analysis
Now that we have a DataFrame containing our data, we want to identify which of the
30 features are important for our prediction model. Selecting the best features of a
dataset is a critical step in Machine Learning, to get a useful classification and to
avoiding a predictive bias.
Visualization Per Pair
One method is the Visualization Per Pairs. As suggested by Ahmed Qassim in his article,
an elegant display of data is gettable using the pairplot set of functions provided by
seaborn. The Python code that produces the complete features combination plot is the
following:
1 # sns is a seaborn object

2
3 sns.set(style="ticks", color_codes=True)
4 g = sns.pairplot(df, palette = ('b', 'r'), hue="Diagnosis", height=2.5)
5 plt.show()
VisualizationPerPairs_2.py hosted with ❤ by GitHub view raw
Code 3: Visualizing all the paired features.
Here we have created the seaborn object sns; then, we have used the pandas DataFrame
(df) produced in Step 2 as an argument for the sns pairplot method. Note that,
specifying the argument “hue = Diagnosis,” the pairplot method has access to the df
column containing the Diagnosis values (0, 1). Moreover, pairplot will use the
argument “palette” to color points with b blue or r red basing on the Diagnosis values.
For simplifying the visualization, the map resulting from Code 3 shows the
combinations of the first five pairs of features. They represent mean values of
parameters Radius, Texture, Perimeter, Area, and Smoothness. Depending on your
computer capabilities, the production of the complete features map, running Code 3,
could take time.
Figure 3: Visualization Per Pair
Each tile represents a scatter plot of a couple of parameters. This visualization makes
more accessible the identification of essential elements for the classification. Some of
these pairs, like Radius vs. Texture or Perimeter vs. Smoothness, have a right separation
level concerning the Diagnosis (blue points = Benign; red spots = Malignant).
Visualizing features in plots of pairs represent an excellent tool, primarily because of its
immediacy.
Bivariate Analysis
Visualization apart, a more efficient method for selecting non-redundant features is

the Bivariate Analysis, which is a method based on the correlation, that analyzes the
relationship between pairs of elements. Here I propose a code that I have readapted
from AN6U5, which implements a function for the Bivariate Analysis. This function,
here called features_correlation_matrix, takes the WBCD df as an argument and returns
a correlation matrix for the 30 features:
1 # Bivariate Analysis
2 # Make a Features Correlation Matrix of the WBCD features
3 # Readapted from AN6U5
4
5 def features_correlation_matrix(df):
6 from matplotlib import pyplot as plt
7 from matplotlib import cm as cm
8 fig = plt.figure()
Get
9 started
fig.set_size_inches(18.5,
Open in app 10.5)
10 ax1 = fig.add_subplot(111)
11
12 cmap = cm.get_cmap('jet', 100)
13
14 # interpolation='nearest' simply displays an image without
15 # trying to interpolate between pixels if the display
16 # resolution is not the same as the image resolution
17 # The correlation is returned in absolute values:
18 cax = ax1.imshow(df.corr().abs(), interpolation="nearest", \
19 cmap=cmap)
20 ax1.grid(True)
21 plt.title('Correlation Matrix of the WBCD features',fontsize=20)
22 labels=list(df.columns)
23
24 ax1.set_xticks(np.arange(len(labels)))
25 ax1.set_yticks(np.arange(len(labels)))
26
27 ax1.set_xticklabels(labels,fontsize=15,\
28 horizontalalignment="left", rotation='vertical')
29
30 ax1.set_yticklabels(labels,fontsize=15)
31
32 # Add a colorbar
33 fig.colorbar(cax, ticks=[-1.0, -0.8, -0.6, -0.4, -0.2, 0, \
34 0.2, 0.4, 0.6, 0.8, 1])
35 plt.show()
36
37
38 # Drop the Outcome column from Df and copy it into df_features
39 df_features = df.drop(df.columns[-1],axis=1)
40 # Run the correlation_matrix function, using df_features as argument
41 features_correlation_matrix(df_features)
FeaturesCorrelationMatrix 2.py hosted with ❤ by GitHub view raw
Code 4: Bivariate Analysis: Features Correlation Matrix of the WBCD features

Figure 4: Plotting the Correlation Matrix of the WBDC features
Take a glance at the resulting plot in Figure 4: we aim to understand how the 30
characteristics of the dataset are related to each other. In calculating correlations
between variables, we can observe that some of them result in being notably correlated
(values greater than 0.9). The general assumption of the Bivariate Analysis is that
features that are highly associated provide redundant information: for this reason, we
want to eliminate them, avoiding a predictive bias.
As suggested in this post by Kalshtein Yael, dedicated to the WBCD analysis with R, we
should remove all features with a correlation higher than 0.9, keeping those with the
lower mean. Code 5 is a readaptation of a Python code example, from the Chris Albon
book [4]:
1 # Dropping of features with a correlation greater than 0.9

2 # Code readapted from Chris Albon
3 # Create the correlation matrix with absolute values
4 corr_matrix = df_features.corr().abs()
5
6 # Select upper triangle of correlation matrix
7 upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
8
9 # Find index of feature columns with correlation greater than 0.9
10 to_drop = [column for column in upper.columns if any(upper[column] > 0.9)]
DroppingFeatures_2.py hosted with ❤ by GitHub view raw
Code 5: Dropping of features with a correlation greater than 0.9

The “to_drop” list contains ten redundant features with a correlation > 0.9. They are
Get started
listed as follows:Open in app
['Perimeter',
'Area',
'Concave_points',
'Perimeter_SE',
'Area_SE',
'W_Radius',
'W_Texture',
'W_Perimeter',
'W_Area',
'W_Concave_points']
Now we want to eliminate these features from our DataFrame and re-plot the df with
the remaining 20:
1 # Drop the redundant features from the df

2 df_features_corr_dropped = df_features.drop(df_features[to_drop], axis=1)
3
4 # Re-plot the dropped df containing only not-redundant features
5 # Using the feature_correlation_matrix() function
6 features_correlation_matrix(df_features_corr_dropped)
DeleteRedundantFeatures_2.py hosted with ❤ by GitHub view raw
Code 6: Delete the redundant features from the DataFrame and re-plot only the not-redundant
Figure 5 shows the plot resulting from Code 6:

Figure 5: Plotting the Correlation Matrix of the 20 not-redundant features
We have now removed highly correlated feature pairs, so only features with low
correlation, like Radius and Texture, will be kept. We can directly access to the
correlation matrix values, typing in a new notebook cell:
1 # Print a table with all the features correlations

2 # Redundant features were dropped from the original 30 features dataset
3 # To get the correlation, we apply the .corr() method
4 # followed by the .abs() method to the df_features_corr_dropped DataFrame
5 df_features_corr_dropped.corr().abs()
CreatingCorrelationTableNotRedundantFeatures_2.py hosted with ❤ by GitHub view raw
Code 7: Creating a correlation table of the not-redundant features
The resulting output is the following:
Table 2: Correlation Matrix of the features (partially displayed)
Thanks to the Visualization Per Pairs and the Bivariate Analysis, now we know what
features deserve to be considered in our analysis, then let’s go ahead with the
construction of our Logistic Regression outcome predictor. In the next Steps, we will
write a code to predict the diagnosis outcome for a pair of features chosen on the base
of their non-redundancy.
Step 4: Logistic Regression Hypothesis Model

Theoretically, for the Logistic Regression, we could use the same hypothesis model that
we have used for the Linear Regression (see Linear Regression with one or more
variables). Still, there are several reasons why we cannot use the linear hypothesis for
the logistic, and now we are going to explain why. Take a glance at Figure 6:
Figure 6. A: Example of binary classification of malignancy prediction in breast cancer. B: The Logistic
Regression Hypothesis is a non-linear function.
The plot in Figure 6A explains why we cannot apply the linear Hypothesis to binary
classification. Imagine that we want to plot our samples with an outcome that can be
Benign or Malignant (red circles). We could apply the Linear Hypothesis model to
separate the samples into two distinct groups. In attempting to classifying binary
values as 0 and 1, the Linear Regression tries to predict values greater than 0.5 as “1”
and all less than 0.5 as “0” because the threshold classifier output for hθ(x) is at 0.5.
Theoretically, the Linear Regression model that we have chosen could work well also
for the binary classification, as the blue line demonstrates. But let see what happens if
we insert another sample with Malignant diagnosis (the green circle in Figure 6A). The
Linear Regression Hypothesis adapts the line to include the new sample (the magenta
line).
Still, this Hypothesis model couldn’t work correctly for all the new samples that we are
going to upload, because the Linear Hypothesis seems not to add further information
to our predictions. That happens because classification is not a linear function. What
we need is a new Hypothesis hθ(x) that can calculate the probability that the Diagnosis
Get started
output can be 0 Open
or 1:inthis
app
new hypothesis model is the Logistic regression.
Equation 1: Logistic Regression model (Hypothesis)
The Logistic Regression Hypothesis model in Equation 1, looks similar to that of the
Linear Regression. But the real difference is in the g function that uses the product of
the translated θ vector with the x vector (we will call this product z) as an argument.
The g function is defined as in Equation 2:
Equation 2: Logistic Regression model (Hypothesis)

The g(z) function, which is a sigmoid function (Logistic Function) is non-linear. It
calculates the probability that the Diagnosis output can be 0 or 1(Figure 6B).
The Python Code for the implementation of the Logistic Function is the following:
1 # The Logistic Function

2 def sigmoid(z):
3 return 1 / (1 + np.exp(-z))
LogisticFunction_2.py hosted with ❤ by GitHub view raw
Code 8: The Logistic Function (sigmoid function)
Step 5: The Cost Function

This Step will answer to the question: How to choose parameters θ with Logistic
Regression? Just like for the Linear Regression, the only way to select θ is by calculating
its Cost Function, and then try to optimize the search using sophisticated algorithms
for the Gradient descent optimization. (argument of Step 6).
Again, we could use the same Cost Function that we used for Linear Regression, and it
should look like the Cost Function of the Equation 4 in the article dedicated to the
Linear Regression, except for a difference in the sigmoid function that characterizes
the Hypothesis model hθ(x).
The y vector, as usual, represents the output with the difference that, here, y is a vector
of binary outcomes (0/1, or Benign/Malignant), and not a vector of continuous-valued
outputs:
Equation 3: The Logistic Regression Cost Function
Concretely, we cannot use the Linear Regression Cost Function for the Logistic. The
non-linearity of the sigmoid function, which handles hθ(x), leads to a J(θ) having a
non-convex pattern, and it will look like the curve in the graph of Figure 7A:
Figure 7: non-convex and convex Cost Function
This non-convex J(θ) is a function with many local optima. It’s not guaranteed that the
Gradient descent will converge to the global minimum. What we want is a convex J(θ)
like that in Figure 7B, which is a function that converges to the global minimum. So,
we have to write the Cost Function in a way that guarantees a convex J(θ):
Equation 4: Logistic Regression Cost Function
Implementing the Equation 4, the Cost Function that is calculated as log(z), looks like
the red curve of Figure 8:
Figure 8: Plotting Logistic Regression Cost Function
Because we have two binary conditions for the Benign or Malignant outcome (y), the
Cost Function in Equation 4 states the cost of our Hypothesis prediction concerning y
is.
If y = 1, but we predict hθ(x) = 0, we will penalize the learning algorithm by a

considerable cost (see the red curve in Figure 8) because, in this case, the cost will
tend to infinite. Instead, if our prediction is hθ(x) = 1, (thus equal to y), then the cost
is going to be 0.
In the case of y = 0, we have the opposite: if y = 0 and we predict hθ(x) = 0, the cost
is going to be 0, because our Hypothesis matches with y, while if our prediction is
hθ(x) = 1, we end up paying a very large cost.
The simplified version of the Cost Function is the following:
Equation 5: Simplified Logistic Regression Cost Function
If y = 0, the Cost Function will be equal to -(log(1-hθ(x)). If y = 1, the second part of

Equation 5 will be 0, then the cost will be equal to -y * log(hθ(x)). A Python
implementation of the simplified and vectorized Cost Function is the following:
1 # Logistic Regression Cost Function

2 def calcCostFunction(theta, X, y):
3
4 # number of training examples
5 m,n = X.shape
6
7 # Calculate h = X * theta (we are using vectorized version)
8 h = X.dot(theta)
9
10 # Calculate the Cost J
11 J = -(np.sum(y * np.log(sigmoid(h)) + (1 - y) * np.log(1 - sigmoid(h)))/m)
12 return J
CalcCostFunction_2.py hosted with ❤ by GitHub view raw
Code 9: The Logistic Regression Cost Function
Step 6: The Gradient Function and its optimization

In the post dedicated to Linear Regression, we have introduced the Gradient Descent,
which is an algorithm that calculates the derivative of the Cost Function, updating the
vector θ by mean of the parameter α, that is the learning rate.
But in the Machine Learning scenario, the Gradient Descent is not the only algorithm
for minimizing the Cost Function: Conjugate Gradient, BFGS, L-BGFS, TNC, for
example, represent some of the more sophisticated algorithms for minimizing the Cost
Function, in automatizing the search of θ.
Based on different statistics, these algorithms try to optimize the Gradient Function,
which is the difference between the actual vector y of the dataset, and the h vector (the
prediction), to learn how to find the minimum J. Generalizing, an optimization
algorithm will repeat until it will converge. Importantly: the updating of θ always has
to be simultaneous.
Optimization algorithms are arguments of the Numerical Optimization field, and an in-
depth study on the use of algorithms goes beyond this article. The Gradient function
that we are going to implement looks identical to that used for the Linear Regression.
But also here, as for the Cost Function, the difference is in the definition of hθ(x) that
needs the sigmoid function:
Equation 6: Logistic Regression Gradient wrapped in a generic optimizing algorithm.
The equation for the vectorized implementation of the Gradient Function is:
Equation 7: Vectorized Logistic Regression Gradient
and the Python code for calculating the Gradient is the following:
1 # The Gradient Function

2 def calcGradient(theta, X, y):
3
5 m,n = X.shape
6
7 # Calculate h = X * theta
8 h = X.dot(theta)
9
10 # Calculate the error = (h - y)
11 error = sigmoid(h) - y
12
13 # Calculate the new theta
14 gradient = 1/m * (X.T).dot(error)
15
16 return gradient
GradientFunction_2.py hosted with ❤ by GitHub view raw
Code 10: The Gradient Function
In Step 9, we will see how to optimize the Gradient Function, using one of the
algorithms provided by scipy.optimize.
Step 7: Feature Scaling and Normalization

We could need to rescale and normalized our data. The Python code for calculating
Feature Scaling and Normalization for two variables is the following:
1 # The Feature Scaling and Normalization

2 # for two variables
3
4 def FeatureScalingNormalization(X):
5 # Initialize the following variables:
6 # Make a copy of the X vector and call it X_norm
7 X_norm = X
8
9 # Initialize mu: It will contain the average
10 # value of X in training set.
11 mu = np.zeros(X.shape[1])
12
13 # Initialize sigma: It will contain the Range(max-min)
14 # of X or Standard Deviation
15 sigma = np.zeros(X.shape[1])
16
17 # mu (mean)
18 mu = np.vstack((X[0].mean(), \
19 X[1].mean()))
20
21 # The Standard Deviation calculation with NumPy,
22 # requires the argument "degrees of freedom" = 1
23 sigma = np.vstack((X[0].std(ddof=1),\
24 X[1].std(ddof=1)))
25
26 # Number of training examples
27 m = X.shape[1]
28
29 # Make a vector of size m with the mu values
30 mu_matrix = np.multiply(np.ones(m), mu).T
31
32 # Make a vector of size m with the sigma values
33 sigma_matrix = np.multiply(np.ones(m), sigma).T
34
35 # Apply the Feature Scaling Normalization formula
pp y g
36 X_norm = np.subtract(X, mu).T
Get started
37 X_norm =Open in app
X_norm /sigma.T
38
39 return [X_norm, mu, sigma]
FeatureScalingAndNormalizationTwoVariable 2.py hosted with ❤ by GitHub view raw
Code 11: Feature Scaling and Normalization for two variables
Step 8: Calculate the accuracy

This function calculates the accuracy of our algorithm:
1 # Calculate the accuracy

2 def CalcAccuracy(theta, X):
3 p = sigmoid(X.dot(theta)) >= 0.5
4 return p
CalculateAccuracy_2.py hosted with ❤ by GitHub view raw
Code 12: The function for calculating the accuracy.
Step 9: Implementing the Logistic Regression with two variables

In the previous steps, we have created all the fundamental functions for the
implementation of the Logistic Regression. Let’s briefly summarize all of them:
sigmoid(z)
calcCostFunction(theta, X, y)
calcGradient(theta, X, y)
FeatureScalingNormalization(X)
CalcAccuracy(theta, X)
Now we will write the code that wraps all of these functions, to predict outcomes of
Malignancy based on two of the 20 not-redundant features of our dataset. For the
choice of the features, we could refer to one of the pairs that we have found in Step 3,
Radius, and Texture, which have a correlation score = 0.32. The following code
produces the features numpy array X and the output numpy vector y, from the
DataFrame df:
1 # Make the X and y numpy arrays,

2 # containing the values of
3 # Radius and Texture
4 X = np.vstack((np.asarray(df.Radius.values), \
5 np.asarray(df.Texture.values)))
6 y np asarray(df Diagnosis values)
6 y = np.asarray(df.Diagnosis.values)
Get started
UploadingDataset_2.py
Openhosted with ❤ by GitHub
in app view raw
Code 13: Make the X and y numpy arrays
Let’s plot the two features:
1 # Plot the data

2 for i in range(len(y)):
3 if y[i]==0.0:
4 c = 'y'
5 m = u'o'
6 if y[i]==1.0:
7 c = 'black'
8 m = u'+'
9 plt.scatter(X[0][i], X[1][i], color=c, marker=m)
10
11 # Put labels
12 plt.xlabel(df.columns[df.columns.get_loc("Radius")])
13 plt.ylabel(df.columns[df.columns.get_loc("Texture")])
PlotData_a_2.py hosted with ❤ by GitHub view raw
Code 14: Plot the features
In Figure 7, the resulting plot is shown:

Figure 7: Plotting Radius and Texture
The yellow dots represent the Benigns, the blacks, the Malignants.
Now let’s Normalize and Scale, our data. Also, we need collecting the mu, which is the
average values of X in our training set, and sigma that is the Standard Deviation. In a
new notebook cell, let’s type:
1 # Normalize X using the FeatureScalingNormalization function, and

2 # copy the results in the "featuresNormalizeresults" list
3 featuresNormalizeresults = FeatureScalingNormalization(X)
4
5 # get the normalized X matrix
6 X = np.asarray(featuresNormalizeresults[0]).T
7
8 # get the mean
9 mu = featuresNormalizeresults[1]
10
11 # get the sigma
12 sigma = featuresNormalizeresults[2]
RunFeatureScalingNormalizationTwoVariable_2.py hosted with ❤ by GitHub view raw
Code 15: Run Feature Scaling and Normalization
Now, we have to update the array X adding a column of “ones,” with the method
.vstack:
1 # Add a column of 'ones' to X

2
4 m = len(y)
5
6 # number of features
7 n = len(X)
8
9 # Add a column of ones to the X matrix
10 X = np.vstack((np.ones(m), X)).T
AddColumnOnes_2.py hosted with ❤ by GitHub view raw
Code 16: Add a column of “ones” to the X matrix
Testing
Let’s do some tests: in order to test our code, let’s try to calculate the Cost Function and
the Gradient, starting with a θ = [0, 0, 0]:
1 # First test: Compute cost and gradient,

2 # and display the updated theta starting with initial theta = [0, 0, 0]
3
4 initial_theta = np.zeros(n+1); # set theta = [0, 0, 0]
5
6 print ("J", calcCostFunction(theta=initial_theta, X=X, y=y))
7 print ("grad", calcGradient(theta=initial_theta, X=X, y=y))
FirstTest_2.py hosted with ❤ by GitHub view raw
Code 17: 1st test; compute Cost Function and Gradient starting with initial θ = 0
The new θ vector is now = [ 0.12741652, -0.35265304, -0.20056252], with an

associated J(θ) = 0.69.
Also, we can try non-zero values of θ, and see what happens:
1 # Second test: Compute cost and gradient,

2 # and display the updated theta starting with a non-zero theta
3
4 test_theta = [-24, 1.2, 0.2];
5
6 print ("J", calcCostFunction(theta=initial_theta, X=X, y=y))
7 print ("grad", calcGradient(theta=initial_theta, X=X, y=y))
SecondTest_2.py hosted with ❤ by GitHub view raw
Code 18: 2nd test; compute Cost Function and Gradient starting with initial not-zero θ
The updated θ vector now is = [-0.37258348, -0.35265304, -0.20056252], with an

associated J(θ) = 8.48.
Gradient Descent Advanced Optimization
The optimization algorithm that we will use for finding θ is the BFGS, which is based
on the quasi-Newton method of Broyden, Fletcher, Goldfarb, and Shanno [5]. Code 19
will implement the function Scypy minimize, which internally will call the BFGS
method:
1 # Gradient Descent Advanced Optimization: BFGS

2 # Update and find the optimal theta
3 m , n = X.shape;
3 m , n X.shape;
4 initial_theta = np.zeros(n);
Get started Open in app = calcCostFunction,
5 Result = op.minimize(fun
6 x0 = initial_theta,
7 args = (X, y),
8 method = 'BFGS',
9 jac = calcGradient);
10 theta = Result.x;
11 Result
GradientDescentAdvancedOptimization_2.py hosted with ❤ by GitHub view raw
Code 19: Gradient Descent Advanced Optimization
If we don’t specify the type of Method we want to use in the argument “method” of the
.minimize function, the BFGS algorithm is used as default. Another method, i.e. TNC,
uses a truncated Newton algorithm for minimizing a function with variables subject to
bounds. Users can experiment with trying the various types of optimizing algorithms
available with the .minimize function in Scypy. See the Scypy documentation page for
more information about all the optimizing methods, with the function .minimize. The
output produced by Code 18, is the following:
fun: 0.2558201286363281
hess_inv: array([[12.66527269, -1.22908954, -2.82539649],
[-1.22908954, 71.16261306, 7.07658929],
[-2.82539649, 7.07658929, 13.39777084]])
jac: array([7.36409258e-07, 3.24760454e-08, 9.55291040e-07])
message: 'Optimization terminated successfully.'
nfev: 20
nit: 19
njev: 20
status: 0
success: True
x: array([-0.70755981, 3.72528774, 0.93824469])
Decision Boundary
The BFGS algorithm has found θ = [-0.70755981, 3.72528774, 0.93824469],

corresponding to a vector containing the Result.x argument, of the scypy.minimize
function. In Step 3, we mentioned that, for Logistic Regression, the Hypothesis hθ(x)
calculates the probability that the output can be 0 or 1. To map this probability to a
discrete class (Bening/Malignant), we select a threshold that is 0.5 above which we
will classify values as “1” and below which we classify values as “0”. So, what we want
to do is tracking the so-called Decision Boundary. A Decision Boundary is a property of
the Hypothesis and its parameters θ, and not a property of the dataset. Let’s plot the
Radius and Texture features again, but with the adding of a red line, that represents the
Decision Boundary for the found θ:
1 # Plot the data with the

2 # Logistic Regression Decision Boundary
3
4 # Create a plot set
5 plot_x = np.asarray([X.T[1].min()+1, X.T[1].max()-3])
6
7 # Calculate the Decision Boundary
8 plot_y = (-1/theta[2]) * (theta[1] * plot_x + theta[0])
9
10 for i in range(len(y)):
11 if y[i]==0.0:
12 c = 'y'
13 m = u'o'
14 if y[i]==1.0:
15 c = 'black'
16 m = u'+'
17 plt.scatter(X.T[1][i], X.T[2][i], color=c, marker=m)
18
19 # Plot the Decision Boundary (red line)
20 plt.plot(plot_x, plot_y, color='red')
21
22 # Put labels
23 plt.xlabel(df.columns[df.columns.get_loc("Radius")])
24 plt.ylabel(df.columns[df.columns.get_loc("Texture")])
PlotDataAndDecisionBoundary_2.py hosted with ❤ by GitHub view raw
Code 20: Plot the data and the Decision Boundary

Figure 8: Plotting Radius and Texture together to the decision boundary
I think that it’s important to highlight that, surprisingly, in spite of Logistic Regression
uses a non-linear (sigmoid) function into its Hypothesis model, the Decision Boundary
is linear!
Calculate the accuracy
Now we want to calculate the accuracy of our algorithm. The function CalcAccuracy
described in Step 8 will do this job:
1 # Calculate accuracy
2 p = CalcAccuracy(theta, X)
3 p = (p == y) * 100
4 print ("Train Accuracy:", p.mean())
RunCalcAccuracy_2.py hosted with ❤ by GitHub view raw
Code 21: calculate the accuracy
The output of CalculateAccuracy is 89.1, that is a good accuracy score!
Make a prediction
Now that we have tested our algorithm and evaluated its accuracy, we want to make
predictions. The following code represents a possible example of a Query: we want to
know what is the outcome for a Radius = 18.00 and a Texture = 10.12:
1 # Perform a Query:
2 # Predict the risk of malignancy for Radius = 18.00 and Texture = 10.12
3
4 query = np.asarray([1, 18.00, 10.12])
5
6 # Scale and Normalize the query
7 query_Normalized = \
8 np.asarray([1, ((query[1]-float(mu[0]))/float(sigma[0])),\
9 ((query[2]-float(mu[1]))/float(sigma[1]))])
10
10
11
12 prediction = sigmoid(query_Normalized.dot(theta));
13 prediction
PredictTwoFeatures_2.py hosted with ❤ by GitHub view raw
Code 22: Example of Query; predict the risk of malignancy for Radius = 18.00 and Texture = 10.12
Note that we have to normalize the Query with mu and sigma for the Scaling and
Normalization. The outcome predicted is 0.79, which means that for a Radius = 18
and a Texture =10.12, the risk of malignancy is nearby to 1.
Step 10: How to adapt the code for Multiple Variables

In this step, we will modify the previous code for the handling of multiple variables.
There are few things to update: the first concerns the Feature Scaling and
Normalization function:
1 def FeatureScalingNormalizationMultipleVariables(X):
2 # N.B.: this code is adapted for multiple variables
3
4 # Initialize the following variables:
5 # Make a copy of the X vector and call it X_norm
6 X_norm = X
7
8 # mu: It will contain the average
9 # value of X in training set.
10 mu = np.zeros(X.shape[1])
11
12
13 # sigma: It will contain the Range(max-min)
14 # of X or Standard Deviation
15 sigma = np.zeros(X.shape[1])
16
17 mu = np.vstack((X[0].mean(), \
18 X[1].mean(), \
19 X[2].mean()))
20 # The Standard Deviation calculation with NumPy,
21 # requires the argument "degrees of freedom" = 1
22 sigma = np.vstack((X[0].std(ddof=1),\
23 X[1].std(ddof=1),\
24 X[2].std(ddof=1)))
25
27 m = X.shape[1]
28
k f i i h h l
29 # Make a vector of size m with the mu values
30 mu_matrix = np.multiply(np.ones(m), mu).T
31
32 # Make a vector of size m with the sigma values
33 sigma_matrix = np.multiply(np.ones(m), sigma).T
34
35 # Apply the Feature Scaling Normalization formula
36 X_norm = np.subtract(X, mu).T
37 X_norm = X_norm /sigma.T
38
39 return [X_norm, mu, sigma]
FeatureScalingAndNormalizationMultipleVariables 2.py hosted with ❤ by GitHub view raw
Code 23: The Feature Scaling and Normalization function for multiple variables
Other minor revisions concern the code for uploading the X vector and the code for the
Query. The following code reassembles what we have done until now, extending all the
functions to the use of the Logistic Regression with multiple variables. Copy and paste
the following code in a new Jupyter Notebook cell:
1 # Read data from file 'wsbc.data.csv'

2 # Make a pandas DataFrame "df"
3 # containing the wdbc data.
4 df = pd.read_csv("wdbc.data.csv")
5
6 # Make the X and y numpy arrays
7 # N.B.: update this code for multiple variables
8 X = np.vstack((np.asarray(df.Radius.values), \
9 np.asarray(df.Texture.values),\
10 np.asarray(df.W_Concave_points.values)))
11
12 y = np.asarray(df.Diagnosis.values)
13
14 # Normalize X using the FeatureScalingNormalizationMultipleVariables function, and
15 # copy the results in the "featuresNormalizeresults" list
16 featuresNormalizeresults = FeatureScalingNormalizationMultipleVariables(X)
17
18 # get the normalized X matrix
19 X = np.asarray(featuresNormalizeresults[0]).T
20
21 # get the mean
22 mu = featuresNormalizeresults[1]
23
24 # get the sigma
25 sigma = featuresNormalizeresults[2]
26
28 m = len(y)
29
30 # number of features
31 n = len(X)
32
33 # Add a column of ones to the array/matrix X ad add a column of '1's
34 X = np.vstack((np.ones(m), X)).T
35
36 # Gradient Descent Advanced Optimization: Update and find the optimal theta
37 m , n = X.shape;
38 initial_theta = np.zeros(n);
39 Result = op.minimize(fun = calcCostFunction,
40 x0 = initial_theta,
41 args = (X, y),
42 method = 'BFGS',
43 jac = calcGradient);
44 theta = Result.x;
45 message = Result.message
46
47
48
49 # Perform a Query:
50 # Predict the risk of malignancy for:
51 # Radius = 5.00, Texture = 1.10, and W_Concave_points = 0.4
52 # N.B.: update this code for multiple variables
53 query = np.asarray([1, 5.00, 1.10, 0.4])
54
55 # Scale and Normalize the query
56 query_Normalized = np.asarray([1, ((query[1]-float(mu[0]))/float(sigma[0])),\
57 ((query[2]-float(mu[1]))/float(sigma[1])),\
58 ((query[3]-float(mu[2]))/float(sigma[2]))])
59
60 # Calculate the prediction using the Logistic Function
61 prediction = sigmoid(query_Normalized.dot(theta));
62
63 # Calculate accuracy
64 p = CalcAccuracy(theta, X)
65 p = (p == y) * 100
66
67 # Print the output
68 print str("Your Query is: "+str(query.tolist()[1:])[1:-1])
69 print str("BFGS Message: "+str(message))
70 print str("Theta found: "+str(theta)[1:-1])
71 print str("Train Accuracy: "+str(p.mean()))
72 print str("Prediction: "+str(prediction))
Code 24: The complete Logistic Regression code for multiple variables
Code 24 will predict the risk of malignancy for: Radius = 5.00, Texture = 1.10, and
Get started
Open in app
W_Concave_points = 0.4. After all the calculations it will produce the following
output:
Your Query is: 5.0, 1.1, 0.4

BFGS Message: Optimization terminated successfully.
Theta found: -1.40776926 3.13487429 1.52552603 4.02250701
Train Accuracy: 95.43057996485061
Prediction: 0.8154762110038786
The prediction of malignancy for these input values is nearby 1. Modifying the Query,
you can experiment by yourself how the probability changes.
Conclusions and Considerations

Logistic Regression is a powerful Machine Learning tool, and we can use it successfully
for predicting categorical outputs of biomedical data. Data wrangling and data mining
can benefit from excellent performances offered by Python and its libraries so well
supported by the community. Linear Algebra programming has intrinsic advantages in
avoiding, where possible, ‘while’ and ‘for’ loops. It is implementable by numpy, a
package that vectorizes the matrixes. Numpy makes working on them more
comfortable, and guarantees better control over the operations, especially for large
arrays.
Moreover, the Machine Learning scenario with Python is enriched by the presence of
many powerful packages (i.e., Tensorflow, Scikit-learn, and other that, for technical
reasons, we haven’t mentioned in this post), which provide excellently optimized
classifications and predictions on data.
While Machine Learning, equipped with Python and all its accessories, can represent
the path toward the future of preventive and diagnostic Medicine, limitations in
comprehending biological variables could make this path an awkward road.
The Wisconsin Breast Cancer (Diagnostic) Data Set, with its 569 patients and 30
features, offers an exhaustive assortment of parameters for classification and for this
reason represents a perfect example for Machine Learning applications. Anyway, many
of these features seem to be redundant, and a definite impact on classification and
prediction by some of them remains still unknown.
We have introduced a Bivariate Analysis in Step 3, to reduce the number of redundant
features. An in-depth discussion on the role of these features in the prediction should
deserve a dedicated article.
I hope you find this post useful!
References:
1. Mangasarian, Olvi & Street, Nick & Wolberg, William. (1970). Breast Cancer
Diagnosis and Prognosis Via Linear Programming. Operations Research. 43.
10.1287/opre.43.4.570.
2. https://www.cancer.org/cancer/breast-cancer/screening-tests-and-early-
detection/breast-biopsy/fine-needle-aspiration-biopsy-of-the-
breast.html#references
3. Andrew NG, Machine Learning | Coursera.
4. Chris Albon, Machine Learning with Python Cookbook, O’Really, ISBN-13: 978–
1491989388.
5. Nocedal, J, and S J Wright. 2006. Numerical Optimization. Springer New York.
Sign up for The Variable

By Towards Data Science
Every Thursday, the Variable delivers the very best of Towards Data Science: from hands-on tutorials
and cutting-edge research to original features you don't want to miss. Take a look.
Your email
Get this newsletter
By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information
about our privacy practices.
Machine Learning Wisconsin Breast Cancer Logistic Regression Python Ml For Bio Data
About Help Legal
Get the Medium app

Logistic Regression For Malignancy Prediction in Cancer - by Luca Zammataro - Towards Data Science

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Logistic Regression For Malignancy Prediction in Cancer - by Luca Zammataro - Towards Data Science

Uploaded by

Copyright:

Available Formats

Get started Open in app

Follow 574K Followers

MACHINE LEARNING FOR BIOMEDICAL DATA

Logistic Regression for malignancy prediction

Luca Zammataro Dec 23, 2019 · 17 min read

In Linear Regression with one or more variables, we have introduced the

Briefly, an FNA is a kind of biopsy, in which physicians use a needle attached to a

Figure 2: The Breast Cancer Wisconsin (Diagnostic), dataset structure.

1 # Import all the required Packages

ImportPackages_2.py hosted with ❤ by GitHub view raw

Code 1: Import all the packages

Step 2: Uploading data

1 # Read data from file 'wsbc.data.csv'

CreatePandasDataFrame_2.py hosted with ❤ by GitHub view raw

Code 2: Create a pandas DataFrame.

Visualization Per Pair

1 # sns is a seaborn object

VisualizationPerPairs_2.py hosted with ❤ by GitHub view raw

Code 3: Visualizing all the paired features.

Figure 3: Visualization Per Pair

Visualization apart, a more efficient method for selecting non-redundant features is

FeaturesCorrelationMatrix 2.py hosted with ❤ by GitHub view raw

Code 4: Bivariate Analysis: Features Correlation Matrix of the WBCD features

Figure 4: Plotting the Correlation Matrix of the WBDC features

1 # Dropping of features with a correlation greater than 0.9

DroppingFeatures_2.py hosted with ❤ by GitHub view raw

Code 5: Dropping of features with a correlation greater than 0.9

1 # Drop the redundant features from the df

DeleteRedundantFeatures_2.py hosted with ❤ by GitHub view raw

Figure 5 shows the plot resulting from Code 6:

Figure 5: Plotting the Correlation Matrix of the 20 not-redundant features

1 # Print a table with all the features correlations

CreatingCorrelationTableNotRedundantFeatures_2.py hosted with ❤ by GitHub view raw

Code 7: Creating a correlation table of the not-redundant features

The resulting output is the following:

Table 2: Correlation Matrix of the features (partially displayed)

Step 4: Logistic Regression Hypothesis Model

Equation 1: Logistic Regression model (Hypothesis)

Equation 2: Logistic Regression model (Hypothesis)

1 # The Logistic Function

LogisticFunction_2.py hosted with ❤ by GitHub view raw

Code 8: The Logistic Function (sigmoid function)

Step 5: The Cost Function

Equation 3: The Logistic Regression Cost Function

Figure 7: non-convex and convex Cost Function

Equation 4: Logistic Regression Cost Function

Figure 8: Plotting Logistic Regression Cost Function

If y = 1, but we predict hθ(x) = 0, we will penalize the learning algorithm by a

The simplified version of the Cost Function is the following:

Equation 5: Simplified Logistic Regression Cost Function

If y = 0, the Cost Function will be equal to -(log(1-hθ(x)). If y = 1, the second part of

1 # Logistic Regression Cost Function

CalcCostFunction_2.py hosted with ❤ by GitHub view raw

Code 9: The Logistic Regression Cost Function

Step 6: The Gradient Function and its optimization

Equation 6: Logistic Regression Gradient wrapped in a generic optimizing algorithm.

Equation 7: Vectorized Logistic Regression Gradient

1 # The Gradient Function

Step 7: Feature Scaling and Normalization

1 # The Feature Scaling and Normalization

FeatureScalingAndNormalizationTwoVariable 2.py hosted with ❤ by GitHub view raw

Code 11: Feature Scaling and Normalization for two variables