Professional Documents
Culture Documents
You have 2 free member-only stories left this month. Sign up for Medium and get an extra one
In this article, which ideally represents a continuation of that dedicated to the Linear
Regression, we will explain the Logistic Regression, another Supervised Learning
model, used for predicting discrete-valued outputs.
We will introduce the mathematical concepts underlying the Logistic Regression, and
through Python, step by step, we will make a predictor for malignancy in breast cancer.
We will use the “Breast Cancer Wisconsin (Diagnostic)” (WBCD) dataset, provided by
the University of Wisconsin, and hosted by the UCI, Machine Learning Repository.
Drs Mangasarian, Street, and Wolberg, the creators of the database, intended to utilize
30 characteristics of individual cells of breast cancer obtained from a minimally
invasive fine needle aspirate (FNA), to discriminate benign from malignant lumps of a
Get started Open in app
breast mass, using Machine Learning [1].
Using an image analysis software called Xcyt based on a curve-fitting algorithm, the
authors were able to determine the boundaries of the cell nuclei from a digitized
640×400, 8-bit-per-pixel grayscale image of the FNA.
Figure 1: A magnified image of a malignant breast FNA. A curve-fitting algorithm was used to outline the cell
nuclei. (Figure from Mangasarian OL., Street WN., Wolberg. WH. Breast Cancer Diagnosis and Prognosis via
Linear Programming. Mathematical Programming Technical Report 94–10. 1994 Dec)
The 30 features describe characteristics of the cell nuclei present in the scanned
images.
As reported in their paper, the authors have used machine learning and image
processing techniques to accomplish malignancy outcome prediction. Still, an in-depth
on their computational approaches go beyond the aim of this article.
In this post, we will not remake the work of Mangasarian, Street, and Wolberg. We will
Get started
use their datasetOpen in app
to implement a Logistic Regression predictor based on some of the 30
features of the WBCD, in Python. We will use the outcome Bening/Malignant to
predict if a new patient has a probability of developing malignancy or not, basing on
the FNA data. Furthermore, our predictor will be an exciting occasion of exposing
some basic concepts of Logistic Regression and implementing a code around the
biomedical problem: what features are essential in predicting malignant outcomes.
The first column of the dataset corresponds to the patient ID, while the last column
represents the diagnosis (the outcome can be “Benign” or “Malignant” based on the
type of diagnosis reported). The resulting dataset consists of 569 patients: 212
(37.2%) have an outcome of Malignancy, and 357 (62.7) are Benign. Figure 2
describes the dataset structure:
Get started Open in app
In detail, the dataset consists of ten real-valued features computed for each cell
nucleus. They are 1) Radius (mean of distances from center to points on the
perimeter), 2) Texture (standard deviation of gray-scale values), 3) Perimeter, 4)
Area, 5) Smoothness (local variation in radius lengths), 6) Compactness (perimeter² /
area — 1.0), 7) Concavity (severity of concave portions of the contour), 8) Concave
points (number of concave portions of the contour), 9) Symmetry and 10) Fractal
Dimension (“coastline approximation” — 1). The ten real-valued features correspond
to the Mean, (values from columns 2 to column 11), to the Standard Errors (values
from columns 12 to 21), and the Worst or largest (mean of the three largest values),
(columns from 22 to 31). Column 32 contains the Bening/Malignant outcome.
For more simplicity, I have formatted the WBDC in a comma-separated-values file. You
can download the formatted version following this link (filename: wdbc.data.csv).
Before starting, I suggest readers following the interesting course in Machine Learning
at Coursera by Andrew NG [3]. The course provides an excellent explanation of all the
arguments treated in this post.
All the code presented in this article is written in Python 2.7. For the implementation
environment, I recommend the use of the Jupyter Notebook.
Step 1: Import packages from SciPy
Get started Open in app
Import all the packages required for the Python code of this article: Pandas, NumPy,
matplotlib, and SciPy. These packages belong to SciPy.org, which is a Python-based
ecosystem of open-source software for mathematics, science, and engineering. Also,
we will import seaborn, which is a Python data visualization library based on
matplotlib. Moreover, an object op from the scipy.optimize package will be created, to
make optimizations of the Gradient.
With pandas we can visualize the first lines of a df content using the .head() method:
Table 1: Typing df.head(), will display the DataFrame content (partial output)
Step 3: Data Visualization and Bivariate Analysis
Get started Open in app
Now that we have a DataFrame containing our data, we want to identify which of the
30 features are important for our prediction model. Selecting the best features of a
dataset is a critical step in Machine Learning, to get a useful classification and to
avoiding a predictive bias.
One method is the Visualization Per Pairs. As suggested by Ahmed Qassim in his article,
an elegant display of data is gettable using the pairplot set of functions provided by
seaborn. The Python code that produces the complete features combination plot is the
following:
Here we have created the seaborn object sns; then, we have used the pandas DataFrame
(df) produced in Step 2 as an argument for the sns pairplot method. Note that,
specifying the argument “hue = Diagnosis,” the pairplot method has access to the df
column containing the Diagnosis values (0, 1). Moreover, pairplot will use the
argument “palette” to color points with b blue or r red basing on the Diagnosis values.
For simplifying the visualization, the map resulting from Code 3 shows the
combinations of the first five pairs of features. They represent mean values of
parameters Radius, Texture, Perimeter, Area, and Smoothness. Depending on your
computer capabilities, the production of the complete features map, running Code 3,
could take time.
Get started Open in app
Each tile represents a scatter plot of a couple of parameters. This visualization makes
more accessible the identification of essential elements for the classification. Some of
these pairs, like Radius vs. Texture or Perimeter vs. Smoothness, have a right separation
level concerning the Diagnosis (blue points = Benign; red spots = Malignant).
Visualizing features in plots of pairs represent an excellent tool, primarily because of its
immediacy.
Bivariate Analysis
1 # Bivariate Analysis
2 # Make a Features Correlation Matrix of the WBCD features
3 # Readapted from AN6U5
4
5 def features_correlation_matrix(df):
6 from matplotlib import pyplot as plt
7 from matplotlib import cm as cm
8 fig = plt.figure()
Get
9 started
fig.set_size_inches(18.5,
Open in app 10.5)
10 ax1 = fig.add_subplot(111)
11
12 cmap = cm.get_cmap('jet', 100)
13
14 # interpolation='nearest' simply displays an image without
15 # trying to interpolate between pixels if the display
16 # resolution is not the same as the image resolution
17 # The correlation is returned in absolute values:
18 cax = ax1.imshow(df.corr().abs(), interpolation="nearest", \
19 cmap=cmap)
20 ax1.grid(True)
21 plt.title('Correlation Matrix of the WBCD features',fontsize=20)
22 labels=list(df.columns)
23
24 ax1.set_xticks(np.arange(len(labels)))
25 ax1.set_yticks(np.arange(len(labels)))
26
27 ax1.set_xticklabels(labels,fontsize=15,\
28 horizontalalignment="left", rotation='vertical')
29
30 ax1.set_yticklabels(labels,fontsize=15)
31
32 # Add a colorbar
33 fig.colorbar(cax, ticks=[-1.0, -0.8, -0.6, -0.4, -0.2, 0, \
34 0.2, 0.4, 0.6, 0.8, 1])
35 plt.show()
36
37
38 # Drop the Outcome column from Df and copy it into df_features
39 df_features = df.drop(df.columns[-1],axis=1)
40 # Run the correlation_matrix function, using df_features as argument
41 features_correlation_matrix(df_features)
Take a glance at the resulting plot in Figure 4: we aim to understand how the 30
characteristics of the dataset are related to each other. In calculating correlations
between variables, we can observe that some of them result in being notably correlated
(values greater than 0.9). The general assumption of the Bivariate Analysis is that
features that are highly associated provide redundant information: for this reason, we
want to eliminate them, avoiding a predictive bias.
As suggested in this post by Kalshtein Yael, dedicated to the WBCD analysis with R, we
should remove all features with a correlation higher than 0.9, keeping those with the
lower mean. Code 5 is a readaptation of a Python code example, from the Chris Albon
book [4]:
['Perimeter',
'Area',
'Concave_points',
'Perimeter_SE',
'Area_SE',
'W_Radius',
'W_Texture',
'W_Perimeter',
'W_Area',
'W_Concave_points']
Now we want to eliminate these features from our DataFrame and re-plot the df with
the remaining 20:
Code 6: Delete the redundant features from the DataFrame and re-plot only the not-redundant
We have now removed highly correlated feature pairs, so only features with low
correlation, like Radius and Texture, will be kept. We can directly access to the
correlation matrix values, typing in a new notebook cell:
Thanks to the Visualization Per Pairs and the Bivariate Analysis, now we know what
features deserve to be considered in our analysis, then let’s go ahead with the
construction of our Logistic Regression outcome predictor. In the next Steps, we will
Get started Open in app
write a code to predict the diagnosis outcome for a pair of features chosen on the base
of their non-redundancy.
Figure 6. A: Example of binary classification of malignancy prediction in breast cancer. B: The Logistic
Regression Hypothesis is a non-linear function.
The plot in Figure 6A explains why we cannot apply the linear Hypothesis to binary
classification. Imagine that we want to plot our samples with an outcome that can be
Benign or Malignant (red circles). We could apply the Linear Hypothesis model to
separate the samples into two distinct groups. In attempting to classifying binary
values as 0 and 1, the Linear Regression tries to predict values greater than 0.5 as “1”
and all less than 0.5 as “0” because the threshold classifier output for hθ(x) is at 0.5.
Theoretically, the Linear Regression model that we have chosen could work well also
for the binary classification, as the blue line demonstrates. But let see what happens if
we insert another sample with Malignant diagnosis (the green circle in Figure 6A). The
Linear Regression Hypothesis adapts the line to include the new sample (the magenta
line).
Still, this Hypothesis model couldn’t work correctly for all the new samples that we are
going to upload, because the Linear Hypothesis seems not to add further information
to our predictions. That happens because classification is not a linear function. What
we need is a new Hypothesis hθ(x) that can calculate the probability that the Diagnosis
Get started
output can be 0 Open
or 1:inthis
app
new hypothesis model is the Logistic regression.
The Logistic Regression Hypothesis model in Equation 1, looks similar to that of the
Linear Regression. But the real difference is in the g function that uses the product of
the translated θ vector with the x vector (we will call this product z) as an argument.
The g function is defined as in Equation 2:
The Python Code for the implementation of the Logistic Function is the following:
Again, we could use the same Cost Function that we used for Linear Regression, and it
should look like the Cost Function of the Equation 4 in the article dedicated to the
Linear Regression, except for a difference in the sigmoid function that characterizes
the Hypothesis model hθ(x).
The y vector, as usual, represents the output with the difference that, here, y is a vector
of binary outcomes (0/1, or Benign/Malignant), and not a vector of continuous-valued
outputs:
Concretely, we cannot use the Linear Regression Cost Function for the Logistic. The
non-linearity of the sigmoid function, which handles hθ(x), leads to a J(θ) having a
non-convex pattern, and it will look like the curve in the graph of Figure 7A:
Get started Open in app
This non-convex J(θ) is a function with many local optima. It’s not guaranteed that the
Gradient descent will converge to the global minimum. What we want is a convex J(θ)
like that in Figure 7B, which is a function that converges to the global minimum. So,
we have to write the Cost Function in a way that guarantees a convex J(θ):
Implementing the Equation 4, the Cost Function that is calculated as log(z), looks like
the red curve of Figure 8:
Get started Open in app
Because we have two binary conditions for the Benign or Malignant outcome (y), the
Cost Function in Equation 4 states the cost of our Hypothesis prediction concerning y
is.
In the case of y = 0, we have the opposite: if y = 0 and we predict hθ(x) = 0, the cost
is going to be 0, because our Hypothesis matches with y, while if our prediction is
hθ(x) = 1, we end up paying a very large cost.
But in the Machine Learning scenario, the Gradient Descent is not the only algorithm
for minimizing the Cost Function: Conjugate Gradient, BFGS, L-BGFS, TNC, for
example, represent some of the more sophisticated algorithms for minimizing the Cost
Function, in automatizing the search of θ.
Based on different statistics, these algorithms try to optimize the Gradient Function,
which is the difference between the actual vector y of the dataset, and the h vector (the
prediction), to learn how to find the minimum J. Generalizing, an optimization
algorithm will repeat until it will converge. Importantly: the updating of θ always has
to be simultaneous.
Optimization algorithms are arguments of the Numerical Optimization field, and an in-
depth study on the use of algorithms goes beyond this article. The Gradient function
that we are going to implement looks identical to that used for the Linear Regression.
But also here, as for the Cost Function, the difference is in the definition of hθ(x) that
needs the sigmoid function:
Get started Open in app
The equation for the vectorized implementation of the Gradient Function is:
and the Python code for calculating the Gradient is the following:
In Step 9, we will see how to optimize the Gradient Function, using one of the
algorithms provided by scipy.optimize.
sigmoid(z)
calcCostFunction(theta, X, y)
calcGradient(theta, X, y)
FeatureScalingNormalization(X)
CalcAccuracy(theta, X)
Now we will write the code that wraps all of these functions, to predict outcomes of
Malignancy based on two of the 20 not-redundant features of our dataset. For the
choice of the features, we could refer to one of the pairs that we have found in Step 3,
Radius, and Texture, which have a correlation score = 0.32. The following code
produces the features numpy array X and the output numpy vector y, from the
DataFrame df:
Get started
UploadingDataset_2.py
Openhosted with ❤ by GitHub
in app view raw
The yellow dots represent the Benigns, the blacks, the Malignants.
Now let’s Normalize and Scale, our data. Also, we need collecting the mu, which is the
average values of X in our training set, and sigma that is the Standard Deviation. In a
new notebook cell, let’s type:
Now, we have to update the array X adding a column of “ones,” with the method
.vstack:
Testing
Let’s do some tests: in order to test our code, let’s try to calculate the Cost Function and
Get started Open in app
the Gradient, starting with a θ = [0, 0, 0]:
Code 17: 1st test; compute Cost Function and Gradient starting with initial θ = 0
Code 18: 2nd test; compute Cost Function and Gradient starting with initial not-zero θ
The optimization algorithm that we will use for finding θ is the BFGS, which is based
on the quasi-Newton method of Broyden, Fletcher, Goldfarb, and Shanno [5]. Code 19
will implement the function Scypy minimize, which internally will call the BFGS
method:
If we don’t specify the type of Method we want to use in the argument “method” of the
.minimize function, the BFGS algorithm is used as default. Another method, i.e. TNC,
uses a truncated Newton algorithm for minimizing a function with variables subject to
bounds. Users can experiment with trying the various types of optimizing algorithms
available with the .minimize function in Scypy. See the Scypy documentation page for
more information about all the optimizing methods, with the function .minimize. The
output produced by Code 18, is the following:
fun: 0.2558201286363281
hess_inv: array([[12.66527269, -1.22908954, -2.82539649],
[-1.22908954, 71.16261306, 7.07658929],
[-2.82539649, 7.07658929, 13.39777084]])
jac: array([7.36409258e-07, 3.24760454e-08, 9.55291040e-07])
message: 'Optimization terminated successfully.'
nfev: 20
nit: 19
njev: 20
status: 0
success: True
x: array([-0.70755981, 3.72528774, 0.93824469])
Decision Boundary
I think that it’s important to highlight that, surprisingly, in spite of Logistic Regression
uses a non-linear (sigmoid) function into its Hypothesis model, the Decision Boundary
is linear!
Now we want to calculate the accuracy of our algorithm. The function CalcAccuracy
described in Step 8 will do this job:
1 # Calculate accuracy
2 p = CalcAccuracy(theta, X)
3 p = (p == y) * 100
4 print ("Train Accuracy:", p.mean())
Make a prediction
Now that we have tested our algorithm and evaluated its accuracy, we want to make
predictions. The following code represents a possible example of a Query: we want to
know what is the outcome for a Radius = 18.00 and a Texture = 10.12:
1 # Perform a Query:
2 # Predict the risk of malignancy for Radius = 18.00 and Texture = 10.12
3
4 query = np.asarray([1, 18.00, 10.12])
5
6 # Scale and Normalize the query
7 query_Normalized = \
8 np.asarray([1, ((query[1]-float(mu[0]))/float(sigma[0])),\
9 ((query[2]-float(mu[1]))/float(sigma[1]))])
10
10
11
Get started Open in app
12 prediction = sigmoid(query_Normalized.dot(theta));
13 prediction
Code 22: Example of Query; predict the risk of malignancy for Radius = 18.00 and Texture = 10.12
Note that we have to normalize the Query with mu and sigma for the Scaling and
Normalization. The outcome predicted is 0.79, which means that for a Radius = 18
and a Texture =10.12, the risk of malignancy is nearby to 1.
1 def FeatureScalingNormalizationMultipleVariables(X):
2 # N.B.: this code is adapted for multiple variables
3
4 # Initialize the following variables:
5 # Make a copy of the X vector and call it X_norm
6 X_norm = X
7
8 # mu: It will contain the average
9 # value of X in training set.
10 mu = np.zeros(X.shape[1])
11
12
13 # sigma: It will contain the Range(max-min)
14 # of X or Standard Deviation
15 sigma = np.zeros(X.shape[1])
16
17 mu = np.vstack((X[0].mean(), \
18 X[1].mean(), \
19 X[2].mean()))
20 # The Standard Deviation calculation with NumPy,
21 # requires the argument "degrees of freedom" = 1
22 sigma = np.vstack((X[0].std(ddof=1),\
23 X[1].std(ddof=1),\
24 X[2].std(ddof=1)))
25
26 # number of training examples
27 m = X.shape[1]
28
k f i i h h l
29 # Make a vector of size m with the mu values
30 mu_matrix = np.multiply(np.ones(m), mu).T
Get started Open in app
31
32 # Make a vector of size m with the sigma values
33 sigma_matrix = np.multiply(np.ones(m), sigma).T
34
35 # Apply the Feature Scaling Normalization formula
36 X_norm = np.subtract(X, mu).T
37 X_norm = X_norm /sigma.T
38
39 return [X_norm, mu, sigma]
Code 23: The Feature Scaling and Normalization function for multiple variables
Other minor revisions concern the code for uploading the X vector and the code for the
Query. The following code reassembles what we have done until now, extending all the
functions to the use of the Logistic Regression with multiple variables. Copy and paste
the following code in a new Jupyter Notebook cell:
Code 24: The complete Logistic Regression code for multiple variables
Code 24 will predict the risk of malignancy for: Radius = 5.00, Texture = 1.10, and
Get started
Open in app
W_Concave_points = 0.4. After all the calculations it will produce the following
output:
The prediction of malignancy for these input values is nearby 1. Modifying the Query,
you can experiment by yourself how the probability changes.
Moreover, the Machine Learning scenario with Python is enriched by the presence of
many powerful packages (i.e., Tensorflow, Scikit-learn, and other that, for technical
reasons, we haven’t mentioned in this post), which provide excellently optimized
classifications and predictions on data.
While Machine Learning, equipped with Python and all its accessories, can represent
the path toward the future of preventive and diagnostic Medicine, limitations in
comprehending biological variables could make this path an awkward road.
The Wisconsin Breast Cancer (Diagnostic) Data Set, with its 569 patients and 30
features, offers an exhaustive assortment of parameters for classification and for this
reason represents a perfect example for Machine Learning applications. Anyway, many
of these features seem to be redundant, and a definite impact on classification and
prediction by some of them remains still unknown.
We have introduced a Bivariate Analysis in Step 3, to reduce the number of redundant
Get started Open in app
features. An in-depth discussion on the role of these features in the prediction should
deserve a dedicated article.
References:
1. Mangasarian, Olvi & Street, Nick & Wolberg, William. (1970). Breast Cancer
Diagnosis and Prognosis Via Linear Programming. Operations Research. 43.
10.1287/opre.43.4.570.
2. https://www.cancer.org/cancer/breast-cancer/screening-tests-and-early-
detection/breast-biopsy/fine-needle-aspiration-biopsy-of-the-
breast.html#references
4. Chris Albon, Machine Learning with Python Cookbook, O’Really, ISBN-13: 978–
1491989388.
Every Thursday, the Variable delivers the very best of Towards Data Science: from hands-on tutorials
and cutting-edge research to original features you don't want to miss. Take a look.
Your email
By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information
about our privacy practices.
Machine Learning Wisconsin Breast Cancer Logistic Regression Python Ml For Bio Data
Get started Open in app