You are on page 1of 20

Breast Caner Detection Using Machine

Learning Techniques

PRESENTED BY:
Madhushree M
PROBLEM STATEMENT
Breast cancer is one of the most feared diseases in the world with high
fatality rate. It is the second most common cause of death after lung cancer.
There are several challenges in the health care sector in India like: inadequate
number of doctors and lack of standardized procedures/methods of disease
diagnosis.
Another major challenge is the huge amount of patients’ data generated
through various types of scans and need for automated, cost-effective and fast
processing for accurate diagnosis.
The final issue is human intervention in diagnosis which may lead to
inaccurate diagnosis and delay in treatment.
Early diagnosis is needed to provide proper treatment and reduce mortality
rate.
So, our aim is to create a standardized, customizable and affordable breast
cancer detection / classification system using Machine Learning Techniques. 2
METHODOLOGY

Wisconsin Data pre- Feature Data


breast cancer processing selection partition
dataset

Classification
prediction
model

Benign Malignant

BLOCK DIAGRAM OF BREAST CANCER DETECTION MODEL

3
 Dataset used: Wisconsin breast cancer dataset from UCI machine
learning repository.
 It is the open source database available in the csv format.
 The breast cancer includes 569 examples of cancer biopsies, each with 32
features.
 After data pre-processing we get 15 unique features.

 Programming Environment: Google Colaboratory


 Programming Language: Python

4
Evaluation metrics
 The performance of the model is measured with respect to the below
metrics.
 Confusion matrix: a confusion matrix is a technique for summarizing the
performance of a classification algorithm.

 F1 score: It is the harmonic mean of precision and recall and gives a better
measure of the incorrectly classified cases.
 We are considering the F1 score metric to evaluate the classification model,
since it takes both false positives and false negatives into account. 5
IMPLEMENTATION
• We have used 3 machine learning algorithms for breast cancer
detection/classification.
• Algorithms used: K-Nearest Neighbour, Support Vector Machine, Logistic
Regression.

K-Nearest Neighbour Support Vector Machine Logistic Regression


6
RESULTS AND ANALYSIS

MODEL 1: KNN
Parameters
CASE 1: CASE 2: Parameters
N_neighbors=5 N_neighbors=5
Metric = minkowski Metric = minkowski
P=1 P=2

Confusion Matrix Confusion Matrix

F1 SCORE : 0.94 F1 SCORE : 0.95

7
Contd....
Model 2: Support Vector Machine

Case 1: Parameters Case 2: Parameters

kernel: rbf kernel: linear

Confusion Matrix Confusion Matrix

F1 SCORE : 0.96 F1 SCORE : 0.97

8
Contd....
Model 3 : Logistic Regression

Case 1: Parameters Case 2: Parameters

C(concordance C(concordance
Statistic)=0.1 Statistic)=1

Confusion Matrix Confusion Matrix

F1 SCORE : 0.95 F1 SCORE : 0.97


9
RESULTS SUMMARY:

ML Algorithm Model 1: KNN Model 2: SVM Model 3: Logistic


Regression
Total dataset 569 569 569
Training 455 455 455
dataset
Testing 114 114 114
dataset
F1 Score with Case 1: Case 2: Case 1: Case 2: Case 1: Case 1:
Parameters: 0.94 0.95 0.96 0.97 0.95 0.97
N_neighbors N_neighbors
=5, Metric = =5, Metric = Kernel= Kernel=
minkowski, minkowski, “rbf” “linear” C=“0.1” C=“1”
P=1 P=2

Confusion Case 1 Case 2 Case 1 Case 2 Case 1 Case 2


matrix • TP=44 • TP=44 • TP=45 • TP=46 • TP=44 • TP=46
• TN=65 • TN=66 • TN=66 • TN=66 • TN=66 • TN=66
• FP=2 • FP=1 • FP=2 • FP=1 • FP=1 • FP=1
• FN=3 • FN=3 • FN=1 • FN=1 • FN=3 • FN=1
CONCLUSION

• Breast cancer is considered to be one of the significant causes


of death in women.
• Breast cancer detection can be done with the help of modern
machine learning algorithms.
• In our work, three classification algorithms KNN, SVM and
Logistic Regression applied on Wisconsin Breast Cancer
dataset.
• Results shows that using pre-processing and feature selection
phase enhances the classifier’s performance.
• The proposed model can be further improved by using Deep
Learning Techniques.

11
LOGISTIC REGRESSION
What is logistic regression?
Logistic regression is a statistical method of classification of objects.
We can solve binary classification problem using logistic regression
technique.
Logistic regression is a regression model where the dependent variable is
categorical such as.
 A doctor classifies the tumour as malignant or benign.
 A bank transaction may be fraudulent or genuine.
 Every incoming mail is spam or not spam.
Logistic regression is just one part of machine learning used for solving this
kind of binary classification problem.
Contd...
Why only logistic regression?
• Because it produces results in a binary format which is used to predict the
Outcome of a categorical dependent variable. Let’s say for example
• Whether the given animal is cat or rat etc... So the outcome is
discrete/Categorical such as 0/1, yes/no, true/false.
Why can’t we use linear regression?
• Ex: salary v/s experience, tumour detection
Contd...
We use linear regression when our output value in range. But in this case it is
in discrete form i.e., Either 0 or 1.

We need to formulated this into an equation. Once it is formulate into an equation it


forms a SIGMOID CURVE.
How to decide whether the value is 0 or 1.
To predict the value between 0 &1. We need to use the concept of threshold.
Threshold basically divides the line. Its value indicates probability of whether
winning/ loosing.
LOGISTIC REGRESSION MODEL

Equation of a straight line: h(x) = T x+ c -  h(x)  +

h(x) is the predicted output, T is the slope, x is the data points and C is the
intercept.
To obtain the limits between 0 to 1, we need to transform the equation.
h(x) = g(T x) = g(z) 0  h(x)  1
Sigmoid function or logistic function
g(z) =
1 g(z)
Where, Z =  x T

Therefore, h(x) = 0.5

z
0
LOGISTIC REGRESSION DECISION BOUNDARY
Suppose predict “y=1” if h(x)  0.5 g(z)  0.5 when z  0
h(x)=g(T x)  0.5 Whenever T x 0

g(z) < 0.5 when z< 0 and T x <0


predict “y=0” if h(x) < 0.5
1
Example: h(x) =g( 0 1 1 2 2)
 + X + X

x2
Take 
0 = -3, 
1 =1, 
2 =1

predict “y=1” if h(x)  0.5, -3+ x1+ x2  0 3


Y=1
Then, x1+ x2  3 2

predict “y=0” if h(x) <0.5, -3+ x1+ x2 < 0 1


Then, x1+ x2 < 3 Y=0
x1
If h(x) =0.5 Then, x1+ x2 = 3 1 2 3
Contd....

• Now that we know what the expression of logistic regression hypothesis ,


we need to know how to define the cost function in order to evaluate the
errors a logistic model is going to make. Recalling the cost function for
linear regression:
• If we minimized this function applying our new  h(x) hypothesis we cannot
assure that we will converge the global minimum of the cost function! As 
h(x) =  is not linear we might end up in a local minimum.
Therefore, our new cost function will be:
Contd....

• As we are dealing with a binary classification problem and y can only be 0 or


1, then the cost function can be simplified to the following expression:

• Choosing the  parameters: using gradient descent


The gradient descent is a iterative process used in logistic regression is exactly
the same than the one used for linear regression. The only difference between
both, is the input hypothesis. Therefore, the gradient descent algorithm is
again:
ADVANTAGES AND DISADVANTAGES
Advantages:
• Easy to understand.
• Requires less training.
• Good accuracy for many simple datasets and it performs well when the
dataset is linearly separable.
Disadvantages:
• If the independent features are correlated it may affect performance.
• It is often prone to noise and over fitting.
• Non linear problems can’t be solved as it has a linear decision surface.
Thank you

You might also like