Professional Documents
Culture Documents
5 - Logistic Regression - Lemai Nguyen 2022
5 - Logistic Regression - Lemai Nguyen 2022
Logistic Regression
Associate Professor Lemai Nguyen
Associate Professor Lemai Nguyen
Information Systems and Business Analytics
Email: lemai.nguyen@deakin.edu.au
UPCOMING EVENTS:
• DATA ANALYTICS IN AUSTRALIAN ORGANISATIONS
• EV DETECTION CHALLENGE
Deakin University CRICOS Provider Code: 00113B - A/Prof Lemai Nguyen Slide 2
Tell me about you…!
Deakin University CRICOS Provider Code: 00113B - A/Prof Lemai Nguyen Slide 3
Kotu V., Deshpande B. Data Science : Concepts and Practice, chapters
1and 4. Second edition. Morgan Kaufmann Publishers; 2019.
Google Colab
https://colab.research.google.com https://jupyter.org/
https://rapidminer.com
Deakin University CRICOS Provider Code: 00113B - A/Prof Lemai Nguyen Slide 4
Predictive Machine Learning with Logistic Regression
Logistic
Regression –
Key concepts
Exercises in
Python
Illustrative
example
Deakin University CRICOS Provider Code: 00113B - A/Prof Lemai Nguyen Slide 5
Logistic
Regression –
Key concepts
Deakin University CRICOS Provider Code: 00113B - A/Prof Lemai Nguyen Slide 6
Supervised Machine Learning
Deakin University CRICOS Provider Code: 00113B - A/Prof Lemai Nguyen MIS716 AI for Business Slide 7
Linear Regression – revision (1)
What it is and How it works
y’= f(𝑥) = 𝒃𝟎 + 𝒃𝟏 𝒙
𝒃𝟎 is the intercept of the line
𝑥=independent variable/predictor
𝑦=dependent variable/label
𝒚′ = 𝒃𝟎 + 𝒃𝟏 𝒙𝟏 + 𝒃𝟐 𝒙𝟐 + ⋯ + 𝒃𝒏 𝒙𝒏
𝒚′ = 𝒃𝟎 + 𝒃𝟏 𝒙
𝒚′ = 𝒃𝟎 + 𝒃𝟏 𝒙𝟏 + 𝒃𝟐 𝒙𝟐 + ⋯ + 𝒃𝒏 𝒙𝒏
• Target is categorical
• Predictors can be continuous or
categorical
into y ∈ {0,1}
x is continuous from - ∞ to + ∞
https://en.wikipedia.org/wiki/Sigmoid_function
Training will involve a search for the coefficients bi to maximise the likelihood of estimations
for each datapoint using a simplified likelihood function
y – original target data (training dataset) v Cost function is Sum of all likelihood values.
p – estimated probability v Gradient descent can be utilised to search for
coefficients to maximise the likelihood of
correct estimations
Deakin University CRICOS Provider Code: 00113B - A/Prof Lemai Nguyen Slide 17
Predictive Machine Learning with Logistic Regression
Illustrative
example
Deakin University CRICOS Provider Code: 00113B - A/Prof Lemai Nguyen Slide 18
Business problem framing – business needs and application
• Assist pathologists in
interpretating test data ->
reduced time and improved
accuracy
• Training novice pathologists
• Predictive analytics to
classify diagnosis
• Past biopsy data and Application
results
• To predict cancer diagnosis
• Long delay in returning
pathology results
• Novice pathologists need Analytics (ML)
training
Data
Business needs
Deakin University CRICOS Provider Code: 00113B - A/Prof Lemai Nguyen Slide 19
ML Problem Framing: Classification
Predict if a datapoint belongs to one of the predefined classes, based on learning from a labelled
dataset
Data Preparation
Model Training Model Evaluation
& Exploration
20
Data:
V1, V2, V7-V9: biological variables
Diagnosis: healthy or cancerous
Deakin University CRICOS Provider Code: 00113B - A/Prof Lemai Nguyen Slide 21
Loading and Exploring the Dataset
# load dataset
records = pd.read_csv("/content/drive/MyDrive/VNU2022/biopsy_ln.csv")
records.info()
records.describe()
Deakin University CRICOS Provider Code: 00113B - A/Prof Lemai Nguyen Slide 22
Exploration
Deakin University CRICOS Provider Code: 00113B - A/Prof Lemai Nguyen Slide 23
Data preparation
records['Diagnosis'] = records['class'].apply(coding_diagnosis)
records.head(10)
Deakin University CRICOS Provider Code: 00113B - A/Prof Lemai Nguyen Slide 24
Exploration
for i in records.iloc[:,1:5]:
sns.regplot(x=records[i], y=records['Diagnosis'], logistic=True, ci=None)
plt.title(i)
plt.show()
Deakin University CRICOS Provider Code: 00113B - A/Prof Lemai Nguyen Slide 25
Data Preparation
#Selecting predictors
print(X.head())
print(y.head())
Deakin University CRICOS Provider Code: 00113B - A/Prof Lemai Nguyen Slide 26
Data Splitting
Deakin University CRICOS Provider Code: 00113B - A/Prof Lemai Nguyen Slide 27
Model Training
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
Deakin University CRICOS Provider Code: 00113B - A/Prof Lemai Nguyen Slide 28
Model Testing
#inspection
inspection=pd.DataFrame({'Actual':y_test, 'Predicted':y_pred})
inspection.head(20)
Deakin University CRICOS Provider Code: 00113B - A/Prof Lemai Nguyen Slide 29
Model Testing
Deakin University CRICOS Provider Code: 00113B - A/Prof Lemai Nguyen Slide 30
Model Testing
Deakin University CRICOS Provider Code: 00113B - A/Prof Lemai Nguyen Slide 31
Plot ROC curve and Confusion Matrix
RocCurveDisplay.from_estimator(logreg,X_test, y_test)
ConfusionMatrixDisplay.from_estimator(logreg, X_test, y_test)
plt.show()
Deakin University CRICOS Provider Code: 00113B - A/Prof Lemai Nguyen Slide 32
Cross Validation
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegressionCV.html
logreg2=LogisticRegressionCV(cv=10, random_state=2022).fit(X, y)
print("Accuracy: %.3f" % logreg2.score(X,y))
Accuracy: 0.930
Deakin University CRICOS Provider Code: 00113B - A/Prof Lemai Nguyen Slide 33
Recap
Deakin University CRICOS Provider Code: 00113B - A/Prof Lemai Nguyen Slide 34
Rapidminer
Deakin University CRICOS Provider Code: 00113B - A/Prof Lemai Nguyen Slide 35
Tuning the model parameters
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
Deakin University CRICOS Provider Code: 00113B - A/Prof Lemai Nguyen Slide 36
Deakin University CRICOS Provider Code: 00113B - A/Prof Lemai Nguyen Slide 37
Assumptions
Pros
• Explainable: Easy to interpret Cons
• Visual representation
• Non-parametric: No assumptions on data • Target must be categorical, best with binary
distribution (linearity, normality) (dichotomous)
• Less effort for data preparation, no need for • Work best if predictors are linearly separable
normalisation by the target
• Work for both numerical and categorical • Require large datasets. Overfitting if datasets
predictors are small
• Complex when having multi-class targets
Deakin University CRICOS Provider Code: 00113B - A/Prof Lemai Nguyen Slide 42
Predictive Machine Learning with Logistic Regression
Exercises in
Python
Deakin University CRICOS Provider Code: 00113B - A/Prof Lemai Nguyen Slide 43
Google Colab
https://colab.research.google.com https://jupyter.org/
Deakin University CRICOS Provider Code: 00113B - A/Prof Lemai Nguyen Slide 44
Additional resources
• Molnar, C., (2022) Interpretable Machine Learning - A Guide for Making Black Box Models Explainable,
https://christophm.github.io/interpretable-ml-book/logistic.html
Deakin University CRICOS Provider Code: 00113B - A/Prof Lemai Nguyen Slide 45