Professional Documents
Culture Documents
Week 3 Lecture Slides BUS265 2023
Week 3 Lecture Slides BUS265 2023
Digital Technology
Lecture 3: Data Design and Predictive Modelling
• Business Understanding
• Data Understanding
• Data Preparation
• Modelling
• Evaluation
• Deployment
2
Supervised Data Mining/
Predictive Modeling
Key: is there a specific, quantifiable target that we
are interested in or trying to predict?
Examples:
• What will the IBM stock price be tomorrow? (e.g., $200)
What would you do if you could predict this?
Age Income
35 75K
68 83K
43 61K
71 56K
… …
Example: Market Life Insurance
§ Send a letter to some prospects in a mailing list
§ Wait for a response …
Income
NO Age
45
Split over age
<45 >=45
?
NO YES
50K
Income
Did not buy life insurance
Interested in LI? NO
Bought life insurance
A different sort of supervised segmentation for
our Life Insurance product
Logistic Regression
p(+|x)=
Age
45
β0 = 123
?
β1 = -1.3
p(LI|x) = 0.48
50K
Income
10
What might a data mining model look like?
11
What is the model?
Classification tree
Split over income
Income
NO Age
45
Split over age
<45 >=45
NO YES
50K
Income
Did not buy life insurance
Bought life insurance
Within supervised learning:
classification vs. regression?
13
Supervised Data Mining/
Predictive Modeling
Recall: Key: is there a specific, quantifiable target
that we are interested in or trying to predict?
14
Example: Life Insurance Marketing
Classification vs. Regression?
Think:
§ What is the target variable?
§ What values can it take in your data?
Pattern 3:
Young people are more likely to
Pattern 4: default
If Balance >= 50K and Age > 45
Then Default = ‘no’
Else Default = ‘yes’ Pattern 5:
If Names ends with ‘e’
Then Balance > 100000
Else Balance <100000
Good vs bad patterns?
Supervised Data Mining: Terminology
Example, Instance Attributes Target
A fact; a data point
Name Balance Age Default
One example Mike 123,000 30 yes
typically described by Mary 51,100 40 yes
Bill 68,000 55 no
a set of attributes
Jim 74,000 46 no
(fields, variables, Mark 23,000 47 yes
features) and a Anne 100,000 49 no
target variable
(label).
Equivalent statistical terminology :
Attributes - independent variables
Target - dependent variable
Pattern 3:
Young people are more likely
Pattern 4: to default
If Balance >= 50K and Age > 45
Then Default = ‘no’
Else Default = ‘yes’ Pattern 5:
If Names ends with ‘e’
Then Balance > 100000
Else Balance <100000
What Can Go Wrong With ML?
28
What Can Go Wrong With ML?
29
What Can Go Wrong With ML?
30
What Can Go Wrong With ML?
Leakage
• Digression on features: It is all about the timing in use!
31
Definition
What Can Go Wrong With ML? Leakage (also
known as data
leakage or target
leakage) is the use
Leakage of information in the
model training
• Digression on features: It is all about the timing in use! process which
would not be
expected to be
available
at prediction time,
causing the
predictive scores
(metrics) to
overestimate the
model's utility when
run in a production
environment.
https://en.wikipedia.org/wiki/Leak
age_(machine_learning)
32
What Can Go Wrong With ML?
33
Example
34
Example—predicting age from income
35
Underfitting
36
Overfitting
37
Just right
• A model that
generalises beyond
the dataset and that
isn’t influenced by the
noise in the dataset.
38
39
Acknowledgements
40
Thank you
Thank you