Professional Documents
Culture Documents
Vinh Vo (Ph.D)
1
Materials used: “Fundamentals of Machine Learning for Predictive Data Analytics”
by J.D. Kelleher, B. Mac Namee and A. D’Arcy (MIT, 2015)
Vinh Vo (EM@HUB) Lecture 0 - Overview of Data Science March 31, 2023 1 / 50
Outline
7 Summary
Outline
7 Summary
Example Applications:
❖ Price Prediction:
➢ Businesses such as hotel chains, airlines, and online retailers need to
constantly adjust their prices to maximize returns based on factors such
as seasonal changes, shifting customer demand, and the occurrence of
special events. Predictive analytics models can be trained to predict
optimal prices based on historical sales records. Businesses can then
use these predictions as an input into their pricing strategy decisions.
❖ Dosage Prediction:
➢ Doctors and scientists frequently decide how much of a medicine or
other chemical to include in a treatment. Predictive analytics models
can be used to assist this decision making by predicting optimal
dosages based on data about past dosages and associated outcomes.
Example Applications:
❖ Risk Assessment:
➢ Risk is one of the key influencers in almost every decision an
organization makes. Predictive analytics models can be used to predict
the risk associated with decisions such as issuing a loan or underwriting
an insurance policy. These models are trained using historical data
from which they extract the key indicators of risk. Output from risk
prediction models can be used by organizations to make better risk
judgements.
❖ Propensity Modeling:
➢ Most business decision making would be made much easier if we could
predict the likelihood, or propensity, of individual customers to take
different actions. Predictive analytics models can be used to build
models that predict future customer actions based on historical
behavior.
Example Applications:
❖ Diagnosis:
➢ Doctors, engineers, and scientists regularly make diagnoses as part of
their work. Typically, these diagnoses are based on their extensive
training, expertise, and experience. Predictive analytics models can
help professionals make better diagnoses by leveraging large collections
of historical examples at a scale beyond anything one individual would
see over his or her career. These diagnoses made by predictive analytics
models usually become an input into the professional’s existing
diagnosis process.
❖ Document Classification:
➢ Predictive analytics models can be used to automatically classify
documents into different categories. Examples include email spam
filtering, new sentiment analysis, customer complaint redirection, and
medical decision making. The definition of a document can be
expanded to include images, sounds, and videos, all of which can be
classified using predictive data analytics models.
Outline
7 Summary
Data Mining
“Data-driven discovery of models and patterns from massive
observational data sets”.
Machine Learning
“(Supervised) Machine Learning techniques automatically learn a model
of the relationship between a set of descriptive features and a target
feature from a set of historical examples”.
Data Science
“Data Science is a set of fundamental principles that support and guide
the principled extraction of information and knowledge from data”.
Outline
7 Summary
Figure: Using machine learning to induce a prediction model from a training dataset.
Figure: Using the model to make predictions for new query instances.
Loan-Salary
ID Occupation Age Ratio Outcome
1 industrial 34 2.96 repaid
2 professional 41 4.64 default
3 professional 36 3.22 default
4 professional 41 3.11 default
5 industrial 48 3.80 default
6 industrial 61 2.52 repaid
7 professional 37 1.50 repaid
8 professional 40 1.93 repaid
9 industrial 33 5.25 default
10 industrial 32 4.15 default
➢ Notice that this model does not use all the features and the feature
that it uses is a derived feature (in this case a ratio): feature design
and feature selection are two important topics in ML.
Outline
7 Summary
Table: A full set of potential prediction models before any training data becomes available.
Table: A sample of the models that are consistent with the training data
Table: A sample of the models that are consistent with the training data
➢ Notice that there is more than one candidate model left! It is because
a single consistent model cannot be found based on a sample training
dataset that ML is ill-posed.
❖ Goal: a model that generalises beyond the dataset and that isn’t
influenced by the noise in the dataset.
❖ Inductive bias the set of assumptions that define the model selection
criteria of an ML algorithm.
❖ There are two types of bias that we can use:
1 restriction bias
2 preference bias
Outline
7 Summary
ID Age Income
1 21 24,000
2 32 48,000
3 62 83,000
4 72 61,000
5 84 52,000
Figure: Striking a balance between overfitting and underfitting when trying to predict age from
income.
Outline
7 Summary
Business Understanding
During this initial phase of any analytics project, the primary goal of the
data analyst is to fully understand the project objectives and requirements
from a business perspective, and then to design a data analytics solution
to achieve the objectives.
Data Understanding
Once the manner in which predictive data analytics will be used to address
a business problem has been decided, it is important that the data analyst
fully understand the different data sources available within an organization
and the different kinds of data contained in these sources.
Data Preparation
Modeling
Evaluation
Before models can be deployed for use, it is important that they are fully
evaluated and proved to be fit for the purpose. This phase of CRISP-DM
covers all the evaluation tasks required to show that a prediction model
will be able to make accurate predictions after being deployed and that
does not suffer from overfitting or underfitting.
Deployment
Outline
7 Summary
Summary
B B B B B
B B B B B
B B
Q&A B B
B B B B B
B B B B B