You are on page 1of 21

Data Science and

Machine Learning

Terminology and Concepts

Marco Brambilla, Emanuele Della Valle - Data Science for Business Innovation
Agenda
• Reality, Abstraction and Modeling
• Artificial Intelligence (AI)
• Machine Learning (ML)
• Supervised, Unsupervised and Reinforced Learning
• Training, Testing, Validation
• Cross-validation

Marco Brambilla, Emanuele Della Valle - Data Science for Business Innovation 2
Abstraction and human mind
• Complexity of reality –> Simplification
• The human mind continuously re-works reality by applying cognitive
processes
• Abstraction: capability of finding the commonality in many different
observations:
• Generalize specific features of real objects (generalization)
• Classify the objects into coherent clusters (classification)
• Aggregate objects into more complex ones (aggregation)

Marco Brambilla, Emanuele Della Valle - Data Science for Business Innovation 3
Models
Model: a simplified or partial representation of reality, defined
in order to accomplish a task or to reach an agreement

Model represents Reality

Marco Brambilla, Emanuele Della Valle - Data Science for Business Innovation
The Role of Models

Data Science

AI MODELS ML

Big Data
Marco Brambilla, Emanuele Della Valle - Data Science for Business Innovation 5
Artificial Intelligence
• Many Interpretations
• Artificial General Intelligence (aka. Strong AI or Full AI)
• General intelligent actions
• Discerning problems
• Acting as humans would
• … up to self-consciousness
• Restricted AI (aka. Weak or Applied AI or Narrow AI)
• Implementing intelligence in specific applications or problem solving tasks
• No full cognitive abilities

Marco Brambilla, Emanuele Della Valle - Data Science for Business Innovation 6
Machine Learning
“The field of machine learning is concerned with the question of
how to construct computer programs
that automatically improve with experience.”
Tom Mitchell (1997)

• A program that “learns” from experience with respect to some task


• Performance of this learning process is measurable
• Meaning of learning?
• Supervised vs. unsupervised vs. reinforced learning process

Marco Brambilla, Emanuele Della Valle - Data Science for Business Innovation 7
Supervised Learning
• Task to be learnt: to extract a description / labeling or pattern from
the data, based on the training
• Training examples labeled by a (human) supervisor
• Use it to predict the output for further examples
• Performance measured as how accurate the output is
• Example applications
• Credit approval
• Medical diagnosis
• Fraud detection
• Text and image labeling or classification

Marco Brambilla, Emanuele Della Valle - Data Science for Business Innovation 8
Unsupervised Learning
• Task to be learnt: finding interesting patterns/ groups/ categories in
the data based on evidence
• No pre-labeled data à detection of facts from raw data
• Performance measured as how good / meaningful the groups /
patterns are
• Example applications
• Customer segmentation
• User behavior categorization
• Grouping of items by similarity

Marco Brambilla, Emanuele Della Valle - Data Science for Business Innovation 9
Reinforcement Learning
• The machine learns through trial-and-error interactions
• The goal is to maximize the amount of reward received from the
environment
• Iterative process
• Trained through interactions with the environment
• Rewards assigned upon success
• Performance measured as amount of rewards collected
• Example applications
• Robot learning
• Games

Marco Brambilla, Emanuele Della Valle - Data Science for Business Innovation 10
Training, Validation, and Test
Split process into: training, validation, and test

Training Validation Testing


Training
Learn the model over the training set

MODEL
Training
Validation
Optimize the predictive capability of the model using the validation set

MODEL
Training Validation
Testing
See how well the model works on the test set (unseen before)
MODEL

Testing

The error gives an unbiased estimate of the predictive power of a model


Training, Validation, and Test
Split data into three sets: training, validation, and test

Dataset

Training Validation Test


set set set
Training, Validation, and Test
Split data into three sets: training, validation, and test

Dataset

Training Validation Test


set set set

70% 15% 15%


Cross-Validation
Repeat the iteration on training + validation multiple times

Dataset

Training Validation Test

MODEL
set set set

10-fold cross-validation: pick 10 times random subsets as training and


validation, and average the quality of the results
Wrapping up
• Learning is
• An iterative process
• Learns from statistics of past phenomena / known data
• Understand future data

Marco Brambilla, Emanuele Della Valle - Data Science for Business Innovation 18
Overfitting
• Learns «too much» from
existing data
• Does not generalize well
on new data

Marco Brambilla, Emanuele Della Valle - Data Science for Business Innovation
Overfitting
• Learns «too much» from Training Data
existing data Test Data

• Does not generalize well


on new data

Error
Cycles / Data
Marco Brambilla, Emanuele Della Valle - Data Science for Business Innovation 20
Bias
• Learning bias
• Algorithmic bias

• Learns past statistical distributions

• Biased data implies biased models

• Risk on people evaluation

Marco Brambilla, Emanuele Della Valle - Data Science for Business Innovation 21

You might also like