Professional Documents
Culture Documents
Learning Objectives
Outline the data science Explain the commonly List the algorithms mostly
cycle and machine used feature selection used in supervised and
learning process and feature engineering unsupervised learning
methods
Business / Domain
Expertise
Data science can be used to answer any business question and drive business decisions.
Identify project Collect and Select and clean Manipulate data Evaluate model Launch the model
objectives review data data using different and draw into production
methods conclusions
Machine Learning
Modeling Process
Machine learning uses computer algorithms to make predictions from input data.
Unsupervised
Learning
Supervised
Learning
Machine Learning
Collect Data
Exploratory Data
• Internal sources
Analysis
• Outside sources
Exploratory Data Analysis: a first glance on the data to see any trends or patterns.
$56,000 755 43 No
$38,000 682 22 Yes
$120,000 731 38 No
$65,000 595 54 Yes
$52,00 784 68 No
Basic descriptive statistics are used here along with plotting of different variables within the dataset.
Passenger ID 0.9
Survived
0.6
Ticket Class
Age 0.3
Siblings/Spouse
0.0
Parents/Children
Fare -0.3
Family
-0.6
This step is to set up the data and preparing it for machine learning modeling.
Feature selection: select the related features from the dataset and remove the irrelevant ones.
Irrelevant features can negatively impact the performance of a machine learning model.
Key Features
All Features
• Reduce the features that have a high • Leverage decision tree algorithms to
correlation with each other determine which features are more
important towards the output
• Keep only the principal components if
there are too many starting features • Remove irrelevant features
Feature engineering is the process to set up your data for better model performance.
Mean = 60 Mean = 0
SD = 10 SD = 1
30 40 50 60 70 80 90 -3
30 -2
40 -1
50 600 701 2
80 3
90
Feature engineering is the process to set up your data for better model performance.
Feature engineering is the process to set up your data for better model performance.
Red 1 0 0
Red 1 0 0
Yellow 0 1 0
Green 0 0 1
Yellow 0 1 0
This step is to select machine learning algorithms that will contribute to the prediction of the results.
The algorithms are the key pieces that allow the machine to learn from input data and improve from
experience.
• Hierarchical clustering
K-means clustering: a popular type of clustering algorithm to identify groups and trends.
Executives
Income
Income
Soccer Moms
Students
Retirees
Age Age
Regression algorithms: predict an output given Classification algorithms: predict the output of a
input in the form of a numeric value given input in the form of categorical value.
Regression Classification
Y y = mx + b
• x: Input
• y: Output
• m: Coefficient value
• b: Intercept
The linear regression algorithm helps us figure out the values of m and b so we can make predictions.
Y Y = 1 (positive)
Y = 0 (negative)
The decision tree algorithm can be used to predict both categorical or numeric outcomes.
The decision tree algorithm splits the dataset into hierarchical branches until it reaches the results
to answer the question.
50%
$13,000
Total return:
Buy 10 stocks
$10,900
$9,000
Invest 50%
$10,000
Algorithm Type
Live Data +
Testing Data
Training Data
We want the model to perform well on the training data as well as the testing data.
This validation process is unique to supervised learning.
Dataset
Input & Output
Train the
model
Test the
model
Predictions
Dataset
Dataset
Train the
model
Predictions
Check
performance
Model evaluation is the step where we compare the model’s performance on the training data with its
performance on the testing data.
• R2 • Accuracy
• MAE • AUC
• MSE
• RMSE
The purpose of machine learning is to find the real relationship between inputs and outputs.
Over generalizes the data Generalize enough to predict Does not generalize enough
future patterns
Error
reduced from Complexity resolved by reducing
regularization. complexity.
Variance
Bias2
Coefficient of Determination (R2) is one of the most used metrics to evaluate regression models.
R2 measures how close the data are to the fitted regression line.
R2 = 0 R2 = 1 R2 = 0.86
MAE, MSE and RMSE measure how far the predicted values deviate from the actual values.
Area Under Curve (AUC) evaluate both true positives AUC value ranges from 0.5 to 1.
and true negatives.
Prediction
True False
Negative Positive
Actual
False True
Negative Positive
Look at the evaluation metrics on both training data and testing data.
Example:
• R2 is high on the training data and high on the testing data – Good fit
• R2 is low on the training data and low on the testing data – Underfit
• R2 is high on the training data but low on the testing data – Overfit
Underfit Overfit