You are on page 1of 21

Week 2:

Machine learning Intro


Instructor: Ting Sun
Clarification
• This is not an R programming course: We will not be spending a lot of
time on R syntax and programming.
• This is not a machine learning course, but basic knowledge about
what is machine learning will be briefly introduced
• This is not an algorithm course, but some basic idea of machine
learning algorithms will be briefly introduced

No need to Get really good at R programming and R syntax.


No need to Deeply study the underlying theory and parameters for
machine learning algorithms in R
Statistical modeling vs. machine learning
• Statistical modeling
formalization of relationships between variables in the form of
mathematical equations using traditional statistical tools, such as
significance, inference, and confidence.
Statistical modeling vs. predictive modeling (2)
• Machine learning
Using machine learning algorithms to learn patterns and trends directly
from data without relying on any prior hypotheses
A comparison table between statistics and machine
learning
Statistical modeling Machine learning
Need pre-determined Yes No
hypotheses?
Objective testing your hypotheses about the relationship develop models that make the most
between variables using the data accurate predictions

Focus explaining how each of the independent making accurate/repeatable predictions by


variables affects the dependent variable combining new data and patterns identified
(Explanation focused) in past data (Prediction focused)

Timeline Rearward looking Forward


Data size Smaller Bigger
Data structure Structured Structured and Unstructured
Number of variables Few Many
Extent of assumptions Many Few
Predictive power weaker stronger
Machine learning basics
Basic problems
Classification problems:
• The output/dependent variables are often called labels or categories.
• The task is to predict the class or category for a given observation.
• A classification problem requires that examples be classified into one of two or
more classes.
• A problem with two classes is often called a two-class or binary classification
problem (e.g., spam or not spam).
• A problem with more than two classes is often called a multi-class classification
problem (e.g., letter grade like A A- B+)
• A predictive model established to solve a classification problem is called a
classification model
Basic problems

Regression problems:
• The output/dependent variables are continuous variable (a real-value).
such as an integer.
• The task is to predict the quantities for a given observation, such as
amounts and sizes(e.g., dollar value of the selling price of a house)
• A predictive model established to solve a regression problem is called a
regression model
Machine learning Algorithms
• Machine learning uses a variety of algorithms to turn a data set into a
model.
• They are the engines of machine learning: tell the machine what to do
• Algorithms instruct the machine how to receive and analyze input data to
predict output values within an acceptable range.
• Learning tasks
• may include learning the function that maps the input to the output, learning
the hidden structure in unlabeled data, etc.
• Which kind of algorithm works best?
It depends on the kind of problem you’re solving, the computing resources
available, and the nature of the data.
A tour of machine learning algorithms
Regression algorithms: Classification algorithms:
• Linear Regression • Logistic Regression
• Artificial Neural Networks • Naïve Bayes
based on Bayes’ theorem and classifies every value as independent
comprises ‘units’ arranged in a series of layers, each of which of any other value. It allows us to predict a class/category, based on a
connects to layers on either side. ANNs are inspired by biological given set of features, using probability.
systems, such as the brain, and how they process information.
• Support Vector Machine
• Decision Trees filter data into categories, which is achieved by providing a set of
a flow-chart-like tree structure that uses a branching method to training examples, each set marked as belonging to one or the other
illustrate every possible outcome of a decision. Each node within of the two categories. The algorithm then works to build a model
the tree represents a test on a specific variable – and each branch is that assigns new values to one category or the other.
the outcome of that test.
• Artificial Neural Networks
• Random Forests • Decision Trees
an ensemble learning method, combining multiple algorithms to
generate better results for classification, regression and other tasks. • Random Forests
Algorithms Mind Map

Adapted from
https://
machinelearningmastery.co
m/a-tour-of-machine-
learning-algorithms/
Evaluation metrics
• Purpose
To evaluate the predictive performance of the established model
• Categories
Metrics for regression models
Metrics for classification models
Metrics for classification models
1. True positives (TP):
True positives are the cases when the actual class of the data point was 1(True) and the predicted is also
1(True)
2. True Negatives (TN):
True negatives are the cases when the actual class of the data point was 0(False) and the predicted is also
0(False)
3. False Positives (FP):
False positives are the cases when the actual class of the data point was 0(False) and the predicted is
1(True). False is because the model has predicted incorrectly and positive because the class predicted was
a positive one. (1)
4. False Negatives (FN):
False negatives are the cases when the actual class of the data point was 1(True) and the predicted is
0(False). False is because the model has predicted incorrectly and negative because the class predicted
was a negative one. (0)
Metrics for classification models
5. Recall(sensitivity)
6. Precision
7. Specificity
8. F score
9. Overall Accuracy
10. AUC
Metrics for regression models
• MAE
Mean Absolute Error
• RMSE
Root Mean Square Error
Data splitting (data resampling)
• When you are building a predictive model, you need to evaluate the capability of
the model on unseen data (estimating model accuracy).
• This is typically done by estimating accuracy using data that was not used to train
the model.
• Data splitting involves partitioning the data into
1. an explicit training dataset used to prepare the model:
use the training set to develop the model based on the data pattern learned by the algorithm
2. an (unseen) test dataset used to evaluate the model's performance on the data.
use the developed model (from the last step) to make predictions for the target variable on the
test dataset.
We evaluate the model’s performance by comparing the predicted value with the actual value
for our target variable.
For example: the iris flower dataset
150 observations: • For the training set (80%) • For the test set (20%)
Split into two subsets Y=F(x1, x2,x3,x4),Where vs. Y
Training set (80%): Y : the class of iris plant Error=Y- and other
(three classes: Iris setosa, Iris evaluation metrics
120 obs virginica, and Iris versicolor)
Test set (20%): X1: sepal.length
Evaluate the performance of
our model
30 obs X2: sepal,width
X3: petal.length
X4: petal.width
Data splitting techniques
• 80% as training set; 20% as test set
• K-fold Cross validation
A summary
• The differences between statistics and machine learning
• The basics of machine learning, including
• classification/regression problem
• Machine learning Algorithms
• for classification problem
• regression problems
• Evaluation metrics
• for classification models: 10 metrics
• for regression models: 2 metrics
• Data splitting /resampling
Discussion questions for week 2
1. Provide an example of statistics and an example of machine learning and
compare them.
2. Statistics vs. machine learning: which is better? Why?
3. Why machine learning models typically provide more accurate predictions than
traditional statistical modeling?
4. Provide an example of a regression problem and an example of classification
problem, and explain both.
5. Why is it necessary to separate the data into training and test set? Can’t we
just test the model with the training set?
6. What are evaluation metrics used for? Do we use different metrics for
regression and classification problems? Please explain.

You might also like