Professional Documents
Culture Documents
Regression problems:
• The output/dependent variables are continuous variable (a real-value).
such as an integer.
• The task is to predict the quantities for a given observation, such as
amounts and sizes(e.g., dollar value of the selling price of a house)
• A predictive model established to solve a regression problem is called a
regression model
Machine learning Algorithms
• Machine learning uses a variety of algorithms to turn a data set into a
model.
• They are the engines of machine learning: tell the machine what to do
• Algorithms instruct the machine how to receive and analyze input data to
predict output values within an acceptable range.
• Learning tasks
• may include learning the function that maps the input to the output, learning
the hidden structure in unlabeled data, etc.
• Which kind of algorithm works best?
It depends on the kind of problem you’re solving, the computing resources
available, and the nature of the data.
A tour of machine learning algorithms
Regression algorithms: Classification algorithms:
• Linear Regression • Logistic Regression
• Artificial Neural Networks • Naïve Bayes
based on Bayes’ theorem and classifies every value as independent
comprises ‘units’ arranged in a series of layers, each of which of any other value. It allows us to predict a class/category, based on a
connects to layers on either side. ANNs are inspired by biological given set of features, using probability.
systems, such as the brain, and how they process information.
• Support Vector Machine
• Decision Trees filter data into categories, which is achieved by providing a set of
a flow-chart-like tree structure that uses a branching method to training examples, each set marked as belonging to one or the other
illustrate every possible outcome of a decision. Each node within of the two categories. The algorithm then works to build a model
the tree represents a test on a specific variable – and each branch is that assigns new values to one category or the other.
the outcome of that test.
• Artificial Neural Networks
• Random Forests • Decision Trees
an ensemble learning method, combining multiple algorithms to
generate better results for classification, regression and other tasks. • Random Forests
Algorithms Mind Map
Adapted from
https://
machinelearningmastery.co
m/a-tour-of-machine-
learning-algorithms/
Evaluation metrics
• Purpose
To evaluate the predictive performance of the established model
• Categories
Metrics for regression models
Metrics for classification models
Metrics for classification models
1. True positives (TP):
True positives are the cases when the actual class of the data point was 1(True) and the predicted is also
1(True)
2. True Negatives (TN):
True negatives are the cases when the actual class of the data point was 0(False) and the predicted is also
0(False)
3. False Positives (FP):
False positives are the cases when the actual class of the data point was 0(False) and the predicted is
1(True). False is because the model has predicted incorrectly and positive because the class predicted was
a positive one. (1)
4. False Negatives (FN):
False negatives are the cases when the actual class of the data point was 1(True) and the predicted is
0(False). False is because the model has predicted incorrectly and negative because the class predicted
was a negative one. (0)
Metrics for classification models
5. Recall(sensitivity)
6. Precision
7. Specificity
8. F score
9. Overall Accuracy
10. AUC
Metrics for regression models
• MAE
Mean Absolute Error
• RMSE
Root Mean Square Error
Data splitting (data resampling)
• When you are building a predictive model, you need to evaluate the capability of
the model on unseen data (estimating model accuracy).
• This is typically done by estimating accuracy using data that was not used to train
the model.
• Data splitting involves partitioning the data into
1. an explicit training dataset used to prepare the model:
use the training set to develop the model based on the data pattern learned by the algorithm
2. an (unseen) test dataset used to evaluate the model's performance on the data.
use the developed model (from the last step) to make predictions for the target variable on the
test dataset.
We evaluate the model’s performance by comparing the predicted value with the actual value
for our target variable.
For example: the iris flower dataset
150 observations: • For the training set (80%) • For the test set (20%)
Split into two subsets Y=F(x1, x2,x3,x4),Where vs. Y
Training set (80%): Y : the class of iris plant Error=Y- and other
(three classes: Iris setosa, Iris evaluation metrics
120 obs virginica, and Iris versicolor)
Test set (20%): X1: sepal.length
Evaluate the performance of
our model
30 obs X2: sepal,width
X3: petal.length
X4: petal.width
Data splitting techniques
• 80% as training set; 20% as test set
• K-fold Cross validation
A summary
• The differences between statistics and machine learning
• The basics of machine learning, including
• classification/regression problem
• Machine learning Algorithms
• for classification problem
• regression problems
• Evaluation metrics
• for classification models: 10 metrics
• for regression models: 2 metrics
• Data splitting /resampling
Discussion questions for week 2
1. Provide an example of statistics and an example of machine learning and
compare them.
2. Statistics vs. machine learning: which is better? Why?
3. Why machine learning models typically provide more accurate predictions than
traditional statistical modeling?
4. Provide an example of a regression problem and an example of classification
problem, and explain both.
5. Why is it necessary to separate the data into training and test set? Can’t we
just test the model with the training set?
6. What are evaluation metrics used for? Do we use different metrics for
regression and classification problems? Please explain.