You are on page 1of 27

PROG 8510 : Programming Statistics for Business

Statistical Regression and


Prediction

Week 9
This class
• Discuss ML algorithm categories
• Perform simple and multiple Linear Regression on rectangular datasets.
• Interpret the results of a Linear Regression.
Core Branches of Artificial Intelligence

Artificial Intelligence is an overarching term that encompasses the concept of


making intelligent machines. Such machines can learn over time and can
improve its performance for the next cycle with limited to no human
Intervention.
AI has been categorized in many ways, but the most common subfields of AI
are
• Machine Learning
• Natural Language Processing
• Deep Learning
Machine Learning
As the name suggests machine
learning algorithms when presented
with data can learn any useful
patterns in the data automatically.
Such algorithms require historical
datasets to be fed in order to tune its
model parameters. Once these
parameters are tuned, the models can
be used to predict future behaviors.
The most common type of ML
algorithms are categorized into
supervised and unsupervised Image credit : depositphotos.com/
learning algorithms.
Two subcategories of machine learning

• Supervised machine learning
o Refers to models that are trained with labeled data sets, which allow the models to learn and grow
more accurate over time. For example, an algorithm would be trained with pictures of dogs and
other things, all labeled by humans, and the machine would learn ways to identify pictures of dogs
on its own. Supervised machine learning is the most common type used today.

• Unsupervised machine learning
o Refers to models that look for patterns in unlabeled data. Unsupervised machine learning can find
patterns or trends that people aren’t explicitly looking for. For example, an unsupervised machine
learning program could look through online sales data and identify different types of clients making
purchases.
Supervised machine learning contd.

Supervised learning, also known as supervised machine learning, is defined by


its use of labeled datasets to train algorithms to classify data or predict
outcomes accurately. As input data is fed into the model, the model adjusts its
weights until it has been fitted appropriately.
Supervised learning helps organizations solve a variety of real-world problems
at scale, such as classifying spam in a separate folder from your inbox. Some
methods used in supervised learning include linear regression, logistic
regression, random forest, and support vector machine (SVM).
Example of Supervised Learning
Example Supervised Prediction question : If a model is trained on the below dataset, you can help predict
answer to questions like , Will a 42-year-old , Post Grad person with an annual income around 78,000 be
able to service their loan ?

Predictors / Features / Variables / X Target Variable / Predicted


Y

Age Education Annual Income Loan Default


31 High School $43,000 No
45 High School $65,000 Yes
33 Bachelors $70,000 No
76 Masters $83,000 No
80 Diploma $81,000 Yes
24 High School $33,000 No
22 Doctorate $100,230 No
Illustration of a Supervised Learning Algorithm
Supervised Learning Labelled Dataset
What is the predicted
salary of a developer
with ?

5.2 years experience


Skill = 2
USA
NY
Types of Supervised Learning

Supervised learning can be separated into two types of problems when data
mining—classification and regression:
Classification uses an algorithm to accurately assign test data into specific categories. It
recognizes specific entities within the dataset and attempts to draw some conclusions on how
those entities should be labeled or defined. Common classification algorithms are linear
classifiers, support vector machines (SVM), decision trees, k-nearest neighbor, and random
forest, which are described in more detail below.

Regression is used to understand the relationship between dependent and independent


variables. It is commonly used to make projections, such as for sales revenue for a given
business. Linear regression, logistical regression, and polynomial regression are popular
regression algorithms.

The next slides discuss Regression.


Regression Introduction

• Managerial decisions are often based on the relationship between two or


more variables:
o Example: After considering the relationship between advertising
expenditures and sales, a marketing manager might attempt to predict
sales for a given level of advertising expenditures.
• Sometimes a manager will rely on intuition to judge how two variables are
related.
• If data can be obtained, a statistical procedure called regression analysis
can be used to develop an equation showing how the variables are related.
Regression

The process of training a model on data where the outcome is known, for
subsequent application to data where the outcome is not known, is termed
supervised learning.
Regression is also a supervised learning algorithm.
Simple linear regression provides a model of the relationship between the
magnitude of one variable and that of a second—for example, as X increases, Y
also increases. Or as X increases, Y decreases.
Key Terms for Simple Linear Regression
Response : The variable we are trying to predict also called dependent variable, Y
variable, target, outcome

Independent variable : The variable used to predict the response. Synonyms X


variable, feature, attribute, predictor

Record : The vector of predictor and outcome values for a specific individual or case.
Synonyms row, case, instance, example

Intercept : The intercept of the regression line—that is, the predicted value when X =
0.

Regression coefficient : The slope of the regression line.

Predicted Values : The estimates Yi obtained from the regression line. predicted
values

Residuals : The difference between the observed values and the fitted values also
called errors
Key Regression Terms continued

• Simple linear regression: A regression analysis for which any one unit change in
the independent variable, x, is assumed to result in the same change in the
dependent variable, y.
• Multiple linear regression: A regression analysis involving two or more
independent variables.
Regression continued

• Simple linear regression estimates how much Y will change when X changes
by a certain amount. With the correlation coefficient, the variables X and Y
are interchangeable. With regression, we are trying to predict the Y variable
from X using a linear relationship (i.e., a line):
Multiple Linear Regression

• When there are multiple predictors, the equation is simply extended to


accommodate them:

• Instead of a line, we now have a linear model—the relationship between


each coefficient and its variable (feature) is linear.
Interpret the results of Regression Equation

Blue line between variables Exposure


and PEFR shows the regression
equation.
As we can see the predicted values
(shown by line) are not perfectly
aligned with the actual data points.
Hence the difference between
predicted and actual values is called
regession residuals or errors.
Diagram Showing Regression Line across Data
How does regression fit a line ?

• When there is a clear relationship, you could imagine fitting the line by hand.
In practice, the regression line is the estimate that minimizes the sum of
squared residual values, also called the residual sum of squares or
RSS:

• The method of minimizing the sum of the squared residuals is termed least
squares regression, or ordinary least squares (OLS) regression
Prediction Versus Explanation (Profiling)

Explanation
• Historically, a primary use of regression was to illuminate a supposed linear
relationship between predictor variables and an outcome variable. The goal
has been to understand a relationship and explain it using the data that the
regression was fit to i.e primarily focusing on model coefficients.
• Economists want to know the relationship between consumer spending and
GDP growth. Public health officials might want to understand whether a
public information campaign is effective. In such cases, the focus is not on
predicting individual cases but rather on understanding the overall
relationship among variables.
Prediction Versus Explanation (Profiling) cont.

Prediction
• With the advent of big data, regression is widely used to form a model to
predict individual outcomes for new data (i.e., a predictive model) rather than
explain data in hand. In this instance, the main items of interest are the fitted
values Y.
• In marketing, regression can be used to predict the change in revenue in
response to the size of an ad campaign. Universities use regression to
predict students’ GPA based on their SAT scores.
Linear Regression in Python
Scikit-Learn

• In order to understand how linear regression is done in python we first have


to understand the working of the popular Machine Learning package in
python called scikit-learn.
• Scikit-Learn is characterized by a clean, uniform, and streamlined API, as
well as by very useful and complete online documentation. A benefit of this
uniformity is that once you understand the basic use and syntax of Scikit-
Learn for one type of model, switching to a new model or algorithm is very
straightforward.
Scikit-Learn Continued

• In this section we will look at how to work with scikit-learn library for performing machine
learning analysis in python.

Data Representation in Scikit – Learn :


o Feature Matrix : This can be thought of a two-dimensional numerical array or matrix,
which is called features matrix. By convention, this features matrix is denoted by variable
named X. It is most often contained in a NumPy array or a Pandas DataFrame
o Target Array : In addition to the feature matrix X, ML tasks need a label or target array,
which by convention is denoted by variable name y. The target array is usually one
dimensional, with length n_samples, and is generally contained in a NumPy array or
Pandas Series

o In the next slide we will see an example in action ;


Scikit-Learn Data Representation

In the code example here ,


Featu
re Matrix
A dataset containing information
about total bills, size of the table
Tar ge t Variable
and the corresponding tip amount
will be used to predict the new tip
dollar amount given the attributes.

Here X are the attributes whereas


the target variable (y) is the Series
name tip.
Scikit-learn steps continued
Once Data sets are represented in X and y format the next steps are ;

Most commonly, the steps in using the Scikit-Learn estimator API are as follows (we will step through a
handful of detailed examples in the sections that follow using example if a linear regression model).

1. Choose a class of model by importing the appropriate estimator class from Scikit-Learn.

2. Choose model hyperparameters by instantiating this class with desired values. The hyperparameters
may differ depending on the type of model chose.

3. Fit the model to your data by calling the fit() method of the model instance. All model fitting takes place
at this step.
Sklearn cont.
4.) Apply the Model to new data:
1. For supervised learning, often we predict labels for unknown data using the
predict() method. Here for a new scenario the model predicted a value of approx.
$11

2. For unsupervised learning, we often transform or infer properties of the data


using the transform() or predict() method.
Example: Predicting Car Mileage
Question :
Use the mpg dataset from the seaborn package and perform the following
tasks,
1-) Load mpg dataset from seaborn under the variable name cars.
2-) Create a feature matrix under variable name X. The feature matric should
contains all numerical columns other than mpg
3-) Create a target array using the mpg column under variable name y
4-) Create a Sklearn Linear regression model to predict the mileage(mpg) of the
car once all other numerical columns are given
Next Class :
• Regression Model Evaluation

You might also like