You are on page 1of 15

Unit – IV: Model Development

Simple and Multiple Regression

Linear regression examines the relationship between one predictor and an outcome, while
multiple regression delves into how several predictors influence that outcome. Both are essential
tools in predictive analytics, but knowing their differences ensures effective and accurate
modelling.

Difference Between Linear Regression and Multiple Regression: Linear Regression vs


Multiple Regression
Parameter Linear (Simple) Regression Multiple Regression
Models the relationship between Models the relationship between one
Definition one dependent and one dependent and two or more independent
independent variable. variables.
Y = C0 + C1X1 + C2X2 + C3X3 + …..
Equation Y = C0 + C1X + e
+ CnXn + e
Simpler dealing with one More complex due to multiple
Complexity
relationship. relationships.
Suitable when there is one clear Suitable when multiple factors affect the
Use Cases
predictor. outcome.
Linearity, Independence, Same as linear regression, with the added
Assumptions
Homoscedasticity, Normality concern of multicollinearity.
Requires 3D or multi-dimensional space,
Typically visualized with a 2D
Visualization often represented using partial regression
scatter plot and a line of best fit.
plots.
Risk of Lower, as it deals with only one Higher, especially if too many predictors
Overfitting predictor. are used without adequate data.
A primary concern; having correlated
Multicollinearity Not applicable, as there’s only
predictors can affect the model’s
Concern one predictor.
accuracy and interpretation.
Basic research, simple Complex research, multifactorial
Applications predictions, understanding a predictions, studying interrelated
singular relationship. systems.

What is Linear Regression?


Linear regression is a statistical method used to model the relationship between a dependent
variable and one independent variable. It aims to establish a linear relationship between these
variables and can be used for both prediction and understanding the nature of the relationship.
Mathematical Equation
The mathematical representation of simple linear regression is:

Y = C0 + C1X + e

where,

● Y: Dependent Variable (target variable)

● X: Independent Variable (input variable)

● C0: Intercept (value of Y when X=0)

● C1: Slope of line

● e: Error term

Assumptions of Linear Regression

Here are some assumption that must be satisfied for the linear regression model to be valid.

● Linearity: The relationship between the independent and dependent variables should be
linear.

● Independence: Observations should be independent of each other.

● Homoscedasticity: The variance of the errors should be the same across all levels of the
independent variables.

● Normality: The dependent variable is normally distributed for a fixed value of the
independent variable.

● No Multicollinearity: This is more pertinent for multiple regression, where all


independent variables should be independent.

Limitations of Linear Regression

● Outliers: This can significantly impact the slope and intercept of the regression line.

● Non-linearity: Linear regression assumes a linear relationship, but this assumption may
not hold in some cases.

● Correlation ≠ Causation: Just because two variables have a linear relationship doesn’t
mean changes in one cause changes in the other.

What is Multiple Regression?


Multiple regression is an extension of simple linear regression. It’s used to model the relationship
between one dependent variable and two or more independent variables. The primary purpose is
to understand how the dependent variable changes as the independent variables change.

Mathematical Equation

The mathematical representation of multiple regression is:

Y = C0 + C1X1 + C2X2 + C3X3 + ….. + CnXn + e

where,

● Y: Dependent Variable (target variable)

● X1, X2, X3,…, Xn: Independent Variable (input variable)

● C0: Intercept (value of Y when X=0)

● C1, C2, C3, C4, C5, …., Cn: Slope of line

● e: Error term

Assumptions of Multiple Regression

● Linearity: A linear relationship exists between the dependent and independent variables.

● Independence: Observations are independent of each other.

● No multicollinearity: Independent variables aren’t too highly correlated with each other.

● Homoscedasticity: Constant variance of the errors.

● No Autocorrelation: The residuals (errors) are independent.

● Normality: The dependent variable is normally distributed for any fixed value of the
independent variables.

Limitations of Multiple Regression

● Overfitting: Including too many independent variables can lead to a model that fits the
training data too closely.

● Omitted Variable Bias: Leaving out a significant independent variable can bias the
coefficients of other variables.

● Endogeneity occurs when an independent variable is correlated with the error term,
leading to biased coefficient estimates.
Until now, you clearly understand what linear and multiple regression are, their mathematical
equations, assumption, and their limitations. You also have a better understanding of how linear
regression and multiple regression are different from each other. Now it’s time for an example
that will give you an idea of calculating the value of linear and multiple regression using Python.

Example of Linear and Multiple Regression

Problem Statement: Suppose we have data for a retail company. The company wants to
understand how their advertising expenses in various channels (e.g., TV, Radio) impact
sales.

● Linear Regression: Predict sales using only TV advertising expenses.

● Multiple Regression: Predict sales using both TV and Radio advertising expenses

Model Evaluation using Visualization

What is Data Visualization?

In simple terms, data visualization in data science refers to the process of generating graphical
representations of information. These graphical depictions, often known as plots or charts, are
pivotal in the realm of data science for effective analysis and interpretation. Understanding the
various types of data visualization in data science is crucial to select the appropriate visual
method for the dataset at hand. Different types serve different analytical needs, from
understanding distributions with histograms to spotting trends with line charts. As one delves
deeper into the data science field, the importance of mastering these visualization types becomes
even more apparent.

Why is Data Visualization Important in Data Science?

There are many reasons for data visualization in data science. Data visualization benefits include
communicating your results or findings, monitoring the model’s performance at the evaluation
stage, hyperparameter tuning, identifying trends, patterns and correlation between dataset
features, data cleaning such as outlier detection, and validating model assumptions.

Examples of Data Visualization in Data Science

Here are some popular data visualization examples.

1. Weather reports: Maps and other plot types are commonly used in weather reports.

2. Internet websites: Social media analytics websites such as Social Blade and Google
Analytics use data visualization techniques to analyze and compare the performance of
websites.
3. Astronomy: NASA uses advanced data visualization techniques in its reports and
presentations.

4. Geography

5. Gaming industry

What Makes Data Visualization Effective?

To get the most out of data visualization, you should consider the following things. These are the
fundamentals of data visualization.

● Clarity: Data should be visualized in a way that everyone can understand.


● Problem domain: When presenting data, the visualizations should be related to the
business problem.
● Interactivity: Interactive plots are useful to compare and highlight certain things within
the plot.
● Comparability: We can compare the thighs easily with good plots.
● Aesthetics: Quality plots are visually aesthetic.
● Informative: A good plot summarizes all relevant information.
● Importance of Data Visualization in Data Science

1. Data cleaning

● Data visualization plays an important role in data clearing. Good examples are detecting
outliers and removing multicollinearity. We can create scatterplots to detect outliers and
generate heatmaps to check multicollinearity.

● 2. Data Exploration

● Before building any model, we need to do some exploratory data analysis to identify
dataset characteristics. For example, we can create histograms for continuous variables to
check for normality in the data. We can create scatterplots between two features to check
whether they are correlated. Likewise, we can create a bar chart for the label column with
two or more classes to identify class imbalance.

● 3. Evaluation of modeling outputs

● We can create a confusion matrix and learning curve to measure the performance of a
model during training. Plots are also useful in validating model assumptions. For
example, we can create a residuals plot and histogram for the distribution of residuals to
validate the assumptions of a linear regression model.
● 4. Identifying trends

● Time and seasonal plots are useful in time series analysis to identify certain trends over
time.

● 5. Presenting results

● As a data scientist, you need to present your findings to the company or other related
persons who do not have more knowledge in the subject domain. So, you need to explain
everything in plain English. You can use informative plots that summarize your findings.
Are you interested in data visualization

● Different Types of Data Visualization in Data Science

● There are many data visualization types. The following are the commonly used data
visualization charts.

1. Distribution plot

● A distribution plot is used to visualize data distribution—for example: A probability


distribution plot or density curve.

2. Box and whisker plot

This plot is used to plot the variation of the values of a numerical feature. You can get the values'
minimum, maximum, median, lower and upper quartiles.
3. Violin plot

Similar to the box and whisker plot, the violin plot is used to plot the variation of a numerical
feature. But it contains a kernel density curve in addition to the box plot. The kernel density
curve estimates the underlying distribution of data.

4. Line plot

A line plot is created by connecting a series of data points with straight lines. The number of
periods is on the x-axis.

5. Bar plot

A bar plot is used to plot the frequency of occurring categorical data. Each category is
represented by a bar. The bars can be created vertically or horizontally. Their heights or lengths
are proportional to the values they represent.
6. Scatter plot

Scatter plots are created to see whether there is a relationship (linear or non-linear and positive or
negative) between two numerical variables. They are commonly used in regression analysis.

7. Histogram

A histogram represents the distribution of numerical data. Looking at a histogram, we can decide
whether the values are normally distributed (a bell-shaped curve), skewed to the right or skewed
left. A histogram of residuals is useful to validate important assumptions in regression analysis.

8. Pie chart

A categorical variable pie chart includes each category's values as slices whose sizes are
proportional to the quantity they represent. It is a circular graph made with slices equal to the
number of categories.
R provides open-source libraries such as

● Ggplot2

● Lattice

Data Visualization Techniques in Data Science

Some of the main data visualization techniques in data science are univariate analysis,
bivariate analysis and multivariate analysis.

1. Univariate Analysis

In univariate analysis, as the name suggest, we analyze only one variable at a time. In other
words, we analyze each variable separately. Bar charts, pie charts, box plots and histograms
are common examples of univariate data visualization. Bar charts and pie charts are created
for categorical variables, while box plots and histograms are created for numerical variables.

2. Bivariate Analysis

In bivariate analysis, we analyze two variables at a time. Often, we see whether there is a
relationship between the two variables. The scatter plot is a classic example of bivariate data
visualization.

3. Multivariate Analysis

In multivariate analysis, we analyze more than two variables simultaneously. The heatmap is
a classic example of multivariate data visualization. Other examples are cluster analysis and
principal component analysis (PCA).

Advantages and Disadvantages of Data Visualization

Advantages

There are many advantages of data visualization. Data visualization is used to:
● Communicate your results or findings with your audience
● Tune hyperparameters
● Identify trends, patterns and correlations between variables
● Monitor the model’s performance
● Clean data
● Validate the model’s assumptions
Disadvantages

There are also some disadvantages of data visualization.


● We need to download, install and configure software and open-source libraries. The
process will be difficult and time-consuming for beginners.
● Some data visualization tools are not available for free. We need to pay for those.
● When we summarize the data, we’ll lose the exact information.
Data Visualization Best Practices

1. Set the context

We need to develop a research question that could be solved with a data-driven approach.

2. Know your audience

This is very important as the visualizations depend on the type of audience you have. To
present your findings to a business people audience, you need to create visualizations closely
related to money, profits, and revenue the terms that business people are familiar with!

3. Choose an effective visual

You need to create the right plot that addresses your requirement. To see the correlations
between multiple variables, you can create histograms for each pair of variables. But that is
not very effective. Instead, you can create a heatmap that is an effective way of visualizing
correlations. When you have many categories, the pie chart is not suitable. Instead, you can
create a bar chart. These are some examples of choosing an effective visual for your
requirements.

4. Keep it simple

Simple plots are easily readable. We can remove unnecessary backgrounds to make things
stand out. We should not include much content in the plot. Title, names for axis, scale, and
legends are just enough.

1. What are the three main goals of data visualization?

● Communicating your results or findings with your audience


● Exploring (knowing) your data
● Identify trends, patterns and correlations between variables
2. How is data visualization used in data science?

Data visualization is used in every aspect of data science:

● Tuning hyperparameters
● Monitoring the model’s performance
● Cleaning data
● Validating the model’s assumptions
3. What are the major challenges of data visualization
● Choosing the right plot type
● Identifying the needs of your audience
● Developing the research question convert it to a data science question
● Collecting data
4. What are the benefits of data visualization?

Commons use cases of data visualization include:


● Communicate your results or findings with your audience
● Tune hyperparameters
● Identify trends, patterns and correlations between variables
● Monitor the model’s performance
● Clean data
● Validate the model’s assumptions
Residual Plot – Distribution Plot
Residual analysis is a technique used to assess a regression model's validity by examining the
differences between observed values and predicted values by the model.

What are Residuals?

● In statistics, a residual is the difference between a variable's observed value and the
variable's predicted value based on a statistical or ML model. In other words, in
regression models, a residual measures how far away a point is from the regression line.

● In a residual analysis, residuals are used to assess the validity of a statistical or ML


model. The model is considered a good fit if the residuals are randomly distributed. If
there are patterns in the residuals, then the model is not accurately capturing the
relationship between the variables. It may need to be improved, or another model may
need to be selected.
Residual Plot

● A residual plot is a scatterplot in which X-axis represents the independent or target


variable, and Y-axis represents residual values based on the ML model.

● A residual plot is used to identify the underlying patterns in the residual values. We can
assess the ML model's validity based on the observed patterns.

Types of Residual Plots

Based on patterns observed in residual values, there are several types of residual plots, as
mentioned below :

Random Pattern

● In this category of residual plots, residual values are randomly distributed, and there is no
visible pattern in the values. In this case, the developed ML model is considered a good
fit.

U-Shaped Pattern

● In this category, the residual plot follows a U-shaped curve, as mentioned in the below
figure. In this case, the model is not considered a good fit, and a non-linear model might
be required.

Assumptions Regarding Residuals in Linear Regression


Before evaluating the linear regression models using residual plot analysis, let’s first understand
three basic assumptions of linear regression models regarding residuals.
Independence
● The linear regression model assumes that residuals or error terms are independent and
that no visible pattern exists. It means that their pairwise covariance is zero.
● If the error terms are not independent, then the uniqueness of the least square’s solution is
lost, and the model is not considered a good fit.

Normality

● In this assumption, it is assumed that residuals are normally distributed. If the residuals
are not normally distributed, then it implies that the model is not able to explain the
relationships among the features in the data.

Homoscedasticity

● It is called the constant variance assumption. In this assumption, it is assumed that the
error term or residual is constant across values of the target variable. It means that it
follows the same variance across the target variable’s values.
Residual Plot Analysis

● Residual plot analysis is used to assess the validity of linear regression models by
plotting the residuals and checking whether the assumptions of linear regression models
are met. The most important assumption of a linear regression model is that the error
terms or residuals are independent and normally distributed.

● A linear regression model can be considered as a combination of deterministic and


stochastic terms. Using linear equation models, we try to predict the deterministic part,
and the remaining part is considered as errors or residuals. These error terms or residuals
must be independent and normally distributed, i.e., stochastic. This is what we are
looking for in a residual plot for a model.

Characteristics of a Good Residual Plot

● An excellent residual plot should have below characteristics mentioned below -

o A high density of points near the X-axis, i.e., points should be more concentrated
near the horizontal axis and less dense away from the horizontal axis.

o It should be symmetric around the X-axis.

Let’s have a look at the below figure that is a good residual plot. As you can see, if residuals are
projected on the vertical axis, they will follow a normal distribution. In this case, the model is
considered a good fit.

● The below figure shows a bad residual plot where error terms follow a skewed
distribution. In this case, the model is not considered a good fit.
FAQ

1. Define skewness and explain how it indicates the asymmetry in a distribution.


Can skewness be negative, positive, or zero? Provide examples for each.
2. What is kurtosis, and how does it describe the shape of a distribution?
Differentiate between leptokurtic, mesokurtic, and platykurtic distributions.
3. Define simple regression and provide an example of its application. What is the
role of the regression line in simple regression analysis?
4. Explain the concept of multiple regression and how it differs from simple
regression. When is multiple regression more appropriate than simple
regression?
5. Explain the concept of polynomial regression and when it is preferred over
linear regression.
6. Describe the concept of mean squared error and its role in evaluating model
performance. How is MSE affected by outliers in the dataset?

You might also like