Professional Documents
Culture Documents
Linear regression examines the relationship between one predictor and an outcome, while
multiple regression delves into how several predictors influence that outcome. Both are essential
tools in predictive analytics, but knowing their differences ensures effective and accurate
modelling.
Y = C0 + C1X + e
where,
● e: Error term
Here are some assumption that must be satisfied for the linear regression model to be valid.
● Linearity: The relationship between the independent and dependent variables should be
linear.
● Homoscedasticity: The variance of the errors should be the same across all levels of the
independent variables.
● Normality: The dependent variable is normally distributed for a fixed value of the
independent variable.
● Outliers: This can significantly impact the slope and intercept of the regression line.
● Non-linearity: Linear regression assumes a linear relationship, but this assumption may
not hold in some cases.
● Correlation ≠ Causation: Just because two variables have a linear relationship doesn’t
mean changes in one cause changes in the other.
Mathematical Equation
where,
● e: Error term
● Linearity: A linear relationship exists between the dependent and independent variables.
● No multicollinearity: Independent variables aren’t too highly correlated with each other.
● Normality: The dependent variable is normally distributed for any fixed value of the
independent variables.
● Overfitting: Including too many independent variables can lead to a model that fits the
training data too closely.
● Omitted Variable Bias: Leaving out a significant independent variable can bias the
coefficients of other variables.
● Endogeneity occurs when an independent variable is correlated with the error term,
leading to biased coefficient estimates.
Until now, you clearly understand what linear and multiple regression are, their mathematical
equations, assumption, and their limitations. You also have a better understanding of how linear
regression and multiple regression are different from each other. Now it’s time for an example
that will give you an idea of calculating the value of linear and multiple regression using Python.
Problem Statement: Suppose we have data for a retail company. The company wants to
understand how their advertising expenses in various channels (e.g., TV, Radio) impact
sales.
● Multiple Regression: Predict sales using both TV and Radio advertising expenses
In simple terms, data visualization in data science refers to the process of generating graphical
representations of information. These graphical depictions, often known as plots or charts, are
pivotal in the realm of data science for effective analysis and interpretation. Understanding the
various types of data visualization in data science is crucial to select the appropriate visual
method for the dataset at hand. Different types serve different analytical needs, from
understanding distributions with histograms to spotting trends with line charts. As one delves
deeper into the data science field, the importance of mastering these visualization types becomes
even more apparent.
There are many reasons for data visualization in data science. Data visualization benefits include
communicating your results or findings, monitoring the model’s performance at the evaluation
stage, hyperparameter tuning, identifying trends, patterns and correlation between dataset
features, data cleaning such as outlier detection, and validating model assumptions.
1. Weather reports: Maps and other plot types are commonly used in weather reports.
2. Internet websites: Social media analytics websites such as Social Blade and Google
Analytics use data visualization techniques to analyze and compare the performance of
websites.
3. Astronomy: NASA uses advanced data visualization techniques in its reports and
presentations.
4. Geography
5. Gaming industry
To get the most out of data visualization, you should consider the following things. These are the
fundamentals of data visualization.
1. Data cleaning
● Data visualization plays an important role in data clearing. Good examples are detecting
outliers and removing multicollinearity. We can create scatterplots to detect outliers and
generate heatmaps to check multicollinearity.
● 2. Data Exploration
● Before building any model, we need to do some exploratory data analysis to identify
dataset characteristics. For example, we can create histograms for continuous variables to
check for normality in the data. We can create scatterplots between two features to check
whether they are correlated. Likewise, we can create a bar chart for the label column with
two or more classes to identify class imbalance.
● We can create a confusion matrix and learning curve to measure the performance of a
model during training. Plots are also useful in validating model assumptions. For
example, we can create a residuals plot and histogram for the distribution of residuals to
validate the assumptions of a linear regression model.
● 4. Identifying trends
● Time and seasonal plots are useful in time series analysis to identify certain trends over
time.
● 5. Presenting results
● As a data scientist, you need to present your findings to the company or other related
persons who do not have more knowledge in the subject domain. So, you need to explain
everything in plain English. You can use informative plots that summarize your findings.
Are you interested in data visualization
● There are many data visualization types. The following are the commonly used data
visualization charts.
1. Distribution plot
This plot is used to plot the variation of the values of a numerical feature. You can get the values'
minimum, maximum, median, lower and upper quartiles.
3. Violin plot
Similar to the box and whisker plot, the violin plot is used to plot the variation of a numerical
feature. But it contains a kernel density curve in addition to the box plot. The kernel density
curve estimates the underlying distribution of data.
4. Line plot
A line plot is created by connecting a series of data points with straight lines. The number of
periods is on the x-axis.
5. Bar plot
A bar plot is used to plot the frequency of occurring categorical data. Each category is
represented by a bar. The bars can be created vertically or horizontally. Their heights or lengths
are proportional to the values they represent.
6. Scatter plot
Scatter plots are created to see whether there is a relationship (linear or non-linear and positive or
negative) between two numerical variables. They are commonly used in regression analysis.
7. Histogram
A histogram represents the distribution of numerical data. Looking at a histogram, we can decide
whether the values are normally distributed (a bell-shaped curve), skewed to the right or skewed
left. A histogram of residuals is useful to validate important assumptions in regression analysis.
8. Pie chart
A categorical variable pie chart includes each category's values as slices whose sizes are
proportional to the quantity they represent. It is a circular graph made with slices equal to the
number of categories.
R provides open-source libraries such as
● Ggplot2
● Lattice
Some of the main data visualization techniques in data science are univariate analysis,
bivariate analysis and multivariate analysis.
1. Univariate Analysis
In univariate analysis, as the name suggest, we analyze only one variable at a time. In other
words, we analyze each variable separately. Bar charts, pie charts, box plots and histograms
are common examples of univariate data visualization. Bar charts and pie charts are created
for categorical variables, while box plots and histograms are created for numerical variables.
2. Bivariate Analysis
In bivariate analysis, we analyze two variables at a time. Often, we see whether there is a
relationship between the two variables. The scatter plot is a classic example of bivariate data
visualization.
3. Multivariate Analysis
In multivariate analysis, we analyze more than two variables simultaneously. The heatmap is
a classic example of multivariate data visualization. Other examples are cluster analysis and
principal component analysis (PCA).
Advantages
There are many advantages of data visualization. Data visualization is used to:
● Communicate your results or findings with your audience
● Tune hyperparameters
● Identify trends, patterns and correlations between variables
● Monitor the model’s performance
● Clean data
● Validate the model’s assumptions
Disadvantages
We need to develop a research question that could be solved with a data-driven approach.
This is very important as the visualizations depend on the type of audience you have. To
present your findings to a business people audience, you need to create visualizations closely
related to money, profits, and revenue the terms that business people are familiar with!
You need to create the right plot that addresses your requirement. To see the correlations
between multiple variables, you can create histograms for each pair of variables. But that is
not very effective. Instead, you can create a heatmap that is an effective way of visualizing
correlations. When you have many categories, the pie chart is not suitable. Instead, you can
create a bar chart. These are some examples of choosing an effective visual for your
requirements.
4. Keep it simple
Simple plots are easily readable. We can remove unnecessary backgrounds to make things
stand out. We should not include much content in the plot. Title, names for axis, scale, and
legends are just enough.
● Tuning hyperparameters
● Monitoring the model’s performance
● Cleaning data
● Validating the model’s assumptions
3. What are the major challenges of data visualization
● Choosing the right plot type
● Identifying the needs of your audience
● Developing the research question convert it to a data science question
● Collecting data
4. What are the benefits of data visualization?
● In statistics, a residual is the difference between a variable's observed value and the
variable's predicted value based on a statistical or ML model. In other words, in
regression models, a residual measures how far away a point is from the regression line.
● A residual plot is used to identify the underlying patterns in the residual values. We can
assess the ML model's validity based on the observed patterns.
Based on patterns observed in residual values, there are several types of residual plots, as
mentioned below :
Random Pattern
● In this category of residual plots, residual values are randomly distributed, and there is no
visible pattern in the values. In this case, the developed ML model is considered a good
fit.
U-Shaped Pattern
● In this category, the residual plot follows a U-shaped curve, as mentioned in the below
figure. In this case, the model is not considered a good fit, and a non-linear model might
be required.
Normality
● In this assumption, it is assumed that residuals are normally distributed. If the residuals
are not normally distributed, then it implies that the model is not able to explain the
relationships among the features in the data.
Homoscedasticity
● It is called the constant variance assumption. In this assumption, it is assumed that the
error term or residual is constant across values of the target variable. It means that it
follows the same variance across the target variable’s values.
Residual Plot Analysis
● Residual plot analysis is used to assess the validity of linear regression models by
plotting the residuals and checking whether the assumptions of linear regression models
are met. The most important assumption of a linear regression model is that the error
terms or residuals are independent and normally distributed.
o A high density of points near the X-axis, i.e., points should be more concentrated
near the horizontal axis and less dense away from the horizontal axis.
Let’s have a look at the below figure that is a good residual plot. As you can see, if residuals are
projected on the vertical axis, they will follow a normal distribution. In this case, the model is
considered a good fit.
● The below figure shows a bad residual plot where error terms follow a skewed
distribution. In this case, the model is not considered a good fit.
FAQ