You are on page 1of 109

Business Statistice-2

Covariance
• Covariance is a measure of direction of linear
relationship between two variables.
• Positive covariance shows positive relationship
between two variables while negative covariance
describes the negative relationship between two
variables.
• 0 covariance shows no linear relationship between
two variables.
Illustration
• The details of the number of commercials and sales
(Rs. Million) are given in the following table. Find the
covariance of the two variables.
• Week No. Of Commercials Sales (Rs. Million)
1 2 50
2 5 57
3 1 41
4 3 54
5 4 54
6 1 38
7 5 63
8 3 48
9 4 59
10 2 46
Calculations of Covariance
Correlation coefficient
• Correlation coefficients are similar to covariance, the
only difference is that correlation coefficient not only
describes the nature of linear relationship between
two variables but also show the strength of the
relationship.
• The correlation coefficient ranges from −1 to
+1. Values close to −1 or +1 indicate a strong
linear relationship. The closer the correlation
is to zero, the weaker the relationship.
Correlation coefficient -
Formula
Correlation coefficient
• The formulas return a value between -1 and 1,
where:
• 1 indicates a strong positive relationship.
• -1 indicates a strong negative relationship.
• zero indicates no relationship at all.
Illustration
• Calculate the correlation co-efficient from the
following table:
Date Stock price (X) Stock price (X)
6/7/2021 1103.99 345.50
6/8/2021 1086.78 352.75
6/9/2021 1076.90 343.35
6/10/2021 1090.06 344.75
6/11/2021 1133.00 350.75
6/14/2021 1139.30 355.95
6/15/2021 1148.60 352.70
6/16/2021 1117.15 349.35
6/17/2021 1103.75 345.65
6/18/2021 1092.30 337.40
6/21/2021 1106.15 334.30
Normal distribution
• It is a distribution of a continuous random variable.
• Due to the large contribution of eighteenth-century
mathematician–astronomer Karl Gauss, it is also
known as Gaussian distribution.
• It is a very important distribution in statistics because it
comes close to fitting the actual observed frequency
distribution of many phenomenon, including human
characteristics (weights, heights, and IQs) test scores,
scientific measurements, amounts of rainfall, and other
similar values.
• The form, or shape, of the normal distribution is
illustrated by the bell-shaped normal curve.
Characteristics of the Normal
Probability Distribution
• The curve has a single peak;
it is unimodal.
• The mean lies at the centre
of the curve.
• Mean, median, and mode
have same value.
• Two tails of NPD never
touches the horizontal axis.
Normal Probability Distribution
The Standard Normal
Distribution

• Standard normal distribution represent a


normal distribution with mean = 0 and std.
deviation = 1.
• Variable in standard normal distribution is
expressed as z.
• For standard normal distribution, areas under
the normal curve have been computed and
are available in tables that can be used to
compute probabilities.
Standard scores or z score

• Standard scores are expressed in standard


deviation units, making it much easier to
compare variables measured on different scales.
• A standard score or z score tells you how many
standard deviations you are away from the mean.
• If a z-score is equal to 0, it is on the mean.
• If Z-Score is +ve than score is above the mean.
• If Z-score is negative than score lies below the
mean.
Illustration
• MRF Tire Company developed a new steel-belted
radial tire and want to sold nation-wide. But, before
launching it in the market, the manager wants to
know that, what is the probability of getting tire
mileage more than 40000 miles?
• From actual road tests with the tires, MRF
engineering group estimated that the mean tire
mileage is m = 36,500 miles and that the standard
deviation is s = 5000. In addition, the data collected
indicate that a normal distribution is a reasonable
assumption.
Illustration
• Solution:
• For z = 0.70 probability value is .7580. Thus probability value of
tyre give mileage more than 40,000 is 1-0.7580 = .2420. This
means 24.20% of tyres will give more than 40,000 mileage.
Chebyshev's theorem

• After Chebyshev's theorem a detailed analysis of bell


curve revealed more precise result. According to this:
• About 68 percent of the values in the population will
fall within ± 1 standard deviation from the mean.
• About 95 percent of the values will lie within ±2
standard deviations from the mean.
• About 99 percent of the values will be in an interval
ranging from 3 standard deviations below the mean to
3 standard deviations above the mean.
Introduction to linear
regression
What is regression?

• Regression is a statistical method used in finance,


investing, and other disciplines that attempts to
determine the strength and character of the
relationship between one dependent variable
(usually denoted by Y) and a series of other
variables (known as independent variables such
as X1, X2…Xn).
• Regression is a statistical procedure that
determines the equation for the straight line that
best fits a specific set of data.
Regression Analysis

• How well a set of data points fits a straight


line can be measured by calculating the
distance between the data points and the line.
• The total error between the data points and
the line is obtained by squaring each distance
and then summing the squared values.
• The regression equation is designed to
produce the minimum sum of squared errors.
Regression Analysis

• Regression analysis is based on relationship,


or association, between two (or more)
variables.
• The variable which is being predicted is called
dependent variable.
• The variable (or variables) that are being used
to predict the value of the dependent variable
are called the independent variables.
Regression Line
• Regression line between the scores and grade point.
Types of Relationships
Regression Equation
• In regression analysis, we shall develop an estimating
equation— that is, a mathematical formula that relates
the known variables to the unknown variable.

• The a is called the Y-intercept because its value is the


point at which the regression line crosses the Y-axis.
• b the slope- represents how much each unit change of
the independent variable X changes the dependent
variable Y.
Application of Regression
• Prediction of target variable for example to
predict the stock price, sales volume, profitability,
scores etc.
• Modeling the relationship between the
dependent and independent variables. For
example finding the relationship between
employee productivity and profitability.
• Review and understand how different variables
impact all of these things.
• Testing of hypothesis
Fitting a linear model to data

• To statistician the line will have “good fit” if it


minimizes the error between estimated points
on the line and the actual observed points
that were used to draw it.
• From now, we use 𝑌 (𝑌 hat) to symbolize the
individual values of the estimated points—
that is, the points that lie on the estimating
line.
Fitting a linear model to data
Fitting a linear model to data
Least Square Method
• The best method to minimize the errors is to apply the
least square method.
• The estimating line that minimizes the sum of the
squares of the errors, we call this the least-squares
method.
• The "least squares" method is a form of mathematical
regression analysis used to determine the line of best
fit for a set of data that minimizes the sum of the
squares of the errors.
• Each point of data represents the relationship between
a known independent variable and an unknown
dependent variable.
Least Square Regression
Equation
• The statistician has developed two equation to
identify the slope (b) and Y-intercept of the
best fitting regression line. The first formula
calculate the slope:
Least Square Regression
Equation
• The second formula calculates the Y-intercept.
Illustration
• Suppose municipality want to know the
relationship and effect of age of truck on
repair expense based on following collected
data.
Repair expenses during last year
Trucks Number Age of Truck in Years
(Rs. Hundreds)
101 5 7
102 3 7
103 3 6
104 1 4
Solution
• Solution: First identify
Solution

= 78-(4*3*6)/44-(4*3^2)
= 78-72/44-36
= 6/8
= 0.75

Thus the equation of the estimating line will be

= 3.75 + 0.75.X
Standard error of estimates
• To measure the reliability of the estimating equation,
statisticians have developed the standard error of estimate.
• This standard error is symbolized se and is similar to the
standard deviation.
• The standard deviation is used to measure the dispersion of a
set of observations about the mean.
• The standard error of estimate, on the other hand, measures
the variability, or scatter, of the observed values around the
regression line.

𝑌−𝑌 2
• 𝑆𝑒 =
𝑛−2
Standard error of estimates

The larger the standard error of estimate, the


greater the scattering (or dispersion) of points
around the regression line
Conversely, if Se = 0, we expect the estimating
equation to be a “perfect” estimator of the
dependent variable.
Standard error of estimates
Coefficient of Determination
(r square)
• The coefficient of determination measures the
proportion of variation in Y that is explained by
the variation in the independent variable X in the
regression model. The range of r square is from 0
to 1 and the greater the value, the more the
variation in Y in the regression model can be
explained by the variation in X.
• The coefficient of determination is equal to the
regression sum of squares (i.e., explained
variation) divided by the total sum of squares
(i.e., total variation).
Benefits of Regression
• Operation efficiency: Companies use this application to optimize
the business process.
• Supporting decisions: Many companies and their top managers
today are using regression analysis (and other kinds of data
analytics) to make an informed business decision and eliminate
guesswork and gut intuition.
• Correcting errors: Even the most informed and careful managers
do make mistakes in judgment. Regression analysis helps managers,
and businesses in general, recognize and correct errors.
• New Insights: Looking at the data can provide new and fresh
insights. Many businesses gather lots of data about their customers.
But that data is meaningless without proper regression analysis,
which can help find the relationship between different variables to
uncover patterns.
Correlation Vs. Regression

Correlation Regression
• Signifies the degree of • Indicates the causal
relationship between the relationship between
two variables. variables.
• It is limited between the • It can be more than two
two variables only.
variables.
• The variables are
interchangeable, which is • The independent and
symmetrical. dependent variables can
• Can not be helpful in not be interchangeable.
prediction. • Used for prediction.
Business Statistics
Books for business statistics

• Levin, D., Stephan, D., & Szabat, K.; “Statistics for


Managers”: Pearson Publication
• Anderson, Sweeney and Williams; “Statistics for
Business and Economics”: Cengage Learning,
2001(11e).
• Ken Black; “Business Statistics: For Contemporary
Decision Making”, Wiley Publication
• Srivastava, T. & Rego, S; “Statistics for
Management”: McGraw Hill Education
Introduction
• What is business statistics?

• Why business statistics?

• What are the features of business statistics?

• Who are the functions of business statistics?


What is business statistics?

• It is the process of analyzing, categorizing,


interpreting and compiling data.
• Business administrative professionals make
inferences from these data sets regarding
products, markets and consumers to help
organizations make informed decisions.
• Various statistical models can then be derived
from this data to interpret trends and gain
insights.
Why business statistics?

To understand and write


• Business articles in newspaper and magazine
• Television reports on economy, industry and
business affairs
• Business research
• Research papers in journals
• Project reports
Uses of business statistics

• To summarize the business data

• To draw conclusions from business data

• To improve business process

• To make reliable forecasts about business


activities
Features of business
statistics

• Numerically expression

• Capable of comparison and connection

• Systematic manner of numbers

• Definite purpose

• Reasonable standard of accuracy


Functions of business
statistics

• Simplification of data and figures


• Facilitates evaluation
• Helps in prediction
• Formulation and testing of hypothesis
• Formulation of suitable policies
• Trend market and investment behavior
Application of statistics

• Accounting: Sample testing makes the audit very


easy and less time consuming
• Finance: Data analysis of stocks helps in
investment decision making.
• Marketing: Consumer purchasing data can revel
the consumer behavior which can be used for
marketing of products.
• Production: Demand data used for prediction of
future demand which helps in production
planning.
Basic Vocabulary of
Business Statistics

• Variables

• Data

• Population

• Sample

• Parameter
Variables

• A variable is a characteristic or condition that


can change or take on different values.

• Most research begins with a general question


about the relationship between two variables
for a specific group of individuals.
Population

• The entire group of individuals is called the


population.
• For example, a researcher may be interested
in the relation between class size (variable 1)
and academic performance (variable 2) for the
population of third standard children.
Sample

• Usually populations are so large that a


researcher cannot examine the entire
group. Therefore, a sample is selected to
represent the population in a research
study. The goal is to use the results
obtained from the sample to help answer
questions about the population.
Population & Sample

• Selection of sample is very important because from


analysis of sample data hypothesis are tested and
statistical inference were drawn for whole population.
• For example, testing of COVID vaccine on selected
people (sample) and based on their result drawing the
inference of the impact of the vaccine on whole
population.
• Similarly, during government election in exit polls only
decision of few people were asked and based on that
inference is drawn that which political party is going to
take responsibility of the state or nation.
Population Vs. Sample
Population - Sample
Branches of business
statistics

Descriptive Statistics Inferential Statistics

• Collection of data • To draw a conclusion

• Summarizing of data • Data collection from small


• Presentation of data group to draw conclusions
about a larger group.
• Analysis of data
• Making an estimation
Branches of business statistics

• Descriptive statistics: It is representation of


the data in the form of table, graph, charts,
etc. that make understanding of the data
easier.
• Descriptive statistics revels the important
characteristics of the data. Such as Central
Tendency, Measures of Variability, and Shape
etc.
Branches of business statistics

• Inferential statistics: statistical inference involves


generalizations and statements about the probability of
their validity.
• The methods and techniques of statistical inference
can also be used in a branch of statistics called decision
theory.
• Knowledge of decision theory is very helpful for
managers because it is used to make decisions under
conditions of uncertainty.
• For example, a manufacturer of stereo sets cannot
specify precisely the demand for its products.
Central Tendency-Mean
• Mean: The mean of a set of observations is
their average. It is equal to the sum of all
observations divided by the number of
observations in the set, which is popularly
known as arithmetic mean.
Central Tendency-Median
• The median is the middle number in a sorted,
ascending or descending, list of numbers and
can be more descriptive of that data set than
the average. The median is sometimes used as
opposed to the mean when there are outliers
in the sequence.
• Extreme values do not affect the median,
making the median a good alternative to the
mean when such values exist in the data.
Mean vs. Median
• Using the collected data, you compute the mean to discover the
“typical” time it takes for you to get ready. For these data:
Day 1 2 3 4 5 6 7 8 9 10
Time (Minutes) 39 29 43 52 39 44 40 31 44 35

• Mean = 396/10=39.6
• The original mean, 39.6 minutes, had a middle, or central, position
among the data values: 5 of the times were less than that mean
and 5 were greater than that mean.
Day 1 2 3 4 5 6 7 8 9 10
Time (Minutes) 39 29 103 52 39 44 40 31 44 35
• New Mean = 456/10 = 45.6
• In contrast, the mean using the extreme value is greater than 9 of
the10 times, making the new mean a poor measure of central
tendency.
Median

You compute the median by following one of two rules:


Rule 1: If the data set contains an odd number of values,
the median is the measurement associated with the
middle-ranked value.
Rule 2: If the data set contains an even number of
values, the median is the measurement associated with
the average of the two middle-ranked values.
Median - Illustration
• Nutritional data about a sample of seven breakfast
cereals (stored in Cereals ) includes the number of
calories per serving. Compute the median number of
calories in breakfast cereals.
Cereal Calories
Kellogg’s All Bran 80
Kellogg’s Corn Flakes 150
Wheaties 100
Nature’s Path Organic Multigrain Flakes 170
Kellogg’s Rice Krispies 110
Post Shredded Wheat Vanilla Almond 160
Kellogg’s Mini Wheats 180

• Median = 7+1/2 = 4th = 150


Median - Illustration
• Find the median from the following:

• Median = 10+1/2 = 5.5th


= 39.5

• Median = 10+1/2 = 5.5th


• = 39.5
Median - Illustration
• Using the collected data, you compute the
average time it takes for you to get ready. For
these data:

• Mean = 45.6
• Median = 10+1/2 = 5.5th
Mode
• The mode is the value that appears most frequently. Like the
median and unlike the mean, extreme values do not affect the
mode. For a particular variable, there can be several modes or no
mode at all.
• For example, for the sample of 10 times to get ready in the
morning:
29 31 35 39 39 40 43 44 44 52
• There are two modes, 39 minutes and 44 minutes, because each of
these values occurs twice.
• However, for this sample of 14 smartphone prices offered by a
cellphone provider (stored in Smartphones ):
56 71 73 74 90 179 213 217 219 225 240 250 500 513
• There is no mode. None of the values is “most typical” because
each value appears the same number of times (once) in the data
set.
Geometric Mean
• The Geometric Mean (GM) is the average value or
mean which signifies the central tendency of the set of
numbers by finding the product of their values.
Basically, we multiply the numbers altogether and take
the nth root of the multiplied numbers, where n is the
total number of data values. For example: for a given
set of two numbers such as 3 and 1, the geometric
mean is equal to √(3×1) = √3 = 1.732.

• In other words, the geometric mean is defined as the


nth root of the product of n numbers.
Illustration
• What is the geometric mean of 2, 3, and 6?
First, multiply the numbers together and then
take the cubed root (because there are three
numbers) = (2*3*6)1/3 = 3.30
• Note: The power of (1/3) is the same as the
cubed root 3√. To convert a nth root to this
notation, just change the denominator in the
fraction to whatever “n” you have.
Variance
• It shows the dispersion of the data from mean.
• Every population has a variance, which is
symbolized by 𝜎^2 (Sigma square).
• Population variance is calculated by dividing the
sum of the squared distances between the mean
and each item in the population by the total
number of items in the population.
• By squaring each distance, we make each number
positive and, at the same time, assign more
weight to the larger deviations (deviation is the
distance between the mean and a value).
Variance
• Formula for Variance

• The variance (σ2), is defined as the sum of the


squared distances of each term in the
distribution from the mean (μ), divided by the
number of terms in the distribution (N)
Standard Deviation
• The standard deviation is a measure of the amount
of variation or dispersion of a set of values.
• A low standard deviation indicates that the values
tend to be close to the mean (also called the
expected value) of the set, while a high standard
deviation indicates that the values are spread out
over a wider range.
• Population SD Sample SD
Illustration
• From the below given stock price of the Mahindra
& Mahindra Ltd, find the average stock price,
average stock return and deviation.
Date Adj Close Date Adj Close
5/23/2021 796 6/7/2021 804
5/24/2021 810 6/8/2021 808
5/25/2021 811 6/9/2021 804
5/26/2021 823 6/10/2021 807
5/27/2021 829 6/11/2021 809
5/28/2021 846 6/14/2021 807
5/31/2021 808 6/15/2021 809
6/1/2021 806 6/16/2021 807
6/2/2021 806 6/17/2021 805
6/3/2021 802 6/18/2021 788
6/4/2021 804
Illustration
• From the below given stock price of the SUN
Pharma and Dr. Reddy Lab, find their average
stock return and deviation.
Date SUN Pharma-SP Date Dr. Reddy SP
6/4/2021 674.20 6/4/2021 5255.00
6/7/2021 675.05 6/7/2021 5219.20
6/8/2021 678.85 6/8/2021 5274.75
6/9/2021 672.80 6/9/2021 5222.80
6/10/2021 676.10 6/10/2021 5292.05
6/11/2021 681.25 6/11/2021 5453.00
6/14/2021 677.25 6/14/2021 5461.35
6/15/2021 673.20 6/15/2021 5410.85
6/16/2021 669.10 6/16/2021 5406.10
6/17/2021 664.70 6/17/2021 5286.50
6/18/2021 668.40 6/18/2021 5283.90
Data representation
• Data can be classified as either categorical or
quantitative.
• Categorical data use labels or names to identify
categories of like items.
• Quantitative data are numerical values that indicate
how much or how many.
• Tabular and graphical displays can be found in annual
reports, newspaper articles, and research studies.
• Data visualization is a term often used to describe the
use of graphical displays to summarize and present
information about a data set.
Frequency Distribution

• A frequency distribution is a tabular summary of data


showing the number (frequency) of observations in
each of several non overlapping categories or
classes.
• Example: Data from a sample of 50 soft drink
purchases.
Frequency Distribution

• Relative frequency distribution gives a tabular summary of


data showing the relative frequency for each class.
• Percent frequency distribution summarizes the percent
frequency of the data for each class.
Frequency Distribution
Bar Charts
• A bar chart is a graphical display for depicting
categorical data summarized in a frequency, relative
frequency, or percent frequency distribution.
Pie Charts
• Pie chart provides another graphical display
for presenting relative frequency and percent
frequency distributions for categorical data.
Frequency distribution for
quantitative data
• To construct a frequency table, we divide the
observations into classes or categories. The
number of observations in each category is called
the frequency of that category.
• When dealing with Quantitative data (data that is
numerical in nature), the categories into which
we group the data may be defined as a range or
an interval of numbers, such as 0 − 10 or they
may be single outcomes (depending on the
nature of the data).
Frequency distribution for
quantitative data
• Number of classes/categories: Classes are formed
by specifying ranges that will be used to group
the data.
• Width of the classes:
• Class limits: Class limits must be chosen so that
each data item belongs to one and only one class.
• The lower class limit identifies the smallest
possible data value assigned to the class. The
upper class limit identifies the largest possible
data value assigned to the class.
Illustration
• Total number of days spent in audit in each
year-end is given in the table. Assume a class
size of 5. Make a frequency table and also
show dot plot and histogram.
Illustration
• Class number = 5
• Class width = (33−12)/5 = 21/5 = 4.2 ≈5
• Class intervals: 10-14; 15-19; 20-24; 25-29; 30-
34
Illustration
• One of the simplest graphical summaries of
data is a dot plot.
• A horizontal axis shows the range for the data.
Each data value is represented by a dot placed
above the axis.
• Example, Audit time data
Histogram
• Histogram can be prepared for data previously
summarized in either a frequency, relative
frequency, or percent frequency distribution.
Scatter Diagram and
Trend line
• A scatter diagram is a graphical display of the
relationship between two quantitative variables.
• A trend line is a line that provides an approximation of
the relationship.
• Example: Consider the advertising/sales relationship
for a stereo and sound equipment store in San
Francisco. On 10 occasions during the past three
months, the store used weekend television
commercials to promote sales at its stores. The
managers want to investigate whether a relationship
exists between the number of commercials shown and
sales at the store during the following week.
Scatter Diagram and
Trend line

Sample data Scatter plot


TYPES OF RELATIONSHIPS DEPICTED BY
SCATTER DIAGRAMS
Multiple regression
Introduction
• In multiple regression and correlation we use more than one independent
variable to investigate the dependent variable.

• In multiple regression and correlation analysis, the process consist of


three steps

• Describe the multiple-regression equation.

• Examine the multiple-regression standard error of estimate.

• Use multiple-correlation analysis to determine how well the regression


equation describes the observed data.

• In addition, in multiple regression, we can look at each individual


independent variable and test whether it contribute significantly to the
way regression describe the data.
MULTIPLE-REGRESSION EQUATION
Here, we have more than one independent
variable therefore use X1 , X2 to represent the
variables.
Thus the equation of the line of estimation
will be
Illustration
• The Internal Revenue Service (IRS) is trying to estimate the
monthly amount of unpaid taxes discovered by its auditing
division, in the past the IRS estimated this figure on the
basis of the expected number of field-audit labour hours. In
recent years, however, the field audit labour hours have
became a unreliable predictor of the actual unpaid taxes.
As a result, the IRS is looking for another factor with which
it can improve the estimating equation.

• The auditing division does keep a record of the number of


hours its computers are used to detect unpaid taxes. Could
we combine this information with the data in the field audit
labour hours and come up with a more accurate estimating
equation for the unpaid taxes discovered each month?
Illustration
Field-Audit Labor Actual Unpaid Taxes
Month Computer Hours
Hours Discovered
January 45 16 29
February 42 14 24
March 44 15 27
April 45 13 25
May 43 13 26
June 46 14 28
July 44 16 30
August 45 16 28
Septemb
44 15 28
er
October 43 15 27
• The auditing division can use this equation monthly to
estimate the amount of unpaid taxes it will discover.
Interpretation of R squared
• After fitting a linear regression model, you need
to determine how well the model fits the data.
Does it do a good job of explaining changes in the
dependent variable?
• R-squared is a goodness-of-fit measure for linear
regression models.
• R-squared measures the strength of the
relationship between your model and the
dependent variable on a convenient 0 – 100%
scale.
Interpretation of R squared
• R-squared evaluates the scatter of the data points
around the fitted regression line. It is also called
the coefficient of determination, or the
coefficient of multiple determination for multiple
regression.
• For the same data set, higher R-squared values
represent smaller differences between the
observed data and the fitted values.
• R-squared is the percentage of the dependent
variable variation that a linear model explains.
Interpretation of R squared
• R-squared is always between 0 and 100%:
• 0% represents a model that does not explain any of the
variation in the response variable around its mean. The
mean of the dependent variable predicts the
dependent variable as well as the regression model.
• 100% represents a model that explains all the variation
in the response variable around its mean.
• Usually, the larger the R2, the better the regression
model fits your observations. However, this guideline
has important caveats that I’ll discuss in both this post
and the next post.
Interpretation of R squared

Plot-A Plot-B

• The R-squared for the regression model on the


left is 15% (Plot-A), and for the model on the
right it is 85% (Plot-B).
Are Low R-squared Values
Always a Problem?
• No! Regression models with low R-squared values
can be perfectly good models for several reasons.
• Some fields of study have an inherently greater
amount of unexplainable variation.
• Fortunately, if you have a low R-squared value
but the independent variables are statistically
significant, you can still draw important
conclusions about the relationships between the
variables.
P-Values
• P-values and coefficients in regression analysis
work together to tell you which relationships
in your model are statistically significant and
the nature of those relationships.
• The coefficients describe the mathematical
relationship between each independent
variable and the dependent variable.
• The p-values for the coefficients indicate
whether these relationships are statistically
significant or not.
Standard Error of Estimates
• The general form of the multiple-regression equation is
• Standard Error of Estimate

• Standard Error of Estimate

• Where
• Y = sample values of dependent variable
• ŷ = corresponding estimated values from the
regression equation
• N = number of data points
• K = number of independent variables
Standard Error of Estimates

 The denominator of this equation indicates that in


multiple regression with k independent variables, the
standard error has n − k − 1 degrees of freedom.
 This occurs because the degrees of freedom are
reduced from n by the k + 1 numerical constants, a, b1,
b2, . . . , bk that have all been estimated from the same
sample.
Calculation of Standard
Error of Estimates
Field-Audit Labor Actual Unpaid Taxes
Month Computer Hours
Hours Discovered
January 45 16 29
February 42 14 24
March 44 15 27
April 45 13 25
May 43 13 26
June 46 14 28
July 44 16 30
August 45 16 28
September 44 15 28
October 43 15 27
Calculation of SEE
Y x1 x2 ŷ (Y-ŷ) (Y-ŷ)^2
29 45 16 29.136 -0.136 0.02
24 42 14 25.246 -1.246 1.55
27 44 15 27.473 -0.473 0.22
25 45 13 25.839 -0.839 0.70
26 43 13 24.711 1.289 1.66
28 46 14 27.502 0.498 0.25
30 44 16 28.572 1.428 2.04
28 45 16 29.136 -1.136 1.29
28 44 15 27.473 0.527 0.28
27 43 15 26.909 0.091 0.01
Here the SEE is 1.071.
Principle of parsimony
• The principle of parsimony is attributed to the early
14th-century English nominalist philosopher, William of
Occam, who insisted that, given a set of equally good
explanations for a given phenomenon, the correct
explanation is the simplest explanation.
• It is called Occam's razor because he ‘shaved’ his
explanations down to the bare minimum: his point was
that in explaining something, assumptions must not be
needlessly multiplied.
• In particular, for the purposes of explanation, things not
known to exist should not, unless it is absolutely
necessary, be postulated as existing.
Principle of parsimony
• For statistical modeling, the principle of parsimony
means that:
• Models should have as few parameters as possible.
• linear models should be preferred to non-linear
models.
• Experiments relying on few assumptions should be
preferred to those relying on many.
• Models should be pared down until they are minimal
adequate.
• Simple explanations should be preferred to complex
explanations.

You might also like