Professional Documents
Culture Documents
NAME ID
Introduction
Model Building:
This term paper is an empirical analysis of Car Average Mileage. In this term paper we want to
show the results of the analysis of the factors that affect Average miles per gallon. Model is built
based on research hypothesis and research hypothesis comes from data. Our data set is provided
to us by our course instructor.
We have a data set of car average mileage.
Table: 01
Data Set: Car Mileage
Observation Average miles Cubic feet Top speed, Vehicle weight,
per gallon (Y) of cab miles per hundreds of
space (x1) hour (x2) pounds (x3)
1 65.4 89 96 17.5
2 56 92 97 20
3 55.9 92 97 20
4 49 92 105 20
5 46.5 92 96 20
6 46.2 89 105 20
7 45.4 92 97 20
8 59.2 50 98 22.5
9 53.3 50 98 22.5
10 43.4 94 107 22.5
11 41.1 89 103 22.5
12 40.9 50 113 22.5
13 40.9 99 113 22.5
14 40.4 89 103 22.5
15 39.6 89 100 22.5
16 39.3 89 103 22.5
17 38.9 91 106 22.5
18 38.8 50 113 22.5
19 38.2 91 106 22.5
20 42.2 103 109 25
21 40.9 99 110 25
22 40.7 107 101 25
23 40 101 111 25
24 39.3 96 105 25
Page |3
25 38.8 89 111 25
26 38.4 50 110 25
27 38.4 117 110 25
28 38.4 99 110 25
29 46.9 104 90 27.5
30 36.3 107 112 27.5
31 36.1 114 103 27.5
32 36.1 101 103 27.5
33 35.4 97 111 27.5
34 35.3 113 111 27.5
35 35.1 101 102 27.5
36 35.1 98 106 27.5
37 35 88 106 27.5
38 33.2 86 109 30
39 32.9 86 109 30
40 32.3 92 120 30
41 32.2 113 106 30
The independent variables are cubic feet of cab space (X₁), top speed (X₂) and vehicle weight
(X₃).
Descriptive statistics
Table: 02
Average Value Cubic feet Value Top speed, Value Vehicle Value
miles per of cab miles per weight,
gallon space hour hundreds of
pounds
Mean 41.40 Mean 90.98 Mean 105.39 Mean 24.39
Standard 1.18 Standard 2.72 Standard 0.96 Standard 0.51
Error Error Error Error
Median 39.30 Median 92.00 Median 106.00 Median 25.00
Mode 40.90 Mode 89.00 Mode 103.00 Mode 22.50
Standard 7.54 Standard 17.40 Standard 6.14 Standard 3.25
Deviation Deviation Deviation Deviation
Sample 56.79 Sample 302.67 Sample 37.69 Sample 10.56
Variance Variance Variance Variance
Kurtosis 2.00 Kurtosis 1.69 Kurtosis -0.01 Kurtosis -0.78
Skewness 1.46 Skewness -1.38 Skewness -0.26 Skewness 0.05
Range 33.20 Range 67.00 Range 30.00 Range 12.50
Minimum 32.20 Minimum 50.00 Minimum 90.00 Minimum 17.50
Maximum 65.40 Maximum 117.00 Maximum 120.00 Maximum 30.00
Sum 1697.4 Sum 3730.0 Sum 4321.0 Sum 1000.00
0 0 0
Sample 41.00 Sample 41.00 Sample 41.00 Sample size 41.00
size size size
Here the sample size is 41, that means 41 samples were measured for doing this survey. This data
set is based on these 41 samples.
The mean provides the measure of central location for the data. Here the mean is 41.40 for Average
miles per gallon (y) and 90.98 for cubic feet of cab space (x₁),105.39 for top speed (x2) and 24.39
for vehicle weight (x3).
The standard error shows the average distance that the observed values fall from the regression
line. Here, the standard error average mile per gallon (y) is 1.18, which means that the distance of
the observed values from the regression line is 1.18. Moreover, the standard error of cubic feet of
Page |6
cab space (x₁) is 2.72, the standard error of top speed (x2) is 0.96, and the standard error of vehicle
weight (x3) is 0.51. All of the standard error values represents the distance between the observed
values and the regression line.
The standard deviation measures the average distance between a single observation and the
mean. Here, for average mile per gallon (y), the distance between the observation and the mean is
7.54. As this standard deviation is relatively high, the variable fluctuates, this will hamper to satisfy
the properties of homoscedasticity.
For cubic feet of cab space (x1), the standard deviation is 17.40, which is comparatively higher
than other variables. As the standard deviation is very high, the variable is more fluctuated and it
will hamper to satisfy the properties of homoscedasticity.
Here, the standard deviation of top speed (x2) is 6.14 which is relatively higher, means the variable
fluctuates, this will hamper to satisfy the properties of homoscedasticity.
For vehicle weight (x3) the standard deviation is 3.25, which is relatively lower than the other
variables, means the variable is not fluctuating, this will not hamper to satisfy the properties of
homoscedasticity.
Skewness measures the direction and degree of asymmetry in the data distribution. Here, the
skewness for average miles per gallon (y) is 1.46, which is positive. Here, the mean is greater
than median by small amount which shows that the data are moderately skewed to the right.
Here, the skewness for cubic feet of cab space (x1) is -1.38, which means the data are negatively
skewed. Here, the mean is less than the median by small amount which shows that the data are
moderately skewed to the left.
Here, the skewness for top speed (x2) is -0.26, which means the data are negatively skewed.
Here, the mean is less than the median by small amount which shows that the data are
moderately skewed to the left.
Here, the skewness for vehicle weight is 0.05, which is positive. Here, the mean is greater than
median by small amount which shows that the data are moderately skewed to the right.
Interpretation of coefficients
Table: 03
Coefficients
Intercept (bₒ) 129.7168027
Cubic feet of cab space -0.060212413
Top speed, miles per hour -0.491517252
Vehicle weight, hundreds of pounds -1.272550564
Here, the dependent variable is average mile per gallon (y). The independent variables are: Cubic
feet of cab space (x₁), top speed (x₂) and vehicle weight (x₃). The values of bₒ, b₁, b₂ and b₃ has
been found by using OLS. Ordinary least square (OLS) is a type of linear least square method, that
is used to estimate the unknown parameters in a linear regression model, under the assumption that
the errors are normally distributed.
bₒ= 129.72, it is the y-intercept. If, there is no relationship between average miles per gallon (y)
and cubic feet of cab space (x₁), top speed (x₂) and vehicle weight (x₃), then the value of y is
129.72. In other words, if there is no statistically significant relationship between dependent
variable and the independent variables, then ŷ=129.72.
b₁= -0.06, it indicates that if cubic feet of cab space increases by 1 unit, the average mile per gallon
will decrease by 0.06 units, on an average, holding all other independent variables (x2 and x3)
constant. There is a negative relationship between average miles per gallon and cubic feet of cab
space (x1).
b₂= -0.49, it indicates that if top speed, miles per hour, increases by 1 unit, then average miles per
gallon will decrease by 0.49 units, on an average, holding all other independent variables (x1 and
x3) constant. Average miles per gallon and top speed are inversely related.
b₃= -1.27, it means that if vehicle weight, hundreds of pounds, increases by 1 unit, the average
miles per gallon will decrease by 1.27 units, on an average, holding all other independent variables
(x1 and x2) constant. Average miles per hour is negatively related to vehicle weight.
Page |8
20 20
Predicted Predicted
0 0
Average miles Average miles
0 50 100 150 per gallon 0 50 100 150 per gallon
Cubic feet of cab space Top speed, miles per hour
60
40 Average miles per
gallon
20
Predicted Average
0 miles per gallon
0 10 20 30 40
Vehicle weight, hundreds of pounds
Table: 04
Regression Statistics
Multiple R 0.85
R Square 0.72
Adjusted R Square 0.70
Standard Error 4.11
Observations 41.00
All the graphs of line fit plots show that the actual values of y and the predicted values of y are
very close, which also indicates that this model is a good fit.
Hypothesis test
T test:
Table: 05
Coefficients Standard t Stat P- Lower Upper 95%
Error value 95%
Intercept 129.72 11.75 11.04 0.00 105.91 153.52
Cubic feet of -0.06 0.04 -1.51 0.14 -0.14 0.02
cab space
Top speed, -0.49 0.12 -4.11 0.00 -0.73 -0.25
miles per hour
Vehicle weight, -1.27 0.24 -5.33 0.00 -1.76 -0.79
hundreds of
pounds
1. We want to test the relationship between average miles per gallon (y) and cubic feet of
cab space (x1), whether they are statistically significant or not.
Hypothesis:
Hₒ : B₁=0 [ There is no significant relationship between average miles per gallon (y)
and cubic feet of cab space (x1) ]
Ha : B₁≠0
t₁= b₁/ Sb₁
= -0.06/0.04
= -1.5
Critical value approach:
df= n – p – 1 = 41 - 3 - 1 = 37 α=0.01,
tα/₂= -2.715
Therefore, t≥ -tα/₂
As, t statistic value is greater than negative t critical value null hypothesis cannot be
rejected, at 1 percent level of significance.
P a g e | 10
p- value approach
p-value = 0.14, α=0.01
As, p-value is greater than α, null hypothesis cannot be rejected, at 1 percent level of
significance.
As a result, we can conclude that b₁=0, so there is no significant relationship between
average mile per gallon and cubic feet of cab space.
2. We want to test the relationship between average miles per gallon (y) and top speed, miles
per hour (x2), whether they are statistically significant or not.
Hypothesis:
Hₒ : B₂=0 [ There is no significant relationship between average miles per gallon (y) and
top speed (x2) ]
Ha : B₂≠0
t₂= b₂/ Sb₂
= -0.49/0.12
= -4.08
Critical value approach:
df= 37 α=0.01, tα/₂= -2.715
Therefore, t≤ -tα/₂
As, t statistic value is smaller than negative t critical value null hypothesis is rejected, at 1
percent level of significance.
p- value approach
p-value = 0.00, α=0.01
As, p-value is less than α, null hypothesis is rejected, at 1 percent level of significance.
As a result, we can conclude that b₂≠0, so average mile per gallon (y) and top speed (x2)
are statistically significant.
3. We want to test the relationship between average miles per gallon (y) and vehicle weight,
hundreds of pounds (x3), whether they are statistically significant or not.
P a g e | 11
Hypothesis:
Hₒ : B₃=0 [ There is no significant relationship between average miles per gallon
vehicle weight (x3) ]
Ha : B₃≠0
t₃= b₃/ Sb₃
= -1.27/0.24
= -5.29
Critical value approach:
df= 37 α=0.01, tα/₂= -2.715
Therefore, t≤ -tα/₂
As, t statistic value is smaller than negative t critical value null hypothesis can be
rejected, at 1 percent level of significance.
p- value approach
p-value = 0.00, α=0.01
As, p-value is less than α, null hypothesis can be rejected, at 1 percent level of
significance.
As a result, we can conclude that b₃≠0, so there is a statistically significant relationship
between average mile per gallon (y) and vehicle weight (x3).
F test
Table: 06
ANOVA
df SS MS F Significance
F
Regression 3 1645.324523 548.4415077 32.40161324 0.00
Residual 37 626.275477 16.92636424
Total 40 2271.6
We wanted to test whether a significant relationship exists between average miles per gallon (y)
and cubic feet of cab space (x₁), top speed, miles per hour (x₂) and vehicle weight, hundreds pf
pounds (x₃).
P a g e | 12
Hypothesis:
Ho: B₁ = B₂ = B₃ = 0 [ This means, average miles per gallon is not explained by all the
independent variables (x1, x2 and x3).
Ha: B₁ ≠ B₂ ≠ B₃ ≠ 0
F= MSR/MSE
= 548.44/16.93
= 32.4
Therefore, F≥ Fα
As, F statistic value is greater than F critical value, we can reject null hypothesis.
P value approach:
P- value = 0.00, α= 0.01
p- value ≤ α
As, p- value is smaller than α, we can reject null hypothesis.
As a result, we can conclude that a significant relationship is present between average miles per
gallon (y) and cubic feet of cab space (x1), top speed, miles per hour (x2) and vehicle weight,
hundreds pf pounds (x3). This indicates that this model is overall significant.
Multicollinearity
Table: 07
Pair wise correlation among
variables
Average Cubic feet of Top speed, Vehicle weight,
miles per cab space miles per hour hundreds of pounds
gallon
Average miles per 1
gallon
Cubic feet of cab -0.32 1
space
Top speed, miles -0.64 0.01 1
per hour
P a g e | 13
Here, we will check the pair-wise correlation coefficient among independent variables, to see if
they have multicollinearity problem or not.
According to the rule of thumb, a sample correlation coefficient greater than +0.7 or less than -0.7
for two independent variables is a warning of potential problems with multicollinearity. Because,
if two independent variables are highly correlated, it is not possible to determine the separate effect
of any particular independent variable on the dependent variable.
The correlation coefficient of cubic feet of cab space (x₁), top speed (x₂), Rx₁x₂ is 0.01, which is
less than 0.7. They have correlation but not excessive correlation. As a result, there is no
multicollinearity problem between cubic feet of cab space and top speed.
The correlation coefficient of top speed (x₂) and vehicle weight (x₃), Rx₂x₃ is 0.44, which is less
than 0.7. They have correlation but not excessive correlation. As a result, there is no
multicollinearity problem between top speed and vehicle weight.
The correlation coefficient of cubic feet of cab space (x₁) and vehicle weight (x₃), Rx₁x₃ is 0.32,
which is less than 0.7. They have correlation but not excessive correlation. As a result, there is no
multicollinearity problem between cubic feet of cab space and vehicle weight.
As a result, there is no multicollinearity problem in this model.
Heteroscedasticity test
2
0
0 10 20 30 40 50 60 70
-2
-4
-6
-8
Average miles per gallon
Heteroscedasticity refers to data with unequal variability where the variance is not constant.
P a g e | 14
Normality assumption
80
60
40
20
0
0 20 40 60 80 100 120
Sample Percentile
0
0 20 40 60 80
-1
-2
Average miles per gallon
As more than 80% of the standardized residuals lie between the range from -2 to +2, so this model
satisfies the normal distribution. Therefore, on the basis of the standardized residuals, this plot
gives us no reason to question the assumption that e has a normal distribution. As a result, based
on standardized residuals this model follows a normal distribution.
P a g e | 15
Conclusion
After evaluating the regression result, we found the following results:
1. Most of the independent variables are statistically significant according to t test.
2. The model is a good fit according to R² value.
3. According to F test, this model is overall significant.
4. This model does not have any multicollinearity problem.
5. It follows normal distribution, as more than 80 percent standardized residual lies
between -2 to +2.
6. The model has heteroscedasticity problem.
We cannot say that the model is completely reasonable, because it does not follow constant
variance assumption, the main problem in this model is heteroscedasticity. We have to find remedy
for this heteroscedasticity problem. Without correction, this model is not guaranteed. If we cannot
correct this problem, then we have to use another model by adding more variables, and we have to
check whether that model fits all the conditions of a reasonable model or not.
Since all the properties of a reasonable model is fulfilled but constant variance property is not
fulfilled, that’s why our estimator is a biased estimator, so this estimator is not reliable, that’s why
we have to look for a better model.
P a g e | 16