You are on page 1of 10

MKTG 4110 Class 6

DESCRIPTIVE STATISTICS ANALYSIS


 Summary statistics LIMITATIONS OF DESCRIPTIVE STATISTICS
 Correlations (t-test)  Correlation coefficient: misleading information
 One sample mean (t-test)  Difficult to control for other factors
 Two sample means (t-test)  Hard to make predictions
 Cross-tabulation (Chi-square test)
 Plots

Example: SALES, PRICE, AND ADVERTISING


 Manager wants to know the effect of pricing on sales
 A firm’s sales would be influenced by its pricing and advertising decision
 Often times, when the firm lowers the price, they advertise heavily
 Are the sales driven by pricing, ads, or both? Accurate size of the effect?
 REGRESSION ANALYSIS
 A model can provide a better understanding of the world & allow you to make better decisions

Sales = a + b*Price + c*Advertising + e

 Relationship between dependent Y: Sales & independent variable X: Price & Ad


 Predict what the outcome of interest (Y: Sales) would be
 Control for confounding factors (e.g., Advertising)
 isolating the effect of interest (e.g., Price)

Simple Linear Regression

 Y = mx +b

How do you estimate the coefficients?

 We want to find the 𝛽0 and 𝛽1 that


minimizes the sum of squared errors
 Finding the best line that fits the data
(minimizes sum of errors)
 Estimating process will be done by
statistical programs (R-Studio).
 Never do it by hand

Intuition of a coefficient estimation in a simple regression

 You are trying to see how x and y moves together


 As X changes how does it change Y

R Square
 We use R-Squared to measure how well the model performs
 that is, how much variance in the Dependent Variable is explained by Independent Variable

Total Sum of Squares


 Variance of Y
 The Y is the alpha, you take
out the mean and square it
and summing it up

Sum of Squared Errors
 Unexplained model parts
 Summing the error terms

How much (%) of the variance in Y is explained by the independent variables?
 This is the function to explain it and formula

R-squared is a measure between 0 – 1 (0% – 100%)


 8% of variance explained by independent variables vs 90%

REGRESSIONS AND R-SQUARED


 Placing meaningful emphasis on the R-squared mostly depends on your task
 If your task is all about prediction, higher r-squared would be great
 If your task is mostly about inference (e.g., estimating the coefficients accurately),
o r-squared could still be low but your model could still be fine

Example: YEARLY FUEL COST (YFC) AND VEHICLE WEIGHT

YFC = 370 + 0.675*Weight


 For every increase in weight, the there is
an increase $0.67 of yearly fuel cost
 If weight is 0, yearly cost is still 370
 This is illogical but that is model limitation

YFC = 370 + 0.675*Weight

NULL AND ALTERNATIVE HYPOTHESES


 We perform hypothesis testing for each coefficient separately

 Null hypothesis: There is no effect (i.e., coefficient is zero)

 Alternative hypothesis: There is effect (i.e., coefficient is NOT zero)

How we calculated the Std Error of a mean and t-statistics for descriptive analysis
Standard errors of the regression coefficients

YearlyFuelCost = 370 + 0.675*Weight  Segma Square is the sum of squared


 Coefficient of interest: b=0.675 errors, and the numerator is what is not
 Standard error of this coefficient: se=0.20 explained in the model
 The dominator is a variance of x
 The larger the n and variance of x, standard
error would in turn decrease

Intuition of SE of a coefficient
 The more variance you have in
your independent variable,
standard error of a coefficient will
become smaller.

 (alt – null)/ std error = t stat


 t-stat is larger than 2
 it is unlikely to see b=0 rather than b=0.675.
 Null Hypo is not true (we reject the
null hypothesis)
 Thus, b is statistically significant
Three Ways to Reject the Null
1. Critical Values
2. P Values (reject if it is smaller than 0.05)
3. Confidence Intervals (reject if 0 is outside of confidence intervals)

P Value
 the “critical value” is t- stat=|1.96| (or simply |2|) and p-value = 0.05
 P-VALUE: PROB OF OBSERVING 𝜷 IF NULL HYPOTHESIS IS TRUE

Confidence Intervals
 We can form 95% confidence intervals using standard errors and see whether 0 is included
 If 0 is outside of the confidence interval reject it

Coefficients:
These are the estimates for a and b.
Coefficients measure the amount the dependent variable increases for a one-unit increases in the
independent variable, holding constant all other independent variable

MULTIPLE REGRESSION Interpretation


 It is straight forward to add additional continuous variable in the regression.
 What differs is the interpretation

Sales = a + b*Price + c*AdDollars

 When many things vary at the same time,


 it is hard to see the impact of each factor by using descriptive statistics or visualization
 Multiple regression lets you look at an isolated effect of one variable
 Price increase of 1 unit changes the sales by “b” while controlling for the Advertising Dollars
Categorical Variables in Regression
 Categorical variables with two levels
 Store Type A (outlet) vs. B (retail store)
 These will be coded as 0 or 1 in the data
 These are called binary (dummy) for categorical variable
Regression results example
 Sales = 5.75 + 2.5*StoreTypeA (outlet)
 Store type A (outlet) sales are estimated to be 2.5 units
more than store type B (retail)

 Sales = 8.25 - 2.5*StoreTypeB (outlet)


 Store type B (retail) sales are estimated to be 2.5 units
less than store type A (outlet)

For a categorical variable of two levels, you can only include one dummy variable

Example 1: Example 2:
Store type A (outlet) and B (retail store) Male and Female
 Can only include one dummy indicating  Can only include one dummy indicating male
outlet (A) or retail store (B) or female
 Model: Sales = 5.75 + 2.5*StoreTypeA  Model: Spending = 125 + 30*Female

Perfect multi-collinearity issue


- You cant include all the dummies in the regression
- If you have L dummies for L number of categories,
- including a constant term in the regression guarantees perfect multicollinearity
- Since they move all together, we cannot estimate the coefficient for each dummy
- One should be left out -> interpretation will be based on this left out dummy
- If you have variable with L categories, use L-1 dummy variables

Dummy Variable
 You will interpret the dummy variables based on what is left out from the model
 Relative interpretation of what is left out
 When we include store type A dummy (e.g., 1 if store A; 0 if store B),
o coefficient is interpreted based on store type B
 When we include male as a dummy (e.g., 1 if male; 0 if female),
o coefficient is interpreted based on female
EXAMPLE: DO FEMALE EXECUTIVES EARN MORE SALARY?

Salary = a + f * Female + b * Experience

Interpretation Questions:

 Interpretation of b? Does experience matter? What is the 95% cf. interval?


o For every unit increase in experience, the salary increases by 5154
o 166 means a low precision in estimate
o 95% interval is 4827 to 5481. Therefore, statistically significant as 0 is not included
 Interpretation of f? Does gender matter? What is the 95% cf. interval?
o On average, female earn 34936 less than males, based on the t stat
o 2904 means a low precision in estimate
o 95% interval is -40636.92 to -29236, Therefore, statistically significant as 0 is not included
 Interpretation of a?
o The intercept or constant, that is what is expected with no experience and gender neutral

Prediction Questions:
 How much salary does Male execs. earn with 10 years of experiences?
o 144075 + 5154 * 10 + 0 * (-34936)
 How much salary does Female execs. earn with 2 years of experiences?
o 144075 + 5154 * 2 + 1*(-34936)

If you were to conduct a hypothesis testing of the “experience” coefficient using the 95% confidence
intervals, is it significant? Provide your testing logic briefly in words

 The t-value for the "experience" coefficient is 30.93, which indicates that the coefficient is
highly significant, meaning that it is unlikely to have occurred by chance.
 Since the 95% confidence interval for the "experience" coefficient does not include zero, we can reject
the null hypothesis.

Briefly describe in words the difference between R-squared and adjusted R-squared. No
need for any calculations—just provide your intuition

 adjusted R-squared considers and tests different independent variables against the stock index and R-
squared does not. adjusted R-squared because it has the potential to be more accurate.

 R-squared measures how much of the variability in the data can be attributed to the model, rather
than to random error.
 Adjusted R-squared adjusts for the number of independent variables in the model, which gives a
more accurate measure of how well the model fits the data.

You might also like