Regression Analysis

Scatter Plots and Correlation

– A scatter plot (or scatter diagram) is used to show
the relationship between two variables
– Correlation analysis is used to measure strength of
the linear association between two variables
• Only concerned with strength of the relationship
• No causal effect is implied

Examples

• Salary and no of years of experience
• Household income and expenditure;
• Price and supply of commodities;
• Amount of rainfall and yield of crops.
• Price and demand of goods.
• Weight and blood pressure
• Sales and GDP

Scatter Plot Examples
Linear relationships Curvilinear relationships

y y

x x

y y

x x

Strong relationships Weak relationships

y y

x x

y y

x x

No relationship y x y x .

Correlation Coefficient • The population correlation coefficient (ρ) measures the strength of the linear association between the variables • The sample correlation coefficient (r) is an estimate of ρ and is used to measure the strength of the linear relationship in the sample observations .

y ) rxy  sx s y 1 cov( x.Calculating sample Correlation Coefficient cov( x. y )   ( xi  x )( yi  y ) n 1 1 sx  n  ( xi  x ) 2 s y  n  ( y i  y ) 2 .

Features of correlation coefficient – Unit free – Range between -1. Y ↓ (↑ ) – 0< ρ≤1 implies that as X ↑ (↓). the stronger the negative linear relationship – The closer to 1.00.00.00 – -1≤ρ<0 implies that as X ↑ (↓). the stronger the positive linear relationship – The closer to 0.00 and 1. Y ↑ (↓) – The closer to -1.00. the weaker the linear relationship – ρ=0 implies that X and Y are not linearly associated .

Significance Test for Correlation • Hypotheses H0: ρ = 0 (no linear correlation) H1: ρ ≠ 0 (linear correlation) .

under H 0 1 r 2 Critical Region:   {tobs  t .n  2 }   {tobs  t .n  2 }   { tobs  t / 2. Significance test for Correlation Test statistic: r n2 tobs  ~ t n 2 .n  2 } .

?2 . What is Regression • Regression is a tool for finding existence of an association relationship between a dependent variable (Y) and one or more independent variables (?1 . . … . ?? ) in a study. • The relationship can be linear or non-linear.

Mathematical vs Statistical Relationship Mathematical Relationship is exact y  β 0  β1 x Statistical Relationship is not exact y  β0  β1x  ε .

var( i )   e2 .Linear Regression Assumptions • Distribution of error:  i ~ N (0.ε j )  0 • The probability distribution of the errors has constant variance .  e2 ) • Error values (ε) are statistically independent • Error values are normally distributed for any given value of x • E(  i ) = 0 . cov (εi .

Nomenclature in Regression • A dependent variable (response variable) measures an outcome of a study (also called outcome variable)”. • An independent variable (explanatory variable) explains changes in a response variable. • Regression often set values of explanatory variable to see how it affects response variable (predict response variable) .

Purpose of regression is to predict the value of dependent variable given the value(s) of independent variable(s). . Terms dependent and independent does not necessarily imply a causal relationship between two variables. but not causation. Caution Regression model establishes the existence of an association between two variables.

Steps in regression analysis • Statement of the problem under consideration • Identify the explanatory variable. • Specify the nature of relationship between dependent variable and explanatory variables • Collection of data on relevant variables • Choice of method for fitting the data • Fitting of model • Model validation and criticism • Using the chosen model(s) for the solution of the posed problem and forecasting .

or Coefficient Dependent residual y  β0  β1x  ε Variable Linear component Random Error (systematic) component .Population Linear Regression The population regression model: Population Random Population Independent Error Slope y intercept Variable term.

Population Linear Regression y y  β0  β1x  ε Observed Value of y for xi εi Slope = β1 Predicted Value Random Error for of y for xi this x value Intercept = β0 xi x .

Estimated Regression Model The sample regression line provides an estimate of the population regression line Estimated (or Estimate of the Estimate of the predicted) y regression regression slope value intercept Independent yˆ i  b0  b1x variable .

Estimation of parameters • Least square method of estimation • Confidence interval • Prediction interval • p-value .

Interpretation of the Slope and the Intercept • b0 is the estimated average value of y when the value of x is zero • b1 is the estimated change in the average value of y as a result of a one-unit change in x .

Assessing Model Accuracy • ?2 • Residual Standard Error (interpretation?) • F Statistic .

SSE SST = SSR + SSE  i ( y  y ) 2   i ( ˆ y  y ) 2   i i ( y  ˆ y ) 2 where: SST = total sum of squares SSR = sum of squares due to regression SSE = sum of squares due to error . SSR. Coefficient of Determination • Relationship Among SST.

Since SSR is based on residuals. so a measure of quality of fitted model can be based on SSR. Goodness of fit of regression Coefficient of Determination It can be noted that a fitted model can be said to be good when residuals are small. R2 is a measure of relative fit based on a comparison of SSR and SST R2 = r2 = SSR/SST where: SSR = sum of squares due to regression SST = total sum of squares ? ≤ ?? ≤ ? a value of ?? closer to 1 indicates the better fit and value of ?? closer to zero indicates the poor fit. .

. Coefficient of Determination (example) R2 = SSR/SST = 2.45859 = 0.02% of the variability in the demand for the item can be explained by the linear model between the demand and the price.220168 The regression relationship is weak.963146 / 13. 22.

260 units on average . Another way to think about this is that even if the model were correct and the true values of the unknown coefficients were known exactly. actual sales in each market deviate from the true regression line by approximately 3.260 units. on average.26. any prediction of sales on the basis of TV advertising would still be off by about 3. In other words. Residual Standard Error ??? (???) ??? = ?−2 Interpretation: Suppose RSE is 3.

. radio. Advertising data The Advertising data set consists of the sales (in thousands of units) of a particular product in 200 different markets. on the on the basis of this data. a marketing plan for next year that will result in high product sales. along with advertising budgets (in thousands of dollars) for the product in each of those markets for three different media: TV. Suppose that in our role as statistical consultants we are asked to suggest. and newspaper. What information would be useful in order to provide such a recommendation? Let us first check how sales is related with ad expenditure.

Simple Linear Regression • Is there a relationship between advertising budget and sales? • How strong is the relationship between advertising budget and sales? .

e. • Confidence interval of parameters • p-value of t-tests .• Scatter Diagram • Correlation coefficient • Test of correlation coefficient • Interpretation of regression coefficients and corresponding s.

Test for Significance To test for a significant regression relationship. we test for intercept parameter. . the variance of error in the regression model. b0. slope parameter b1 and predicted y test commonly used is: t Test t test requires an estimate of σe2.

2) where: ( SS ) 2 xy SSE =  (yi . given as s 2 = MSE = SSE/(n . Testing for Significance • An Estimate of σe2 The mean square error (MSE) provides the unbiased estimator of σe2.yˆi )2  SS y  SS x = SS y  b SS 1 XY .

under H0 b1  10 s tobs  where sb1  sb1  i ( i x  x ) 2 . Testing for slope parameter • Hypotheses H 0 : 1  10 H1 : 1  10 • Test Statistic.

Testing for intercept parameter • Hypotheses H 0 :  0   00 H1 :  0   00 • Test Statistic. under H0 b0   00 1 x2 tobs  where sb 0  s  sb 0 n  ( xi  x ) 2 .

n-2 or t > t/2.2 degrees of freedom .n-2 where: t is based on a t distribution with n .Testing for Significance: t Test Critical Region Reject H0 if p-value <  or tobs < -t /2.

Reject H0 if p-value < . Determine the hypotheses.05 or |t| > 3.Testing for Significance: Example 1. State the rejection rule.05 b1 3.182 (with 3 degrees of freedom) . H0 : 1  0 H a : 1  0 2. a = . Select the test statistic. Specify the level of significance. t sb1 4.

t = 4.Testing for Significance: t Test 5. (Also. t = 4. the p-value is less than .08 6.02.182. b1 5 t   4. Hence.01 in the upper tail.) We can reject H0. Compute the value of the test statistic.541 provides an area of . Determine whether to reject H0.63 sb1 1.63 > 3. .

Confidence Interval for 1 • The form of a confidence interval for 1 is: t /2 sb1 is the b1  t /2 sb1 margin of error b1 is the point estimator where t / 2 is the t value providing an area of /2 in the upper tail of a t distribution with n .2 degrees of freedom .

Confidence Interval for 0 • The form of a confidence interval for 0 is: t / 2 sb 0 is the b0  t / 2 sb 0 margin of error b0 is the point estimator where t / 2 is the t value providing an area of /2 in the upper tail of a t distribution with n .2 degrees of freedom .

Multiple Regression .

The Advertising data set consists of the sales of that product in 200 different markets. radio. and newspaper Response or dependent variable? Predictors or independent variable(s)? . Example Suppose that we are statistical consultants hired by a client to provide advice on how to improve sales of a particular product. along with advertising budgets for the product in each of those markets for three different media: TV.

or is the relationship more complicated? .Common questions in regression Which predictors are associated with the response? What is the relationship between the response and each predictor? Can the relationship between Y and each predictor be adequately summarized using a linear equation.

Advertising data One may be interested in answering questions such as: • Which media contribute to sales? • Which media generate the biggest boost in sales? or • How much increase in sales is associated with a given increase in TV advertising? .

Multiple Regression The model is y = β0 + β1x1 + β2x2 + … + βkxk +  15-44 .

• Scatter matrix • Correlation matrix • Test of correlation coefficients • Interpretation of regression coefficients and corresponding s.e. • Confidence interval of parameters • p-value of t-tests .

Assessing Model Accuracy • ?2 • Adjusted ?2 • Residual Standard Error • F Statistic .

Residual Standard Error ??? (???) ??? = ?−? .

• With a purpose of correction in overly optimistic picture. Adjusted ?2 • If more explanatory variables are added to the model. then ?2 will still increase and gives an overly optimistic picture. Even. but the same is not true for adj ?2 . denoted as or adj ?2 is used which is defined as 2 ?−1 • ???? = 1 − 1 − ?2 ?−? Note: ?2 will never decrease when a variable is added. . adjusted ?2 . In case the variables are irrelevant. if the model fits poorly. then adj ? 2 can even be negative. then ?2 increases.

?−? ???0 ?? ???? ??? (???) ?−? 15-49 . Testing For Overall Significance of Model – F Test Is There a Relationship Between the Response and Predictors? ?0 : ?1 = ?2 =…=??−1 =0 ?1 : ?? ≠ 0 ??? ??????? ??? ? Test Statistic: ??? ?= ? − 1 ~??−1.

e. If F is highly significant. Larger ?2 implies greater F value • That is why the F test under analysis of variance is termed as the measure of overall significance of estimated regression • It is also a test of significance of ?2 . y is linearly related to X’s. i. 2 Relation between ? and F ??? ?−? ??? ?−? ??? ?−? ? 2 • ?= = = = ??? ?−1 ??? ?−1 ???−??? ?−1 1−?2 • When ?2 =0. . F=0 • When ?2 =1. it implies that we can reject H0. So both F and ?2 vary directly. F=∞.

• There is no Multi-collinearity (no perfect linear relationship among explanatory variables). estimation of parameters testing of hypothesis properties of the estimator are based on following major assumptions: • The relationship between the study variable and explanatory variables is linear. at least approximately. . Model Adequacy Checking The fitting of linear regression model. • The errors are normally distributed • The error term has constant variance.

the residuals are plotted against the fitted values. • Typically. Residuals Plot • The graphical analysis of residuals is a very effective way to investigate the adequacy of the fit of a regression model and to check the underlying assumptions. .

Plots of residuals against the fitted values (heteroscedasticity) If plot is such that the residuals can be contained in a horizontal band (and residual fluctuates is more or less in a random fashion inside the band) . It plot is such that the residuals can be contained is an outward opening funnel. Presence of heteroscedasticity . then there are no obvious model defects. then such pattern indicates that the variance of errors is not constant.

Presence of heteroscedasticity If plot is such that the residuals are contained inside a curved plot. a squared error term may be necessary. . then it indicates nonlinearity. Plots of residuals against the fitted values (heteroscedasticity) If plots is such that the residuals can be accommodated in an inward opening funnel. Transformations on explanatory variables and/or study variable may also be helpful is these cases.linear. For example. then such pattern indicates that the variance of errors is not constant. This could also mean that some other explanatory variables are needed in the model. The assumed relationship between y and X’s is non.