Professional Documents
Culture Documents
Descriptive Analysis
Descriptive analysis is the elementary transformation of data in a way that describes the basic characteristics such as central tendency, distribution, and variability. For example, consider the business researcher who takes responses from 1,000 American consumers and tabulates their favorite soft drink brand and the price they expect to pay for a six-pack of that product. The mean, median, and mode for favorite soft drink and the average price across all 1,000 consumers would be descriptive statistics that describe central tendency in three different ways. Means, medians, modes, variance, range, and standard deviation typify widely applied descriptive statistics.
Cross Tabulation
As long as a question deals with only one categorical variable, tabulation is probably the best approach. Cross-tabulation is the appropriate technique for addressing research questions involving relationships among multiple less-than interval variables (nominal or ordinal ). Cross-tabs allow the inspection and comparison of differences among groups based on nominal or ordinal categories. In cross- tabs, the frequency table display one variable in rows and another in columns. Example- the following cross-tab summarizes several cross-tabulations from responses to a questionnaire on bonuses paid to American International Groups (AIG) executives and federal government bailouts in general. Panel A presents results regarding how closely the respondents have followed the news stories regarding AIG executives receiving bonuses from the 2009 federal government bailout money. The cross-tab suggests this may vary with basic demographic variables.
From the results, we can see that more men (60 percent) than women (51 percent) reported they very closely followed these news reports. Further, it appears that how closely one followed these news stories increases with age (from 41 percent of those 1829 to 68 percent of those over 65). Panel B provides another example of a cross-tabulation table. The question asks if the respondents feel that most of the bailout money is going to those that created the crisis. In this case, we see very little difference between men (68 percent agree) and women (69 percent agree).
However, before reaching any conclusions based on this survey, one must carefully scrutinize this finding for possible extraneous variables.
where 2 = chi-square statistic; Oi = observed frequency in the ith cell; Ei = expected frequency in the ith cell
Measure of Association
Measure of Association is a general term that refers to a number of bivariate statistical techniques used to measure the strength of a relationship between two variables. Correlation analysis is the most appropriate for interval or ratio variables. Regression can accommodate either less-than interval independent variables, but the dependent variable must be continuous. The chi-square (2) test provides information about whether two or more less-than interval variables are interrelated
If the value of r = +1.0, a perfect positive relationship exists. Perhaps the two variables are one and the same!
If the value of r = 1.0, a perfect negative relationship exists. The implication is that one variable is a mirror image of the other. As one goes up, the other goes down in proportion and vice versa. No correlation is indicated if r = 0. A correlation coefficient indicates both the magnitude of the linear relationship and the direction of that relationship. For example, if we find that r= 0.92, we know we have a very strong inverse relationshipthat is, the greater the value measured by variable X, the lower the value measured by variable Y.
The coefficient of determination, R2, measures that part of the total variance of Y that is accounted for by knowing the value of X. If the correlation between unemployment and hours worked is r = -0.635 the R2 = 0.403. About 40 percent of the variance in unemployment can be explained by the variance in hours worked, and vice versa. Thus, R-squared really is just r squared!
Correlation Matrix
A correlation matrix is the standard form for reporting observed correlations among multiple variables. Each entry represents the bivariate relationship between a pair of variables.
The main diagonal consists of correlations of 1.00. Also REMEMBER that correlations should always be considered with significance levels (p- values). If a correlation is not significant it is of little use.
Regression Analysis
Regression analysis is a technique for measuring the linear association between a dependent and an independent variable. Regression is a dependence technique where correlation is an interdependence technique. A dependence technique makes a distinction between dependent and independent variables. An interdependence technique does not make this distinction and simply is concerned with how variables relate to one another.
With simple regression, a dependent (or criterion) variable, Y, is linked to an independent (or predictor) variable, X. Regression analysis attempts to predict the values of a continuous, interval-scaled dependent variable from specific values of the independent variable.
If we have n observations on y and x then we can write the simple linear regression as:y= + xi + ui where i = 1,2,3,.,n. The objective is to obtain estimated values for unknown parameters and given that there are n observations on y and x. In the above simple linear regression equation represents the Y intercept (where the line crosses the Yaxis) and is the slope coefficient. The slope is the change in Y associated with a change of one unit in X.
Assumptions of Regression
1) Zero Mean. E(ui ) = 0 for all observations or i. 2) Common variance. Var(ui ) = 2 for all i. 3) Independence. ui and uj are independent for all i j. 4) Independence of xj . ui and xj are independent for all i and j.
5) Normality. ui are normally distributed for all i. Assumptions 1, 2 and 3 are in combination written as ui = IN(0, 2 )
In most business research, the estimate of is most important. The explanatory power of regression rests with because this is where the direction and strength of the relationship between the independent and dependent variable is explained.
An intercept term () is sometimes referred to as a constant because represents a fixed point. Parameter estimates can be presented in either raw or standardized form.
One potential problem with raw parameter estimates is due to the fact that they reflect the measurement scale range. So, if a simple regression involved distance measured with miles, very small parameter estimates may indicate a strong relationship. In contrast, if the very same distance is measured with centimeters, a very large parameter estimate would be needed to indicate a strong relationship.
Standardized Coefficient
Standardized coefficient are the estimated coefficient indicating the strength of relationship expressed on a standardized scale where higher absolute values indicate stronger relationships (range is from 1 to 1). A standardized regression coefficient provides a common metric allowing regression results to be compared to one another no matter what the original scale range may have been. Raw regression weights () have the advantage of retaining the scale metric. Standardized coefficient should be explained when the researcher is testing an explanation but not a prediction. Un-standardized weights are used when researcher makes some predictions.
Multiple Regression
Y = 0 + 1X1 + 2X2 + . . . + nXn + u
Y = 0 + .326X1 + .612X2 + u
Standard Error of the Estimate (SEE) = a measure of the accuracy of the regression predictions. It estimates the variation of the dependent variable values around the regression line. It should get smaller as we add more independent variables, if they predict well. SEE is simply the standard deviation of the Y values about the estimated regression line and is often used as a summary measure of the goodness of fit of the estimated regression line.
Standard error- is nothing but the standard deviation of the sampling distribution of the estimator, and the sampling distribution of an estimator is simply a probability or frequency distribution of the estimator, that is, a distribution of the set of values of the estimator obtained from all possible samples of the same size from a given population.
Total Sum of Squares (SST) = total amount of variation that exists to be explained by the independent variables. TSS = the sum of SSE and SSR. Sum of Squared Errors (SSE) = the variance in the dependent variable not accounted for by the regression model = residual. The objective is to obtain the smallest possible sum of squared errors as a measure of prediction accuracy. Sum of Squares Regression (SSR) = the amount of improvement in explanation of the dependent variable attributable to the independent variables. Outliers are observations that have large residual values and can be identified only with respect to a specific regression model.
Y
Total Deviation
Y = average
Deviation explained by regression
If the R2 is statistically significant, we then evaluate the strength of the linear association between the dependent variable and the several independent variables. A large R2 indicates the straight line works well while a small R2 indicates it does not work well.
Even though an R2 is statistically significant, it does not mean it is practically significant. We also must ask whether the results are meaningful. For example, is the value of knowing you have explained 4 percent of the variation worth the cost of collecting and analyzing the data?
VIOLATIONS OF
REGRESSION ASSUMPTIONS
Heteroskedasticity
The second assumption of Regression is Common variance , i.e., Var(ui ) = 2 for all i. This assumption is also known as homoskedasticity [equal (homo) spread (skedasticity)].
Violation of this second assumption of regression is called heteroskedasticity, i.e., the errors don't have a constant/ common variance.
In Fig 1, the conditional variance of Yi (which is equal to that of ui ), which is conditional upon the given Xi , remains the same regardless of the values taken by the variable X. In contrast, consider the Fig 2, which shows that the conditional variance of Yi increases as X increases. Here, the variances of Yi are not the same. Hence, there is heteroskedasticity.
Sources of Heteroskedasticity
Error-learning models- as people learn, their errors of behavior become smaller over time (2i is expected to decrease). As incomes grow, people have more disposable income and hence more scope for choice about the disposition of their income. Hence, 2i is likely to increase with income.
Consequences of Heteroskedasticity
Heteroskedasticty does not result in biased parameter estimates, i.e. beta is not biased. Ordinary Least Square estimates are no longer BLUE (best linear unbiased estimates), that is among all estimators OLS does not provide the estimate with the smallest variance, i.e., the standard error of beta is biased. Bias in standard error under heteroscedasticity leads to bias in test statistics like t, F, and 2 may not be valid any more
Removing Heteroskedasticity
Re-specify variables. the Model or transform the
Use Robust Standard Errors- OLS assumes that errors are both independent and identically distributed. Robust standard errors relaxes these assumptions. Use Weighted Least Squares.
Autocorrelation
The term autocorrelation may be defined as correlation between members of series of observations ordered in time [as in time series data] or space [as in cross-sectional data]. The third assumption of regression implies Independence, i.e., ui and uj are independent for all i j, i.e., E(ui and uj ) = 0 for all i = j.
Sources of Autocorrelation
Inertia- A salient feature of most economic time series is inertia, or sluggishness. Example- Time series such as GNP, price indexes, production, employment, and unemployment exhibit (business) cycles. Starting at the bottom of the recession, when economic recovery starts, most of these series start moving upward. In this upswing, the value of a series at one point in time is greater than its previous value. There is a momentum built into them, and it continues until something happens (e.g., increase in interest rate or taxes or both) to slow them down. Therefore, in regressions involving time series data, successive observations are likely to be interdependent.
Specification Bias: Excluded Variables Case- In empirical analysis the researcher often starts with a plausible regression model that may not be the most perfect one. After the regression analysis, the researcher does the postmortem to find out whether the results accord with a priori expectations. The residuals may suggest that some variables that were originally candidates but were not included in the model for a variety of reasons should have been included. This is the case of excluded variable specification bias. Specification Bias: Incorrect Functional Form.
Lags- In a time series regression of consumption expenditure on income, it is not uncommon to find that the consumption expenditure in the current period depends, among other things, on the consumption expenditure of the previous period. Such regression models which incorporate lag values are called autoregression. The rationale for such a model is simple. Consumers do not change their consumption habits readily for psychological, technological, or institutional reasons. Now if we neglect the lagged term, the resulting error term will reflect a systematic pattern due to the influence of lagged consumption on current consumption.
Data Transformation- like taking lagged values (level form) and first difference operators (first difference form) may lead to autocorrelation.
Figure a to d shows that there is a discernible pattern among the us. Figure a shows a cyclical pattern; Figure b and c suggests an upward or downward linear trend in the disturbances; whereas Figure d indicates that both linear and quadratic trend terms are present in the disturbances. Only Figure e indicates no systematic pattern, supporting the non autocorrelation.
Consequences of Autocorrelation
The estimated variance underestimate the true 2. ( )is likely to
We are likely to overestimate R2. The usual t and F tests of significance are no longer valid, and if applied, are likely to give seriously misleading conclusions about the statistical significance of the estimated regression coefficients.
Detection of Autocorrelation
The most celebrated test for detecting auto correlation or serial correlation is popularly known as the Durbin Watson d statistic.
In large samples, we can use the NeweyWest method to obtain standard errors of OLS estimators that are corrected for autocorrelation.
Multicollinearity
The situation where the explanatory variables are highly intercorrelated is referred to as multicollinearity. When the explanatory variables are highly correlated it becomes difficult to disentangle the separate effects of each of the explanatory variables on the explained variables.
If multicollinearity is perfect , the regression coefficients of the X variables are indeterminate and their standard errors are infinite.
If multicollinearity is less than perfect, the regression coefficients, although determinate, possess large standard errors (in relation to the coefficients themselves), which means the coefficients cannot be estimated with great precision or accuracy.
In the figure a there is no overlap between X2 and X3, and hence no collinearity. In Figure b through e there is a low to high degree of collinearitythe greater the overlap between X2 and X3 (i.e., the larger the shaded area), the higher the degree of collinearity. In the extreme, if X2 and X3 were to overlap completely (or if X2 were completely inside X3, or vice versa), collinearity would be perfect.
Sources of Multicollinearity.
The data collection method employed, for example, sampling over a limited range of the values. Constraints on the model or in the population being sampled. For example, in the regression of electricity consumption on income (X2) and house size (X3) there is a physical constraint in the population in that families with higher incomes generally have larger homes than families with lower incomes. Model specification error- adding polynomial terms to a regression model, especially when the range of the X variable is small.
An over-determined model. This happens when the model has more explanatory variables than the number of observations. This could happen in medical research where there may be a small number of patients about whom information is collected on a large number of variables.
Consequences of Multicollinearity
Although BLUE, the OLS estimators have large variances and covariances, making precise estimation difficult. Because of the above consequence, the confidence intervals tend to be much wider, leading to the acceptance of the zero null hypothesis (i.e., the true population coefficient is zero) more readily also the t ratio of one or more coefficients tends to be statistically insignificant.
Although the t ratio of one or more coefficients is statistically insignificant, R2, the overall measure of goodness of fit, can be very high. The OLS estimators and their standard errors can be sensitive to small changes in the data.
Multicollinearity Diagnostic
Variance Inflation Factor (VIF) measures how much the variance of the regression coefficients is inflated by multicollinearity problems. If VIF equals 0, there is no correlation between the independent measures. A VIF measure of 1 is an indication of some association between predictor variables, but generally not enough to cause problems. A maximum acceptable VIF value would be 10; anything higher would indicate a problem with multicollinearity. Tolerance the amount of variance in an independent variable that is not explained by the other independent variables. If the other variables explain a lot of the variance of a particular independent variable we have a problem with multicollinearity. Thus, small values for tolerance indicate problems of multicollinearity. The minimum cutoff value for tolerance is typically .10. That is, the tolerance value must be smaller than .10 to indicate a problem of multicollinearity.