A brief guide to statistical analysis
1 Summary Statistics 2 Normality 3 The Correlation Matrix 4 Regression Analysis: The Basics 5 Regression Analysis: 7 Hypothesis Testing and Modelling Issues 5.1 Significance of individual parameters 5.2 Explanatory power: adjR2 5.3 Significance of the equation: The F-test 5.4 Dummy variables 5.5 The standard error 5.6 Collinearity 5.7 The general-to-specific modelling methodology
Statistical analysis begins with the collection of a sample of data. These data can be subjected to various kinds of statistical analysis. The purpose of the analysis is to draw inferences regarding the population from which the sample is drawn.
1 Summary Statistics The mean is the arithmetic average.1 The standard deviation and variance measure variability of the observations around the mean. The larger is the variability, the greater is the standard deviation. The variance is the standard deviation squared. The minimum (Min) is the smallest observation. The maximum (Max) is the largest observation. The mode is the most common observation. The median (Med) is the middle observation.2
2 Normality3 All basic methods of statistical analysis assume that the observations follow an approximately normal distribution. A normal distribution can be understood as a bell shaped frequency distribution around the mean. There are two elementary tests of
© 2002-2006 Shane Bonetti The sum of the observations divided by the number of observations. 2 E.g. if the observations are 2, 6, 8, 10, 12 the median is 8. 3 (1) These tests of normality should only be applied to continuous variables, not to dummy variables. (2) If there is serious skewness or kurtosis, it becomes necessary to use statistical methods which do not involve any assumption regarding the underlying distribution of the variables. These are called non-parametric statistical methods. One alternative is to transform the variable (e.g. take the natural log, or the square or the square root, or some other transfor mation). Often a variable which is not normally distributed can be transformed into a normally distributed variable in this way.
normality, that is tests of whether the approximate normality assumption is satisfied: skewness and kurtosis.
Skewness is the absence of symmetry. It involves the hump of the distribution being away from the mean. If there is no skewness then the statistical measure of skewness (s) is zero. There are tests available to show whether s differs significantly from zero. As a rule of thumb, |s| > 1 indicates potentially serious non-normality.
insufficient/excessive "weight in the tails". If there is no kurtosis (if the variable is not "kurtotic") then the statistical measure of kurtosis (k) is zero. There are tests available to show whether k differs significantly from zero. As a rule of thumb, |k| > 3 indicates potentially serious non-normality.
3 The Correlation Matrix The correlation coefficient (r) indicates the strength of the linear relationship between two variables. r can take values –1 ≤ r ≤ 1: r = 0 implies no linear correlation r > 0 implies a positive correlation r<0 implies a negative correlation r=1 means there is perfect positive correlation r=–1 means there is perfect negative correlation. The correlation matrix is a table containing all of the correlation coefficients between a list of variables. It has two uses. First, it provides a quick (though ultimately unsatisfactory) method of checking the patterns which the data reveal. If theory states that x1 and x2 are positively related, this is confirmed if r > 0, and disconfirmed if r ≤0. Second, the correlation matrix provides a collinearity check (See 5.6).
4 Regression Analysis: The Basics Ordinary Least Squares (OLS) regression finds a "line of best fit" for any given data, between a dependent variable (the variable which is to be explained, typically denoted y) and one or more independent variables (the explanatory variables, typically denoted x). Statistics or econometrics packages deliver an estimate of the constant term and the slope
terms. For a regression equation with one independent variable, the regression equation is: y = β 0 + β1 x β1 is a "regression coefficient". It measures the average effect of x on y. β0 is the constant term. It measures the average value of y when x = 0. For a regression equation with multiple independent variables, the regression equation is: y = β0 + β1 x1 + β2 x2 + β3 x3 + β4 x4 + β5 x5+ ... β1 , β2, β3, β4, β5 ... are "regression coefficients". They measure the average effect of the relevant x on y, holding all of the other x's constant. This provides a better test of theoretical hypotheses than correlation coefficients. β0 is the constant term. It measures the average value of y when all of the x's are zero.
5 Regression Analysis: 7 Hypothesis Testing and Modelling Issues 5.1 Significance of individual parameters The β's can be positive, negative or zero. Typically, theory suggests a hypothesis for particular β's – that β > 0 or β < 0. But the β's are based on a sample. If the sample indicates, for instance, that β > 0, we require some measure of confidence that β > 0 rather than β = 0, in the population from which the sample is taken. This is revealed by a significance test. If β is "significant at the x% level", there is a x% chance that the true value of β is zero, and (100–x)% confidence that β≠ 0. The "gold standard" for scientific work is to require significance at the 5% level, though this choice is arbitrary.
5.2 Explanatory power: adjR2 One way of assessing regression equations and comparing different regression equations is to ask "what proportion of the variation in the dependent variable (y) is explained by the independent variables (x's)?" Other things being equal, the more that is explained the better. adjR2 ("adjusted R squared" or R-bar squared as it is sometimes called) measures this. It varies between 0 (none of the variation in the dependent variable can be accounted for by variations in the independent variables) and 1 (all of the variation in the dependent variable can be accounted for by variations in the independent variables). Other things being equal, the larger is adjR2 the better is the regression equation.
5.3 Significance of the equation: The F-test The F-test indicates the degree of confidence that there exists a relationship between the independent and dependent variables for the population. That is, the F-test shows the significance (see 5.1) of the explanatory power (see 5.2) of the equation. If the F-test is significant at the x% level, there is a x% chance that there is no relationship between the independent variables and the dependent variables.
5.4 Dummy variables Dummy variables (DV's) classify subjects or observations into two mutually exclusive and exhaustive classes. A dummy variable takes a value 0 or 1, 1 for membership of a class, 0 for non-membership. The regression coefficient (β) on a dummy variable measures the average effect on the dependent variable (y) of membership of the class, by comparison with non-members.
5.5 The standard error The standard error of the regression equation (SE) measures the precision with which y can be predicted. For particular values of the x's, the regression equation based on a particular sample can be used to calculate the predicted value of y. Call this predicted value µy. Then the standard error (SE) of the estimate can be used to make the following statement: we are 95% confident that the true population value of y for those values of x is in the range µy ± 2 SE.
5.6 Collinearity One of the assumptions underlying OLS regression methods is that the explanatory variables are "linearly independent". If two independent variables are highly correlated they should not both appear in a regression equation, because they do not differ sufficiently from each other. As a rule of thumb, if |r| ≥ 0.7 between two variables, the variables should not both appear as independent variables in a regression equation, because they lack independent explanatory power.
5.7 The general-to-specific modelling methodology Consider a theory which implies that y is a function of x1 , x2 , x3 , x4 , x5: y = f( x1 , x2 , x3 , x4 , x5)
This is the general model of y. Data are collected on y and that x's, and ignoring the complexities of "non linear" relationships, a regression equation is estimated: y = β 0 + β 1 x 1 + β 2 x2 + β 3 x3 + β 4 x 4 + β 5 x 5 This is the first step in making the model more specific – moving from a general model to a general linear model. Imagine that significance levels are: β1 70% , β2 60% , β3 5% , β4 4% , β5 1%. Then the least significant variable (x1) would be dropped from the equation, and the equation re-estimated: y = β 0 + β 2 x 2 + β 3 x3 + β 4 x4 + β 5 x 5 This process of identifying and dropping the least significant independent variables is repeated until all regression coefficients reach some specified required level of significance (1%, 5% or 10%) according to context. This is a (very very rough) version of the general-to-specific modelling methodology.
Other things being equal, a simpler explanation is preferred. This translates to a preference for a regression equation with fewer explanatory variables, other things being equal. The general-to-specific modelling methodology captures the desire for explanatory parsimony.4
Parsimony: excessive carefulness in spending money or using resources; meanness. Occam's Razor: The principle that, all other things being equal, a simpler explanation is better.