You are on page 1of 5

# A brief guide to statistical analysis ς

1 Summary Statistics
2 Normality
3 The Correlation Matrix
4 Regression Analysis: The Basics
5 Regression Analysis: 7 Hypothesis Testing and Modelling Issues
5.1 Significance of individual parameters
5.3 Significance of the equation: The F-test
5.4 Dummy variables
5.5 The standard error
5.6 Collinearity
5.7 The general-to-specific modelling methodology

Statistical analysis begins with the collection of a sample of data. These data can be
subjected to various kinds of statistical analysis. The purpose of the analysis is to draw
inferences regarding the population from which the sample is drawn.

1 Summary Statistics
The mean is the arithmetic average.1 The standard deviation and variance measure
variability of the observations around the mean. The larger is the variability, the greater is
the standard deviation. The variance is the standard deviation squared. The minimum
(Min) is the smallest observation. The maximum (Max) is the largest observation. The
mode is the most common observation. The median (Med) is the middle observation.2

2 Normality3
All basic methods of statistical analysis assume that the observations follow an
approximately normal distribution. A normal distribution can be understood as a bell
shaped frequency distribution around the mean. There are two elementary tests of

## ς © 2002-2006 Shane Bonetti

1 The sum of the observations divided by the number of observations.
2 E.g. if the observations are 2, 6, 8, 10, 12 the median is 8.
3 (1) These tests of normality should only be applied to continuous variables, not to dummy
variables. (2) If there is serious skewness or kurtosis, it becomes necessary to use statistical
methods which do not involve any assumption regarding the underlying distribution of the
variables. These are called non-parametric statistical methods. One alternative is to transform
the variable (e.g. take the natural log, or the square or the square root, or some other transfor
mation). Often a variable which is not normally distributed can be transformed into a
normally distributed variable in this way.
normality, that is tests of whether the approximate normality assumption is satisfied:
skewness and kurtosis.

Skewness is the absence of symmetry. It involves the hump of the distribution being away
from the mean. If there is no skewness then the statistical measure of skewness (s) is zero.
There are tests available to show whether s differs significantly from zero. As a rule of
thumb, |s| > 1 indicates potentially serious non-normality.

## Kurtosis involves excessive flatness or steepness of the distribution, or

insufficient/excessive "weight in the tails". If there is no kurtosis (if the variable is not
"kurtotic") then the statistical measure of kurtosis (k) is zero. There are tests available to
show whether k differs significantly from zero. As a rule of thumb, |k| > 3 indicates
potentially serious non-normality.

## 3 The Correlation Matrix

The correlation coefficient (r) indicates the strength of the linear relationship between two
variables. r can take values –1 ≤ r ≤ 1:
r = 0 implies no linear correlation
r > 0 implies a positive correlation
r<0 implies a negative correlation
r=1 means there is perfect positive correlation
r=–1 means there is perfect negative correlation.
The correlation matrix is a table containing all of the correlation coefficients between a
list of variables. It has two uses. First, it provides a quick (though ultimately
unsatisfactory) method of checking the patterns which the data reveal. If theory states that
x1 and x2 are positively related, this is confirmed if r > 0, and disconfirmed if r ≤0.
Second, the correlation matrix provides a collinearity check (See 5.6).

## 4 Regression Analysis: The Basics

Ordinary Least Squares (OLS) regression finds a "line of best fit" for any given data,
between a dependent variable (the variable which is to be explained, typically denoted y)
and one or more independent variables (the explanatory variables, typically denoted x).
Statistics or econometrics packages deliver an estimate of the constant term and the slope
terms. For a regression equation with one independent variable, the regression equation
is:
y = β 0 + β1 x
β1 is a "regression coefficient". It measures the average effect of x on y. β0 is the constant
term. It measures the average value of y when x = 0. For a regression equation with
multiple independent variables, the regression equation is:
y = β0 + β1 x1 + β2 x2 + β3 x3 + β4 x4 + β5 x5+ ...
β1 , β2, β3, β4, β5 ... are "regression coefficients". They measure the average effect of the
relevant x on y, holding all of the other x's constant. This provides a better test of
theoretical hypotheses than correlation coefficients. β0 is the constant term. It measures
the average value of y when all of the x's are zero.

## 5 Regression Analysis: 7 Hypothesis Testing and Modelling Issues

5.1 Significance of individual parameters
The β's can be positive, negative or zero. Typically, theory suggests a hypothesis for
particular β's – that β > 0 or β < 0. But the β's are based on a sample. If the sample
indicates, for instance, that β > 0, we require some measure of confidence that β > 0
rather than β = 0, in the population from which the sample is taken. This is revealed by a
significance test. If β is "significant at the x% level", there is a x% chance that the true
value of β is zero, and (100–x)% confidence that β≠ 0. The "gold standard" for scientific
work is to require significance at the 5% level, though this choice is arbitrary.

One way of assessing regression equations and comparing different regression equations
is to ask "what proportion of the variation in the dependent variable (y) is explained by
the independent variables (x's)?" Other things being equal, the more that is explained the
better. adjR2 ("adjusted R squared" or R-bar squared as it is sometimes called) measures
this. It varies between 0 (none of the variation in the dependent variable can be accounted
for by variations in the independent variables) and 1 (all of the variation in the dependent
variable can be accounted for by variations in the independent variables). Other things
being equal, the larger is adjR2 the better is the regression equation.
5.3 Significance of the equation: The F-test
The F-test indicates the degree of confidence that there exists a relationship between the
independent and dependent variables for the population. That is, the F-test shows the
significance (see 5.1) of the explanatory power (see 5.2) of the equation. If the F-test is
significant at the x% level, there is a x% chance that there is no relationship between the
independent variables and the dependent variables.

## 5.4 Dummy variables

Dummy variables (DV's) classify subjects or observations into two mutually exclusive
and exhaustive classes. A dummy variable takes a value 0 or 1, 1 for membership of a
class, 0 for non-membership. The regression coefficient (β) on a dummy variable
measures the average effect on the dependent variable (y) of membership of the class, by
comparison with non-members.

## 5.5 The standard error

The standard error of the regression equation (SE) measures the precision with which y
can be predicted. For particular values of the x's, the regression equation based on a
particular sample can be used to calculate the predicted value of y. Call this predicted
value µy. Then the standard error (SE) of the estimate can be used to make the following
statement: we are 95% confident that the true population value of y for those values of x
is in the range µy ± 2 SE.

5.6 Collinearity
One of the assumptions underlying OLS regression methods is that the explanatory
variables are "linearly independent". If two independent variables are highly correlated
they should not both appear in a regression equation, because they do not differ
sufficiently from each other. As a rule of thumb, if |r| ≥ 0.7 between two variables, the
variables should not both appear as independent variables in a regression equation,
because they lack independent explanatory power.

## 5.7 The general-to-specific modelling methodology

Consider a theory which implies that y is a function of x1 , x2 , x3 , x4 , x5:
y = f( x1 , x2 , x3 , x4 , x5)
This is the general model of y. Data are collected on y and that x's, and ignoring the
complexities of "non linear" relationships, a regression equation is estimated:
y = β 0 + β 1 x 1 + β 2 x2 + β 3 x3 + β 4 x 4 + β 5 x 5
This is the first step in making the model more specific – moving from a general model to
a general linear model. Imagine that significance levels are: β1 70% , β2 60% , β3 5% , β4
4% , β5 1%. Then the least significant variable (x1) would be dropped from the equation,
and the equation re-estimated:
y = β 0 + β 2 x 2 + β 3 x3 + β 4 x4 + β 5 x 5
This process of identifying and dropping the least significant independent variables is
repeated until all regression coefficients reach some specified required level of
significance (1%, 5% or 10%) according to context. This is a (very very rough) version of
the general-to-specific modelling methodology.

## Other things being equal, a simpler explanation is preferred. This translates to a

preference for a regression equation with fewer explanatory variables, other things being
equal. The general-to-specific modelling methodology captures the desire for explanatory
parsimony.4

## 4Parsimony: excessive carefulness in spending money or using resources; meanness. Occam's

Razor: The principle that, all other things being equal, a simpler explanation is better.