33 upvotes00 downvotes

825 views5 pages(c) the author

Jun 25, 2007

© Attribution Non-Commercial (BY-NC)

PDF, TXT or read online from Scribd

(c) the author

Attribution Non-Commercial (BY-NC)

825 views

33 upvotes00 downvotes

(c) the author

Attribution Non-Commercial (BY-NC)

You are on page 1of 5

1 Summary Statistics

2 Normality

3 The Correlation Matrix

4 Regression Analysis: The Basics

5 Regression Analysis: 7 Hypothesis Testing and Modelling Issues

5.1 Significance of individual parameters

5.2 Explanatory power: adjR2

5.3 Significance of the equation: The F-test

5.4 Dummy variables

5.5 The standard error

5.6 Collinearity

5.7 The general-to-specific modelling methodology

Statistical analysis begins with the collection of a sample of data. These data can be

subjected to various kinds of statistical analysis. The purpose of the analysis is to draw

inferences regarding the population from which the sample is drawn.

1 Summary Statistics

The mean is the arithmetic average.1 The standard deviation and variance measure

variability of the observations around the mean. The larger is the variability, the greater is

the standard deviation. The variance is the standard deviation squared. The minimum

(Min) is the smallest observation. The maximum (Max) is the largest observation. The

mode is the most common observation. The median (Med) is the middle observation.2

2 Normality3

All basic methods of statistical analysis assume that the observations follow an

approximately normal distribution. A normal distribution can be understood as a bell

shaped frequency distribution around the mean. There are two elementary tests of

1 The sum of the observations divided by the number of observations.

2 E.g. if the observations are 2, 6, 8, 10, 12 the median is 8.

3 (1) These tests of normality should only be applied to continuous variables, not to dummy

variables. (2) If there is serious skewness or kurtosis, it becomes necessary to use statistical

methods which do not involve any assumption regarding the underlying distribution of the

variables. These are called non-parametric statistical methods. One alternative is to transform

the variable (e.g. take the natural log, or the square or the square root, or some other transfor

mation). Often a variable which is not normally distributed can be transformed into a

normally distributed variable in this way.

normality, that is tests of whether the approximate normality assumption is satisfied:

skewness and kurtosis.

Skewness is the absence of symmetry. It involves the hump of the distribution being away

from the mean. If there is no skewness then the statistical measure of skewness (s) is zero.

There are tests available to show whether s differs significantly from zero. As a rule of

thumb, |s| > 1 indicates potentially serious non-normality.

insufficient/excessive "weight in the tails". If there is no kurtosis (if the variable is not

"kurtotic") then the statistical measure of kurtosis (k) is zero. There are tests available to

show whether k differs significantly from zero. As a rule of thumb, |k| > 3 indicates

potentially serious non-normality.

The correlation coefficient (r) indicates the strength of the linear relationship between two

variables. r can take values –1 ≤ r ≤ 1:

r = 0 implies no linear correlation

r > 0 implies a positive correlation

r<0 implies a negative correlation

r=1 means there is perfect positive correlation

r=–1 means there is perfect negative correlation.

The correlation matrix is a table containing all of the correlation coefficients between a

list of variables. It has two uses. First, it provides a quick (though ultimately

unsatisfactory) method of checking the patterns which the data reveal. If theory states that

x1 and x2 are positively related, this is confirmed if r > 0, and disconfirmed if r ≤0.

Second, the correlation matrix provides a collinearity check (See 5.6).

Ordinary Least Squares (OLS) regression finds a "line of best fit" for any given data,

between a dependent variable (the variable which is to be explained, typically denoted y)

and one or more independent variables (the explanatory variables, typically denoted x).

Statistics or econometrics packages deliver an estimate of the constant term and the slope

terms. For a regression equation with one independent variable, the regression equation

is:

y = β 0 + β1 x

β1 is a "regression coefficient". It measures the average effect of x on y. β0 is the constant

term. It measures the average value of y when x = 0. For a regression equation with

multiple independent variables, the regression equation is:

y = β0 + β1 x1 + β2 x2 + β3 x3 + β4 x4 + β5 x5+ ...

β1 , β2, β3, β4, β5 ... are "regression coefficients". They measure the average effect of the

relevant x on y, holding all of the other x's constant. This provides a better test of

theoretical hypotheses than correlation coefficients. β0 is the constant term. It measures

the average value of y when all of the x's are zero.

5.1 Significance of individual parameters

The β's can be positive, negative or zero. Typically, theory suggests a hypothesis for

particular β's – that β > 0 or β < 0. But the β's are based on a sample. If the sample

indicates, for instance, that β > 0, we require some measure of confidence that β > 0

rather than β = 0, in the population from which the sample is taken. This is revealed by a

significance test. If β is "significant at the x% level", there is a x% chance that the true

value of β is zero, and (100–x)% confidence that β≠ 0. The "gold standard" for scientific

work is to require significance at the 5% level, though this choice is arbitrary.

One way of assessing regression equations and comparing different regression equations

is to ask "what proportion of the variation in the dependent variable (y) is explained by

the independent variables (x's)?" Other things being equal, the more that is explained the

better. adjR2 ("adjusted R squared" or R-bar squared as it is sometimes called) measures

this. It varies between 0 (none of the variation in the dependent variable can be accounted

for by variations in the independent variables) and 1 (all of the variation in the dependent

variable can be accounted for by variations in the independent variables). Other things

being equal, the larger is adjR2 the better is the regression equation.

5.3 Significance of the equation: The F-test

The F-test indicates the degree of confidence that there exists a relationship between the

independent and dependent variables for the population. That is, the F-test shows the

significance (see 5.1) of the explanatory power (see 5.2) of the equation. If the F-test is

significant at the x% level, there is a x% chance that there is no relationship between the

independent variables and the dependent variables.

Dummy variables (DV's) classify subjects or observations into two mutually exclusive

and exhaustive classes. A dummy variable takes a value 0 or 1, 1 for membership of a

class, 0 for non-membership. The regression coefficient (β) on a dummy variable

measures the average effect on the dependent variable (y) of membership of the class, by

comparison with non-members.

The standard error of the regression equation (SE) measures the precision with which y

can be predicted. For particular values of the x's, the regression equation based on a

particular sample can be used to calculate the predicted value of y. Call this predicted

value µy. Then the standard error (SE) of the estimate can be used to make the following

statement: we are 95% confident that the true population value of y for those values of x

is in the range µy ± 2 SE.

5.6 Collinearity

One of the assumptions underlying OLS regression methods is that the explanatory

variables are "linearly independent". If two independent variables are highly correlated

they should not both appear in a regression equation, because they do not differ

sufficiently from each other. As a rule of thumb, if |r| ≥ 0.7 between two variables, the

variables should not both appear as independent variables in a regression equation,

because they lack independent explanatory power.

Consider a theory which implies that y is a function of x1 , x2 , x3 , x4 , x5:

y = f( x1 , x2 , x3 , x4 , x5)

This is the general model of y. Data are collected on y and that x's, and ignoring the

complexities of "non linear" relationships, a regression equation is estimated:

y = β 0 + β 1 x 1 + β 2 x2 + β 3 x3 + β 4 x 4 + β 5 x 5

This is the first step in making the model more specific – moving from a general model to

a general linear model. Imagine that significance levels are: β1 70% , β2 60% , β3 5% , β4

4% , β5 1%. Then the least significant variable (x1) would be dropped from the equation,

and the equation re-estimated:

y = β 0 + β 2 x 2 + β 3 x3 + β 4 x4 + β 5 x 5

This process of identifying and dropping the least significant independent variables is

repeated until all regression coefficients reach some specified required level of

significance (1%, 5% or 10%) according to context. This is a (very very rough) version of

the general-to-specific modelling methodology.

preference for a regression equation with fewer explanatory variables, other things being

equal. The general-to-specific modelling methodology captures the desire for explanatory

parsimony.4

Razor: The principle that, all other things being equal, a simpler explanation is better.

## Much more than documents.

Discover everything Scribd has to offer, including books and audiobooks from major publishers.

Cancel anytime.