|Views: 208
|Likes: 4

Published by edgardoking_ca

See more

See less

A brief guide to statistical analysis

ς

1 Summary Statistics2 Normality3 The Correlation Matrix4 Regression Analysis: The Basics5 Regression Analysis: 7 Hypothesis Testing and Modelling Issues5.1 Significance of individual parameters5.2 Explanatory power: adjR2

5.3 Significance of the equation: The F-test

5.4 Dummy variables5.5 The standard error

5.6 Collinearity5.7 The general-to-specific modelling methodology

Statistical analysis begins with the collection of a sample of data. These data can besubjected to various kinds of statistical analysis. The purpose of the analysis is to drawinferences regarding the population from which the sample is drawn.

1 Summary Statistics

The mean is the arithmetic average.

1

The standard deviation and variance measurevariability of the observations around the mean. The larger is the variability, the greater isthe standard deviation. The variance is the standard deviation squared. The minimum(Min) is the smallest observation. The maximum (Max) is the largest observation. Themode is the most common observation. The median (Med) is the middle observation.

2

2 Normality

3

All basic methods of statistical analysis assume that the observations follow anapproximately normal distribution. A normal distribution can be understood as a bellshaped frequency distribution around the mean. There are two elementary tests of

ς

© 2002-2006 Shane Bonetti

1

The sum of the observations divided by the number of observations.

2

E.g. if the observations are 2, 6, 8, 10, 12 the median is 8.

3

(1) These tests of normality should only be applied to continuous variables, not to dummyvariables. (2) If there is serious skewness or kurtosis, it becomes necessary to use statisticalmethods which do not involve any assumption regarding the underlying distribution of thevariables. These are called non-parametric statistical methods. One alternative is to transformthe variable (e.g. take the natural log, or the square or the square root, or some other transformation). Often a variable which is not normally distributed can be transformed into anormally distributed variable in this way.

normality, that is tests of whether the approximate normality assumption is satisfied:skewness and kurtosis.Skewness is the absence of symmetry. It involves the hump of the distribution being awayfrom the mean. If there is no skewness then the statistical measure of skewness (s) is zero.There are tests available to show whether s differs significantly from zero. As a rule of thumb, |s| > 1 indicates potentially serious non-normality.Kurtosis involves excessive flatness or steepness of the distribution, orinsufficient/excessive "weight in the tails". If there is no kurtosis (if the variable is not"kurtotic") then the statistical measure of kurtosis (k) is zero. There are tests available toshow whether k differs significantly from zero. As a rule of thumb, |k| > 3 indicatespotentially serious non-normality.

3 The Correlation Matrix

The correlation coefficient (r) indicates the strength of the linear relationship between twovariables. r can take values –1

≤

r

≤

1:r = 0 implies no linear correlationr > 0 implies a positive correlationr<0 implies a negative correlationr=1 means there is perfect positive correlationr=–1 means there is perfect negative correlation.The correlation matrix is a table containing all of the correlation coefficients between alist of variables. It has two uses. First, it provides a quick (though ultimatelyunsatisfactory) method of checking the patterns which the data reveal. If theory states thatx

1

and x

2

are positively related, this is confirmed if r > 0, and disconfirmed if r

≤

0.Second, the correlation matrix provides a collinearity check (See 5.6).

4 Regression Analysis: The Basics

Ordinary Least Squares (OLS) regression finds a "line of best fit" for any given data,between a dependent variable (the variable which is to be explained, typically denoted y)and one or more independent variables (the explanatory variables, typically denoted x).Statistics or econometrics packages deliver an estimate of the constant term and the slope

terms. For a regression equation with one independent variable, the regression equationis:y =

β

0

+

β

1

x

β

1

is a "regression coefficient". It measures the average effect of x on y.

β

0

is the constantterm. It measures the average value of y when x = 0. For a regression equation withmultiple independent variables, the regression equation is:y =

β

0

+

β

1

x

1

+

β

2

x

2

+

β

3

x

3

+

β

4

x

4

+

β

5

x

5

+ ...

β

1

,

β

2

,

β

3

,

β

4

,

β

5

... are "regression coefficients". They measure the average effect of therelevant x on y, holding all of the other x's constant. This provides a better test of theoretical hypotheses than correlation coefficients.

β

0

is the constant term. It measuresthe average value of y when all of the x's are zero.

5 Regression Analysis: 7 Hypothesis Testing and Modelling Issues

5.1 Significance of individual parametersThe

β

's can be positive, negative or zero. Typically, theory suggests a hypothesis forparticular

β

's – that

β

> 0 or

β

< 0. But the

β

's are based on a sample. If the sampleindicates, for instance, that

β

> 0, we require some measure of confidence that

β

> 0rather than

β

= 0, in the population from which the sample is taken. This is revealed by asignificance test. If

β

is "significant at the x% level", there is a x% chance that the truevalue of

β

is zero, and (100–x)% confidence that

β

≠

0. The "gold standard" for scientificwork is to require significance at the 5% level, though this choice is arbitrary.5.2 Explanatory power: adjR

2

One way of assessing regression equations and comparing different regression equationsis to ask "what proportion of the variation in the dependent variable (y) is explained bythe independent variables (x's)?" Other things being equal, the more that is explained thebetter. adjR

2

("adjusted R squared" or R-bar squared as it is sometimes called) measuresthis. It varies between 0 (none of the variation in the dependent variable can be accountedfor by variations in the independent variables) and 1 (all of the variation in the dependentvariable can be accounted for by variations in the independent variables). Other thingsbeing equal, the larger is adjR

2

the better is the regression equation.

Filters

scribd