You are on page 1of 13

Chapter 3

Applied Statistics
Introduction to Statistical analysis
Why is statistics important?
It is part of the quantitative approach to knowledge: In physical science the first essential
step in the direction of learning any subject is to find principles of numerical reckoning and
practicable methods for measuring some quality connected with it.
According to Davis, Statistics: The determination of the probable from the possible
What is statistics?
Two common use of the word:
1. Descriptive statistics: numerical summaries of samples; (what was observed)
In Descriptive statics: we want to summarize some data in a shorter form.
Example: The adjustments of 14 GPS control points for this orthorectification ranged from 3.63
to 8.36 m with an arithmetic mean of 5.145
2. Inferential statistics:
From samples to populations. (what could have been or will be observed)In Inferential
statistics We are trying to understand some process and possible predict based on this
understanding.
Example: The mean adjustment for any set of GPS points used for orthorectification is no less
than 4.3 and no more than 6.1 m; this statement has a 5% probability of being wrong.
Why use statistical analysis?
So we need model it, i.e. make a conceptual or mathematical representation, from which
we infer the process.
But how do we know if the model is correct?
* Are we imagining relations where there are none?
* Are there true relations we havent found?
Statistical analysis gives us a way to quantify the confidence we can have in our inferences.

What is statistical analysis?


This term refers to a wide range of techniques to. . .
1. Describe 2. Explore 3. Understand 4. Prove 5. Predict
.based on sample datasets collected from populations, using some sampling strategy.
What is a statistical model?
A mathematical representation of a process or its outcome with a computable level of
uncertainty according to assumptions (more or less plausible or provable)
This is an example of an empirical model. It may imply the underlying process, but need not. It
might be useful for prediction, even if its a black box.
Assumptions: not part of the model, but must be true for the model to be correct.
(Note: A process model explicitly represents the underlying process and tries to
Simulate it.)
Statistical InferenceOne of the main uses of statistics is to infer from a sample to a population,
e.g. the true value of some parameter of interest (e.g. mean)
the degree of support for or against a hypothesis
This is a contentious subject; here we use simple frequentist notions.
Data analysis strategy1. Posing the research questions
2. Examining data items and their support
3. Exploratory non-spatial data analysis
4. Non-spatial modelling
5. Exploratory spatial data analysis
6. Spatial modelling
7. Prediction
8. Answering the research questions

Regression
Regression is statistical technique which helps us to explore the relation between two
variables.
For e.g. Chemical Distillation Process.
Fig below shows scatter diagram between Hydrocarbon percentage and Oxygen purity
percentage

We represent Hydrocarbon percentage with variable x and Oxygen purity percentage


with variable y Close inspection of the scatter diagram shows that although there is no single
curve passing through all points, there is a strong motivation to state that various points in scatter
diagram are randomly distributed around a straight line.
It is reasonable to assume that mean of random variable y is linearly related to variable x as
follows
E (y_x) = _0 + _1 x
Slope _1 and intercept _0 are called regression coefficients of linear regression model while
mean of y, E (y ), is assumed to be linear function of x, the actual value of y for given value of x
will be different and is given as

y = _0 + _1 x + e
Where e represents the random error term in the regression.
Suppose we have a fix value of x, then randomness in y is mainly on account of random error
from e, as the remaining terms get fixed by the linear relationship.
Suppose e is R.V. with zero mean and variance then,

Multivariate Analysis
Many statistical techniques focus on just one or two variables, Multivariate analysis (MVA)
techniques allow more than two variables to be analysed at once-Multiple regressions is not
typically included under this heading, but can be thought of as a multivariate analysis.
Multivariate Analysis Methods

Two general types of MVA technique

Analysis of dependence

Where one (or more) variables are dependent variables, to be explained or


predicted by others

E.g. Multiple regression, PLS, MDA

Analysis of interdependence

No variables thought of as dependent

Look at the relationships among variables, objects or cases

E.g. cluster analysis, factor analysis

Identify underlying dimensions or principal components of a distribution

Helps understand the joint or common variation among a set of variables

Probably the most commonly used method of deriving factors in factor analysis (before
rotation)

The first principal component is identified as the vector (or equivalently the linear
combination of variables) on which the most data variation can be projected

The 2nd principal component is a vector perpendicular to the first, chosen so that it
contains as much of the remaining variation as possible

And so on for the 3rd principal component, the 4th, the 5th etc.

Multivariate Normal Distribution

Generalisation of the univariate normal

Determined by the mean (vector) and covariance matrix


X ~ N ,

E.g. Standard bivariate normal

X ~ N 0,0, I 2 ,

1
p ( x)
e
2

x2 y2
2

Cluster Analysis

Techniques for identifying separate groups of similar cases

Similarity of cases is either specified directly in a distance matrix, or defined in


terms of some distance function

Also used to summarise data by defining segments of similar cases in the data

This use of cluster analysis is known as dissection

Clustering Techniques

Two main types of cluster analysis methods

Hierarchical cluster analysis

Each cluster (starting with the whole dataset) is divided into two, then
divided again, and so on

Iterative methods

k-means clustering (PROC FASTCLUS)

Analogous non-parametric density estimation method

Also other methods

Overlapping clusters

Fuzzy clusters

Applications

Market segmentation is usually conducted using some form of cluster analysis to divide
people into segments

Other methods such as latent class models or archetypal analysis are sometimes
used instead

It is also possible to cluster other items such as products/SKUs, image attributes, brands
Correspondence Analysis-

Provides a graphical summary of the interactions in a table

Also known as a perceptual map

Can be very useful

But so are many other charts

E.g. to provide overview of cluster results

However the correct interpretation is less than intuitive, and this leads many researchers
astray

Software for Correspondence Analysis

Earlier chart was created using a specialised package called BRANDMAP

Can also do correspondence analysis in most major statistical packages

For example, using PROC CORRESP in SAS:


*---Perform Simple Correspondence AnalysisExample 1 in SAS OnlineDoc;
proccorresp all data=Cars outc=Coor;

Partial Least Squares (PLS)

Multivariate generalisation of regression

Have model of form Y=XB+E

Also extract factors underlying the predictors

These are chosen to explain both the response variation and the variation among
predictors

Results are often more powerful than principal components regression

PLS also refers to a more general technique for fitting general path models, not discussed
here

Structural Equation Modeling (SEM)

General method for fitting and testing path analysis models, based on covariances

Also known as LISREL

Implemented in SAS in PROC CALIS

Fits specified causal structures (path models) that usually involve factors or latent
variables

Confirmatory analysis

Broader MVA Issues

Preliminaries

EDA is usually very worthwhile

Univariate summaries, e.g. histograms

Scatterplot matrix

Multivariate profiles, spider-web plots

Missing data

Establish amount (by variable, and overall) and pattern (across


individuals)

Think about reasons for missing data

Treat missing data appropriately e.g. impute, or build into model fitting

MVA Issues

Preliminaries (continued)

Check for outliers


Large values of Mahalonobis D2

Testing results

Some methods provide statistical tests


But others do not
Cross-validation gives a useful check on the results
Leave-1-out cross-validation
Split-sample training and test datasets
Sometimes 3 groups needed
For model building, training and testing

Principal Component Analysis-:


PCA is a useful statistical technique that has found application in fields such as face
recognition and image compression, and is a common technique for finding patterns in data of
high dimension.
The main idea behind the principal component analysis is to represent multidimensional
data with fewer number of variables retaining main features of the data. It is inevitable that
by reducing dimensionality some features of the data will be lost. It is hoped that these lost
features are comparable with the noise and they do not tell much about underlying
population.
The method PCA tries to project multidimensional data to a lower dimensional space
retaining as much as possible variability of the data.
Finally we come to Principal Components Analysis (PCA). What is it? It is a way
of identifying patterns in data, and expressing the data in such a way as to highlight
their similarities and differences. Since patterns in data can be hard to find in data of
high dimension, where the luxury of graphical representation is not available, PCA is
a powerful tool for analysing data.

The other main advantage of PCA is that once you have found these patterns in
the data, and you compress the data, ie. by reducing the number of dimensions,
without much loss of information. This technique used in image compression, as we
will see in a later section.
This chapter will take you through the steps you needed to perform a Principal
Components Analysis on a set of data. I am not going to describe exactly why the
technique works, but I will try to provide an explanation of what is happening at each
point so that you can make informed decisions when you try to use this technique
yourself.
3.1 Method
Step 1: Get some data
In my simple example, I am going to use my own made-up data set. Its only got 2
dimensions, and the reason why I have chosen this is so that I can provide plots of the data to
show what the PCA analysis is doing at each step.

The data I have used is found in Figure 3.1, along with a plot of that data.

Step 2: Subtract the mean


For PCA to work properly, you have to subtract the mean from each of the data
dimensions.
The mean subtracted is the average across each dimension. So, all the < values
have <_ (the mean of the < values of all the data points) subtracted, and all the = values
have =_ subtracted from them. This produces a data set whose mean is zero.
Step 3: Calculate the covariance matrix
This is done in exactly the same way as was discussed in section 2.1.4. Since the data
is 2 dimensional, the covariance matrix will be _ f _ . There are no surprises here, so I
will just give you the result:
Step 4: Calculate the eigenvectors and eigenvalues of the covariance matrix
Since the covariance matrix is square, we can calculate the eigenvectors and
eigenvalues for this matrix. These are rather important, as they tell us useful
information about our data. I will show you why soon. In the meantime, here are the
eigenvectors and eigenvalues:
It is important to notice that these eigenvectors are both unit eigenvectors ie. Their
lengths are both 1. This is very important for PCA, but luckily, most maths packages, when
asked for eigenvectors, will give you unit eigenvectors.
So what do they mean? If you look at the plot of the data in Figure 3.2 then you can see
how the data has quite a strong pattern. As expected from the covariance matrix, they two
variables do indeed increase together. On top of the data I have plotted both the eigenvectors
as well. They appear as diagonal dotted lines on the plot. As stated in the eigenvector
section,they are perpendicular to each other. But, more importantly, they provide us with
information about the patterns in the data. See how one of the eigenvectors goes through the
middle of the points, like drawing a line of best fit? That eigenvector is showing us how these
two data sets are related along that line. The second eigenvector gives us the other, less
important, pattern in the data, that all the points follow the main line, but are off to the side of
the main line by some amount.
So, by this process of taking the eigenvectors of the covariance matrix, we have been able
to extract lines that characterise the data. The rest of the steps involve transforming the data
so that it is expressed in terms of them lines.

Step 5: Choosing components and forming a feature vector


Here is where the notion of data compression and reduced dimensionality comes into it.
If you look at the eigenvectors and eigenvalues from the previous section, you will notice
that the eigenvalues are quite different values. In fact, it turns out that the eigenvector with
the highest eigenvalue is the principle component of the data set. In our example, the
eigenvector with the larges eigenvalue was the one that pointed down the middle of the data.
It is the most significant relationship between the data dimensions.
In general, once eigenvectors are found from the covariance matrix, the next step is to
order them by eigenvalue, highest to lowest. This gives you the components in order of
significance. Now, if you like, you can decide to ignore the components of lesser
significance. You do lose some information, but if the eigenvalues are small, you dont lose
much. If you leave out some components, the final data set will have less dimensions than the
original. To be precise, if you originally have _ dimensions in your data, and so you calculate
_ eigenvectors and eigenvalues, and then you chooses only the first { eigenvectors, then the
final data set has only dimensions.
This technique is widely used in many areas of applied statistics. It is natural since
interpretation and visualisation in a fewer dimensional space is easier than in many
dimensional space. Especially if we can reduce dimensionality to two or three then we can
use various plots and try to find structure in the data.
Principal components can also be used as a part of other analysis.
Its simplicity makes it very popular. But care should be taken in applications. First it
should be analysed if this technique can be applied. For example if data are circular then it
might not be wise to use PCA. Then transformation of the data might be necessary before
applying PCA.
PCA is one of the techniques used for dimension reductions.
Selection of Dimension for PCA-:
There are many recommendations for the selection of dimension. Few of them are:
1. The proportion of variances. If the first two components account for 70%-90% or more of
the total variance then further components might be irrelevant (Problem with scaling)
2. Components below certain level can be rejected. If components have been calculated
using correlation matrix often those components with variance less than 1 are rejected. It
might be dangerous. Especially if one variable is independent of the others then it might
give rise the component with variance less than 1. It does not mean that it is
uninformative.

3. If accuracy of the observations is known then components with variances less than
observations certainly can be rejected.
4. Scree plot. If scree plots show elbow then components with variances less than this elbow
can be rejected.
5. There is cross-validation technique. One value of the observation is removed (xij) then
using principal components this value is predicted and it is done for all data points. If
adding the component does not improve prediction power then this component can be
rejected. This technique is computer intensive.
Principal components as linear combination of original parameters
Let us assume that we have a random vector x with p elements. We want to find a linear
combination of these variables so that variance of the new variable is large. I.e. we want to
find new vector y:
so that it has maximum possible variance. It means that this variable contains maximum
possible variability of the original variables. Without loss of generality we can assume that
mean values of the original variables are 0. Then for variance of y we can write:
Problem reduces to finding maximum of this quadratic form.
If found this new variable will be the first principal component.

You might also like