Professional Documents
Culture Documents
Applied Statistics
Introduction to Statistical analysis
Why is statistics important?
It is part of the quantitative approach to knowledge: In physical science the first essential
step in the direction of learning any subject is to find principles of numerical reckoning and
practicable methods for measuring some quality connected with it.
According to Davis, Statistics: The determination of the probable from the possible
What is statistics?
Two common use of the word:
1. Descriptive statistics: numerical summaries of samples; (what was observed)
In Descriptive statics: we want to summarize some data in a shorter form.
Example: The adjustments of 14 GPS control points for this orthorectification ranged from 3.63
to 8.36 m with an arithmetic mean of 5.145
2. Inferential statistics:
From samples to populations. (what could have been or will be observed)In Inferential
statistics We are trying to understand some process and possible predict based on this
understanding.
Example: The mean adjustment for any set of GPS points used for orthorectification is no less
than 4.3 and no more than 6.1 m; this statement has a 5% probability of being wrong.
Why use statistical analysis?
So we need model it, i.e. make a conceptual or mathematical representation, from which
we infer the process.
But how do we know if the model is correct?
* Are we imagining relations where there are none?
* Are there true relations we havent found?
Statistical analysis gives us a way to quantify the confidence we can have in our inferences.
Regression
Regression is statistical technique which helps us to explore the relation between two
variables.
For e.g. Chemical Distillation Process.
Fig below shows scatter diagram between Hydrocarbon percentage and Oxygen purity
percentage
y = _0 + _1 x + e
Where e represents the random error term in the regression.
Suppose we have a fix value of x, then randomness in y is mainly on account of random error
from e, as the remaining terms get fixed by the linear relationship.
Suppose e is R.V. with zero mean and variance then,
Multivariate Analysis
Many statistical techniques focus on just one or two variables, Multivariate analysis (MVA)
techniques allow more than two variables to be analysed at once-Multiple regressions is not
typically included under this heading, but can be thought of as a multivariate analysis.
Multivariate Analysis Methods
Analysis of dependence
Analysis of interdependence
Probably the most commonly used method of deriving factors in factor analysis (before
rotation)
The first principal component is identified as the vector (or equivalently the linear
combination of variables) on which the most data variation can be projected
The 2nd principal component is a vector perpendicular to the first, chosen so that it
contains as much of the remaining variation as possible
And so on for the 3rd principal component, the 4th, the 5th etc.
X ~ N 0,0, I 2 ,
1
p ( x)
e
2
x2 y2
2
Cluster Analysis
Also used to summarise data by defining segments of similar cases in the data
Clustering Techniques
Each cluster (starting with the whole dataset) is divided into two, then
divided again, and so on
Iterative methods
Overlapping clusters
Fuzzy clusters
Applications
Market segmentation is usually conducted using some form of cluster analysis to divide
people into segments
Other methods such as latent class models or archetypal analysis are sometimes
used instead
It is also possible to cluster other items such as products/SKUs, image attributes, brands
Correspondence Analysis-
However the correct interpretation is less than intuitive, and this leads many researchers
astray
These are chosen to explain both the response variation and the variation among
predictors
PLS also refers to a more general technique for fitting general path models, not discussed
here
General method for fitting and testing path analysis models, based on covariances
Fits specified causal structures (path models) that usually involve factors or latent
variables
Confirmatory analysis
Preliminaries
Scatterplot matrix
Missing data
Treat missing data appropriately e.g. impute, or build into model fitting
MVA Issues
Preliminaries (continued)
Testing results
The other main advantage of PCA is that once you have found these patterns in
the data, and you compress the data, ie. by reducing the number of dimensions,
without much loss of information. This technique used in image compression, as we
will see in a later section.
This chapter will take you through the steps you needed to perform a Principal
Components Analysis on a set of data. I am not going to describe exactly why the
technique works, but I will try to provide an explanation of what is happening at each
point so that you can make informed decisions when you try to use this technique
yourself.
3.1 Method
Step 1: Get some data
In my simple example, I am going to use my own made-up data set. Its only got 2
dimensions, and the reason why I have chosen this is so that I can provide plots of the data to
show what the PCA analysis is doing at each step.
The data I have used is found in Figure 3.1, along with a plot of that data.
3. If accuracy of the observations is known then components with variances less than
observations certainly can be rejected.
4. Scree plot. If scree plots show elbow then components with variances less than this elbow
can be rejected.
5. There is cross-validation technique. One value of the observation is removed (xij) then
using principal components this value is predicted and it is done for all data points. If
adding the component does not improve prediction power then this component can be
rejected. This technique is computer intensive.
Principal components as linear combination of original parameters
Let us assume that we have a random vector x with p elements. We want to find a linear
combination of these variables so that variance of the new variable is large. I.e. we want to
find new vector y:
so that it has maximum possible variance. It means that this variable contains maximum
possible variability of the original variables. Without loss of generality we can assume that
mean values of the original variables are 0. Then for variance of y we can write:
Problem reduces to finding maximum of this quadratic form.
If found this new variable will be the first principal component.