Data Exploration

# Data Exploration

Data exploration can be attempted with almost any type of data, although in practice it is most useful where there are a large number of variables and observations. The main aim of data-exploration techniques is to synthesize and process the interrelationships between observations in such a way to make the patterns obvious to the experimenter.
Statistics
Sub: Statistics Topic: Data exploration
*
DATA

EXPLORATION

Data exploration can be attempted with almost any type of data, although in practice it is most usefulwhere there are a large number of variables and observations. The main aim of data-explorationtechniques is to synthesize and process the interrelationships between observations in such a way tomake the patterns obvious to the experimenter. These techniques are less concerned with
P
-valuesand should be treated as ways to generate hypotheses rather than test them. Several of the morecommonly employed techniques are considered here. This is certainly not an exhaustive list but itdoes provide a flavor of the sort of techniques that are available. The simplest and perhaps mostobvious way to generate new hypotheses or to explore relationships between variables is to plotthem. Many statistical packages, including SPSS and MINITAB, will produce a matrix of scatterplotswith each cell of the matrix having a different plot of two variables. This sort of visual aid will give ageneral feel for which variables are related to which as well as for the
‘shape’ of the data. Don’t be
afraid to experiment with different types of plot before moving on to more standard methods. Theseare two very similar techniques that weight all the available variables to provide the maximumdiscrimination between individuals. The idea of principal component analysis (PCA; a.k.a. factoranalysis, principal axes) is very similar, in many ways, to correlation and regression. The technique canbe applied to any data set that has two or more observations for each individual (e.g. several differentmorphometric measurements from the same specimen). There are assumptions about the data
–
thatit is continuous and normally distributed
–
but these can be overlooked if the purpose of the test is togenerate further hypotheses. The technique can be visualized well when there are only twoobservations for each individual. First imagine a scatterplot of two variables that are correlated suchthat the points fall within an oval cloud. PCA will determine the line through the points that passesthrough the long axis of the cloud and will use
that as the first principal axis or ‘principal component’.
A line through the cloud of points at right angles to the first axis will generate a second principalcomponent. Of course this process occurs in multidimensional space within the computer with one

*
dimension for each of the variables included in the analysis
and the ‘lines’ through the clouds of
points being formed by weighting each of the variables appropriately. In this way PCA synthesizes thedata from a mass of variables into a set of compound axes. The first axis will explain the mostvariation, then the second and so on. Therefore inspection of the weightings of the first few axes willshow which variables contribute most to the differences between individuals. In morphometricanalysis it is usually the case that individual specimens will vary in size. The first principal axis willnearly always account for size and it is often employed as a method for removing size from theanalysis leaving aspects
of ‘shape’ for the second and subsequent axes.