Introductory Statistical Data Analysis For Geoscientists Using Matlab

UNIVERSITY OF SYDNEY
SCHOOL OF GEOSCIENCES
DIVISION OF GEOLOGY AND GEOPHYSICS
45
40
35
30
25
20
15
10
0
-6000 -5000 -4000 -3000 -2000 -1000 0
INTRODUCTORY STATISTICAL
DATA ANALYSIS
FOR GEOSCIENTISTS
USING MATLAB/OCTAVE
R. Dietmar Müller
TABLE OF CONTENTS
1 Introduction...................................................................................................................1
2 Introduction to MatLab .................................................................................................1
2.1 What is MATLAB? ......................................................................................................... 1
2.2 Getting Started ................................................................................................................ 1
2.3 Help.................................................................................................................................. 1
2.4 Colon's & brackets .......................................................................................................... 2
2.5 Plotting/Graphs: online help ........................................................................................... 3
2.6.1 Formats ..................................................................................................................................... 4
2.6.2 Saving and loading the workspace.............................................................................................. 5
2.7 Introduction to Plotting ................................................................................................... 6
2.8 overview, colourmaps ..................................................................................................... 7
2.8.1 Important steps for making 2D Graphs ....................................................................................... 7
2.8.2 Important steps for making 3D Graphs ....................................................................................... 8
2.8.3 Functions that create and plot continuous surfaces ...................................................................... 9
2.9 COLOUR....................................................................................................................... 10
3. Types of Geoscience Data and Data Quality................................................................12
3.1 Types of data ........................................................................................................................ 12
3.2 Dependent Versus Independent Variables .......................................................................... 13
3.3 Data quality ................................................................................................................... 13
4. Statistical data analysis................................................................................................13
4.1 Introduction.......................................................................................................................... 13
4.2 Plotting data as line/symbol and stem diagrams........................................................... 14
4.3 Plotting data as histograms ........................................................................................... 14
4.4 Plotting data as scatterplots .......................................................................................... 16
5. Probability ...................................................................................................................16
5.1. Random variables.......................................................................................................... 16
5.2 What is statistical significance?..................................................................................... 18
5.3 How to determine that a result is "really" significant.................................................. 18
5.4 Gaussian distribution .................................................................................................... 19
5.5 Random Data Sample Statistics .................................................................................... 21
5.5.1 Mean ....................................................................................................................................... 21
5.5.2 Standard deviation ................................................................................................................... 21
5.5.3 Standardization to unit standard deviation................................................................................. 22
5.5.4 Geometric mean....................................................................................................................... 22
5.6 Other distributions and their applications ................................................................... 22
5.6.1 Binomial.................................................................................................................................. 22
5.6.2 Multinomial............................................................................................................................. 23
5.6.3 Poisson .................................................................................................................................... 23
5.6.4 Exponential.............................................................................................................................. 24
5.6.5 Gamma.................................................................................................................................... 25
5.6.6 Beta......................................................................................................................................... 25
I
5.6.7 Negative Binomial ................................................................................................................... 26
5.6.8 Log-normal.............................................................................................................................. 26
5.6.9 Rayleigh .................................................................................................................................. 26
5.6.10 Weibull ............................................................................................................................... 26
6 Significance Tests ........................................................................................................27
6.1 Introduction................................................................................................................... 27
6.2 Testing whether data are normally distributed............................................................ 27
6.3 Outliers .......................................................................................................................... 30
6.3 distribution of the sample mean.................................................................................... 33
6.5 Student t-distribution.................................................................................................... 35
6.6 Student t-Test for Independent Samples ...................................................................... 36
6.7 Student t-test for dependent samples ............................................................................ 36
6.7.1 Within-group Variation ............................................................................................................ 36
6.7.2 Purpose.................................................................................................................................... 37
6.7.3 Assumptions............................................................................................................................ 37
6.7.4 One Sample Student's t test ...................................................................................................... 37
6.7.5 Two Independent Samples Student's t test................................................................................ 38
6.8 Chi-square distribution................................................................................................. 41
6.9 Peasron Chi-Square test ................................................................................................ 42
6.10 Fisher F distribution...................................................................................................... 45
7 ANOVA: Analysis of Variance ....................................................................................46
8 Statistics between two or more variables......................................................................47
8.1 Correlations between two or more variables ................................................................ 47
8.1.1 Introduction ............................................................................................................................. 47
8.1.2 Significance of Correlations .................................................................................................... 47
8.1.3 Nonlinear Relations between Variables ................................................................................... 48
8.1.4 Measuring Nonlinear Relations ............................................................................................... 48
9 When to Use Nonparametric Techniques ....................................................................50
10 Directional and Oriented data..................................................................................52
10.1 Introduction........................................................................................................................ 52
10.2 Rose Plots............................................................................................................................ 52
10.3 Plotting and contouring oriented data on stereonets......................................................... 54
10.3.1 Types of stereonets.............................................................................................................. 54
10.3.2 Which net to use for which purpose?.................................................................................... 54
10.2 Tests of significance of mean direction .............................................................................. 55
11 Spatial data analysis: Contouring unequally spaced data........................................56
12 Overview of Computer Intensive Statistical Inference Procedures ..........................58
12.1 Introduction................................................................................................................... 58
12.2 Monte Carlo Methods ................................................................................................... 58
12.2.1 Introduction......................................................................................................................... 58
12.2.2 Monte Carlo Estimation....................................................................................................... 59
12.2.3 Bootstrapping...................................................................................................................... 60
12.2.4 The "Jackknife" ................................................................................................................... 61
12.2.5 Markov Chain Monte Carlo Estimation................................................................................ 61
II
12.2.6 Meta-Analysis ..................................................................................................................... 62
12.2.7 Multivariate Modelling........................................................................................................ 62
12.2.8 Overall Assessment of Strengths and Weaknesses................................................................ 63
13 References................................................................................................................65
13.1 Geosciences .................................................................................................................... 65
13.2 General .......................................................................................................................... 65
14 Appendix..................................................................................................................67
14.1 Tables............................................................................................................................. 67
14.1.1 Critical values of R for Rayleigh's test ................................................................................. 67
14.1.2 Values of concentration parameter K from R for Rayleigh's test ........................................... 68
14.1.3 Critical values of Spearman's rank correlation coefficient..................................................... 69
14.1.4 Critical values of T for Mann-Whitney Test (α=5%)........................................................... 69
14.2 STIXBOX Contents....................................................................................................... 70
14.2 Computational Tools and Demos on the Internet......................................................... 71
III
1 INTRODUCTION
This course module is designed to convey the principles of statistics applied to earth
science data. Problem solving is illlustrated based on Matlab and its free counterpart, Octave,
and two free sets of toolboxes, which work with both Matlab (www.mathworks.com) and
Octave (www.octave.org): the "stixbox" (www.maths.lth.se/matstat/stixbox/), which includes
most popular forward and inverse distribution functions, various hypothesis tests and graphics
functions, and an earth science toolbox written by G. Middleton (1999).
2 INTRODUCTION TO MATLAB
2.1 WHAT IS MATLAB?
Definition. MATLAB is a high-performation language for technical computing. After
mastering the basics, you will see amazing power that combines computational capabilities
with graphics capabilities. MATLAB stands for matrix laboratory, which reflects its original
application to matrix applications. With MATLAB, you have the following capabilities:
• math and computation

• algorithm development
• modeling, simulation, and prototyping
• data analysis, exploration, and visualization
• scientific and engineering graphics
• application development, including Graphical User Interface (GUI) building
2.2 GETTING STARTED

Running MATLAB. Simply type "matlab &" in a window and hit RETURN. You are in the
program.
2.3 HELP
Your first resource should be the "help" function that is part of the MATLAB program.
There is a "Help" icon which will bring up a searchable help data base. It contains tutorials
for basic MATLAB functions. If you are not familiar with MATLAB I strongly
recommend to go through the help pages on all basic MATLAB functions.
Another nice tool is the "lookfor keyword" command. (type "help help" for information
on this). It will look through all the help pages and give you back the commands that have the
keyword in the first line of its help page. For example, what would we do to learn of all the
commands that relate to "meshes" in MATLAB? Type:
lookfor mesh
You can then follow this up by doing a "help" on any of these programs/tools.
1
2.4 COLON'S & BRACKETS
The colon operator is very useful and important for array definitions, and number increment
size. For instance, type the expression:
20:30
You'll get a row vector with the integers:
ans =
20 21 22 23 24 25 26 27 28 29 30
You can easily change the default increment, in this case "1", to anything else, if you simply
add the incrementation size between the end values. For example, type:
20:2:30
You'll get a row vector with the integers:
ans =
20 22 24 26 28 30
Pretty straight forward stuff. Using the left and right brackets "[" and "]", you can define
matrices, where the rows are separated by a semicolon ";". For example, to define a 3 by 3
matrix with the numbers 1 through 9, type:
[1 2 3;4 5 6;7 8 9]
You'll get a matrix back as:
ans =
1 2 3
4 5 6
7 8 9
To give this matrix a name in MATLAB's memory, such as Ed, then type:
Ed = [1 2 3;4 5 6;7 8 9]
Ed =
2
1 2 3
4 5 6
7 8 9
If you don't give such things a name, the default name "ans", which stands for answer is the
variable name. Let's use the colon and brackets to define a 5X5 matrix:
junk = [1:5; 10:10:50; 5:9; 1000:-100:600; 3 5 2 9 44]
junk =
1 2 3 4 5
10 20 30 40 50
5 6 7 8 9
1000 900 800 700 600
3 5 2 9 44
2.5 PLOTTING/GRAPHS: ONLINE HELP

One at a time, let's go through some online resources -- various tutorial pages people have set
up, to learn some graphics and plotting basics. Where indicated, cut and paste the text from
the resource page into your window that is running MATLAB, to try the example.
Online Reference MATLAB manual, and a very nice Frequently Asked Questions page
(Univ. Texas, Austin):
http://www-math.cc.utexas.edu/math/Matlab/Manual/ReferenceTOC.html
Let's look at an example from the above site. Here's a "plotting lines in 3D" page :
http://www-math.cc.utexas.edu/math/Matlab/Manual/plot3.html Do the example
Here's another example -- animating your graph:

http://www-math.cc.utexas.edu/math/Matlab/Manual/comet3.html Do the example
Meshes for surfaces

http://www-math.cc.utexas.edu/math/Matlab/Manual/mesh.html Do the example
Surfaces where meshs are filled:

http://www-math.cc.utexas.edu/math/Matlab/Manual/surf.html Do the example
Colourmaps:
http://www-math.cc.utexas.edu/math/Matlab/Manual/colourmap.html
Some prefab shapes: cylinder:

http://www-math.cc.utexas.edu/math/Matlab/Manual/cylinder.html Do the example
3
Prefab shapes: sphere:
http://www-math.cc.utexas.edu/math/Matlab/Manual/sphere.html Do the example (and
launch the MATLAB window, go to the 3D shapes page in Visualizations, and experiment
with changing the surfacing, shading, colourmap.)
Operators and special characters:

http://www-math.cc.utexas.edu/math/Matlab/Manual/Operators.html
A UTAH PAGE: (nice tutorial) note, an important point about array math:
http://www.mines.utah.edu/gg_computer_seminar/matlab/tut3.html
General intro plotting info from the above UTAH site:

http://www.mines.utah.edu/gg_computer_seminar/matlab/tut18.html Do the examples
More good graphics basics:

http://www-math.bgsu.edu/~gwade/matlabprimer/graphics.html Do the examples
2.6 FORMATS, SAVING AND LOADING FILES
2.6.1 Formats
MATLAB has different formating options for how we view the variables in the workspace.
Common formats are listed below (do a help on these to learn more):
format short
format short e
format long
format long e
format bank
These produce the following for a vector x = [4/3 1.2345e-6 ]:
x = [4/3 1.2345e-6]
x =
1.3333 0.0000
format short
x
x =
1.3333 0.0000
format short e
x
x =
1.3333e+00 1.2345e-06
format long
x
4
x =
1.33333333333333 0.00000123450000
format long e
x
x =
1.333333333333333e+00 1.234500000000000e-06
format bank
x
x =
1.33 0.00
Suppressing output: By default, MATLAB always displays the results of a command you
type onto the screen. This is not always optimal, especially if you define a rather large array
of numbers. However, to suppress the output from being displayed on the screen, simply, put
a semi-colon at the end of the line ";".
2.6.2 Saving and loading the workspace

A nice feature of MATLAB is that you can save everything in your workspace. Simply
typing:
save
will create a file called matlab.mat in your present working directory. To access this file
during another MATLAB session, just type:
load
which load's the matlab.mat file. You can give the file it's own name with:
save filename
then to load it into memory at a later time (as you might imagine), type:
load filename
If you are only interested in save a particular variable, no problem: let's say we want to just
save our A, and want to put it into a file called A_nov21.mat. Then type:
save A_nov21 A
The syntax in general for this is:
save filename var1 var2 var3 ...
5
2.6.3 Loading ASCII Data Files
Arrays of numbers can be stored in files on disk. For instance, open Text Editor and create a
file that has an array of numbers:
1 3 5 7
2 2 2 2
3 4 4 4
1 1 1 1
NOTE: don't worry about copying my numbers, any 4x4 matix will do. Then save your file.
Ideally, save it as filename.dat, then you will know it is a data file that you use with
MATLAB (however, the ".dat" extension is not necessary). If i called my file "crap.dat", then
to load it into MATLAB, I type "load crap.dat". Now, the variable "crap" (without the ".dat")
is assigned to the data that was in that file. To be sure it is properly loaded, in MATLAB,
simply type the variable name and hit return.
2.7 INTRODUCTION TO PLOTTING

Copy the file "dewijs.dat" from /local/matlabr12/local/middleton to your working directory
We will use this file for our simple demonstration of 2D graphs. This is a classic data set of
118 measurements of zinc (ZN) made at 2 meter intervals along a single sphalerite quartz vein
in the Pulacayo Mine in Chile, by De Wijs (1951). So, these data are measured in equal
intervals in space. Load the data into MATLAB after you copy it over, with:
load dewijs.dat
If you now type "dewijs", the array will spew back to you (recall, to view a screen at a time,
type "more on"). Make a simple XY plot with circles for symbols:
plot(dewijs, 'o')
You can see that the Y ordinate plotted at the value of the measurement, but the X ordinate is
simply a count of the number of the entry (entry 1, entry 2, etc), and has nothing to do with
the 2 meter interval. We can address this a number of ways... 118 times 2 is 236. So let's just
make a new array of numbers: 0, 2, 4 ... up to 234 (not 236, because we are counting zero):
x = [0:2:234]
Now, let's plot both arrays, so that the Y values are now properly spaced in X:
plot(x, dewijs, 'o')
This literally says: plot a 'o' at [x(1),dewijs(1)], plot a 'o' at [x(2),dewijs(2)], and so on. Now
look at the X-axis. Things are shaping up. Let's add labels:
title('Zn measurements in a Chilean Qtz vein');

xlabel('Position (2m units)');
6
Let's add a solid line to the plot, since it is difficult to identify any kind of trend. We can add
as many data sets to a plot with:
plot(A,B,attributes, C,D,attributes, etc )
Thus, we will plot the same data set twice, once with "o"s, and once as a solid line (the
default, so we don't need to give it attributes for that):
plot(x, dewijs, 'o',x, dewijs)
An alternative way to plot data is with the stem function. Try:
stem(dewijs)
So, there doesn't appear to be any strong systematic trends to this data set, so let's do some
statistics with it. From inspection of our last plot, we see our Y values ranging from a little
under 5 to almost 40. Let's make a histogram of the data. First, we need to define bins for the
data. For this data set, let's make our bins 5 units wide, and from 2.5-7.5, 7.5-12.5, ... up to
32.5-37.5. In one line:
x = [2.5:5:37.5]
Now, we can plot the histogram
hist(dewijs,x)
Look at the "help hist" details. We can easily swap axes.
2.8 OVERVIEW, COLOURMAPS
2.8.1 Important steps for making 2D Graphs

One of the books that comes with MATLAB is primarily focused on graphics. It has
some nice tables on building 2 and 3D graphs. I reproduce them below to emphasise what is
the proper train of thought in organizing your information so that you can produce a graphic
most efficiently. The table below shows seven essential steps. The examples in the right
column are simply just that -- examples (more info on any of these is available with
MATLAB's help function.
7
2.8.2 Important steps for making 3D Graphs
The typical steps in making a 3D graph are similar to the 2D case, except now we call a
3D graping function, which typically has far more options, such as lighting and viewpoint,
etc. These are really just attributes to the porjection and how we "foof" it up. They are
important functions to know about if you are going to continue on in MATLAB.
8
2.8.3 Functions that create and plot continuous surfaces
MATLAB defines a surface by the z-coordinates of points above a rectangular grid in the
x-y plane. The plot is formed by joining adjacent points with straight lines. Surface plots are
useful for visualizing matrices that are too large to display in numerical form, and for
graphing functions of two variables.
For us in the Earth sciences, the importance of plotting surfaces is obvious: so much of
our data is spatially oriented, such as topography, gravity, heat, magnetism, yada yada yada.
The table below lists functions that makes surfaces from your input matrices. And all of our
spatial data sets can be made into matrices (since we have x=longitude, y=latitude, and
z=measurement).
Some important graph features
2.8.4 Figure windows

MATLAB directs its graphics output to a window called the figure window. If no figure
window is currently open, MATLAB will create one. If a figure window exists, MATLAB
uses that one. If multiple figure windows are open, MATLAB uses the most recently
used/created figure window. The figure function creates figure windows. For example,
figure
creates a new window and makes it the current destination for graphics output. You can make
any figure window the current/active one by clicking it with the mouse, or look at the title bar
of your figure windows -- they are numbered. To make the nth window active, just type:
figure(n)
2.8.5 Subplots
You can display multiple plots in the same figure window. The function
subplot(n,n,i)
breaks the figure window into an m-by-n matrix of small subplots (m rows and n columns),
and selects the ith subplot for the current plot. For example, if m=3 and n=4, then we are
9
dividing the figure window into 12 subplots: 3 rows and 4 columns.
Let's do one with 2 rows and 2 columns. Then we have "subplot(2,2,i)". We would designate
i=1 for the 1st plot -- they are ordered from left to right in row one, then row two, and so on.
Here's our example... let's plot a bunch of relationships between the sine and cosine function,
all on the same page:
t=0:pi/20:2*pi
[x,y]=meshgrid(t);
subplot(2,2,1)
plot(sin(t),cos(t))
axis equal
subplot(2,2,2)
z = sin(x) + cos(y)
plot(t,z)
axis([0 2*pi -2 2])
subplot(2,2,3)
z = sin(x).*cos(y);
plot(t,z)
axis([0 2*pi -1 1])
subplot(2,2,4)
z = (sin(x).^2)-(cos(y).^2);
plot(t,z)
axis([0 2*pi -1 1])
2.9 COLOUR
2.9.1 Defaults
As you noticed, your line colours in your last plot were cycled through a range of colours.
This is a default feature, which can all be changed. Look at the help page for "plot" to see how
you can easily hardwire different colours to different lines.
You can easily change your background colour of your images, which is white by default. Try
colourdef black
to change it to black (use the UP arrow to find your command for running the subplot_ex
example, and run it again).
colourdef white
changes is back to the white background.
2.9.2 RGB colour

MATLAB uses RGB for colour definitions. However, instead of going from 0 (completely
off) to 255 (completely saturated), MATLAB goes from zero to one. Here is a table of
common colours in this system:
10
Red Green Blue Colour
0 0 0 black
1 1 1 white
1 0 0 red
0 1 0 green
0 0 1 blue
1 1 0 yellow
1 0 1 magenta
0 1 1 cyan
.5 .5 .5 gray
.5 0 0 dark red
1 .62 .40 copper
.49 1 .83 aquamarine
2.9.3 Colourmaps
Each MATLAB window has a "colourmap" associated with it. A colourmap is simply a 3
column matrix whose length (# of rows) is equal to the number of colours it defines.
MATLAB's default colourmap is "jet", actually, it is "jet(64)" -- a 64-colour rendering of jet.
Type "colormap" and MATLAB will spew it's 3 column matrix of jet (64) -- remember, these
are just RGB values.
You can see the colour scale by looking at a colourbar: open a new figure window (type
"figure") then type "colorbar", and a colour definition scale bar will appear on your map.
Look at some of the default maps: try
colormap(pink)
colormap(copper)
colormap(flag)
Flag is obviously not well suited for our purposes, but type "colourbar" so MATLAB spews
"flag's" RGB matrix out. You can see how it is composed:
ans =
1 0 0 (red)
1 1 1 (white)
0 0 1 (blue)
0 0 0 (black)
1 0 0
1 1 1
0 0 1
0 0 0
1 0 0
...etc...
11
3. TYPES OF GEOSCIENCE DATA AND DATA QUALITY
3.1 TYPES OF DATA
The data types we analyse in statistics are also called "variables" in a statistical context.
Variables can be things that we measure, control, or manipulate in research. In our case, we
will only concerned with data or variables we measure.
Data differ in "how well" they can be measured, i.e., in how much measurable
information their measurement scale can provide. There is obviously some measurement error
involved in every measurement, which determines the "amount of information" that we can
obtain. Another factor that determines the amount of information that can be provided by a
variable is its "type of measurement scale." Specifically data are classified as follows.
a. Nominal (or categorical) data allow for only qualitative classification. That is, they can
be measured only in terms of whether the individual items belong to some distinctively
different categories, but we cannot quantify or even rank order those categories. For
example, information may be given as a list of names, descriptions etc (e.g. sediment type
given as sand, silt, clay, ooze, …)
b. Ordinal data allow us to rank order the items we measure in terms of which has less and
which has more of the quality represented by the variable, but still they do not allow us to
say "how much more." A typical example of an ordinal variable in geology is Moh's
hardness scale or Richter's earthquake scale.
c. Interval data allow us not only to rank order the items that are measured, but also to
quantify and compare the sizes of differences between them. For example, temperature,
as measured in degrees Celsius, constitutes an interval scale. We can say that a
temperature of 40 degrees is higher than a temperature of 30 degrees, and that an increase
from 20 to 40 degrees is twice as much as an increase from 30 to 40 degrees.
d. Ratio data are very similar to interval variables; in addition to all the properties of
interval variables, they feature an identifiable absolute zero point, thus they allow for
statements such as x is two times more than y. Typical examples of ratio scales are
measures of time or space. For example, as the Kelvin temperature scale is a ratio scale,
not only can we say that a temperature of 200 degrees is higher than one of 100 degrees,
we can correctly state that it is twice as high. Interval scales do not have the ratio
property. Most statistical data analysis procedures do not distinguish between the interval
and ratio properties of the measurement scales.
e. Discrete data: These data can only assume specific (usually integer)values (e.g. counts
of objects).
f. Closed data: percentages, or ppm's etc. These are very common in geochemistry and
petrology.
g. Directional data: These data are given as angles, and are extremely important in
geoscience, as dips and strikes of structures measured in the field or in cores are
expressed this way. Most standard statistics textbooks do not include a treatise of
directional data.
12
3.2 DEPENDENT VERSUS INDEPENDENT VARIABLES
Independent variables are those that are manipulated whereas dependent variables are
only measured or registered. This distinction appears terminologically confusing to many
because, we may say, "all variables depend on something." However, once you get used to
this distinction, it becomes indispensable. For example, if we collect geological data along a
creek-bed, or geophysical data on a paddock or at sea, distance would be a dependent
variable, whereas rock type, type of fossils, or variation in the magnetic field etc would be an
independent variable.
3.3 DATA QUALITY

Accuracy refers to the closeness of the measurements to the "actual" or "real" value of
the physical quantity, whereas the term precision is used to indicate the closeness with which
the measurements agree with one another quite independently of any systematic error
involved. Therefore, an "accurate" estimate has small bias. A "precise" estimate has both
small bias and variance. Quality is proportion to the inverse of variance.
If repeated measurements are similar, there are called precise. If the same systematic
error is made for a set of repeart measurements, they would be precise but entirely inaccurate
at the same time. The robustness of a procedure is the extent to which its properties do not
depend on those assumptions which you do not wish to make.
4. STATISTICAL DATA ANALYSIS

4.1 INTRODUCTION
Studying a problem through the use of statistical data analysis usually involves four basic
steps.
1. Defining the problem
2. Collecting the data
3. Analyzing the data
4. Reporting the results
This class will be concerned mostly with data analysis/hypotheeis testing and decision
making. For the purpose of statistical data analysis, distinguishing between cross-sectional
and time series data is important. Cross-sectional data re data collected at the same or
approximately the same point in time. Time series data are data collected over several time
periods. A Meta-analysis deals with a set of results to give an overall result that is
comprehensive and valid. Principal component analysis and factor analysis are used to
reduce the dimensionality of multivariate data. In these techniques correlations and
interactions among the variables are summarized in terms of a small number of underlying
factors. The methods rapidly identify key variables or groups of variables that control the
system under study. The resulting dimension reduction also permits graphical representation
of the data so that significant relationships among observations or samples can be identified.
Other related techniques include Multidimensional Scaling, Cluster Analysis, and

Correspondence Analysis. Multivariate analysis is a branch of statistics involving the
13
consideration of objects on each of which are observed the values of a number of variables. A
wide range of methods is used for the analysis of multivariate data, and this course will give a
view of the variety of methods available, as well as going into some of them in detail.
4.2 PLOTTING DATA AS LINE/SYMBOL AND STEM DIAGRAMS

Zn assays in a Chilean vein
40
35
30
25
Zn (%)
20
15
10
0
0 50 100 150 200 250
Distance (m)
Plot of zink data points from a Quartz vein marked as circles, connected by lines.
Very simple diagram to create using the MATLAB plot function.
Stem Diagram of De Wijs data
40
35
30
25
Zn (%)
20
15
10
0
0 20 40 60 80 100 120
Position (2m units)
A stem diagram of the same data points marked by circles.

Produced using the MATLAB function stem.
4.3 PLOTTING DATA AS HISTOGRAMS

The purpose of a histogram is to graphically summarize the distribution of a univariate
data set. The histogram graphically shows the following:
1. center (i.e., the location) of the data;
2. spread (i.e., the scale) of the data is;
3. skewness of the data;
4. presence of outliers; and
5. presence of multiple modes in the data.

These features provide strong indications of the proper distributional model for the data.
14
Histogram of De Wijs data
40
35
30
25
frequency
20
15
10
0
0 5 10 15 20 25 30 35 40
Zn (%)
Histogram of the zink data using the MATLAB bar function.
Before we can construct a histogram we must determine how many classes we should
use. This is purely arbitrary, but too few classes or too many classes will not provide as clear
a picture as can be obtained with some more nearly optimum number. An empirical
relationship (known as Sturges' rule) which seems to hold and which may be used as a guide
to the number of classes (k) is given by
k = the smallest integer greater than or equal to 1 + Log(n) / Log (2) = 1 + 3.332Log(n)
To have an 'optimum' you need some measure of quality - presumably in this case, the
'best' way to display whatever information is available in the data. The sample size contributes
to this, so the usual guidelines are to use between 5 and 15 classes, with more classes possible
if you have a larger sample. You take into account a preference for tidy class widths,
preferably a multiple of 5 or 10, because this makes it easier to appreciate the scale.
Beyond this it becomes a matter of judgement - try out a range of class widths and choose
the one that works best. (This assumes you have a computer and can generate alternative
histograms fairly readily). There are often management issues that come into it as well. For
example, if your data is to be compared to similar data - such as prior studies, or from other
countries - you are restricted to the intervals used therein.
If the histogram is very skewed, then unequal classes should be considered. Use narrow
classes where the class frequencies are high, wide classes where they are low. The following
approach is common:
1. Find the range (highest value - lowest value).

2. Divide the range by a reasonable interval size: 2, 3, 5, 10 or a = multiple of 10.
3. Aim for no fewer than 5 intervals and no more than 15.
___________________________________________________________________________
Worked Matlab Example: Histogram plotting
We have measured the diameters of 40 ammonites:
3.2 3.7 2.9 3.9 3.4 3.1 3.1

3.9 3.5 3.3 3.6 3.8 3.7 3.0
3.5 3.2 3.5 3.7 3.9 3.6 3.4
15
2.9 3.2 3.4 2.9 3.6 3.7 3.3
3.4 4.0 3.8 3.7 3.3 2.9 3.1
3.2 3.6 3.5 3.3 3.4
% Plotting histograms
load ex_2_1.dat
help histo
% Sturge's rule:
% 1 + log(40)/log(2)=6.3 suggests 6 bins
% Explore difference in using odd or even placing of bins
subplot(1,2,1)
histo(ex_2_1,6,1)
subplot(1,2,2)
histo(ex_2_1,6,2)
% Alternatively specify bins explicitly
mx = max(ex_2_1)
mn = min(ex_2_1)
range = mx - mn
mids = [2.95 3.15 3.35 3.55 3.75 3.95]
[n,x] = hist(ex_2_1, mids)
bar(x,n,1,'w')
___________________________________________________________________________
4.4 PLOTTING DATA AS SCATTERPLOTS

A scatter plot reveals relationships or association between 2 variables. Such
relationships manifest themselves by any non-random structure in the plot.
Oxygen versus carbon isotope ratios, 0-230 m, Queensland Plateau

0.35
0.3
0.25
0.2
0.15
Delta C13, PDB
0.1
0.05
0
-3 -2 -1 0 1 2 3 4 5
Delta O18, PDB
5. PROBABILITY
5.1. RANDOM VARIABLES
Turning geoscience data into knowledge requires an ability to test hypotheses and to
analyse errors. This in turn requires some understanding of probability and certain statistical
16
principles. An underlying concept in statistics is that of a random variable. Random
variables may be thought of as physical quantities, which are yet to be known. Since we
cannot predict their values, we may say that they depend on "chance". Examples for random
variables include quantities in the future (like future earthquakes or floods), quantities in the
past (like the past state of the Earth's magnetic field), or present properties of the Earth,
which we attempt to infer from geological/geophysical measurements (like the motion of a
particular tectonic plate that is measured by GPS data). More traditional types of random
variables include the outcome of future experiments, like the toss of a coin, or the content X
of a can of coke drawn from a vending machine and selected at random, or the outcome of a
physical experiment.
After collecting a series of data, these data are regarded as a set in probability theory,
defined as a collection of objects (also called points or elements) about which it is possible to
determine whether any particular object is a member of the set. In particular, the possible
result of a series of measurements (or experiments) represent a set of points called the sample
space. These points may be grouped together in various ways called events, and under
suitable conditions probability functions may be assigned to each. The probabilities always
lie between zero and one, such that an impossible event has the probability of zero, and the
probability of a certain event is one.
If we consider a sample space of points representing the possible outcomes of a particular

series of measurements, a random variable x(j) is a set function defined for points k from the
sample space. A random variable X can assume x(j) values which can be real numbers
between -8 and +8 , associated to each sample points that might occur. In other words, the
random outcome of an experiment, indexed by j, can be represented by a discrete distribution
of real numbers x(j), which are the possible values of X. A random variable is described by a
function called the probability density function (PDF). The PDF is a measure of the density
of probability of the random variable plotted on a horizontal axis, which is the domain of
possible values of the random variable. Thus if X is a random variable, the PDF f(x) has a
graph whose area is 1, since it is certain that x will have some value within its domain.
Each x(j) has a probability p(j). The discrete distribution function f(x) of p(j) is:
f(x) = pj if x = x j (j = 1,2, ...)

0 otherwise
We obtain the probability distribution function (PDF) by taking sums:
F(x) = ΣŠ f(x j ) = ΣŠ pj
xj x xj x
The first moment of a probability distribution is the mean, and the first central moment
about the mean is zero. The second moment about the mean is called the variance, σx2 , and
its square root, σx , is called the standard variation. The third moment about the mean is
called skewness γ, and is zero for PDF's which are symmetric about the mean. The fourth
moment about the mean is the kurtosis. It measures "peakedness" of the distribution.
First central moment: E (x - µ) = 0.
17
Second central moment: σ 2 = E (x 2 ) - µ 2 .
γ = 13 E([x - µ]3 )
Third central moment: σ
.
The median value divides the probability density distribution in two halves such that
there is a 50% chance for x to be less than the median and a 50% chance for it to be greater
than mx. The next figure shows measures of central tendency of an arbitrary probability
density distribution.
5.2 WHAT IS STATISTICAL SIGNIFICANCE?

The statistical significance (p-level) of a result is an estimated measure of the degree
to which it is "true" (in the sense of "representative of the population"). More technically,
the value of the p-level represents a decreasing index of the reliability of a result. The higher
the p-level, the less we can believe that the observed relation between variables in the sample
is a reliable indicator of the relation between the respective variables in the population.
Specifically, the p-level represents the probability of error that is involved in accepting our
observed result as valid, that is, as "representative of the population."
For example, a p-level of .05 (i.e.,1/20) indicates that there is a 5% probability that the
relation between the variables found in our sample is a "fluke." In other words, assuming that
in the population there was no relation between those variables whatsoever, and we were
repeating experiments like ours one after another, we could expect that approximately in
every 20 replications of the experiment there would be one in which the relation between the
variables in question would be equal or stronger than in ours. In many areas of research, the p-
level of .05 is customarily treated as a "border-line acceptable" error level.
5.3 HOW TO DETERMINE THAT A RESULT IS "REALLY"

SIGNIFICANT
There is no way to avoid arbitrariness in the final decision as to what level of significance
will be treated as really "significant." That is, the selection of some level of significance, up to
which the results will be rejected as invalid, is arbitrary. In practice, the final decision usually
depends on whether the outcome was predicted a priori or only found post hoc in the course
of many analyses and comparisons performed on the data set, on the total amount of
18
consistent supportive evidence in the entire data set, and on "traditions" existing in the
particular area of research. Typically, in many sciences, results that yield p=0.05 are
considered borderline statistically significant but remember that this level of significance still
involves a pretty high probability of error (5%). Results that are significant at the p=0.01 level
are commonly considered statistically significant, and p=0.05 or p=0.001 levels are often
called "highly" significant. But remember that those classifications represent nothing else but
arbitrary conventions that are only informally based on general research experience.
5.4 GAUSSIAN DISTRIBUTION

The Gaussian or Normal distribution is important because many natural processes result
in data that are normally or log-normally distributed. The distribution of many test statistics is
normal or follows some form that can be derived from the normal distribution. In this sense,
philosophically speaking, the Normal distribution represents one of the empirically verified
elementary "truths about the general nature of reality," and its status can be compared to the
one of fundamental laws of natural sciences. The exact shape of the normal distribution (the
characteristic "bell curve") is defined by a function, which has only two parameters: mean and
standard deviation.
A random variable is said to follow a Gaussian (or normal) distribution, if its probability
density function is given by:
P(x) = (2πσ 2 ) e -0.5 (x - µx) /σ

2 2
The Gaussian PDF is completely specified by the mean µx and standard deviation σ. The
shape of the Gaussian distribution is a bell-shaped curve, symmetric about the mean, with
68% of its area within one standard deviation, and 95% within two standard deviations.
In a Normal distribution, observations that have a standardized value of less than -2

or more than +2 have a relative frequency of 5% or less. (Standardized value means that a
value is expressed in terms of its difference from the mean, divided by the standard
deviation.)
19
Normal probability density and distribution fcunctions
The integral of the Gaussian distribution is the error function, which is important for
solving thermal heat conduction problems. The Gaussian distribution is quite frequently used
in science, because of a result known as the central limit theorem, which states that the sum
of many independent random variables tends to behave as a Gaussian random variable, no
matter what their distribution, as long as it is the same for all. This result implies that any
physical process which is the sum of random events is Gaussian in its distribution.
Unfortunately this assumption often does not hold for some distributions of "real" data we
have to deal with.
Most computers contain a random number generator, which produces numbers between 0
and 1 with an approximately uniform distribution. One may use the random number
generator, and the central limit theorem, to compute random numbers, which are
approximately Gaussian with zero mean and unit variance. With a computer program to
generate uniformly distributed random numbers on the interval (0,1), one may compute the
sum of 12 of them, which by the Central Limit Theorem is approximately Gaussian, subtract
the mean value (6), and obtain approximately Gaussian numbers with unit variance.
Application: Many applications arise from central limit theorem (average of values of n
observations approaches normal distribution, irrespective of form of original distribution
under quite general conditions). Consequently, the normal distribution is an appropriate
model for many, but not all, physical phenomena. Examples: Distribution of physical
measurements on fossils, abundance of elements in rocks, average temperatures, etc. Many
methods of statistical analysis presume normal distribution.
___________________________________________________________________________
Worked Matlab Example: Normal distribution
Grades of chip samples from a body of ore have a normal distibution with a mean of 12%
and a standard deviation of 1.6%. Find the probability that the grade of a chip sample taken at
random will have a grade of:
1) 15% or less
2) 14% or more
3) 8% or less
4) between 8% and 15%
% mean 12%, sd 1.6%

m1=12
sd1=1.6
%prob <15%
20
%first standardise
z1=sdiz(15,m1,sd1)
%then use (just created) cumulative prob fn
prob1=cump(z1)
%Now prob>14%
z2=sdiz(14,m1,sd1)
prob2=1-cump(z2)
%prob<8%, not problem here if z is negative

z3=sdiz(8,m1,sd1)
prob3=cump(z3)
% 8%<prob<15%
%already standerised 8 & 15% (ie z3 and z1)
prob4=cump(z1)-cump(z3)
___________________________________________________________________________
5.5 RANDOM DATA SAMPLE STATISTICS
5.5.1 Mean
The mean value of a data series gt is given by:
N
1
g=
N
∑g
t =1
t
where N is the number of data samples. The quantity g calculated here is an unbiased
estimate of the "true" mean value of the continuous function g(t). For many data analysis
procedures it is necessary to remove the mean value. For example, in Fourier transformed
data the presence of a mean value different from zero would result in a spurious large
frequency component in the amplitude spectrum. Hence we form a new time series given by:
xt = gt − g t = 1,2,...,N.
When a distribution is skewed (i.e. not normal), the mean is not representative and the
median should rather be used. The median, which splits the distribution into two halves, is
often very similar to the mode, the most commonly occurring value, but the median is much
easier to compute than the mode. Therefore, the median is often used as a simple substitute
for the mode.
5.5.2 Standard deviation

The standard deviation of the sample is computed by:
1/ 2
 N (x )2 
s = ∑ t 
 t =1 N − 1
Note that here xt has zero mean. Both s and s2 are unbiased estimates of the standard
deviation σ and the variance σ2 (for details see Bendat and Piersol, 1986, Section 4.1).
21
5.5.3 Standardization to unit standard deviation
For some computer operations which require fixed point rather than floating point
calculations, it is desirable to standardize the time series to unit standard deviation. This can
be achieved by multiplying the transformed values xt by 1/s:
xt
zt = t = 1,2,...,N.
s
5.5.4 Geometric mean

Sometimes a population is skewed to the right but the frequency density of logarithms of
the values is symmetrical (geological data are often log-normally distributed, such as
concentrations of gold, cadmium and uranium in ore deposits). In these cases the arithmetic
mean is unrepresentative and it is common to compute the logarithm of the values and take
the antilogarithm of the resulting mean. This measure is also called the geometric mean. It
can be shown that the same result is obtained from multiplying all the values together and
taking the Nth root of the product:
xgeo = (‹ XN)1/N
___________________________________________________________________________
Worked Matlab Example: Basic sample statistics
The following data are diameters (in mm) of clasts from a conglomerate:
23 24 27 29 29 30 33 33 34 38 45 60 60 88 126 221 256
load ex_2_2.dat %loads file into array called ex_2_2
median1=median(ex_2_2)
mean1=mean(ex_2_2)
%Now to get geometric mean

l=1.0;
for i=1:length(ex_2_2)
x=ex_2_2(i);
t=x*l;
l=t
end
geom=l^(1/length(ex_2_2))
___________________________________________________________________________
5.6 OTHER DISTRIBUTIONS AND THEIR APPLICATIONS
5.6.1 Binomial
Application: Gives probability of exactly successes in n independent trials, when
probability of success p on single trial is a constant. Used frequently in quality control,
reliability, survey sampling, and other industrial problems.
22
Example: What is the probability of 7 or more "heads" in 10 tosses of a fair coin? The
binomial distribution can sometimes be approximated by a normal or by a Poisson
distribution.
5.6.2 Multinomial
Application: Gives probability of exactly ni outcomes of event i, for i = 1, 2, ..., k in n
independent trials when the probability pi of event i in a single trial is a constant. Used
frequently in quality control and other industrial problems. Generalization of binomial
distribution for more than 2 outcomes.
Example: Four companies are bidding for each of three contracts, with specified success
probabilities. What is the probability that a single company will receive all the orders?
5.6.3 Poisson
Application: Gives probability of exactly x independent occurrences during a given
period of time if events take place independently and at a constant rate. May also represent
number of occurrences over constant areas or volumes. Used frequently in quality control,
reliability, queuing theory, and so on. Frequently used as approximation to binomial
distribution.
Probability function for the Poisson distribution:
P(X = x) = exp(-µt) (µt)x/x!
Where µ is the mean occurrence of an event over a time period t.

___________________________________________________________________________
Worked Matlab Example: Poisson distribution
The number of major floods occurring in 50-year periods in a certain region has a Poisson
distribution with a mean of 2.2. What is the probability of the region experiencing
1) exactly one flood in a 50-year period?

2) Exactly one flood in a 25-year period?
3) At least one flood in a 50-year period?
4) Not more than two floods in a 25-year period?
% For a Poisson distribution

% Pr(X=x) = exp(-mean*t)*((mean*t)^x)/x!
% Nb 1 time period (t=1) is 50 years
% 2 floods in 50 years, X=2 t=1

m1=2.2
prob1=exp(-m1)*((m1).^2)/factorial(2)
% 1 flood in 25 years, X=1, t=0.5
prob2=exp(-m1*0.5)*((m1*0.5).^1)/factorial(1)
% at least one flood in 50 years (t=1) is all but zero in 50 years
% ie 1-P(X=0)
23
prob3=1-exp(-m1)*((m1).^0)/factorial(0) %nb 0!=1
% Not more than 2 is P(0)+P(1)+P(2), t=0.5
prob4=(exp(-m1*0.5)*((m1*0.5).^0)/factorial(0))+(exp(-
m1*0.5)*((m1*0.5).^1)/ factorial(1))+(exp(-
m1*0.5)*((m1*0.5).^2)/factorial(2))
___________________________________________________________________________
5.6.4 Exponential
Many geological events can be represented by points in space or time. When discrete
events occur randomly and independently at a mean rate l per unit interval, the intervals
between events give rise to a probability density function as follows:
f(x) = l exp(-λx), λ > 0, x S 0

0 otherwise
In this case, the number of events occurring in a unit interval has a Poisson distribution
with parameter λ, the mean rate of occurrence, and a mean time between events of 1/λ.
The probability for a certain time period x separating two events is:
P(X S x) = exp(-λx)
and
P(X T x) = 1 - exp(-λx)
It follows that
P(x1 T X T x2) = exp(-λx1) - exp(-λx2)
Where P(x1 T X T x2) is the probability that an event will occur within a time period
between x1 and x2 years.
___________________________________________________________________________
Worked Matlab Example: Exponential distribution
The number of major earthquakes occuring in 100-year intervals in a certain region has a
Poisson distribution with a mean rate of 2.1. Find the probability that the time between two
successive earthquakes is
1) more than 25 years

2) less than 50 years
3) between 30 and 40 years
% 1 time interval (t=1) is 100 years

% Mean rate is 2.1
24
m1=2.1;
% more than 25 years ie t=0.25

prob1=exp(-m1*0.25)
% less than 50 = 1 - more than 50; t=0.5

prob2=1-exp(-m1*0.5)
% Between 30 and 40 year is P(>30)-P(>40)

prob3=exp(-m1*0.3)-exp(-m1*0.4)
___________________________________________________________________________
5.6.5 Gamma
Application: A basic distribution of statistics for variables bounded at one side - for
example x greater than or equal to zero. Gives distribution of time required for exactly k
independent events to occur, assuming events take place at a constant rate. Used frequently in
queuing theory, reliability, and other industrial applications.
Erlangian, exponential, and chi-square distributions are special cases. The Dirichlet is a
multidimensional extension of the Beta distribution.
Distribution of a product of iid uniform (0, 1) random? Like many problems with
products, this becomes a familiar problem when turned into a problem about sums. If X is
uniform (for simplicity of notation make it U(0,1)), Y=-log(X) is exponentially distributed, so
the log of the product of X1, X2, ... Xn is the sum of Y1, Y2, ... Yn which has a gamma
(scaled chi-square) distribution. Thus, it is a gamma density with shape parameter n and scale
1.
Example: Distribution of time between re calibrations of instrument that needs re

calibration after k uses; time between inventory restocking, time to failure for a system with
standby components.
5.6.6 Beta
Application: A basic distribution of statistics for variables bounded at both sides - for
example x between o and 1. Useful for both theoretical and applied problems in many areas.
Uniform, right triangular, and parabolic distributions are special cases. To generate beta,
generate two random values from a gamma, g1, g2. The ratio g1/(g1 +g2) is distributed like a
beta distribution. The beta distribution can also be thought of as the distribution of X1 given
(X1+X2), when X1 and X2 are independent gamma random variables.
There is also a relationship between the Beta and Normal distributions. The conventional
calculation is that given a PERT Beta with highest value as b lowest as a and most likely as
m, the equivalent normal distribution has a mean and mode of (a + 4M + b)/6 and a standard
deviation of (b - a)/6. Many stixbox distribution functions are based on the beta and gamma
functions.
Example: Distribution of proportion of population located between lowest and highest

value in sample; distribution of daily per cent yield in a manufacturing process; description of
elapsed times to task completion (PERT).
25
5.6.7 Negative Binomial
Application: Gives probability similar to Poisson distribution when events do not
occur at a constant rate and occurrence rate is a random variable that follows a gamma
distribution. Generalization of Pascal distribution when s is not an integer. Many authors do
not distinguish between Pascal and negative binomial distributions.
Example: Distribution of number of cavities for a group of dental patients.
5.6.8 Log-normal
Application: Permits representation of random variable whose logarithm follows
normal distribution. Model for a process arising from many small multiplicative errors.
Appropriate when the value of an observed variable is a random proportion of the previously
observed value.
In the case where the data are lognormally distributed, the geometric mean acts as a better
data descriptor than the mean. The more closely the data follow a lognormal distribution, the
closer the geometric mean is to the median, since the log re-expression produces a
symmetrical distribution. The ratio of two log-normally distributed variables is log-normal.
Example: Many geological phenomina give rise to log-normal distributions
5.6.9 Rayleigh
Application: Gives distribution of radial error when the errors in two mutually
perpendicular axes are independent and normally distributed around zero with equal
variances. Special case of Weibull distribution.
Example: Any data given as dips and strikes.
5.6.10 Weibull
Application: General time-to-failure distribution due to wide diversity of hazard-rate
curves, and extreme-value distribution for minimum of N values from distribution bounded at
left. The Weibull distribution is often used to model "time until failure." In this manner, it is
applied in actuarial science and in engineering work.
It is also an appropriate distribution for describing data corresponding to resonance

behavior, such as the variation with energy of the cross section of a nuclear reaction or the
variation with velocity of the absorption of radiation in the Mossbauer effect. Rayleigh and
exponential distribution are special cases.
Example: Life distribution for some capacitors, ball bearings, relays, and so on.
26
6 SIGNIFICANCE TESTS
6.1 INTRODUCTION
Significance tests are based on certain assumptions: The data have to be random samples
out of a well defined basic population and one has to assume that some variables follow a
certain distribution - in most cases the normal distribution is assumed.
Power of a test is the probability of correctly rejecting a false null hypothesis. A null
hypothesis is a hypothesis of no difference. If a null-hypothesis is rejected when it is
actually true, a Type I error has occurred. This probability is one minus the probability of
making a Type II error (β), which is the error that occurs when an erroneous hypothesis
is accepted. We choose the probability of making a Type I error when we set α and that if we
decrease the probability of making a Type I error we increase the probability of making a
Type II error.
Thus, the probability of correctly retaining a true null hypothesis has the same
relationship to Type I errors as the probability of correctly rejecting an untrue null hypothesis
does to Type II error.
Power and the True Difference Between Population Means: Anytime we test whether a
sample differs from a population or whether two sample come from 2 separate populations,
there is the assumption that each of the populations we are comparing has it's own mean and
standard deviation (even if we do not know it). The distance between the two population
means will affect the power of our test.
Power as a Function of Sample Size and Variance: You should notice that what really
made the difference in the size of β is how much overlap there is in the two distributions.
When the means are close together the two distributions overlap a great deal compared to
when the means are farther apart. Thus, anything that effects the extent the two distributions
share common values will increase β (the likelihood of making a Type II error).
Sample size has an indirect effect on power because it affects the measure of variance we
use to calculate the t-test statistic. Since we are calculating the power of a test that involves
the comparison of sample means, we will be more interested in the standard error (the average
difference in sample values) than standard deviation or variance by itself. Thus, sample size is
of interest because it modifies our estimate of the standard deviation. When n is large we will
have a lower standard error than when n is small. In turn, when N is large well have a smaller
β region than when n is small.
6.2 TESTING WHETHER DATA ARE NORMALLY DISTRIBUTED

Many statistical methods are based on the assumption that data are normally distributed.
If an initial histogram plot indicates that the data to be analysed may be normally distributed,
we can perform another quick test, before conducting more formal tests (e.g. a chi-square
test), using the normfit Matlab script from Middleton (below). In order to determine if a
sample of data may have come from a Normal population, the best-fit Normal distribution is
27
computed & compared with a histogram. The non-standardised variable x is plotted and the
area under the curve is equal to the total frequency times the histogram class interval.
Histogram and fitted Normal curve
45
40
35
30
25
frequency
20
15
10
0
-3 -2 -1 0 1 2 3
x
A histogram of 200 values drawn at random from a Normal population with a mean 0 and a standard
deviation 1, together with the fitted Normal curve.
These data are clearly quite close to a normal distribution. Now let's look at another
example, namely data on uranium content (in ppm) in lake sediments from Saskatchewan,
Canada.
Uranium data from lake sediments
70
60
50
40
30
fre quenc y
20
10
0
0 20 40 60 80 100 120 140 160 180
x
Initially it looks like there is not too much hope that these data might be anything close to
a normal distribution. However, it turns out that log-normal distributions are quite common
in nature. For example if you take a hammer, and smash a quarz grain to pieces, the resulting
grain size distribution will likely be log-normal. Let's have a look whether this might hold for
the lake sediment data:
28
Log of Uranium data from lake sediments
25
20
15
10 y
fre quenc
0
-1 0 1 2 3 4 5 6
x
Now the picture has changed: Our data may indeed be log-normally distributed, but the
distribution is not quite symmetric. In order to be sure that we can safely follow the
assumption that say at 95% confidence the data are log-normally distributed, we would now
have to carry out a chi-squared test (see later).
The following plot shows a bathymetric profile across the Hawaiian seamount chain.
-1000
-2000
-3000
-4000
-5000
50 100 150 200 250 300 350 400 450 500
A histogram of the depths (below) clearly shows that this is at the very least a bimodal
distribution, with one very shallow peak that correcponds to the top of the Hawaiian chain,
and a large range of abyssal seafloor depths that could be broken up into two further
distributions (flexural bulges next to the seamount chain, and true abyssal seafloor depths).
45
40
35
30
25
20
15
10
0
-6000 -5000 -4000 -3000 -2000 -1000 0
29
45
40
35
30
25
20
15
10
0
-6000 -5000 -4000 -3000 -2000 -1000 0
The stixbox normmix function estimates the mixture of normal distributions, and their
means, standard deviations, and mixture weights. In this case, this matrix contains the
following values for three estimated distributions.
Mean depth Standard deviation weight

5580 59 0.2
4654 391 0.7
568 166 0.1
Other tests to determine the probability that a sample came from a normally distributed
population of observations is the Kolmogorov-Smirnov test and the Shapiro-Wilks' W test.
6.3 OUTLIERS
Outliers are atypical (by definition), infrequent observations. Because of the way in
which the regression line is determined (especially the fact that it is based on minimizing not
the sum of simple distances but the sum of squares of distances of data points from the line),
outliers have a profound influence on the slope of the regression line and consequently on the
value of the correlation coefficient. A single outlier is capable of considerably changing the
slope of the regression line and, consequently, the value of a correlation, or it may also change
the outcome of a test for normality of a distribution. Note that if the sample size is relatively
small, then including or excluding specific data points that are not as clearly "outliers".
Typically, we believe that outliers represent a random error that we would like to be able
to control. Unfortunately, there is no widely accepted method to remove outliers
automatically (however, see the next paragraph). However, quantitative methods to exclude
outliers have been proposed. The best method that I have come across is "Chauvenet's
criterion", as described in Taylor's book "An introduction to error analysis". Chauvenet's
criterion states that if the expected number of measurements at least as bad as the
suspect measurement is less than 1/2, then the suspect measurement should be rejected.
tsus = (xsus - mean)/std
where tsus is the number of standard deviations by which the suspect measurement xsus
differs from the mean. We next find the probability P(outside tsus*std) that a legitimate
mesurement will differ from the mean by tsus or more standard deviations. This can be done
in Matlab by using the stixbox function pnorm. pnorm computes values of the normal
30
distribution function (i.e. cumulative density function) (right on figure below), whereas dnorm
computes the normal density function (left on figure below).
For example:
Pnorm(0) = 0.5
As shown in the figure above, because at a z (or x)-value of 0, the grey area underneath
the density function is exactly half the area under this curve.
pnorm([-1 1])
yields:
0.1587 0.8413
z (or x) = -1 and 1 corresponds to minus and plus one standard deviation of a normal
distribution function.
We can use a small Matlab script, based on the commands dnorm and pnorm to
reproduce the figure above:
clear
x = [-4:0.1:4];
dens=dnorm(x);
dist=pnorm(x);
subplot(1,2,1)
plot(x,dens)
grid on
title('Normal distribution function')
xlabel('x')
ylabel('Probability')
subplot(1,2,2)
plot(x,dist)
grid on
title('Normal density function')
xlabel('x')
ylabel('Cumulative area')
31
Normal density function Normal distribution function
0.4 1
0.35
0.8
0.3
0.25 0.6
0.2
0.15 0.4
Probabilit y
Cumulative area
0.1
0.2
0.05
0 0
-4 -3 -2 -1 0 1 2 3 4 -4 -3 -2 -1 0 1 2 3 4
x x
The probability that a measurement from a normally distributed population with a certain
standard deviation x (will occur corresponds to the area under the normal curve between -x
and +x.
The likelihood that a measurement within one standard deviation will occur corresponds
to the area under the normal density curve (left above) between -1 and 1. To find this area all
we have to do is look up the cumulative area at x=-1 and x=1 on the normal distribution
curve, and take the difference between them. In Matlab, we type:
diff(pnorm([-1 1]))
ans = 0.6827
We have just verified that about 68% of data from a normal distribution should lie within 1
standard deviation of the mean (which is zero for standardized data). In other words, the
probability for this to happen is 0.68, or 68%. By the same token, the probability that a
measurement will lie outside one standard deviation is 1-0.68=0.32.
Now back to Chauvenet's criterion. We want to find the probability that an "outlier" in our
series of measurements is actually an "erroneus" measurement that should be removed. For
that we need to find the probability P(outside tsus*std) that a legitimate mesurement will differ
from the mean by tsus or more standard deviations. For a standard deviation of one, we have
just shown that this probability is 0.32. Now, we multipy this value by the number of
measurements N:
n(worse than tsus) =N * P(outside tsus*std)
This is the total number of measurements expected to be at least as bad as xsus.

___________________________________________________________________________
Worked example: Chauvenet's criterion for identifying outliers
We have measured the length of ten belemnites in mm:
46, 48, 44, 38, 45, 47, 58, 44, 45, 43
We notice that the value 58 seems anomalously large, check our field records, but cannot find
any evidence that this measurement was caused by a mistake. We use Chauvenet's criterion
to evaluate whether this mesasurement should be rejected as an unexpected outlier in a data
set from a presumably normal population.
32
First we compute a mean and standard deviation for all ten measurements as 45.8 and 5.1.
The difference xsus between the suspect value and the mean is 12.2, or 2.4 standard deviations:
tsus = (xsus - mean)/std
(58 - 45.8)/5.1
ans = 2.4
diff(pnorm([-2.4 2.4]))
ans = 0.9836
1 - 0.9836
ans = 0.0164
10 * 0.0164
ans = 0.164
In ten measurements, we would expect to find only 0.164 of one measurement as bad as our
suspect result. This is less than the number 0.5 set by Chauvenet's criterion, so we should
consider rejecting the anomalous measurement.
Our next most suspect result is 38. Check whether this result is expected or not given our ten
measurements.
___________________________________________________________________________
6.3 DISTRIBUTION OF THE SAMPLE MEAN

Suppose 20 scientists were sent out to Botany Bay, and each one of them would take five
sediment samples and determine the average grain size. They then calculate the mean grain
size of the five samples, such that we have 20 mean grain sizes. If we would plot these
twenty values on a histogram we would find that they are much less scattered than the original
100 grain-size values.
It can be shown that the distribution of the sample mean has the same mean as that of the
single values, but the variance is much lower by a factor which depends on the sample size:
σ =σ
2 2
x
/n
The square root of the variance of this or any other sampling distribution is called the
standard error (SE) of the statistic:
SE ( x ) = σ
2
n
The standard error is useful to calculate confidence intervals of means, as seen in the
following example.
Sometimes we have to estimate confidence intervals for differences between two means.
The estimated standard error of the difference in sample means is:
33
+ 1 / nB
2
s (1/ n A
where s2 is the common variance:
(n A − 1) s A + (nB − 1) s B
2 2
=
2
s −2 n −n
A B
___________________________________________________________________________
Worked example: Confidence intervals of means
1) Given data on the percentage of quartz in thin sections from an igneous rock, what is the
confidence interval around the estimated mean quartz percentage in the rock?
qz=[23.5 16.6 25.4 19.1 19.3 22.4 20.9 24.9];

m1=mean(qz);
v1=var(qz);
n=length(qz);
% standard error: root(s^2/n)

s1=sqrt(v1/n);
% Now t(0.025,n-1) is prob not exceeded by 0.975

t1=qt(0.975,n-1);
% Confidence limits
c1=m1-t1*s1
c2=m1+t1*s1
2) Is there any evidence that two brachiopod samples could have been derived from
populations having the same mean (data are contained in data files ex_2_22_a.dat and
ex_2_22_b.dat)?
load ex_2_22_a.dat;
load ex_2_22_b.dat;
% Calculate mean, std, variance, and sample size

m1=mean(ex_2_22_a)
m2=mean(ex_2_22_b)
s1=std(ex_2_22_a)
s2=std(ex_2_22_b)
ss1=s1*s1
ss2=s2*s2
n1=length(ex_2_22_a)
n2=length(ex_2_22_b)
% Calculate combined variance

sc=((n1-1)*ss1+(n2-1)*ss2)/(n1+n2-2)
%t dist with n1+n2-2=16 df; will be exceeded by 0025::0.975 will not be %

exceeded
t1=qt(0.975,16)
st=t1*sqrt(sc*((1/n1)+(1/n2)))
%now difference in means...

dm=m1-m2
%Confidence limits
34
c1=dm-st
c2=dm+st
___________________________________________________________________________
6.5 STUDENT T-DISTRIBUTION

The distribution of the sample mean requires knowledge of the value of the standard
deviation of the quantity, as shown above
x−µ
σ/ n
Here x is the data mean and µ the (unknown) population mean. The population standard
deviation s is usually unknown, and therefore we must estimate it with the statistic s, which
gives the following:
x−µ
t=
s/ n
The distribution of this quantity is not normal, but it is bell-shaped and centered on zero.
It is dependent on the degrees of freedom (d.f.), denoted by ? (nu), which are n-1, with n
being the number of sample observations.
Student t density and distribution function
The t distributions were discovered in 1908 by William Gosset who was a chemist and a
statistician employed by the Guinness brewing company. He considered himself a student still
learning statistics, so that is how he signed his papers as pseudonym "Student". Or perhaps he
used a pseudonym due to "trade secrets" restrictions by Guinness.
Note that there are different t distributions, it is a class of distributions. When we speak of
a specific t distribution, we have to specify the degrees of freedom. The t density curves are
symmetric and bell-shaped like the normal distribution and have their peak at 0.
However, the spread is more than that of the standard normal distribution. The larger the
degrees of freedom, the closer the t-density is to the normal density (see figure below).
35
6.6 STUDENT T-TEST FOR INDEPENDENT SAMPLES
The t-test is the most commonly used method to evaluate the differences in means
between two groups. Theoretically, the t-test can be used even if the sample sizes are very
small (e.g., as small as 10; some researchers claim that even smaller n's are possible), as long
as the variables are normally distributed within each group and the variation of scores in the
two groups is not reliably different. As mentioned before, the normality assumption can be
evaluated by investigating the distribution of the data via histograms or by performing a
normality test. If the normality condition is not met, then you can evaluate the differences in
means between two groups using a nonparametric alternatives to the t- test (discussed later).
The p-level reported with a t-test represents the probability of error involved in
accepting our research hypothesis about the existence of a difference. Technically
speaking, this is the probability of error associated with rejecting the hypothesis of no
difference between the two categories of observations (corresponding to the groups) in the
population when, in fact, the hypothesis is true.
If the difference is in the predicted direction, you can consider only one half (one "tail")
of the probability distribution and thus divide the standard p-level reported with a t-test (a
"two-tailed" probability) by two.
In order to perform the t-test for independent samples, one independent (grouping)
variable and at least one dependent variable (e.g., a test score) are required. The means of the
dependent variable will be compared between selected groups based on the specified values of
the independent variable.
It often happens in research practice that you need to compare more than two groups, or
compare groups created by more than one independent variable while controlling for the
separate influence of each of them. In these cases, you need to analyze the data using Analysis
of variance (ANOVA), which can be considered to be a generalization of the t-test. In fact, for
two group comparisons, ANOVA will give results identical to a t-test (t**2 [df] = F[1,df]).
However, when the design is more complex, ANOVA offers numerous advantages that t-tests
cannot provide.
6.7 STUDENT T-TEST FOR DEPENDENT SAMPLES
6.7.1 Within-group Variation

The size of a relation between two variables, such as the one measured by a difference in
means between two groups, depends to a large extent on the differentiation of values within
the group. Depending on how differentiated the values are in each group, a given "raw
difference" in group means will indicate either a stronger or weaker relationship between the
independent (grouping) and dependent variable.
For example, if the mean count of oil inclusions was 102 in formation A and 104 in
formation B, then this difference of "only" 2 points would be extremely important if all values
for formation A fell within a range of 101 to 103, and all scores for formation B fell within a
range of 103 to 105. However, if the same difference of 2 was obtained from very
differentiated scores (e.g., if their range was 0-200), then we would consider the difference
36
entirely negligible. That is to say, reduction of the within-group variation increases the
sensitivity of our test.
6.7.2 Purpose
The t-test for dependent samples helps us to take advantage of one specific type of design
in which an important source of within-group variation (or so-called, error) can be easily
identified and excluded from the analysis. Specifically, if two groups of observations (that are
to be compared) are based on the same set of samples who were analysed twice (e.g., before
and after a particular treatment), then a considerable part of the within-group variation in both
groups of scores can be attributed to the initial individual differences between subjects.
Note that, in a sense, this fact is not much different than in cases when the two groups are
entirely independent, where individual differences also contribute to the error variance; but in
the case of independent samples, we cannot do anything about it because we cannot identify
(or "subtract") the variation due to individual differences in subjects.
However, if the same sample was tested twice, then we can easily identify (or "subtract")
this variation. Specifically, instead of treating each group separately, and analyzing raw
scores, we can look only at the differences between the two measures (e.g., "pre-test" and
"post test") in each subject. By subtracting the first score from the second for each subject and
then analyzing only those "pure (paired) differences," we will exclude the entire part of the
variation in our data set that results from unequal base levels of individual subjects. This is
precisely what is being done in the t-test for dependent samples, and, as compared to the t-test
for independent samples, it always produces "better" results (i.e., it is always more sensitive).
6.7.3 Assumptions
The theoretical assumptions of the t-test for independent samples also apply to the
dependent samples test; that is, the paired differences should be normally distributed. If these
assumptions are clearly not met, then one of the nonparametric alternative tests should be
used.
6.7.4 One Sample Student's t test
Formula - tobs :
Where X is the mean value of a
Two-Tailed Tests: Statistical Hypotheses, Critical Values and comparisons for a two-
tailed test
37
One-Tailed Tests: Statistical Hypotheses, Critical Values and comparisons for a one-
tailed test
Alternate hypothesis is population mean < hypothetical population mean
Alternate hypothesis is population mean > hypothetical population mean
6.7.5 Two Independent Samples Student's t test
Formula - tobs :
Two-Tailed Tests: Statistical Hypotheses, Critical Values and comparisons for a two-
tailed test
38
One-Tailed Tests: Statistical Hypotheses, Critical Values and comparisons for a one-
tailed test
Alternate hypothesis is mean of population one < mean of population two
Alternate hypothesis is mean of population one > mean of population two
___________________________________________________________________________
Worked example: T-test
A random sample of 12 observations is obtained from a normal distribution. What value
of the t-statistic will be exceeded with a probability of 0.025? The number of degrees of
freedom (df) is n-1=11. Use the help function to find out how dt, pt and qt work.
% Using t-distribution generator (stixbox functions dt, pt and qt).

% t-stat will be exceeded by P=0.025 and is not exceeded by P=0.975
p=0.975;
df=11; %n-1
%inverse t is qt
t=qt(p,df)
A random sample of 8 hand specimens of rock was analysed for organis material; th4
sample mean was found to be be 5.8 % and the sample standard deviation was 2.3. Do you
think it reasonable to suppose that the organix content of the rock is 5.0%?
%Use function tstat (not provided with stixbox)
--------
function t = tstat(m1,m2,s,n)
%Calculates t- test statistic
t = (m1-m2)/(s/sqrt(n))
--------
39
m1=5.8; %sample mean
m2=5.0; %suggested mean
s=2.3; %std
n=8;
t2=tstat(m1,m2,s,n)
%Now for 7df want t0.05 (ie not exceeded by 0.95)

t3=qt(0.95,7)
%ie no reason to doubt mean (ie t2 < t3)

___________________________________________________________________________
Worked example: Inference applied to t-test for equivalence of population
mean to a hypothetical mean
Is there any evidence that the igneous rock from which the eight measurements of quartz
were taken has a mean quartz percentage greater than 20%?
qz=[23.5 16.6 25.4 19.1 19.3 22.4 20.9 24.9];
m1=mean(qz)
sd1=std(qz)
s1=sd1*sd1
n=size(ex_2_23,1)
ss1=sqrt(s1/n)
% Working out our critical t

% Need 5% significance, ie not exceeded by 95%
t1=qt(0.95,n-1)
%Our t-stat
m2=20
t2=tstat(m1,m2,sd1,n)
___________________________________________________________________________
Worked example: Inference applied to t-test for equivalence of two

population means
Using the data on brachiopds introduced earlier, there are two alternative questions that
might need to be answered.
1) Is there any evidence that the brachiopods from horizon A are longer than those from
horizon B?
2) Is there any evidence of difference in lengths between A and B?
load ex_2_24_a.dat
load ex_2_24_b.dat
% Basic Stats
m1=mean(ex_2_24_a)
m2=mean(ex_2_24_b)
s1=std(ex_2_24_a)
s2=std(ex_2_24_b)
ss1=s1*s1
ss2=s2*s2
n1=size(ex_2_24_a,1)
n2=size(ex_2_24_b,1)
40
%Degrees of freedom n1 + n2 - 2
df=n1+n2-2;
%Critical t p=0.95
t1=qt(0.95,df)
%Common variance
sc=((n1-1)*ss1+(n2-1)*ss2)/(n1+n2-2)
%Calculated test stat

t2=(m1-m2)/(sqrt(sc*((1/n1)+(1/n2))))
%nb t2 > t1 there null rejected
%Question II
%Here use +/- ends of t-graph to get alpha=5%, ie critical value 0.25
t3=qt(0.975,df)
%Therefore do not accept difference

___________________________________________________________________________
6.8 CHI-SQUARE DISTRIBUTION

The probability density curve of a chi-square distribution is an asymmetric curve
stretching over the positive side of the line and having a long right tail. The form of the curve
depends on the value of the degrees of freedom. It is the distribution of the sum of squares of
N unit variance Gaussian random variables.
There are two popular applications for the chi-square distribution:
1) Chi-square Test for Association is a (non-parametric, therefore can be used for nominal
data) test of statistical significance widely used bivariate tabular association analysis.
Typically, the hypothesis is whether or not two different populations are different
enough in some characteristic or aspect of their behavior based on two random
samples. This test procedure is also known as the Pearson chi-square test.
2) Chi-square Goodness-of-fit Test is used to test if an observed distribution conforms to

any particular distribution. Calculation of this goodness of fit test is by comparison of
observed data with data expected based on the particular distribution.
Like the Student's t-Distribution, the Chi-square distribution's shape is determined by its
degrees of freedom. The animation below shows the shape of the Chi-square distribution as
the degrees of freedom increase (1, 2, 5, 10, 25 and 50).
Chi-square density and distribution function
41
6.9 PEASRON CHI-SQUARE TEST
The Pearson Chi-square test is the most common test for significance of the relationship
between categorical variables. This measure is based on the fact that we can compute the
expected frequencies in a two-way table (i.e., frequencies that we would expect if there was
no relationship between the variables). For example, suppose we ask 20 male and 20 female
geologists to choose between two brands of beer (brands A and B). If there is no relationship
between preference and gender, then we would expect about an equal number of choices of
brand A and brand B for each sex. The Chi-square test becomes increasingly significant as the
numbers deviate further from this expected pattern; that is, the more this pattern of choices for
males and females differs.
The value of the Chi-square test and its significance level depends on the overall number
of observations and the number of cells in the table. Relatively small deviations of the relative
frequencies across cells from the expected pattern will prove significant if the number of
observations is large.
The only assumption underlying the use of the Chi-square test (other than random
selection of the sample) is that the expected frequencies are not very small. The reason for this
is that the Chi-square test inherently tests the underlying probabilities in each cell; and when
the expected cell frequencies fall, for example, below 5, those probabilities cannot be
estimated with sufficient precision.
The chi-square statistic we use for this test is:
k (O j − E j )2
Χ =∑
2
j =1 E j
where k = number of categories, Oj = frequency observed in jth category, and Ej =

frequency expected in jth category if null hypothesis is true.
H0: Data drawn from population having specified properties

H1: Data drawn from population not having specified properties
Degrees of freedom (df) = k -1 - (number of parameters estimated)
42
___________________________________________________________________________
Worked example: Pearson Chi-square test
You have developed a theory that the proportions of four minerals in granite are 4:1:2:3.
You have analysed a random sample of 100 grains in a thin section consisting of 35, 12, 22
and 31 of these species. You would like to test if this sample lends support to your theory or
not.
% Observed and expected frequency of 4 minerals
obsf=[35 12 22 31];
exf=[40 10 20 30];
% Observed minus expected

dif=obsf-exf;
% (Observed - Expected)2/Expected
dif2=dif.^2./exf;
dif3=cumsum(dif2);
dif4=dif3(1,4);
% For df=3, 95% of values are <= 7.815
qchisq(0.95,3);
% Value of sum of departures of measured and ex. values is 1.2

% -> much smaller than max value allowed within 95%
___________________________________________________________________________
Worked example: Goodness-of-fit Chi-square test for a normal distribution
The data file ex_2_25.dat contains data on uranium content from lake sediments from 71
sites from Saskatchewan, Canada. Are these data drawn from a normally distributed
population? First plot a histogram of these data. You quickly see that the data are not
normally distributed at all. Our next guess is that they may be log-normally distributed, and
we proceed from there.
load ex_2_25.dat
histo(ex_2_25)
u=log(ex_2_25);
umean=mean(u)
ustd=std(u)
nsamp=length(u);
% Standardise the data (zero mean and std of 1)
ust=(u-umean)/ustd;
43
% Bin data
% To compare with normal dist must take bins symmetric about 0
% Centers of bins (midpoints); must be symmetric and an even number
y=[-1.75 -1.25 -0.75 -0.25 0.25 0.75 1.25 1.75];

[ofreq,x]=hist(ust,y);
% bar(x,ofreq)
% Boundaries of bins
yb=[-5 -1.5 -1 -0.5 0 0.5 1 1.5 5];
% Now use histc to plot histogram using new bins
% Find out how histc works

help histc
[N,bin]=histc(ust,yb)
bar(yb,N,'histc','c')
n=length(yb);
% Find expected probabilities in % for intervals for x1-x2

% First calculate cum. probabilities, then prob for given intervals
ep=diff(pnorm(yb));
% Find expected frequencies of specimen

% Normalize w.r.t. total number of samples we have
% 100% corresponds to 71 samples
efreq=ep*nsamp;
% Calculate (Obsfreq-expfreq)^2/expfreq
chi2=((ofreq-efreq).^2)./efreq
chi2sum=sum(chi2)
% Deg. freedom = no. of classes - 1 - no. of param. est.

% (2 for normal dist)
classes=length(yb);
df=classes-1-2
% Critical chi2 value at 95% and df=5
chi2inv(0.95,df)
44
% -> Calc. chi2sum (9.4) does not exceed critical Chi2 (11.4)
% -> Accept null hypo.
% -> Data do not differ sign. from normal dist.
6.10 FISHER F DISTRIBUTION

The F distribution is based on the ratio of two Chi squared variables and is commonly
used in the analysis of Variance (ANOVA). It is used to determine the equality of variances
s1 and s2 of two normally distributed data sets, for example two sets of porosity
measurements of the same sandstone formation at two different wells.
Suppose we want to perform an hypothesis to determine if two population variances are the
same:
Sample Statistics: (the two sample variances)
Test Statistic:
(note that we usually put the larger variance in the numerator)
Critical Region: let =0.05 reject if (F Table Critical Value)
A specific F distribution is denoted by the degrees of freedom for the numerator Chi-
square and the degrees of freedom for the denominator Chi-square. An example of the F(10,10)
distribution is shown in the figure below. When referencing the F distribution, the numerator
degrees of freedom are always given first, as switching the order of degrees of freedom
changes the distribution (e.g., F(10,12) does not equal F(12,10)).
Fisher F density and distribution fcuntion
___________________________________________________________________________
Worked example: F-test
We have obtained two sets of porosity measurements from a sandstone formation at two
different locations (5 and 11 samples). We are interested in determining if the variation in
porosity is the same in both areas. We will use a 95% level of significance. The two degrees
45
of freedom (n-1) are 4 and 10. What value of a statistic from the F4,10 distribution will be
exceeded with probability 0.05?
% Porosity measurements
a=[10.0 8.5 7.9 9.2 7.5]';
b=[10.5 7.9 8.7 7.3 10.4 8.8 7.7 9.4 10.4 8.3 9.2]';
% Compute F-statistic
av=(std(a))^2
bv=(std(b))^2
% the larger variance must be in he numerator
F=bv/av
p=0.95;
%use finv distribution generator qf to obatin critical F
f=qf(0.95,10,4)
% Calculated F-statistic (1.27) is much smaller than Fcrit (5.96)

% We accept the null hypothesis that the parent populations
% of the two sets of samples have equal variances
___________________________________________________________________________
7 ANOVA: ANALYSIS OF VARIANCE

The tests we have learned up to this point allow us to test hypotheses that examine the
difference between only two means. Analysis of Variance or ANOVA will allow us to test the
difference between 2 or more means. ANOVA does this by examining the ratio of variability
between two conditions and variability within each condition. For example, say we give a
drug that we believe will improve memory to a group of people and give a placebo to another
group of people. We might measure memory performance by the number of words recalled
from a list we ask everyone to memorize. A t-test would compare the likelihood of observing
the difference in the mean number of words recalled for each group. An ANOVA test, on the
other hand, would compare the variability that we observe between the two conditions to the
variability observed within each condition. Recall that we measure variability as the sum of
the difference of each score from the mean. When we actually calculate an ANOVA we will
use a short-cut formula.
Thus, when the variability that we predict (between the two groups) is much greater than
the variability we don't predict (within each group) then we will conclude that our treatments
produce different results.
Levene's Test: Suppose that the sample data do not support the homogeneity of variance
assumption, however, there is a good reason that the variations in the population are almost
the same, then is such a sitiuation you may like to use the Levene's modified test: In each
group first compute the absolute deviation of the individual values from the median in that
group. Apply the usual one way ANOVA on the set of deviation values and then interpret the
results.
The stixbox does not have an anova tool available, but both the (commercial) statistics
toolbox from Matlab and the free octave software do have anova functions built in, which are
fairly easy to use.
46
8 STATISTICS BETWEEN TWO OR MORE VARIABLES
8.1 CORRELATIONS BETWEEN TWO OR MORE VARIABLES
8.1.1 Introduction
Correlation is a measure of the relation between two or more variables. The measurement
scales used should be at least interval scales, but other correlation coefficients are available to
handle other types of data. Correlation coefficients can range from -1.00 to +1.00. The value
of -1.00 represents a perfect negatice correlation while a value of +1.00 represents a perfect
positive correlation. A value of 0.00 represents a lack of correlation. The most widely-used
type of correlation coefficient is Pearson r, also called linear or product- moment correlation.
Pearson correlation (hereafter called correlation), assumes that the two variables are
measured on at least interval scales, and it determines the extent to which values of the two
variables are "proportional" to each other. The value of correlation (i.e., correlation
coefficient) does not depend on the specific measurement units used; for example, the
correlation between height and weight will be identical regardless of whether inches and
pounds, or centimeters and kilograms are used as measurement units. Proportional means
linearly related; that is, the correlation is high if it can be "summarized" by a straight line
(sloped upwards or downwards).
This line is called the regression line or least squares line, because it is determined such
that the sum of the squared distances of all the data points from the line is the lowest possible.
Note that the concept of squared distances will have important functional consequences on
how the value of the correlation coefficient reacts to various specific arrangements of data (as
we will later see).
As mentioned before, the correlation coefficient (r) represents the linear relationship
between two variables. If the correlation coefficient is squared, then the resulting value (r2,
the coefficient of determination) will represent the proportion of common variation in the two
variables (i.e., the "strength" or "magnitude" of the relationship). In order to evaluate the
correlation between variables, it is important to know this "magnitude" or "strength" as well
as the significance of the correlation.
8.1.2 Significance of Correlations

The significance level calculated for each correlation is a primary source of information
about the reliability of the correlation. The significance of a correlation coefficient of a
particular magnitude will change depending on the size of the sample from which it was
computed. The test of significance is based on the assumption that the distribution of the
residual values (i.e., the deviations from the regression line) for the dependent variable y
follows the normal distribution, and that the variability of the residual values is the same for
all values of the independent variable x.
However, Monte Carlo studies suggest that meeting those assumptions closely is not
absolutely crucial if your sample size is not very large. It is impossible to formulate precise
recommendations based on those Monte- Carlo results, but many researchers follow a rule of
47
thumb that if your sample size is 50 or more then serious biases are unlikely, and if your
sample size is over 100 then you should not be concerned at all with the normality
assumptions. There are, however, much more common and serious threats to the validity of
information that a correlation coefficient can provide; they are briefly discussed in the
following paragraphs.
8.1.3 Nonlinear Relations between Variables

Another potential source of problems with the linear (Pearson r) correlation is the shape
of the relation. As mentioned before, Pearson r measures a relation between two variables
only to the extent to which it is linear; deviations from linearity will increase the total sum of
squared distances from the regression line even if they represent a "true" and very close
relationship between two variables. The possibility of such non-linear relationships is another
reason why examining scatterplots is a necessary step in evaluating every correlation.
8.1.4 Measuring Nonlinear Relations

What do you do if a correlation is strong but clearly nonlinear (as concluded from
examining scatterplots)? Unfortunately, there is no simple answer to this question, because
there is no easy-to-use equivalent of Pearson r that is capable of handling nonlinear relations.
If the curve is monotonous (continuously decreasing or increasing) you could try to transform
one or both of the variables to remove the curvilinearity and then recalculate the correlation.
For example, a typical transformation used in such cases is the logarithmic function which
will "squeeze" together the values at one end of the range.
Another option available if the relation is monotonous is to try a nonparametric

correlation (e.g., Spearman R, see spearman stixbox function) which is sensitive only to the
ordinal arrangement of values, thus, by definition, it ignores monotonous curvilinearity.
However, nonparametric correlations are generally less sensitive and sometimes this method
will not produce any gains. Unfortunately, the two most precise methods are not easy to use
and require a good deal of "experimentation" with the data. Therefore you could try to
identify the specific function that best describes the curve. After a function has been
found, you can test its "goodness-of-fit" to your data.
___________________________________________________________________________
Worked example: Linear and polynomial regression
On ODP Leg 183 (Kerguelen Plateau) sediment velocity data were collected based on
both downhole geophysical logs and on laboratory measurements of core samples. The
question arises: How well do the two sets of measurements correlate?
We can only work with values collected below 80 m below sea floor, because the hole
was cased above this depth, preventing us to collect data based on downhole logs.
% Velocities from Kerguelen ODP Leg183
clear
load vel_log.dat
load vel_samp.dat
deplog=vel_log(:,1);
depsamp=vel_samp(:,1);
48
vellog=vel_log(:,2);
velsamp=vel_samp(:,2);
plot(vellog,-deplog,'o')
hold on
plot(velsamp,-depsamp,'+r')
title('Velocities from logs (blue) and from samples (red)');
xlabel('Velocity [km/s]')
ylabel('Depth [m]')
hold off
% Usable depth interval

depi=(95:0.15:345);
% Resample both sets of measurements at log sampling rate

% logs are more coarsely sampled than core data
help interp1
vell=interp1(deplog,vellog,depi);
vels=interp1(depsamp,velsamp,depi);
% Plot resampled data (velocity versus depth)

% on scatterplot
figure
plot(vell,vels,'+');
Log Samples
-50 -50
-100 -100
-150 -150
-200 -200
-250 -250
-300 -300
-350 -350
0 2 4 6 8 0 2 4 6 8
% Find out options for linreg stixbox function

help linreg
linreg(vell,vels)
49
7
0
1 2 3 4 5 6 7
A pointwise confidence band for the expected y-value is plotted, as well as a dashed line
which indicates the prediction interval given x (e.g. by default the dashed line encompasses
95% of the data).
Now add arguments for confidence interval and polynomial degree (from 1-3).
Which polynomial model fits the data? Also check out the identify stixbox function.
___________________________________________________________________________
9 WHEN TO USE NONPARAMETRIC TECHNIQUES

One must use statistical technique called nonparametric if it satisfies at least on of the
following five types of criteria:
1. The data entering the analysis are enumerative - that is, count data representing the
number of observations in each category or cross-category.
2. The data are measured and /or analyzed using a nominal scale of measurement.
2. The data are measured and /or analyzed using an ordinal scale of measurement.
4. The inference does not concern a parameter in the population distribution - as, for
example, the hypothesis that a time-ordered set of observations exhibits a random pattern.
5. The probability distribution of the statistic upon which the analysis is based is not
dependent upon specific information or assumptions about the population(s) which the
sample(s) are drawn, but only on general assumptions, such as a continuous and/or
symmetric population distribution.
By this definition, the distinction of nonparametric is accorded either because of the level
of measurement used or required for the analysis, as in types 1 through 3; the type of
inference, as in type 4 or the generality of the assumptions made about the population
distribution, as in type 5. For example one may use the Mann-Whitney Rank Test as a
50
nonparametric alternative to Students T-test when one does not have normally distributed
data.
Mann-Whitney: To be used with two independent groups (analogous to the independent

groups t-test) (stixbox function test2r)
Wilcoxon: To be used with two related (i.e., matched or repeated) groups (analogous to the
related samples t-test) (stixbox function test1r)
Kruskall-Wallis: To be used with two or more independent groups (analogous to the single-
factor between-subjects ANOVA)
Friedman: To be used with two or more related groups (analogous to the single-factor
within-subjects ANOVA)
Spearman's rank correlation coefficient: Rank correlation is useful if variables are not
normally distributed. For example, the depth or time ranges of the occurrence of a
particular fossil can be cross-correlated (stixbox function spearman)
51
10 DIRECTIONAL AND ORIENTED DATA
10.1 INTRODUCTION
Directional and oriented data abound in geology and geophysics.
Directional data include:
1) Asymmetrical ripples
2) Flute marks
3) Faults (downthrown side known)
4) Belemnites
5) Gastropods
Oriented data include:
1) Symmetrical ripples
2) Grooves
3) Joints
4) Graptolites
5) Crinoid stems
10.2 ROSE PLOTS

___________________________________________________________________________
Worked example: Plotting dip directions on a rose plot
% Dips and dip directions of faults measured on the 592 North Level at
% Brown's Creek Copper-Gold Mine, Blayney NSW.
fdip = [48 65 83 74 87 52 56 57 60 73 79 52 43 57 69 71 69 87 70 35 33 75
87 72 15 34 60 63 59 59 64 60 42 40 80 97 79 69 83 60 79 58 66 16 78 69 76
82 84 73 85 73 74 81 77 52 69 68 81 83 71 83 87 78 69 63 74 81 86 87 77 69
72 74 19 86 81 74 37 31 74 79 84 85 85 79 86 82 73 75 76 78 69 86 72 35 47
51 55 82 79 71 86 81 84 72 79 76 85 78 79 85 60 81 71 75 74 71 78 50 84 77
41 53 74 60];
fdipdir = [330 070 084 350 275 081 084 324 075 050 069 001 016 098 085 095
108 114 107 271 339 305 110 311 123 324 295 295 055 096 268 290 067 105 090
087 119 092 280 086 089 088 094 280 302 047 042 274 107 114 314 098 349 088
255 090 092 065 078 078 085 028 045 091 117 285 295 104 273 105 108 034 048
52
079 145 300 297 103 081 087 079 173 072 071 283 060 256 270 259 249 087 093
105 109 108 125 095 103 105 291 266 250 062 268 271 284 279 278 105 295 268
292 278 096 293 109 107 314 160 108 273 245 118 313 286 112];
These data can be loaded from a file called "browns_creek.dat".
% Use Middleton's function grose3 for plotting
load browns_creek.dat
% Dip direction
fdipdir=browns_creek(:,2);
% Plot dip directions on rose plot

% Find out how grose 3 works
help grose3
figure
grose3(fdipdir,24,1,0)
20
15
10
126
0
-20 -15 -10 -5 0 5 10 15 20
-5
-10
-15
-20
figure
grose3(fdipdir,24,1,1)
3.8
2.5
1.3
0
126
-5 -3.8 -2.5 -1.3 0 1.3 2.5 3.8 5
-1.3
-2.5
-3.8
-5
What is the difference between these two plots?

Which one visually overemphasises high frequencies?
53
10.3 PLOTTING AND CONTOURING ORIENTED DATA ON STEREONETS
10.3.1 Types of stereonets
1) Equal-angle stereonet, also termed Wulff net. It maintains angular relationships within
the projection plane of the stereonet. For example, if the small circle intersection of a
cone with the lower hemisphere is plotted, on an equal-angle net the shape of this surface
will project as a perfect circle.
2) Equal-area stereonet, also termed Schmidt net. It maintains the proportion of the lower
hemisphere surface projected to the plane of the net. In other words, no preferred
alignment of data will be apparent if the data are truly random.
10.3.2 Which net to use for which purpose?

For the plotting of a large number of structural data elements, we must use the
equal area net to remove any bias when interpreting the average trend of the data. For
this reason most structural geologists will carry the equal area net with them in the field.
In effect, both nets preserve the angular relationships between lines and planes in three
dimensional space; however, when these elements are projected into the two dimensional
plane of the net diagram they are somewhat distorted on the equal area stereonet.
Equal-angle (Wulff) stereonets are used in crystallography because the plotted

angular relationships are preserved, and can be measured directly from the stereonet plot.
Equal-area (Schmidt) stereonets are also commonly used in structural geology because
they present no statistical bias when large numbers of data are plotted. On the equal-area
net area is preserved so, for example, each 2° polygon on the net has the same area. In
structural geology the stereonet is assumed to be a lower-hemisphere projection since all
structural elements are defined to be inclined below the horizontal. This is unlike
crystallographic projections where elements may plot on either the upper or lower
hemisphere.
___________________________________________________________________________
Worked example: Plotting and contouring oriented data on stereonets
Now we'll use a couple of Middleton's scripts to create a Schmidt (equal area) net, and plot
the Browns Creek data.
schmidt
Snetplot
% Now the data can be contoured
vgcnt3(X);
pause
Snetplot will ask you for input interactively, i.e. you have to type in the data file name
(e.g. "browns_creek.dat"), and the symbol for the plot (e.g. '+b' will plot blue plus signs).
54
___________________________________________________________________________
10.2 TESTS OF SIGNIFICANCE OF MEAN DIRECTION
We might be interested in calculating a prevailing current direction, a paleoslope of flow

in magma. To do this, we have to make sure that the data are not random, and that they don't
have more than one mode. Here Rayleigh's Test R is used for the significance of a mean
direction.
2 2
1 n  + n 
R=  ∑ sinθ i   ∑ cosθ i 
n
 i =1   i =1 
___________________________________________________________________________
Worked example: Rayleigh's test
Files ex_5_2_a_fa.dat and ex_5_2_a_fb.dat contain two sets of paleocurrent

measurements from sandstone ripple foresets. Is there any evidence for a preferred
trend in either data set?
1) Calculate R based on above equation. You have to convert all angles to radians in
Matlab.
2) The critical value for R has to be taken from an appropriate table (see Swan and
Sandilands, 1995) for a given n and α=0.05. In this case, it is 0.27.
3) If the calculated value exceeds the critical value, we reject the null hypothesis (i.e. there is
a preferred trend). The direction of the preferred trend is found by:
55
θ = tan (∑ sin θ / ∑ cosθ )
−1
%Rayleighs test
load ex_5_2_a_fa.dat;
fa=ex_5_2_a_fa;
histo(fa)
figure
na=length(fa);
%convert into radians

rfa=(fa.*pi)/180;
ssrfa=(sum(sin(rfa)))^2;
scrfa=(sum(cos(rfa)))^2;
Rfa=(1/na)*(sqrt(ssrfa+scrfa))
%Rfa > Rcrit

% null hypothesis rejected
%Direction of the preferred trend

dira=atan((sum(sin(rfa)))/(sum(cos(rfa))));
dirtenda=(dira/pi)*180
___________________________________________________________________________
For more examples for directional data analysis see Swan and Sandilands (1995).
11 SPATIAL DATA ANALYSIS: CONTOURING

UNEQUALLY SPACED DATA
Spatial data collected in the field are often irregularly spaced. In order to grid them onto a
regular grid and plot them, MARLAB provides functions called "griddata" and "interp2".
These functions let's you grid the data using four different methods:
1) 'linear' - Triangle-based linear interpolation (default).

2) 'cubic' - Triangle-based cubic interpolation.
3) 'nearest' - Nearest neighbor interpolation.
4) 'v4' - MATLAB 4 griddata method (Delauney triangulation).
In order to assess the difference between these methods, we will take one of Middleton's data
files in /local/matlabr12/local/middleton called "dmap.dat". This is a data set from
Middleton's book, from a topographic data set of Davis (1986). It is a relatively small
unequally spaced data set. The lat and lon ranges are between about 0 and 6.5.
Let's plot an interpolated and gridded version of these data.

___________________________________________________________________________
56
Worked example: Gridding data
ti = 0:0.25:6.5;
[xi,yi,zi] = griddata(dmap(:,1),dmap(:,2),dmap(:,3), ti, ti');
v = 700:25:950;
contour(ti, ti', zi, v), axis('square')
Now let's look at a mesh version of this in 3d, with a defined perspective, then add the data to
it (note, these commands are in the runfile "gd2.m"):
mesh(xi,yi,zi);
view(170,30);
hold on;
plot3(dmap(:,1), dmap(:,2), dmap(:,3), 'o')'
hold off;
Notice the "view" command. Let's do a "help view" to understand what that defined. Again,
we can get rid of the default colourising of the grid by defining any colour as a new colour
map, such as:
colourmap(hot)
Try using different matlab colormaps:

hsv - Hue-saturation-value color map.
hot - Black-red-yellow-white color map.
gray - Linear gray-scale color map.
bone - Gray-scale with tinge of blue color map.
copper - Linear copper-tone color map.
pink - Pastel shades of pink color map.
white - All white color map.
lines - Color map with the line colors.
colorcube - Enhanced color-cube color map.
vga - Windows colormap for 16 colors.
jet - Variant of HSV.
prism - Prism color map.
cool - Shades of cyan and magenta color map.
autumn - Shades of red and yellow color map.
spring - Shades of magenta and yellow color map.
winter - Shades of blue and green color map.
summer - Shades of green and yellow color map.
Now use triangle based cubic and nearest neighbor interpolation to grid the data and then plot
them again. Do you notice differences?
Lastly, use the "interp2" function to grid the data. Type:
help interp2
to find out how it works. Try the "spline" option, plot the data again, and evaluate the
difference to previous results.
___________________________________________________________________________
57
12 OVERVIEW OF COMPUTER INTENSIVE
STATISTICAL INFERENCE PROCEDURES
12.1 INTRODUCTION
Resampling procedures, also commonly referred to as computer intensive statistical
inference procedures, may be used to assess the significance of a statistic in a hypothesis test
or to determine the lower and upper bounds for a confidence interval when the usual
assumptions of parametric statistical procedures are not met (Manly, 1991). Computer
intensive procedures require the recomputation of hundreds or thousands of artificially
constructed data sets. Like other nonparametric statistical procedures, these procedures
existed as theory on paper long before they were brought into the practical mainstream. The
Monte Carlo method of resampling, for example, was introduced by Barnard in 1963 (Noreen,
1989), but at that time could only be illustrated and implemented operationally on very small
sample sizes.
However, with the advent of fast, inexpensive computing, essentially since around 1990,
the use of computer intensive procedures has grown dramatically, particularly in the area of
basic academic research. Actually, with the widespread availability of powerful personal
computers and free statistical software like the stixbox for Matlab and Octave that even
brings resampling-type methods right into the home, the name computer intensive seems
today to be as anachronistic as it was descriptive just a few years ago.
Computer intensive procedures are for probability estimation; that is, they are used to
calculate p-values for test statistics or lower and upper bounds for confidence intervals
without relying on classical inferential assumptions like normality of the sampling
distribution and the Central Limit Theorem (Noreen, 1989). The taxonomy of computer
intensive methods is difficult to define, primarily for three reasons:
1) there are many subtly different ways to perform each of the methods, with each way
leading to slightly different results and interpretations;
2) each theoretician that has contributed to the literature on computer intensive

procedures has a name for his procedure, and sees unique links to the other
procedures, and
3) asymptotically, all the procedures are forms of each other and of the permutation test
(Noreen, 1989). Manly (1991) and Noreen (1989) concur on a taxonomy that divides
computer intensive procedures into two related yet unique streams: randomization
methods and Monte Carlo methods.
12.2 MONTE CARLO METHODS
12.2.1 Introduction
Monte Carlo methods are most often used in simulation studies on computer-generated
data to show how p-values for a statistical test or estimation method functions when no
58
convenient real data exist. Monte Carlo methods are used to make inferences about the
population from which a sample has been drawn. The Monte Carlo methods are Monte Carlo
estimation, bootstrapping, the jackknife, and Markov Chain Monte Carlo estimation.
It’s important to note that Monte Carlo refers to the type of resampling process used to
produce a probability estimate, not the act of generating a data set. That is, it is possible to
computer-generate a data set, and numerous recombinations of it, without it being a Monte
Carlo procedure. For example, Hambleton, Swaminathan, and Rogers (1991) identify a
procedure for detecting differential item functioning in IRT-calibrated test items that uses one
or more simulations of test data to generate sets of item and examinee ability parameters for
comparison, but this is not truly a Monte Carlo procedure. On the other hand, Yen (1986), in
assessing the distributional qualities of Thurstonian absolute scaling, presents a method for
generating two simulation samples of data drawn at random from an assumed population. In
this case, even though only two samples are drawn at a time, this is, according to Noreen
(1989), considered a Monte Carlo method.
12.2.2 Monte Carlo Estimation

The "original" Monte Carlo estimation method, as introduced by Barnard in 1963
(Noreen, 1989), is used to test the hypothesis that the data are a random sample from a
specified population. This method requires that a computational model of the specified
population be available, so that simulated random samples can be generated for use in
computing the test statistic(s) of interest (Manly, 1991). For example, to use Monte Carlo
estimation on a sample of high school mathematics test scores, it would first be necessary to
create a model for the true population distribution of test scores. As Noreen (1989) points out,
"the Monte Carlo estimation method is particularly valuable in situations where the
population distribution is known, but the sampling distribution of a statistic has not been
analytically derived." As a drawback, Monte Carlo studies (a) often simulate data that fail to
take in to account the kinds of anomalies encountered in "real" data, and (b) are not
accompanied by sufficient inferential information to allow the reader to make an informed
decision about the models being tested.
The steps for performing a Monte Carlo procedure, as given by Noreen (1989), are:
1. [Given a "real" sample of n observations from a defined population and a computed statistic of interest]
Identify a model of the population from which simulated samples are to be drawn.
2. Generate a large number N of simulated samples of size n, and compute the statistic of interest for each
sample.
3. Order the computed simulated sample statistics in a distribution, called the "Monte Carlo distribution" of
the statistic.
4. Map the "real" statistic to the Monte Carlo distribution; use the would-be percentile rank of the "real"
statistic to estimate its p-value.
The procedure for establishing a confidence interval for the "real" statistic in a Monte
Carlo procedure is analogous to that used in a randomization test, as described above. As with
randomization confidence intervals, a Monte Carlo confidence interval need not be
symmetrical.
59
12.2.3 Bootstrapping
According to legend, Baron Münchhausen saved himself from drowning in quicksand by
pulling himself up using only his bootstraps. The statistical bootstrap, which uses
resampling from a given set of data to mimic the variability that produced the data in
the first place, has a rather more dependable theoretical basis and can be a highly effective
procedure for estimation of error quantities in statistical problems.
Bootstrapping is a special case of Monte Carlo estimation (Efron, 1982). Bootstrapping

procedures use resampling with replacement from an already-drawn sample. Bootstrapping is
used most often to approximate standard errors and associated p-values on estimates of
population parameters when the sampling distribution of the target population is either
indeterminate or difficult to obtain empirically.
Efron & Tibshirani (1993) provide the generic algorithm for performing a bootstrapping
procedure as follows:
1. [Given a sample of size n and a calculated sample statistic of interest] Draw a random "bootstrap" sample
of size n with replacement (that is, an observation, once drawn, may be drawn again), and calculate the
"bootstrap" statistic of interest from this sample.
2. Repeat step (1) a large number N of times.
3. Estimate the "bootstrap standard error" of the parameter of interest using the N bootstrap statistics as the
inputs for the usual standard error equation.
An estimate of the bias of the statistic of interest is obtained simply by subtracting the
mean bootstrap statistic from the original sample statistic. While this bias estimate may be
useful as a descriptive tool for readers (Efron, 1982), a problem arises from "adjusting out"
the bias: "the bootstrap bias estimator from a single sample contains an indeterminate amount
of random variability along with bias, and this may artificially inflate the mean squared error
of the statistic" (Mooney & Duval, 1993).
A shortcoming of bootstrapping is that all methods for estimating bootstrap confidence

intervals rely to some degree on either the normal or t-distribution (Efron & Tibshirani, 1993).
For N reasonably large, however, this should not pose a problem, even for relatively small
sample sizes (Mooney & Duval, 1993), although no cited studies have shown the behavior of
such confidence intervals on extremely small, nonnormal data sets. The various procedures
for establishing a confidence interval around the "real" statistic all use some form of point
estimate plus or minus a normal (or t) variate times the bootstrap standard error.
There are a variety of bootstrapping methods available; the shift method and the normal
approximation method are two popular methods (Noreen, 1989). According to Noreen, both
of these methods are frequently used "for estimating significance levels based on the
bootstrap sampling distribution … the ‘shift’ method assumes that the bootstrap sampling
distribution and the null hypothesis sampling distribution have the same shape but different
distributions… the ‘normal approximation’ method assumes that the null hypothesis sampling
distribution is normal; the bootstrap sampling distribution is used only to estimate the
variance of the normal distribution."
In the literature on computer intensive methods, bootstrapping methods are considered

somewhat speculative (Mooney & Duval, 1993). Bootstrap-derived probability estimates of
population parameters are highly sample-sensitive as well as sample size-sensitive and, in
contrast to Monte Carlo methods, "bootstrapping relies on an analogy between the sample and
60
the population from which the sample is drawn." Monte Carlo studies (!) on simulated highly
nonnormal distributions have shown bootstrap estimates to be discomfortingly liberal with
respect to Type I error (Mooney & Duval, 1993). Currently, bootstrapping is most commonly
used to estimate population variances in the absence of conventional parametric estimation
assumptions (Noreen, 1989).
12.2.4 The "Jackknife"

First presented by Tukey in 1958, the jackknife is a special case of the bootstrap. The
procedure given by Mooney & Duval (1993) for performing a jackknife is:
1. [Given a sample of size n and sample estimate, θ hat , of the parameter of interest, θ ] Divide the sample
into g exhaustive and mutually exclusive subsamples of size h, such that gh = n.
2. Drop out one subsample from the entire original sample. Calculate θ hat – 1 from that reduced sample of
size (g-1)h.
3. Calculate the "pseudovalue," θ *

g , from this θ hat – 1 by weighting:
θ *g = g θ hat – (g – 1) θ hat – 1
4. Repeat steps 2 and 3 for all g subsamples, yielding a vector of g pseudovalues.
5. Take the mean of these pseudovalues to yield the "jackknife estimate" of θ .
Since the jackknife estimate of θ is second-order unbiased, an estimate of the first-order

bias of θ hat is simply the difference of θ hat and the jackknife estimate, similar to the bias
estimate from bootstrapping. Yet while the jackknife is intuitive and relatively easy to
perform on a computer, it also has several severe weaknesses. Like bootstrap estimates,
jackknife estimates are purely sample-referenced; inference to a population is at best
speculative. However, jackknife estimates "fail for markedly nonlinear statistics such as the
sample median, unlike bootstrap estimates" (Efron ,1982).
Since bootstrapping is considered a more general statistical procedure and works at least
as well as the jackknife in most situations, the jackknife is generally only of historical interest
today. An exception to this is in the area of identifying influential cases or strata in a
statistical model. Here, "the case subgroup or stratum pseudovalue can be used to determine
whether the subgroup or stratum has a greater-than-average effect on the overall parameter
estimate than other subgroups" (Mooney & Duval, 1993). This useful by-product of jackknife
estimation remains popular.
12.2.5 Markov Chain Monte Carlo Estimation

During the past decade, there have been interesting new applications of the Monte Carlo
framework. The most popular of these, judging from the literature, is Markov Chain Monte
Carlo estimation, or "MCMC," which combines the traditional Monte Carlo method with
Bayesian inference. Essentially, the assumed prior sampling distribution of the population
parameter becomes the "population" from which the Monte Carlo procedure draws many
random samples. Both the sampling and the "update" of the assumed distribution continue
iteratively until the parameter estimates converge. This procedure is said to provide superior
information in the case in which the sampling distribution of the true population of interest is
unknown but a "starting point," or prior distribution, is reasonable (Draper, 1995).
61
Applications of MCMC currently popular in the literature include Gibbs sampling and
data augmentation. Both of these have been used in the context of hierarchical modeling in
the social sciences. Gibbs sampling has been particularly popular in recent years, as interest in
modeling multilevel social phenomena has unearthed the problem of estimating level-wise
parameters when marginal posterior distributions of those parameters are unknown. This
problem is elegantly resolved by use of Gibbs sampling, in which Monte Carlo estimation is
performed on conditional posterior distributions of known form (e.g., "known" normals, as in
standard scores). Plots of Monte Carlo-estimated values then reveal the contours of the
unknown marginal posterior and joint posterior distributions of the parameters of interest, and
descriptions and inferences can be presented (Seltzer, 1993). For a complete introduction to
data augmentation, I refer the reader to Tanner & Wong (1987); for Gibbs sampling, Gelfand
& Smith (1990).
12.2.6 Meta-Analysis
Harwell (1990) provides an overview of many types of Monte Carlo studies done for the
purpose of illustrating the performance of certain statistical methods, such as single- and
multifactor ANOVA and ANCOVA, multiple regression, and hierarchical models. Harwell
reviews methods employed for synthesizing Monte Carlo studies, and criticizes the way
Monte Carlo results are conducted without regard for some "overarching theory" to guide
interpretation. He presents a five-step strategy for summarizing Monte Carlo results that
includes specific problem formulation, data design and collection data evaluation, analysis
and interpretation, and presentation of results. In a later article (1992), Harwell embellishes
this strategy specifically for one- and two-factor fixed effects ANOVA.
12.2.7 Multivariate Modelling

Draper (1995) provides an extensive coverage of the use of hierarchical models (HM) in
social science, particularly in education. He illustrates the relative advantages of using HM in
educational settings, stating that "multi-level school analyses seem to have been waiting for
HM to come along." Draper criticizes earlier studies that use traditional statistical methods,
such as multiple regression and ANOVA, on clearly hierarchically nested data. He points out
that, prior to implementing HM in education, researchers need to carefully consider design
issues and inference-related interpretations of results so as not to mislead the body of
knowledge in the process. Draper recommends the increased use of MCMC methods, and
particularly Gibbs sampling, in place of the more common maximum likelihood estimation
procedures for student- and school-level effects on HMs.
Seltzer (1993) draws the same conclusion about the use of Gibbs sampling with HMs,
providing more detail on what goes wrong when, under the standard assumptions of normality
in HM, "fat-tailed" data is simulated. Seltzer concludes that many MCMC studies may
already exist that mislead the reader as to the precision of the HM primarily because proper
treatment for platykurtic data has not been addressed.
Several other authors present Monte Carlo investigation studies for performance of
statistical estimation procedures: Bacon (1995) for performance of correlational outlier
identification methods over a variety of data distribution types; Bengt (1994) for testing of a
Bayesian process for filling in missing data for covariance structure modeling; Wolins (1995)
for comparing the speed and efficiency of maximum likelihood and unweighted least squares
estimation procedures in factor analysis, over many samples and with a variety of
62
distributions represented; and Finch, et al. (1997) for examining bias in the estimation of
indirect effects and their standard errors, using simulated skewed data, in structural equation
models using maximum likelihood estimation.
These studies represent the type of research, almost entirely investigative in nature, that
has arisen from the advances in Monte Carlo methods. While highly sophisticated and likely
inaccessible beyond the abstract and discussion to those not intimately familiar with the
procedures, these studies nevertheless convey the complex dimensionality and depth of
inquiry now achievable using these new methods and a good computer.
12.2.8 Overall Assessment of Strengths and Weaknesses

Strengths
• Randomization methods: Uniformly most powerful testing, provided N large, even

when the population distribution of interest is unknown;
• Randomization methods: Conceptually fairly accessible;
• Monte Carlo estimation: Plenty of literature, on-line references, and e-mail/listserve

advice (some good, some not) available to users; useful for investigating the properties
of statistical estimates under extreme conditions, such as IRT parameter estimates;
useful in meta-analysis if results are presented in a meaningful format;
• MCMC and variations: Very promising for investigation of complex, multilevel social
phenomena, especially with HMs; provide both descriptive and inferential
information; great for use with missing data and/or latent variables;
• All methods: Easily adaptable (or perhaps even already adapted) to a user’s
substantive field;
• All methods: Easily and inexpensively automated on computer.
Weaknesses
• All methods: Not yet widely available in commercial PC- or LAN-based statistical
packages, although more widely available for use on mainframe computers;
• All methods: Requires computer automation, distancing the user/student from the
procedure and potentially undermining the acquisition of conceptual understanding;
• All methods: Plethora of procedures and variations, some appearing bordering on

proprietary; epidemic inconsistency of terms, symbols, and jargon across methods;
nearly overwhelming number and variety of applications to substantive fields; all of
these again alienating the user from the (at times rather simple) underlying concepts;
• Bootstrapping: Sensitive to sample characteristics and sample size; reliant on "faith" in

the population distribution; erratic in estimating with nonnormal samples; p-values
and confidence intervals still depend in some way on an assumed normal or t
distribution;
63
• Jackknife: Largely irrelevant, especially when bootstrapping procedures available;
same weaknesses as bootstrapping;
• Monte Carlo estimation: Tends to be overrated/overused, probably because it is "in

vogue;" often used with reasonably estimable distributions, providing little more than
conceptually and computationally simpler nonparametric or parametric methods could
have.
• MCMC and variations: Conceptually inaccessible to most audiences; not procedurally

relevant to studies other than multilevel HMs and other multidimensional inquiries.
In the Matlab/Octave stixbox, various Monte Carlo-type methods are implemented:

covjack - Jackknife estimate of the variance of a parameter estimate.
covboot - Bootstrap estimate of the variance of a parameter estimate.
stdjack - Jackknife estimate of the parameter standard deviation.
stdboot - Bootstrap estimate of the parameter standard deviation.
rboot - Simulate a bootstrap resample from a sample.
ciboot - Bootstrap confidence interval.
test1b - Bootstrap t test and confidence interval for the mean.
64
13 REFERENCES
13.1 GEOSCIENCES
Davis, J.C., 1973, Statistics and data analysis in geology, Wiley International, 550 pp.
Middleton, 1999, Data analysis in the earth sciences using Matlab, Prentice Hall.
Swan, A.R.H., and Sandilands, M., 1995, Introduction to geological data analysis, Blackwell
Science, 446 pp.
13.2 GENERAL
Akkermans, W. M. W. (1994). Monte Carlo estimation of the conditional Rasch model.
Research Report 94-09, Faculty of Educational Science and Technology, University of
Twente, The Netherlands.
Bacon, D. R. (1995). A Maximum likelihood approach to correlational outlier identification.

Multivariate Behavioral Research, 30(2), 125-148.
Baxter M., Exploratory Multivariate Analysis in Archaeology, pp. 167-170, Edinburgh
University Press, Edinburgh, 1994.
Christakos G., Modern Spatiotemporal Geostatistics, Oxford University Press, 2000.
Cornwell, J. M., & Ladd, R. T. (1993). Power and accuracy of the Schmidt and Hunter meta-
analytic procedures. Educational and Psychological Measurement, 53(4), 877-895.
Draper, D. (1995). Inference and hierarchical modeling in the social sciences. Journal of
Educational and Behavioral Statistics, 20(2), 115-147.
Edgington, E. S. (1987). Randomization tests (2nd Ed.). New York: Marcel Dekker.
Efron, B. (1982). The Jackknife, the Bootstrap and other resampling plans. Philadelphia:
Society for Industrial and Applied mathematics.
Efron B., and R. Tibshirani, An Introduction to the Bootstrap, Chapman & Hall, 1994.
Efron, B., & Tibshirani, R. J. (1993). An introduction to the bootstrap. New York: Chapman
& Hall.
Finch, J. F., et al. (1997). Effects of sample size and nonnormality on the estimation of
mediated effects in latent variable models. Structural Equation Modeling, 4(2), 87-107.
Gelfand, A. E., & Smith, A. F. M. (1990). Sampling based approaches to calculating marginal
densities. Journal of the American Statistical Association, 85, 398-409.
Good, P. (1994). Permutation tests: A practical guide to resampling methods for testing
hypotheses. New York: Springer-Verlag New York.
Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response
theory. Newbury Park, CA: Sage Publications.
Harwell, M. R. (1997). An investigation of the Raudenbush (1988) test for studying variance
65
homogeneity. Journal of Experimental Education, 65(2), 181-190.
Harwell, M. R. (1997). Analyzing the results of Monte Carlo studies in item response theory.
Educational and Psychological Measurement, 57(2), 266-279.
Harwell, M. R. (1990). Summarizing Monte Carlo results in methodological research. Journal
of Educational Statistics, 17(4), 297-313.
Harwell, M. R., Rubinstein, E.N., Hayes, W. S., & Olds, C. C. (1992). Summarizing Monte
Carlo results in methodological research: The one- and two-factor fixed effects
ANOVA cases. Journal of Educational Statistics, 17(4), 315-339.
Ludbrook, J., & Dudley, H. (1998). Why permutation tests are superior to t and F tests in
biomedical research. The American Statistician, 52(2), 127-132.
Manly F., Multivariate Statistical Methods: A Primer, Chapman and Hall, London, 1986.
Manly, B. F. J. (1991). Randomization and Monte Carlo methods in biology. London, U.K.:
Chapman & Hall
Marco D., Building and Managing the Meta Data Repository: A Full Lifecycle Guide,
John Wiley, 2000.
Mooney, C. Z., & Duval, R. D. (1993). Bootstrapping: A nonparametric approach to

statistical inference. Newbury Park, CA: Sage Publications.
Muthen, B. (1994). A simple approach to inference in covariance structure modeling with
missing data: Bayesian analysis. Evaluative report; Project 2.4; available from ERIC:
Identifier ED379321.
Noreen, E. W. (1989). Computer intensive methods for testing hypotheses: An introduction.
New York: John Wiley & Sons.
Seltzer, M. H. (1993). Sensitivity analysis for fixed effects in the hierarchical model: A Gibbs
sampling approach. Journal of Educational Statistics, 18(3), 207-235.
Shao J., and D. Tu, The Jackknife and Bootstrap, Springer Verlag, 1995.
Tanner, M. A., & Wong, W. H. (1987). The calculation of posterior distributions by data
augmentation (with discussion). Journal of the American Statistical Association, 82,
528-550.
Westfall, P. H., & Young, S. S. (1993). Resampling-based multiple testing. New York: John
Wiley & Sons.
Westphal Ch., T. Blaxton, Data Mining Solutions: Methods and Tools for Solving Real-
World Problems, John Wiley, 1998.
Wolins, L. (1995). A Monte Carlo study of constrained factor analysis using maximum
likelihood and unweighted least squares. Educational and Psychological Measurement,
55(4), 545-557.
Yen, W.M. (1986). The choice of scale for educational measurement: An IRT perspective.
Journal of Educational Measurement, 23(4), 299-325.
66
14 APPENDIX
14.1 TABLES
14.1.1 Critical values of R for Rayleigh's test
67
14.1.2 Values of concentration parameter K from R for Rayleigh's test
68
14.1.3 Critical values of Spearman's rank correlation coefficient
14.1.4 Critical values of T for Mann-Whitney Test (α=5%)
69
14.2 STIXBOX CONTENTS
A statistics toolbox for Matlab and Octave.
Version 1.29, 10-May-2000
GNU Public Licence Copyright (c) Anders Holtsberg.
Comments and suggestions to andersh@maths.lth.se.
Distribution functions.
dbeta - Beta density function.
dbinom - Binomial probability function.
dchisq - Chisquare density function.
df - F density function.
dgamma - Gamma density function.
dhypg - Hypergeometric probability function.
dlognorm - The log-normal density function.
dnorm - Normal density function.
dt - Student t density function.
dweib - The Weibull density function.
dgumbel - The Gumbel density function.
pbeta - Beta distribution function.

pbinom - Binomial cumulative probability function.
pchisq - Chisquare distribution function.
pf - F distribution function.
pgamma - Gamma distribution function.
phypg - Hypergeometric cumulative probability function.
plognorm - The log-normal distribution function.
pnorm - Normal distribution function.
pt - Student t cdf.
pweib - The Weibull distribution function.
pgumbel - The Gumbel distribution function.
qbeta - Beta inverse distribution function.

qbinom - Binomial inverse cdf.
qchisq - Chisquare inverse distribution function.
qf - F inverse distribution function.
qgamma - Gamma inverse distribution function.
qhypg - Hypergeometric inverse cdf.
qlognorm - The inverse log-normal distribution function.
qnorm - Normal inverse distribution function.
qt - Student t inverse distribution function.
qweib - The Weibull inverse distribution function.
qgumbel - The Gumbel inverse distribution function.
rbeta - Random numbers from the beta distribution.

rbinom - Random numbers from the binomial distribution.
rchisq - Random numbers from the chisquare distribution.
rf - Random numbers from the F distribution
rgamma - Random numbers from the gamma distribution.
rhypg - Random numbers from the hypergeometric distribution.
rlognorm - Log-normal random numbers.
rnorm - Normal random numbers (use randn instead).
rt - Random numbers from the student t distribution.
rweib - Random numbers from the Weibull distribution.
rgumbel - Random numbers from the Gumbel distribution.
Logistic regression.
ldiscrim - Compute a linear discriminant and plot the result.
logitfit - Fit a logistic regression model.
lodds - Log odds function.
70
loddsinv - Inverse of log odds function.
Various functions.
bincoef - Binomial coefficients.
cat2tbl - Take category data and produce a table of counts.
getdata - Some famous multivariate data sets.
quantile - Empirical quantile (percentile).
ranktrf - Rank transform data.
spearman - Spearman's rank correlation coefficient.
stdize - Standardize columns to have mean 0 and standard deviation 1.
corr - Correlation coefficient.
cvar - Covariance.
Resampling methods.
covjack - Jackknife estimate of the variance of a parameter estimate.
covboot - Bootstrap estimate of the variance of a parameter estimate.
stdjack - Jackknife estimate of the parameter standard deviation.
stdboot - Bootstrap estimate of the parameter standard deviation.
rboot - Simulate a bootstrap resample from a sample.
ciboot - Bootstrap confidence interval.
test1b - Bootstrap t test and confidence interval for the mean.
Tests, confidence intervals, and model estimation.

cmpmod - Compare small linear model versus large one.
contincy - Test for contigency table row-column independence.
ciquant - Nonparametric confidence interval for quantile.
lsfit - Fit a least squares model.
lsselect - Select a predictor subset for regression.
test1n - Tests and confidence intervals, one normal sample.
test1r - Test for median equals 0 using rank test.
test2n - Tests and confidence intervals, two normal samples.
test2r - Test for equal location of two samples using rank test.
normmix - Estimate a mixture of normal distributions.
Graphics.
qqgamma - Gamma probability paper plot (and estimate).
qqnorm - Normal probability paper plot.
qqplot - Plot empirical quantile vs empirical quantile.
qqweib - Weibull probability paper plot.
qqgumbel - Gumbel probability paper plot.
kaplamai - Plot Kaplan-Maier estimate of survivor function.
linreg - Linear or polynomial regression, including plot.
histo - Plot a histogram (alternative to hist).
plotsym - Plot with symbols.
plotdens - Draw a nonparametric density estimate.
plotempd - Plot empirical distribution.
identify - Identify points on a plot by clicking with the mouse.
pairs - Pairwise scatter plots.
14.2 COMPUTATIONAL TOOLS AND DEMOS ON THE INTERNET

Annotated Review of Statistical Tools on the Internet
http://ubmail.ubalt.edu/~harsham/Business-stat/opre504.htm#rbw
Analysis of Variance, by B. Lewis.

http://nimitz.mcs.kent.edu/~blewis/stat/anova.html
71
Statistical Calculators Presided at UCLA. Material here includes: Power Calculator, Statistical
Tables, Regression and GLM Calculator, Two Sample Test Calculator, Correlation and
Regression Calculator, and CDF/PDF Calculators.
http://www.stat.ucla.edu/calculators
External Links, by SPSS, Free resources for spss, excel, word & more...
http://www.spss.org/wwwroot
Interactive Statistics, by University of Illinois. Examples from over '5' Calculators include:
Data, Correlations, Scatter plot, Box Models, and Chisquare Applet,
http://www.stat.uiuc.edu/
Interactive Statistical Calculation, by John Pezzullo, Web pages that perform most of
statistical calculations.
http://members.aol.com/johnp71/javastat.html
Guide to Basic Laboratory Statistics, by B. Lewis. This is an informal guide to elementary

inferential statistical methods used in the laboratory. It is not a text on statistics. Instead, the
focus is on the proper planning of experiments and the interpretation of results. Some
examples include: Spearman's Rank Correlation, Simple Least Squares Data Fitting.
http://nimitz.mcs.kent.edu/~blewis/stat/scon.html
Java Applets Includes demos for: Distributions (Histograms, Normal Approximation to

Binomial, Normal Density, The T distribution, Area Under Normal Curves, Z Scores & the
Normal Distribution.
http://www.isds.duke.edu/sites/java.html
Statistics (Guide to basic stats labs, ANOVA, Confidence Intervals, Regression, Spearman's
rank correlation, T-test, Simple Least-Squares Regression, and Discriminant Analysis.
Demos Contains few interesting demos such as changing the parameters of various
distributions, convergence of t-distribution to normal, etc.
http://www-stat.stanford.edu/
Online Statistical Textbooks, by Haiko Lüpsen.

http://www.uni-koeln.de/themen/Statistik/onlinebooks.html
Statistics: The Study of Stability in Variation, by Jan de Leeuw, The Textbook has
components which can be used on all levels of statistics teaching. It is disguised as an
introductory textbook, perhaps, but many parts are completely unsuitable for introductory
teaching. Its contents are Introduction, Analysis of a Single Variable, Analysis of a Pair of
Variables, and Analysis of Multi-variables.
http://www.stat.ucla.edu/textbook
Introductory Statistics:Concepts, Models, and Applications, by David Stockburger, It

represents over twenty years of experience in teaching the material contained therein by the
author. The high price of textbooks and a desire to customize course material for his own
needs caused him to write this material. It contains projects, interactive exercises, animated
examples of the use of statistical packages, and inclusion of statistical packages.
http://www.psychstat.smsu.edu/sbk00.htm
72
Selecting Statistics, Cornell University. Answer the questions therein correctly, then Selecting
Statistics leads you to an appropriate statistical test for your data.
http://trochim.human.cornell.edu/selstat/ssstart.htm
Statistical training on the web, by Mike Talbot

http://www.bioss.sari.ac.uk/~mike/webtra.htm
SURFSTAT Australia, by Keith Dear, Summarizing and Presenting Data, Producing Data,
Variation and Probability, Statistical Inference, Control Charts.
http://www.anu.edu.au/nceph/surfstat/surfstat-home/surfstat.html
Introduction to Quantitative Methods, by Gene Glass, A basic statistics course in the College
of Education at Arizona State University.
http://olam.ed.asu.edu/%7eglass/502/home.html
Sanda Kaufman's Teachnig Resources, contains teaching resources for variety of topics
including quantitative methods.
http://cua6.csuohio.edu/~sanda/teach.htm
Some experimental pages for teaching statistics, by Juha Puranen, contains some - different
methods for visualizing statistical phenomena, such as Power and Box-Cox transformations.
http://noppa5.pc.helsinki.fi/koe/index.htm
Statistical Home Page by David C. Howell, Containing statistical material covered in the
author's textbooks (Statistical Methods for Psychology and Fundamental Statistics for the
Behavioral Sciences), but it will be useful to others not using this book. It is always under
construction.
http://www.uvm.edu/~dhowell/StatPages/StatHomePage.html
Elementary Statistics, by J. McDowell, Contents: Frequency distributions, Statistical

moments, Standard scores and the standard normal distribution, Correlation and regression,
Probability, Sampling Theory, Inference: One Sample, Inference: Two Samples.
http://www.cc.emory.edu/EMORY_CLASS/PSYCH230/psych230.html
Encyclopedia Britannica, Description of some elementary topics in statistics.

http://search.eb.com/
VassarStats, by Richard Lowry, On-line elementary statistical computation.

http://faculty.vassar.edu/~lowry/VassarStats.html
Teaching Activities, by Statistics Canada, Contains interactive exercises focusing on data

analysis and survey skills.
http://www.statcan.ca/english/kits/teach.htm
73

Introductory Statistical Data Analysis For Geoscientists Using Matlab

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Introductory Statistical Data Analysis For Geoscientists Using Matlab

Uploaded by

Copyright:

Available Formats

UNIVERSITY OF SYDNEY

• math and computation

2.2 GETTING STARTED

You'll get a row vector with the integers:

You'll get a row vector with the integers:

You'll get a matrix back as:

You'll get a matrix back as:

junk = [1:5; 10:10:50; 5:9; 1000:-100:600; 3 5 2 9 44]

You'll get a matrix back as:

2.5 PLOTTING/GRAPHS: ONLINE HELP

Here's another example -- animating your graph:

Meshes for surfaces

Surfaces where meshs are filled:

Some prefab shapes: cylinder:

Operators and special characters:

General intro plotting info from the above UTAH site:

More good graphics basics:

2.6 FORMATS, SAVING AND LOADING FILES

These produce the following for a vector x = [4/3 1.2345e-6 ]:

2.6.2 Saving and loading the workspace

The syntax in general for this is:

save filename var1 var2 var3 ...

2.7 INTRODUCTION TO PLOTTING

plot(x, dewijs, 'o')

title('Zn measurements in a Chilean Qtz vein');

plot(A,B,attributes, C,D,attributes, etc )

plot(x, dewijs, 'o',x, dewijs)

An alternative way to plot data is with the stem function. Try:

Now, we can plot the histogram

Look at the "help hist" details. We can easily swap axes.

2.8 OVERVIEW, COLOURMAPS

2.8.1 Important steps for making 2D Graphs

Some important graph features

2.8.4 Figure windows

2.9.2 RGB colour

3.3 DATA QUALITY

4. STATISTICAL DATA ANALYSIS

Other related techniques include Multidimensional Scaling, Cluster Analysis, and

4.2 PLOTTING DATA AS LINE/SYMBOL AND STEM DIAGRAMS

A stem diagram of the same data points marked by circles.

4.3 PLOTTING DATA AS HISTOGRAMS

5. presence of multiple modes in the data.

Histogram of the zink data using the MATLAB bar function.

1. Find the range (highest value - lowest value).

3.2 3.7 2.9 3.9 3.4 3.1 3.1

% Explore difference in using odd or even placing of bins

% Alternatively specify bins explicitly

4.4 PLOTTING DATA AS SCATTERPLOTS

Oxygen versus carbon isotope ratios, 0-230 m, Queensland Plateau

Delta C13, PDB

If we consider a sample space of points representing the possible outcomes of a particular

f(x) = pj if x = x j (j = 1,2, ...)

We obtain the probability distribution function (PDF) by taking sums:

First central moment: E (x - µ) = 0.

5.2 WHAT IS STATISTICAL SIGNIFICANCE?

5.3 HOW TO DETERMINE THAT A RESULT IS "REALLY"

5.4 GAUSSIAN DISTRIBUTION

P(x) = (2πσ 2 ) e -0.5 (x - µx) /σ

In a Normal distribution, observations that have a standardized value of less than -2

% mean 12%, sd 1.6%

%prob<8%, not problem here if z is negative

5.5 RANDOM DATA SAMPLE STATISTICS

5.5.2 Standard deviation