Professional Documents
Culture Documents
Department of Geophysics
e-mail:
norbert.szabo.phd@gmail.com
Why Geostatistics?
How often does a specific value of data
occur in the data set?
How many data occur below a specific
value?
How can data frequency modelled
mathematically?
What is the most characteristic vale in
the measurement area?
What is the standard deviation of the
Data data set?
S
a How to handle incorrect data?
m How can we estimate measurements at
p points which have not been measured
l based on other measurements?
e What kind of relation a data type has
with other data?
Isaaks and Srivastava (1989)
Why Geostatistics?
Isaaks and Srivastava (1989) What is the probability of joint
occurrence of data?
Is there a relation between data sets
or are they independent?
How strong is the relation between
data sets, positive or negative?
How do we describe this function
relation mathematically and use it to
interpolate the result to unmeasured
locations?
How do we estimate the model
parameters from the data?
What is the error of estimation?
Statistical sample
How do we sort data in case of a big
of two variables
data set?
Course Outline
Data Sets, Histograms and Density Functions
Determination of the Mode
Characterization of Uncertainty
Maximum Likelihood Method
Measure of Joint Variability
Variogram Analysis and Kriging
Regression Analysis
Multi-Dimensional Samples and Scaling
Factor Analysis and Principal Component Analysis
Cluster Analysis
Linear Optimization Methods
Global Optimization Methods
Selected Bibliography
Edward H Isaacs, R Mohan Srivastava, 1989. An
Introduction to Applied Geostatistics. Oxford University
Press
Martin H Trauth, 2006. MATLAB Recipes for Earth
Sciences. Springer
Ferenc Steiner, 1991. The Most frequent value:
introduction to a modern conception of statistics. Academic
Press Budapest
William Menke, 2012. Geophysical Data Analysis:
Discrete Inverse Theory. MATLAB Edition. Elsevier
Michalewicz Z, 1992. Genetic Algorithms + Data Structures
= Evolution Programs. Springer
Other journal articles, e-books and chapters
Data Sets, Histograms and
Density Functions
Data Represented on the Continuum of
Numbers
Observed data set - repeated radioactive measurements on the same rock
sample, with the same instrument and in the same environment
Different number of gamma particles are detected by the instrument over
the same time period. During the decay of atomic nuclei, the number of
particles emitted is not constant during the same period. The measured
values are scattering around a typical (expected) value
On the line from left to right in the data density increases until it reaches
the maximum, then decreases again. Depending on the position of the
interval of a and b, we can expect values with different frequency. At
repeated measurements, the number of data varies in the different
intervals, but the whole data density does not change
Occurrence of Data
Attention - during the measurement the same data can occur
repeatedly
Plot the number of occurrence against the values
Steiner (1990)
Steiner (1990)
Data Represented by Histogram
Plot the ratio ni/n on the ordinate axis,
this is the relative frequency. Then the
histogram will be independent from
the number of data (data density
distribution does not change). The
ratio 100*ni/n gives the percentage of
data inside the i-th subinterval
Plot the ratio ni/(n*h) on the ordinate
axis. Then, the total area of the
histogram’s column will be 1. The i-th
rectangular area is proportional to the
number of data in the i-th subinterval
If the h value is too big, the global
data density image will be distorted. If
the h value is small, large amplitude
fluctuations will disturb the data
analysis Steiner (1990)
Walker Lake, Nevada (USA)
Steiner (1990)
Properties of PDF
The whole area under the PDF
curve (100%)
f(x)dx 1
-
1 S S
, T xT
f u (x) S 2 2
0 , otherwise
1 x
f L (x) e
2
General form of the PDF
x T
1
f L (x) e S
2S
Steiner, 1990
1 1
f C (x)
1 x2
1 1 1 S
f C (x) 2
S x T
2
S x T 2
1
S
Steiner, 1990
i i
y
i 1
f x , T, S 2
min
Steiner (1990)
Asymmetric PDF Models
Weibull distribution Lognormal distribution
Steiner (1990)
Walker Lake, Nevada (USA)
PDF CDF
http://evolution-textbook.org
Properties of CDF
dF( x )
The CDF is the primitive function of the PDF f (x)
dx
1 n x x2 xn
xn xk 1
n k 1 n
w k xk
w 1 x1 w 2 x 2 w n x n
x n, w k 1
n
w1 w 2 w n
w
k 1
k
Median - the value separating the higher half of the sample. For
a data set, it may be thought of as the "middle" value
x (n 1)/2 , n is odd
med n x n/2 x (n 2)/2
, n is even
2
Walker Lake, Nevada (USA)
1 100 V1 V2 V100
V
100 k 1
Vk
100
97.55 ppm
V50 V51
med(V) 100.5ppm
2
MATLAB Recipe 1
Put your sample here
clc; x=
clear all; -1.0000
x=[-1 2.2 3.6 4 9.8]', 2.2000
w=[0.5 1 2 1 0.5]',
3.6000
n=length(x);
avr=0; 4.0000
for k=1:n 9.8000
avr=avr+x(k);
end w=
avr=avr/n; 0.5000
weigavr=0;
1.0000
aux=0;
for k=1:n 2.0000
weigavr=weigavr+(w(k)*x(k)); 1.0000
aux=aux+w(k); 0.5000
end
weigavr=weigavr/aux; avr =
x=sort(x);
3.7200
if mod(n,2)==0
med=0.5*(x(n/2)+x((n+2)/2));
else weighavr =
med=x((n+1)/2); 3.5600
end
avr, med =
weigavr,
3.6000
med,
MATLAB Recipe 2
>> x=[-1 2.2 120 3.6 4 9.8]' >> w=[0.5 1 0.0001 2 1 0.5]'
x= w=
-1.0000 0.5000
2.2000 1.0000
120.0000 0.0001
3.6000 2.0000
4.0000 1.0000
9.8000 0.5000
>> (w'*x)/sum(w)
ans =
114.4552
Most Frequent Value
Weighted average - data far from the most of the data get small weights, data at
near the MFV get higher weights n
x i i
ε2
M i 1
, i 2
ε x i M
n 2
i 1
i
Large dihesion =
Big weights go to each data
Small dihesion =
Small or zero weights to
outlying data
Estimation of MFV
Automated iterative process - in general the values of M and are
calculated simultaneously by a recursion formula (j is the number of
iterations). In the first step, M1 takes the value of the mean or median,
and the first approximation of dihesion is estimated from the range of
the sample
ε1
3
max x i min x i
2
n
x i M n, j
2 n
ε 2j1
3
i 1 ε j x i M n, j
2 2 2 ε
i 1
2
j1 x i M n, j
2
xi
ε 2j1 M n, j1
n n
ε 2j1
ε x ε
1
i 1
2
j i M n, j
2 2
i 1
2
j1 x i M n, j
2
Estimation of MFV
x=[-12.5 -6.7 -2 -1.5 0.1 2.4 6.8 9.8 15 23.5 30]
10
Leggyakoribb érték (M)
M(1)=mean(x)
8
M(1)=median(x)
6
0
5 10 15 20 25 30
Iterációs lépésszám (j)
40
30
Dihézió ()
20
10
0
5 10 15 20 25 30
Iterációs lépésszám (j)
Estimation of MFV Outlier
20
Leggyakoribb érték (M)
M(1)=mean(x)
15
M(1)=median(x)
10
0
5 10 15 20 25 30
Iterációs lépésszám (j)
120
100
Dihézió ()
80
60
40
20
0
5 10 15 20 25 30
Iterációs lépésszám (j)
MATLAB Recipe 3
clear all; >> mean(x)
x=[-12.5 -6.7 -2 -1.5 0.1 2.4 6.8 9.8 15 23.5 30 100]; ans =
M1=mean(x); 13.7417
epsilon1=0.5*sqrt(3)*(max(x)-min(x));
itermax=30;
for j=1:itermax >> median(x)
szaml=0; szaml2=0; ans =
nev=0; nev2=0; 4.6000
seg=0; seg2=0;
if j==1 >> M(1)
epsilon(j)=epsilon1; ans =
M(j)=M1; 13.7417
else
for i=1:length(x)
seg=(x(i)-M(j-1))^2; >> M(itermax)
szaml=szaml+3*((seg)/(((epsilon(j-1)^2)+seg)^2)); ans =
nev=nev+(1/((epsilon(j-1)^2)+seg)^2); 3.2485
end
epsilon(j)=sqrt(szaml/nev); >> epsilon(1)
for i=1:length(x) ans =
seg2=(epsilon(j-1)^2)/((epsilon(j-1)^2)+((x(i)-M(j-1))^2));
97.4279
szaml2=szaml2+(seg2*x(i));
nev2=nev2+seg2;
end >> epsilon(itermax)
M(j)=szaml2/nev2; ans =
end 10.3280
end
Expected Value of Random Variable
Relative frequency - how often the event A (one data) happens divided
by all outcomes (measured data set)
Probability - with increasing number of measurements, the relative
frequency will fluctuate around P(A), which shows the probability of the
event A happening compared to all outcomes
Random variable - a variable whose possible values are numerical
outcomes of a random phenomenon. The probability of occurrence of
xk (k=1,2,…,n) is pk for discrete random variables is (n is the number of
possible outcomes) n
p k P( x x k ), p
k 1
k 1
x
n0 n0
f(x 0 ) h f(x 0 ) E n h x k f(x k ) pk
nh
k
n k 1 k 1
Comparison of Typical Values
Balance example - consider the following data set including six data
and one of them is an outlier. The source of outliers can be a defective
instrument, wrong measurement, data transfer or recording etc.
It can be seen that the sample mean is very sensitive to the presence
of the outlier, the median and the most frequent value given more
realistic estimations
Outlier
xi x0
p
p 1: x i x 0 vagy p 2 :
i 0
x x 2
i 1 i 1 i 1
It can be seen that if the xi is far from the most frequently occurring
range of x than the distances are big. The effect of large deviations
can be reduced by suitably chosen 2 as
x
n
x0
2 2
i
i 1
Overall Deviation from the Mode
Decouple the characteristic distance from n and identify the unit of
measurement with the unit of x
1/ p
1 n p 1 n
Lp xi x 0 p 1 : L1 x i x 0
n i 1 n i 1
1 n
p 2 : L2 x i x 0 2
n i 1
1 1
n x x 2
2n n x x 2
2n
Pk 1 i 0
k 1: P1 1 i 0
i 1 k i 1
1
n x x 2
2n
k 2 : P2 1 i 0
i 1 2
Steiner (1990)
Variance of Random Variable
The variance (square of standard deviation) characterizes the
random variable’s deviation from the expected value (the squared
deviation of a random variable from its mean). For discrete random
variable it is defined
n
x k E n p k
2 2
n
k 1
( x T ) 2
1
fG (x) e 2S2
S 2
( x E )2
1
fG (x) e 22
2
Unbiased Estimation of Variance
The empirical standard deviation (n) is the biased estimation of the
theoretical standard deviation (), because E(n)
n 1 2
E
2
n
n
, n
1 n
n i 1
x i x 2
n 1 i 1 n 1
KorrVariance =
49.4444
Isaaks and Srivastava (1989)
Walker Lake, Nevada (USA)
1 100
V
100 k 1
Vk 97.55 ppm
σ 2n
1 100
Vk V 2 688 ppm2
100 k 1
σn
1 100
Vk V 2 688 ppm2 26.23 ppm
100 k 1
(V E(V)) 2 (V 97.55) 2
1 2σσ(V2 1
f G (V) e e 226.23 2
σ(V) 2π 26.23 2π
Confidence Intervals
The dihesion characterizes the interval of most frequently occurring
data. It also shows what percentage of data can be expected in
intervals whose lengths are several times the dihesion
Confidence level is the percentage frequency of possible confidence
intervals that contain the true value of their corresponding parameter
Steiner (1990)
Confidence Intervals
In the intersextile range [-Q,Q] two-third of data (66% confidence level), half of the
data in the interquartile range [-q,q] (50% confidence level) is expected. Error
characteristic quantities are the semi-interquartile range (q) and the semi-intersextile
range (Q), -q is the lower quartile (1/4 of the data is less than this), q is the upper
quarter (1/4 of the data is higher than this), -Q is the lower sextile (1/6 of the data is
less than this), Q is the upper sextile (1/6 of the data is higher than this)
A percentile (or a centile) is a measure indicating the value below which a given
percentage of observations fall. For instance, the 10th percentile is the value below
which 10% of the observations may be found
Confidence Intervals of Gaussian PDF
Standard Gaussian PDF
Skewness
The k-th central moment is defined as E((x–E(x))k), where k is a
positive integer. For instance, the variance is equal to the second
central moment (k=2)
Skewness is a measure of the asymmetry of an unimodal PDF, which
can be directly calculated from the sample
1 n
x i x 3
n i 1
3
1 n 2
x i x 2
n i 1
1 n
i x x 4
n i 1
2
3
1 n
x i x 2
n i 1
0.08
title('Profile-N');
grid on;
0.06 xatls=0;
0.04
for i=1:N
xatls=xatls+DATA(i,2);
0.02 end
0
E=xatls/N,
4.88 4.885 4.89 4.895
Magnetic field [nT] 4
s1=0;
x 10
for k=1:N
s1=s1+((DATA(k,2)-E)*(DATA(k,2)-E));
Telkibánya, Hungary (2013) end
Sigma=sqrt(s1/(N-1)),
Skew=skewness(DATA(:,2)),
Kurt=kurtosis(DATA(:,2))-3,
Propagation of Error
If the unmeasured quantity q depends on other measurable
quantities, meaning that q=q(x,y,…), then by measuring variables
(x,y…) and knowing the measurement errors (Δx, Δy,…), the
mean of q and its absolute error Δq can be derived as
q q
q q Δq, where q q(x, y,) és Δq Δx Δy
x y
A property of variance applied to independent random variables
q2 2x 2y q x 2 y 2
x y x x , y , y x , y ,
Maximum Likelihood Method
Estimation of Location Parameter
Let us assume that we know the type
of the PDF f(x) and its scale parameter
(S). The location parameter (T) is to be
determined
We search for the optimal value of T
that maximizes the probability of
getting the n number of data we
observed. The parameter estimation
method is called the maximum
likelihood method (ML)
Consider a 10-element data set of
Cauchy distribution with a scale
parameter S=1
Select a small value of x and
calculate the probabilities f(xi)x at the
locations of the observed data. The
optimal value of T is given at the
maximum of the product of probabilities
calculated over the data set Steiner (1990)
Likelihood Function
Likelihood function - the condition of finding the optimum using the
maximum likelihood method (xn can be neglected because it is
independent from T)
n
L f x , T f x , T f x , T f x , T max
i 1
i 1 2 n
n n 2 x i T 2
1
1
2S2
x i T 2
f G x i , S, T S
1 1 i 1
L
S
e 2S n
e
i 1 i 1 2π 2π
1 2
n
L ln L n lnS ln2π 2 x i T max
* n
2 2S i 1
L* 1 n L* n 1 n
T S i 1
2 x i T 0
S
3 x i T 0
S S i 1
2
x1 T x 2 T x n T 0 1 n
n S x T 2
n
x
1 n i 1
i
T i x
n i 1
x ... x n 1
E x E 1
Ex1 ... Ex n n Ex Ex E x Ex
1
n n n
2 x σx
σ x
2 2 x1 ... x n
1 2 2
2 x1 ... x n 2 n x
1 2
σx
n n n n n
As a consequence, in case
of large n and finite standard
deviation of the mean shows
a n-proportional increase in
estimation accuracy
Measure of Joint Variability
Uncorrelated Variables
Let x(x1,x2) be a two-dimensional random variable. The probability
density function f(x1,x2) gives the probability of the first
measurement falling in the range of x1, and the second in the range
of x2
The joint probability density function of uncorrelated variables is
Menke (1984)
Covariance
Divide the plain x1 x2 for four quadrants. Then, form from the data the
function 𝐱𝟏 − 𝐱𝟏 ∙ 𝐱𝟐 − 𝐱𝟐 . Multiply this function with the values of the PDF
then, sum the areas. The resultant quantity is the covariance, which is the
measure of the joint variability of two random variables
Menke,1984
For uncorrelated variables cov=0, because the values for all quadrants are
the same. For correlated variables cov≠0. The sign of the covariance shows
the tendency in the linear relationship between the variables. The sign is
either positive (direct relationship) or negative (inverse relationship)
Properties of Covariance
The theoretical formula and properties of covariance
cov( x , y) Ex E( x ) y E( y)
cov( x , y) E( xy ) E( x ) E( y)
2 x y 2 x 2 y 2 covx , y
covx , y x y
cov( x , x ) 2 x
cov( x, y) x k x y k y
r ( x , y) , rn k 1
( x ) ( y)
2 2 n n
x x y k y
2 2
k
k 1 k 1
The sign of the correlation coefficient informs about the direction of co-
variation
Walker Lake, Nevada (USA)
U U Vk V
100
k
r k 1
0.84
U U V V
100 100
2 2
k k
k 1 k 1
__
m
Szabó, 2004
B/z
__
m
MATLAB Recipe 7
y=15m
0.35
0.3
0.25
0.2
P
0.15
0.1
0.05
__
m 0
4.83 4.84 4.85 4.86 4.87 4.88 4.89 4.9 4.91
T [nT] 4
x 10
4
x 10
4.91
4.9
4.88
T[nT]
y=18m
4.87
kovar = var_x= 0.25
0.2 4.86
23099 12635 151.98
12635 13041 0.15
4.85
P
r = 0.64
= 0.95
k k
ρn k 1 100
y
σ rankx σ ranky 50
Zhang (2009)
Note that the points Z3 and Z5 points should get higher weights than Z7,
because they are in the same geological unit as Z0. How can we take
the prior geological information into consideration during the
interpolation?
Spatial Correlation of Porosity
Lag
Bohling (2005)
Variogram
Assume that the data is normally distributed and their fluctuation can be
characterized by a standard deviation
Our task - a given quantity in a point is estimated as the weighted average
of known data in the vicinity. Finding the optimal values of the weights so
that the variance of the result is minimal
Semi-variogram - a function describing the degree of spatial dependence of
a spatial random variable. The curve (h) gives the half of the squared sum
of the differences of the observed variable Z as a function of distance h
1 n h
h Zri Zri h 2
2n h i 1
Covariance required for kriging is calculated from the variogram Bohling (2005)
COVZr , Zr h C h
Kriging
Kriging is a robust interpolation method for estimating the
characteristics of unrecognized points using the information of the
surrounding data and spatial variance (not sensitive to the variogram
model and its directional dependence)
Approximate the unknown value of physical property Z in a point P0
with the weighted average of Z(Pi) measured close to point P0
n
ZP0 w i ZPi
i 1
The unknown weights can be determined from the W=K-1D system, with
the help of which the unknown Z(P0) can be calculated. The estimation
error (variance) can be obtained by =wTD
Walker Lake, Nevada (USA)
exponenciális
Walker Lake, Nevada (USA)
__
m
Szabó, 2004
B/z
__
m
Nyékládháza, North-East Hungary
Empirical variogram
Result of interpolation
Determine the optimal value of the parameters m and a using the LSQ
method. The best fit between the measured and calculated data is
given at the minimum of the objective function
y
n n
2 2
E y(m)
i y(c)
i
(m)
i mx i a min
i 1 i 1
m = a= m_R = a_atl =
y f x i
1
p 1 : L1 (m)
i
n i 1
y
n
f x i
1 2
p 2 : L2 (m)
i
n i 1
n
p : L max y (m)
i f x i
i 1
y
N
2
(m)
i f(x i , p) min
i 1
J
f(x, p) f(x, p1 , p2 ,..., p J ) p x
j0
j
j
F(x 0 , y0 ) P(x x 0 , y y0 ) p ik
x i x 0 , yk y0
F(, y 0 ) P( x , y y 0 ) P( y y k )
y k y0
2 F(x, y)
f(x, y)
xy
Bivariate Normal Distribution
For bivariate Gaussian distribution the marginal probability density function of
variables x and y (where standard deviations 1, 2 and expected values m1,
m2 are known) follow a normal distribution
x - m1 2
1 2 12
f G (x) e
1 2
y - m 2 2
1 2 22
f G (y) e
2 2
1 x m1 2 x m1 y m2 y m2 2
2r
f G (x, y)
1
e
2 1 r 2 σ12 σ1σ 2 σ 22
2π σ1σ 2 1 r 2
Multi-Dimensional Observations
The recorded values of observations are called data. During
exploration projects, the data set can contain different kinds of
measurements with different magnitude and measurement units
collected from a large number of geological objects from all over the
investigated area (surface or subsurface)
Data have two specifications. Object - one element of a geological
formation/structure that we observe. Feature - physical property of
the object element (i.e. measured variable)
Let us collect the data of J types measured on I number of objects
into a data matrix
d11 d1 j d1J
D d i1 d ij d iJ
d I1 d Ij d IJ
Data Vectors
The i-th object vector derived from data matrix D contains all features of the
i-th object, and the j-th feature vector includes the values of the j-th feature
observed on different objects
(obj)
di d i1 , d i2 ,, d iJ
d1j
(feat) d
2j
dj
d Ij
1 I
dij d ij d j , d j d ij
I i 1
Extent scaling - scaling data into the interval [0,1], which means constant
shift and compression. The elements of the j-th feature vector are now in
the interval [0,1]. The scaled variable is dimensionless
dij
d ij min d ij
max d ij min d ij
Scaling into arbitrary interval - scaling data into the interval [A,B], where A
and B are the desired lower and upper limits of the new interval
d ij min d ij
dij A j B j A j
max d ij min d ij
MATLAB Recipe 8
>> d=[1 3.2 -5.6 5 8 11]' >> sigma=std(d) >> max(d) >> min(duj4)
d= sigma = ans = ans =
1.0000 5.7875 11 0
3.2000
-5.6000 >> duj2=(d-mean(d))/sigma >> duj3=d/max(d) >> max(duj4)
5.0000 duj2 = duj3 = ans =
8.0000 0.0909 1
11.0000 -0.4780 0.2909
-0.0979 -0.5091 >> A=-10, B=20
>> datl=mean(d) -1.6184 0.4545 A=
datl = 0.2131 0.7273 -10
3.7667 0.7315 1.0000 B=
1.2498 20
>> duj=d-datl >> max(duj3)
duj = >> mean(duj2) ans = >> duj5=A+((B-A)*(d-min(d))/ ...
-2.7667 ans = 1 (max(d)-min(d)))
-0.5667 -1.1102e-016 duj5 =
-9.3667 >> min(d) 1.9277
1.2333 >> std(duj2) ans = 5.9036
4.2333 ans = -5.6000 -10.0000
7.2333 1 9.1566
>> duj4=(d-min(d))/ ... 14.5783
>> mean(duj) (max(d)-min(d)) 20.0000
ans = duj4 =
-5.9212e-016 0.3976 >> min(duj5)
0.5301 ans =
0 -10
0.6386 >> max(duj5)
0.8193 ans =
1.0000 20
Matrix Product
X AE
Matrix A is expressed as the linear combination of a<M number of factors
A FL T
where F is the N-by-a matrix of factor scores and L is the M-by-a matrix of
factor loadings (practically they represent the correlation coefficients
between the measurements and the derived factors)
The number of factors (a) depends on the statistical problem. Some
approximate methods can specify it (K. G. Jöreskog’s non-iterative
method). Sometimes, the upper limit of a is estimated as
a 0.5 2M 1 8M 1
Types of Statistical Factors
General factor - common factor connected to most of the observed
variables with high or medium loadings
Common factor - at least two of factor loadings are different from zero
Specific factor - influences only one variable
Residual factor - it is a specific factor generated by measurement error or
approximation
The decomposition of matrix A is not unique, because for any a-by-a
orthogonal (rotation) matrix B holds
A FBB1 L T
N
0 0 2M
If the above criteria are fulfilled, the M-by-M correlation matrix (R) including the
correlation coefficients of measured variables can be written as
1 T 1 T
R X X A A U2
N N
R
1
N
FLT T
FL U L L U
T 2 T 2
Correlation matrix can be related to the factor loadings (elements of matrix L).
The specific variances that are not described by the common factors are in the
main diagonal of matrix U2
Communalities
The diagonal elements of the correlation matrix R are equal to 1, they
represent the variance of the standardized measured variables (σ=1). The
elements of the reduced correlation matrix derived from matrix product LLT
are the correlation coefficients of the measured variables and factors
L
r
R U L L 12 1
2 T
where h i2 2
j1
ij
r1M r2M h 2M
Communalities (h2) - the elements in the main diagonal of matrix LLT, they
are the variance portions of the measured variables that can be described
by the common factors. For the residual factors, we derive the specific
variances from the M-by-M matrix of communalities H2
U IH
2 2
If h2 is a far less than 1, the measured variables has little connection to the
factors, vica versa
Estimation of Factor Loadings
Determination of the M-by-a matrix of factor loadings (L) leads to an
eigenvalue problem (Av=λv, where v0 and λ is a scalar) which is solved
by the method of spectral (or eigen-) decomposition of symmetrical
matrices
LL R U
T 2
L L Z Z
T T
L Z
1/ 2
LT Z Z
1/ 2 T 1/ 2 T T
1/ 2
ZT
r
r r 1
I) h i2 max rij II) h i2 ik il III) h i2
M 1
ij
j rkl j1
ji
Bartlett’s method - based on the hypothesis of linearity, the LSQ method weighted by
the specific variances gives an estimate to the factor scores
Maximum Likelihood method - factor scores and loadings are simultaneously
estimated to maximize the likelihood objective function in an iterative process
Jöreskog’s method - a non-iterative (quick) approximate solution (independent of
scaling of the data and gives an objective estimate the optimal number of factors)
Interpretation of Factors
The interpretation of the k-th factor is based on the magnitude and the sign of
the factor loading Lik. The higher the value of coefficient Lik the more closely the
k-th factor correlates to the i-th measured variable
A general factor is indicated by high relative loadings and identical signs.
In case of simple structure (the factor coefficients are close to either 1 or 0), the
factors can be easily interpretable (they can be associated with physical
properties)
If the factors cannot be interpreted well, they can be transformed by rotation
methods to be more informative. The orthogonal transformation methods lead to
uncorrelated factors. For instance, the aim of the varimax method is to produce
factor loads close to 1 or 0. The varimax method maximizes the variance of the
squared factor loadings
L
M a
2
VAR( L )
2 2
ik Lk2 max
i 1 k 1
FaktorLoad =
-0.0183 0.8819 0.4176
0.8957 -0.1850 -0.2882
-0.3719 0.8160 0.4372
0.7625 -0.3221 -0.1011 1
0.8550 -0.1932 -0.4526
0.8617 0.2317 0.0521
-0.3777 0.0864 0.9115 0.5
Component 3
Factor 1
X TPT
http://cnx.org/content/m11461/latest
Estimation of Principal Components
Multiply the model equation X=TPT with matrix P on the right side. The inner
product of the M-length vector pi is orthonormal (piTpj=1, if i=j, else piTpj=0),
from here the N-by-r matrix of principal components is
T XP
COV X X
T
Determine the eigenvectors and eigenvalues of matrix COV using e.g. the
SVD (singular value decomposition) or the method of spectral
decomposition of symmetric matrices (see the section of factor analysis)
COV ZΛ Z Λ I
T
i i2
Principal components are sorted according to the eigenvalues of the
covariance matrix. The direction of the highest variance is designated by the
eigenvector of the highest eigenvalue. This is the first axis in the new
coordinate system (identified as the first principal component). Then we find
the second highest eigenvalue and that will be the second coordinate (second
principal component) etc.
With the M-by-r matrix of principal component coefficients (P), one can
determine the N-by-r matrix of principal components (T)
PZ
T XZ
MATLAB Recipe 10
clc;
clear all;
n=1000;
mu=[0,0];
covar=[4 1;1 2];
X=mvnrnd(mu,covar,n);
[Princoef PrincC EigValX]=princomp(X);
[Princoef2,Var,Percent]=pcacov(covar);
Princoef,
Var,
EigValX,
Percent,
Princoef =
0.9142 -0.4053
0.4053 0.9142
EigValX =
4.8125
1.5323
Var =
4.4142
1.5858
Percent =
73.5702
26.4298
Cluster Analysis
Principles of Clustering
Cluster analysis - multivariate statistical method the aim of which is to classify the
elements of a given data set based on their common properties and partition the
data objects into groups. The classification is based on the definition of distance.
Two objects are considered as similar if their distance is small, and they are
different if that is large. The goal of clustering is to have minimal distances within
groups and maximal distances between the groups
Cluster - a group of objects which are similar in a well defined aspect. Diameter -
the distance between the furthermost elements of the cluster. Centroid gives the
location of the cluster in the data space, the centroid cK of the k-th cluster is the
arithmetic mean of the cluster elements. Radius - the distance between the centroid
and the furthermost element from it
The following criteria must be met during www.cse.unr.edu
the decomposition of data matrix X to
subclusters: each element must be
classified into a cluster, one element must
be classified only in one cluster, no cluster
must be empty
Cluster analysis is sensitive to outlying
data. Data objects with extreme values and
noise level can distort the estimation
Measure of Distance
Measure of Distance
Distance definitions: Minkowski distance - Lp-norm of the deviation of i-th
and j-th observations, Euclidean distance - L2-norm of the deviation of i-th
and j-th observations, Cityblock (Manhattan) distance - L1-norm of the
deviation of i-th and j-th observations, Mahalanobis distance - L2-norm of the
deviation of i-th and j-th observations weighted by the inverse covariance
matrix. If the observations are not independent from each other then the
measure of correlation has to be taken into account (also when the distance
of variables are not comparable because of different magnitudes)
Distance matrix - an n-by-n
matrix, where dij gives the
distance between the i-th
and j-th data objects (n is the
total number of data objects)
0 d12 d1n
d 0 d 2 n
D 21
d n1 dn2 0
www.emeraldinsight.com
Hierarchical Clustering
The benefit of hierarchical cluster analysis is that the number of
generated clusters does not have to be preset. Disadvantage - it is
rather time consuming, therefore only used for relatively small data sets
(data distances need to be stored in matrix D). This method is sensitive
to noise and outlying data
Agglomerative methods - each object starts as an individual cluster
(singleton), then, the two closest objects are combined, repeatedly until
all objects will be gathered in the same cluster. After calculating all
cluster solutions, the optimum number of clusters can be chosen
Divisive methods - starts with one cluster and with the same strategy as
the above method in a reverse order separates objects until all are
separated
Algorithm of hierarchical clustering - in the first step, the distance matrix
of initial configuration of data objects is calculated. At this point, each
data is in its own individual cluster. After this, the closest data points to
each other are merged and the distance matrix is re-calculated. The
above steps are repeated until only one cluster remains
Dendrogram
Dendrogram - tree diagram frequently used to illustrate the arrangement of the
clusters produced by hierarchical clustering methods. The inner branches
represent different clusters, and at the end of the branches the connected data
objects are represented (nodes represents data of individual observations). The
distance between merged clusters is monotone increasing with the level of the
merger. This method illustrates how elements are grouped, but it is not suitable for
spatial representation of clusters
Leaf nodes
Linkage Criteria
The distance between two clusters can be defined several ways. Single-
linkage clustering - it applies the distance of the closest elements as
linkage criterion. Complete-linkage clustering - it examines the maximum
distance between the elements of each cluster. Average linkage
clustering - it is based on the average distance between the elements of
clusters. Centroid linkage clustering - it examines the distance of cluster
centroids. Ward’s linkage method - it minimizes the increase in total
within-cluster variance (sum of squared deviations from the centroid)
after merging clusters using agglomerative cluster analysis
Tan (2006)
MATLAB Recipe 11
Increase the number of objects here
clc;
clear all;
subplot(2,2,1);
X = 10*rand(30,1);
Y = pdist(X,'mahalanobis');
Z = linkage(Y,'single');
[H,T] = dendrogram(Z,'colorthreshold','default');
set(H,'LineWidth',2);
xlabel(‚Number of objects');
ylabel('Centroid distance');
subplot(2,2,2);
Y = pdist(X,'mahalanobis');
Z = linkage(Y,'average');
[H,T] = dendrogram(Z,'colorthreshold','default');
set(H,'LineWidth',2);
xlabel(‚Number of objects');
ylabel('Centroid distance');
subplot(2,2,3);
Y = pdist(X,'mahalanobis');
Z = linkage(Y,'centroid');
[H,T] = dendrogram(Z,'colorthreshold','default');
set(H,'LineWidth',2);
xlabel(‚Number of objects');
ylabel('Centroid distance');
subplot(2,2,4);
Y = pdist(X,'mahalanobis');
Z = linkage(Y,'ward');
[H,T] = dendrogram(Z,'colorthreshold','default');
set(H,'LineWidth',2);
xlabel(‚Number of objects');
ylabel('Centroid distance');
Petrophysical Example
Petrophysical Example
Petrophysical Example
Non-Hierarchical Clustering
Non-Hierarchical clustering or partitioning is an iterative cluster analysis
method. This kind of clustering requires the determination of the number of
clusters prior to the iterative process. It has the feature of handling big data
sets well; but the result is highly affected by the initial selection of centroids
K-Means Clustering - choose the number of clusters and the K pieces of
starting centroids. Create K number of group so that each element is classified
in a cluster with the nearest centroid. Calculate the new cluster centroids. The
previous steps are repeated until the convergence criteria is met
The sum of squared error for the
elements and nearest centroids
SSE d 2 ci , x j
K ni
i 1 j1
9 1. csoport 9
2. csoport
3. csoport 8
8
1. centoid
2. centroid
7 7
3. centroid
CA 6
y
6
y
3 5
5
2
1 4
4
3 3
2 2
1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9
x x
9 9
8 8
7
CA 7
6 6
y
5 5
4 4
3 3
2 2
1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9
x x
MATLAB Recipe 12
c1=mean(X1);
10
c2=mean(X2);
c3=mean(X3); 9
X=[X1;X2;X3];
figure(1); 8
subplot(1,2,1);
d1(meas) d1(cal)
(meas) (cal) d (meas) d (cal)
2
d d e e 2
0
d (meas)
N d N
(cal)
EEd
(meas) (cal)
d min
N N N
p
e1 ei , e 2 ei2 , e p p ei , e
max ei
i 1...N
i 1 i 1 i 1
Gaussian Least Squares Method
Overdetermined inverse problems (number of data > number of
unknowns) do not have algebraic solutions (number of independent
equations > number of unknowns)
The deviation vector can be expressed using the Jacobi’s matrix
(meas) (calc) (meas)
ed d d Gm 0
k 1 k 1 i 1 j1
e N
E
0, (l 1,2,, M)
m l
E E E
0, 0, , 0
m1 m 2 m M
Derivation of Normal Equation
Derive the objective function E (ij=mi/mj=1, if i=j, else ij=0)
N M M
E (d (m)
k G ki m i )(d (m)
k G kj m j )
k 1 i 1 j1
N 2 N M N M M
k j
j ki kj 0
(m)
d 2 d G m m m G G
m l k 1
k kj i
k 1 j1 k 1 i 1 j1
M M N M N
(mi jl m jil ) G kiG kj 2 mi G ki G kl
M N N
2 jl G kjd (m)
k
2 G kl d (m)
k
j1 k 1 k 1 i 1 j1 k 1 i 1 k 1
N M N
2 G kl G ki m i 2 G kl d (km ) 0 1
/ G G G Gm G d
T T T (m)
k 1 i 1 k 1
1 T ( m )
N M N
G kl G ki m i G kl d k
(m)
k 1 i 1 k 1
m G G G d
T
Linear Regression Solved By Inversion
Assume that the temperature versus
depth relation is linear. Accordingly,
the model (response) equation
T(z) m1 m 2 z
1 z1 N
N zi
1 1 ... 1 1 z 2 i 1
G G
T
N
z1 z 2 ... z N z
N
1 z i z i
2
N i 1 i 1
T1( m ) N ( m )
( m ) Ti
1 1 ... 1 T2 i 1
G d
T
N
z1 z 2 ... z N z T ( m )
T ( m ) i i
N i 1
1
N
N (m)
(est) m1
N zi Ti
1 T (m) i 1
(est)
i 1
m G G G d N
T
m(est) N N
2 z
i
1
i
i 1
z i2
z T (m)
i 1
i i
MATLAB Recipe 13
Estimated model
Generalized Inverse
Measurement data always contain noise, which is transformed to the
model space during inversion. Thus, the model parameters estimated by
inversion will also contain some amount of error. On the other hand, the
forward modeling procedure is also a source of error, because in the
(mostly empirical or approximate) response equations the selection of rock
and fluid properties causes also some modeling error in the results of
inversion. These two independent error types are added and treated as the
data error (σd), which is also an input of inversion
Generalized inverse - an M-by-N matrix (M) which creates a connection
between the model and data during the inversion
m Md v
In linearized inversion, vector v is zero and the data versus model relation
is linear. For instance, in case of the Gaussian LSQ method matrix M is
(est)
m Md
(meas)
1 T
where M G G G
T
Model Covariance
In case of linear relation, the model equation can be written also with the
average values
m m M (d d )
For the i-th and j-th model parameters with indices (i=j=1,2,…,M)
mi mi Mik d k d k , m j m j M jl d l dl
N N
k 1 l 1
G G
T
1 T 1 T
covm σ d2 G G
T T
G G
AB
T T
BA , A G G T
1
és B G
With the above relation, the model-covariance matrix gets simpler form
2
1 T
T 1
covm σ d G G G G G G σ d G G
T 2 T
1
Inversion of Direct-Push Data
Inverted direct push logs are: natural gamma-ray intensity (GR), density (ρb), neutron-porosity (ΦN), electric resistivity (R).
Estimated model parameters are: water volume (Vw), quartz volume (Vs), clay volume (Vcl).
Weighted LSQ Method
The uncertainty of inverted data variables is normally different
(measurement error). If we have a priori information about the accuracy of
data, then it can be taken in consideration during the inversion procedure
Let W(d) an N-by-N weighting matrix in the data space (which is a diagonal
matrix in case of uncorrelated data), which consists of weighting coefficients
for each data depending on their accuracy. The more accurate the data the
higher the weight it gets. The weighting matrix can be constructed by
different principles
1 0 0 12 0 0 e1 1 0 0
0 2 0 1
2 2 0
, W
, W 0 0 e2 0
(d) ( d ) (d )
W
0 0 0
1
0 0 1 0 0 N2 0 0 e N
Halliburton Co.
Global Optimization Methods
Search for Global Optimum
Linearized optimization methods provide quick and acceptable solution, if the start model
is selected favorably (by having sufficient a priori information, the optimization procedure
can be started not too far from the optimum)
If the start model is too far from
the solution, then the linear
optimization methods (such as
LSQ) tend to be trapped in one
of the local minimums of the
objective function gradient-
based search
Global optimization methods
(such as Simulated Annealing
and Genetic Algorithms) allow
the determination of the global Absolute minimum
(absolute) minimum of an Locality Initial model
objective function given that the
control parameters are set
properly derivative-free and
initial model independent Szabó, 2004
random search
Simulated Annealing Method
Simulated Annealing (SA) is a random search technique used for finding
the global optimum of an objective function. The SA algorithm is based on
the analogy between the simulation of the annealing of solids and the
problem of solving multivariate optimization problems. It has more variants
to decrease the CPU time (e.g. FSA, VFSA)
The SA algorithm can be used to solve the inverse problem. Depending on
the rate and schedule of cooling some structure of the solid will form, which
will have an energy state (energy function) and the inverse problem also
has an objective function to be minimized
In metallurgy the annealing of metals are done by heating the metals to
melting point and then cooling them slowly. This process gradually reduces
the kinetic energy of atoms and the metal starts to crystallize. The energy of
the formed structure depends on the cooling schedule. An (theoretically)
infinitely slow cooling would result in a low energy, perfect structure which is
analogous to the geophysical inverse problem, where we try to stabilize the
objective function at the global minimum. In practice, such a slow cooling is
not possible, therefore a faster cooling schedule is necessary
Simulated Annealing Method
Faster cooling results in defects in the
crystal structure, which means that the
metal freezes up at a higher energy
state into an imperfect structure. This is
consistent with the stabilization of the
inversion process in a local minimum.
However, by a special heat treatment,
called annealing, atoms are released
from the higher energy state and they
reach the absolute minimum energy
structure under appropriate cooling. The
SA method is used to find the global
minimum of the objective function
In a convergent SA process the optimal
model tends to the Gibbs-Boltzmann
distribution (thermal equilibrium), the
probability density function of which is
E m( i )
shown on the right. In the formula, P(m(i))
e T
is the probability of i-th model, S is the P m (i )
E m( j)
S
number of models and T is the
temperature as a control parameter of
e T
j1
the SA procedure
Energy Function
Energy function - objective function of the SA procedure, which is related to the
deviation of the measured and calculated data. If the data distribution is
normal, the LSQ method gives an optimal solution
2
N
1
E2 d (meas)
k d (cal)
k min
N k 1
For data types of different magnitudes, the deviation should be normalized with
either the data variances or the measured data
In case of having outliers in the data set, the use of L1-norm is preferable
N
d
1
E1 (meas)
k d (cal)
k min
N k 1
m(new)
i m(old)
i b, 0 b b max
ΔE E m(new) E m(old)
Metropolis criterion - it is the probability rule of accepting a new model. If the
acceptance probability P is greater than or equal to a randomly generated
number from U[0,1], the new model is accepted, otherwise it is rejected. Even
in the case of ΔE>0, some models can be accepted→ escape a local minimum
1, E 0
P(E ) E
e T , E 0
Tq
T0
(q 1)
ln q
Setting the initial temperature (T0) can be done empirically or based on test runs (q is the
number of iterations). With the average energy method, one can calculate the mean of the
energy of the accepted models for different temperatures. Where the above average error
is minimal, one can set the given T0 as initial temperature for the MSA procedure
The cooling rate of MSA is very slow. The
optimum search takes orders of magnitude
higher CPU time than linear methods
The Fast Simulated Annealing (FSA) method
cools down according to a function 1/q and
takes sample from multivariate Cauchy
distribution. The Very Fast Simulated
Annealing (VFSA) applies an exponential
cooling schedule by applying an individual
temperature for each model parameter
Workflow of MSA Procedure
Iinitialization
Initial temperature
http://www.edc.ncl.ac.uk
Crossover Mutation
Float-Encoded Genetic Algorithm
Classical GA is a time-consuming
optimum seeking method, because
in each iteration for the calculation
of the theoretical data a decoding
phase is to be applied
Float-encoded Genetic Algorithm
(FGA) calculates the fitness directly
with real valued model parameters
and does not apply any coding and
decoding operations. Since each
parameter is selected from a real
interval, the parameter space can
be better resolved than with the
binary GA
In the FGA procedure applied for
seeking the absolute extreme of
the fitness function, the model
parameters are encoded as
floating-point numbers and floating-
point genetic operations are used a) Selection b) Crossover c) Mutation
to provide a solution with the
highest precision and optimal
computer processing time
Comparative Study
The FGA improves a large number of models (30-100), while the LSQ and SA
methods updates only one model. At the end of the optimization procedure, FGA
provides a set of optimal models but, it is also the most time-consuming process.
Linear inversion methods usually require only 5-20 iterations, SA needs 5000-10000
iterations, and FGA requires 10,000-100,000 generations to find the optimum
Since the random search in FGA is not only performed from point to point in the
parameter space, but we examine several points simultaneously, thereby more
effectively we can avoid the local extrema of objective function. Some hyperplane
partitions can be automatically eliminated, when we are out of the pre-defined range
of model parameters
FGA does not use any linearization like the LSQ, therefore the calculation of the
Jacobi’s matrix (i.e. partial derivatives of data with respect to the model parameters)
is unnecessary (it only works with the codes and values of the objective function).
FGA and SA as global optimization methods are independent from the derivatives
and are practically initial-model independent (i.e. they do not require a priori
information in the phase of inversion)
Development of convergence of global optimization methods (SA, FGA) is highly
dependent on the setting of the control parameters (e.g. SA - temperature, FGA -
types and parameters of genetic operators)
Factor Analysis Using FGA
2D FGA Inversion of
Direct-Push Logging Data
N
x1 W1j
y j f (s j ), s j Wij x i
i 1
x2 W2j
where Wij is the weight of i-
s=WTx
th input (xi) giving the extent W3j
to which the i-th input is
x3 ∑ f(s)
involved in the neuron’s
...
...
answer yj (j=1,2,…,N)
xN WNj y1 y2 … yN
Activation Function
In ANNs, the activation function
of a node defines the output of
that node given an input or set
of inputs. The nonlinear transfer
function is called an activation
function. Activation is taken
place above an appropriate
(stimulus) threshold value
The following types of activation
functions are used in practice:
- Sigmoid (I)
- Linear (II)
- Heaviside-step (III)
- Gaussian (IV)
- Sinus (V)
Benedek (2000)
- Tangent-hyperbolic (VI)
Structure of ANNs
In order to solve complex problems,
neurons are networked. This is done
because it can be shown that a properly
connected neural network can approximate
any nonlinear relationship to any arbitrary
precision. This function approximation
capability allows the effective modeling of
static and dynamic systems
Structurally the ANN consists of different
layers. A layer is the combination of
neurons involved in the processing of
similar information. The outputs of one
layer are connected to inputs of another
layer or the outputs of the entire network.
The input layer does not process any
information, it only transmits, the input of
the entire network. The input and output of Multilayer
the hidden layer are connected only to the perceptron
neurons of the environment, while the
output layer contains the output of the Several parameters have to be preset that will
entire network. From a structural point of affect the determination of the output values,
view, the neural networks differ in how the e.g. the number of data, number of layers,
neurons building up them are connected to initial weights, objective function, optimization
each other algorithm (linear or global optimization)
Training of ANNs
Training the neural network is possible with the existence of the associated input and output data
pairs. When teaching a given input, it is assumed that the output will display the desired
response. In the course of the training, in an iterative (optimization) method, the above two data
sets are fitted, i.e. we minimize a suitably chosen fit criterion function (objective function). The
operation of the network to be taught is achieved in such a way as to be as close as possible to
the operation of the system being tested. The result depends on the optimization method (i.e.
linear or global optimization). The success of teaching can be characterized quantitatively, since
the difference of the input and output data sets can be measured with the data distance or
correlation coefficient
The system is clearly characterized by its parameters. These parameters are independent
variables of an objective function that define a model. The parameters of the model are the
weights of the ANN. During optimization, these are refined, and at the end of the process, we
obtain an optimal combination of weighting coefficients. With the specified weights, the given
system configuration gives an appropriate response to the new input data, i.e. an output data set
No
Modifying
Model
weights
Types of ANNs
Multilayer Perceptron (MLP) - the most commonly used multi-layer feedforward neural
network. In the MLP, there are simple perceptrons which are connected as a network
using different weights. The three-layered MLP network can classify any kind of samples
Radial Basis Function (RBF) - often used as a feedforward network. It contains only one
hidden layer, where radially symmetric activation functions transform the input data non-
linearly. A network usually uses only one type of activation function but their parameters
can be changed neuron to neuron. The weighted sum is generated in the output layer
Hopfield network - is a simple recurrent ANN, where connections between units form a
directed cycle. They are used to solve complex problems. They can be locally (within
layers) or globally recurrent networks (from output into input), which have excellent
function approximations capability. The Hopfield network consists of one layer (input and
output). In the network all the neuron are connected to each other. The Hopfield network is
trained using unsupervised learning, which means that not the weights are being
optimized but the values of y. The weights (w) and the initial values of x are estimated.
Then using the network, we will get the values of y in the first iteration. In the next
iteration, the values of y will be identified as the input variables (x=y), and that is how we
get the second approximation etc.
Self-Organizing Map (SOM) - they are used for complicated problems using unsupervised
learning. In the Kohonen’s network for example only input and output neurons are present.
Such as cluster analysis, we can use it efficiently for dimensionality reduction and
grouping of the input sample. Kohonen’s network is a computationally convenient
abstraction building on biological models of neural systems
Training of GIS-Based ANN
Output max
Class
norbert.szabo.phd@gmail.com