Dutta 111 35th Apcom Final

University of Alaska Fairbanks
Critical Assessment of Machine Learning

Algorithm as Estimation Techniques for a
Poly metallic Ore Deposit

Sridhar Dutta, PhD
Sukumar Bandopadhyay, PhD, P.E
Rajive Ganguli, PhD, P.E
Debasmita Misra, PhD

Outline of the Presentation
Background
Objective
Neural Network (NN) for Grade Estimation
Support Vector Machine (SVM) for Grade
Estimation
Case Study
Results

Background
Resource Estimation
Importance: reliable estimate/ prerequisite/ for mine
planning, ore grade control
Problems : complex geol structure; absolute
determination not possible; intrusions by other
materials; variations in both vertical/lateral extents,
structural disturbances, multiplicity of ore
structures, variation in thickness/quality in the same
structures, associated formation; inability to verify.
Factors: a challenge; factors such as variabilty/ore
grade, outliers, chrac of ore boundary; geometry,
etc.
4
Continual search for more reliable &
robust estimation techniques -----
Estimation Techniques
Polygonal Method.
Triangulation.
Local Sample Mean.
Distance Weighing Method.
Various kriging estimators
Ordinary Kriging (OK)/ family of kriging (IK), etc.

Traditional approach has been the use of geostatistics

5
Background (Contd..)

Limitations
OK suited for Linear relationship
Semi-variogram modeling
Normality Assumption
Stationarity Assumption
Large Amount of Data Required
Anisotropy and Trend Analysis difficult with less data

6
Background (Contd..)
Emerging Techniques
Neural Networks (NN)
Support Vector Machines (SVM)

Advantages
Non Linear Mapping Capability.
No assumption on the distribution.
No Variogram Modeling
Fundamental Differences
OK utilizes information from local samples
NN utilizes information from all of the
samples
OK is regarded as a local estimation
technique
NN is a global estimation technique
NN will capture any non-linear spatial trend
8
Objectives
To develop a reserve model using machine
learning algorithms (MLA) for improved ore
grade estimation.
Critical Assessment of Machine Learning
Algorithm as Estimation Techniques for a
Poly metallic Ore Deposit
Compare the grade estimates obtained
using the MLA with the traditional ordinary
kriging method.
9
Neural Network Approach
In general the approach involves

Determination of a suitable Network Architecture.

Determination of the activation functions, number of
hidden neurons.

Determination of a mapping function by adjusting the
connection weights using some learning algorithm.
t
p
O
u
s
t
I
p
Hidden Layer
n
u
s
u
t
w
jk

y
k

Total input (I
j
) to the hidden
layer element
Output of the hidden
layer element f
j
(I
j
)
Total input (I
k
) to the
output layer element
Output y
k
f
k
(I
k
)
A typical neural network architecture

NN for Grade Estimation

NN needs to be trained (learning process) for
a minimal estimation error

Output accuracy depends upon how each
element in the layers is weighted to capture
the underlying phenomenon

Choose the network with minimal generalization
error
Quick-stop training method is employed
Dataset is split into three subsets- training, calibration and
validation
Network trained on the training set. However, the decision to stop
the training is made on the networks performance in the
calibration set.
Prediction dataset used to evaluate the generalization
performance.


13
Popularly known as support vector regression (SVR) for its
regression abilities
SVR is based on the Structural Risk Minimization (SRM) principle
and motivated by the Statistical Learning Theory (SLT)
Statistical learning theory (SLT)
Involves learning from training data. Empirical risk minimization (ERM)
practiced.
Learning an ill-posed problem. Always there is a generalization error

R(h) Remp (h) + (h)
where, R is the bound on the testing error, Remp is the empirical risk on the
training data and is the confidence term dependent on model complexity defined
by VC dimension h. h describes the general notion of complexity, and is
independent of the particular function used to model the data.

SVM for Grade Estimation
14

Statistical learning theory

The general strategy is select a model that minimizes the training error
and has the smallest VC dimension. This is the principle of SRM which
in turn results in the smallest bound on the test error.
Dependence of VC confidence on the VC dimension h
and training data l, h<l
(from kecman, 2001)

Bound on the test error derived in SLT. The minimum
corresponds to an optimal model complexity (from
Pozdnoukhov 2005)

15
Support Vector Regression (contd..)
As per the SRM principle, the objective is to select an approximating
function that not only minimizes the confidence term but also the
empirical risk.
The overall risk that is minimized is given by the following objective
function:
minimize R= ,,w
T
w,, + C, y-f(x,w)|

Or
minimize R= ,,w
T
w,, + C[
i
+
i
*]

Under the constraints,
y
i
- w
T
x-b s +
i

w
T
x+b-y
i
s +
i
*
*
i
>0 ,
i
>0 .
SVR General Steps
1. Define the problem as classification or regression.
2. Standardize the input data.
3. Check for outliers, i.e. the strange data points.
4. Select the kernel function in order to transform the
data to a higher dimensional feature space. One of
the common kernels considered is Radial Basis
Function (RBF) kernel.
5. Select the shape, i.e. the smoothing parameter
of the kernel function. This is the polynomial degree
for the polynomial and variances for the Gaussian
RBF kernel.
SVR General Steps

6. Choose the penalty parameter C and the desired accuracy
defining the insensitivity zone .
7. Solve the quadratic programming problem in the 2 x L
variables for the corresponding regression task.
8. Train the model and validate it on a previously unseen
dataset. If the validation result is not satisfactory, repeat the
steps from 4 to 8.
9. Since the search of the individual C, and the shape
parameter can be tedious and a time consuming task, an
alternative approach could be cross-validation and grid
search to find the best value of cost parameter.
18
Approach
Three data division techniques were
investigated
Random Division
Genetic Algorithm
Kohonen Network

Appropriate model developed for Ore
Reserve Estimation.
Study Area and Data Characteristics
Greens Creek mine located in Southeast
Alaska
Polymetallic ore body (silver, zinc, gold, and
lead)
Data available in terms of easting (x) and
northing (y) co-ordinates (in m), gold, silver,
lead, zinc and copper content (in ppm).
20
Comparative Analysis in a Lode
Deposit
Data obtained from the Greens Creek mine.
432 exploratory borehole observations (x, y, gold,
silver, lead, zinc, copper).
Silver values were estimated.
Training= 216; Calibration= 108; Validation=108.
GA was used to obtain the model data subsets.

21
Modeling
Split Sampling Approach was carried out.
For NN three datasets and SVM two datasets
required.
Random data division resulted in dissimilar
datasets.
Genetic Algorithms was applied.

Data Modeling

Data divided into
three statistically
similar subsets
employing genetic
algorithms (GA).

Training Dataset Mean SD
X 5541.92 409.34
Y 3752.67 541.29
Gold 0.03 0.06
Lead 0.15 0.28
Zinc 3.41 7.38
Copper 2.89 3.91
Silver 0.96 1.43
Calibration Dataset
X 5558.50 416.80
Y 3707.91 494.98
Gold 0.03 0.05
Lead 0.13 0.23
Zinc 3.74 5.02
Copper 2.75 3.39
Silver 0.92 1.18
Validation Dataset
X 5567.85 429.48
Y 3670.76 520.68
Gold 0.03 0.06
Lead 0.14 0.26
Zinc 3.03 4.03
Copper 2.69 3.53
Silver 0.89 1.20
Statistical Properties of the Greens Creek model datasets
Data Modeling

For NN modelling, the commercially available
software package Neuroshell was used
Several network architectures investigated
Final architecture consisted of 5 slabs (one input
slab, one output slab and three hidden slabs)
Histogram plot for the silver values

Snapshot of the semi-variogram
modeling on the variable silver

NN Modeling
For this modelling exercise, the network comprised
of 5 slabs: one input slab, 3 hidden slabs and 1
output slab (slab is basically a group of neurons; a
particular layer may have multiple slabs). Each slab
in the hidden layer and the output layer consisted of
different activation functions. The input slab has six
neurons for each of the input variables while the
output slab has one neuron for the silver values as
the output variable. The slabs in the hidden layer
have 8, 6 and 8 neurons respectively.
27
NN Architecture
Tanh
Slab 5 Slab 1 Slab 3

Gaussian
Gaussian
Complementary

Slab 4

Slab 2

Output
Ward Net Architecture for the NN
modeling

Linear
Activation

I
n
p
u
t

SVR Modeling
For the SVM modelling, a grid based approach with 10 fold
cross validation on the training dataset was employed to select
the optimal model parameters C, and .
The following Figure shows the plot for the model performance
(troughs and flat regions) for different combinations of the C
and values.
The cross-validation MSE was used as a criterion to select the
optimum parameter values of C, and .
The flat regions correspond to the various possible
combinations for the optimal values of C and . The optimal
estimates of C and were found to be 2.5 and 0.5 respectively
Effect of the cost and kernel width on the error for the silver values
Variation of error with epsilon for the variable silver.
SVR Modeling
Once the optimum values of these parameters were
determined the next step involved the selection of an
optimum value of . This was selected by fixing the
values of C and at their optimum values, while
varying the parameter . This exercise was also
carried out through a cross-validation study on the
training data set. The previous figure showed the
variation of the mean squared error with respect to
the parameter for the training dataset.
SVR Modeling
Following this exercise, the optimum model
parameter values of C, and for the silver
values were found to be 2.5, 0.5 and 0.05
respectively. The final step involved the
assessment of the model generalization
ability through the examination of the
generalization error on the prediction data
set.
Data Modeling
SVM modeling performed using R
a grid based approach with 10 fold cross
validation on the training dataset was employed to
select the optimal model parameters C, and .
cross-validation MSE was used as a criterion to
select the optimum parameter values of C,
optimum model parameter values of C, and
for the silver values were found to be 2.5, 0.5 and
0.05
Results
Silver values predicted for 108 observations
Model performance evaluated based on a
summary statistic, termed the skill value
skill value = abs (ME) + MAE + RMSE + (1- RSQ)

where,
ME= mean error (ME),
MAE= mean absolute error (MAE),
RMSE=root mean squared error (RMSE)
RSQ= coefficient of determination
SKILL VALUE
This summary statistic, termed the skill value, is an
entirely subjective measurement. One can possibly
devise numerous skill measures; however, the one
proposed here is quite simple and weights the ME,
MAE, RMSE equally and applies a scaling to the
RSQ so that it is of the same order of magnitude as
the other components. It should be noted that the
lower the skill value, the better the method is.
Results
Statistics (Silver) SVM NN

OK
Mean Error 0.02 0.08

0.25
Mean Absolute Error 0.25 0.36

0.64
Root Mean Squared
Error
0.48 0.72

1.04
RSQ 0.91 0.79

0.59
Generalization performance of the models for the variable Silver
Results
Statistics (Silver) SVM NN

OK
skill value 0.84 1.37

2.34
Rank 01 02

03
Model performances based on the skill values
Results
University of Alaska Fairbanks True vs. Predicted (NN)
R
2
= 0.793
0
1
2
3
4
5
6
0 2 4 6 8
True
P
r
e
d
i
c
t
e
d
True vs. Predicted (SVM)

R
2
= 0.9034
0
1
2
3
4
5
6
0 1 2 3 4 5 6 7 8
True
P
r
e
d
i
c
t
e
d
R
2
= 0.5927
0.00
0.50
1.00
1.50
2.00
2.50
3.00
3.50
4.00
0.00 2.00 4.00 6.00 8.00
True
P
r
e
d
i
c
t
e
d
True vs. Predicted (OK)
Results
Results --- Discussions
It can be seen from the plots that the
SVM method over-performs compared to the
other two methods.
To further investigate the performance of
the model, the prediction error distribution
plots for the OK, NN and the SVM methods
were analysed.
Results
Error distribution for the Silver values (NN)
Error distribution for the Silver values (SVM)
Error distribution for the Silver values (OK)
Results
Results---
It can be noted that error distribution of silver values
for the SVM model and the NN model approximates
a normal distribution, whereas for the OK method it
is more of a lognormal shape.
A normality assumption of the model errors is always
preferred. Thus, a lognormal error distribution of the
OK method could be seen as a disadvantage. This is
particularly significant where uncertainty analysis is
conducted.

43
Conclusions
SVM produced better estimates for the silver values
in the lode deposit.

In general, MLAs can be used for the purpose of
predictive mapping such as ore reserve estimation if
data is used sensibly. These methods are
comparatively fast if the dataset is small.
44
45

Genetic Algorithm (GA) for Data Division
Optimization technique
based on the theory of
genetics and natural
selection.

Performs reproduction,
cross-over and mutation
operations on each solution
of the successive iterations
to generate the final
solution.

Principle stages of Genetic Algorithms
Generate Initial
Population

Assess fitness
value
Reproduce
Population
Crossover
Population
Mutate
Population
Final
Population
46
Kohonen Map for Data Division

Unsupervised Learning
technique.

Identifies the various features
existing in the data by grouping
the similar features into a
cluster.

Sampling done from these
clusters to ensure proper
representation in the subsets.

Competition

Cooperation

Weight Update
Stages in Kohonen Mapping
( ) i c i w x
i
w x =
min
) (
) (
) (
)) ( ) ( )( ( ) (
) 1 (
t N i
t N i
t w
t w t x t t w
t w
c
c
i
i i
i
e
e
)
`
+
= +
o
47
Background (Contd.)
Split Sampling
Data split into atleast two subsets.
Training Subset and Prediction subset.

Similar datasets can be obtained.

Merits: good with large data.

Demerits: larger variance, data division has to be proper
(otherwise model trained in english tested in french)
Standard linear regression equation

The linear case is a special case of the
nonlinear regression equation
T
y w x b = +
( ) y f x =
Support vector regression
Idea : we define a tube of radius around the regression ( 0)
No error if y lays inside the tube or band
50

Support Vector Regression
Goal is to determine the functional dependency

- A novel loss function termed as the Vapniks linear loss function with
- insensitivity zone is introduced.
- an error tube of thickness is defined around the regression line ( 0)

T
y w x b = + ( ) y f x =
1
( , , ( )) ( ) max(0, ( ) ) L x y f x y f x y f x
c
c
c = =
2
2
2
( , , ( )) ( ) [max(0, ( ) )] L x y f x y f x y f x
c
c
c = =
We therefore define an -insensitive loss
function L
1

1
( , , ( )) ( ) max(0, ( ) ) L x y f x y f x y f x
c
c
c = =
L
2
2
2
2
( , , ( )) ( ) [max(0, ( ) )] L x y f x y f x y f x
c
c
c = =
52
Support Vector Regression (Contd..)

Slack variables e
i
are defined for each observation:
1
max(0, ( ) ) ( , , ( )) ( )
i i i i i i i i
e y f x L x y f x y f x
c
c
c = = =
e
e
e
e
Kernels are used to linearize the problems in
conditions of non-linearity
55
SVR (contd..)
Classic quadratic optimization problem
Can be solved by using lagrange multipliers following the Kharush-Kuhn-
Tucker (KKT) conditions. New primal objective function is

min L
p
(w,b,
I
,
i
*, o, o*,, *) = (),,w
T
w,, + C[
i
+
i
*] - (
i
*
i
* +
i
i
) - o
i
*
(y
i
- w
T
x-b + +
i
)- o
i
(w
T
x+b- y
i
+ +
i
)

- At optimal point first derivative w.r.t the independent variable vanish.
dLp/ dw =w
0
- ( o- o*)x
i
= 0 dLp/do
i
= o
i
(w
T
x+b- y
i
+ +
i
) = 0,
dLp/ db = ( o- o*)= 0 dLp/do
i
*= o
i
* (y
i
- w
T
x-b + +
i
*)=0
dLp/ d
i
= C-o
i
-
i
dLp/d
i
* =
i
*
i
*=(C-a)
i
*= 0
dLp/ d
i
*= C-o
i
*-
i
* dLp/d
i
=
i
i
= (C-a)
i
= 0

56
SVR (contd..)
It can be expressed in dual form (o
i
,o
i
*) by
substituting the KKT conditions.

max L
d
(oi, oi*) = (-1/2) ( o
i
- o
i
*)( o
j
- o
j
*) x
i
T
x
j
- ( o
i
- o
i
*)
+ ( o
i
- o
i
*)y
i

subjected to
( o
i
- o
i
*)=0
0s o
i
s C
0s o
i
*s C
57
SVR (contd..)
Optimization will yield L (o
i
, o
i
*) pairs- one each for a training
pattern.
Patterns with non-zero o
i
or o
i
* are support vectors (SV).
Complexity proportional to number of SVs.
The best regression hyperplane is given by
f(x,w) = w
o
T
x + b
= ( o- o*)x
i
T
x + b
bias: average b = y
i
- w
o
T
x
i
+ for 0 < o
I
< C
= y
i
- w
o
T
x
i
- for 0 < o
*
i
< C

58
SVR (contd..)
In Nonlinear SVR, Kernels used ( such as polynomial, RBF)

Same concept as the linear.

Parameters of the model are:

- C (penalty parameter), (error tube thickness), ( rbf kernel width
when used).

- Optimal parameters can be selected by cross validation techniques.

Basic kernels for vectorial data:
Linear kernel:
(feature space is Q-dimensional if Q is the dim of ; Map is
identity!)
RBF-kernel:
(feature space is infinite dimensional)
Polynomial kernel of degree two:
(feature space is d(d+1)/2 dimensional if d is the dim of )
_ |
_
( , ) '
i j i j
K x x x x =
2
2
( , ) exp( )
2
i j
i j
x x
K x x
o
=
2
( , ) ( ' )
i j i j
K x x x x =
K-fold Cross-validation
Learning typically involves training and testing the model. This can be done in two
approaches-
(1) Split sample approach
(2) K-fold cross-validation approach.
K-fold cross validation approach is typically the best method of developing a
learning model under conditions of sparse data (Hastie et al., 2001).

61
Background (Contd.)
K-fold Cross-validation
Data Split into K roughly equal sized parts; for example with
K=5, We have..

The model is fitted using (k-1) parts and the error is estimated
using the (k) part. We do this k=1,2,3..5 and combine the
prediction errors to estimate the model error.
merits: can be good under conditions of data sparseness.
demerits: more training time, imprecise way to measure the
accuracy, model data subsets may not be similar.
Train Test Train Train Train
Key Steps in Modeling
Using the learning curve, estimate the number of folds needed in
cross-validation
For the SVM, select a kernel and estimate the optimal values of sigma
(width of the kernel) and cost function using a grid search.
Using the optimal cost and sigma values, train the model and validate
using the k-fold cross validation approach.
63
Model Development
For NN modeling a wardnet architecture with
three slabs in the hidden layer was used.
The slabs in the hidden layer had 8, 6, 8
neurons.
For SVM modeling a grid based approach
with 10 fold cross validation was used.

64
Background (contd.)
Model evaluation is necessary.
Split sampling (or holdout method)
Cross validation (k-cross, leave-one-out)
Model learns based on training data subset.
Performance evaluation is done based on
generalization on validation data subset.
For reliable performance evaluation, validation
subset of the data should have similar statistical
properties as the training data.

NN Modeling
Various network architectures with different numbers of hidden
layers and neurons in each layer were investigated prior to the
selection of an architecture with 9 hidden neurons.
The purpose behind this modelling exercise was to ensure that
the model is neither over-fitted nor under-fitted.
Over-fitting of a NN model is a condition which arises when
there are too many neurons in the hidden layer as a result of
which the network performs exceptionally well in the training
dataset but doesnt generalize well. On the other hand, under-
fitting is a condition arising due to a smaller number of neurons
during which the network results in high training and high
generalization error.
NN Modeling
The three slabs in the hidden layer use three
different activation functions viz. tanh, gaussian and
complementary gaussian whereas the output layer
slab uses a linear activation function. The concept
behind using different combinations of the activation
functions is to identify various patterns in the
dataset. A particular activation function may be more
suitable for a few typical patterns; however, it may
not work at all for other patterns. Thus, the use of
different activation functions ensures that at least
some of the underlying trends in the data are
captured.
NN Modeling
For example, a gaussian activation function in one
hidden slab may detect features in the mid-range of
the data while a gaussian complement activation
function in another hidden slab may detect features
from the upper and the lower extremes of the data.
Similarly, a tanh activation function will tend to group
together data at the low and the high ends of the
original data range. This may be helpful in reducing
the effects of outliers. Implementation of these
features in the output layer may result in better
predictions.
68
Results
Optimum model parameters values of C, and for
the silver values were found to be 2.5, 0.5 and 0.05.
Effect of the cost and kernel width on the error for the silver
values
Variation of error with epsilon for the variable silver.

Dutta 111 35th Apcom Final

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Dutta 111 35th Apcom Final

Uploaded by

Copyright:

Available Formats

University of Alaska Fairbanks

Critical Assessment of Machine Learning

You might also like