Professional Documents
Culture Documents
Non-Linear Regression and Input Variable Importance in Machine Learning with CART
and Random Forests
José Mira
Statistical Laboratory
May 26, 2023
• ¿What are data mining and Machine Learning?
– Data analysis
– Model development
• Objectives
– Understanding a complex/ a priori “chaotic” process from
natural or social sciences
• Description of trends and regularities
• Pattern identification
– Prediction (could be “black box”)
– Quantification of uncertainty
– josemanuel.mira@upm.es
2
• Introduction
• CART
• Random Forests
• Sensitivity Analysis
• A reminder of Bayesian Stats and Bayesian
CART
• Examples
Análisis de la varianza 3
Essential (“delicate”) issues
• Outliers
– Errors
– Correct data: Rare events could be more
relevant/interesting tan “regular ones”
• Missing Values
• Do a set of model hypothesis hold for our data? (specially for
parametric models)
4
Two cultures:
L. Breiman (2001), Statistical Modelling: The two Cultures, Statistical Science, 16,3,199-
231.
5
Linear Regression
X Logistic regression Y
….
X Unknown Y
Decision trees
SVM..
– Model validation: predictive accuracy
6
Examples
• Modelling of natural sciences processes
– Meteo
– Astronomy!
– Structural Engineering
– Nuclear Engineering
– Complementary to mechanistics approaches- quantification of uncertainty
– Complex Computer simulators
• Energy:
– Demand
– Prices
– Patterns
• CLIENT FIDELITY
• Credit scoring, credit cards
• SALES
• Telecom or banking
– Client behavior:
• IN ADVANCE IDENTIFICATION OF FUTURE CLIENT NEEDS ….
7
Examples
• Service industry
• FRAUD DETECTION
8
Characteristics
– 100 variables
– One million individuals
9
Definitions
10
Data mining
11
“ If you torture the data long enough, Nature
will always confess”
12
Data types
• …spatial
• …time dependent
• …documentary
• …multimedia
• WWW
13
• There is more tan one good model
• Sir Ronald Fisher “All models are wrong, some
are useful”
• Tradeoff between simplicity (parsimony) and
accuracy
• Dimensionality
14
Tools (I)
15
Tools (II)
16
Tools (III)
17
Steps
• Define objective:
– E.g. Classification of customers in a bank
• Choose the type of model:
– Ej. Classification Trees
• Choose algorithm:
– CART, BART, CHAID, Random Forest….
• Choose software: R, Python, Matlab, SPSS, SAS….
18
CART -Based Models
The basics of tree models
Variable of interest
Qualitative/categorical Continuous/Frequcies
Classification Trees
Regression trees
PREDICTION
Difference between estimation and prediction
Algorithms
- CART
• Software - CHAID
– R - BART
– MATLAB - DYNATREE
– SPSS
– GUIDE
– SAS….
– PYTHON
Ensembles of trees
- Random Forests
- Bagging
- Boosting
Main Features
Entropy reduction
Q
H pi log( pi )
i 1
2) Quantitative output
F-test
Variance reduction
Entropy as a measure of uncertainty
Stirling’s formula
log( N !) N log( N ) N
Q
W pi log( pi )
i 1
Entropy: the basics
In 4!
But now let us suppose we consider that the sorting B1N1B2N2 is the
same as
B1N1B1N2 or
B1N2B2N1 ..
i.e, that what matters is the COLOUR of the ball in a given poosition
(first, second, third or fourth) not WHICH OF THE TWO red ones
or WHICH of the two black ones it is.
Análisis de la varianza 34
Gini index as a measure of uncertainty
Q
G 1 pi2
i 1
Similar to entropy
Probability of two independiente extractions
being equal
Gini index
We have proportion p1 of red balls and 1-p1 of black ones
Four possibilities:
Análisis de la varianza 36
Stopping criteria
MSE
2) Categorical outputs
NEV non
explained
variability
Deviance o RSS
Example: Iris (flower) Classification.
• IRIS SETOSA
• IRIS VERSICOLOR
Categorical • IRIS VIRGÍNICA
variable
OBJECTIVE: Finding a classification criterion for new flowers
Iris
Grupo X1 X2 X3 X4
1 1 5,1 3,5 1,4 0,2
2 1 4,9 3,0 1,4 0,2
3 1 4,7 3,2 1,3 0,2
1.- Setosa 4 1 4,6 3,1 1,5 0,2
5 1 5,0 3,6 1,4 0,2
:
50
:
1
:
5,0
:
3,3
:
1,4
:
0,2
X1: Sepal Length
1 2 7,0 3,2 4,7 1,4
2 2 6,4 3,2 4,5 1,5
X2: Sepal width
2.- Versicolor 3
4
2
2
6,9
5,5
3,1
2,3
4,9
4,0
1,5
1,3
5
:
2
:
6,5
:
2,8
:
4,6
:
1,5
: X3: Petal length
50 2 5,7 2,8 4,1 1,3
1
2
3
3
6,3
5,8
3,3
2,7
6,0
5,1
2,5
1,9 X4: Petal width
3 3 7,1 3,0 5,9 2,1
3.- Virginica 4
5
3
3
6,3
6,5
2,9
3,0
5,6
5,8
1,8
2,2
: : : : : :
50 3 5,9 3,0 5,1 1,8
ESPECIE
Nodo 0
Categoría
Iris-setosa
%
33,33
n
50
Root node
Iris-versicolor 33,33 50
Iris-virginica 33,33 50
Total (100,00) 150
Split points pétalo - longitud
Mejora=0,3333 Variables selected
<=2,4500000000000002 >2,4500000000000002
Nodo 1 Nodo 2
Categoría % n Categoría % n
Iris-setosa 100,00 50 Iris-setosa 0,00 0
Iris-versicolor 0,00 0 Iris-versicolor 50,00 50
Iris-virginica 0,00 0 Iris-virginica 50,00 50
Total (33,33) 50 Total (66,67) 100
pétalo - ancho
Mejora=0,2598
Leaf nodes
<=1,75 >1,75
Nodo 3 Nodo 4
Categoría % n Categoría % n
Iris-setosa 0,00 0 Iris-setosa 0,00 0
Iris-versicolor 90,74 49 Iris-versicolor 2,17 1
Iris-virginica 9,26 5 Iris-virginica 97,83 45
Total (36,00) 54 Total (30,67) 46
A new flower, what species is it?
A new flower (prediction)
ESPECIE
• Petal width=3.83
Iris-setosa
Iris-versicolor
Iris-virginica
33,33
33,33
33,33
50
50
50
• Sepal length=2.36 Total (100,00) 150
• Sepal width=0.32
pétalo - longitud
Mejora=0,3333
<=2,4500000000000002 >2,4500000000000002
Nodo 1 Nodo 2
Categoría % n Categoría % n
Iris-setosa 100,00 50 Iris-setosa 0,00 0
Iris-versicolor 0,00 0 Iris-versicolor 50,00 50
Iris-virginica 0,00 0 Iris-virginica 50,00 50
Total (33,33) 50 Total (66,67) 100
pétalo - ancho
Mejora=0,2598
<=1,75 >1,75
Nodo 3 Nodo 4
Categoría % n Categoría % n
Iris-setosa 0,00 0 Iris-setosa 0,00 0
Iris-versicolor 90,74 49 Iris-versicolor 2,17 1
Iris-virginica 9,26 5 Iris-virginica 97,83 45
Total (36,00) 54 Total (30,67) 46
Ensembles of trees
Bagging
Random Forest
Ensembles of trees: Bagging and Random Forests
• Bagging
• Random Forests
Ensembles of tres result from resampling of original data.
More stable and better predictors than standard CART
Ensembles of trees: Bagging
Sophistication of bagging
Bootstrap (resampling)
For each Split, random selection of subset of inputs
Better statistical properties, specially in terms of
variances
More difficult to interpret than standard (single tree)
CARTs
Definition of sensitivity analysis
• Local
• Global
Regression-based sensitivity indices
• Regression coefficient
y 0 1 x1 1 x2 ... K xK
Regression-based sensitivity indices
• Global SA
• Based on decomposition of (output) variability in its
different (input) sources
• Noisy or non-noisy context
• No noise term for deterministic simulation computer
codes
• There is noise in real data
Concept of interaction
• The effect on the output of a given input depends on the value of the another/other
input/inputs
– Z=X+Y+4XY
Concept of Interaction (III)
– Z=3X2+4Y3
– Z=X+Y+4XY
• Formulation of interaction in terms of derivatives
Análisis de la varianza 59
Interaction and non-linearity
• Newton’s law:
– F=ma
– A rigid body is subjected to a force F1 at time point t1
– The same body is subjected to a force F2 at time point t2
– The same body is subjected to F1 and F2 at time point t3
Análisis de la varianza 60
ANOVA-based SA
y
i 1 j 1
ij
y ..
IJ
J
j 1
yij
y i.
J
I
y ij
y. j i 1
I
ANOVA-based SA
y
i 1 j 1 k 1
ijk
y ...
IJK
J K
y
j 1 k 1
ijk
y i..
JK
I K
y ijk
y. j. i 1 k 1
IK
I J
y
i 1 j 1
ijk
y ..k
IJ
ANOVA-based SA
( ) ij yij i j
yij y i. y . j y ..
ANOVA-based SA
( ) ij y ij . i j
y ij . y i.. y . j . y ...
( ) ik y i.k i k
y ij . y i.. y ..k y ...
( ) jk y . jk j k
y . jk y . j . y ..k y ...
ANOVA-based SA
• Continuous formulation
• Infinite number of i and infinte number of j
i y ( x1i , x2 )dx2
x2
j y ( x1 , x2 j )dx1
x1
y( x , x )dx dx
x2 x1
1 2 1 2
ANOVA-based SA
• Three factor-case
i 1
( yij y..)
j 1
2
i 1
( yi. y.. )
j 1
2
i 1
( y. j y..)
j 1
2
i 1
ij i. . j
( y
j 1
y y y..) 2
I J I J
J I
i
2 2
j ( ) 2
ij
i 1 j 1 i 1 j 1
Sums of squares decomposition
• Continuous 2 factor-case
VT VE ( ) VE ( ) VE ( )
ANOVA-based sensitivity indices
VE ( )
D1 ;
VT
VE ( )
D2 ;
VT
VE ( ) VE ( )
D'1 ;
VT
Kriging model-based SA
75
Input variable importance with RF (II)
76
Input variable importance with RF (III)
77
Bayesian Statistics
• Conceptual differences
• Definition of probability
Bayesian Statistics
• Practical differences
• Incorporation of prior info by means of prior distribution
• Bayes’s theorem
p ( | X ) ( )l ( X | )
Advantages of Bayesian Statistics
• Bayes’ Blog:
– https://markpsite.wordpress.com
• Doing Bayesian Data Analysis:
– http://doindbayesiandatanalysis.blogspot.com
• Count Bayesie
– https://www.countbayesie.com
Non-informative prior distributions
Stan
Jags (formerly WinBugs)
Implementation of MCMC
Models have to be “programmed” (not the MCMC
algortithms)
BART model
• Sobol indices
• On simplified surrogate model developed as sums of trees
• Similar to kriging sensitivity methodology
• Within the Bayesian framework
Simulation examples
y x1 x2 x3
• X1
• [1] -2.5 -1.5 -0.5 0.5 1.5 2.5
• > X2
• [1] -2.5 -1.5 -0.5 0.5 1.5 2.5
• > X3
• [1] -2.5 -1.5 -0.5 0.5 1.5 2.5
• > X4
• [1] -2.5 -1.5 -0.5 0.5 1.5 2.5
Results first simulation example (I)
• Response: y
• Df Sum Sq Mean Sq F value Pr(>F)
• x1b 9 8250 916.67
• x2b 9 8250 916.67
• x3b 9 8250 916.67
• x1b:x2b 81 0 0.00
• x1b:x3b 81 0 0.00
• x2b:x3b 81 0 0.00
• x1b:x2b:x3b 729 0 0.00
• Residuals 0 0
Results first simulation example (II)
• RANDOM FORESTS
• importance(arbol.rf)
• %IncMSE IncNodePurity
• X1b 122.51 7080.11
• X2b 125.36 7122.51
• X3b 128.94 7218.21
Second simulation example
y x1 x2 x3 3 x1 x2
• X1
• [1] -2.5 -1.5 -0.5 0.5 1.5 2.5
• > X2
• [1] -2.5 -1.5 -0.5 0.5 1.5 2.5
• > X3
• [1] -2.5 -1.5 -0.5 0.5 1.5 2.5
Results second simulation example (I)
Response: y
Df SumSq Mean Sq
x1b 9 8250 916.7
x2b 9 8250 916.7
x3b 9 8250 916.7
x1b:x2b 81 612562 7562.5
x1b:x3b 81 0 0.0
x2b:x3b 81 0 0.0
x1b:x2b:x3b 729 0 0.0
Residuals 0 0
Results second simulation example (II)
importance(arbol.rf)
%IncMSE IncNodePurity
X1b 73.38 218499.14
X2b 74.25 203480.14
X3b -21.46 19781.42
References (I)
– Azzalini, A., and Scarpa, B., (2012), “Data analysis and data mining”. Oxford University Press.
– Breiman, L., (2001), “Random Forests”, Machine Learning, 45, pp 5-32
– Breiman, L., (2001), “Statistical modelling: the two cultures”, Statistical Science, v. 16 (3),199-
231.
– Chipman, H., George, E., and McCullogh, R., (2010), “BART: Bayesian Additive Regression
Trees”, The Annals of Applied Statistics, 4,1, 266-298.
– Gramacy, R., Taddy, M, (2013), “Variable selection and sensitivity analysis via dynamic trees
with application to computer code performance testing”., The Annals of Applied Statistics, vol.
7, 1, 51-80.
– Grompig, U, (2009), Variable importance assessment in regression:linear regression vs Random
Forest”, The American Statistician, vol. 63, num. 4.
– Hastie, T., Tibshirani, R., and Friedman, J., (2008), “The elements of Statistical Learning”.
Springer.
– James, G., Witten, D., Hastie, T., Tibshirani, R., (2013), “An introduction to statistical learning
with applications in R”. Springer.
– Matignon, R., (2007), “Data Mining using SAS Enterprise Miner”. Wiley
– Oakley, J. and O’Hagan, A., (2004), “Probabilistic sensitivity analysis of complex Models: A
Bayesian approach”. Journal of the Royal Statistical Society, B., 76(3), 751-769.
– Taddy, M., Gramacy, R., and Polson, L., (2011), “Dynamic trees for learning and design”, JASA,
106(493), 109-123
References (II)
– Verikas, A., Gelzinis, A., Bacauskiene, M., (2011), Mining data with Random
Forests: a survey and results of new tests”, Patter Recognition, 44, 330-349.
– Raschka, S., (2016), “Python machine learning”, Packt Publishing.
– Mueller, J. and Massaron, L., (2016), “Machine learning for dummies”. Wiley
– Lewis, , (2016), “Deep learning made easy with R: A gentle Introduction for Data
Science”, Auscov.
– Murphy, K., (2012), “Machine Learning: A probabilistic perspective”. MIT Press.
– Theodoridis, S., and Koutrumbas, K., (2009), “Pattern recognition”. Academic Press
– Theodoridis, J., (2015), “Machine Learning: A Bayesian and optimization
Perspective”. Academic Press.
– Le Cunn, Y., Bengio, J., and Hinton., (2015), “Deep learning”, Nature, 521, 436-444.
References (III)
– Torgo., L., (2017), “Data Mining with R. Learning with case studies”. CRC Press.
– Lantz, B., (2019), “Machine learning with R”. Packt.
– Ahrazem, I., Mira, J. and González, C., (2019), “Multi-Output Conditional Inference
Trees Applied to the Electricity Market: Variable Importance Analysis”. Energies,
2019, 12(6), 1097; https://doi.org/10.3390/en12061097
– Ahrazem, I., Forte, J., Mira, J. and González, C., (2020), “Variable Importance
Analysis in Imbalanced datasets: A New Approach”. DOI: 10.1109/ACCESS.2020.3008416
– .
R vs Python et al
101