You are on page 1of 101

UPM-PhD

Non-Linear Regression and Input Variable Importance in Machine Learning with CART
and Random Forests
José Mira
Statistical Laboratory
May 26, 2023
• ¿What are data mining and Machine Learning?
– Data analysis
– Model development
• Objectives
– Understanding a complex/ a priori “chaotic” process from
natural or social sciences
• Description of trends and regularities
• Pattern identification
– Prediction (could be “black box”)
– Quantification of uncertainty
– josemanuel.mira@upm.es
2
• Introduction
• CART
• Random Forests
• Sensitivity Analysis
• A reminder of Bayesian Stats and Bayesian
CART
• Examples

Análisis de la varianza 3
Essential (“delicate”) issues

• Outliers
– Errors
– Correct data: Rare events could be more
relevant/interesting tan “regular ones”
• Missing Values
• Do a set of model hypothesis hold for our data? (specially for
parametric models)

4
Two cultures:

• Data are generated by a proposed stochastic model (data


generating process)
• Unknown stochastic model: algorithmic/non parametric
approaches are used instead

L. Breiman (2001), Statistical Modelling: The two Cultures, Statistical Science, 16,3,199-
231.

5
Linear Regression
X Logistic regression Y
….

X Unknown Y

Decision trees
SVM..
– Model validation: predictive accuracy

6
Examples
• Modelling of natural sciences processes
– Meteo
– Astronomy!
– Structural Engineering
– Nuclear Engineering
– Complementary to mechanistics approaches- quantification of uncertainty
– Complex Computer simulators
• Energy:
– Demand
– Prices
– Patterns
• CLIENT FIDELITY
• Credit scoring, credit cards
• SALES
• Telecom or banking
– Client behavior:
• IN ADVANCE IDENTIFICATION OF FUTURE CLIENT NEEDS ….

7
Examples

• Service industry
• FRAUD DETECTION

TURNING DATA INTO KNOWLEDGE

8
Characteristics

• Tools to EXTRACT, EXPLOIT information


– Two variables
– 100 individuals

– 100 variables
– One million individuals

9
Definitions

• “Data mining represents the work of processsing graphically


or numerically large amounts or continuous streams of data
with the aim of extracting information useful to those who
possess them”

Azzalini & Scarpa 2012

10
Data mining

• Bridging the gap between


– Statistics
– Artificial Intelligence ( Machine learning)
– Data base management

11
“ If you torture the data long enough, Nature
will always confess”

R.H. Coase 1991 Nobel Laurate in Economics

12
Data types
• …spatial
• …time dependent
• …documentary
• …multimedia
• WWW

13
• There is more tan one good model
• Sir Ronald Fisher “All models are wrong, some
are useful”
• Tradeoff between simplicity (parsimony) and
accuracy
• Dimensionality

14
Tools (I)

• Unsupervised Learning Methods:


– Principal Component Analysis (PCA)
– Factor Analysis
– Cluster Analysis
– Autoencoders
– RBM (Restricted Boltzmann Machines)
– SOM (Self Organizing Maps)

15
Tools (II)

• Supervised Learning Methods:


– Multiple linear regression
– Non-parametric regression
– Splines
– Additive models GLM
– CART, Bagging and Random Forests
– Neural networks – Deep Learning-MLP
– Support Vector Machines
– Boosting

16
Tools (III)

• In-between Supervised and Unsupervised Learning Methods:


– Reinforcement Learning (a reward instead of a level)
– Semisupervised

17
Steps

• Define objective:
– E.g. Classification of customers in a bank
• Choose the type of model:
– Ej. Classification Trees
• Choose algorithm:
– CART, BART, CHAID, Random Forest….
• Choose software: R, Python, Matlab, SPSS, SAS….

18
CART -Based Models
The basics of tree models

• Belong to Data Mining


• Build models which describes/expresses a given variable of
interest in terms of the remaining ones .
• Allow for assessment of explanatory variable importance
Classification andINTRODUCCIÓN
Regression Trees
(Breiman et al, 1984)
Data mining vs Traditional Statistics
Characteristics
• Does not establish any parametric model, it is thus fully data-
dependent
• The data “choose” the model (Form of “Artificial
Intelligence”)
• Very computer and algorithm intensive
• Thus, increasingly competitive , given growth in computing
power and lower costs
• Interpretation: easy-not so easy
Breiman, L. (2001), Statistical Modeling; The two cultures, Statistical Science,
16, (3), 199-231
Classification and Regression Trees

Variable of interest

Qualitative/categorical Continuous/Frequcies

Classification Trees
Regression trees

Class estimation Quantit output


Estim.

PREDICTION
Difference between estimation and prediction

• Estimation is about parameters


• Prediction for observables
• Example with simple regression
• Weight=b0+b1height+u(random error)
Advantages (I)

• Not subject to distributional assumptions


• Works very well with a large number of explanatory
variables (“predictors”)
• Capable of identifying and modelling complex
interactions (non-linearities) between inputs
Advantages (II)

• Interpretability vs traditional probabilistic models:


– To think in terms of probability is abstract and often counter-
intuitive
– The inherent logic of a tree is straighforward to understand
• Easy to build
Drawbacks

• No analytical or “closed form”, it is just an algorithm


• Not robust vs outliers
Software and Algorithms

Algorithms
- CART
• Software - CHAID
– R - BART
– MATLAB - DYNATREE
– SPSS
– GUIDE
– SAS….
– PYTHON
Ensembles of trees
- Random Forests
- Bagging
- Boosting
Main Features

• Response, output or dependent variable


• Predictor, input or independent variables
• Training sample
• Validation sample
• Test sample (Future data)

Tree building (growing)


Tree pruning
STEPS Tree validation_selection
The key points

• Characteristics of the algorithms


– Selection of predictor variables and Split
points: : Splitting rules
– Criteria to decide if a node is terminal or
partition process continues: Stopping rules
– Three types pf nodes: root, internal, leaf
– Category assignment to leaf (“terminal”) nodes:
Classification criterion
– Value of output variable for terminal nodes:
Prediction criterion
Splitting criteria-I
1) Categorical output, with Q categories

Entropy reduction
Q
H   pi log( pi )
i 1

Gini reduction (Total leaf impurity)


Q
G  1   pi2
i 1
Splitting criteria - II

2) Quantitative output

F-test

Variance reduction
Entropy as a measure of uncertainty

Independent from the distributions values, just on the


probabilities.
Interpretation: Number of posible configurations in N atoms
N!
W
( Np1 )!...( NpQ )!

Stirling’s formula
log( N !)  N log( N )  N

Q
W   pi log( pi )
i 1
Entropy: the basics

Suppose we have four balls, 2 red R1,R2 and 2 black B1 and B2

In how many ways could we sort them out?

In 4!

But now let us suppose we consider that the sorting B1N1B2N2 is the
same as

B1N1B1N2 or
B1N2B2N1 ..

i.e, that what matters is the COLOUR of the ball in a given poosition
(first, second, third or fourth) not WHICH OF THE TWO red ones
or WHICH of the two black ones it is.

Thus, the number of different sortings would be NOT 4!


But (4!/(2!x2!)

Análisis de la varianza 34
Gini index as a measure of uncertainty

Q
G  1   pi2
i 1

Similar to entropy
Probability of two independiente extractions
being equal
Gini index
We have proportion p1 of red balls and 1-p1 of black ones

We now draw a ball at random and I guess its colour, in


accordance with the probabilities above.

Four possibilities:

It’s red and I say it’s red p1*p1


It’s red and I say it’s black p1(1-p1)
It’s black and I say it’s red (1-p1)p1
It’s black and I say it’s black (1-p1)^2

My guess has been right for the first and fourth


possibilities and wrong for third and fourth

P(wrong guess)=1-P(right guess)=


=1-p1^2-p2^2, Gini index

Análisis de la varianza 36
Stopping criteria

– One or a minimal number of observations per


node
– All observations in the node are equal
– External bound on the number of tree levels

• Care with “Overfitting”. …… Outliers


Overall accuracy or Model assesment
statistics for CART
1) Numerical outputs

MSE

2) Categorical outputs

Proportion of misclassified data


Global impurity index in final nodes (Entropy or Gini)
Number of tree levels: cost-complexity pruning
with Cross validation
Assigning values to terminal nodes: Prediction

• Assignment of mean value of the output for a given


terminal node.

• **A linear regression model may be built from all


observations in a terminal node, with the inputs
associated to the node.

NEV non
explained
variability
Deviance o RSS
Example: Iris (flower) Classification.

Avaliable info on four physical flower measurements:


• petal length and width,
Predictor • sepal length and width,
variables
and about the species for each individual flower in the sample:

• IRIS SETOSA
• IRIS VERSICOLOR
Categorical • IRIS VIRGÍNICA
variable
OBJECTIVE: Finding a classification criterion for new flowers
Iris

Virgínica Versicolor Setosa


Example: Iris

Grupo X1 X2 X3 X4
1 1 5,1 3,5 1,4 0,2
2 1 4,9 3,0 1,4 0,2
3 1 4,7 3,2 1,3 0,2
1.- Setosa 4 1 4,6 3,1 1,5 0,2
5 1 5,0 3,6 1,4 0,2
:
50
:
1
:
5,0
:
3,3
:
1,4
:
0,2
X1: Sepal Length
1 2 7,0 3,2 4,7 1,4
2 2 6,4 3,2 4,5 1,5
X2: Sepal width
2.- Versicolor 3
4
2
2
6,9
5,5
3,1
2,3
4,9
4,0
1,5
1,3
5
:
2
:
6,5
:
2,8
:
4,6
:
1,5
: X3: Petal length
50 2 5,7 2,8 4,1 1,3
1
2
3
3
6,3
5,8
3,3
2,7
6,0
5,1
2,5
1,9 X4: Petal width
3 3 7,1 3,0 5,9 2,1
3.- Virginica 4
5
3
3
6,3
6,5
2,9
3,0
5,6
5,8
1,8
2,2
: : : : : :
50 3 5,9 3,0 5,1 1,8
ESPECIE

Nodo 0
Categoría
Iris-setosa
%
33,33
n
50
Root node
Iris-versicolor 33,33 50
Iris-virginica 33,33 50
Total (100,00) 150
Split points pétalo - longitud
Mejora=0,3333 Variables selected

<=2,4500000000000002 >2,4500000000000002

Nodo 1 Nodo 2
Categoría % n Categoría % n
Iris-setosa 100,00 50 Iris-setosa 0,00 0
Iris-versicolor 0,00 0 Iris-versicolor 50,00 50
Iris-virginica 0,00 0 Iris-virginica 50,00 50
Total (33,33) 50 Total (66,67) 100

pétalo - ancho
Mejora=0,2598
Leaf nodes
<=1,75 >1,75

Nodo 3 Nodo 4
Categoría % n Categoría % n
Iris-setosa 0,00 0 Iris-setosa 0,00 0
Iris-versicolor 90,74 49 Iris-versicolor 2,17 1
Iris-virginica 9,26 5 Iris-virginica 97,83 45
Total (36,00) 54 Total (30,67) 46
A new flower, what species is it?
A new flower (prediction)
ESPECIE

• Petal length= 5.82 Categoría


Nodo 0
% n

• Petal width=3.83
Iris-setosa
Iris-versicolor
Iris-virginica
33,33
33,33
33,33
50
50
50
• Sepal length=2.36 Total (100,00) 150

• Sepal width=0.32
pétalo - longitud
Mejora=0,3333

<=2,4500000000000002 >2,4500000000000002

Nodo 1 Nodo 2
Categoría % n Categoría % n
Iris-setosa 100,00 50 Iris-setosa 0,00 0
Iris-versicolor 0,00 0 Iris-versicolor 50,00 50
Iris-virginica 0,00 0 Iris-virginica 50,00 50
Total (33,33) 50 Total (66,67) 100

pétalo - ancho
Mejora=0,2598

<=1,75 >1,75

Nodo 3 Nodo 4
Categoría % n Categoría % n
Iris-setosa 0,00 0 Iris-setosa 0,00 0
Iris-versicolor 90,74 49 Iris-versicolor 2,17 1
Iris-virginica 9,26 5 Iris-virginica 97,83 45
Total (36,00) 54 Total (30,67) 46
Ensembles of trees
Bagging
Random Forest
Ensembles of trees: Bagging and Random Forests

• Bagging
• Random Forests
Ensembles of tres result from resampling of original data.
More stable and better predictors than standard CART
Ensembles of trees: Bagging

First sophistication of CART


Bootstrap (resampling)
Better statistical properties, specially in terms of
variances
More difficult to interpret than standard (single tree)
CARTs
Ensembles of trees: Random Forests

Sophistication of bagging
Bootstrap (resampling)
For each Split, random selection of subset of inputs
Better statistical properties, specially in terms of
variances
More difficult to interpret than standard (single tree)
CARTs
Definition of sensitivity analysis

• Given a set of input variables, what is the effect on the output


of “small” chages to the those inputs

• Not necessarily a statistical problem


Problems with traditional SA

• Miopic to non-monotonic transformations

• Unable to proper handling of non-uniform inputs


Classification of SA

• Local

• Global
Regression-based sensitivity indices

• Regression coefficient

y   0  1 x1  1 x2  ...   K xK
Regression-based sensitivity indices

• Standardized regression coefficient – the input and output


variables are divided by their standard deviation.

• Interpretation: when a given factor Xi is increased by a


standard deviation, the output increases on average by Bi
standard deviations (all other inputs held equal-ceteris
paribus)
ANOVA methods

• Global SA
• Based on decomposition of (output) variability in its
different (input) sources
• Noisy or non-noisy context
• No noise term for deterministic simulation computer
codes
• There is noise in real data
Concept of interaction

• The effect on the output of a given input depends on


the value of the another/other input/inputs

• Road traffic example

• Example for continuous models


Concept of interaction (II)

• The effect on the output of a given input depends on the value of the another/other
input/inputs

• Road traffic example:


– Experiment in A6 motorway: villalba-Aravaca (30 km)
» Response: average speed:
• Factor I: car model: Ferrari or Hyundai Atos
• Factor II: traffic: traffic jam or very low
– Does the effect on one factor on the mean value of the response depend on the
value taken by the other factor?

• Example for continuous models:


– Z=3X2+4Y3

– Z=X+Y+4XY
Concept of Interaction (III)

• Example for continuous models:

– Z=3X2+4Y3

– Z=X+Y+4XY
• Formulation of interaction in terms of derivatives

Análisis de la varianza 59
Interaction and non-linearity

• Newton’s law:

– F=ma
– A rigid body is subjected to a force F1 at time point t1
– The same body is subjected to a force F2 at time point t2
– The same body is subjected to F1 and F2 at time point t3

Análisis de la varianza 60
ANOVA-based SA

• Definition of main effect

• Definition of second or higher order interactions


ANOVA-based SA

• Sums of squares decomposition

• Global sensitivity measures


ANOVA-based SA

• Main effects: discrete models


• Two factors/inputs
 i  y i.  y ..
 j  y . j  y ..
I J

 y
i 1 j 1
ij

y .. 
IJ
J


j 1
yij
y i. 
J
I

y ij
y. j  i 1
I
ANOVA-based SA

• Discrete 3 factor models


 i  y i..  y ... ;  j  y . j .  y ... ;  k  y ..k  y ...
I J K

 y
i 1 j 1 k 1
ijk

y ... 
IJK
J K

 y
j 1 k 1
ijk

y i.. 
JK
I K

 y ijk
y. j.  i 1 k 1
IK
I J

 y
i 1 j 1
ijk

y ..k 
IJ
ANOVA-based SA

• Definition of second order interactions

( ) ij  yij     i   j 
 yij  y i.  y . j  y ..
ANOVA-based SA

• Second order interactions for discrete models

• Second order interactions for continuous models


ANOVA-based SA

• Second order interactions with 3 factors

( ) ij  y ij .     i   j 
y ij .  y i..  y . j .  y ...
( ) ik  y i.k     i   k 
y ij .  y i..  y ..k  y ...
(  ) jk  y . jk     j   k 
y . jk  y . j .  y ..k  y ...
ANOVA-based SA

• Continuous formulation
• Infinite number of i and infinte number of j

 i   y ( x1i , x2 )dx2  
x2

 j   y ( x1 , x2 j )dx1  
x1

   y( x , x )dx dx
x2 x1
1 2 1 2
ANOVA-based SA

- Descomposition in sums of squares

- Global sensitivity index


Sums of squares decomposition

• Three factor-case

VT  VE ( )  VE (  )  VE ( )  VE ( )  VE ( )  VE (  )  VE ( )  VNE


Sums of squares decomposition

• Discrete three factor-case

yij  y..  yij  y i.  y i.  y j .  y j .  y..


I J I J I J I J


i 1
 ( yij  y..)  
j 1
2

i 1
 ( yi.  y.. )  
j 1
2

i 1
 ( y. j  y..)  
j 1
2

i 1
 ij i. . j
( y
j 1
 y  y  y..) 2

I J I J
 J   I    
i
2 2
j  ( ) 2
ij
i 1 j 1 i 1 j 1
Sums of squares decomposition

• Continuous 2 factor-case

  ( y( x , x )   ) dx dx   ( x1 )dx1  dx2    2 ( x2 )dx2  dx1    ( ( x , x ))


2 2
1 2 1 2 1 2 dx1dx2

VT  VE ( )  VE (  )  VE ( )
ANOVA-based sensitivity indices

• Also called Sobol indices

VE ( )
D1  ;
VT
VE (  )
D2  ;
VT
VE ( )  VE ( )
D'1  ;
VT
Kriging model-based SA

• Paper by Oakley and O’Hagan (2004)


• For computer codes- deterministic (non noisy) functions
• Simplified (surrogate) kriging modelos
• Kriging = spatial statistic interpolators
• Very intensive in analytical manipulations
Input variable importance with RF (I)

Concept of Out of Bag (OOB)


observations
For each resampling, the OOB
observations are those from the
original sample which do not
appear in the resampled one

75
Input variable importance with RF (II)

For each resampling i, from a total of B (B trees)


1) Obtain the OOB subsample, L i
2) Obtain the number of correct classifications
for L i, call it c i
For each input variable X j
1) Random permutation of its values , L i
2) Obtain the number of correct
classifications , call it c ij
3) Obtain the importance measure for X j
1 B
D j   | ci  cij |
B i 1

76
Input variable importance with RF (III)

Gini importance measure

Compute for each node decrease in Gini index


Compute mean of decrease for all tree nodes with
involvement of Xj

77
Bayesian Statistics

• Alternative to frequentist approach


• Practical and conceptual differences
Bayesian Statistics

• Conceptual differences
• Definition of probability
Bayesian Statistics

• Practical differences
• Incorporation of prior info by means of prior distribution
• Bayes’s theorem

p ( | X )   ( )l ( X |  )
Advantages of Bayesian Statistics

• Incorporation of prior info


• Computational methods : MCMC and particle filters
• Prediction: more “elegant” tan frequentist
References for Bayesian Statistics

• Robert, C., (2007), “The Bayesian Choice”. Springer.


• Gelman, A., Carlin, J., Stern, H.S; Dunson, D., Vehtari. A.,
Rubin, D., (2013) Bayesian Data Analysis. CRC Press.
• Krusche, J., (2015), Doing Bayesian Data Analysis, A Tutorial
with R. JAGS and Stan. Academic Press.
• McElreath, R., (2020), Statistical Rethinking: A Bayesian
Course with Examples in R and STAN. CRC Press.
References for Bayesian Statistics (Blogs)

• Bayes’ Blog:
– https://markpsite.wordpress.com
• Doing Bayesian Data Analysis:
– http://doindbayesiandatanalysis.blogspot.com
• Count Bayesie
– https://www.countbayesie.com
Non-informative prior distributions

• Express quasi “absolute” prior ignorance


• A priori improper uniform
• On uniform under metric transfromations
• More complex and subjective than it seems
• Practical and interesting approach of Box y Tiao (1973)
• Jeffreys’ invariant prior
MCMC methods

• Markov Chain Monte Carlo


• Revolution in Bayesian Statistics from seminal paper by Smith
and Gelfand (1990)
• Allows to implement models which to data were previously
unfeasible
• Origin in the Metropolis algorithm
MCMC-Gibbs sampling

• The joint distribution is unknown, but the full conditionals are


known
f ( X | Y , Z ), f (Y | X , Z ), f ( Z | X , Y )

• By means of sampling of full conditionals, convergence to the


joint distribution is reached
Software for Bayesian Statistics

Stan
Jags (formerly WinBugs)
Implementation of MCMC
Models have to be “programmed” (not the MCMC
algortithms)
BART model

• Sum of trees model


• Within Bayesian framework
Dynatree model

• Dynamic version of the BART model


• More computational efficiency in dynamic or recursive estimation context
• Within Bayesian framework
SA with Dynatree

• Sobol indices
• On simplified surrogate model developed as sums of trees
• Similar to kriging sensitivity methodology
• Within the Bayesian framework
Simulation examples

• Simulate a model (stochastic or deterministic)


• Carry out SA
• Significant prior knowledge of input variable
importance ranking
• Useful for better understanding of ML tools because
the input –output relationship is fully known.
Results first simulation example

y  x1  x2  x3

• Full factorial design

• X1
• [1] -2.5 -1.5 -0.5 0.5 1.5 2.5
• > X2
• [1] -2.5 -1.5 -0.5 0.5 1.5 2.5
• > X3
• [1] -2.5 -1.5 -0.5 0.5 1.5 2.5
• > X4
• [1] -2.5 -1.5 -0.5 0.5 1.5 2.5
Results first simulation example (I)

• Analysis of Variance Table

• Response: y
• Df Sum Sq Mean Sq F value Pr(>F)
• x1b 9 8250 916.67
• x2b 9 8250 916.67
• x3b 9 8250 916.67
• x1b:x2b 81 0 0.00
• x1b:x3b 81 0 0.00
• x2b:x3b 81 0 0.00
• x1b:x2b:x3b 729 0 0.00
• Residuals 0 0
Results first simulation example (II)

• RANDOM FORESTS

• importance(arbol.rf)
• %IncMSE IncNodePurity
• X1b 122.51 7080.11
• X2b 125.36 7122.51
• X3b 128.94 7218.21
Second simulation example

y  x1  x2  x3  3 x1 x2

• Full factorial design

• X1
• [1] -2.5 -1.5 -0.5 0.5 1.5 2.5
• > X2
• [1] -2.5 -1.5 -0.5 0.5 1.5 2.5
• > X3
• [1] -2.5 -1.5 -0.5 0.5 1.5 2.5
Results second simulation example (I)

• Analysis of Variance Table

Response: y
Df SumSq Mean Sq
x1b 9 8250 916.7
x2b 9 8250 916.7
x3b 9 8250 916.7
x1b:x2b 81 612562 7562.5
x1b:x3b 81 0 0.0
x2b:x3b 81 0 0.0
x1b:x2b:x3b 729 0 0.0
Residuals 0 0
Results second simulation example (II)
importance(arbol.rf)
%IncMSE IncNodePurity
X1b 73.38 218499.14
X2b 74.25 203480.14
X3b -21.46 19781.42
References (I)
– Azzalini, A., and Scarpa, B., (2012), “Data analysis and data mining”. Oxford University Press.
– Breiman, L., (2001), “Random Forests”, Machine Learning, 45, pp 5-32
– Breiman, L., (2001), “Statistical modelling: the two cultures”, Statistical Science, v. 16 (3),199-
231.
– Chipman, H., George, E., and McCullogh, R., (2010), “BART: Bayesian Additive Regression
Trees”, The Annals of Applied Statistics, 4,1, 266-298.
– Gramacy, R., Taddy, M, (2013), “Variable selection and sensitivity analysis via dynamic trees
with application to computer code performance testing”., The Annals of Applied Statistics, vol.
7, 1, 51-80.
– Grompig, U, (2009), Variable importance assessment in regression:linear regression vs Random
Forest”, The American Statistician, vol. 63, num. 4.
– Hastie, T., Tibshirani, R., and Friedman, J., (2008), “The elements of Statistical Learning”.
Springer.
– James, G., Witten, D., Hastie, T., Tibshirani, R., (2013), “An introduction to statistical learning
with applications in R”. Springer.
– Matignon, R., (2007), “Data Mining using SAS Enterprise Miner”. Wiley
– Oakley, J. and O’Hagan, A., (2004), “Probabilistic sensitivity analysis of complex Models: A
Bayesian approach”. Journal of the Royal Statistical Society, B., 76(3), 751-769.
– Taddy, M., Gramacy, R., and Polson, L., (2011), “Dynamic trees for learning and design”, JASA,
106(493), 109-123
References (II)

– Verikas, A., Gelzinis, A., Bacauskiene, M., (2011), Mining data with Random
Forests: a survey and results of new tests”, Patter Recognition, 44, 330-349.
– Raschka, S., (2016), “Python machine learning”, Packt Publishing.
– Mueller, J. and Massaron, L., (2016), “Machine learning for dummies”. Wiley
– Lewis, , (2016), “Deep learning made easy with R: A gentle Introduction for Data
Science”, Auscov.
– Murphy, K., (2012), “Machine Learning: A probabilistic perspective”. MIT Press.
– Theodoridis, S., and Koutrumbas, K., (2009), “Pattern recognition”. Academic Press
– Theodoridis, J., (2015), “Machine Learning: A Bayesian and optimization
Perspective”. Academic Press.
– Le Cunn, Y., Bengio, J., and Hinton., (2015), “Deep learning”, Nature, 521, 436-444.
References (III)

– Torgo., L., (2017), “Data Mining with R. Learning with case studies”. CRC Press.
– Lantz, B., (2019), “Machine learning with R”. Packt.
– Ahrazem, I., Mira, J. and González, C., (2019), “Multi-Output Conditional Inference
Trees Applied to the Electricity Market: Variable Importance Analysis”. Energies,
2019, 12(6), 1097; https://doi.org/10.3390/en12061097
– Ahrazem, I., Forte, J., Mira, J. and González, C., (2020), “Variable Importance
Analysis in Imbalanced datasets: A New Approach”. DOI: 10.1109/ACCESS.2020.3008416
– .
R vs Python et al

101

You might also like