You are on page 1of 24

MINISTÉRIO DA EDUCAÇÃO

UNIVERSIDADE FEDERAL DO RIO GRANDE DO SUL

ESCOLA DE ENGENHARIA

Programa de Pós-Graduação em Engenharia de Minas,

Metalúrgica e de Materiais (PPGEM)

INTRODUCTION TO MACHINE LEARNING

DAVID ALVARENGA DRUMOND

PORTO ALEGRE, RS

2017

1
Contents

1. INTRODUCTION ................................................................................................. 3

2. OBJECTIVE .......................................................................................................... 5

3. REVIEW ............................................................................................................... 5

3.1 Classification and Regression ......................................................................... 5

3.2 Generalization, Overfitting and Underfitting ................................................... 5

3.3 Metrics to classify machine learning algorithms .............................................. 6

3.4 The Data Set ................................................................................................... 7

3.5 The Principal component Analysis ................................................................ 10

3.6 Linear Discriminant Analysis........................................................................ 13

3.7 KNeighbors .................................................................................................. 15

3.4 Random Forest.............................................................................................. 20

4. REFERENCES .................................................................................................... 24

2
1. INTRODUCTION

The goal of machine learning is building systems that can adapt to their

environments and learn from data or recognition patterns. This science field attracted

researches from many fields, including computer science, engineering, mathematics,

physics, neuroscience and cognitive science (Alpaydin, 2004). The process of learning

is necessary in cases where we cannot directly write a computer program to solve a

given problem, but is need information of data or experience (Alpaydin, 2004). Because

of that, statistical tools are the main tool for machine learning proceedings. For

example, the recognition of faces and spoken speech can be recognize by machine

learning algorithms. With increase of a training data, it is possible to derive patterns on

images or sounds to characterize and estimate the personal voice of someone. Machine

learning approach have a deemed importance to find geological classification and

regression in geostatistics. For example the machine learning method as random forest

(RF) and support vector machine have been applied in spatial interpolation of

environmental variables as a support for geometallurgy and environmental fields (Li et

al., 2011). Machine learning algorithms can adapt to their circumstances, rather than

explicitly writing a different program for each special circumstance. This is an

interesting feature for geostatistics sampling purposes since they are obtained over a

period of time, adapting the effect of information over the proceedings of mining

planning.

Most of machine learning algorithms requires a large amount of data. These

methodologies are called supervised learning. With advances in computer technology,

we can have the ability to store large amount of data, as well accessing data from large

3
distances by a computer network(Alpaydin, 2004). Machine learning algorithms are

commonly used for enterprises who are interesting to find patterns in large databases.

This can be typically amounts to gigabytes of data every day. In another way its

possible to freely disposable computer works to find patterns without any data or

experience. This can be called as unsupervised learning, such as neural networks, or

hierarchical clusters (HCA).

The core task of supervised learning is to provide a predictive mathematical

model from samples. The common role to achieve this methodology is twofold: first we

have to training the dataset and optimization the performance with relative amount of

data. Second, we have to estimate using constructed model the test set, and compare the

predictive accuracy of the process.

Supervised learning algorithms are commonly related to:

1. Learning Associations: Learning the association of conditional

distributions by different features A and B by a conditional probability

P(A|B).

2. Classification: The problem of classification is to split the dataset in

different categories and estimate the group to each new data-set

incorporated

3. Regression: Regression is the group of problems to estimate real values

using the dataset. Despite classification algorithms, regression aims to

estimate non categorical data.

In some applications the output of machine learning algorithms is not only a program,

but a sequence of monitored actions. In this case a single action is not the purpose of the

program but a policy of the sequence of correction actions to reach a goal. Such learning

4
methods are called Reinforcement Learning. A game playing is an example of

reinforcement algorithms, when only single command cannot win a game, but a

sequence of procedures to achieve this goal.

2. OBJECTIVE

The objective of this work is illustrate the usage of machine learning algorithms

in some dataset and makes a review about common algorithms using in this computer

field.

3. REVIEW

3.1 Classification and Regression

There are two major groups of machine learning algorithms, called as classification

and regression (Müller and Guido, 2016). In classification problems the goal is to

predict the class label of some group given an amount of data labeled before. In

regression problem the goal is predict some real value using some pattern obtained by

training data. In another terms, classification problems deal with indicator variables and

regression problems with float or real variables.

3.2 Generalization, Overfitting and Underfitting

The main objective of machine learning algorithms is create a program to be

able to find patterns in data. Because of that we split data in the train and test sets to

verify if predictions in train set can be done in test set. If this become reliable, we can

say that program generalize the problem, and program can estimate good in both train

and test sets. By the way, it can be possible to create a complex model that only
5
estimates better in training dataset. This problem is called overfitting. If poor

estimatives can be done in training dataset, the model can be underfitting.

3.3 Metrics to classify machine learning algorithms

There are many metrics to classify machine learning algorithms. They can be

divided as classification metrics and Regression metrics for Supervised Learning:

1. Classification Metrics

a. Classification accuracy: Is the number of correct predictions made as a

ratio of all predictions made. This is the most common evaluation metric

for classification problems

b. Logarithm Loss: Is a measure of probability used in classification. Lower

values of logarithm loss indicate best estimatives. It takes in account the

expected value of some classification label taking in account the

logarithm of the probability

c. Sensitivity: Is the ratio of the instances positive estimated relative to the

total instances of the class

d. Specificity: Is the ration of the instances negative estimated relative to

the total instances of the class

e. Confusion Matrix: Is a table to present predictions in x-axis and accuracy

outcomes in y-axis.

2. Regression Metrics

a. Mean Absolute Error: Is the mean of the differences of estimate values

and the true value

6
b. Mean Squared Error: Is the square of mean absolute error. When taking

the square root of MSE the units are converted to original units. This is

called root mean squared error.

c. R2 metric: Is the coefficient of correlation between the estimatives and

the true values. The value ranges from 0 to 1 as 1 the best estimative.

3.4 The Data Set

The Iris dataset is a multivariate data create by the British scientist Ronald

Aylmer Fisher, one of the creator of the modern statistical science, in his paper “The

use of multiple measures in Taxonomic problems”(Fisher, 1936) as an example of

LDA (Linear Discriminant Analysis). It contains three species of Iris Flowers (Iris

Setosa, Iris Virginica, IrisVersicolor) and variations of this morphologic characteristics

as Petal Length, Petal Width, Sepal Lengh and Sepal Width.

Figure 1 – Morphology of a Iris Flower by its Petals and Sepals(Müller and Guido, 2016)

7
The following code describes the Iris Dataset

1. from sklearn.datasets import load_iris


2. from sklearn.preprocessing import StandardScaler
3. from sklearn.decomposition import PCA
4. from IPython.display import display
5. import matplotlib.pyplot as plt
6. from mpl_toolkits.mplot3d import Axes3D
7. import pandas as pd
8. import sklearn.discriminant_analysis as DA
9. import sklearn.neighbors as neighbors
10. from sklearn.model_selection import train_test_split
11. import numpy as np
12. import sklearn.metrics as metrics
13. from sklearn.ensemble import RandomForestRegressor
14. from pylab import *
15. from sklearn.model_selection import cross_val_score
16.
17. %matplotlib inline
18.
19. '''''
20. Exercise
21. .................................................
22.
23. student: David Alvarenga Drumond
24. id_number: 00249899
25.
26. This is an exercise made for Introduction to Machine Learning at UFRGS (Univer
sidade Federal do Rio grande do Sul)
27. 07/12/2017
28.
29. Iris dataset was chose from sklearn.datasets to perform basic machine learning
algorithms
30.
31. '''
32.
33. # Import iris data_set
34.
35. data_set = load_iris()
36.
37.
38. # Describe dataset
39.
40. print data_set.DESCR
41.
42. # Plot

Output

43. Iris Plants Database


44. ====================
45.
46. Notes
47. -----
48. Data Set Characteristics:
49. :Number of Instances: 150 (50 in each of three classes)
50. :Number of Attributes: 4 numeric, predictive attributes and the class
51. :Attribute Information:
52. - sepal length in cm
53. - sepal width in cm
54. - petal length in cm
55. - petal width in cm
56. - class:
57. - Iris-Setosa

8
58. - Iris-Versicolour
59. - Iris-Virginica
60. :Summary Statistics:
61.
62. ============== ==== ==== ======= ===== ====================
63. Min Max Mean SD Class Correlation
64. ============== ==== ==== ======= ===== ====================
65. sepal length: 4.3 7.9 5.84 0.83 0.7826
66. sepal width: 2.0 4.4 3.05 0.43 -0.4194
67. petal length: 1.0 6.9 3.76 1.76 0.9490 (high!)
68. petal width: 0.1 2.5 1.20 0.76 0.9565 (high!)
69. ============== ==== ==== ======= ===== ====================
70.
71. :Missing Attribute Values: None
72. :Class Distribution: 33.3% for each of 3 classes.
73. :Creator: R.A. Fisher
74. :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
75. :Date: July, 1988
76.
77. This is a copy of UCI ML iris datasets.
78. http://archive.ics.uci.edu/ml/datasets/Iris
79.
80. The famous Iris database, first used by Sir R.A Fisher
81.
82. This is perhaps the best known database to be found in the
83. pattern recognition literature. Fisher's paper is a classic in the field and

84. is referenced frequently to this day. (See Duda & Hart, for example.) The
85. data set contains 3 classes of 50 instances each, where each class refers to a

86. type of iris plant. One class is linearly separable from the other 2; the
87. latter are NOT linearly separable from each other.
88.
89. References
90. ----------
91. - Fisher,R.A. "The use of multiple measurements in taxonomic problems"
92. Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
93. Mathematical Statistics" (John Wiley, NY, 1950).
94. -
Duda,R.O., & Hart,P.E. (1973) Pattern Classification and Scene Analysis.
95. (Q327.D83) John Wiley & Sons. ISBN 0-471-22361-1. See page 218.
96. - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
97. Structure and Classification Rule for Recognition in Partially Exposed
98. Environments". IEEE Transactions on Pattern Analysis and Machine
99. Intelligence, Vol. PAMI-2, No. 1, 67-71.
100. -
Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule". IEEE Transactions
101. on Information Theory, May 1972, 431-433.
102. - See also: 1988 MLC Proceedings, 54-
64. Cheeseman et al"s AUTOCLASS II
103. conceptual clustering system finds 3 classes in the data.
104. - Many, many more ...

9
3.5 The Principal component Analysis

Dimensionally reduction is one of the most important techniques applied in

machine learning algorithms, and is related to decomposing features in another

combination to created a reduced number of new features that can explain information

of data maintaining their statistics properties. The central idea of Principal Component

Analysis is to reduce the dimensionally of a data set consisting in a large number of

intercorrelated variables while retaining as much as possible the variation of dataset.

(Jolliffe, 2002). The Figure 2 shows the geometrical interpretation of principal

component analysis algorithm.

Figure 2 - a) Dispersion of 50 data sets in two axes z1 and z2 b) Rotation of data using PCA (Jolliffe,
2002)

The Figure 2a shows a high linearly correlated data by two properties z1 and z2.

Using PCA technique its possible to transform the coordinates of dataset creating new

orthogonal axis that make data independent as a rotation of dataset. The PCA problems

involves to make a linear transformation of data maximizing the variance of the

transformation. Consider a as a linear weight and ∑ the covariance matrix of data. It can

10
be shown for a data set X that ( ) = ′∑ . Using the restriction that the transform

is independent = 1 we can find the PCA as Equation 1

(∑ − ) =0 1)

Where is the lagrangian, and I an identity matrix. This is the same problem to obtain

eingenvalues ( ) and eingenvectors ( ) for a covariance matrix ∑. The eingenvalues

are ordering according the covariance percentual. The firsts eigenvalues reproduces the

maximum percentual of data covariance.

The problem of reducing dimensionally is related on removing some eigenvectors that

are reponsable for lower percentage of covariance matrix.

1. # Use PCA to reduce dimensionally


2. #..........................................................
3.
4. # Collect iris properties and target
5.
6. X = data_set.data
7. Y = data_set.target
8.
9. # Print properties correlation
10.
11. columns = ['sepal lenght','sepal width','petal lenght', 'petal width']
12. df = pd.DataFrame(X,columns=columns)
13. print " Print correlation matrix of data \n"
14. print df.corr()
15.
16. # Scale data_set to average 0 and stdesviation 1
17.
18. scaler = StandardScaler().fit(X)
19. X_transform = scaler.transform(X)
20.
21. # Perform Principal Component Analysis
22. pca = PCA(n_components=2)
23. X_pca = pca.fit_transform(X)
24.
25.
26. # Precision metrics:
27. print "\n Explained variance ratio : {}".format(pca.explained_variance_ratio_)

28. print "\n Precision matrix\n{}".format(pca.get_precision())


29. print "\n Components\n{}".format(pca.components_)
30.
31. cmap = cm.get_cmap('Dark2', 3)
32.
33. fig = plt.figure(figsize=(8, 6))
34. plt.scatter(X_pca[:, 0], X_pca[:, 1], c=Y, cmap=cmap, s=100)
35.

11
36.
37. cbar = plt.colorbar(ticks=[0,1,2])
38. cbar.ax.set_yticklabels(['Setosa','Versicolor','Virginica'])
cbar.ax.set_yticklabels([
39.
40. plt.title("Two PCA directions"
directions")
41. plt.xlabel("1st eigenvector"
eigenvector")
42. plt.ylabel("2nd eigenvector"
eigenvector")
43. plt.show()
44.

Output

45. Print correlation matrix of data


46.
47. sepal lenght sepal width petal lenght petal width
48. sepal lenght 1.000000 -0.109369 0.871754 0.817954
49. sepal width -0.109369
0.109369 1.000000 -0.420516 -0.356544
50. petal lenght 0.871754 -0.420516 1.000000 0.962757
51. petal width 0.817954 -0.356544 0.962757 1.000000
52.
53. Explained variance ratio : [ 0.92461621 0.05301557]
54.
55. Precision matrix
56. [[ 10.38525514 -6.82204502
6.82204502 -4.20572606 -1.7510368 ]
57. [ -6.82204502 11.21577598 3.34261 1.41240052]
58. [ -4.20572606 3.34261 4.90734692 -6.14476038]
59. [ -1.7510368 1.41240052 -6.14476038 16.99268128]]
60.
61. Components
62. [[ 0.36158968 -0.08226889
0.08226889 0.85657211 0.35884393]
63. [ 0.65653988 0.72971237 -0.1757674 -0.07470647]]

12
3.6 Linear Discriminant Analysis

Discriminant analysis try to address the following question: Given a dataset =

{ , , … } what is the analytical way (linear or not) to divide the space to obtain

regions for different two paired classes. Considering only two classes to divide the

Fisher discriminant analysis try to minimize equation 2

2)
( )=

Where:

=( − )( − ) 3)

= − −
, ∈

where is the average of all properties contained in a class i. The intuition behind

maximizing J(w) is to find a direction to maximize the projection of class means

(numerator) while minimizing the class variance (denominator). (Mika et al., 1999).

For multiclasses problems Sb and Sw can be describe as a matrix of inner relations of

each class. Using the quadradic formula, the problem to obtain the discriminant analysis

is the same to obtain the eigenvectors and eingenvalues of the equation 4

( − ) =0 4)

The linear discriminant analysis divides the feature space into linear separators. For

non-linear relationships, discriminant analysis can be used with kernel functions to

create another different divisions on space.

13
1. # Perform classification with Discriminant Analysis
2.
3. # Perform data_set split
4. Xtrain, Xtest, Ytrain, Ytest = train_test_split(X,Y, test_size = .3)
5.
6. Xtrain_transform = pca.transform(Xtrain)
7. Xtest_transform = pca.transform(Xtest)
8.
9.
10. clf = DA.LinearDiscriminantAnalysis().fit(Xtrain_transform, Ytrain)
11.
12. #predict ytest with xtest transform
13.
14. ytest = clf.predict(Xtest_transform)
15.
16.
17. # Perform the Linear Discrimiant analysis graph
18. #.............................................................................
...
19.
20. h = .01
21.
22. fig = plt.figure(figsize=(8, 6))
23. x_min, x_max = Xtrain_transform[:,0].min() -
h, Xtrain_transform[:,0].max() + h
24. y_min, y_max = Xtrain_transform[:,1].min() -
h, Xtrain_transform[:,1].max() + h
25. xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
26. Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
27. Z = Z.reshape(xx.shape)
28. cmap = cm.get_cmap('Dark2', 3)
29. plt.pcolormesh(xx, yy, Z, cmap=cmap)
30. scat = plt.scatter(Xtrain_transform[:,0], Xtrain_transform[:,1], marker='o', s
=100, c=Ytrain, edgecolors='k', cmap=cmap, alpha=1)
31. plt.xlim(-4, 4)
32. plt.ylim(-1.5, 1.5)
33. cbar = plt.colorbar(ticks=[0,1,2])
34. cbar.ax.set_yticklabels(['Setosa','Versicolor','Virginica'])
35.
36. plt.title("Linear discriminant analysis prediction using PCA eigenvectors")
37. plt.xlabel("1st eigenvector")
38. plt.ylabel("2nd eigenvector")
39.
40. plt.show()
41.
42. # ............................................................................
.....
43.
44. # Define metrics of classification
45.
46. print "Score of train data {}".format(clf.score(Xtrain_transform, Ytrain))
47. print 'Score of test data {} \n'. format(metrics.accuracy_score(ytest, Ytest,
normalize=True))
48. print 'classification report:\n{}'.format(metrics.classification_report(ytest,
Ytest))
49. print 'confusion matrix: \n{}'.format(metrics.confusion_matrix(ytest, Ytest))

50.
51.
52. #.............................................................................
....
53. # Perform cross-validation of LDA

14
54.
55. print "\n Perform cross validation of LDA \n"
56. print cross_val_score(clf, X, Y, cv=10, scoring='r2')
57.
58. Score of train data 0.952380952381
59. Score of test data 0.977777777778
60.
61. classification report:
62. precision recall f1-score support
63.
64. 0 1.00 1.00 1.00 17
65. 1 1.00 0.94 0.97 16
66. 2 0.92 1.00 0.96 12
67.
68. avg / total 0.98 0.98 0.98 45
69.
70. confusion matrix:
71. [[17 0 0]
72. [ 0 15 1]
73. [ 0 0 12]]
74.
75. Perform cross validation of Kneighbors
76.
77. [ 1. 1. 1. 1. 0.9 1. 0.8 1. 1. 1. ]

3.7 KNeighbors

The Kneighbors algorithm aims to classify categories according the proximity of

training set. This algorithm is too sensitive to data because of


of that, making the learning

extreme dependent of data configuration. The figure 3 shows the prediction of

15
3()points using data of two different classes. According to the distance of prediction

to class it was choose if data will be classified at certain class. 

Figure 3 - Nearest neighbor apllied in training set according two classes. Star points as predictive data. Circle
points as raw data.(Müller and Guido, 2016)

K nearest neighbor is an extension of nearest neighbor and gives a predict point the

class with high frequency for k nearest samples. The Figure 4 shows an example of the

same predictions points using 3 nearest samples.

16
Figure 4—K Nearest neighbor apllied in training set according two classes. Star points as predictive data.
Circle points as raw data.(Müller and Guido, 2016)

There are several kinds of distance metrics that can be used in nearest neighbor

algorithms. Some of them are showed in table 1.

Table 1 – Types of distance measure

Types of distance

1) Euclidian Distance

2) City-Block Distance

3) Correlation distance

5) Chi-square Distance

6) Mahalanobis distance

The common used distance measure is the Euclidian distance, and because of that

spherical patterns can be present in predictive values on dataset. KNearest Neighbors

17
can be used in sense of regession too. In this procedure, average of nearest samples are

calculated.

1. # Perform classification with KneighborsClassifier


2.
3. # Perform data_set split
4. Xtrain, Xtest, Ytrain, Ytest = train_test_split(X,Y, test_size = .3)
5.
6. Xtrain_transform = pca.transform(Xtrain)
7. Xtest_transform = pca.transform(Xtest)
8.
9. n_neighbors = 5
10.
11. clf = neighbors.KNeighborsClassifier(n_neighbors= n_neighbors).fit(Xtrain_tran
sform, Ytrain)
12.
13.
14.
15. #predict ytest with xtest transform
16.
17. ytest = clf.predict(Xtest_transform)
18.
19.
20. # Perform the KneighborsClassifier analysis graph
21. #.............................................................................
...
22.
23. h = .01
24.
25. fig = plt.figure(figsize=(8, 6))
26. x_min, x_max = Xtrain_transform[:,0].min() -
h, Xtrain_transform[:,0].max() + h
27. y_min, y_max = Xtrain_transform[:,1].min() -
h, Xtrain_transform[:,1].max() + h
28. xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
29. Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
30. Z = Z.reshape(xx.shape)
31. cmap = cm.get_cmap('Dark2', 3)
32. plt.pcolormesh(xx, yy, Z, cmap=cmap)
33. scat = plt.scatter(Xtrain_transform[:,0], Xtrain_transform[:,1], marker='o', s
=100, c=Ytrain, edgecolors='k', cmap=cmap, alpha=1)
34. plt.xlim(-4, 4)
35. plt.ylim(-1.5, 1.5)
36. cbar = plt.colorbar(ticks=[0,1,2])
37. cbar.ax.set_yticklabels(['Setosa','Versicolor','Virginica'])
38.
39. plt.title("Kneighbors prediction using PCA eigenvectors")
40. plt.xlabel("1st eigenvector")
41. plt.ylabel("2nd eigenvector")
42.
43. plt.show()
44.
45. # ............................................................................
.....
46.
47. # Define metrics of classification
48. print("Number of neighbors choose: {}".format(n_neighbors))
49. print "Score of train data {}".format(clf.score(Xtrain_transform, Ytrain))
50. print 'Score of test data: {} \n'. format(metrics.accuracy_score(ytest, Ytest,
normalize=True))
51. print 'classification report:\n{}'.format(metrics.classification_report(ytest,
Ytest))

18
52. print 'confusion matri
matrix: \n{}'.format(metrics.confusion_matrix(ytest,
.format(metrics.confusion_matrix(ytest, Ytest))

53.
54.
55. #.............................................................................
....
56. # Perform cross-validation
validation of Kneighbors
57.
58. print "\n Perform cross validation of Kneighbors \n"
59. print cross_val_score(clf, X, Y, cv=10, scoring='r2')
60.

Output

61.
62. Number of neighbors choose: 5
63. Score of train data 0.961904761905
64. Score of test data: 1.0
65.
66. classification report:
67. precision recall f1-score support
68.
69. 0 1.00 1.00 1.00 14
70. 1 1.00 1.00 1.00 20
71. 2 1.00 1.00 1.00 11
72.
73. avg / total 1.00 1.00 1.00 45
74.
75. confusion matrix:
76. [[14 0 0]
77. [ 0 20 0]
78. [ 0 0 11]]
79.
80. Perform cross validation of Kneighbors
81.
82. [ 1. 0.9 1. 1. 0.8 0.9 0.9 1. 1. 1. ]

19
3.4 Random Forest

The random forest approach tries to compute several decision trees according

subsets of data and choose the best classification index based on the frequency of

estimated labels. The decision tree is a mathematical structure consisting in nodes and

edges (Safavian and Landgrebe, 1991). Each node corresponds in a subset of data and

each edge a division of samples. Figure 5 shows an example of decision tree. The entire

dataset corresponds to the root node and different edges connect the division of data by

different classifiers.

Figure 5- Example of a decision tree (Safavian and Landgrebe, 1991)

Figure 6 shows the process to classify the features X1 and X2 by a decision tree. Each
feature is selected randomly and separated by a criteria.

20
Figure 6- Process of linearly division in decision trees (Safavian and Landgrebe, 1991)

For binary decision trees the process to split data is related to the “goodness-of-split”

show in equation 5:

( , )= ( , , , ) 5)

Where s is a note of the decision tree, t is an edge process and Pl and Pr are the

proportions divided by the criteria and pl and pr is the proportion of the classes. If

=( , , … ) is a subset of proportions to split the data, ∅( ) is a function

called impurity, and measure the maximum spread of classification. The common

functions are the gini as shown in equation 6 (Breiman, 1996).

∅( ) = − (1 − ) 6)

And the entropy

21
∅( ) = − ( ) 7)

So the godnnes of fitting can be derived as:

( , ) = ∅( ) − ∅( ) − ∅( ) 5)

1. # Perform classification with Random Forest


2.
3. # Perform data_set split
4. Xtrain, Xtest, Ytrain, Ytest = train_test_split(X,Y, test_size = .3)
5.
6. Xtrain_transform = pca.transform(Xtrain)
7. Xtest_transform = pca.transform(Xtest)
8.
9.
10. clf = RandomForestRegressor().fit(Xtrain_transform, Ytrain)
11.
12. #predict ytest with xtest transform
13.
14. ytest = clf.predict(Xtest_transform)
15.
16.
17. # Perform the Linear Discrimiant analysis graph
18. #.............................................................................
...
19.
20. h = .01
21.
22. fig = plt.figure(figsize=(8, 6))
23. x_min, x_max = Xtrain_transform[:,0].min() -
h, Xtrain_transform[:,0].max() + h
24. y_min, y_max = Xtrain_transform[:,1].min() -
h, Xtrain_transform[:,1].max() + h
25. xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
26. Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
27. Z = Z.reshape(xx.shape)
28. cmap = cm.get_cmap('Dark2', 3)
29. plt.pcolormesh(xx, yy, Z, cmap=cmap)
30. scat = plt.scatter(Xtrain_transform[:,0], Xtrain_transform[:,1], marker='o', s
=100, c=Ytrain, edgecolors='k', cmap=cmap, alpha=1)
31. plt.xlim(-4, 4)
32. plt.ylim(-1.5, 1.5)
33. cbar = plt.colorbar(ticks=[0,1,2])
34. cbar.ax.set_yticklabels(['Setosa','Versicolor','Virginica'])
35.
36. plt.title("Random Forest prediction using PCA eigenvectors")
37. plt.xlabel("1st eigenvector")
38. plt.ylabel("2nd eigenvector")
39.
40. plt.show()
41.
42. # ............................................................................
.....
43.
44. # Define metrics of classification

22
45.
46. print 'Most important features:\n{}'.format(clf.feature_importances_)
.format(clf.feature_importances_)
47.
48. #.............................................................................
....
49. # Perform cross-validation
validation of Random Forest
50.
51. print "\n Perform cross validation of Random Forest \n"
52. print cross_val_score(clf, X, Y, cv=10, scoring='r2')
53.

Output

54. Most important features:


55. [ 0.97708108 0.02291892]
56.
57. Perform cross validation of Random Forest
58.
59. [ 1. 1. 1. 0.94 0. 0. 0.994 0. 0. 0. ]

4. Discussion
The Iris dataset presents 4 features (Petal Lengh, Petal Width, Sepal Length, Sepal
Width) used to estimate the class of flower (Iris Virginica, Versicolor, Setosa) given
taxonomic parameters. It could be shown that only two principal components are
responsible
sible for more than 97% of the data variance.

23
5. REFERENCES

Alpaydin, E., 2004. Introduction to Machine Learning. MIT Press.


Breiman, L., 1996. Some properties of splitting criteria. Mach. Learn. 24, 41–47.
Fisher, R.A., 1936. The use of multiple measurements in taxonomic problems. Ann.
Hum. Genet. 7, 179–188.
Jolliffe, I.T., 2002. Principal component analysis and factor analysis. Princ. Compon.
Anal. 150–166.
Li, J., Heap, A.D., Potter, A., Daniell, J.J., 2011. Application of machine learning
methods to spatial interpolation of environmental variables. Environ. Model.
Softw. 26, 1647–1659.
Mika, S., Ratsch, G., Weston, J., Scholkopf, B., Mullers, K.-R., 1999. Fisher
discriminant analysis with kernels, in: Neural Networks for Signal Processing
IX, 1999. Proceedings of the 1999 IEEE Signal Processing Society Workshop.
IEEE, pp. 41–48.
Müller, A.C., Guido, S., 2016. Introduction to Machine Learning with Python: A Guide
for Data Scientists. O’Reilly Media, Inc.
Safavian, S.R., Landgrebe, D., 1991. A survey of decision tree classifier methodology.
IEEE Trans. Syst. Man Cybern. 21, 660–674.

24

You might also like