Professional Documents
Culture Documents
ESCOLA DE ENGENHARIA
PORTO ALEGRE, RS
2017
1
Contents
1. INTRODUCTION ................................................................................................. 3
2. OBJECTIVE .......................................................................................................... 5
3. REVIEW ............................................................................................................... 5
4. REFERENCES .................................................................................................... 24
2
1. INTRODUCTION
The goal of machine learning is building systems that can adapt to their
environments and learn from data or recognition patterns. This science field attracted
physics, neuroscience and cognitive science (Alpaydin, 2004). The process of learning
given problem, but is need information of data or experience (Alpaydin, 2004). Because
of that, statistical tools are the main tool for machine learning proceedings. For
example, the recognition of faces and spoken speech can be recognize by machine
images or sounds to characterize and estimate the personal voice of someone. Machine
regression in geostatistics. For example the machine learning method as random forest
(RF) and support vector machine have been applied in spatial interpolation of
al., 2011). Machine learning algorithms can adapt to their circumstances, rather than
interesting feature for geostatistics sampling purposes since they are obtained over a
period of time, adapting the effect of information over the proceedings of mining
planning.
we can have the ability to store large amount of data, as well accessing data from large
3
distances by a computer network(Alpaydin, 2004). Machine learning algorithms are
commonly used for enterprises who are interesting to find patterns in large databases.
This can be typically amounts to gigabytes of data every day. In another way its
possible to freely disposable computer works to find patterns without any data or
model from samples. The common role to achieve this methodology is twofold: first we
have to training the dataset and optimization the performance with relative amount of
data. Second, we have to estimate using constructed model the test set, and compare the
P(A|B).
incorporated
In some applications the output of machine learning algorithms is not only a program,
but a sequence of monitored actions. In this case a single action is not the purpose of the
program but a policy of the sequence of correction actions to reach a goal. Such learning
4
methods are called Reinforcement Learning. A game playing is an example of
reinforcement algorithms, when only single command cannot win a game, but a
2. OBJECTIVE
The objective of this work is illustrate the usage of machine learning algorithms
in some dataset and makes a review about common algorithms using in this computer
field.
3. REVIEW
There are two major groups of machine learning algorithms, called as classification
and regression (Müller and Guido, 2016). In classification problems the goal is to
predict the class label of some group given an amount of data labeled before. In
regression problem the goal is predict some real value using some pattern obtained by
training data. In another terms, classification problems deal with indicator variables and
able to find patterns in data. Because of that we split data in the train and test sets to
verify if predictions in train set can be done in test set. If this become reliable, we can
say that program generalize the problem, and program can estimate good in both train
and test sets. By the way, it can be possible to create a complex model that only
5
estimates better in training dataset. This problem is called overfitting. If poor
There are many metrics to classify machine learning algorithms. They can be
1. Classification Metrics
ratio of all predictions made. This is the most common evaluation metric
outcomes in y-axis.
2. Regression Metrics
6
b. Mean Squared Error: Is the square of mean absolute error. When taking
the square root of MSE the units are converted to original units. This is
the true values. The value ranges from 0 to 1 as 1 the best estimative.
The Iris dataset is a multivariate data create by the British scientist Ronald
Aylmer Fisher, one of the creator of the modern statistical science, in his paper “The
LDA (Linear Discriminant Analysis). It contains three species of Iris Flowers (Iris
Figure 1 – Morphology of a Iris Flower by its Petals and Sepals(Müller and Guido, 2016)
7
The following code describes the Iris Dataset
Output
8
58. - Iris-Versicolour
59. - Iris-Virginica
60. :Summary Statistics:
61.
62. ============== ==== ==== ======= ===== ====================
63. Min Max Mean SD Class Correlation
64. ============== ==== ==== ======= ===== ====================
65. sepal length: 4.3 7.9 5.84 0.83 0.7826
66. sepal width: 2.0 4.4 3.05 0.43 -0.4194
67. petal length: 1.0 6.9 3.76 1.76 0.9490 (high!)
68. petal width: 0.1 2.5 1.20 0.76 0.9565 (high!)
69. ============== ==== ==== ======= ===== ====================
70.
71. :Missing Attribute Values: None
72. :Class Distribution: 33.3% for each of 3 classes.
73. :Creator: R.A. Fisher
74. :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
75. :Date: July, 1988
76.
77. This is a copy of UCI ML iris datasets.
78. http://archive.ics.uci.edu/ml/datasets/Iris
79.
80. The famous Iris database, first used by Sir R.A Fisher
81.
82. This is perhaps the best known database to be found in the
83. pattern recognition literature. Fisher's paper is a classic in the field and
84. is referenced frequently to this day. (See Duda & Hart, for example.) The
85. data set contains 3 classes of 50 instances each, where each class refers to a
86. type of iris plant. One class is linearly separable from the other 2; the
87. latter are NOT linearly separable from each other.
88.
89. References
90. ----------
91. - Fisher,R.A. "The use of multiple measurements in taxonomic problems"
92. Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
93. Mathematical Statistics" (John Wiley, NY, 1950).
94. -
Duda,R.O., & Hart,P.E. (1973) Pattern Classification and Scene Analysis.
95. (Q327.D83) John Wiley & Sons. ISBN 0-471-22361-1. See page 218.
96. - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
97. Structure and Classification Rule for Recognition in Partially Exposed
98. Environments". IEEE Transactions on Pattern Analysis and Machine
99. Intelligence, Vol. PAMI-2, No. 1, 67-71.
100. -
Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule". IEEE Transactions
101. on Information Theory, May 1972, 431-433.
102. - See also: 1988 MLC Proceedings, 54-
64. Cheeseman et al"s AUTOCLASS II
103. conceptual clustering system finds 3 classes in the data.
104. - Many, many more ...
9
3.5 The Principal component Analysis
combination to created a reduced number of new features that can explain information
of data maintaining their statistics properties. The central idea of Principal Component
Figure 2 - a) Dispersion of 50 data sets in two axes z1 and z2 b) Rotation of data using PCA (Jolliffe,
2002)
The Figure 2a shows a high linearly correlated data by two properties z1 and z2.
Using PCA technique its possible to transform the coordinates of dataset creating new
orthogonal axis that make data independent as a rotation of dataset. The PCA problems
transformation. Consider a as a linear weight and ∑ the covariance matrix of data. It can
10
be shown for a data set X that ( ) = ′∑ . Using the restriction that the transform
(∑ − ) =0 1)
Where is the lagrangian, and I an identity matrix. This is the same problem to obtain
are ordering according the covariance percentual. The firsts eigenvalues reproduces the
11
36.
37. cbar = plt.colorbar(ticks=[0,1,2])
38. cbar.ax.set_yticklabels(['Setosa','Versicolor','Virginica'])
cbar.ax.set_yticklabels([
39.
40. plt.title("Two PCA directions"
directions")
41. plt.xlabel("1st eigenvector"
eigenvector")
42. plt.ylabel("2nd eigenvector"
eigenvector")
43. plt.show()
44.
Output
12
3.6 Linear Discriminant Analysis
{ , , … } what is the analytical way (linear or not) to divide the space to obtain
regions for different two paired classes. Considering only two classes to divide the
2)
( )=
Where:
=( − )( − ) 3)
= − −
, ∈
where is the average of all properties contained in a class i. The intuition behind
(numerator) while minimizing the class variance (denominator). (Mika et al., 1999).
each class. Using the quadradic formula, the problem to obtain the discriminant analysis
( − ) =0 4)
The linear discriminant analysis divides the feature space into linear separators. For
13
1. # Perform classification with Discriminant Analysis
2.
3. # Perform data_set split
4. Xtrain, Xtest, Ytrain, Ytest = train_test_split(X,Y, test_size = .3)
5.
6. Xtrain_transform = pca.transform(Xtrain)
7. Xtest_transform = pca.transform(Xtest)
8.
9.
10. clf = DA.LinearDiscriminantAnalysis().fit(Xtrain_transform, Ytrain)
11.
12. #predict ytest with xtest transform
13.
14. ytest = clf.predict(Xtest_transform)
15.
16.
17. # Perform the Linear Discrimiant analysis graph
18. #.............................................................................
...
19.
20. h = .01
21.
22. fig = plt.figure(figsize=(8, 6))
23. x_min, x_max = Xtrain_transform[:,0].min() -
h, Xtrain_transform[:,0].max() + h
24. y_min, y_max = Xtrain_transform[:,1].min() -
h, Xtrain_transform[:,1].max() + h
25. xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
26. Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
27. Z = Z.reshape(xx.shape)
28. cmap = cm.get_cmap('Dark2', 3)
29. plt.pcolormesh(xx, yy, Z, cmap=cmap)
30. scat = plt.scatter(Xtrain_transform[:,0], Xtrain_transform[:,1], marker='o', s
=100, c=Ytrain, edgecolors='k', cmap=cmap, alpha=1)
31. plt.xlim(-4, 4)
32. plt.ylim(-1.5, 1.5)
33. cbar = plt.colorbar(ticks=[0,1,2])
34. cbar.ax.set_yticklabels(['Setosa','Versicolor','Virginica'])
35.
36. plt.title("Linear discriminant analysis prediction using PCA eigenvectors")
37. plt.xlabel("1st eigenvector")
38. plt.ylabel("2nd eigenvector")
39.
40. plt.show()
41.
42. # ............................................................................
.....
43.
44. # Define metrics of classification
45.
46. print "Score of train data {}".format(clf.score(Xtrain_transform, Ytrain))
47. print 'Score of test data {} \n'. format(metrics.accuracy_score(ytest, Ytest,
normalize=True))
48. print 'classification report:\n{}'.format(metrics.classification_report(ytest,
Ytest))
49. print 'confusion matrix: \n{}'.format(metrics.confusion_matrix(ytest, Ytest))
50.
51.
52. #.............................................................................
....
53. # Perform cross-validation of LDA
14
54.
55. print "\n Perform cross validation of LDA \n"
56. print cross_val_score(clf, X, Y, cv=10, scoring='r2')
57.
58. Score of train data 0.952380952381
59. Score of test data 0.977777777778
60.
61. classification report:
62. precision recall f1-score support
63.
64. 0 1.00 1.00 1.00 17
65. 1 1.00 0.94 0.97 16
66. 2 0.92 1.00 0.96 12
67.
68. avg / total 0.98 0.98 0.98 45
69.
70. confusion matrix:
71. [[17 0 0]
72. [ 0 15 1]
73. [ 0 0 12]]
74.
75. Perform cross validation of Kneighbors
76.
77. [ 1. 1. 1. 1. 0.9 1. 0.8 1. 1. 1. ]
3.7 KNeighbors
15
3()points using data of two different classes. According to the distance of prediction
Figure 3 - Nearest neighbor apllied in training set according two classes. Star points as predictive data. Circle
points as raw data.(Müller and Guido, 2016)
K nearest neighbor is an extension of nearest neighbor and gives a predict point the
class with high frequency for k nearest samples. The Figure 4 shows an example of the
16
Figure 4—K Nearest neighbor apllied in training set according two classes. Star points as predictive data.
Circle points as raw data.(Müller and Guido, 2016)
There are several kinds of distance metrics that can be used in nearest neighbor
Types of distance
1) Euclidian Distance
2) City-Block Distance
3) Correlation distance
5) Chi-square Distance
6) Mahalanobis distance
The common used distance measure is the Euclidian distance, and because of that
17
can be used in sense of regession too. In this procedure, average of nearest samples are
calculated.
18
52. print 'confusion matri
matrix: \n{}'.format(metrics.confusion_matrix(ytest,
.format(metrics.confusion_matrix(ytest, Ytest))
53.
54.
55. #.............................................................................
....
56. # Perform cross-validation
validation of Kneighbors
57.
58. print "\n Perform cross validation of Kneighbors \n"
59. print cross_val_score(clf, X, Y, cv=10, scoring='r2')
60.
Output
61.
62. Number of neighbors choose: 5
63. Score of train data 0.961904761905
64. Score of test data: 1.0
65.
66. classification report:
67. precision recall f1-score support
68.
69. 0 1.00 1.00 1.00 14
70. 1 1.00 1.00 1.00 20
71. 2 1.00 1.00 1.00 11
72.
73. avg / total 1.00 1.00 1.00 45
74.
75. confusion matrix:
76. [[14 0 0]
77. [ 0 20 0]
78. [ 0 0 11]]
79.
80. Perform cross validation of Kneighbors
81.
82. [ 1. 0.9 1. 1. 0.8 0.9 0.9 1. 1. 1. ]
19
3.4 Random Forest
The random forest approach tries to compute several decision trees according
subsets of data and choose the best classification index based on the frequency of
estimated labels. The decision tree is a mathematical structure consisting in nodes and
edges (Safavian and Landgrebe, 1991). Each node corresponds in a subset of data and
each edge a division of samples. Figure 5 shows an example of decision tree. The entire
dataset corresponds to the root node and different edges connect the division of data by
different classifiers.
Figure 6 shows the process to classify the features X1 and X2 by a decision tree. Each
feature is selected randomly and separated by a criteria.
20
Figure 6- Process of linearly division in decision trees (Safavian and Landgrebe, 1991)
For binary decision trees the process to split data is related to the “goodness-of-split”
show in equation 5:
( , )= ( , , , ) 5)
Where s is a note of the decision tree, t is an edge process and Pl and Pr are the
proportions divided by the criteria and pl and pr is the proportion of the classes. If
called impurity, and measure the maximum spread of classification. The common
∅( ) = − (1 − ) 6)
21
∅( ) = − ( ) 7)
( , ) = ∅( ) − ∅( ) − ∅( ) 5)
22
45.
46. print 'Most important features:\n{}'.format(clf.feature_importances_)
.format(clf.feature_importances_)
47.
48. #.............................................................................
....
49. # Perform cross-validation
validation of Random Forest
50.
51. print "\n Perform cross validation of Random Forest \n"
52. print cross_val_score(clf, X, Y, cv=10, scoring='r2')
53.
Output
4. Discussion
The Iris dataset presents 4 features (Petal Lengh, Petal Width, Sepal Length, Sepal
Width) used to estimate the class of flower (Iris Virginica, Versicolor, Setosa) given
taxonomic parameters. It could be shown that only two principal components are
responsible
sible for more than 97% of the data variance.
23
5. REFERENCES
24