You are on page 1of 4

1

1
that uses an orthogonal transformation to convert a set of
observations of possibly correlated variables.PCA is sensitive
Data Mining Lab 1 to the relative scaling of the original variables.PCA was
First A. Author, Fellow, IEEE, Second B. Author, and invented in 1901 by Karl Pearson.PCA is mostly used as a tool
Third C. Author, Jr., Member, IEEE in exploratory data analysis and for making predictive models
and mainly in dimensionality reduction. PCA can be done by
Abstract—This document gives brief introduction to some of the eigenvalue decomposition of a data covariance (or correlation)
keys concepts of decision trees like gini, entropy, information gain matrix or singular value decomposition of a data matrix,
and foundation of pca like covariance matrix, eigen values and eigen usually after a normalization step of the initial data. PCA can
vectors. This paper also describes details of different distance be used to reduce the original variables into a smaller number
measurements and similarity calculation. These various concepts of new variables. The principal component with more variance
are implemented in python language using machine learning library
or spreadness is kept and another with less is rejected.In order
scikit-learn and data analysis and visualization tools numpy, pandas
and matplotlib. These concepts was also implemented on iris to implement PCA in python, we first center of data is
dataset and different graphs are visualized. calculated by finding mean of the data,then we calculate
covariance of matrix using numpy.cov function and then
Index Terms—Enter key words or phrases in alphabetical calculating eigen values and vectors using numpy.linalg.eng
order, separated by commas. F function taking covariance matrix as a parameter.Now we plot
principal components using matplotlib.pyplot.quiver which
plots a 2D field of arrows. This function takes first parameter
I. INTRODUCTION
one location of start of arrow, second and third parameters are
Entropy is the measure of uncertainty of a random the eigen vectors which gives direction to arrow.
variable.The higher the entropy, the harder it is to draw any
conclusions from that information. The entropy is 0 if all
samples of a node belong to the same class, and the entropy is
maximal if we have a uniform class distribution.
Gini index or Gini impurity measures the degree or probability
of a particular variable being wrongly classified when it is
randomly chosen.The degree of Gini index varies between 0
and 1 where 0 is one class and 1 is another. A Gini Index of 0.5
denotes equally distributed elements into some classes.A
classification is measure of a misclassification in error made by
a node. Its curve is linear in nature.
Variance is a measure of the variability or spread in a set of
data. Covariance is a measure of the extent to which
corresponding elements from two sets of ordered data move
in the same direction.High variance implies high information in
dimension.High covariance means high correlation between
attributes. Covariance matrix is a square matrix giving the fig:principal components of petal length vs width
covariance between each pair of elements of a given random
vector.When the population contains higher dimensions or II. METHODS & EXPERIMENTS
more random variables, a matrix is used to describe the
In order to calculate the eigen values, eigen vectors and
relationship between different dimensions.covariance matrix
covariance matrix using python programming language we use
is to define the relationship in the entire dimensions as the
modules like statistics, scipy.linalg, and numpy. We have used
relationships between every two random variables.
statistics.mean and scipy.linalg.cov in order to calculate the
Eigen vector is a nonzero vector that changes at most by a
mean and covariance of two list of data.And we used
scalar factor when that linear transformation is applied to it. It
scipy.linalg.eig to calculate eigen values and eigen vectors
also gives the spread of data in particular
where this function takes a covariance matrix as a parameter
direction .Eigenvalues are a special set of scalars associated
and return a list of eigen values and eigen vectors.The purpose
with a linear system of equations (i.e., a matrix equation) that
of subtracting the mean from a dataset is to obtain a dataset
are sometimes also known as characteristic roots.It gives the
whose mean is zero which allow different data sets to be
direction of spread of data.
comparable, centered and makes us easy calculate spreadness
Principal component analysis (PCA) is a statistical procedure

1
2

of data. using sum, math.pow, zip ,sum functions and defining a nth
root function which returns nth root of input value. For cos
Various methods are used for calculation of distances and similarity, we define a function which takes list of points and
similarities are used to analyze the data returns value using math.sqrt, math.pow, zip, round, float and
A. Euclidean distance sum functions and defining a function which returns a square
Euclidean distance is root of square differences between root of input value.For jaccard similarity, we define a function
coordinates of a pair of objects.If the points (x1,y1), (x1,y1) which takes list of points and returns value using len,set, float
and (x2,y2)(x2,y2) are in 2-dimensional space, then the and sum functions.
Euclidean distance between them is
III. MATH
√(x2−x1)2+(y2−y1)2.
A. Formulas
B. Manhattan distance Mathematically, Variance is the average squared deviation
Manhattan distance is between two points from the mean score. We use the following formula to
measured along axes at right angles. In a plane compute variance.
with p1 at (x1, y1) and p2 at (x2, y2), it is Var(X) = Σ ( Xi - X )2 / N = Σ xi2 / N
|x1 - x2| + |y1 - y2|.
C. Minkowsik distance where, N is the number of data
Minkowski distance is a metric in a normed vector space
which can be considered as a generalization of both the X is the mean of the N data.
Euclidean and Manhattan distance.The Minkowski distance
between two variables X and Y is defined as Xi is the ith data in the dataset
D ( X, Y ) =
xi is the ith deviation score in dataset
D. Cosine Similarity
We use the following formula to compute covariance.
Cosine Similarity is a metric used to measure how similar
the documents are irrespective of their size. Mathematically,
it measures the cosine of the angle between two vectors Cov(X, Y) = Σ ( Xi - X ) ( Yi - Y ) / N = Σ xiyi / N
projected in a multi-dimensional space.
𝑛 where
𝐴.𝐵
Cosine similarity = 𝐴∨.∨𝐵∨ = ∑𝑖−1 𝐴𝑖𝐵𝑖
N is the number of data in dataset

X is the mean of the N data

E. Jaccard Similarity Xi is the ith data in the first set of data


The Jaccard coefficient measures similarity between finite
sample sets, and is defined as the size of the intersection xi is the ith deviation score in the first set of data
divided by the size of the union of the sample sets.
Jaccard sim = (the number in both sets) / Y is the mean of the N data in the second data set
(the number in either set) * 100
Yi is the ithe raw data in the second set of dataset
In order to implement these methods to calculate
distances and similarities in python we use module like numpy, yi is the ith deviation data in the second set of dataset
math, Decimal . For euclidean distance,we define a function
which takes list of points and returns value using math.sqrt,
math.pow and sum functions.For Manhattan distance,we
define a function which takes list of points and returns value
using zip, abs and sum functions.For Minkowsik distance,we
define a function which takes list of points and returns value
3

B. Equations
Many equations are used in this process to analyze the iris
dataset.Some of them are
The equation for entropy is given by

where, pi is probability of class i in our data.

The equation for gini impurity is given by

fig: Entropy, Gini impurity, and classification error of binary


where pi is the probability of an object being classified to a system
particular class i .
The graph above is be generated in python by using libraries
Classification error(E)=1-max(p1,p2,…..,pj) such as numpy, matplotlib. in order to generate this curve
np.log2 is used to for logarithmic with base 2, np.max to select
Where pj is probability of class j the maximum value in a list, np.arange is used to generate list
of numbers where input parameters are start number, end
number and step size, plt.plot function is used to plot the
graph with parameters x(list of values in x-axis) and y(list of
outputs of input x in y-axis), plt.legend is used to differentiate
different curves in same graph. And plt.xlabel, ylabel and title
is used to give labels to x-axis,y-axis and title.
IV. DATA
The Iris Data set is a multivariate data set introduced by VI. RESULT AND ANALYSIS
British statistician and biologist Ronald Fisher in 1936 in his
We defined our own function to calculate gini,entropy ,and
paper. The data set consists of 50 samples from each of classification error and plot the graph each of then by
three species of Iris (Iris setosa, Iris virginica and Iris generating a data for a binary system and analyzed their
versicolor). Four features were measured from each nature of curve using various modules in python.Also, we
sample: the length and the width of sepals and petals, in created function for distances and similarities metrics like
centimeters. The dataset contains a set of 150 records euclidean, Manhattan, Minkowsik distances and Cosine,
under five attributes - petal length, petal width, sepal jaccard similarity.Then, we loaded the iris dataset and
length,sepal width and species. counted the values and checked number of features
To analyze the iris dataset we used modules like numpy, relationship between them using a pairplot. And
implemented principal component analysis by first calculating
pandas,matplotlib.pyplot,datasets from sklearn ,seaborn .
the mean to find location of start of arrow and then,
Then,data is loaded into a variable using load_iris
calculating covariance and then eigen vectors to find
function.Then, we create dataframe of iris data using direction of arrow and eigen values to get length of the arrow
pandas.DataFrame and we plot pairplot of the dataframe and component with more variance was kept and other was
using seaborn.pairplot and we calculate mean of each rejected.
column to get center of data.

VII. CONCLUSION
V. GRAPHS We concluded that gini, classification and entropy are
important metric for evaluation of classification and splitting
of decision tree. And we also learnt about various distance
4

and similarity metrics which plays vital role in calculation of


different evaluation metrics. And we also found that pca is
one of the most important concept which is one of the first
steps for dimensionality reduction and feature selection in
data analysis.Also we learnt pairplot is one of the easiest was
to selection of features such that there is maximum accuracy.

VIII. APPENDIX
The code for this experiment is uploaded to github and its
link is provided below:
https://github.com/rj7shakya/datamining/blob/master/lab/la
b1.ipynb

IX. REFERENCES

[1] R. A. Fisher (1936). "The use of multiple


measurements in taxonomic problems". Annals
of Eugenics.
[2] Edgar Anderson (1936). "The species problem in
Iris". Annals of the Missouri Botanical Garden..
[3] Edgar Anderson (1935). "The irises of the Gaspé
Peninsula". Bulletin of the American Iris Society.
[4] "UCI Machine Learning Repository: Iris Data
Set"
vol. 134, pp. A635–A646, Dec. 1965.
[5] E. H. Miller, “A note on reflector arrays,” IEEE
Trans. Antennas Propagat., to be published.
[6] E. E. Reber, R. L. Michell, and C. J. Carter,
“Oxygen absorption in the earth’s atmosphere,”
Aerospace Corp., Los Angeles, CA, USA, Tech.
Rep. TR-0200 (4230-46)-3, Nov. 1988.
[7] J. H. Davis and J. R. Cogdell, “Calibration
program for the 16-foot antenna,” Elect. Eng.
Res. Lab., Univ. Texas, Austin, TX, USA, Tech.
Memo. NGL-006-69-3, Nov. 15, 1987.

You might also like