Professional Documents
Culture Documents
Lab1 Ieee1
Lab1 Ieee1
1
that uses an orthogonal transformation to convert a set of
observations of possibly correlated variables.PCA is sensitive
Data Mining Lab 1 to the relative scaling of the original variables.PCA was
First A. Author, Fellow, IEEE, Second B. Author, and invented in 1901 by Karl Pearson.PCA is mostly used as a tool
Third C. Author, Jr., Member, IEEE in exploratory data analysis and for making predictive models
and mainly in dimensionality reduction. PCA can be done by
Abstract—This document gives brief introduction to some of the eigenvalue decomposition of a data covariance (or correlation)
keys concepts of decision trees like gini, entropy, information gain matrix or singular value decomposition of a data matrix,
and foundation of pca like covariance matrix, eigen values and eigen usually after a normalization step of the initial data. PCA can
vectors. This paper also describes details of different distance be used to reduce the original variables into a smaller number
measurements and similarity calculation. These various concepts of new variables. The principal component with more variance
are implemented in python language using machine learning library
or spreadness is kept and another with less is rejected.In order
scikit-learn and data analysis and visualization tools numpy, pandas
and matplotlib. These concepts was also implemented on iris to implement PCA in python, we first center of data is
dataset and different graphs are visualized. calculated by finding mean of the data,then we calculate
covariance of matrix using numpy.cov function and then
Index Terms—Enter key words or phrases in alphabetical calculating eigen values and vectors using numpy.linalg.eng
order, separated by commas. F function taking covariance matrix as a parameter.Now we plot
principal components using matplotlib.pyplot.quiver which
plots a 2D field of arrows. This function takes first parameter
I. INTRODUCTION
one location of start of arrow, second and third parameters are
Entropy is the measure of uncertainty of a random the eigen vectors which gives direction to arrow.
variable.The higher the entropy, the harder it is to draw any
conclusions from that information. The entropy is 0 if all
samples of a node belong to the same class, and the entropy is
maximal if we have a uniform class distribution.
Gini index or Gini impurity measures the degree or probability
of a particular variable being wrongly classified when it is
randomly chosen.The degree of Gini index varies between 0
and 1 where 0 is one class and 1 is another. A Gini Index of 0.5
denotes equally distributed elements into some classes.A
classification is measure of a misclassification in error made by
a node. Its curve is linear in nature.
Variance is a measure of the variability or spread in a set of
data. Covariance is a measure of the extent to which
corresponding elements from two sets of ordered data move
in the same direction.High variance implies high information in
dimension.High covariance means high correlation between
attributes. Covariance matrix is a square matrix giving the fig:principal components of petal length vs width
covariance between each pair of elements of a given random
vector.When the population contains higher dimensions or II. METHODS & EXPERIMENTS
more random variables, a matrix is used to describe the
In order to calculate the eigen values, eigen vectors and
relationship between different dimensions.covariance matrix
covariance matrix using python programming language we use
is to define the relationship in the entire dimensions as the
modules like statistics, scipy.linalg, and numpy. We have used
relationships between every two random variables.
statistics.mean and scipy.linalg.cov in order to calculate the
Eigen vector is a nonzero vector that changes at most by a
mean and covariance of two list of data.And we used
scalar factor when that linear transformation is applied to it. It
scipy.linalg.eig to calculate eigen values and eigen vectors
also gives the spread of data in particular
where this function takes a covariance matrix as a parameter
direction .Eigenvalues are a special set of scalars associated
and return a list of eigen values and eigen vectors.The purpose
with a linear system of equations (i.e., a matrix equation) that
of subtracting the mean from a dataset is to obtain a dataset
are sometimes also known as characteristic roots.It gives the
whose mean is zero which allow different data sets to be
direction of spread of data.
comparable, centered and makes us easy calculate spreadness
Principal component analysis (PCA) is a statistical procedure
1
2
of data. using sum, math.pow, zip ,sum functions and defining a nth
root function which returns nth root of input value. For cos
Various methods are used for calculation of distances and similarity, we define a function which takes list of points and
similarities are used to analyze the data returns value using math.sqrt, math.pow, zip, round, float and
A. Euclidean distance sum functions and defining a function which returns a square
Euclidean distance is root of square differences between root of input value.For jaccard similarity, we define a function
coordinates of a pair of objects.If the points (x1,y1), (x1,y1) which takes list of points and returns value using len,set, float
and (x2,y2)(x2,y2) are in 2-dimensional space, then the and sum functions.
Euclidean distance between them is
III. MATH
√(x2−x1)2+(y2−y1)2.
A. Formulas
B. Manhattan distance Mathematically, Variance is the average squared deviation
Manhattan distance is between two points from the mean score. We use the following formula to
measured along axes at right angles. In a plane compute variance.
with p1 at (x1, y1) and p2 at (x2, y2), it is Var(X) = Σ ( Xi - X )2 / N = Σ xi2 / N
|x1 - x2| + |y1 - y2|.
C. Minkowsik distance where, N is the number of data
Minkowski distance is a metric in a normed vector space
which can be considered as a generalization of both the X is the mean of the N data.
Euclidean and Manhattan distance.The Minkowski distance
between two variables X and Y is defined as Xi is the ith data in the dataset
D ( X, Y ) =
xi is the ith deviation score in dataset
D. Cosine Similarity
We use the following formula to compute covariance.
Cosine Similarity is a metric used to measure how similar
the documents are irrespective of their size. Mathematically,
it measures the cosine of the angle between two vectors Cov(X, Y) = Σ ( Xi - X ) ( Yi - Y ) / N = Σ xiyi / N
projected in a multi-dimensional space.
𝑛 where
𝐴.𝐵
Cosine similarity = 𝐴∨.∨𝐵∨ = ∑𝑖−1 𝐴𝑖𝐵𝑖
N is the number of data in dataset
B. Equations
Many equations are used in this process to analyze the iris
dataset.Some of them are
The equation for entropy is given by
VII. CONCLUSION
V. GRAPHS We concluded that gini, classification and entropy are
important metric for evaluation of classification and splitting
of decision tree. And we also learnt about various distance
4
VIII. APPENDIX
The code for this experiment is uploaded to github and its
link is provided below:
https://github.com/rj7shakya/datamining/blob/master/lab/la
b1.ipynb
IX. REFERENCES