You are on page 1of 41

CSE Sem V

Big Data Analytics


Unit 3

DATA ANALYSIS

(For Reference Purpose only) Priyanka Suryawanshi


Classification
• Classification is to identify the category or the class label of a new
observation.
• First, a set of data is used as training data.
• The set of input data and the corresponding outputs are given to the
algorithm. So, the training data set includes the input data and their
associated class labels.
• Using the training dataset, the algorithm derives a model or the
classifier.
• The derived model can be a decision tree, mathematical formula, or a
neural network. In classification, when unlabeled data is given to the
model, it should find the class to which it belongs. The new data
provided to the model is the test data set.
• Example:

• The bank needs to analyze whether giving a loan to a particular customer


is risky or not. 
• For example, based on observable data for multiple loan borrowers, a
classification model may be established that forecasts credit risk. The data
could track job records, homeownership or leasing, years of residency,
number, type of deposits, historical credit ranking, etc. The goal would be
credit ranking, the predictors would be the other characteristics, and the
data would represent a case for each consumer. In this example, a model
is constructed to find the categorical label. The labels are risky or safe.
• How does Classification Works?
There are two stages in the data classification system: classifier or
model creation and classification classifier.
1. Developing the Classifier or model creation: 

This level is the learning stage or the learning process.


The classification algorithms construct the classifier in this stage.
A classifier is constructed from a training set composed of the records of
databases and their corresponding class names.
Each category that makes up the training set is referred to as a category
or class.
We may also refer to these records as samples, objects, or data points.
2. Applying classifier for classification: 

The classifier is used for classification at this level. The test data are
used here to estimate the accuracy of the classification algorithm. If the
consistency is deemed sufficient, the classification rules can be
expanded to cover new data records. It includes:
• Sentiment Analysis: Sentiment analysis is highly helpful in social
media monitoring. We can use it to extract social media insights. We
can build sentiment analysis models to read and analyze misspelled
words with advanced machine learning algorithms. The accurate
trained models provide consistently accurate outcomes and result in a
fraction of the time.
• Document Classification: We can use document classification to
organize the documents into sections according to the content.
Document classification refers to text classification; we can classify
the words in the entire document. And with the help of machine
learning classification algorithms, we can execute it automatically.

• Image Classification: Image classification is used for the trained


categories of an image. These could be the caption of the image, a
statistical value, a theme. You can tag images to train your model for
relevant categories by applying supervised learning algorithms.

• Machine Learning Classification: It uses the statistically


demonstrable algorithm rules to execute analytical tasks that would
take humans hundreds of more hours to perform.
3. Data Classification Process: The data classification process can be
categorized into five steps:

1. Create the goals of data classification, strategy, workflows, and architecture of


data classification.
2. Classify confidential details that we store.
3. Using marks by data labelling.
4. To improve protection and obedience, use effects.
5. Data is complex, and a continuous method is a classification.
• Data Classification Lifecycle

The data classification life cycle produces an excellent structure for


controlling the flow of data to an enterprise. Businesses need to account
for data security and compliance at each level. With the help of data
classification, we can perform it at every stage, from origin to deletion.
The data life-cycle has the following stages,
• Support Vector Machine or SVM is one of the most popular
Supervised Learning algorithms, which is used for Classification as
well as Regression problems. However, primarily, it is used for
Classification problems

• The goal of the SVM algorithm is to create the best line or decision
boundary that can segregate n-dimensional space into classes so that
we can easily put the new data point in the correct category in the
future. This best decision boundary is called a hyperplane.
• SVM chooses the extreme points/vectors that help in creating the
hyperplane. These extreme cases are called as support vectors, and
hence algorithm is termed as Support Vector Machine. Consider the
below diagram in which there are two different categories that are
classified using a decision boundary or hyperplane:
• Example
Suppose we see a strange cat that also has some features of dogs, so if
we want a model that can accurately identify whether it is a cat or dog,
so such a model can be created by using the SVM algorithm. We will
first train our model with lots of images of cats and dogs so that it can
learn about different features of cats and dogs, and then we test it with
this strange creature. So as support vector creates a decision boundary
between these two data (cat and dog) and choose extreme cases (support
vectors), it will see the extreme case of cat and dog. On the basis of the
support vectors, it will classify it as a cat.
• Hyperplane and Support Vectors in the SVM algorithm:
• Hyperplane: There can be multiple lines/decision boundaries to segregate
the classes in n-dimensional space, but we need to find out the best decision
boundary that helps to classify the data points. This best boundary is known
as the hyperplane of SVM.
• The dimensions of the hyperplane depend on the features present in the
dataset, which means if there are 2 features (as shown in image), then
hyperplane will be a straight line. And if there are 3 features, then
hyperplane will be a 2-dimension plane.
• We always create a hyperplane that has a maximum margin, which means
the maximum distance between the data points.
• Support Vectors:
• The data points or vectors that are the closest to the hyperplane and
which affect the position of the hyperplane are termed as Support
Vector. Since these vectors support the hyperplane, hence called a
Support vector.
• How does SVM works?
1. Linear SVM:

The working of the SVM algorithm can be


understood by using an example. Suppose we
have a dataset that has two tags (green and
blue), and the dataset has two features x1 and
x2. We want a classifier that can classify the
pair(x1, x2) of coordinates in either green or
blue. 
• So as it is 2-d space so by just using a straight line, we can easily
separate these two classes. But there can be multiple lines that can
separate these classes.
• Hence, the SVM algorithm helps to
find the best line or decision
boundary; this best boundary or
region is called as a hyperplane.
SVM algorithm finds the closest
point of the lines from both the
classes. These points are called
support vectors. The distance
between the vectors and the
hyperplane is called as margin. And
the goal of SVM is to maximize this
margin. The hyperplane with
maximum margin is called
the optimal hyperplane.
2. Non-Linear SVM:
If data is linearly arranged, then we can separate it by using a straight
line, but for non-linear data, we cannot draw a single straight line. 
• So to separate these data points, we need to add one more dimension.
For linear data, we have used two dimensions x and y, so for non-
linear data, we will add a third dimension z. By adding the third
dimension, the sample space will become as below image:
• So now, SVM will divide the datasets into classes
• Since we are in 3-d Space, hence it is looking like a plane parallel to
the x-axis. If we convert it in 2d space with z=1, then it will become
as:
Cluster Analysis
• Cluster analysis is a multivariate data mining technique whose
goal is to groups objects (eg., products, respondents, or other
entities) based on a set of user selected characteristics or
attributes. It is the basic and most important step of data
mining and a common technique for statistical data analysis,
and it is used in many fields such as data compression,
machine learning, pattern recognition, information retrieval etc.
• What does this mean?
• When plotted geometrically, objects within clusters should be
very close together and clusters will be far apart.
• Clustering Analysis is useful in
o Market Segmentation
o Statistical data analysis
o Social network analysis
o Image segmentation
o Anomaly detection, etc.
• Types of Cluster Analysis

The clustering algorithm needs to be chosen experimentally


unless there is a mathematical reason to choose one cluster
method over another.It should be noted that an algorithm that
works on a particular set of data will not work on another set of
data. There are a number of different methods to perform cluster
analysis. Some of them are,
Partitioning Methods(Centroid based clustering),Hierarchical Methods,
Density Based Methods, Grid Based Methods, Model Based Clustering
Methods, Distribution based clustering
• Partitioning Method

• Suppose we are given a database of ‘n’ objects and the partitioning


method constructs ‘k’ partition of data. Each partition will represent a
cluster and k ≤ n. It means that it will classify the data into k groups,
which satisfy the following requirements −
• Each group contains at least one object.
• Each object must belong to exactly one group.
• Points to remember −

• For a given number of partitions (say k), the partitioning method will
create an initial partitioning.
• Then it uses the iterative relocation technique to improve the
partitioning by moving objects from one group to other.

• There are many algorithms that come under partitioning method some
of the popular ones are K-Mean, PAM(K-Mediods), CLARA
algorithm (Clustering Large Applications) etc.
• K-Mean (A centroid based Technique):

The K means algorithm takes the input parameter K from the user and
partitions the dataset containing N objects into K clusters so that
resulting similarity among the data objects inside the group
(intracluster) is high but the similarity of data objects with the data
objects from outside the cluster is low (intercluster).

The similarity of the cluster is determined with respect to the mean


value of the cluster.
• It is a type of square error algorithm.
• At the start randomly k objects from the dataset are chosen in which
each of the objects represents a cluster mean(centre).
• For the rest of the data objects, they are assigned to the nearest cluster
based on their distance from the cluster mean.
• The new mean of each of the cluster is then calculated with the added
data objects.

You might also like