You are on page 1of 7

BIG DATA ANALYTICS (2017 REGULATION)

UNIT – 2 CLUSTERING AND


CLASSIFICATION

Advanced Analytical Theory and Methods: Overview of


Clustering - K-means - Use Cases - Overview of the
Method - Determining the Number of Clusters -
Diagnostics - Reasons to Choose and Cautions .

Classification: Decision Trees - Overview of a Decision


Tree - The General Algorithm - Decision Tree Algorithms
- Evaluating a Decision Tree - Decision Trees in R - Naïve
Bayes - Bayes‘ Theorem - Naïve Bayes Classifier.

1
BIG DATA ANALYTICS (2017 REGULATION)

Why Big Data Important?


 The success of the organization not just lies in how good there are in doing their business but also on how
well they can analyze their data and derive insights about their company, their competitors etc.
 Big data can help you in taking the right decision at right time.
BIG DATA ANALYTICS (2017 REGULATION)

Analytics:
 Analytics is the process of breaking the problem into simpler parts and using inferences based on data to
take decisions. Analytics is not a tool or technology, rather it is a way of thinking and acting.
 Analytics is used in all sorts of industries like Finance Analytics, Healthcare Analytics, Retail analytics,
Telecom Analytics, Web Analytics.

Analytics Lifecycle:
let us consider the following stages of an Analytics project lifecycle.
1. Problem Identification : A problem is a situation that is judged as something that needs to
be corrected.
2. Hypothesis Formulation : Break down problems and formulate hypotheses.
3. Data Collection
4. Data Exploration : Before a formal data analysis can be conducted, the analyst must know
how many cases are in the dataset, what variables are included etc.
5. Data Preparation/ Manipulation : Data comes to you in a form that is not easy to analyze. We need to
clean data and check it for consistency, extensive manipulation of the data is needed in order to analyze.
6. Model planning / Building : This is really the entire process of building the solution and implementing
the solution.
7. Validate Model
8. Evaluate/Monitor results
BIG DATA ANALYTICS (2017 REGULATION)

Machine Learning:
Machine learning (ML) is the scientific study of algorithms and statistical models that computer systems use
to perform a specific task without using explicit instructions. It is seen as a subset of artificial intelligence.
(The primary aim is to allow the computers learn automatically without human intervention or assistance)
Types of learning algorithms:
1. Supervised learning : Supervised Learning Algorithms are the ones that involve direct supervision of the
operation. In this case, the developer labels sample data corpus and set strict boundaries upon which the
algorithm operates.
The most common fields of use for supervised learning are price prediction and trend forecasting in
sales, retail commerce, and stock trading. In both cases, an algorithm uses incoming data to assess the
possibility and calculate possible outcomes.
Supervised machine learning includes two major processes:
Classification: The process where incoming data is labeled based on past data samples and manually trains the
algorithm to recognize certain types of objects and categorize them accordingly.
Regression: The process of identifying patterns and calculating the predictions of continuous outcomes. The
system has to understand the numbers, their values, grouping etc. 
List of Common Algorithms:
 Nearest Neighbor

 Naive Bayes

 Decision Trees

 Linear Regression

 Support Vector Machines (SVM)

 Neural Networks
BIG DATA ANALYTICS (2017 REGULATION)

2. Unsupervised Learning : Is the one that does not involve direct control.
 Supervised machine learning is that you know the results and need to sort out the data, then in case of
unsupervised machine learning algorithms the desired results are unknown and yet to be defined.
The unsupervised machine learning algorithm is used for exploring the structure of the
information; Extracting valuable insights; Detecting patterns; implementing this into its operation to
increase efficiency. (Digital Marketing etc.)

Unsupervised learning algorithms apply the following techniques:


Clustering: It is an exploration of data used to segment it into meaningful groups (i.e., clusters) based on their
internal patterns without prior knowledge of group credentials.
Dimensionality reduction: There is a lot of noise in the incoming data. Machine learning algorithms use
dimensionality reduction to remove this noise while distilling the relevant information.
List of Common Algorithms:
 k-means clustering
 Principal Component Analysis
 Association Rules
BIG DATA ANALYTICS (2017 REGULATION)

3. Semi-supervised Learning : typically a small amount of labeled data and a large amount of unlabeled
data.
4. Reinforcement Learning : Is a learning method that interacts with its environment by producing
actions and discovers errors or rewards.

List of Common Algorithms


 Q-Learning
 Temporal Difference (TD)
 Deep Adversarial Networks
BIG DATA ANALYTICS (2017 REGULATION)

3. Semi-supervised Learning : typically a small amount of labeled data and a large amount of unlabeled
data.
4. Reinforcement Learning : Is a learning method that interacts with its environment by producing
actions and discovers errors or rewards.

List of Common Algorithms


 Q-Learning
 Temporal Difference (TD)
 Deep Adversarial Networks

You might also like