Professional Documents
Culture Documents
Sonali Jadhav
M.E. Computer Engineering
Assistant Professor, Computer Engineering Department,
D. J. Sanghvi College of Engineering ,
Mumbai.
® ®
TECHNICAL
PUBLICATIONS
SINCE 1993 An Up-Thrust for Knowledge
(i)
Big Data Analytics
Subject Code : CS8091
Published by :
® ®
Amit Residency, Office No.1, 412, Shaniwar Peth, Pune - 411030, M.S. INDIA
TECHNICAL P h . : + 9 1 - 0 2 0 - 2 4 4 9 5 4 9 6 / 9 7 , Te l e f a x : + 9 1 - 0 2 0 - 2 4 4 9 5 4 9 7
PUBLICATIONS
SINCE 1993 An Up-Thrust for Knowledge Email : sales@technicalpublications.org Website : www.technicalpublications.org
Printer :
Yogiraj Printers & Binders
Sr.No. 10/1A,
Ghule Industrial Estate, Nanded Village Road,
Tal. - Haveli, Dist. - Pune - 411041.
Price : ` 250/-
ISBN 978-93-89420-88-3
9 789389 420883 AU 17
Contents
1.1 Introduction
1.2 Evolution of Big data
1.3 Best Practices for Big Data Analytics
1.4 Big Data Characteristics
1.5 Validating the Promotion of the Value of Big Data
1.6 Big Data Use Cases
1.7 Characteristics of Big Data Applications
1.8 Perception and Quantification of Value
1.9 Understanding Big Data Storage
1.10 A Genral Overview Of High-Performance Architecture
1.11 Architecture of Hadoop
1.12 Hadoop Distributed File System (HDFS)
1.13 Architecture of HDFS
1.14 Map Reduce and YARN
1.15 Map Reduce Programming Model
Summary
Two Marks Questions with Answers [Part - A Questions]
Part - B Questions
(1 - 1)
Big Data Analytics 1-2 Introduction to Big Data
1.1 Introduction
Due to the massive digitalization, a large amount of data is being generated by web
applications and Social networking sites that runs on internet by many organizations. In
today’s technological world the high computational power and large storage size is the
basic need and it has been significantly increased over the period of time. The
organizations are producing huge amount of data at rapid rate today and as per global
internet usage report by Wikipedia, the 51% of the world's population uses internet to
perform their day to day activities. Most of them use internet for web surfing, online
shopping, or interacting using Social Medias sites like Facebook, twitter or LinkedIn etc.
These websites generate massive amount of data that involve uploading and
downloading of videos, pictures or text messages whose size is almost unpredictable with
large number of users.
The recent survey on data generation says that Facebook produces 600 TB of data per
day and analyzes 30+ Petabytes of user generated data, Boeing jet airplane generates more
than 10 TBs of data per flight including geo maps and other information, Walmart
handles more than 1 million customer transactions every hour with estimated more than
2.5 petabytes of data per day, there are 0.4 million tweets generated by twitter per minute,
400 hours of new videos uploads on YouTube with access by 4.1 million users. Therefore,
it becomes necessary to manage such a huge amount of data generally called “Big data”
in the perspective of its storage, processing and analytics.
In big data, the data generated in many formats like structured, semi structured or
unstructured. The structured data has fixed pattern or schema which can be stored and
managed using tables in RDBMS, The semi-structured data does not have pre-defined
structure or pattern as it involves scientific or bibliographic data which can be
represented using Graph data structures while unstructured data also do not have a
standard structure, pattern or schema. The examples of unstructured data are videos,
audios, images, pdfs, compressed, log or JSON files. The traditional database
management techniques are incapable of storing, processing, handling and analyzing big
data with various formats which includes images, audio, videos, maps, text, xml etc.
The processing of big data using traditional database management system is very
difficult because of its four characteristics called 4 Vs of Big data shown in Fig. 1.1.1. In
Big data, the Volume refers to size of data being generated per minute or seconds, Variety
means types of data generated including structured, unstructured or semi structured
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 1-3 Introduction to Big Data
data, Velocity refers to speed of data generated per minute or per seconds and Veracity
refers to uncertainty of data generated being generated.
Because of above four V’s, it becomes more and more difficult to capture, store,
organize, process and analyze the data generated by various web applications or
websites. In traditional analytics system, the cleansed or meaningful data is collected and
stored by RDBMS in a data ware house. This data was analyzed by means of performing
Extract, Transform and Load (ETL) operations. It has support of only cleansed structured
data used for batch processing. The parallel processing of such data by traditional
analytics were costlier because of expensive hardware. Therefore, big data analytics
solutions came into picture which has many advantages over the traditional analytics
solutions. The major advantages of Big data analytics are supporting real time or batch
processing data, analyzes different formats of data, can process uncleansed or uncertain
data, does not require an expensive hardware, supports huge volume of data generated at
any velocity and perform data analytics at low cost.
Therefore, it is best to begin with a definition of big data. The analyst firm Gartner can
be credited with the most-frequently used (and perhaps, somewhat abused) definition:
Big data is high-volume, high-velocity and high-variety information assets that demand
cost-effective, innovative forms of information processing for enhanced insight and
decision making.
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 1-4 Introduction to Big Data
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 1-5 Introduction to Big Data
Therefore, every successful Big Data project always tend to start with smaller data
sets and targeted goals.
2) Think big for scalability : While defining a Big data system, always follow a
futuristic approach. That means determining how much data will be collected in next
six months from now or calculating how many greater numbers of servers are
needed to handle it. This approach will allow applications to be scaled easily without
having any bottleneck.
3) Avoid bad practices : There are many potential reasons for failing Big Data projects.
So, for making the successful Big data project, the following wrong practices must be
avoided
a) Rather than blindly adopting and deploying something, first understand the
business purposes of the technology you are using for the deployment so as to
implement the right analytics tools for the job at hand. Without a solid
understanding of business requirements, the project will end up without having
an intended outcome.
b) Do not assume that the software will have all of the solutions for your problem as
the business requirements, environment, input/output varies from project to
project.
c) Do not consider the solution of one problem relevant for every problem as each
problem has unique requirements and needs a unique solution which can’t be
used to solve other problems. As a result, new methods and tools might be
required to capture, cleanse, store, process at least some of your Big Data.
d) Do not appoint same person for handling multiple types of analytical operations
as lack of business requisite and analytical expertise may leads to failure of
project. So, they require analytics professionals with statistical, actuarial, and other
sophisticated skills, with expertise in advanced analytics operations.
4) Treat big data problem as a scientific experiment : In Big data project, collecting
and analyzing the data is just a part of procedure while analytics is only producing
the business value which needs to be incorporated into business processes intended
to improve the performance and results. Therefore, every Big data problem requires
a feedback loop for passing the success of actions taken as a result of analytical
findings, followed by improvement of the analytical models based on the business
results.
5) Decide what data can be included and what to leave out : As Big Data analytics
projects involve large data sets that doesn’t means all the data generated by a system
can be analyzed. Therefore, it is required to select the appropriate datasets for
analysis based on their value and outcomes.
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 1-6 Introduction to Big Data
6) Must have a periodic maintenance plan : The success of Big Data analytics initiative
requires regular maintenance of analytics programs on the top of changes in business
requirements.
7) In-memory processing : The In-memory processing of large datasets must be
analyzed for getting the improvements in data-processing, speed of execution and
volume of data. It gives hundreds of times of increased performance compared to
older technologies, Better price-to-performance ratios, reductions in the cost of
central processing units and memory and can handle rapidly expanding volumes of
information.
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 1-7 Introduction to Big Data
Does not needed high end servers as it can be run on commodity hardware
Provide predictive, descriptive and Prescriptive analytics for improving supply chain
management
Provide accuracy in fraud detection.
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 1-8 Introduction to Big Data
There are some more benefits promoted by inculcating business intelligence and data
warehouse tools in big data like enhanced business planning with product analysis,
optimized supply chain management with fraud detection and analysis of waste, and
abuse of products.
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 1-9 Introduction to Big Data
3. Large data volumes : Due to the huge volume of data, the analytical application
needs high rates of data creation and delivery.
4. Significant data variety : Due to the diversity of applications, the data generated by
applications may have variety of data like structured or unstructured generated by
different data sources.
5. Data parallelization : As big data application needs to process huge amount of data;
the application’s runtime can be improved through task or thread-level
parallelization applied to independent data segments.
Some of the big data applications and their characteristics are given in Table 1.7.1.
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 1 - 10 Introduction to Big Data
b) Weather Big data system is “Lowering the costs” of organizations spending’s like
capital expenses (Capex) and Operational expenses (Opex)
c) Weather Big data system is “Increasing the productivity” by speeding up the process
of execution with efficient results.
d) Weather Big data system is “Reducing the risk” while using big data platform
collecting the data from streams of automated sensors and can provide full visibility.
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 1 - 11 Introduction to Big Data
To get a better understanding of the architecture for big data platform, we will
examine the Apache Hadoop software stack, since it is a collection of open source projects
that are combined to enable a software-based big data appliance. The generalized general
overview of high-performance architecture of Hadoop is shown in Fig. 1.10.1.
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 1 - 12 Introduction to Big Data
The different components of Hadoop ecosystem are explained in following Table II.
1) HDFS It is a Hadoop distributed file system which is used to split the data
in to blocks and is distributed among servers for processing. It runs
multiple clusters to store several copies of data blocks which can be
used in case of failure occurs.
2) Map reduce It’s a programming method to process big data comprising of two
programs written in Java such as mapper and reducer. The mapper
extracts data from HDFS and put in to maps while reducer aggregate
the results generated by mappers.
3) Zookeeper It is a centralized service used for maintaining configuration
information with distributed synchronization and coordination.
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 1 - 13 Introduction to Big Data
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 1 - 14 Introduction to Big Data
3) HDFS Client : In Hadoop distributed file system the user applications access the file
system using the HDFS client. Like any other file systems, HDFS supports various
operations to read, write and delete files, and operations to create and delete
directories. The user references files and directories by paths in the namespace. The
user application does not need to aware that file system metadata and storage are on
different servers, or that blocks have multiple replicas. When an application reads a
file, the HDFS client first asks the name node for the list of data nodes that host
replicas of the blocks of the file. The client contacts a data node directly and requests
the transfer of the desired block. When a client writes, it first asks the name node to
choose data nodes to host replicas of the first block of the file. The client organizes a
pipeline from node-to-node and sends the data. When the first block is filled, the
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 1 - 16 Introduction to Big Data
client requests new data nodes to be chosen to host replicas of the next block. The
Choice of data nodes for each block is likely to be different.
4) HDFS Blocks : In general the users data stored in HDFS in terms of block. The files
in file system are divided in to one or more segments called blocks. The default size
of HDFS block is 64 MB that can be increase as per need.
The HDFS is fault tolerance such that if data node fails then current block write
operation on data node is re-replicated to some other node. The block size, number
of replicas and replication factors are specified in hadoop configuration file. The
synchronization between name node and data node is done by heartbeats functions
which are periodically generated by data node to name node.
Apart from above components the job tracker and task trackers are used when map
reduce application runs over the HDFS. Hadoop Core consists of one master job
tracker and several task trackers. The job tracker runs on name node like a master
while task trackers runs on data nodes like slaves.
The job tracker is responsible for taking the requests from a client and assigning task
trackers to it with tasks to be performed. The job tracker always tries to assign tasks
to the task tracker on the data nodes where the data is locally present. If for some
reason the node fails the job tracker assigns the task to another task tracker where the
replica of the data exists since the data blocks are replicated across the data nodes.
This ensures that the job does not fail even if a node fails within the cluster.
The HDFS can be manipulated either using command line. All the commands used
for manipulating HDFS through command line interface begins with “hadoop fs”
command. Most of the linux commands are supported over HDFS which starts with
“-”sign.
For example: The command for listing the files in hadoop directory will be
#hadoop fs –ls
The general syntax of HDFS command line manipulation is
#hadoop fs -<command>
The most popular HDFS commands are given in Table 1.13.1
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 1 - 17 Introduction to Big Data
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 1 - 18 Introduction to Big Data
been centralized and management of resources at each node is now performed by a local
Node. Each application that directly negotiates with the central Resource Manager for
resources there is the idea of an Application Master related with every application that
legitimately consults with the Resource Manager for resource allocation, effective
scheduling to improve node utilization and to provide monitoring progress with tracking
status.
Last, the YARN approach enables applications to be better mindful of the data
allocation over the topology of the resource inside a cluster. This mindfulness considers
improved colocation of compute and data resources, reducing data movement, and thus,
lessening delay related with data access latencies. The outcome ought to be expanded
scalability and performance.
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 1 - 19 Introduction to Big Data
Error handling : Map reduce engine provides different fault tolerance mechanisms in
case of failure. When the tasks are running on different cluster nodes during which if
any failure occurs then map reduce engine find out those incomplete tasks and
reschedule them for execution on different nodes.
Scheduling : The map reduce involves map and reduce operations that divide large
problems in to smaller chunks and those are run in parallel by different machines so
there is a need to schedule different tasks on computational nodes on priority basis
which is taken care by map reduce engine.
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 1 - 20 Introduction to Big Data
The shuffle and sort are the components of reducer. The shuffling is a process of
partitioning and moving a mapped output to the reducer where intermediate keys are
assigned to the reducer. Each partition is called subset and so each subset becomes input
to the reducer. In general shuffle phase ensures that the partitioned splits reached at
appropriate reducers where reducer uses http protocol to retrieve their own partition
from mapper.
The sort phase is responsible for sorting the intermediate keys on single node
automatically before they are presented to the reducer. The shuffle and sort phases occur
simultaneously where mapped output is being fetched and merged.
The reducer reduces a set of intermediate values which share unique keys with set of
values. The reducer uses sorted input to generate the final output. The final output is
written using record writer by the reducer in to output file with standard output format.
The final output of each map reduce program is generated with key value pairs
written in output file which is written back to the HDFS store.
For example, of Word count process using map reduce with all phases of execution are
illustrated in Fig. 1.15.3
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 1 - 21 Introduction to Big Data
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 1 - 22 Introduction to Big Data
The split size can be calculated using Compute SplitSize () method by providing
FileInputFormat. The input split is associated with record reader that loads data from
source and convert it into key, value pair defined by Input Format.
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 1 - 23 Introduction to Big Data
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 1 - 24 Introduction to Big Data
The input to mapper is input key, input value and context which give information
about current task while input to reducer is set of key values with output collector that
collects key, value pair output from mapper or reducer. The reporter is used to report
progress, update counters and status information of map reduce job.
Summary
Due to the massive digitalization, a large amount of data is being generated by
web applications and Social networking sites that runs on internet by many
organizations such data is called Big data.
In big data, the data is generated in many formats like structured, semi structured
or unstructured.
The structured data has fixed pattern or schema which can be stored and
managed using tables in RDBMS, The semi-structured data does not have pre-
defined structure or pattern as it involves scientific or bibliographic data which
can be represented using Graph data structures while unstructured data also do
not have a standard structure, pattern or schema.
The processing of big data using traditional database management system is very
difficult because of its four characteristics called 4 Vs of Big data those are
Volume, Variety, Velocity and Veracity.
The first Big Data challenge came in to picture at Census Bureau, US, in 1880,
where the information concerning approximately 60 million people had to be
collected, classified, and reported that process took more than 10 years to process.
The Apache Hadoop is the open source framework for solving Big data problem.
The common Big data use cases are Business intelligence by querying, reporting
and searching the datasets, using big data analytics tools for report generation,
trend analysis, search optimization, and information retrieval and performing
predictive, prescriptive and descriptive analytics.
The popular applications of Big data analytics are Fraud detection, Data profiling,
Clustering, Price modelling and Recommendation System.
Two Marks Questions with Answers [Part A - Questions]
Q.1 Define Big data and also enlist the advantages of Big data analytics.
Ans. : Big data is high-volume, high-velocity and high-variety information assets that
demand cost-effective, innovative forms of information processing for enhanced insight
and decision making.
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 1 - 25 Introduction to Big Data
Does not needed high end servers as it can be run on commodity hardware
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 1 - 26 Introduction to Big Data
It supports file operations like read, write, delete but append not update.
It provides Java APIs and command line command line interfaces to interact with
HDFS.
It provides different File permissions and authentications for files on HDFS.
The shuffle and sort are the components of reducer. The shuffling is a process of
partitioning and moving a mapped output to the reducer where intermediate keys are
assigned to the reducer. Each partition is called subset and so each subset becomes
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 1 - 28 Introduction to Big Data
input to the reducer. In general shuffle phase ensures that the partitioned splits reached
at appropriate reducers where reducer uses http protocol to retrieve their own partition
from mapper. The sort phase is responsible for sorting the intermediate keys on single
node automatically before they are presented to the reducer. The shuffle and sort
phases occur simultaneously where mapped output is being fetched and merged.
The reducer reduces a set of intermediate values which share unique keys with set of
values. The reducer uses sorted input to generate the final output. The final output is
written using record writer by the reducer in to output file with standard output
format. The final output of each map reduce program is generated with key value pairs
written in output file which is written back to the HDFS store.
Part - B Questions
Q.1 Explain Big data characteristics along with their use cases. AU : May-17
Q.2 What are 4 V’s of Big data ? Also explain best practices for Big data analytics
Q.3 Explain generalized architecture of ?
Q.4 Explain architecture of Hadoop along with components of High-performance
architecture of Big data.
Q.5 Explain the functionality of map-Reduce Programming model. AU : May-17
Q.6 Explain the functionality of HDFS and map-reduce in detail . AU : Nov.-18
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
UNIT - II
Syllabus
Advanced Analytical Theory and Methods: Overview of Clustering - K-means - Use Cases - Overview
of the Method - Determining the N umber of Clusters - Diagnostics - Reasons to Choose and Cautions
.- Classification : Decision Trees - Overview of a Decision Tree - The General Algorithm - Decision
Tree Algorithms - Evaluating a Decision Tree - Decision Trees in R - N aïve Bayes - Bayes‘ Theorem -
N aïve Bayes Classifier.
Contents
2.1 Overview of Clustering
2.2 K-means Clustering
2.3 Use Cases of K-means Clustering
2.4 Determining the Number of Clusters
2.5 Diagnostics
2.6 Reasons to Choose and Cautions
2.7 Classification
2.8 Decision Tree Algorithms
2.9 Evaluating Decision Tree
2.10 Decision Tree in R
2.11 Baye’s Theorem
2.12 Naive Bayes Classifier
Summary
Two Marks Questions with Answers [Part - A Questions]
Part - B Questions
(2 - 1)
Big Data Analytics 2-2 Clustering and Classification
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 2-3 Clustering and Classification
cluster, and then performs iterative (repetitive) calculations to optimize the positions of
the centroids. It halts after creating and optimizing clusters when either the centroids
have stabilized i.e. there is no change in their values because the clustering has been
successful or the defined number of iterations has been achieved. This concept is
represented in terms of algorithm as below.
Step 3 : Calculate the distance between each data point with the cluster centers by
using any distance measurers.
Step 4 : Assign the data points to the cluster center whose distance from the cluster
center is minimum with respect to other cluster centers.
Step 5 : Recalculate the distance between each data point and its centers to get new
cluster centers using mean.
Step 6 : Repeat from step 3 till there is no change in the cluster center
The pictorial representation of K-means algorithm is shown in Fig. 2.2.1.
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 2-4 Clustering and Classification
Calculate the distance between each data point from X with the cluster centers
C1 = 4 and C2 = 12 by using Euclidean distance.
C =4 2 1 0 6 7 8 16 21 26
1
C = 12 10 9 8 2 1 0 8 13 18
2
Table 2.2.1 : Distance of data points with cluster centers C1 and C2 for iteration 1
From the above Table 2.2.1, the data points are clustered in accordance with the
minimum distance from the cluster centers. So, the cluster center C1=4 has the data points
{2,3,4} and the cluster centers C2=12 has the data points {10,11,12,20,25,30}. As per Step 4
of algorithm, we have assigned the data points to the cluster center whose distance from
the cluster center is minimum with respect to other cluster centers.
Now, Calculate the new cluster center for each data points using mean.
1 n
Mean = ( X )
n i=1 i
So, for data points {2,3,4}, mean = 3 which is the new cluster center C1 = 3 while for
data points {10,11,12,20,25,30}, mean = 18, which is the new cluster center C2 = 18. Now we
have to repeat the same steps till there is no change in the cluster center.
Cluster Data points
centers
2 3 4 10 11 12 20 25 30
C =3 1 0 1 7 8 9 17 22 27
1
C = 18 16 15 14 8 7 6 2 7 12
2
Table 2.2.2 : Distance of data points with cluster centers C1 and C2 for iteration 2
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 2-5 Clustering and Classification
As per the above Table 2.2.2, the cluster center C1=3 clusters data points {2,3,4,10} while
the cluster center C2=18 clusters data points {11,12,20,25,30}.
Now, Calculate the new cluster center for each data points using mean.
So, for data points {2,3,4,10}, mean = 4.75 which is the new cluster center C1 = 4.75
while for data points {11,12,20,25,30}, mean = 18, which is the new cluster center C2 =
16.33. Now we have to repeat the same steps till there is no change in the cluster center.
Cluster Data Points
Centers
2 3 4 10 11 12 20 25 30
C1 = 4.75 2.75 1.75 0.75 5.25 6.25 7.25 15.25 20.25 25.25
C2 = 16.33 14.33 13.33 12.33 6.33 5.33 4.33 3.67 8.67 13.67
Table 2.2.3 : Distance of data points with cluster centers C1 and C2 for iteration 3
Since we get the same data points for both the clusters as per the Table 2.2.3, so we can
say that the final Cluster centers are C1 = 4.75 and C2 = 16.33.
1. Document clustering :
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 2-6 Clustering and Classification
Step 2 : Create two databases; one for storing the details of the authorized user and
the other for storing details of the crime occurring in a particular location
Step 3 : The data can be added to the database using SQL queries
Step 5 : The PHP file to retrieve data converts the database in the JSON format.
Step 6 : This JSON data is parsed from the android so that it can be used.
Step 7 : The location added by the user from the android device is in the form of
address which is converted in the form of latitudes and longitude that is further added to
the online database.
Step 9 : The various crime types used are Robbery, Kidnapping, Murder, Burglary
and Rape. Each crime type is denoted using a different color marker.
Step 10 : The crime data plotted on the maps is passed to the K - means algorithm.
Step 11 : The data set is divided into different clusters by computing the distance of
the data from the centroid repeatedly.
Step 12 : A different colored circle is drawn for different clusters by taking the
centroid of the clusters as the center where the color represents the frequency of the crime
Step 13 : This entire process of clustering is also performed on each of the crime types
individually.
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 2-7 Clustering and Classification
Step 14 : In the end, a red colored circle indicates the location where safety measures
must be adopted.
From the clustered results it is easy to identify crime prone areas and can be used to
design precaution methods for future. The classification of data is mainly used to
distinguish types of preventive measures to be used for each crime. Different crimes
require different treatment and it can be achieved easily using this application. The
clustering technique is effective in terms of analysis speed, identifying common crime
patterns and crime prone areas for future prediction. The developed application has
promising value in the current complex crime scenario and can be used as an effective
tool by Indian police and enforcement of law organizations for crime detection and
prevention.
3. Cyber-profiling criminals :
The activities of Internet users are increasing from year to year and have had an
impact on the behavior of the users themselves. Assessment of user behavior is often only
based on interaction across the Internet without knowing any others activities. The log
activity can be used as another way to study the behavior of the user. The Log Internet
activity is one of the types of big data so that the use of data mining with K-Means
technique can be used as a solution for the analysis of user behavior. This study is the
process of clustering using K-Means algorithm which divides into three clusters, namely
high, medium, and low. The cyber profiling is strongly influenced by environmental
factors and daily activities. For investigation, the cyber-profiling process gives a good,
contributing to the field of forensic computer science. Cyber Profiling is one of the efforts
to know the alleged offenders through the analysis of data patterns that include aspects of
technology, investigation, psychology, and sociology. Cyber Profiling process can be
directed to the benefit of:
1. Identification of users of computers that have been used previously.
2. Mapping the subject of family, social life, work, or network-based organizations,
including those for whom he/she worked.
3. Provision of information about the user regarding his ability, level of threat, and
how vulnerable to threats to identify the suspected abuser
Criminal profiles generated in the form of data on personal traits, tendencies, habits,
and geographic-demographic characteristics of the offender (for example: age, gender,
socio-economic status, education, origin place of residence). Preparation of criminal
profiling will relate to the analysis of physical evidence found at the crime scene, the
process of extracting the understanding of the victim (victimology), looking for a modus
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 2-8 Clustering and Classification
operandi (whether the crime scene planned or unplanned), and the process of tracing the
perpetrators were deliberately left out (signature)
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 2-9 Clustering and Classification
Where pi represents data point and q(i) represents the cluster center
The optimal number of clusters can be defined as follows
1. Compute clustering algorithm for different values of K. For instance, by varying K
from 1 to 10 clusters.
2. Calculate the Within-Cluster-Sum of Squared Errors for different values of K, and
choose the K for which WSS becomes first starts to diminish. The Squared Error for
each point is the square of the distance of the point from its cluster center.
The WSS score is the sum of these Squared Errors for all the points.
Any distance metric like the Euclidean Distance or the Manhattan Distance can be
used.
3. Plot the curve of WSS according to the number of clusters K as shown in Fig. 2.4.1.
4. Location of a bend (knee) in the plot is generally considered as an indicator of the
appropriate number of clusters.
5. In the plot of WSS-versus-K, this is visible as an elbow.
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 2 - 10 Clustering and Classification
From the above Fig. 2.4.1, we conclude that for the given number of clusters, the elbow
point is found at K = 3. So for this problem, the maximum number of clusters would be 3.
a(i) =
1 d(i,j)
|Ci| 1 j Ci i j
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 2 - 11 Clustering and Classification
b(i) =
min 1 d (i,j)
i j |Cj| j CJ
d(i, j) is the distance between points i and j. Generally, Euclidean Distance is used as
the distance metric.
2.5 Diagnostics
In WSS, the heuristic value must be chosen to get the desired output which can be
provided at least several possible values of K. When numbers of attributes are relatively
small, a common approach is required to further refinement of distinct identified clusters.
The example of distinct clusters is shown in Fig. 2.5.1 (a).
Fig. 2.5.1 (a) Example of distinct clusters Fig. 2.5.1 (b) Example of less obvious clusters
To resolve the problem of distinct clusters, the above three questions needs to be
considered.
Q 1) whether the clusters are separated from each other’s?
Q 2) whether any cluster has only few points?
Q 3) whether any centroid appears to be too close to each other’s?
The solution to above three questions might results in to less obvious clusters as
shown in Fig. 2.5.1 (b).
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 2 - 12 Clustering and Classification
2.7 Classification
In machine learning and statistics, classification is a supervised learning approach in
which the computer program learns from the data input given to it and then uses this
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 2 - 13 Clustering and Classification
learning to classify new observation. Classification is technique to categorize our data into
a desired and distinct number of classes where we can assign label to each class.
Applications of classification are, speech recognition, handwriting recognition, biometric
identification, document classification etc. A Binary classifiers classifies 2 distinct classes
or with 2 possible outcomes while Multi-Class classifiers classifies more than two distinct
classes. The different types of classification algorithms in Machine Learning are :
1. Linear Classifiers : Logistic Regression, Naive Bayes Classifier
2. Nearest Neighbor
3. Support Vector Machines
4. Decision Trees
5. Boosted Trees
6. Random Forest
7. Neural Networks
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 2 - 14 Clustering and Classification
4. Branch / sub-tree : The branch or sub-tree are the sub part of an entire tree.
5. Decision node : The Decision node is generated when a sub-nodes are splited into
further sub-nodes.
6. Parent and child node : The node which is divided into sub-nodes is called parent
node and sub-nodes are called child nodes.
7. Leaf / terminal node : The nodes which do not get further splitted are called as Leaf
or Terminating node.
Greedy algorithms cannot guarantee to return the globally optimal decision tree.
This can be mitigated by training multiple trees, where the features and samples are
randomly sampled with replacement.
Decision tree learners create biased trees if some classes dominate. It is therefore
recommended to balance the data set prior to fitting with the decision tree.
Information gain in a decision tree with categorical variables gives a biased
response for attributes with greater no. of categories.
Generally, it gives low prediction accuracy for a dataset as compared to other
machine learning algorithms.
Calculations can become complex when there are many class label.
Basically, there are three algorithms for creating a decision tree; namely Iterative
Dichotomiser 3 (ID3), Classification And Regression Trees (CART) and C4.5. The next
section 2.8 will describe the above three decision tree algorithms in detail.
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 2 - 16 Clustering and Classification
Play Cricket. Make a decision tree that predicts whether cricket will be played on the
day.
Sr. No Outlook Temperature Humidity Windy Play cricket
The play cricket is the final outcome which depends on the other four attributes. To
start with we have to choose the root node. To choose the best attribute we find entropy
which specifies the uncertainty of data and information gain represented as,
p p n n
Entropy = log2 p + n log2 p + n
p+n p + n
Average information can be calculated as,
p i + ni
I (Attribute) = p+n
Entropy (A)
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 2 - 17 Clustering and Classification
Outlook p n Entropy
Sunny 2 3 0.971
Rainy 3 2 0.971
Overcast 4 0 0
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 2 - 18 Clustering and Classification
Mild No
Temperature p n Entropy
Hot 2 2 1
Mild 4 2 0.918
Cool 3 1 0.811
p(cool) + n(cool)
Entropy(Temperature = mild) + Entropy(Temperature = cool)
p+n
Now, repeat the same procedure for finding the entropy for Humidity as shown in
Table 2.8.4.
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 2 - 19 Clustering and Classification
Humidity p n Entropy
High 3 4 0.985
Normal 6 1 0.591
3+4 6+1
I(Humidity) = 0.985 + 0.591 = 0.788
9+5 9+5
Now we calculate gain for the attribute Humidity
Gain = Entropy(S) - I(Humidity) = 0.940-0.788 = 0.152
Now, repeat the same procedure for finding the entropy for Windy as shown in
Table 2.8.5.
Windy p n Entropy
Strong 3 3 1
Weak 6 2 0.811
3+3 6+2
I(Windy) = 1+ 0.811 = 0.892
9+5 9+5
Now we calculate gain for the attribute Windy
Gain = Entropy(S) - I(Windy) = 0.940-0.892 = 0.048
Finally the Attributes and their information gain is shown in Table IX. From this table
we conclude that the outlook has maximum gain than others. So, root node for decision
tree will be selected as outlook which is shown in Table 2.8.6.
Attribute Information Gain
Humidity 0.152
Windy 0.048
So, the initial decision tree will look like Fig. 2.8.1.
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 2 - 20 Clustering and Classification
As seen for overcast, there is only outcome “Yes” because of which further splitting is
not required and “Yes” becomes the leaf node. Whereas the sunny and rain has to be
further splitted. So a new data set is created and the process is again repeated. Now
consider the new tables for Outlook=Sunny and Outlook=Rainy as shown in Table 2.8.7.
Now we solve for attribute Outlook=Sunny. As seen from the Table 2.8.8, for Outlook=
Sunny the play cricket has p=2 and n=3.
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 2 - 21 Clustering and Classification
Sunny Weak No
Temperature 0.571
From the Table 2.8.12, it is seen that the Humidity has the highest gain amongst the
other attributes. So, it will be selected as a next node as shown in Fig. 2.8.2.
As seen from Figure 2.5, for Humidity there are only two conditions as Normal “Yes”
and High “No”. So, further expansion is not required and both become the leaf node.
Now consider the new tables for Outlook=Rainy as shown in Table 2.8.13.
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 2 - 23 Clustering and Classification
Rainy Normal No
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 2 - 24 Clustering and Classification
For each attribute like Temperature calculate the entropy for Cool, and Mild as shown
in Table 2.8.16.
Rainy Mild No
Humidity 0.02
Windy 0.971
Temperature 0.02
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 2 - 25 Clustering and Classification
As seen from Fig. 2.8.3, for Windy there are only two conditions as Weak “Yes” and
Strong “No”. So, further expansion is not required and both become the leaf node. So, this
becomes the final decision tree. Hence, given the attributes and decisions we can easily
construct the decision tree using ID3 algorithm.
2.8.2 CART
Classification and Regression Tree (CART) is one of commonly used Decision Tree
algorithm. It uses recursive partitioning approach where each of the input node is split
into two child nodes. Therefore, CART decision tree is often called Binary Decision Tree.
In CART, at each level of decision tree, the algorithm identify a condition - which variable
and level to be used for splitting the input node (data sample) into two child nodes and
accordingly build the decision tree.
CART is an alternate decision tree algorithm which can handle both regression and
classification tasks. This algorithm uses a new metric named gini index to create decision
points for classification tasks. Given the attributes and the decision as shown in Table
2.8.18. The procedure for creating the decision tree using CART is explained below.
Day Outlook Temperature Humidity Wind Decision
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 2 - 26 Clustering and Classification
The Gini index is a measure for classification tasks in CART algorithm which has sum
of squared probabilities of each class. It is defined as
GiniIndex (Attribute = value) = GI (v) = 1 – (Pi)2 for i = 1 to number of classes.
Gini Index (Attribute) = V = values Pv GI(v)
Outlook is attribute which can be either sunny, overcast or rain. Summarizing the final
decisions for outlook feature is given in Table 2.8.19.
Sunny 2 3 5
Overcast 4 0 4
Rain 3 2 5
Using the above information from the table, we calculate Gini(Outlook) by using the
formulae’s defined earlier
22 32
Gini(Outlook = Sunny) = 1 – – = 1 – 0.16 – 0.36 = 0.48
5 5
42 02
Gini(Outlook=Overcast) = 1 – – = 0
4 4
3 22
2
Gini(Outlook=Rain) = 1 – – =1 – 0.36 – 0.16
5 5
Then, we will calculate weighted sum of gini indexes for outlook feature.
5 4 5
Gini(Outlook) = 0.48 + 0 + 0.48 = 0.171 + 0 + 0.171 = 0.342
14 14 14
Here after, the same procedure is repeated for other attributes. Temperature is an
attribute which has 3 different values : Cool, Hot and Mild. The summary of decisions for
temperature is given in Table 2.8.20.
Hot 2 2 4
Cool 3 1 4
Mild 4 2 6
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 2 - 27 Clustering and Classification
22 22
Gini(Temp = Hot) = 1 – – = 0.5
4 4
32 12
Gini(Temp = Cool) = 1 – – = 1 – 0.5625 – 0.0625 = 0.375
4 4
42 22
Gini(Temp= Mild) = 1– – = 1 – 0.444 – 0.111 = 0.445
6 6
We'll calculate weighted sum of gini index for temperature feature
4 4 6
Gini(Temp) = 0.5 + 0.375 + 0.445 = 0.142 + 0.107 + 0.190 = 0.439
14
14 14
Humidity is a binary class feature. It can be high or normal as shown in Table 2.8.21.
Humidity Yes No Number of instances
High 3 4 7
Normal 6 1 7
3 2 4 2
Gini(Humidity = High) = 1 – – = 1 – 0.183 – 0.326 = 0.489
7 7
6 2 1 2
Gini(Humidity = Normal) = 1– – = 1 – 0.734 – 0.02 = 0.244
7 7
Weighted sum for humidity feature will be calculated next
7 7
Gini(Humidity) = 0.489 + 0.244 = 0.367
14 14
Wind is a binary class similar to humidity. It can be weak and strong as shown in
Table 2.8.22.
Weak 6 2 8
Strong 3 3 6
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 2 - 28 Clustering and Classification
6 2 2 2
Gini(Wind = Weak) = 1 – – = 1 – 0.5625 – 0.062 = 0.375
8 8
32 32
Gini(Wind = Strong) = 1– – = 1 – 0.25 – 0.25 = 0.5
6 6
8 6
Gini(Wind) = 0.375 + 0.5 = 0.428
14 14
After calculating the gini index for each attribute, the attribute having minimum value
is selected as the node. So, from the Table 2.8.23 the outlook feature has the minimum
value, therefore, outlook attribute will be at the top of the tree as shown in Fig. 2.8.4.
Feature Gini index
Outlook 0.342
Temperature 0.439
Humidity 0.367
Wind 0.428
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 2 - 29 Clustering and Classification
We will apply same principles to the other sub datasets in the following steps. Let us
take sub dataset for Outlook=Sunny. We need to find the gini index scores for
temperature, humidity and wind features respectively. The sub dataset for
Outlook=Sunny is as shown in Table 2.8.24.
Hot 0 2 2
Cool 1 0 1
Mild 1 1 2
02 22
Gini(Outlook = Sunny and Temperature = Hot) = 1– – = 0
2 2
12 02
Gini(Outlook=Sunny and Temperature = Cool) = 1– – = 0
1 1
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 2 - 30 Clustering and Classification
12 12
Gini(Outlook = Sunny and Temperature = Mild) = 1– – = 1 – 0.25 – 0.25 = 0.5
2 2
2 1 2
Gini(Outlook=Sunny and Temperature) = 0 + 0 + 0.5 = 0.2
5 5 5
Now, we determine Gini of humidity for Outlook=Sunny as per Table 2.8.26.
High 0 3 3
Normal 2 0 2
02 32
Gini(Outlook=Sunny and Humidity=High) = 1– – = 0
3 3
22 02
Gini(Outlook=Sunny and Humidity=Normal) = 1– – = 0
2 2
3 2
Gini(Outlook=Sunny and Humidity) = 0 + 0 = 0
5 5
Now, we determine Gini of Wind for Outlook=Sunny as per Table 2.8.27.
Wind Yes No Number of instances
Weak 1 2 3
Strong 1 1 2
12 22
Gini(Outlook=Sunny and Wind=Weak) = 1– – = 0.266
3 3
12 12
Gini(Outlook=Sunny and Wind=Strong) = 1– – = 0.2
2 2
3 2
Gini(Outlook=Sunny and Wind) = 0.266 + 0.2 = 0.466
5 5
Decision for sunny outlook
We’ve calculated gini index scores for features when outlook is sunny as shown
in Table 2.8.28. The winner is humidity because it has the lowest value.
Temperature 0.2
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 2 - 31 Clustering and Classification
Humidity 0
Wind 0.466
Humidity is the extension of Outlook Sunny as the gini index for Humidity is
minimum as shown in Fig. 2.8.6.
As seen from Fig. 2.8.6, decision is always no for high humidity and sunny outlook.
On the other hand, decision will always be yes for normal humidity and sunny outlook.
Therefore, this branch is over and the leaf nodes of Humidity for Outlook=Sunny is
shown in Fig. 2.8.7.
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 2 - 32 Clustering and Classification
Let us take sub dataset for Outlook= Rain, and determine the gini index for
temperature, humidity and wind features respectively. The sub dataset for Outlook=
Rain is as shown in Table 2.8.29.
Day Outlook Temperature Humidity Wind Decision
The calculation for gini index scores for temperature, humidity and wind features
when outlook is rain is shown following Tables 2.8.30, 2.8.31, and 2.8.32 .
Temperature Yes No Number of instances
Cool 1 1 2
Mild 2 1 3
Table 2.8.30 : Decision table of temperature for outlook=rain
12 12
Gini(Outlook=Rain and Temperature=Cool) = 1 – – = 0.5
2 2
22 12
Gini(Outlook=Rain and Temperature=Mild) = 1 – – = 0.444
3 3
2 3
Gini(Outlook=Rain and Temperature) = 0.5 + 0.444 = 0.466
5 5
Table 2.8.31 : Decision table of humidity for outlook=rain
12 12
Gini(Outlook=Rain and Humidity=High) = 1 – – = 0.5
2 2
22 12
Gini(Outlook=Rain and Humidity=Normal)= 1 – – = 0.444
3 3
2 3
Gini(Outlook=Rain and Humidity) = 0.5 + 0.444 = 0.466
5 5
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 2 - 33 Clustering and Classification
Weak 3 0 3
Strong 0 2 2
Temperature 0.466
Humidity 0.466
Wind 0
Table 2.8.33 : Gini index of each feature for Outlook=Rain
Place the wind attribute for outlook rain branch and monitor the new sub data sets as
shown in Figure 2.8.8.
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 2 - 34 Clustering and Classification
As seen, when wind is weak the decision is always yes. On the other hand, if wind is
strong the decision is always no. This means that this branch is over and the final decision
tree using CART algorithm is depicted in Fig. 2.8.9.
2.8.3 C4.5
The C4.5 algorithm is used to generate a Decision tree based on Decision Tree
Classifier. It is mostly used in data mining where decisions are generated based on a
certain sample of data. It has many improvements over the original ID3 algorithm. The
C4.5 algorithm can handle missing data
So, If the training records contains unknown attribute values then C4.5 evaluates the
gain for each attribute by considering only the records where the attribute is defined. For
the corresponding records of each partition, the gain is calculated, and the partition that
maximizes the gain is chosen for the next split. It also supports both categorical and
continuous attributes where values of a continuous variable are sorted and partitioned.
The ID3 algorithm may construct a deep and complex tree, which would cause
overfitting. The C4.5 algorithm addresses the overfitting problem in ID3 by using a
bottom-up technique called pruning to simplify the tree by removing the least visited
nodes and branches.
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 2 - 35 Clustering and Classification
This selection of attribute may not be the best overall, but it is guaranteed to be the
best at that step. This characteristic strengthens the effectiveness of decision trees.
However, selecting the wrong attribute with bad split may propagate through the rest of
the tree. Thus, to address this issue, the synergistic method can be utilized like random
forest which may randomize the splitting or even randomize data and come up with
numerous tree structure. These trees at that point vote in favor of each class, and the class
with the most votes is picked as the predicted class.
There are few ways to evaluate a decision tree. Some of the important evaluations are
given as follows.
Firstly, evaluate whether the splits of the tree make sense. Conduct stability checks by
validating the decision rules with domain experts, and determine if the decision rules are
sound. Second, look at the depth and nodes of the tree. Having such a large number of
layers and getting nodes with few members might be signs of overfitting. In overfitting,
the model fits the training set well, however it performs ineffectively on the new samples
in the testing set.
For decision tree learning, overfitting can be caused by either the lack of training data
or the biased data in the training set. So, to avoid overfitting in decision tree two
methodologies can be utilized. First is stop rising the tree early before it reaches the point
where all the training data is perfectly classified and second is grow the full tree, and then
post-prune the tree with methods such as reduced-error pruning and rule-based post
pruning and Lastly, use standard diagnostics tools that apply to classifiers that can help
evaluate overfitting.
The structure of a decision tree is sensitive to small variations in the training data.
Therefore, constructing two decision trees based on two different subsets of same dataset
may result in very different trees and it is not a good choice if the dataset contains many
irrelevant or redundant variables. If the dataset contains redundant variables, the
resulting decision tree ignores all and algorithm cannot able to detect the information
gain on other hand if dataset contains irrelevant variables and accidentally chosen as
splits in the tree, the tree may grow too large and may end up with less data at every
split, where overfitting is likely to occur.
Although decision trees are able to handle correlated variables, as when most of the
variables in the training set are correlated, overfitting is likely to occur. To overcome the
issue of instability and potential overfitting, one can combine the decisions of several
randomized shallow decision trees using classifier called random forest
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 2 - 36 Clustering and Classification
For binary decisions, a decision tree works better if the training dataset consists of
records with an even probability of each result. In that scenario, the logistic regression on
a dataset with multiple variables can be used to determine which variables are the most
useful to select based on information gain.
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 2 - 37 Clustering and Classification
This equation is known as Bayes Theorem, which relates the conditional and marginal
probabilities of stochastic events A and B as
P(A|B) = P (B|A) P (A) / P (B) .
Each term in Bayes’ theorem has a conventional name. P(A) is the prior probability or
marginal probability of A. It is “prior” in the sense that it does not take into account any
information about B. P(A|B) is the conditional probability of A, given B. It is also called
the posterior probability because it is derived from or depends upon the specified value
of B. P(B|A) is the conditional probability of B given A. P(B) is the prior or marginal
probability of B, and acts as a normalizing constant. This theorem plays an important role
in determining the probability of the event, provided the prior knowledge of another
event.
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 2 - 38 Clustering and Classification
Temperature = Cool, Humidity = High, Wind = Strong and Outlook = Sunny which is not
in the Table 2.12.1.
From the Table 2.12.1, it is seen the attribute play cricket has outcome Yes = 9 and No =
5 for 14 conditions. Using the table, we determined P (Strong | Yes), P (Strong | No), P
(Weak | Yes), P (Weak| No), P (High | Yes), P (High| No), P (Normal| Yes), P (Normal
| No), P (Hot | Yes), P (Hot | No), P (Mild| Yes), P (Mild| No), P (Cool| Yes), P (Cool|
No), P (Sunny| Yes), P (Sunny| No), P (Overcast| Yes), P (Overcast| No) and P (Rain|
Yes), P (Rain| No) as shown in Fig. 2.12.1.
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 2 - 39 Clustering and Classification
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 2 - 40 Clustering and Classification
Consider the first attribute Wind, in which we have two possibilities like Strong and
Weak. From the table, it is seen that Probability of Wind i.e. P (Wind)|Yes Finding the
probability of wind is demonstrated in Fig. 2.12.1.
As seen from the Fig. 2.12.1, the Wind has two subparameters like Strong and Weak.
From Table 2.12.2, number of Strong are appeared 6 times while Weak appeared at 8
times. Now number of Yes for Strong are 3 and Number of No for Strong are 3 therefor
3 3
we can say P(Strong|Yes) = and P(Strong|No)= . Similarly, for Weak, Number of yes
9 5
6 2
are 6 and No's are 2. So P(Weak|Yes) = and P(Weak|No) = . Similarly for Humidity,
9 5
two conditions are High and Normal. From Table, number of High's are 7 while Normal
appeared 7 times. Now number of Yes for High are 3 and No are 4. therefor we can say
3 4
P(High|Yes) = and P(High|No)= . Similarly, for Normal, Number of yes are 6 and
9 5
6 1
No's are 1. So P(Normal|Yes) = and P(Normal|No)= . Similarly for Temperature and
9 5
Outlook, the probabilities are given as shown in Table 2.12.2.
Temperature :
2 4 3
P(Hot | Yes) = p(Mild | Yes) = p(Cool | Yes) =
9 9 9
2 2 1
P(Hot | No) = p(Mild | No) = p(Cool | No) =
5 9 5
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 2 - 41 Clustering and Classification
Outlook :
2 4 3
P(Sunny | Yes) = p(Overcast | Yes) = p(Rain | Yes) =
9 9 9
3 p(Overcast | No) = 0 2
P(Sunny | No) = p(Rain | No) =
5 5
Now Considering the problem statement, Let X={Sunny, Cool, High, Strong} and using
the relation we can write
P(X|Yes) = P(Yes) * P(Sunny|Yes)* P(Cool|Yes) * P(High|Yes) *P(Strong|Yes)
9 2 3 3 3
= * * * * = 0.0053
14 9 9 9 9
P(X|No) = P(No) * P(Sunny|No)* P(Cool|No)* P(High|No)* P(Strong|No)
5 3 1 4 3
= * * * * = 0.0206
14 5 5 5 5
As P(X|No) is greater than P(X|Yes), the answer is No for playing cricket under these
conditions i.e. Outlook is Sunny, Temprature is Cool, Humidity is High and Wind is
Strong. With this approach, we can determine the answer for playing cricket as Yes or No
for other conditions which are not mentioned in the Table. 2.12.2 In this way, we have
learned the K-means clustering along with use cases and two statistical classifiers in this
chapter.
Summary
Clustering is one of the most popular exploratory data analysis techniques which
involves task of classifying data in to subgroups where data points in the same
subgroup (cluster) are very similar and data points in other clusters are different.
K-means clustering is one of the simplest and popular unsupervised machine
learning algorithms which tries to partition the dataset into K pre-defined distinct
non-overlapping subgroups (clusters) in iterative manner where each data point
belongs to only one group.
Some use cases of clustering are document clustering, fraud detection, cyber-
profiling criminals, delivery store optimization, customer segmentation etc.
The Silhouette method is used for interpretation and validation of consistency
within clusters of data while gap statistic compares the total within intra-cluster
variation for different values of k with their expected values under null reference
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 2 - 42 Clustering and Classification
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 2 - 43 Clustering and Classification
Part - B Questions
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
UNIT - III
3
Association and
Recommendation System
Syllabus
lAdvanced Analytical Theory and Methods: Association Rules - Overview - Apriori Algorithm -
Evaluation of Candidate Rules - Applications of Association Rules - Finding Association and finding
similarity - Recommendation System : Collaborative Recommendation- Content Based
Recommendation - Knowledge Based Recommendation- Hybrid Recommendation Approaches.
Contents
3.1 Overview of Association Rules
3.2 Apriori Algorithm
3.3 Evaluation of Candidate Rules
3.4 Applications of Association Rules
3.5 Finding Associations and Finding Similarity
3.6 Recommendation System
3.7 Collaborative Recommendation
3.8 Content-based Recommendation
3.9 Knowledge based Recommendation
3.10 Hybrid Recommendation Approaches
Summary
Two Marks Questions with Answers [Part - A Questions]
Part - B Questions
(3 - 1)
Big Data Analytics 3-2 Association and Recommendation System
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 3-3 Association and Recommendation System
Here, If part of the rule (the {X} above) is known as the antecedent and the THEN part
of the rule is known as the consequent (the {Y} above). The antecedent is the condition
and the consequent are the result. For example, If customer is purchasing Bread then
likely to purchase Butter also or likely to purchase Eggs.
Generally, association rule mining is the two-step process, first is finding all the
frequent item sets and second is generate strong association rules from the frequent
itemsets.
So, there are mainly two algorithms used for finding the frequent item sets namely
Apriori algorithm and FP growth algorithm. In this chapter, we are mainly focusing on
Apriori algorithms for the discussion of association rules.
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 3-4 Association and Recommendation System
In first iteration, the Apriori algorithm, the identifies the frequent 1-itemsets (for
example, {Bread}, {Milk}, {Butter}, …) and evaluated to identify the frequent 1-itemsets
among them. In the second iteration, the algorithm identifies frequent itemsets paired
into 2-itemsets (for example, {Bread, Butter}, {Bread, Milk}, {Jam, Butter}, …) and again
evaluated to identify the frequent 2-itemsets among them. Likewise, algorithm continues
to run n-iterations on n-itemsets to evaluate frequent n-itemsets among them. At each
iteration, the algorithm checks whether the support criterion can be met and stops when
algorithm runs out of support or until the itemsets reach a predefined length.
This algorithm uses two steps "join" and "prune" to reduce the search space. The join
step generates (K+1) itemset from K-itemsets by joining each item with itself while prune
step scans the count of each item in the database. If the candidate item does not meet
minimum support, then it gets removed. This step is performed to reduce the size of the
candidate itemsets.
The algorithm composed of sequence of steps to be followed to find the most frequent
itemset in a given database which performs the join and the prune steps iteratively until
the most frequent itemset is achieved. The generalized steps in Apriori algorithm are
given as below
Apriori Algorithm
Step 1 : In the first iteration of the algorithm, each item is taken as a 1-itemsets
candidate. The algorithm counts the occurrences of each item.
Step 2 : In the set of 1-itemsets, the occurrence of items which satisfies the minimum
support are determined. The minimum support is already defined in the problem. So,
the candidates which count more than or equal to minimum support, are taken forward
for the next iteration and the others are pruned or removed.
Step 3 : In the second iteration, 2-itemset frequent items with minimum support are
revealed. In this step the join of 2-itemset is generated by forming a group of 2 by
combining items with itself.
Step 4 : The 2-itemset candidates are pruned using minimum support threshold value
as like step 2.
Step 5 : The next iteration will form 3-itemsets using join and prune step. If all 2-itemset
subsets are frequent then the superset will be frequent otherwise it is pruned.
Step 6 : The process continues for {4-itemsets, 5-itemsets…. K-itemsets} till itemset does
not meet the minimum support criteria or itemsets reach a predefined length.
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 3-5 Association and Recommendation System
Example 1:
In a given table the transaction Ids (TID's) and Itemsets are provided, so find all the
frequent itemsets using Apriori algorithm with minimum support = 50 % and Minimum
confidence = 50 %.
TID Itemsets
T1 A, B, C
T2 A, C
T3 A, D
T4 B, E, F
C1
Items Support_count
{A} 3
{B} 2
{C} 2
{D} 1
{E} 1
{F} 1
Now, compare candidate support count with minimum support count. Here,
minimum support count = 2. So, from C1 compare support count of each item with
minimum support count which should be greater than equal to 2. Therefore, we need to
prune/remove the items whose support count is less than 2. So, the itemset that satisfies
the minimum support count (L1) are shown in Table 3.2.2.
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 3-6 Association and Recommendation System
L1
Items Support_count
{A} 3
{B} 2
{C} 2
Table 3.2.2 : Items satisfying minimum support count for 1-Itemset in Iteration 1
Iteration 2 : Now perform Join operations by combining single items from L1 into 2-
Itemsets with all combinations and find their support count. Therefore, candidate support
(C2) calculated for 2-Itemset is shown in Table 3.2.3.
C2
Items Support_count
{A, B} 1
{B, C} 1
{A, C} 2
Compare candidate support count with minimum support count 2. Here, we find only
one transaction {A, C} has support count equal to 2. Therefore, we need to prune/remove
the other items whose support count is less than 2. The itemset that satisfies the minimum
support count (L2) is shown in Table 3.2.4.
L2
Items Support_count
{A, C} 2
Table 3.2.4 : Items satisfying minimum support count for 2-Itemset in Iteration 2
Here, we find only one transaction which satisfies minimum support count and will be
a frequent item set for given problem. Now, let us find out final rules with support and
confidence for frequent itemsets are given in Table 3.2.5.
AC 2 = Support/Occurrence of A 66 %
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 3-7 Association and Recommendation System
= 2/3 = 0.66
In, both the cases confidence is greater than minimum confidence 50 % given in the
problem. Therefore, Final rules will be A C and C A.
Example 2 :
In the given Table 3.2.6, the transaction Ids (TID's) and list of Itemsets are provided, so
find all the frequent itemsets using Apriori algorithm with minimum support count is 2
and support confidence = 50 %.
T1 I1, I2, I5
T2 I2, I4
T3 I2, I3
T4 I1, I2, I4
T5 I1, I3
T6 I2, I3
T7 I1, I3
T9 I1, I2, I3
As per previous example, the different iterations of Apriori algorithm are given in
following tables. Now let us find candidate support (C1) and Items satisfying minimum
support count (L1) for 1-Itemset as shown in Table 3.2.7. In this example the minimum
support count provided is 2, so prune all the values which are less than 2 in
Support_count column.
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 3-8 Association and Recommendation System
C1 L1
{I1} 6 {I1} 6
{I2} 7 {I2} 7
{I3} 6 {I3} 6
{I4} 2 {I4} 2
{I5} 2 {I5} 2
Table 3.2.7 : Candidate support and items satisfying minimum support count for 1-itemset
Now calculate candidate support (C2) and Items satisfying minimum support count
(L2) for 2-Itemset as shown in Table 3.2.8.
C2 L2
{I2, I5} 2
{I3, I4} 0
{I3, I5} 1
{I4, I5} 0
Table 3.2.8 : Candidate support and Items satisfying minimum support count for 2-Itemset
Similarly, calculate candidate support (C3) and Items satisfying minimum support
count (L3) for 3-Itemset as shown in Table 3.2.9.
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 3-9 Association and Recommendation System
C2 L2
Table 3.2.9 : Candidate support and Items satisfying minimum support count for 3-Itemset
Now, let us find out final rules with support and confidence for frequent itemsets are
given in Table 3.2.10.
Association rule Support count Confidence Confidence %
I1 I2 I3 2 = Support/Occurrence of I1 = 2/6 = 0.33 33 %
From above Table 3.2.10, we can conclude that the final frequent itemsets for given
problem would be I5 I1 I2 as it has confidence more than minimum confidence 50 %.
is able to identify trustworthy rules, but it cannot tell whether a rule is coincidental.
Measures such as lift and leverage not only ensure interesting rules are identified but also
filter out the coincidental rules.
The association rules discover all sets of items that have support more than the
minimum support and then using the large itemsets to produce the desired rules that
have confidence greater than the minimum confidence. The lift of a rule is the ratio of the
observed support to that expected if X and Y are independent.
For Rule X Y, the Support, Confidence and Lift can be calculated as
Frequency(XY) Number of transactions in which (XY) appear
Support = or
N (Total number of transactions)
where N is total number of transactions
Frequency(XY) Support (XY)
Confidence = or
Frequency(X) Support(X)
Support (XY)
Lift =
Support (X) Support (Y)
Support (XY)
Leverage =
Support (X) Support (Y)
For example, consider the following table which has list of Itemsets along with their
transaction ID (TID). The support, confidence and lift can be calculated as follows
TID ITEMS
T1 Bread, Milk
T2 Bread, Jam, Butter, Eggs
T3 Milk, Jam, Butter, Coke
T4 Bread, Milk, Jam, Butter
T5 Bread, Milk, Jam, Coke
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 3 - 11 Association and Recommendation System
new symptoms and defining relationships between the new signs and the
corresponding diseases.
3. Census Data : Every government has tones of census data which is used to plan
efficient public services as well as help public businesses. This application of
association rule mining help governments for supporting complete public policy
and bringing forth an efficient functioning of a democratic society.
4. Protein Sequence : The association rules can be used in proteins sequence findings.
As proteins sequences made up of twenty types of amino acids. Each protein bears
a unique 3D structure which depends on the sequence of these amino acids. A slight
change in the sequence can cause a change in structure which might change the
functioning of the protein. So, association rules can be effectively used to predict the
exact protein sequence.
5. Retail Marketplace : In Retail, association rules can help determine what items are
purchased together, purchased sequentially, and purchased by season. This can
assist retailers to determine product placement and promotion optimization.
6. Telecommunications : In Telecommunications, association rules can be used to
determine what services are being utilized and what packages customers are
purchasing. For instance, Telecommunications these days is also offering TV and
Internet. Creating bundles for purchases can be determined from an analysis of
what customers purchase, thereby giving the company an idea of how to price the
bundles.
7. Banks : In Financial organizations like Banks, the association rules can be used to
analyze credit card purchases of customers to build profiles for fraud detection
purposes and cross-selling opportunities.
8. Insurance : In Insurance, the association rule mining can be used to build profiles
to detect medical insurance claim fraud by viewing profiles of claims. It allows to
scan the profiles for determining whether a person has more than 1 claim belongs to
a particular claimee within a specified period of time.
A) Euclidean distance :
The Euclidean distance is the distance between two points in Euclidean space.
Considering two points, A and B, with their associated coordinates {a1, a2….an} and
{b1,b2,…bn} the Euclidean distance between two points A and B is given by the formula
The Euclidean distance between two points cannot be negative, because the positive
square root is intended. Therefore, in Euclidean space we assume that lowers the distance
between two points higher the similarity.
B) Jaccard similarity :
The Jaccard similarity is another measure for finding similarity index. The Jaccard
similarity of sets A and B is nothing but the ratio of the sizes of the intersection and union
of sets S and T. The Jaccard similarity is then given by the formula
|A B|
J(A, B) =
|A B|
Here, we consider that, lower the value of Jaccard distance higher the similarity.
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 3 - 14 Association and Recommendation System
C) Cosine similarity :
This method is very similar to the Jaccard similarity, but gives somewhat different
results, because it measures similarity instead of dissimilarity. The cosine distance
between two points is an angle that the vectors to those points make. This angle must be
in the range of 0 degree to 180 degrees, regardless of how many dimensions that space
has. The Cosine similarity can be calculated by the formula :
n
Ai Bi
AB i=1
cos () = =
||A|| ||B|| n n
A 2
i B2i
i=1 i =1
D) Edit distance :
The Edit distance is the distance between two strings x = x1, x2 · · · xn and y = y1, y2 · · ·
ym is the smallest number of insertions and deletions of single characters that will convert
x to y. It is used to find how dissimilar two strings are by counting number of operations
required to convert one string into another.
E) Hamming distance :
The Hamming distance is another measure for finding the similarity. In a given a
space of vectors, we define the Hamming distance between two vectors to be the number
of components in which they differ. The Hamming distance cannot be negative, and if it
is zero, then the vectors are identical.
The above similarity measures are used in applications like plagiarism softwares
where textual similarity is found between original document and plagiarized documents,
mirror pages analysis to detect the fake webpages of popular websites where pages of
these mirror sites are quite similar, but are rarely identical and so on.
suggestion about what they might like to buy, based on their past history of product
searches or purchase in an on-line retailers website. In simple words, a recommendation
system filters the data using different algorithms and recommends the most relevant
items to users. It captures the past behavior of a customer and based on that, recommends
the products which the users might be likely to buy.
The aim of developing recommender systems is to reduce information overload by
retrieving the most relevant information and services from a huge amount of data,
thereby providing personalized services. The most important feature of a recommender
system is its ability to "guess" a user's preferences and interests by analyzing the behavior
of this user and/or the behavior of other users to generate personalized recommendations.
The purpose of a recommender system is to suggest relevant items to users. To achieve
this task, there are four categories of methods exist namely collaborative filtering
methods, content-based methods, knowledge based recommendation methods and
hybrid recommendation methods.
There are four major types of recommendation systems shown in Fig. 3.6.1.
information relating to each user's preferences and needs. Using function knowledge, it
can draw connections between a customer's need and a suitable product. While, the
Hybrid recommender system combines the strengths of more than two recommender
system and also eliminates any weakness which exist when only one recommender
system is used. The detailed description about above recommendation systems are
explained in next sections.
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 3 - 17 Association and Recommendation System
they have liked in the past. The similarity between users or items can be calculated by
correlation similarity or cosine-based similarity measures. When calculating the similarity
between items using the above measures, only users who have rated both items are
considered. The process of identifying similar users and recommending what similar
users like is called collaborative filtering or collaborative recommendation system. It is
represented as column in the utility matrix as shown in Fig. 3.7.1.
User 1 4 5 3
User 2 5 5 4
User 3 4 3 4
User 4 4 4 4
In above Fig. 3.7.1, The utility matrix represents movie ratings provided by different
users for movies Harry Potter 1, Harry Potter 2, Harry Potter 3 and Jumanji on scale 1 to 5
where 1 is lowest and 5 is highest rating. There are many values which are blank in
matrix, therefore collaborative recommendation system can be used to predict those
missing values in the matrix.
The main challenge in designing collaborative recommender is that the ratings
matrices are sparse in nature. Consider an example of a movie application in which users
specify ratings with value like or dislike of specific movies. Most users would have
watched only a small segment of the big number of available movies. As a result, most of
the ratings are unspecified. The specified ratings are also referred to as observed ratings.
The unspecified ratings will be called to as "missing" or "unobserved" ratings. The basic
idea of collaborative filtering methods is that, these unspecified ratings can be estimated
because the observed ratings are often highly correlated across various users and items.
This similarity can be used to make the interpretations about partly specified values. Most
of the models for collaborative filtering focus on leveraging either item-based correlations
or user-based correlations for the prediction process. Few models use both types of
correlations. This model is then used to estimate the missing values in the matrix, in the
same way that a classifier imputes the missing test labels.
There are two methods that are effectively used in collaborative filtering namely
Memory-based methods and Model-based methods. In Memory-based methods, the
ratings of user and item combinations are predicted on the basis of their neighborhoods
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 3 - 18 Association and Recommendation System
because of that they are also called as neighborhood based collaborative filtering. Here,
the similarity functions are computed between the columns of the ratings matrix to
discover similar items. The upsides of memory-based techniques are that they are simple
to implement and the resulting recommendations are easy to describe. On the other hand,
memory-based algorithms do not work very well with sparse ratings matrices. Despite
the fact that, the memory-based collaborative filtering algorithms are simple in nature but
they tend to be heuristic and do not work well in all the circumstances.
In model-based methods, the combination of data mining and machine learning
methods are utilized with regards to predictive models where the model is
parameterized, the parameters of this model are learned within the context of an
optimization framework. The examples of model-based methods include rule-based
models, Bayesian methods, decision trees etc. In general, collaborative filtering is used
for missing value analysis where underlying data matrix for the problem is very large and
sparse.
machine learning approaches and are capable of learning users interests from the
historical data of users.
The content-based methods are advantageous in making recommendations for new
items which doesn't have sufficient number of ratings. For such items, the
recommendations are made based on similarity index of other similar attributes which
might have been rated by the users. Therefore, to make recommendations for such items,
there is no requirement of history of ratings as supervised model will be able to leverage
these ratings in combination with other items. The content-based methods have several
disadvantages in many cases. First, the obvious recommendations are provided for the
items which user has never consumed and has no chance of being recommended as it is
undesirable because of the use of keywords. Second, even though content-based methods
are effective in providing recommendations for new items, they are not reliable for new
users. This is because the training model for the target user needs to have a large number
of ratings available for the target item in order to make robust predictions without over
fitting. Because of that the content-based methods have different trade-offs from
collaborative filtering systems.
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 3 - 20 Association and Recommendation System
automobiles, real estate, tourism requests etc. In such examples, enough ratings may not
be available for the recommendation process. As the items are bought rarely, and with
different types of detailed options, it is difficult to obtain a sufficient number of ratings
for a specific instantiation of the item at hand. This problem is also come across the cold-
start problem. Furthermore, the historic data about such cases are not meaningful because
of nature of consumer preferences may evolve over time when dealing with such items as
particular item may have different attributes associated with its various properties, and a
user may be attentive only in items with specific properties. For example, cars may have
several makes, models, colors, engine options, and interior options, and user interests
may be very specific based on combination of these options. Thus, in such cases the
knowledge-based recommender systems can be useful, in which ratings are not used for
the purpose of recommendations, rather, the recommendation process is performed on
the basis of similarities between customer requirements and item descriptions, or the use
of constraints specifying user requirements. The explicit specification of requirements
results in greater control of users over the recommendation process. The knowledge-
based recommender systems can be classified on the basis of the type of the interfaces
used like constraint-based recommender systems and case-based recommender system.
In constraint-based recommender systems, the users typically specify the requirements or
constraints with lower or upper limits on the item attributes where domain-specific rules
are used for matching the user's requirement for item attributes. While in case-based
recommender systems the specific cases are specified by the user as targets or anchor
points and similarity metrics are defined on the item attributes to retrieve analogous
items to these cases. The similarity metrics form the domain knowledge that is used in
such systems. The obtained results are often used as new target cases with some
interactive modifications by the user. Note that in both the cases, the system offers an
opportunity to the user to modify their specified requirements. However, the interactivity
in knowledge-based recommender system is achieved using guidance takes place
through either conversational systems, Search-based systems or navigation-based
systems. In conversational systems, the preferences of users are determined iteratively in
the context of a feedback loop as the item domain is complex and the user preferences can
be determined only in the context of an iterative informal system. In search-based
systems, the user preferences are formed by using a precise sequence of questions that
can provide the ability to specify user constraints, while in navigation-based systems, the
user specifies a number of modification requests to the item being currently
recommended through an iterative set of change requests. Most of the times the
knowledge-based systems work analogous to content-based systems but major difference
between them is content-based systems learn from past user behavior, whereas
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 3 - 21 Association and Recommendation System
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 3 - 22 Association and Recommendation System
Feature combination : In this technique, the features are derived from different
knowledge sources, then they are combined together and assigned to a single
recommendation algorithm.
Feature augmentation : This technique computes the feature or set of features, which
are the part of the input for the next technique.
Cascade : In this technique, the strict priority is given to all the recommenders where
the lower priority recommenders flouting ties in the scoring of the higher ones.
Meta-level : This technique uses only a single recommendation technique and
produces some sort of model, which is used as a input for another technique.
Summary
The Association rules are useful for analyzing and predicting customers behavior.
They play an important role in applications like market basket analysis, customer-
based analytics, catalog design, product clustering and store layout.
The “Market-basket Analysis” model is commonly used for better understanding
of association rules, where a many to many relationships between “items” and
“baskets” are represented.
The Apriori is one of the most fundamental algorithms for generating association
rules. It uses support for pruning the itemsets and controlling the exponential
growth of candidate itemsets where smaller candidate itemsets, which are known
to be frequent itemsets, are combined and pruned to generate longer frequent
itemsets.
The association rule has three important measures that express the degree of
confidence in the rule called Support, Confidence, and Lift.
Applications of Association Rules are Market Basket Analysis, Medical Diagnosis,
Census Data, Protein Sequence, Telecommunications etc.
A distance measure is the measure of how similar one observation is compared to
a set of other observations. The similarity between entities can be measured by
finding the closest distance between two data points. The important distance
measures are Euclidean distance, Jaccard distance, Cosine distance, Edit Distance
and Hamming distance.
A recommendation system filters the data using different algorithms and
recommends the most relevant items to users. It captures the past behavior of a
customer and based on that, recommends the products which the users might be
likely to buy.
There are four basic types of recommendation systems namely Collaborative,
Content-based, Knowledge based and Hybrid.
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 3 - 23 Association and Recommendation System
50 50
Support count = No. of transactions = 4=2
100 100
Suppose in a transaction, the occurrence of Itemset i is 6.
Therefore, Confidence is given as Support count /Occurrence of Itemset i =2/6=0.33
i.e. 33 %
Q.4 What are the applications of association rules ?
Ans. : There are many applications of association rules, some of them are described as
below
1. Market Basket Analysis : This is the most commonplace example of association
rules where data is gathered utilizing standardized barcode scanners in many
supermarkets. This database, known as the " market basket " database, comprises
of an enormous number of records on past transactions. A record in a database
lists all the items purchased by a customer in one sale. Realizing which groups are
inclined towards which set of items gives these shops the opportunity to alter the
store layout and the store catalog to put the ideally concerning each other.
2. Medical Diagnosis : Association rules in medical diagnosis can be beneficial for
assisting physicians for curing patients. As, Diagnosis is not an easy process and
has a scope of errors which may result in unreliable end-results. Using relational
association rule mining, we can identify the probability of the occurrence of illness
concerning various factors and symptoms. Further, it can be extended by adding
new symptoms and defining relationships between the new signs and the
corresponding diseases.
3. Census Data : Every government has tones of census data which is used to plan
efficient public services as well as help public businesses. This application of
association rule mining help governments for supporting complete public policy
and bringing forth an efficient functioning of a democratic society.
4. Protein Sequence : The association rules can be used in Proteins sequence findings.
As proteins sequences made up of twenty types of amino acids. Each protein
bears a unique 3D structure which depends on the sequence of these amino acids.
A slight change in the sequence can cause a change in structure which might
change the functioning of the protein. So, association rules can be effectively used
to predict the exact protein sequence.
5. Retail Marketplace : In Retail, association rules can help determine what items
are purchased together, purchased sequentially, and purchased by season. This
can assist retailers to determine product placement and promotion optimization.
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 3 - 25 Association and Recommendation System
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 3 - 26 Association and Recommendation System
eBay or Walmart etc. who uses recommendation system for their product
recommendations.
Q.6 What is Memory-based and model-based filtering ?
Ans. : In Memory-based methods, the ratings of user and item combinations are
predicted on the basis of their neighborhoods because of that they are also called as
neighborhood based collaborative filtering. Here, the similarity functions are computed
between the columns of the ratings matrix to discover similar items. The upsides of
memory-based techniques are that they are simple to implement and the resulting
recommendations are easy to describe.
In model-based methods, the combination of data mining and machine learning
methods are utilized with regards to predictive models where the model is
parameterized, the parameters of this model are learned within the context of an
optimization framework. The examples of model-based methods include rule-based
models, Bayesian methods, decision trees etc.
Part - B Questions
Q.1 Explain the Apriori algorithm for mining frequent itemsets with an example.
AU : May-17
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
UNIT - IV
4 Stream Memory
Syllabus
Introduction to Streams Concepts - Stream Data Model and Architecture - Stream Computing,
Sampling Data in a Stream - Filtering Streams - Counting Distinct Elements in a Stream -
Estimating moments . Counting oneness in a Window - Decaying Window - Real time Analytics
Platform (RTAP) applications - Case Studies - Real Time Sentiment Analysis, Stock Market
Predictions - Using Graph Analytics for Big Data : Graph Analytics.
Contents
4.1 Introduction to Streams Concepts
4.2 Stream Data Model and Architecture
4.3 Stream Computing
4.4 Sampling Data in a Stream
4.5 Filtering Streams
4.6 Counting Distinct Elements in a Stream
4.7 Estimating Moments
4.8 Counting ones in a Window
4.9 Decaying Window
4.10 Real Time Analytics Platform (RTAP)
4.11 Real Time Sentiment Analysis
4.12 Stock Market Predictions
4.13 Graph Analytics
Summary
Two Marks Questions with Answers [Part A - Questions]
Part B - Questions
(4 - 1)
Big Data Analytics 4-2 Stream Memory
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 4-3 Stream Memory
the clickstream records and generates a security alert if the clickstream shows
suspicious behavior.
b) In financial institutions, to tracks the market changes the customer portfolios are
adjusted based on configured constraints.
c) In power power grid, the alert or notification is generated based on throughput
when certain thresholds are reached.
d) In news source, the articles that are relevant to the audience are generated by
analyzing the clickstream records from various sources based on their demographic
information.
e) In network management and web traffic engineering, the streams of packets are
collected and processed to detect anomalies.
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 4-4 Stream Memory
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 4-5 Stream Memory
The streams which are inputted to the stream processor has to be stored in a
temporary store or working store. The temporary store in data stream model is a transient
store used for storing the parts of streams which can be queried for processing. The
temporary store can be a disk, or main memory, depending on how fast queries to be
processed. The results of the queries are stored in a large archival storage but archival
data cannot be used for query processing but can be used in special circumstances. A
streaming query is a continuous query that executes over the streaming data. They are
similar to database queries used for analyzing data and differ by operating continuously
on data as they arrive incrementally in real-time. The stream processor supports two
types of queries namely ad-hoc queries and standing queries. In ad-hoc query, the
variable dependent results are generated where each query generates different results
depending on the value of the variable. The ad-hoc query uses common approach to store
a sliding window of each stream in the working or temporary store. It doesn’t allow to
store all the streams entirely as it expects to answer arbitrary queries about the streams by
storing appropriate parts or summaries of streams. They are intended for a specific
purpose in contrast to a predefined query.
Alternatively, the standing queries are continuous query that executes over the
streaming data whose functions are predetermined. For standing query, each time a new
stream is arrives and produces the aggregate results.
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 4-6 Stream Memory
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 4-7 Stream Memory
The ETL platforms receives queries from users, based on that it fetches the events from
message queues and applies the query stream data to generate a result by performing
additional joins, transformations on aggregations. The result may be an API call, an
action, a visualization, an alert, or in some cases a new data stream. The popular ETL
tools for streaming data are Apache Storm, Spark and Flink and Samza.
The third component of the data stream architecture is Query Engine which is used
once streaming data is prepared for consumption by the stream processor. Such data
must be analyzed to provide valuable insights, while the fourth component is streaming
data storage, which is used to store the streaming event data into different data storage
mediums like data lakes.
The stream data processing provides several benefits like able to deal with never-
ending streams of events, real-time data processing, detecting patterns in time-series data
and easy data scalability while some of the limitations of stream data processing are
network latency, limited throughput, slow processing. Supporting on window sized
streams and limitations related to In-memory access to stream data.
The common examples of data stream applications are
Sensor networks : Which is a huge source of data occurring in streams and are used in
numerous situations that require constant monitoring of several variables, based on
which important decisions are made.
Network traffic analysis : In which, network service providers can constantly get
information about Internet traffic, heavily used routes, etc. to identify and predict
potential congestions or identify potentially fraudulent activities.
Financial applications : In which the online analysis of stock prices is performed which
is used for making the sell decisions about the product, quickly identifying correlations
with other products, understand fast changing trends and to an extent forecasting
future valuations about the product.
The Queries over continuous data streams have much in common with queries in a
traditional DBMS. There are two types of queries can be identified as typical over data
streams namely One-time queries and Continuous queries :
a) One-time queries : One-time queries are queries that are evaluated once over a
point-in-time snapshot of the data set, with the answer returned to the user. For
example, a stock price checker may alert the user when a stock price crosses a
particular price point.
b) Continuous queries : Continuous queries, on the other hand, are evaluated
continuously as data streams continue to arrive. The answer to a continuous query is
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 4-8 Stream Memory
produced over time, always reflecting the stream data seen so far. Continuous query
answers may be stored and updated as new data arrives, or they may be produced as
data streams themselves.
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 4-9 Stream Memory
analyze and act on insights, and to move from batch processing to real time analytical
decisions. The stream computing supports low-latency velocities and massively parallel
processing architectures to obtain the useful knowledge from big data. Consequently, the
stream computing model is a new trend for high-throughput computing in the big data
analytics. The different organizations who uses stream computing are telecommunication,
health care, utility companies, municipal transits, security agencies and many more. The
two popular use cases of stream computing are distribution load forecasting, conditional
maintenance and smart meter analytics in energy industry and monitoring a continuous
stream of data and generate alerts when intrusion is detected on a network through a
sensor input.
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 4 - 10 Stream Memory
The Integrated development environment (IDE) is used for debugging and testing of
stream processing applications that processes streams using streaming operators, visual
development of applications, provides filtering, aggregation, correlation methods for
streamed data along with user interface for time windows analysis. The database
connectors are used for providing rule engines and stream processing engines for
processing a streamed data with multiple DBMS features. Common main memory DBMS
and rule engines are to be redesigned to use in stream computing. The streaming
analytics engine allows management, monitoring, and real-time analytics for real-time
streaming data and data mart is used for storing live data for processing with additional
feature like operational business intelligence. It also provides automated alerts for the
events.
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 4 - 11 Stream Memory
In data stream processing, the three important operations used are sampling, filtering
and counting distinct elements from the stream which are explained in next subsequent
sections.
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 4 - 12 Stream Memory
This method works well as long as we keep the list of all users and in/out decision in
main memory. By using a hash function, one can avoid keeping the list of users such that
for each user name hash to one of ten buckets, 0 to 9. Therefore, if the user hashes to
bucket 0, then accept this search query for the sample, and if not, then not. Effectively, we
use the hash function as a random number generator and without storing the in/out
decision for any user, we can reconstruct that decision any time a search query by that
user arrives.
The generalized sampling problem consists of tuples with n components for the
streams. A subset of the components are the key components, on which the selection of
the sample will be based. In our example, the user, query, and time are the subsets and
users are in the key. However, we can use sample of queries on key attributes to get the
outcome.
In general, to generate a samples of size a/b where a is the key and b are the tuples, we
hash the key value a for each tuple to b buckets, and accept the tuple for the sample if the
hash value is less than a. The result will be a sample consisting of all tuples with certain
key values and the selected key values will be approximately a/b of all the key values
appearing in the stream. While sampling methods reduce the amount of data to process,
and, by consequence, the computational costs, they can also be a source of errors. The
main problem is to obtain a representative sample, a subset of data that has
approximately the same properties of the original data.
In reservoir sampling, the randomized algorithms are used for randomly choosing the
samples from a list of items, where list of items is either a very large or unknown number.
For example, imagine you are given a really large stream of data and your goal is to
efficiently return a random sample of 1000 elements evenly distributed from the original
stream. A simple way is to generate random integers between 0 and (N – 1), then
retrieving the elements at those indices will give the answer.
4.4.1.2 Biased Reservoir Sampling
In biased reservoir sampling is a bias function to regulate the sampling from the
stream. In many cases, the stream data may evolve over time, and the corresponding data
mining or query results may also change over time. Thus, the results of a query over a
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 4 - 13 Stream Memory
more recent window may be quite different from the results of a query over a more
distant window. Similarly, the entire history of the data stream may not relevant for use
in a repetitive data mining application such as classification. The simple reservoir
sampling algorithm can be adapted to a sample from a moving window over data
streams. This is useful in many data stream applications where a small amount of recent
history is more relevant than the entire previous stream. This will give a higher
probability of selecting data points from recent parts of the stream as compared to distant
past. The bias function in sampling is quite effective since it regulates the sampling in a
smooth way so that the queries over recent horizons are more accurately resolved.
4.4.1.3 Concise Sampling
Many a time, the size of the reservoir is sometimes restricted by the available main
memory. It is desirable to increase the sample size within the available main memory
restrictions. For this purpose, the technique of concise sampling is quite effective. Concise
sampling exploits the fact that the number of distinct values of an attribute is often
significantly smaller than the size of the data stream. In many applications, sampling is
performed based on a single attribute in multi-dimensional data that type of sampling is
called concise sampling. For example, customer data in an e-commerce site sampling may
be done based on only customer ids. The number of distinct customer ids is definitely
much smaller than “n” the size of the entire stream.
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 4 - 14 Stream Memory
email itself as a pair. As each email address consumes 20 bytes or more space, it is not
reasonable to store the set S in main memory. Thus, we have to use disk to store and
access that.
Suppose we want to use main memory as a bit array, then we need eight million bits
array and have to run hash function h to eight million buckets from email addresses.
Since there are one million members of S, approximately 1/8th of the bits will be 1 and
rest would be 0. Here, as soon as stream element arrives, we hash its email address, if
hash value for stream element e-mail comes to 1 then we let the email through else we
drop this stream element. But sometimes spam email will get through, so to eliminate
every spam, we need to check for membership in set S those good and bad emails that get
through the filter. The Bloom filter is used in such cases to eliminate the tuples which do
not meet the selection criterion.
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 4 - 15 Stream Memory
the stream element pass through else discard. That means, if one or more of these bits are
remains 0, then K could not be found in S, so reject the stream element. So, to find out
how many elements are passed we need to calculate the probability of a false positive
outcomes, as a function of n bit-array length, m the number of members of set (S), and m
number of hash functions.
Let us take an example, where we have a model which is used for throwing darts at
the targets. Here, suppose we have T targets and D darts and there is a possibility of any
dart is equally likely to hit any target. So, the analysis of how many targets can we expect
to be hit at least once falls in one of the conditions given below :
T–1
The probability of a given dart will not hit a given target would be
T
T – 1
D
The probability of none of the D darts will hit a given target would be
T
With approximation, the probability that none of the y darts hit a given target would
be e(D/T) .
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 4 - 16 Stream Memory
would be {3, 4, 2, 3}, the count of distinct numbers in fourth pass is 3. Therefore, the final
count of distinct numbers are 3, 4, 4, 3.
Let us take another example, suppose we want to find out how many unique users
have accessed a particular website let’s say Amazon in a given month based on gathering
statistics. So, here universal set would be a set of logins and IP address which has
sequences of four 8-bit bytes from which they send the query for that site. The easiest way
to solve this problem is to keep the set in main memory which has list of all the elements
in the stream and make them arranged in an search structure like hash table or search tree
so as to add new elements quickly. But the problem here is to obtain an exact number of
distinct elements appear in the stream. However, if the number of distinct elements is too
large then we cannot store them in main memory. Therefore, the solution of this problem
is to use several machines for handling only one or more number of the streams and store
most of the data structure in secondary.
Flajolet-Martin Algorithm :
1) Pick a hash function h that maps each of the n elements to at least log2 n bits.
2) For each stream element x, let r(x) be the number of trailing 0’s in h(x).
3) Record R = the maximum r(x) seen.
4) Estimate the count = 2R
The steps for counting distinct elements in a stream using Flajolet-Martin algorithm is
as follows :
Step 1 : Create a bit array/vector of length L and suppose there are n number of
elements in the stream, such that 2L>n.
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 4 - 17 Stream Memory
th
Step 2 : The i bit in array/vector represents the hash function value whose binary
representation ends in 0i. So, initialize each bit to 0.
Step 3 : Generate a feasible random hash function that maps input string to natural
numbers.
Step 4 : For each word in an input stream perform hashing and determine the number
of trailing zeros, such that if the number of trailing zeros is k, set the kth bit in the bit
array/vector to 1.
Step 5 : Get the index of the first 0 (called R) in the bit array/vector when input is
exhausted. In this way calculate the number of consecutive 1’s. Here, we have seen
0, 00, ..., 0R-1 as the output of the hash function plus one.
Step 7 : This implies that our count can be off by a factor of 2 for 32 % of the
observations, off by a factory of 4 for 5 % of the observations, off by a factor of 8 for
0.3 % of the observations and so on. As the standard deviation of R is a constant :
σ(R) = 1.12. (as R can be off by 1 for 1 – 0.68 = 32% of the observations, off by 2 for
about 1 – 0.95 = 5% of the observations and off by 3 for 1 – 0.997 = 0.3 % of the
observations using the Empirical rule of statistics.)
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 4 - 18 Stream Memory
Let us assume a = 3 and b = 1, therefore hash function h(x) would be 3x+1 mod 32.
So, calculate the hash value in binary format for each stream in S i.e. S = {4, 2, 5 ,9, 1, 6,
3, 7}
h(4) = 3(4) + 7 mod 32 = 19 mod 32 = 19 = (10011)
h(2) = 3(2) + 7 mod 32 = 13 mod 32 = 13 = (01101)
h(5) = 3(5) + 7 mod 32 = 22 mod 32 = 22 = (10110)
h(9) = 3(9) + 7 mod 32 = 34 mod 32 = 2 = (00010)
h(1) = 3(1) + 7 mod 32 = 10 mod 32 = 10 = (01010)
h(6) = 3(6) + 7 mod 32 = 25 mod 32 = 25 = (11001)
h(3) = 3(3) + 7 mod 32 = 16 mod 32 = 16 = (10000)
h(7) = 3(7) + 7 mod 32 = 28 mod 32 = 28 = (11100)
Now let us find trailing number of 0’s in each binary output by observing rightmost
number of zeroes. So trailing zero's for given stream would be {0, 0, 1, 1, 1, 0, 4, 2}.
Therefore, value of R would be maximum number of trailing zeros i.e. R= 4.
So, Number of distinct elements (N) = 2R = 24 = 16.
Example 2 : Given a Stream S = {1,3,2,1,2,3,4,3,1,2,3,1} and hash function h(x) = (6x+1)
mod 5, treat the result as a 5-bit binary integer.
So, calculate the hash value in binary format for each stream in
S = {1,3,2,1,2,3,4,3,1,2,3,1}
h(1) = (6 * (1)+1) mod 5 = 7 mod 5 = 2 = (00010)
h(3) = (6 * (3)+1) mod 5 = 19 mod 5 = 4 = (00100)
h(2) = (6 * (2) +1) mod 5 = 13 mod 5 = 3 = (00011)
h(1) = (6 * (1)+1) mod 5 = 7 mod 5 = 2 = (00010)
h(2) = (6 * (2) +1) mod 5 = 13 mod 5 = 3 = (00011)
h(3) = (6 * (3)+1) mod 5 = 32 = 19 mod 5 = 4 = (00100)
h(4) = (6 * (4)+1) mod 52 = 25 mod 5 = 0 = (00000)
h(3) = (6 * (3)+1) mod 5 = 19 mod 5 = 4 = (00100)
h(1) = (6 * (1)+1) mod 5 = 7 mod 5 = 2 = (00010)
h(2) = (6 * (2) +1) mod 5 = 13 mod 5 = 3 = (00011)
h(3) = (6 * (3)+1) mod 5 = 19 mod 5 = 4 = (00100)
h(1) = (6 * (1)+1) mod 5 = 7 mod 5 = 2 = (00010)
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 4 - 19 Stream Memory
Now let us find trailing number of 0’s in each binary output by observing rightmost
number of zeroes. So Trailing zero's for given stream would be {1,2,0,1,0,2,5,2,1,0,2,1}.
Therefore, value of R would be maximum number of trailing zeros i.e. R= 5.
So, Number of distinct elements (N) = 2R = 25 = 32.
Here, whenever we apply a hash function H to a stream element a, the bit string H(a)
will end in some number of 0’s. Assume this number as a tail length for a and H. Let N be
the maximum tail length of the stream. Then we shall use estimate 2N for the number of
distinct elements seen in the stream. This estimate makes intuitive sense.
The Intuition here are
a) The probability of a given stream element a has hash H(a) ending in at least n
number of 0’s is 2−n.
b) Suppose there are m distinct elements in the stream, then the probability that none
of them has tail length at least n is (1 − 2−n)m.
Here we conclude that If m is larger than 2n, then the probability for finding tail length
at least n approaches is 1 and If m is much less than 2n, then the probability of finding a
tail length at least n approaches is 0. But inappropriately there is a trap regarding the
strategy for combining the estimates of m for the number of distinct elements obtained by
using many different hash functions. As per our first Intuition, if we take the average of
the values 2N then get a value that approaches the true m hash function but the influence
of overestimate has occurred on the average. Suppose value of 2n is much larger than m,
then some probability p to discover n to be the largest number of 0’s at the end of the
hash value for any of the stream elements m and the probability of finding n +1 for largest
number of 0’s would be at least p/2.
For the space requirement, as we know that during the reading on the stream, one
integer per hash function needs to be kept in main memory. But as integer records the
largest tail length for the hash function, it is difficult to find out the space requirement.
Processing only one stream could use millions of hash functions which are far more than
the estimates. So, the main memory constrain would be only the number of hash
functions that are trying to process many streams at the same time.
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 4 - 20 Stream Memory
stream consists of elements chosen from a universal set U which has ordered elements i
and mi be the number of occurrences of the ith element. Then the kth order moment of the
stream is calculated as sum over all i.e.
Fk = i A (mi)k
Here, 0th moment of the stream is sum of 1 for each mi >0; number of distinct elements.
1st moment of the stream is the sum of all mi , which must be the length of the stream. The
2nd moment of the stream sum of the squares of the mi2 , which could be a surprise
number (S) that measures the uneven the distribution of elements in the stream, m2
describes the “skewness” of a distribution; smaller the value of M2, less skewed is the
distribution.
For example, suppose we have a stream of length 100, in which eleven different
elements are appeared. The most even distribution of these eleven elements would be 1
appearing 10 times and the 10 appearing 9 times each. In this case, the surprise number
would be 1×102 + 10 × 92 = 910. Here, we can’t keep count for each element that appeared
in a stream in main memory. So, we need to estimate the kth moment of a stream by
keeping a limited number of values in main memory and computing an estimate from
these values.
Examples :
Consider the following data streams and calculate the surprise number :
1) 5,5,5,5,5 Surprise number = 5 × 52 = 125
2) 9,9,5,1,1 Surprise number = (2 × 92 + 1 × 52 + 2 × 12) =189
To estimate the second moment of the stream with limited amount of main memory
space we can use Alon-Matias-Szegedy algorithm. Here, more the space we use, the more
accurate the estimate will be. In this algorithm, we compute the number of variables X.
For each variable X, we store when a particular element of the universal set, which we
refer to as X.element and the value of the integer variable X.value. To find the value of a
variable X, we select the position in the stream between 1 and n randomly. If element is
found in set X.element then initialize X.value to 1. Likewise, we read the stream, add 1 to
X.value each time we encounter another occurrence of X.element. Technically, the
estimates of the second and higher moments assumes that the stream length n is a
constant and it grows with time. Here, we store only the values of variables and multiply
some function of that value by n when it is time to estimate the moment.
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 4 - 21 Stream Memory
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 4 - 22 Stream Memory
Fig. 4.8.1 : Bitstream divided into buckets following the DGIM rules
Bucket size Bucket size 4 Bucket size Bucket size Bucket size
4 i.e. 2 =4
2 2 2 1 i.e. 20=1
i.e. 22=4 i.e. 21=2 i.e. 21=2
Here, when new bit comes in then drop last bucket if its timestamp is prior to N time
before current time. If the new bit arrived is 0 with a time stamp 101, then there are no
changes needed in the buckets but if the new bit that arrives is 1, then we need to make
some changes.
101011 000 10111 0 11 00 101 1 0 1 1
New bits to be entered
So current bit is 1 then create a new bucket of size 1 and make the current timestamp
and size to 1. If there was only one bucket of size 1, then nothing more needs to be done.
However, if there are now three buckets of size 1 (buckets with timestamp 100,102,103)
then combine the leftmost(oldest) two buckets of size 2 as shown below.
101011 000 10111 0 11 00 101 1001 1 1
Bucket Bucket Bucket Bucket Bucket Bucket
size 4 size 4 size 2 size 2 size 2 size 1
To combine any two adjacent buckets of the same size, replace them by one bucket of
twice the size. The timestamp of the new bucket is the timestamp of the rightmost of the
two buckets. By performing combining operation on buckets, the resulting buckets would
be
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 4 - 23 Stream Memory
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 4 - 24 Stream Memory
In decaying window, it is easier to adjust the sum exponentially than sliding window
of fixed length. The effect of this definition is to spread out the weights of the stream
elements as far back in time as the stream goes. In sliding window, the element that falls
out of the window each time a new element arrives needs to be taken care. In contrast, a
fixed window with the same sum of the weights, 1/c, would put equal weight 1 on each of
the most recent 1/c elements to arrive and weight 0 on all previous elements which is
illustrated in Fig. 4.9.1. However, when a new element at+1 arrives at the stream input,
we first multiply the current sum by 1 – c and then add at+1.
In this method, each of the previous elements get moved one position further from the
current element, so its weight is multiplied by 1 − c. Further, the weight on the current
element is (1 − c)0 = 1, so adding at+1 is the correct way to include the new element’s
contribution.
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 4 - 25 Stream Memory
smart meters with huge number of new data sources. The Real-time analytics will
leverage information from all these devices to apply analytics algorithms and generate
automated actions within milliseconds of a trigger. The Real-Time analytics platform
composed of three components namely :
Input : which is generated upon the event happens (like new sale, new customer,
someone enters a high security zone etc.)
Processing unit : which capture the data of the event, and analyze the data without
leveraging resources that are dedicated to operations. It also involves executing different
standing and ad-hoc queries over streamed data and
Output : that consume this data without disturbing operations, explore it for better
insights and generates analytical results by means of different visual reports over the
dedicated dashboard. The general architecture of Real-Time Analytics Platform is shown
in Fig. 4.10.1.
The various requirements for real-time analytics platform are as follows :
1. It must support continuous queries for real-time events.
2. It must consider the features like, robust- ness, fault tolerance, low-latency reads
and updates, incremental analytics and learning and scalability.
3. It must have improved the in-memory transaction speed.
4. It should quickly move the not needed data into secondary disk for persistent
Storage.
5. It must support distributing data from various sources with speedy processing.
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 4 - 26 Stream Memory
The basic building blocks of Real Time Streaming Platform are shown in Fig. 4.10.2.
The streaming data is collected from various flexible data sources by producing
connectors which move and receive data from the sources to the queuing system. The
queuing system is faulty tolerance and persistent in nature. The streamed data then
buffered to be consumed by the stream processing engine. The queuing system is a high-
throughput, low latency system which provides high availability and fail-over
capabilities. There are many technologies that support real-time analytics, such as :
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 4 - 27 Stream Memory
SAP HANA : It is a streaming analytical tool that allows SAP users to capture, stream
and analyze data with active event monitoring and event driven response to
applications.
Apache Spark : It is a streaming platform for big data analytics in real-time developed
by Apache.
Cisco Connected Streaming Platform : It is used for finding the insights from high
velocity streams of live data over the network with multiple sources with enabled
immediate actions.
Oracle Stream Analytics : It provides graphical interface to performing analytics over
the real-time streamed data.
Google Real Time Analytics : It is used for performing real-time analytics over the
cloud data collected over different applications.
comments and feedback with emotional states such as “angry”, “sad” and “happy”. It
tries to identify and extract sentiments within the text. The analysis of sentiments can be
either document based where the sentiment in the entire document is summarized as
positive, negative or objective or can be sentence based where individual sentences,
bearing sentiments, in the text are classified.
Sentiment analysis is widely applied to reviews and social media for a variety of
applications, ranging from marketing to customer service. In the context of analytics,
sentiment analysis is “the automated mining of attitudes, opinions and emotions from
text, speech and database sources”. With the proliferation of reviews, ratings,
recommendations and other forms of online expression, online opinion has turned into a
kind of virtual currency for businesses looking to market their products, identify new
opportunities and manage their reputations.
Some of the popular applications of real-time sentiment analysis are,
1) Collecting and analyzing sentiments over the Twitter. As Twitter has become a
central site where people express their opinions and views on political parties and
candidates. Emerging events or news are often followed almost instantly by a burst
in Twitter volume, which if analyzed in real time can help explore how these events
affect public opinion. While traditional content analysis takes days or weeks to
complete, real time sentiment analysis can look into the entire Twitter traffic about
the election, delivering results instantly and continuously. It offers the public, the
media, politicians and scholars a new and timely perspective on the dynamics of the
electoral process and public opinion.
2) Analyzing the sentiments of messages posted to social networks or online forums
can generate countless business values for the organizations which aim to extract
timely business intelligence about how their products or services are perceived by
their customers. As a result, proactive marketing or product design strategy can be
developed to effectively increase the customer base.
3) Tracking the crowd sentiments during commercial viewing by advertising agencies
on TVs and decide which commercials are resulting in positive sentiments and
which are not.
4) A news media website is interested in getting an edge over its competitors by
featuring site content that is immediately relevant to its readers where they use
social media to know the topics relevant to their readers by doing real time
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 4 - 29 Stream Memory
sentiment analysis on Twitter data. They Specifically, to identify what topics are
trending in real time on Twitter, they need real-time analytics about the tweet
volume and sentiment for key topics
5) In Marketing, the real-time sentiment analysis can be used to know the public
reactions on product or services supplies by an organization. The analysis is
performed on which product or services they like or dislike and how they can be
improved,
6) In Quality Assurance, the real-time sentiment analysis can be used to detect errors
in your products based on your actual user’s experience.
7) In Politics, the real-time sentiment analysis can be used to determine the views of
the people regarding specific situations on which they angry or happy.
8) In Finances, the real-time sentiment analysis tries to detect the sentiment towards a
brand, to anticipate their market moves
The best example of real time sentiment analysis is predicting the pricing or
promotions of a product being offered through social media and the web. The solution for
price or promotion prediction can be implemented software solutions like Radar (Real-
Time Analytics Dashboard Application for Retail) and Apache Storm. The RADAR is the
software solution for retailers built using a Natural Language Processing (NLP) based
Sentiment Analysis engine that utilizes different Hadoop’s technologies including HDFS,
Apache Storm, Apache Solr, Oozie and Zookeeper to help enterprises maximize sales
through databased continuous re-pricing. Apache Storm is a distributed real-time
computation system for processing large volumes of high-velocity data. It is part of the
Hadoop ecosystem. Storm is extremely fast, with the ability to process over a million
records per second per node on a cluster of modest size. Apache Solr is another tool from
the Hadoop ecosystem which provides highly reliable, scalable search facility at real time.
RADAR uses Apache STORM for real-time data processing and Apache SOLR for
indexing and data analysis. The generalized architecture of RADAR for retail is shown in
Fig. 4.11.1.
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 4 - 30 Stream Memory
For retailers, the RADAR can be used to customize their environment so that they can
track the following for any number of products / services in their portfolio based on Social
Sentiment for each product or service they are offering and competitive
pricing/promotions being offered through social media and the web. With this solution,
retailers can create continuous re-pricing campaigns and implement them real-time in
their pricing systems, track the impact of re-pricing on sales and continuously compare it
with social sentiment.
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 4 - 31 Stream Memory
Traditionally, stock market prediction algorithms used to check historical stock prices
and try to predict the future using different models. The traditional approach is not
effective in a real time because, as stock market trends continually changes based upon
economic forces, regulations, competition, new products, world events and even (positive
or negative) tweets are all factors to affect stock prices therefore. Thus, predicting the
stock prices using real-time analytics is the necessity. The generalized architecture for
real-time stock prediction has three basic steps, as shown in Fig. 4.12.1.
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 4 - 32 Stream Memory
Fig. 4.12.2 : Detailed representation of real-time stock prediction using machine learning
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 4 - 33 Stream Memory
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 4 - 34 Stream Memory
graph analysis, the labeled vertices indicate the types of entities that are related while
labeled edges are used to represent the nature of the relationship, while multiple
relationships between pairs of vertices are represented by multiple edges between pair of
vertices.
In graph analytics, the directed graph can be represented by triplet consist of subject
which is the source point of the relationship, an object is the target point of relationship,
and a predicate that represents type of the relationship. Therefore, the database which
support these triplets is called a semantic database. The graph model supports all the
types of entities and their relationship. The graph model composed of different models
namely communication models that represents communication across a community
triggered by a specific event, influence model that represents entities holding influential
sites within a network for intermittent periods of time, Distance modeling for analyzing
the distances between sets of entities like finding the strong correlations between
occurrences of sets of statistically improbable phrases and collaborative model that uses
isolated groups of individuals that share similar interests. The graph analytics is mainly
used for business problems that has characteristics like Adhoc nature of the analysis,
absence of structure in the problem, embedded knowledge in the network, connectivity
problem, predictable performance, undirected discovery and flexible semantics.
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 4 - 35 Stream Memory
Summary
The stream is the sequence of data elements which flows in a group while stream
processing is a big data technology which is used to query continuous data stream
generated in a real-time for finding the insights or detect conditions and quickly
take actions within a small period of time. The Sources of streamed data are
Sensor Data, Satellite Image Data, Web data, Social web data etc.
The stream data model uses data stream management system unlike database
management systems for managing and processing the data streams
A streaming data architecture is a framework for processing huge volumes of
streaming data from multiple sources. The generalized streaming architecture
composed of four components like Message Broker or Stream Processor, ETL
tools, Query Engine and streaming data storage.
The Stream computing is a computing paradigm that reads data from collections
of sensors in a stream form and as a result it computes continuous real-time data
streams. It enables graphics processors (GPUs) to work in coordination with low-
latency and high-performance CPUs to solve complex computational problems.
The architecture of stream computing consists of five components namely: Server,
Integrated development environment, Database Connectors, Streaming analytics
engine and data mart.
The Sampling in a data stream is the process of collecting and representing the
sample of the elements of a data stream. The samples are usually much smaller
element of entire stream, but designed to retain the original characteristics of the
stream.
There are basic three types of sampling namely Reservoir Sampling, Biased
Reservoir Sampling and Concise Sampling.
The filtering is the process of accepting the tuples in the stream that meets the
selection criterion where accepted tuples are provided to another process as a
stream and rejected tuples are dropped.
The purpose of the Bloom filter is to allow all the stream elements whose keys (K)
are lies in set (S) otherwise rejecting most of the stream elements whose keys (K)
are not part of set (S) while Flajolet–Martin algorithm is used for calculating the
number of distinct elements in a stream.
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 4 - 37 Stream Memory
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 4 - 38 Stream Memory
b) Satellite Image Data : Where data receives from Satellites to the to earth streams
which consist of many terabytes of images per day. The surveillance cameras are
fitted in a satellite which produces images for processing streamed at station on
earth.
c) Web data : Where the real time streams of IP packets generated on internet are
provided to the switching node which runs queries to detect denial-of-service
attacks or other attacks are then reroute the packets based on information about
congestion in the network.
d) Data in an online retail stores : Where retail firm data collect, store and process
data about product purchase and services by particular customer to understand
the customers behavior analysis
e) Social web data : where data generated through social media websites like
Twitter. Facebook is used by third party organization for sentimental analysis and
prediction of human behavior.
The applications which uses data streams are
Realtime maps which uses location-based services to find nearest point of interest
Monitoring
Performing fraud detection on live online transaction
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 4 - 39 Stream Memory
represented by log2 N bits. To store the total number of bits we ever seen in the stream,
we need to determine the window by timestamp modulo N. For that, we need to divide
the window into buckets consisting of timestamp of its right (most recent) end and the
number of 1's in the bucket. This number must be a power of 2, and we refer to the
number of 1's as the size of the bucket. To represent a bucket, we need log2 N bits to
represent the timestamp which is modulo N of its right end. To represent the number of
1's we only need log2 log2 N bits. Thus, O(logN) bits suffice to represent a bucket. There
are six rules that must be followed when representing a stream by buckets.
a) The right side of the bucket should always start with 1 as if it starts with a 0, it is
to be neglected. for example, 1001011 here a bucket of size would be 4 as it is
having four 1's and starting with 1 on it's right end i.e. Every position with a 1 is
in some bucket.
b) Every bucket should have at least one 1, else no bucket can be formed i.e. Every
position with a 1 is in some bucket
c) No position is in more than one bucket.
d) There are one or two buckets of any given size, up to some maximum size.
e) All buckets sizes must be in a power of 2.
f) Buckets cannot decrease in size as we move to the left.
Q.4 What is sentiment analysis ? AU : May-17
Ans. : The Sentiment Analysis (also referred as opinion mining) is a Natural Language
Processing and Information Extraction task that aims to obtain the feelings expressed in
positive or negative comments, questions and requests, by analyzing a large number of
data over the web. In real-time sentimental analysis, the sentiments are collected and
analyzed in real time with live data over the web. It uses natural language processing,
text analysis and computational linguistics to identify and extract subjective
information in source materials.
Part - B Questions
Q.1 With neat sketch explain the architecture of data stream management system
AU : May-17
Q.2 Outline the algorithm used for counting distinct elements in a data stream AU : May-17
Q.3 Explain with example Real Time Analytics Platform (RTAP)
Q.4 State and explain Bloom filtering with the example.
Q.5 State and explain Real Time Analytics Platform (RTAP) applications AU : Nov.-18
Q.6 Compute the surprise number (second moment) for the stream 3, 1, 4, 1, 3, 4, 2, 1, 2.
What is third moment of the stream ?
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 4 - 40 Stream Memory
1 3 3 9 27
2 2 2 4 8
3 2 2 4 8
4 2 2 4 8
mi = 9 mi = 21 mi = 51
From table, it is concluded that, the first moment or length of stream is 9, second
moment of the stream is 21 and third moment is 51.
The third moment of the stream for the given problem is 51
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
UNIT - V
NOSQL Data
5 Management for Big Data
and Visualization
Syllabus
N oSQL Databases : Schema-less Models : Increasing Flexibility for Data Manipulation-Key Value
Stores- Document Stores - Tabular Stores - Object Data Stores - Graph Databases Hive - Sharding –-
Hbase – Analyzing big data with twitter - Big data for E-Commerce Big data for blogs - Review of
Basic Data Analytic Methods using R.
Contents
5.1 Introduction to NoSQL
5.2 "Schema-Less Models" : Increasing Flexibility for Data Manipulation
5.3 Key Value Stores
5.4 Document Stores
5.5 Tabular Stores
5.6 Object Datastores
5.7 Graph Datastores
5.8 Hive
5.9 Sharding
5.10 Hbase
5.11 Analyzing Big Data with Twitter
5.12 Big Data for E-Commerce and Blogs
5.13 Review of Basic Data Analytic Methods using R
Summary
Two Marks Questions with Answers [Part - A Questions]
Part - B Questions
(5 - 1)
N OSQL Data Management for
Big Data Analytics 5-2 Big Data and Visualization
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
N OSQL Data Management for
Big Data Analytics 5-3 Big Data and Visualization
generated along with huge volume, variety, velocity and veracity. For example, social
media websites like Twitter, Facebook, Google, Instagram etc. that collect and process
terabytes of user's data every single day who uses NOSQL databases for their all the
operations.
The NOSQL databases came in to picture which has many advantages over the
traditional RDBMS. The major advantages of NOSQL databases over traditional SQL
databases are given as below
a) It supports real time or batch processing data along with different formats like
structured, unstructured or semi-structured.
b) It can process uncleansed or uncertain data for analytics
c) It does not require an expensive hardware for implementation as it can be run on
commodity hardware
d) It supports huge volume of data generated real time with velocity
e) It can perform data analytics at low cost.
f) It is free of joins and schema and doesn't require complex queries
g) It can work on many processors parallelly with linear scalability
h) Most of NOSQL systems are opensource that makes it very cost efficient
i) Cost per bit for processing is very low and compatible with cloud computing
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
N OSQL Data Management for
Big Data Analytics 5-4 Big Data and Visualization
5. Graph-based databases : This type of databases are network databases that uses
edges and nodes to represent the stored data with relationship. e.g. Apache Giraph,
GraphX, Neo4j etc.
The above NOSQL databases are explained briefly in subsequent sections of this
chapter
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
N OSQL Data Management for
Big Data Analytics 5-5 Big Data and Visualization
Key Value
2. Put (key, value) Which adds the value associates with the key.
3. Multi-get (key1, key2,.., keyN) Returns the list of values associated with the list of keys.
4. Delete(key) Removes the entry for the specified key from the data store.
Table 5.3.2 : Different operations in key value store
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
N OSQL Data Management for
Big Data Analytics 5-6 Big Data and Visualization
Under the right conditions, the table is circulated in a way that is aligned with the way
the keys are sorted out so that the hashing function is utilized to figure out which node
holds that key's bucket While key value sets are extremely valuable for both storing the
results of analytical algorithms and for creating those outcomes for reports.
Some of the popular use cases of the key-value databases are
Storing and processing user session data
Maintaining schema-less user profiles
Storing user preferences
Storing shopping cart data
One is that the model won't naturally give any sort of conventional database abilities
(such as atomicity of transactions, or consistency when numerous transactions are
executed at the same time) those abilities must be given by the application itself. Another
is that as the model develops, keeping up unique values as keys may become more
difficult, requiring the introduction of some complexity in creating character strings that
will stay novel among a myriad of keys.
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
N OSQL Data Management for
Big Data Analytics 5-7 Big Data and Visualization
and everything is searchable within the document, most of the document stores group
documents together in a collection. The document that has some structure and encoding
for managing the data. Some of the common encoding used with the document stores are
XML, JSON (Java Script Object Notation), BSON (which is a binary encoding of JSON
objects) etc. or other means of serializing data. The document representation embeds the
model so that the meanings of the document values can be inferred by the application. It
is not suitable for running complex queries or if the application requires complex multiple
operation transactions.
The difference between RDBMS and document store database is given Table 5.4.1.
RDBMS Document store
Database Database
Table Collection
Tuple/Row Document
Column Field
Primary key Primary key (Default key _id provided by mongodb itself)
The commonly used use cases of document store databases are E-commerce platforms,
content management systems, web analytics, analytics platforms or blogging platforms.
One of the differences between a key value store and a document store is that the key
value store requires the use of a key to retrieve data, while document store often provides
an object metadata used for querying the data based on the contents either through a
programming API or using a query language.
The following example shows the insertion of data values and respective keys in to a
document called Book. It has keys like id, title, description, by, array of tags and likes. The
respective values for keys are assigned at right hand side.
db.Book.insert(
{
"_id": ObjectId(7df78458902c),
"title": "Python Programming",
"description": "Python is a Object oriented language ",
"by": "Technical publication",
"tags": ["SciPy", "NumPy", "Pandas"],
"likes": "100"
})
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
N OSQL Data Management for
Big Data Analytics 5-8 Big Data and Visualization
Column Name
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
N OSQL Data Management for
Big Data Analytics 5-9 Big Data and Visualization
The graph datastore of movies with different actors, properties and relationship are
shown in Fig. 5.7.2.
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
N OSQL Data Management for
Big Data Analytics 5 - 10 Big Data and Visualization
In above graph, the actors (like Emile Hirsh, Rain, Halle Berry etc.) at leaf nodes
represented by leaf nodes which are connected to the properties like movies and
relationship between actors and movies are defined by arrows.
5.8 Hive
The Apache Hive is a data warehouse software built on the top of Hadoop that
facilitates reading, writing, and managing large datasets stored in HDFS. The data stored
by Hive may residing on distributed storage which can be queried using SQL. It is mainly
used for storing the data in a data warehouse and performs different SQL operations over
that. The hive provides a simple SQL-like query language called Hive Query Language
(HQL) for querying and managing the large datasets. Hive engine compiles these queries
into Map-Reduce jobs to be executed on Hadoop.
The basic features of hive are given as follows
It stores schema in a database and processed data into HDFS.
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
N OSQL Data Management for
Big Data Analytics 5 - 11 Big Data and Visualization
Using the Hive query language (HiveQL), which is very similar to SQL, queries are
converted into a series of jobs that execute on a Hadoop cluster through MapReduce or
Apache Spark. The Users can run batch processing workloads with Hive as well as can
analyze the same data simultaneously using interactive SQL or using machine-learning
workloads over the tools like Apache Impala or Apache Spark within a single platform.
The various SQL commands used in hive are given in Table 5.8.1.
Sr. No. Function Command
It has support for different aggregation functions of SQL like SUM, COUNT, MAX,
MIN etc. and other functions like CONCAT, SUBSTR, ROUND etc. It also supports
GROUP BY and SORT BY clauses along with joins. It comes with a command-line shell
interface which can be used to create tables and execute queries. In addition, custom
Map-Reduce scripts can also be plugged into queries.
As Hive is a petabyte-scale data warehouse system built on the top of Hadoop
platform, it allows programmers to write the custom Map-Reduce framework to perform
more sophisticated analysis and data processing. Therefore, Hive on MapReduce or Spark
is best-suited for batch data preparation or ETL.
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
N OSQL Data Management for
Big Data Analytics 5 - 12 Big Data and Visualization
5.9 Sharding
The Sharding is a type of database partitioning that splits very large databases the into
smaller, faster, more easily managed parts called data shards. The word shard means a
small part of a whole. It is a database architecture pattern related to horizontal
partitioning that makes separation of one table's rows into multiple different tables,
known as partitions. Each partition has the same schema and columns, but also entirely
different rows. Likewise, the data held in each is unique and independent of the data held
in other partitions.
Original Table
In a vertically-partitioned table, entire columns are separated out and put into new,
distinct tables. The data held within one vertical partition is independent from the data in
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
N OSQL Data Management for
Big Data Analytics 5 - 13 Big Data and Visualization
all the others, and each table holds both distinct rows and columns. The Table 5.9.1
illustrates how a table could be partitioned both horizontally and vertically.
The Sharding involves breaking up the data into two or more smaller chunks, called
logical shards. The logical shards are then distributed across separate database nodes,
referred to as physical shards, which can hold multiple logical shards. Despite this, the
data held within all the shards collectively represent an entire logical dataset.
The major benefits of sharding are, it can help to facilitate horizontal scaling easily, it
helps to speed up query response times and it help to make an application more reliable
by mitigating the impact of outages. While the drawbacks of Sharding are complexity of
properly implementing a sharded database architecture, mostly shards eventually
become unbalanced and it becomes very difficult to return back to its unsharded
architecture once sharding is done.
5.10 Hbase
Hbase is an column-oriented non-relational database management system that runs on
the top of Hadoop Distributed File System (HDFS) and used to store and process
unstructured data. It provides a fault-tolerant way of storing and processing large data
sets and is well suited for real-time data processing or random read/write access to large
volumes of data. It does not support a structured query language like SQL and relational
data stores at all. The HBase applications are usually written in Java like a typical
MapReduce application. It is designed to scale linearly and comprises a set of standard
tables with rows and columns, much like a traditional database. Each table in a Hbase
must have an element defined as a primary key to uniquely define the attribute. It
supports a rich set of primitive data types like numeric, binary and strings along with
complex types including arrays, maps, enumerations and records. The HBase mostly
relies on Zookeeper for getting high-performance coordination which is built into Hbase.
It is also compatible with Hive which gives query engine for batch processing of big data
and fault-tolerance in big data applications.
The basic features provided by Hbase are listed as follows
It offers consistent reads and writes on stored and real time data
It provides atomic Read and Write where during one read or write process, all other
processes are prevented from performing any read or write operations
It offers database sharding of tables which is required to reduce I/O time and
overhead
It provides high availability, high throughput along with scalability in both linear
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
N OSQL Data Management for
Big Data Analytics 5 - 14 Big Data and Visualization
and analyzing them to compute a score of sentiments. The sentiment analysis is the
process of detecting the contextual polarity of text. A common use case for sentiment
analysis is to discover how people feel about a particular topic.
Fig. 5.11.1 (a) Signin to twitter Fig. 5.11.1 (b) Create new app
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
N OSQL Data Management for
Big Data Analytics 5 - 16 Big Data and Visualization
Fill in all the required fields to make the application and use the website as google.com
as shown in Fig. 5.11.2 then complete the form and click on Create your Twitter
application .
Step 2 : Open manage keys and access tokens from application settings and select my
access token to create new token for the accessing the twitter data as shown in Fig. 5.11.3.
Step 3 : After creating the access token, open flume.conf file in the directory
/usr/lib/flume/conf and then change the following keys in the file. These keys will be
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
N OSQL Data Management for
Big Data Analytics 5 - 17 Big Data and Visualization
Step 4 : Now changes needs to be done in flume.conf file for Access Token, Access
Token Secret, Consumer Key (API Key) and Consumer Secret (API Secret) along with
adding the keywords for search that we want to extract from twitter. In this example, we
are extracting data on demonetization in India as shown in Fig. 5.11.5.
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
N OSQL Data Management for
Big Data Analytics 5 - 18 Big Data and Visualization
The tweets coming in from Twitter are in Json format, therefore we need to load the
tweets into the Hive using json input format. So add jar file in to hive using following
command.
ADD jar /usr/local/hive/lib/hive-serdes-1.0-SNAPSHOT.jar
After successfully adding the Jar file, create a Hive table to store the Twitter data. For
calculating the hashtags, we need the tweet_id and hashtag_text, to be stored in Hive
table. So, use following command to create a hive table.
CREATE EXTERNAL TABLE tweets (id BIGINT, entities
STRUCT<hashtags:ARRAY<STRUCT<text:STRING>>>) ROW FORMAT SERDE
'com.cloudera.hive.serde.JSONSerDe' LOCATION '/flumedir/data/tweets_raw';
Now, let's create another table which can store id and the hashtag text using the below
command:
create table hashtag_word as select id as id, hashtag from hashtags LATERAL VIEW
explode(words) w as hashtag;
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
N OSQL Data Management for
Big Data Analytics 5 - 19 Big Data and Visualization
Now, let's use the query to calculate the number of times each hashtag has been
repeated.
select hashtag, count(hashtag) as total_count from hashtag_word group by hashtag order by
total_count desc;
The hashtag and the number of times it is repeated in the Twitter data will appear as a
output as shown in Fig. 5.11.7.
extract the id and the hashtag from the above tweets using following command
extract_details = FOREACH load_tweets GENERATE FLATTEN(myMap#'entities') as
(m:map[]),FLATTEN(myMap#'id') as id;
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
N OSQL Data Management for
Big Data Analytics 5 - 20 Big Data and Visualization
DESC;
Using above command, we have calculated the average rating of each tweet by using
each word of the tweet and arranging the tweets in the descending order as per their
rating. Now let us run sentiments.sql script as shown in Fig. 5.11.9.
The output of average ratings by tweet ids is shown in Fig. 5.11.10 which shown the
tweet_id and its rating.
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
N OSQL Data Management for
Big Data Analytics 5 - 21 Big Data and Visualization
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
N OSQL Data Management for
Big Data Analytics 5 - 22 Big Data and Visualization
head(sales)
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
N OSQL Data Management for
Big Data Analytics 5 - 24 Big Data and Visualization
The summary() function provides some descriptive statistics, such as the mean and
median, for each data column.
summary(sales)
cust_id sales_total num_of_orders gender
Min. :10001 Min. : 30.02 Min. : 1.000 F:5035
1st Qu.:10251 1st Qu.: 80.29 1st Qu.: 2.000 M:4965
Median :10501 Median : 151.65 Median : 2.000
Mean :10501 Mean : 249.46 Mean : 2.428
3rd Qu.:10750 3rd Qu.: 295.50 3rd Qu.: 3.000
Max. :11000 Max. :7606.09 Max. :22.000
Plotting a dataset's contents can provide information about the relationships between
the various columns.
The hist() function is used plot the histogram for the dataset
hist(results$residuals, breaks = 800)
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
N OSQL Data Management for
Big Data Analytics 5 - 25 Big Data and Visualization
Vectors are a basic building block for data in R. As seen previously, simple R variables
are actually vectors. A vector can only consist of values in the same class. The tests for
vectors can be conducted using the is.vector() function while The array() function can be
used to restructure a vector as an array.The matrix() function is used to create matrix as
shown below.
sales <- matrix(0, nrow = 3, ncol = 4)
sales
[,1] [,2] [,3] [,4]
[1,] 0 0 0 0
[2,] 0 0 0 0
[3,] 0 0 0 0
The basic data analytical methods in R are explained in Lab no. 3.
Summary
The NoSQL is a set of concepts that allows the rapid and efficient processing of
data sets with a focus on performance, reliability, and agility. It does not require a
fixed schema, normalized data (3NF), tables or joins, instead it is used for
distributed data stores with easier scalability.
There are five basic types of NoSQL databases like Key-Value Store, Document
Store, Tabular Store, Object Store and Graph-based databases
The Key-Value store is presented with a simple string called key that returns an
arbitrary large BLOB of data called value, document-based NoSQL database
stores and retrieves the data as a key value pair where value part is stored as a
document with associated key, tabular store sparse data which is to be stored in a
three-dimensional table that is indexed by row key, column key and a timestamp
that may allude to the time at which the rows column was stored.
The Object data store stores and manage the different kinds of objects in a
database while The Graph stores are highly optimized to proficiently store graph
nodes and links, and enable you to query these graphs.
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
N OSQL Data Management for
Big Data Analytics 5 - 26 Big Data and Visualization
The Apache Hive is a data warehouse software built on the top of Hadoop that
facilitates reading, writing, and managing large datasets stored in HDFS. The data
stored by Hive may residing on distributed storage which can be queried using
SQL.
The Sharding is a type of database partitioning that splits very large databases the
into smaller, faster, more easily managed parts called data shards.
Hbase is an column-oriented non-relational database management system that
runs on the top of Hadoop Distributed File System (HDFS) and used to store and
process unstructured data.
On twitter, the trending analysis allows to analyze individual tweets and looks for
certain words amongst them while sentiment analysis allows to look for certain
keywords in a tweet and analyzing them to compute a score of sentiments.
Applications of big data analytics for E-Commerce and Blogs are Predictive
Analytics, Dynamic Pricing, Enhanced Customer Service, Fraud detection, Supply
Chain Visibility etc.
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
N OSQL Data Management for
Big Data Analytics 5 - 27 Big Data and Visualization
The document-based NoSQL database stores and retrieves the data as a key value
pair where value part is stored as a document with associated key. In document store
each record is considered as a separate document where everything inside a document
is automatically indexed when a new document is added. Although the indexes are
large and everything is searchable within the document, most of the document stores
group documents together in a collection. The document that has some structure and
encoding for managing the data. Some of the common encoding used with the
document stores are XML, JSON (Java Script Object Notation), BSON (which is a binary
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
N OSQL Data Management for
Big Data Analytics 5 - 28 Big Data and Visualization
encoding of JSON objects) etc. or other means of serializing data. The example of
document store for collection Book using Mongo dB is shown as follows.
{"_id": ObjectId(7df78458902c), "title": "Python Programming", "description": "Python
is a Object oriented language ", "by": "Technical publication", "tags": ["SciPy", "NumPy",
"Pandas"], "likes": "100"}
Q.4 Enlist the categories of NOSQL datastores.
Ans. : There are five basic categories of NoSQL databases described as follows
1. Key-Value store : It uses combination of keys and values stored in big hash tables
to process the data. e.g. Redis, Riak, Amazon DynamoDB etc.
3. Tabular store (Column-based store) : These databases are capable of storing and
processing data in a column of a table. Some of the columnar databases are Hbase,
Cassandra, Azure Table Storage (ATS), BerkeleyDB etc.
4. Object store : These types of databases stores object in a database with capability to
represent ORDBMS. The examples of Object stores are ObjectDB, Perst, Objectstore
etc.
5. Graph-based databases : This type of databases are network databases that uses
edges and nodes to represent the stored data with relationship. e.g. Apache Giraph,
GraphX, Neo4j etc.
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
N OSQL Data Management for
Big Data Analytics 5 - 29 Big Data and Visualization
It provides atomic read and write where during one read or write process, all other
processes are prevented from performing any read or write operations.
It offers database sharding of tables which is required to reduce I/O time and
overhead.
It provides high availability, high throughput along with scalability in both linear
and modular form.
It supports distributed storage like HDFS.
It supports real time data processing using block cache and Bloom filters to make real
time query processing easier.
It supports data replication across clusters.
Q.6 What is sharding ? Explain horizontal and vertical portioning. [Refer section 5.9]
Part - B Questions
Q.5 Explain in brief Big data analytics for E-commerce and Blogs.
Q.6 Explain the architecture of Hive.
Ans. : The Hive architecture compose of five components namely: Hive User Interface,
Metadata Store, HDFS/Hbase, Hive Query Processing Engine (HiveQL) and Execution
Engine. The architecture of Hive is shown in Fig. 5.2.
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
N OSQL Data Management for
Big Data Analytics 5 - 30 Big Data and Visualization
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics Lab
Contents
Lab 1 : Installation of Hadoop single node cluster on Ubuntu 16.04
(L - 1)
Big Data Analytics L-2 Big Data Analytics Lab
In single node setup name node and data node runs on same machine. The detail
steps to install hadoop on ubuntu 16.04 are explained as follows
Step 1 : Update the Ubuntu
$ sudo apt-get update
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics L-3 Big Data Analytics Lab
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics L-4 Big Data Analytics Lab
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics L-5 Big Data Analytics Lab
Add the newly created key to the list of authorized keys so that Hadoop can use ssh
without prompting for a password.
$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics L-6 Big Data Analytics Lab
Disable ipv6 feature as it uses 0.0.0.0 for the various networking-related Hadoop
configuration options will result in Hadoop binding to the IPv6 addresses. To disable it
open sysctl.conf file
$ sudo nano /etc/sysctl.conf
add following lines at the end of sysctl.conf file and reboot the machine.
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1
You can check whether IPv6 is enabled on your machine with the following command:
$ cat /proc/sys/net/ipv6/conf/all/disable_ipv6
A return value of 0 means IPv6 is enabled and a value of 1 means disabled.
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics L-7 Big Data Analytics Lab
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics L-8 Big Data Analytics Lab
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics L-9 Big Data Analytics Lab
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics L - 10 Big Data Analytics Lab
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics L - 11 Big Data Analytics Lab
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics L - 12 Big Data Analytics Lab
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>
</configuration>
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics L - 13 Big Data Analytics Lab
4. Configuremapred-site.xml file
By default, the /usr/local/hadoop/etc/hadoop/ folder contains
/usr/local/hadoop/etc/hadoop /mapred-site.xml.template file which has to be
renamed/copied with the name mapred-site.xml.
So copy the file and open it for configuration.
$ cp / usr / local / hadoop / etc / hadoop / mapred-site.xml.template
/usr/local/hadoop/etc/hadoop/mapred-site.xml
$ sudo nano /usr/local/hadoop/etc/hadoop/mapred-site.xml
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics L - 14 Big Data Analytics Lab
5. Configure hdfs-site.xml
The /usr/local/hadoop/etc/hadoop/hdfs-site.xml file needs to be configured for each
host in the cluster that is being used. It is used to specify the directories which will be
used as the namenode and the datanode on that host.So first create directories under hdfs
for name node,data node and hdfs store.
$ sudo mkdir -p /usr/local/hadoop_store/hdfs/namenode
$ sudo mkdir -p /usr/local/hadoop_store/hdfs/datanode
$ sudo chown -R hduser:hadoop /usr/local/hadoop_store
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics L - 15 Big Data Analytics Lab
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics L - 16 Big Data Analytics Lab
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics L - 17 Big Data Analytics Lab
To verify all the services are running pass JPS command. If output of JPS command shows
following output then we can say that hadoop is successfully installed.
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics L - 18 Big Data Analytics Lab
In this practical single node hadoop cluster have been used. The hadoop cluster with
pre-installed eclipse on Cent OS is going to used for running Map-reduce program. The
steps to run word count program using map-reduce framework are as follows
Step 1 : Open Eclipse and create new Java project specify name and click on finish
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics L - 19 Big Data Analytics Lab
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics L - 20 Big Data Analytics Lab
Step 3 : Right click on Package name wordcount and create new class in it and assign
name wordcount
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics L - 21 Big Data Analytics Lab
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
while(token.hasMoreTokens())
{
String status = new String();
String word = token.nextToken();
Text outputKey = new Text(word);
IntWritable outputValue = new IntWritable(1);
con.write(outputKey, outputValue);
}
} // end of map()
} //end of Mapper Class
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics L - 22 Big Data Analytics Lab
public void reduce(Text word, Iterable<IntWritable> values, Context con) throws IOException,
InterruptedException
{
int sum = 0;
} // end of reduce()
} // end of Reducer class
/*
*/
// job definition
} // end of main()
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics L - 23 Big Data Analytics Lab
Step 6 : Once all the errors have been resolved then right click on project and select
export jar files,specify name to it and click on finish.
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics L - 24 Big Data Analytics Lab
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics L - 25 Big Data Analytics Lab
Step 7 : Create input text file and copy both input and jar files it to hadoop directory
[training@Localhost ~]$ cat > inputtsec
this is a demo program on map reduce
the the is is a a demo mo
map map
reduce reduce
[1] + Stopped cat > inputtsec
[training@Localhost ~]$ hadoop fs - copyFromLocal / home/training/inputtsec hadoop/
[training@Localhost ~]$ hadoop fs - copyFromLocal / home/training/word.jar hadoop/
[training@Localhost ~]$
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics L - 26 Big Data Analytics Lab
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics L - 27 Big Data Analytics Lab
In this lab, we are going to study various methods in R available for exploring,
conditioning, modeling, and presenting the data in a dataset. The various functions used
to perform data analytics are explained as follows.
1) Import data
Syntax: Var_name<-read.csv(<Filename.csv>)
read_csv from local path Loading dataset from remote website
titanic_data <- read.csv(“c:/data /titanic.csv”) titanic_data <- "https://goo.gl/At238b" %>%
read.csv %>%
Choosing a File : data <- read.csv(file.choose())
2) Examine the dataset
head(Var_name) # For printing the top lines of a dataset
summary(Var_name) #For printing the summary of a dataset
3) Plot the graph
Syntax for Bar plot : Syntax for plot : Syntax for histogram :
barplot(height, name.args = plot(x, y, …) hist(x, breaks, freq,
NULL, col = NULL, main = probability, density, angle,
NULL) col, border)
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics L - 28 Big Data Analytics Lab
hist(housing$Home.Value)
4) Install libraries
install.packages("dplyr")
install.packages("ggplot2")
5) Import libraries
library(dplyr)
library(ggplot2)
6) Read table
Syntax - read.table(file, header, sep, quote =, row.names, col.names, allowEscapes,flush,
stringsAsFactors, fileEncoding)
> x <- read.table(file.choose(),header=T,sep=”\t”); :
>x :
State region date Home.value structur.cost
1 MH west 2019 214952 160599
2 MH west 2019 225511 160252
3 RS west 2019 234994 163791
4 GJ west 2017 235820 161787
5 GA west 2016 244590 155400
6 MH west 2017 253714 157458
7) Operators in R
Arithmetic + * / %% %/% ^
Operators
Relational < > == <= >= !=
Operators
Logical Operators & | ! && ||
Assignment = <- -> <<- ->>
Operators
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics L - 29 Big Data Analytics Lab
8) Constants in R
Constant Value
pi 3.141593
LETTERS A........Z
letters a........z
month.abb “Jan”... “Dec”
month.name “January” ...
“December”
9) Datatypes in R
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics L - 30 Big Data Analytics Lab
Example 1
In this example, we are going to use “USArrest” dataset available on Kaggle
website.The steps to perform clustering using K-means technique are given as follows
Step 2 : Load dataset “USArrest” and remove the missing values from dataset
> data('USArrests')
> d_frame <- USArrests
> d_frame <- na.omit(d_frame) #Removing the missing values
> d_frame <- scale(d_frame)
> head(d_frame)
The output of above command is given as follows.
> data(‘USArrests’)
> d_frame <- USArrests
> d_frame <- na.omit(d_frame) #Removing the missing values
> d_frame <- scale(d_frame)
> head(d_frame)
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics L - 31 Big Data Analytics Lab
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics L - 32 Big Data Analytics Lab
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics L - 33 Big Data Analytics Lab
Example 2
In example2, we are going to use Uber dataset provided by Kaggle website where
value of k is predefined which is 5. So, the steps to perform clustering using K-means
technique for Uber dataset are given as follows
Syntax :
Variable_name <- read.csv (<Path to csv file>)
Here symbol <- is used as an assignment operator
Example:
X<- read.csv(https://raw.githubusercontent.com/raw-data.csv)
In this example we are going to use Uber datasets which are available on Kaggle
website for the analysis. The urls for five data sets are given below
https://raw.githubusercontent.com/fivethirtyeight/uber-tlc-foil-response/master/uber-
trip-data/uber-raw-data-apr14.csv
https://raw.githubusercontent.com/fivethirtyeight/uber-tlc-foil-
response/master/uber-trip-data/uber-raw-data-may14.csv
https://raw.githubusercontent.com/fivethirtyeight/uber-tlc-foil-response/master/uber-
trip-data/uber-raw-data-jun14.csv
https://raw.githubusercontent.com/fivethirtyeight/uber-tlc-foil-response/master/uber-
trip-data/uber-raw-data-jul14.csv
https://raw.githubusercontent.com/fivethirtyeight/uber-tlc-foil-
response/master/uber-trip-data/uber-raw-data-aug14.csv
https://raw.githubusercontent.com/fivethirtyeight/uber-tlc-foil-response/master/uber-
trip-data/uber-raw-data-sep14.csv"
Now load the all the datasets using read.csv command to the respective variables.
(Note: You must have active internet connection to load data from Kaggle website to local
R studio)
apr <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/uber-tlc-foil-
response/master/uber-trip-data/uber-raw-data-apr14.csv")
may <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/uber-tlc-foil-
response/master/uber-trip-data/uber-raw-data-may14.csv")
jun <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/uber-tlc-foil-
response/master/uber-trip-data/uber-raw-data-jun14.csv")
jul <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/uber-tlc-foil-
response/master/uber-trip-data/uber-raw-data-jul14.csv")
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics L - 34 Big Data Analytics Lab
Now to bind the data of all separate datasets in to one, we need use dplyr package
with function name bind_rows.
To install dplyr package type following command on R Prompt
>install.packages("dplyr")
To bind datasets, use following command to merge all records in variable data.
> library("dplyr")
> data <- bind_rows(apr, may, jun, jul, aug, sep)
Now view the summary of data using following command
summary(data)
OUTPUT
> summary(data)
Date.Time Lat Lon Base
Length : 4534327 min. : 39.66 min. : 74.93 B02512 : 205673
Class : character 1st Qu. : 40.72 1st Qu. : 74.00 B02598 : 1393113
Mode : character Median : 40.74 Median : 73.98 B02617 : 1458853
Mean : 40.74 Mean : 73.97 B02682 : 1212789
3rd Qu. : 40.76 3rd Qu. : 73.97 B02764 : 263899
Max. : 42.12 Max. : 72.07
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics L - 35 Big Data Analytics Lab
Step 2 : Prepare data for clustering and perform Missing value analysis
This step consists of cleaning and rearranging your data so that you can work on it
more easily. It's a good idea to first think of the sparsity of the dataset and check the
amount of missing data. For that first install library VIM.
install.packages("VIM")
Now, aggregate all datasets using aggr command which will plot the graphs as shown
below
From above output, we can say that there are no missing values in the dataset. Now to
use date and time values let us use lubridate library for this. Lubridate makes it simple for
you to identify the order in which the year, month, and day appears in your dates and
manipulate them.
> install.packages("lubridate")
> library("lubridate")
> data$Date.Time <- mdy_hms(data$Date.Time)
> data$Year <- factor(year(data$Date.Time))
> data$Month <- factor(month(data$Date.Time))
> data$Day <- factor(day(data$Date.Time))
> data$Weekday <- factor(wday(data$Date.Time))
> data$Hour <- factor(hour(data$Date.Time))
> data$Minute <- factor(minute(data$Date.Time))
> data$Second <- factor(second(data$Date.Time))
> data$Month
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics L - 36 Big Data Analytics Lab
Now, check out the first few rows to see what our data looks like now
> head(data, n = 10)
Date.Time Lat Lon Base Year Month Day Weekday Hour Minute
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics L - 37 Big Data Analytics Lab
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics L - 38 Big Data Analytics Lab
References:
https://www.kaggle.com/deepakg/usarrests
https://uc-r.github.io/kmeans_clustering
https://www.datacamp.com/community/tutorials/k-means-clustering-r
https://www.guru99.com/r-k-means-clustering.html
https://data-flair.training/blogs/clustering-in-r-tutorial/
Decision Trees are a popular Data Mining technique that makes use of a tree-like
structure to deliver consequences based on input decisions. In this lab we are going to
implement decision tree classification in R. The steps are as follows
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics L - 39 Big Data Analytics Lab
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics L - 40 Big Data Analytics Lab
The conditional parting of above plot can be performed using following commands.
> ctree_ <- ctree(survived ~ ., train_data)
> plot(ctree_)
The output of conditional parting is shown as follows
References :
https://www.kaggle.com/c/titanic/data
https://data-flair.training/blogs/r-decision-trees/
https://www.guru99.com/r-decision-trees.html#1
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics L - 41 Big Data Analytics Lab
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics L - 42 Big Data Analytics Lab
-------------------------------------------------------------------------------------------------------------
::: gre::0 (KDE)
-------------------------------------------------------------------------------------------------------------
Call :
Density.default(x = x, na.rm = TRUE)
Data : x (102 obs.) : Bandwidth ‘bw’ = 39.59
x y
Min. : 181.2 Min. : 1.145e06
st st
1 Qu. : 365.6 1 Qu. : 2.007e04
Median : 550.0 Median : 1.291e03
Mean : 550.0 Mean : 1.354e03
3rd Qu. : 734.4 3rd Qu. : 2.375e03
Max. : 918.8 Max. : 3.465e03
-------------------------------------------------------------------------------------------------------------
::: gre::0 (KDE)
-------------------------------------------------------------------------------------------------------------
Call :
Density.default(x = x, na.rm = TRUE)
Data : x (223 obs.) : Bandwidth ‘bw’ = 0.1134
x y
Min. : 2.080 Min. : 0.0002229
1st Qu. : 2.645 1st Qu. : 0.0924939
Median : 3.210 Median : 0.4521795
Mean : 3.210 Mean : 0.4419689
3rd Qu. : 3.775 3rd Qu. : 0.6603271
Max. : 4.340 Max. : 1.1433285
-------------------------------------------------------------------------------------------------------------
::: gpa::1 (KDE)
-------------------------------------------------------------------------------------------------------------
Call :
Density.default(x = x, na.rm = TRUE)
Data : x (102 obs.) : Bandwidth ‘bw’ = 0.1234
x y
Min. : 2.25 Min. : 0.0005231
st st
1 Qu. : 2.78 1 Qu. : 0.0800747
Median : 3.31 Median : 0.4801891
Mean : 3.31 Mean : 0.4710851
3rd Qu. : 3.84 3rd Qu. : 0.8626207
Max. : 4.37 Max. : 1.0595464
-------------------------------------------------------------------------------------------------------------
::: rank (categorical)
-------------------------------------------------------------------------------------------------------------
rank 0 1
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics L - 45 Big Data Analytics Lab
1 0.10313901 0.24509804
2 0.36771300 0.42156863
3 0.33183857 0.24509804
4 0.19730942 0.08823529
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics L - 46 Big Data Analytics Lab
Step 6 : Create prediction using confusion matrix over test and training data
First make the predictions over trained data sets
> p <- predict(model, train, type = 'prob')
> head(cbind(p, train))
The output of above command is as shown below
>head (cbind (p,train))
0 1 admit gre gpa rank
1 0.8528794 0.1471206 0 380 3.61 3
2 0.5621460 0.4378540 1 660 3.67 3
3 0.2233490 0.7766510 1 800 4.00 1
4 0.8643901 0.1356099 1 640 3.19 4
6 0.6263274 0.3736726 1 760 3.00 2
7 0.5933791 0.4066209 1 560 2.98 1
Now to create confusion matrix for training data as p1 and observe the result stored in
p1.
> p1 <- predict(model, train)
Non kernel based densities kernel based densities
> (tab1 <- table(p1, train$admit)) > (tab1 <- table(p1, train$admit))
P1 0 1 P1 0 1
0 196 69 0 203 69
1 27 33 1 20 33
> 1 – sum(diag(tab1))/(sum(tab1) > 1 – sum(diag(tab1))/(sum(tab1)
[1] 0.2953846 [1] 0.2738462
create a confusion matrix for test data as p2 and observe the result stored in p2.
> p2 <- predict(model, test)
Non kernel-based densities kernel based densities
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics L - 47 Big Data Analytics Lab
> (tab2 <- table(p2, test$admit)) > (tab2 <- table(p2, test$admit))
P2 0 1 P2 0 1
0 47 21 0 47 20
1 3 4 1 3 5
> 1 – sum(diag(tab2))/(sum(tab2) > 1 – sum(diag(tab2))/(sum(tab2)
[1] 0.32 [1] 0.3066667
While developing a model, if we do not use the kernel-based densities then the
accuracy may be degraded. From above results, it is observed that in train dataset the
accuracy has been improved in kernel-based densities from 29.53 % to 27.38 % and in
testing dataset the accuracy has been improved from 32 % to 30.66 %
References:
https://www.youtube.com/watch?v=RLjSQdcg8AM
Lab 7 : To study NOSQL databases Mongo dB and implement CRUD operations
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics L - 48 Big Data Analytics Lab
The various commands for performing CRUD operations in mongodb are given as
follows
Sr. Operation Syntax Example
No.
1 Create database >use <database_name> >use mydb
switched to db mydb
2 Show databases >show dbs >show dbs
local 0.78125GB
test 0.23012GB
3 Delete current >db.dropDatabase() >db.dropDatabase()
database
4 Create collection >db.createCollection("<Collecti >db.createCollection("Stude
on_Name") nt”)
5 Insert records in a >db.<Collectionname>.insert({“ >db.movie.insert({"name"
database Key”:”Value”}) :"tsec"})
6 Delete collection >db.<Collection_Name>.drop() >db.Student.drop()
true
7 View records >db.<Collection_name>.find() >db.Student.find()
8 Update records >db.<Collection_name>.update >db.Student.update({'title':;
({<oldKey:oldValue>},{$set:{<n Manager'},{$set:{'title':'Lead
ewKey:newValue> }}) er'}})
To work on HDFS, Hive and Hbase we need to have Hadoop configured server which
can be installed manually as like Lab 1 or can use the readily available Hadoop
distributions like Cloudera or Hortonworks. The Cloudera Hadoop virtual machine
(CDH) can be downloaded from website
https://www.cloudera.com/downloads/quickstart_vms/5-13.html as shown
below
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics L - 49 Big Data Analytics Lab
The CDH virtual machine can be open in any virtualization software like VMware
workstation player as shown in following figure.
Once Hadoop is ready, we can work on any Hadoop ecosystem component like Hove,
HDFS, Pig, Hbase etc.
1. HDFS
The Apache Hadoop is an open source software framework that enables distributed
processing of large data sets across clusters of commodity servers using programming
models. It is designed to scale up from a single server to thousands of machines, with a
very high degree of fault tolerance. The Hadoop framework consist of Hadoop
distributed file system (HDFS).
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics L - 50 Big Data Analytics Lab
a) Hive
The Apache Hive is a data warehouse software that facilitates reading, writing, and
managing large datasets which are residing on distributed storage using SQL. The hive
provides a simple SQL-like query language called Hive Query Language (HQL) for
querying and managing the large datasets. It also allows programmers to write the
custom Map-Reduce framework to perform more sophisticated analysis. To work on Hive
Open the Terminal in CDH and type hive which opens hive shell as shown below
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics L - 51 Big Data Analytics Lab
b) Hbase
Hbase is a Column-oriented database used to store unstructured data in column
format. It is a NoSQL solution for big data available inside Hadoop platform. The Hbase
is not similar to the relational database systems and also does not support SQL syntaxes
in general. To work on Hbase Open the Terminal in CDH and type Hbase shell as shown
below
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics L - 52 Big Data Analytics Lab
N otes
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge