Big Data Analytics TEXTBOOK

SUBJECT CODE : CS8091
Strictly as per Revised Syllabus of

Anna University
Choice Based Credit System (CBCS)
Semester - VI (IT)
Semester - VII (CSE) Professional Elective - II
Big Data Analytics
Dr. Bhushan Jadhav

Ph.D. Computer Engineering
Assistant Professor, Information Technology Department,
Thadomal Shahani Engineering College,
Bandra, Mumbai.
Sonali Jadhav
M.E. Computer Engineering
Assistant Professor, Computer Engineering Department,
D. J. Sanghvi College of Engineering ,
Mumbai.
® ®
TECHNICAL
PUBLICATIONS
SINCE 1993 An Up-Thrust for Knowledge
(i)
Big Data Analytics
Subject Code : CS8091
Semester - VI ( Information Technology)

Semester - VII ( Computer Science & Engineering) Professional Elective - II
First Edition : January 2020
ã Copyright with Authors

All publishing rights (printed and ebook version) reserved with Technical Publications. No part of this book
should be reproduced in any form, Electronic, Mechanical, Photocopy or any information storage and
retrieval system without prior permission in writing, from Technical Publications, Pune.
Published by :
® ®
Amit Residency, Office No.1, 412, Shaniwar Peth, Pune - 411030, M.S. INDIA
TECHNICAL P h . : + 9 1 - 0 2 0 - 2 4 4 9 5 4 9 6 / 9 7 , Te l e f a x : + 9 1 - 0 2 0 - 2 4 4 9 5 4 9 7
PUBLICATIONS
SINCE 1993 An Up-Thrust for Knowledge Email : sales@technicalpublications.org Website : www.technicalpublications.org
Printer :
Yogiraj Printers & Binders
Sr.No. 10/1A,
Ghule Industrial Estate, Nanded Village Road,
Tal. - Haveli, Dist. - Pune - 411041.
Price : ` 250/-
ISBN 978-93-89420-88-3
9 789389 420883 AU 17
9789389420883 [1] (ii)

UNIT - I
1 Introduction to Big Data

Syllabus
Evolution of big data-Best practices for big data analytics-Big data characteristics-Validating-The
promotion of the value of big data- Big data use cases-Characteristics of big data applications-
perception and quantification of value-Understanding big data storage-A general overview of high
performance architecture-HDFS – Map reduce and YARN -Map reduce programming model.
Contents
1.1 Introduction
1.2 Evolution of Big data
1.3 Best Practices for Big Data Analytics
1.4 Big Data Characteristics
1.5 Validating the Promotion of the Value of Big Data
1.6 Big Data Use Cases
1.7 Characteristics of Big Data Applications
1.8 Perception and Quantification of Value
1.9 Understanding Big Data Storage
1.10 A Genral Overview Of High-Performance Architecture
1.11 Architecture of Hadoop
1.12 Hadoop Distributed File System (HDFS)
1.13 Architecture of HDFS
1.14 Map Reduce and YARN
1.15 Map Reduce Programming Model
Summary
Two Marks Questions with Answers [Part - A Questions]
Part - B Questions
(1 - 1)
Big Data Analytics 1-2 Introduction to Big Data
1.1 Introduction
Due to the massive digitalization, a large amount of data is being generated by web
applications and Social networking sites that runs on internet by many organizations. In
today’s technological world the high computational power and large storage size is the
basic need and it has been significantly increased over the period of time. The
organizations are producing huge amount of data at rapid rate today and as per global
internet usage report by Wikipedia, the 51% of the world's population uses internet to
perform their day to day activities. Most of them use internet for web surfing, online
shopping, or interacting using Social Medias sites like Facebook, twitter or LinkedIn etc.
These websites generate massive amount of data that involve uploading and
downloading of videos, pictures or text messages whose size is almost unpredictable with
large number of users.
The recent survey on data generation says that Facebook produces 600 TB of data per
day and analyzes 30+ Petabytes of user generated data, Boeing jet airplane generates more
than 10 TBs of data per flight including geo maps and other information, Walmart
handles more than 1 million customer transactions every hour with estimated more than
2.5 petabytes of data per day, there are 0.4 million tweets generated by twitter per minute,
400 hours of new videos uploads on YouTube with access by 4.1 million users. Therefore,
it becomes necessary to manage such a huge amount of data generally called “Big data”
in the perspective of its storage, processing and analytics.
In big data, the data generated in many formats like structured, semi structured or
unstructured. The structured data has fixed pattern or schema which can be stored and
managed using tables in RDBMS, The semi-structured data does not have pre-defined
structure or pattern as it involves scientific or bibliographic data which can be
represented using Graph data structures while unstructured data also do not have a
standard structure, pattern or schema. The examples of unstructured data are videos,
audios, images, pdfs, compressed, log or JSON files. The traditional database
management techniques are incapable of storing, processing, handling and analyzing big
data with various formats which includes images, audio, videos, maps, text, xml etc.
The processing of big data using traditional database management system is very
difficult because of its four characteristics called 4 Vs of Big data shown in Fig. 1.1.1. In
Big data, the Volume refers to size of data being generated per minute or seconds, Variety
means types of data generated including structured, unstructured or semi structured
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
data, Velocity refers to speed of data generated per minute or per seconds and Veracity
refers to uncertainty of data generated being generated.
Fig. 1.1.1 : Four Vs of Big data
Because of above four V’s, it becomes more and more difficult to capture, store,
organize, process and analyze the data generated by various web applications or
websites. In traditional analytics system, the cleansed or meaningful data is collected and
stored by RDBMS in a data ware house. This data was analyzed by means of performing
Extract, Transform and Load (ETL) operations. It has support of only cleansed structured
data used for batch processing. The parallel processing of such data by traditional
analytics were costlier because of expensive hardware. Therefore, big data analytics
solutions came into picture which has many advantages over the traditional analytics
solutions. The major advantages of Big data analytics are supporting real time or batch
processing data, analyzes different formats of data, can process uncleansed or uncertain
data, does not require an expensive hardware, supports huge volume of data generated at
any velocity and perform data analytics at low cost.
Therefore, it is best to begin with a definition of big data. The analyst firm Gartner can
be credited with the most-frequently used (and perhaps, somewhat abused) definition:
Big data is high-volume, high-velocity and high-variety information assets that demand
cost-effective, innovative forms of information processing for enhanced insight and
decision making.
®
1.2 Evolution of Big Data

To deeply understand the consequences of Big Data analytics, the theory related to
computing history, specifically Business Intelligence (BI) and scientific computing needs
to be understood. The problem related to Big Data most likely be tracked before the
evolution of computers, when unstructured data in paper format was needed to be
tackled. Perhaps the first Big Data challenge came in to picture at Census Bureau, US, in
1880, where the information concerning approximately 60 million people had to be
collected, classified, and reported that process took more than 10 years to process.
Therefore, in 1890, the first Big Data platform was introduced with a mechanical device
called the Hollerith Tabulating System which worked with punch cards with capacity of
holding about 80 variables in one card which was very inefficient. So, in 1927, The
Austrian-German engineer have developed a device that can store information
magnetically on tape. But it also has very limited space for storage. In 1943, the British
engineer had developed a machine called Colossus which was capable to scan 5.000
characters a second which reduced the workload from weeks to hours.
In 1969, the Advanced Research Projects Agency (ARPA), a subdivision of the
Department of US Defense has developed ARPANET for military operations which
evolves to the Internet in 1990. With the evolution of World wide web the true sense of
big amount of data generation has started while after introduction of emerging
technologies like internet of things. By 2013, the IoT had evolved to multiple technologies
that uses Internet, wireless communications, embedded systems, Mobile technologies etc.
As we know that the relational databases run on today’s desktop computers have enough
compute power to process the information contained in the 1890 census with some basic
code. Therefore, the definition of Big Data continues to evolve with time and advances in
technology.
1.3 Best Practices for Big Data Analytics

Like other technologies, there are some best practices that can be applied to the
problems of Big Data. The best practices for Big data analytics are explained as follows.
1) Start small with big data : In Big data analytics, while analyzing the Big Data,
always starts with a smaller task. Ideally, those smaller tasks will build the expertise
needed to deal with the larger analytical problem. In Big data problem, the variety of
data gets generated with uncover patterns and correlations in both structured and
unstructured data. So, starting with a bigger task may create a dead spot in an
analytics matrix where patterns may not be relevant to the problem being asked.
®
Therefore, every successful Big Data project always tend to start with smaller data
sets and targeted goals.
2) Think big for scalability : While defining a Big data system, always follow a
futuristic approach. That means determining how much data will be collected in next
six months from now or calculating how many greater numbers of servers are
needed to handle it. This approach will allow applications to be scaled easily without
having any bottleneck.
3) Avoid bad practices : There are many potential reasons for failing Big Data projects.
So, for making the successful Big data project, the following wrong practices must be
avoided
a) Rather than blindly adopting and deploying something, first understand the
business purposes of the technology you are using for the deployment so as to
implement the right analytics tools for the job at hand. Without a solid
understanding of business requirements, the project will end up without having
an intended outcome.
b) Do not assume that the software will have all of the solutions for your problem as
the business requirements, environment, input/output varies from project to
project.
c) Do not consider the solution of one problem relevant for every problem as each
problem has unique requirements and needs a unique solution which can’t be
used to solve other problems. As a result, new methods and tools might be
required to capture, cleanse, store, process at least some of your Big Data.
d) Do not appoint same person for handling multiple types of analytical operations
as lack of business requisite and analytical expertise may leads to failure of
project. So, they require analytics professionals with statistical, actuarial, and other
sophisticated skills, with expertise in advanced analytics operations.
4) Treat big data problem as a scientific experiment : In Big data project, collecting
and analyzing the data is just a part of procedure while analytics is only producing
the business value which needs to be incorporated into business processes intended
to improve the performance and results. Therefore, every Big data problem requires
a feedback loop for passing the success of actions taken as a result of analytical
findings, followed by improvement of the analytical models based on the business
results.
5) Decide what data can be included and what to leave out : As Big Data analytics
projects involve large data sets that doesn’t means all the data generated by a system
can be analyzed. Therefore, it is required to select the appropriate datasets for
analysis based on their value and outcomes.
®
6) Must have a periodic maintenance plan : The success of Big Data analytics initiative
requires regular maintenance of analytics programs on the top of changes in business
requirements.
7) In-memory processing : The In-memory processing of large datasets must be
analyzed for getting the improvements in data-processing, speed of execution and
volume of data. It gives hundreds of times of increased performance compared to
older technologies, Better price-to-performance ratios, reductions in the cost of
central processing units and memory and can handle rapidly expanding volumes of
information.
1.4 Big Data Characteristics

Big data can be described by the following characteristics :
a) Volume : The quantity of data that is generated is very important in this context. It
is the size of the data which determines the value and potential of the data under
consideration and whether it can actually be considered Big Data or not. The name
‘Big Data’ itself contains a term which is related to size and hence the characteristic.
b) Variety : The next aspect of Big Data is its variety. This means that the category to
which Big Data belongs to is also an essential fact that needs to be known by the
data analysts. This helps the people, who are closely analyzing the data and are
associated with it, to effectively use the data to their advantage and thus upholding
the importance of the Big Data.
c) Velocity : The term ‘velocity’ in the context refers to the speed of generation of data
or how fast the data is generated and processed to meet the demands and the
challenges which lie ahead in the path of growth and development.
d) Variability : This is a factor which can be a problem for those who analyze the data.
This refers to the inconsistency which can be shown by the data at times, thus
hampering the process of being able to handle and manage the data effectively.
e) Veracity : The quality of the data being captured can vary greatly. Accuracy of
analysis depends on the veracity of the source data.
f) Complexity : Data management can become a very complex process, especially
when large volumes of data come from multiple sources. These data need to be
linked, connected and correlated in order to be able to grasp the information that is
supposed to be conveyed by these data. This situation, is therefore, termed as the
‘complexity’ of Big Data.
®
1.5 Validating the Promotion of the Value of Big Data

In previous sections, we have seen the characteristics and best practices for big data
analytics. From that the key factors for successful big data technologies which are
beneficial for organization are
 Reducing the capital and operational cost
 Does not needed high end servers as it can be run on commodity hardware
 Supports both Structured and unstructured data
 Supports high performance and scalable analytical operations
 Simple programming model for scalable applications

Previously, the implementation of high-performance computing systems was
restricted to large organizations. But because of low budget many organizations were not
able to implemented it. However, with the improvement market condition and economy,
the high-performance computing systems has attracted many organizations who willing
to invest in implementation of big data analytics. This is particularly valid for those
associations whose financial limits were already too diminutive to even consider
accommodating the venture.
There are many factors that needs to be considered before adapting any new
technology like big data analytics. Any new technology can’t be adapted blindly just
because of its feasibility and popularity within the organization. So, without considering
the risk factors the procured technology may fail and leading to the disappointment
phase of the hype cycle which may nullify the expectations for clear business
improvements. Therefore, before opting a new technology the five factors needs to be
considered are sustainability of technology, feasibility, integrability, value and
reasonability.
Apart from that the reality and hype about big data analytics must be checked before
opting it. To review different between reality and hype, one must see what can be done
with big data and what is said about that.
The Center for Economics and Business Research (CEBR) has published the
advantages of big data as
 Provide improvements in the strategy, business planning, research and analytics
leading to new innovation and the product development
 Optimized spending with improved customer marketing
 Provide predictive, descriptive and Prescriptive analytics for improving supply chain
management
 Provide accuracy in fraud detection.
®
There are some more benefits promoted by inculcating business intelligence and data
warehouse tools in big data like enhanced business planning with product analysis,
optimized supply chain management with fraud detection and analysis of waste, and
abuse of products.
1.6 Big Data Use Cases

The big data system is designed for providing high-performance capabilities over the
elastically harnessed parallel computing resources with distributed storage. It is intended
for providing optimized results over the scalable hardware and high-speed networks. The
Apache Hadoop is the opensource framework for solving Big data problem. The typical
Big Data Use Cases solved by Hadoop are given as follows
a) It provides support for Business intelligence by querying, reporting, searching,
filtering, indexing, aggregating the datasets.
b) It provides tools for report generation, trend analysis, search optimization, and
information retrieval.
c) It has improved performance for data management operations like log storage, data
storage and archiving, followed by sorting, running joins, Extract, Transform and
Loading (ETL) processing, data conversions, duplicate analysis and elimination.
d) It supports text processing, genome and protein sequencing, web crawling, workflow
monitoring, image processing, structure prediction, and so on.
e) It also supports applications like data mining and analytical applications like facial
recognition, social network analysis, profile matching, text analytics, machine
learning, recommendation system analysis, web mining, information extraction, and
behavior analysis.
f) It supports different types of analytics like predictive, prescriptive and descriptive
along with functions like as document indexing, concept filtering, aggregation,
transformation, semantic text analysis, pattern recognition, and searching.
1.7 Characteristics of Big Data Applications

Each Big data solution (like Hadoop) to is intended to solve a business problem in
quicker time over the larger deployments subject to one or more of the following criteria:
1. Data throttling : The challenges related to existing solution running on traditional
hardware due to throttling are data accessibility issue, data latency, limited data
availability and limits on bandwidth.
2. Computation-restricted throttling : Due to the limitation of computing resources, the
existing solution has Computation-restricted throttling where the expected
computational performance has not been met with conventional systems.
®
3. Large data volumes : Due to the huge volume of data, the analytical application
needs high rates of data creation and delivery.
4. Significant data variety : Due to the diversity of applications, the data generated by
applications may have variety of data like structured or unstructured generated by
different data sources.
5. Data parallelization : As big data application needs to process huge amount of data;
the application’s runtime can be improved through task or thread-level
parallelization applied to independent data segments.
Some of the big data applications and their characteristics are given in Table 1.7.1.
Sr. No. Application Name Possible Characteristics
1 Fraud detection Data throttling, Computation-restricted throttling, Large

data volumes, Significant data variety, Data
parallelization.
2 Data profiling Large data volumes, Data parallelization
3 Clustering Data throttling, Computation-restricted throttling, Large

data volumes, Significant data variety, Data
parallelization.
4 Price modelling Data throttling, Computation-restricted throttling, Large

data volumes, Data parallelization.
5 Recommendation Data throttling, Computation-restricted throttling, Large

System data volumes, Significant data variety, Data
parallelization.
Table 1.7.1 : Characteristics of Big data applications
1.8 Perception and Quantification of Value

As we know that the three important facets of Big data system are organizational
readiness, suitability of the business challenge and big data’s contribution to the
organization. Therefore, to test the Perception and Quantification of Value of Big data, the
following criteria’s must be tested.
a) Weather big data system is “Increasing the revenues of organization”. This can be
tested by using a recommendation engine.
®
Big Data Analytics 1 - 10 Introduction to Big Data
b) Weather Big data system is “Lowering the costs” of organizations spending’s like
capital expenses (Capex) and Operational expenses (Opex)
c) Weather Big data system is “Increasing the productivity” by speeding up the process
of execution with efficient results.
d) Weather Big data system is “Reducing the risk” while using big data platform
collecting the data from streams of automated sensors and can provide full visibility.
1.9 Understanding Big Data Storage

Every big data application requires collection of storage and computing resources to
achieve their performance and scalability within a runtime environment. The collection of
four key computing resources essential for running a Big data application are
a) CPU or Processor : which allows multiple tasks to be executed simultaneously
b) Memory : which holds the data for faster processing in association with CPU
c) Storage : provides persistence storage of data.
d) Network : which provides the communication channel between different nodes
through which the datasets are exchanged between processing nodes and storage
nodes.
As single-node computers are incapable to process huge amount of data, that’s why
the high-performance platforms are used which composed of collections of computers
with a pool of resources that can process massive amounts of data.
1.10 A General Overview Of High-Performance Architecture

The High-performance architecture of Big data is composed of connecting multiple
nodes together through variety of network topologies. However, it discriminates the
organization of computing and data across the network of storage nodes.
In this architecture, a master job manager is responsible for managing the pool of
processing nodes along with assigns tasks and monitors the activity. While storage
manager manages the data storage pool and distributes datasets across the collection of
storage resources, which doesn’t require any colocation of data and processing tasks. It is
only intended for minimize the costs of data access latency.
®
Fig. 1.10.1 : Generalize High-performance architecture of Big data System
To get a better understanding of the architecture for big data platform, we will
examine the Apache Hadoop software stack, since it is a collection of open source projects
that are combined to enable a software-based big data appliance. The generalized general
overview of high-performance architecture of Hadoop is shown in Fig. 1.10.1.
1.11 Architecture of Hadoop

The challenges associated with Big data can be solved using one of the most popular
frameworks provided by Apache is called Hadoop. Big data is a term that refers to data
sets or combinations of data sets whose size (volume), complexity (variability), and rate of
growth (velocity) make them difficult to be captured, managed, processed or analyzed by
conventional technologies and tools, such as relational databases and desktop statistics or
visualization packages, within the time necessary to make them useful.in simple way we
can say Big data is a problem while hadoop is the solution for that problem.
The Apache Hadoop is an open source software project that enables distributed
processing of large data sets across clusters of commodity servers using programming
models. It is designed to scale up from a single server to thousands of machines, with a
very high degree of fault tolerance. It is a software framework for storing the data and
®
running the applications on clusters of commodity hardware that provides massive

storage for any kind of data, enormous processing power and the ability to handle
virtually limitless concurrent tasks or jobs.
The Hadoop provides various tools for processing of big data collectively termed as
Hadoop ecosystem. (See Fig. 1.11.1)
Fig. 1.11.1 Hadoop Ecosystem
The different components of Hadoop ecosystem are explained in following Table II.
Sr. No. Name of Description

Component
1) HDFS It is a Hadoop distributed file system which is used to split the data
in to blocks and is distributed among servers for processing. It runs
multiple clusters to store several copies of data blocks which can be
used in case of failure occurs.
2) Map reduce It’s a programming method to process big data comprising of two
programs written in Java such as mapper and reducer. The mapper
extracts data from HDFS and put in to maps while reducer aggregate
the results generated by mappers.
3) Zookeeper It is a centralized service used for maintaining configuration
information with distributed synchronization and coordination.
®
4) HBase It is a Column-oriented database service used as NoSQL solution for

big data
5) Pig It is a platform used for analyzing the large data sets using a high-
level language. It uses dataflow language and provides parallel
execution framework.
6) Hive It provides data warehouse infrastructure for big data
7) Flume It provides distributed and reliable service for efficiently collecting,
aggregating, and moving large amounts of log data.
8) Scoop It is a tool designed for efficiently transferring bulk data between
Hadoop and structured data stores such as relational databases.
9) Mahaout It provides libraries for scalable machine learning algorithms
implemented on the top of Hadoop implemented using Map reduce
framework.
10) Oozie It is a workflow scheduler system to manage the Hadoop jobs.
11) Ambari It provides a software framework for provisioning, managing and
monitoring Hadoop clusters.
Table 1.11.1 : Different components of Hadoop ecosystem
1.12 Hadoop Distributed File System (HDFS)

The Hadoop Distributed File System (HDFS) is a hadoop implementation of
distributed file system design that hold large amount of data and provide easier access to
many clients distributed across the network. It is highly fault tolerant and designed to be
run on low cost hardware (called commodity hardware).The files in HDFS are stored
across the multiple machine in redundant fashion to recover the data loss in case of
failure.
It enables storage and management of large files stored on distributed storage medium
over the pool of data node. A Single name node runs in a cluster is associated with
multiple data nodes that provide the management of hierarchical file organization and
namespace. The HDFS file composed of fixed size blocks or chunks that are stored on
data nodes. The name node is responsible for storing the metadata about each file that
includes attributes of files like type of file, size, date and time of creation, properties of the
files as well as the mapping of blocks to files at the data nodes. The data node treats each
data block as a separate file and propagates the critical information with the name node.
The HDFS provides fault tolerance through data replication that can be specified at the
time of file creation with attribute name degree of replication (i.e., the number of copies
®
made) which is progressively significant in bigger environments consisting of many racks

of data servers. The significant benefits provided by HDFS are given as follows
 It provides Streaming access to file system data.
 It is suitable for distributed storage and processing.
 It is optimized to support high streaming read operations with limited set.
 It supports file operations like read, write, delete but append not update.
 It provides Java APIs and command line command line interfaces to interact with
HDFS.
 It provides different File permissions and authentications for files on HDFS.
 It provides continuous monitoring of name nodes and data nodes based on continuous
“heartbeat” communication between the data nodes to the name node.
 It provides Rebalancing of data nodes so as to equalize the load by migrating blocks of
data from one data node to another.
 It uses checksums and digital signatures to manage the integrity of data stored in a file.
 It has built-in metadata replication so as to recover data during the failure or to protect
against corruption.
 It also provides synchronous snapshots to facilitates rolled back during failure.
1.13 Architecture of HDFS

The HDFS follows Master-slave architecture using name and Data nodes. The Name
node act as a master while multiple Data nodes worked as slaves. The HDFS is
implemented as block structure file system where files are broken in to block of fixed size
stored on hadoop clusters. The HDFS architecture is shown in Fig. 1.13.1.
Fig. 1.13.1 : HDFS Architecture

®
The Components of HDFS composed of following elements

1) Name Node : An HDFS cluster consists of single name node called master server that
manages the file system namespaces and regulate access to files by client. It runs on
commodity hardware that manages file system namespaces. It stores all metadata for
the file system across the clusters. The name node serves as single arbitrator and
repository for HDFS metadata which is kept in main memory for faster random
access. The entire file system name space is contained in a file called FsImage stored
on name nodes file system, while the transaction log record are stored in Editlog file.
2) Data Node : In HDFS there are multiple data nodes exist that manages storages
attached to the node that they run on. They are usually used to store users’ data on
HDFS clusters. Internally the file is splitted in to one or more blocks to data node.
The data nodes are responsible for handling read/write request from clients. It also
performs block creation, deletion and replication upon instruction from name node.
The data node store each HDFS data block in separate file and several blocks are
stored on different data nodes. The requirement of such a block structured file
system is to store, manage and access files metadata reliably. The representation of
name node and data node is shown in Fig. 1.13.2.
Fig. 1.13.2 : Representation of Name node and Data nodes
3) HDFS Client : In Hadoop distributed file system the user applications access the file
system using the HDFS client. Like any other file systems, HDFS supports various
operations to read, write and delete files, and operations to create and delete
directories. The user references files and directories by paths in the namespace. The
user application does not need to aware that file system metadata and storage are on
different servers, or that blocks have multiple replicas. When an application reads a
file, the HDFS client first asks the name node for the list of data nodes that host
replicas of the blocks of the file. The client contacts a data node directly and requests
the transfer of the desired block. When a client writes, it first asks the name node to
choose data nodes to host replicas of the first block of the file. The client organizes a
pipeline from node-to-node and sends the data. When the first block is filled, the
®
client requests new data nodes to be chosen to host replicas of the next block. The
Choice of data nodes for each block is likely to be different.
4) HDFS Blocks : In general the users data stored in HDFS in terms of block. The files
in file system are divided in to one or more segments called blocks. The default size
of HDFS block is 64 MB that can be increase as per need.
The HDFS is fault tolerance such that if data node fails then current block write
operation on data node is re-replicated to some other node. The block size, number
of replicas and replication factors are specified in hadoop configuration file. The
synchronization between name node and data node is done by heartbeats functions
which are periodically generated by data node to name node.
Apart from above components the job tracker and task trackers are used when map
reduce application runs over the HDFS. Hadoop Core consists of one master job
tracker and several task trackers. The job tracker runs on name node like a master
while task trackers runs on data nodes like slaves.
The job tracker is responsible for taking the requests from a client and assigning task
trackers to it with tasks to be performed. The job tracker always tries to assign tasks
to the task tracker on the data nodes where the data is locally present. If for some
reason the node fails the job tracker assigns the task to another task tracker where the
replica of the data exists since the data blocks are replicated across the data nodes.
This ensures that the job does not fail even if a node fails within the cluster.
The HDFS can be manipulated either using command line. All the commands used
for manipulating HDFS through command line interface begins with “hadoop fs”
command. Most of the linux commands are supported over HDFS which starts with
“-”sign.
For example: The command for listing the files in hadoop directory will be
#hadoop fs –ls
The general syntax of HDFS command line manipulation is
#hadoop fs -<command>
The most popular HDFS commands are given in Table 1.13.1
Sr. No. Command Description
1. #hadoop fs –ls List the files

2. #hadoop fs –count hdfs:/ Count the number of directories,
files and bytes under the paths
3. #hadoop fs –mkdir /user/hadoop Create a new directory hadoop
under user directory
®
4. #hadoop fs –rm hadoop/cust Delete file cust from hadoop

directory
5. #hadoop fs –mv /user/training/custhadoop/ Move file cust from /user/training
directory to hadoop directory
6. #hadoop fs –cp/user/training/custhadoop/ Copy file cust from /user/training
directory to hadoop directory
7. #hadoopfs –copyToLocal hadoop/a.txt Copy file a.txt to local disk from
/home/training/ HDFS
8. #hadoopfs–copyFromLocal Copy file a.txt from local directory
/home/training/a.txt hadoop/ /home/training to HDFS
Table 1.13.1 : HDFS Commands
1.14 Map Reduce and YARN

In Hadoop, MapReduce is the programming model for execution of Hadoop jobs
which performs job management. The MapReduce execution condition utilizes an
master/slave execution model, in which one master node is called the as JobTracker that
deals with managing a pool of slave computing resources called TaskTrackers. The job of
the JobTracker is to manage the TaskTrackers, continuously monitoring the accessibility,
job management, scheduling the tasks, tracking the assigned tasks and ensuring the fault
tolerance.
The job of the TaskTracker is a lot more straightforward : it wait for task assignment,
execute the task and give status back to the JobTracker on periodic basis. The clients can
make demands from the JobTracker, which turns into the sole arbitorator for allocation of
resources. There are constraints inside this current MapReduce model. In the first place,
the programming worldview is pleasantly fit to applications where there is region
between the processing and the data, yet applications that request demand data
movement will quickly move toward becoming impeded by latency issues. Second, not
all applications are effectively mapped to the MapReduce model and Third, the
designation of processing nodes within the cluster is fixed through allocation of certain
nodes as “map slots” versus “reduce slots, When the computation is weighted toward
one of the phases, the nodes assigned to the other phase are largely unused bringing
about processor underutilization. This is being tended to in future renditions of Hadoop
through the isolation of obligations inside a modification called YARN. In this approach,
while in addition, there is the concept of an Application Master that is associated while
taking over the responsibility for In this methodology, overall resource management has
®
been centralized and management of resources at each node is now performed by a local
Node. Each application that directly negotiates with the central Resource Manager for
resources there is the idea of an Application Master related with every application that
legitimately consults with the Resource Manager for resource allocation, effective
scheduling to improve node utilization and to provide monitoring progress with tracking
status.
Last, the YARN approach enables applications to be better mindful of the data
allocation over the topology of the resource inside a cluster. This mindfulness considers
improved colocation of compute and data resources, reducing data movement, and thus,
lessening delay related with data access latencies. The outcome ought to be expanded
scalability and performance.
1.15 Map Reduce Programming Model

The map reduce is a programming model provided by Hadoop that allows expressing
distributed computations on huge amount of data.It provides easy scaling of data
processing over multiple computational nodes or clusters. In map reduces model the data
processing primitives used are called mapper and reducer. Every map reduce program
must have at least one mapper and reducer subroutines. The mapper has map method
that transforms input key value pair in to any number of intermediate key value pairs
while reducer has a reduce method that transform intermediate key value pairs that are
aggregated in to any number of output key, value pairs.
The map reduce keeps all processing operations separate for parallel executions where
a complex problem with extremely large in size is decomposed in to sub tasks. These
subtasks are executed independently from each other. After that the result of all
independent executions are combined together to get the complete output.
1.15.1 Features of Map Reduce

The different features provided by map reduce are explained as follows
 Synchronization : The map reduce supports execution of concurrent tasks. When the
concurrent tasks are executed, they need synchronization. The synchronization is
provided by reading the state of each map reduce operation during the execution and
uses shared variables for those.
 Data locality : In map reduce, as the data resides on different clusters, it appears like a
local to the users’ application. To obtain the best result the code and data of application
should reside on same machine.
®
 Error handling : Map reduce engine provides different fault tolerance mechanisms in
case of failure. When the tasks are running on different cluster nodes during which if
any failure occurs then map reduce engine find out those incomplete tasks and
reschedule them for execution on different nodes.
 Scheduling : The map reduce involves map and reduce operations that divide large
problems in to smaller chunks and those are run in parallel by different machines so
there is a need to schedule different tasks on computational nodes on priority basis
which is taken care by map reduce engine.
1.15.2 Working of Map Reduce Framework

The unit of work in map reduce is a
job. During map phase the input data is
divided in to input splits for analysis
where each split is an independent task.
These tasks run in parallel across hadoop
clusters. The reducer phase uses result
obtained from mapper as an input to
generate the final result. Fig. 1.15.1 : Map reduce process
The map reduce takes a set of input <key, value> pairs and produces a set of output
<key, value> pairs by supplying data through map and reduce functions. The typical map
reduce process is shown in Fig. 1.15.1.
Every map reduce program undergoes different phases of execution. Each phase has
its own significance in map reduce framework. The different phases of execution in map
reduce are shown in Fig. 1.15.2 (See on next page) and explained as follows
In input phase the large data set in the form of <key, value> pair is provided as a
standard input for map reduce program. The input files used by map reduce are kept on
HDFS (Hadoop Distributed File System) store which has standard Input Format specified
by user.
Once input file is selected then the split phase reads the input data and divided those
in to smaller chunks. The splitted chunks are then given to the mapper.
The map operations extract the relevant data and generate intermediate key value
pairs. It reads input data from split using record reader and generates intermediate
results. It is used to transform the input key, value list data to output key, value list which
is then pass to combiner.
The combiner is used with both mapper and reducer to reduce the volume of data
transfer. It is also known as semi reducer which accepts input from mapper and passes
output key, value pair to reducer.
®
The shuffle and sort are the components of reducer. The shuffling is a process of
partitioning and moving a mapped output to the reducer where intermediate keys are
assigned to the reducer. Each partition is called subset and so each subset becomes input
to the reducer. In general shuffle phase ensures that the partitioned splits reached at
appropriate reducers where reducer uses http protocol to retrieve their own partition
from mapper.
The sort phase is responsible for sorting the intermediate keys on single node
automatically before they are presented to the reducer. The shuffle and sort phases occur
simultaneously where mapped output is being fetched and merged.
The reducer reduces a set of intermediate values which share unique keys with set of
values. The reducer uses sorted input to generate the final output. The final output is
written using record writer by the reducer in to output file with standard output format.
Fig. 1.15.2 Different phases of execution in map reduce
The final output of each map reduce program is generated with key value pairs
written in output file which is written back to the HDFS store.
For example, of Word count process using map reduce with all phases of execution are
illustrated in Fig. 1.15.3
®
Fig. 1.15.3 : Word count process using map reduce
1.15.3 Input Splitting

The input to map reduce is provided by input file which has arbitrary format and
resides on HDFS. The input and output format of files defines how the input files are
splitted and how the output files are transformed. The common input output formats are
TextInputFormat/ TextOutputFormat that reads or writes lines of text in to files,
SequenceFileInputFormat/ SequenceFileOutputFormat that reads or write sequence of
files those can be fed as a input to another map reduce jobs, KeyValueInputFormat that
parses lines in to key, value pair and SequenceFileAsBinaryOutputFormat writes key,
values to sequence file in binary format.
Once the input format is selected the next operation is to define the input splits that
break the file in to tasks. The input split describes a unit of work which has at least a
single map task. The map reduce functions are then applied to data sets collectively called
job that has several tasks. The Input splits are often mapped with HDFS Blocks. The
default size of HDFS block is 64 MB. The user can define the split value manually. Each
split is processed by independent map function i.e. we can say that input split is the data
processed by individual mapper. Basically, each split has number of records with key,
value pair. The tasks are processed as per split size where largest one is processed first.
The minimum split size is usually 1 byte. The user can define split size greater than the
HDFS block size but it is not always preferred. The input split performed on Input file is
shown in Fig. 1.15.4 where record reader is used to read the records from input split and
to give them to the mapper.
®
Fig. 1.15.4 : Input split performed on Input file
The split size can be calculated using Compute SplitSize () method by providing
FileInputFormat. The input split is associated with record reader that loads data from
source and convert it into key, value pair defined by Input Format.
1.15.4 Map and Reduce Functions

The map function process the data and generate intermediate key, value pair result
while reduce function merges all intermediate values associated with particular key to
generate output.
In map phase the list of data are provided one at a time to mapping function. It
transfers each input element through map function to generate output data elements
shown Fig. 1.15.5.
The reduce function receives an iterative input values from mappers output list which
then aggregates the values together to return a single output. For reducer the input list is
the mappers output list shown in Fig.1.15.6.
Fig. 1.15.5 : Mapper function
®
Fig. 1.15.6 : Reducer function

The input to mapper defined by input format while output generated by reducer
generated in specified output format. if user does not specify input and output formats
then Text format is selected by default.
1.15.5 Input and Output Parameters

As we have seen that the map reduce framework operates on key, value pairs. It accept
input to the job as a set of key, value pair and produces set of key value pair as a output
of the job shown in Fig.1.15.7.
Fig. 1.15.7 : Input output Key value pair

The input to mapper is defined by various input formats which validate the input jobs,
split up the input files in to logical split and provide record readers for processing while
output format validate the output of job and provide record writer to write output in to
output file of the job.
The syntax of mapper and reducer classes in java are given below
Mapper Class Reducer Class

public class MyMapper extends public class MyReducer extends
MapReduceBase implements MapReduceBase implements
Mapper<LongWritable, Text, Text, Reducer<text, IntWriable, Text,
IntWritable> IntWritable>
{ {
public void map(LongWritable key, public void reduce(Text key,
Text value, Context context) throws Iterator<IntWritable > values,
IOException, InterruptedException OutputCollector< Text, Inwritable >
{ output, Reporter reporter) throws
…………………… IOException
…………………… {
} ……………………
} ……………………
}
}
®
The input to mapper is input key, input value and context which give information
about current task while input to reducer is set of key values with output collector that
collects key, value pair output from mapper or reducer. The reporter is used to report
progress, update counters and status information of map reduce job.
Summary
 Due to the massive digitalization, a large amount of data is being generated by
web applications and Social networking sites that runs on internet by many
organizations such data is called Big data.
 In big data, the data is generated in many formats like structured, semi structured
or unstructured.
 The structured data has fixed pattern or schema which can be stored and
managed using tables in RDBMS, The semi-structured data does not have pre-
defined structure or pattern as it involves scientific or bibliographic data which
can be represented using Graph data structures while unstructured data also do
not have a standard structure, pattern or schema.
 The processing of big data using traditional database management system is very
difficult because of its four characteristics called 4 Vs of Big data those are
Volume, Variety, Velocity and Veracity.
 The first Big Data challenge came in to picture at Census Bureau, US, in 1880,
where the information concerning approximately 60 million people had to be
collected, classified, and reported that process took more than 10 years to process.
 The Apache Hadoop is the open source framework for solving Big data problem.
 The common Big data use cases are Business intelligence by querying, reporting
and searching the datasets, using big data analytics tools for report generation,
trend analysis, search optimization, and information retrieval and performing
predictive, prescriptive and descriptive analytics.
 The popular applications of Big data analytics are Fraud detection, Data profiling,
Clustering, Price modelling and Recommendation System.
Two Marks Questions with Answers [Part A - Questions]
Q.1 Define Big data and also enlist the advantages of Big data analytics.
Ans. : Big data is high-volume, high-velocity and high-variety information assets that
demand cost-effective, innovative forms of information processing for enhanced insight
and decision making.
®
The major advantages of Big data analytics are :

 Supporting real time and batch data processing
 Supports huge volume of data generated at any velocity
 Reducing the capital and operational cost
 Does not needed high end servers as it can be run on commodity hardware
 Supports both Structured and unstructured data
 Supports high performance and scalable analytical operations
 Simple programming model for scalable applications
 It can process uncleansed or uncertain data
Q.2 What are the characteristics of Big data applications ?

Ans. : Big data can be described by the following characteristics :
1. Volume - The quantity of data that is generated is very important in this context. It
is the size of the data which determines the value and potential of the data under
consideration and whether it can actually be considered Big Data or not. The name
'Big Data' itself contains a term which is related to size and hence the characteristic.
2. Variety - The next aspect of Big Data is its variety. This means that the category to
which Big Data belongs to is also an essential fact that needs to be known by the
data analysts. This helps the people, who are closely analyzing the data and are
associated with it, to effectively use the data to their advantage and thus upholding
the importance of the Big Data.
3. Velocity - The term 'velocity' in the context refers to the speed of generation of data
or how fast the data is generated and processed to meet the demands and the
challenges which lie ahead in the path of growth and development.
4. Variability - This is a factor which can be a problem for those who analyze the data.
This refers to the inconsistency which can be shown by the data at times, thus
hampering the process of being able to handle and manage the data effectively.
5. Veracity - The quality of the data being captured can vary greatly. Accuracy of
analysis depends on the veracity of the source data.
6. Complexity - Data management can become a very complex process, especially
when large volumes of data come from multiple sources. These data need to be
linked, connected and correlated in order to be able to grasp the information that is
supposed to be conveyed by these data. This situation, is therefore, termed as the
'complexity' of Big Data.
®
Q.3 What are Features of HDFS ? AU : May 17

Ans. : The features of HDFS are given as follows :
 It is suitable for distributed storage and processing
 It provides Streaming access to file system data
 It is optimized to support high streaming read operations with limited set.
 It supports file operations like read, write, delete but append not update.
 It provides Java APIs and command line command line interfaces to interact with
HDFS.
 It provides different File permissions and authentications for files on HDFS.
 It provides continuous monitoring of name nodes and data nodes based on

continuous "heartbeat" communication between the data nodes to the name node.
 It provides Rebalancing of data nodes so as to equalize the load by migrating
blocks of data from one data node to another.
 It uses checksums and digital signatures to manage the integrity of data stored in a
file.
 It has built-in metadata replication so as to recover data during the failure or to
protect against corruption.
 It also provides synchronous snapshots to facilitates rolled back during failure.
Q4 What are the features provided by Map-Reduce programming model ? AU : Nov.-18
Ans. : The different features provided by map reduce are explained as follows :
 Synchronization : The map reduce supports execution of concurrent tasks. When

the concurrent tasks are executed, they need synchronization. The synchronization
is provided by reading the state of each map reduce operation during the
execution and uses shared variables for those.
 Data locality : In map reduce, as the data resides on different clusters, it appears
like a local to the users' application. To obtain the best result the code and data of
application should reside on same machine.
 Error handling : Map reduce engine provides different fault tolerance mechanisms
in case of failure. When the tasks are running on different cluster nodes during
which if any failure occurs then map reduce engine find out those incomplete
tasks and reschedule them for execution on different nodes.
 Scheduling : The map reduce involves map and reduce operations that divide
large problems into smaller chunks and those are run in parallel by different
machines. So there is a need to schedule different tasks on computational nodes on
priority basis which is taken care by map reduce engine.
®
Q.5 What are different phases of executions in map-reduce programming ?

Ans. : Every map reduce program undergoes different phases of execution. Each phase
has its own significance in map reduce framework. The different phases of execution in
map reduce are shown in Fig. 1.1 and explained as follows :
In input phase the large data set in the form of <key, value> pair is provided as a
standard input for map reduce program. The input files used by map reduce are kept
on HDFS (Hadoop Distributed File System) store which has standard InputFormat
specified by user. The map operations extract the relevant data and generate
intermediate key value pairs. It reads input data from split using record reader and
generates intermediate results. It is used to transform the input key, value list data to
output key, value list which is then pass to combiner.
The combiner is used with both mapper and reducer to reduce the volume of data
transfer. It is also known as semi reducer which accepts input from mapper and passes
output key, value pair to reducer.
Fig. 1.1 : Different phases of execution in map reduce
The shuffle and sort are the components of reducer. The shuffling is a process of
partitioning and moving a mapped output to the reducer where intermediate keys are
assigned to the reducer. Each partition is called subset and so each subset becomes
®
input to the reducer. In general shuffle phase ensures that the partitioned splits reached
at appropriate reducers where reducer uses http protocol to retrieve their own partition
from mapper. The sort phase is responsible for sorting the intermediate keys on single
node automatically before they are presented to the reducer. The shuffle and sort
phases occur simultaneously where mapped output is being fetched and merged.
The reducer reduces a set of intermediate values which share unique keys with set of
values. The reducer uses sorted input to generate the final output. The final output is
written using record writer by the reducer in to output file with standard output
format. The final output of each map reduce program is generated with key value pairs
written in output file which is written back to the HDFS store.
Part - B Questions
Q.1 Explain Big data characteristics along with their use cases. AU : May-17
Q.2 What are 4 V’s of Big data ? Also explain best practices for Big data analytics
Q.3 Explain generalized architecture of ?
Q.4 Explain architecture of Hadoop along with components of High-performance
architecture of Big data.
Q.5 Explain the functionality of map-Reduce Programming model. AU : May-17
Q.6 Explain the functionality of HDFS and map-reduce in detail . AU : Nov.-18


®
UNIT - II
2 Clustering and Classification
Syllabus
Advanced Analytical Theory and Methods: Overview of Clustering - K-means - Use Cases - Overview
of the Method - Determining the N umber of Clusters - Diagnostics - Reasons to Choose and Cautions
.- Classification : Decision Trees - Overview of a Decision Tree - The General Algorithm - Decision
Tree Algorithms - Evaluating a Decision Tree - Decision Trees in R - N aïve Bayes - Bayes‘ Theorem -
N aïve Bayes Classifier.
Contents
2.1 Overview of Clustering
2.2 K-means Clustering
2.3 Use Cases of K-means Clustering
2.4 Determining the Number of Clusters
2.5 Diagnostics
2.6 Reasons to Choose and Cautions
2.7 Classification
2.8 Decision Tree Algorithms
2.9 Evaluating Decision Tree
2.10 Decision Tree in R
2.11 Baye’s Theorem
2.12 Naive Bayes Classifier
Summary
Part - B Questions
(2 - 1)
Big Data Analytics 2-2 Clustering and Classification
2.1 Overview of Clustering

Clustering is one of the most popular exploratory data analysis techniques which is
used to get an deeper insights about the data. It can be defined as the task of classifying
data in to subgroups where data points in the same subgroup (cluster) are very similar
and data points in other clusters are different. In Big data analytics, clustering plays an
important role in classifying different objects intended for finding the hidden structure in
a data. Consider a situation wherein you need to analyze a set of data objects, and unlike
classification the class label of each object is unknown. This condition occurs mainly in
the case of large databases, where the process of defining class labels to a large number of
objects is costly.
Clustering is the process of collecting and grouping similar data into classes or
clusters. In other words, clustering is a process in which similar data is grouped into
classes or clusters so that the objects within the same cluster or class have high similarity
with respect to the dissimilar objects in another cluster or group. Often, the dissimilarities
are evaluated on the basis of the attributes that describe an object. These attributes are
also known as distance measures. The clustering is an unsupervised learning technique to
group the similar objects. It is often used for exploratory analysis of data where
predictions can’t be made. It is specifically used to find the similarity between the objects
based on their attributes. The similar objects are placed into a group called cluster. Each
object is called data point where data point’s lies in same cluster are similar to each other
while points from different clusters are dissimilar in nature. In next section, we are going
to study one of the most popular clustering algorithms called “K-means Clustering”.
2.2 K-means Clustering

K-means clustering is one of the simplest and popular unsupervised machine learning
algorithms. The K-means algorithm tries to partition the dataset into K pre-defined
distinct non-overlapping subgroups (clusters) in iterative manner where each data point
belongs to only one group.
Initially define a target number K, which refers to the number of centroids you need in
the dataset. A centroid is the imaginary or real location representing the center of the
cluster. The K-means algorithm identifies K number of centroids, and then allocates
every data point to the nearest cluster, while keeping the centroids as small as possible.
To process the learning data, the K-means algorithm in data mining starts with a first
group of randomly selected centroids, which are used as the beginning points for every
®
cluster, and then performs iterative (repetitive) calculations to optimize the positions of
the centroids. It halts after creating and optimizing clusters when either the centroids
have stabilized i.e. there is no change in their values because the clustering has been
successful or the defined number of iterations has been achieved. This concept is
represented in terms of algorithm as below.
Algorithmic Steps for K-Means Clustering
Step 1 : Let X = {X1,X2,…,Xn} be the set of data points.
Step 2 : Arbitrarily select ‘K’ cluster centers denoted as C1,C2,...,Ck.
Step 3 : Calculate the distance between each data point with the cluster centers by
using any distance measurers.
Step 4 : Assign the data points to the cluster center whose distance from the cluster
center is minimum with respect to other cluster centers.
Step 5 : Recalculate the distance between each data point and its centers to get new
cluster centers using mean.
Step 6 : Repeat from step 3 till there is no change in the cluster center
The pictorial representation of K-means algorithm is shown in Fig. 2.2.1.
Fig. 2.2.1 Pictorial representation of K-means algorithm
2.2.1 One Dimensional Problem for K-means Clustering

Let, X = {2,3,4,10,11,12,20,25,30} be the set of data points. Arbitrarily select K = 2 cluster
centers C1 = 4 and C2 = 12.
®
Calculate the distance between each data point from X with the cluster centers
C1 = 4 and C2 = 12 by using Euclidean distance.
The Euclidean distance can be calculated by formula

d(p,q) = d(q,p)
= (q1  p1)2 + (q2  p2)2 + ... + (qn  pn)2
n
=  (qi  pi)2
i=1
For this sum, d(i,m) = (Ci  Xm)2 for i = 1,2. And m = 1, 2, …, 9.
So, for cluster 1 and 2 the distances are calculated as shown in Table 2.2.1.
Cluster Data Points

Centers
2 3 4 10 11 12 20 25 30
C =4 2 1 0 6 7 8 16 21 26
1
C = 12 10 9 8 2 1 0 8 13 18
2
Table 2.2.1 : Distance of data points with cluster centers C1 and C2 for iteration 1
From the above Table 2.2.1, the data points are clustered in accordance with the
minimum distance from the cluster centers. So, the cluster center C1=4 has the data points
{2,3,4} and the cluster centers C2=12 has the data points {10,11,12,20,25,30}. As per Step 4
of algorithm, we have assigned the data points to the cluster center whose distance from
the cluster center is minimum with respect to other cluster centers.
Now, Calculate the new cluster center for each data points using mean.
1 n
Mean = ( X )
n i=1 i
So, for data points {2,3,4}, mean = 3 which is the new cluster center C1 = 3 while for
data points {10,11,12,20,25,30}, mean = 18, which is the new cluster center C2 = 18. Now we
have to repeat the same steps till there is no change in the cluster center.
Cluster Data points
centers
2 3 4 10 11 12 20 25 30
C =3 1 0 1 7 8 9 17 22 27
1
C = 18 16 15 14 8 7 6 2 7 12
2
®
As per the above Table 2.2.2, the cluster center C1=3 clusters data points {2,3,4,10} while
the cluster center C2=18 clusters data points {11,12,20,25,30}.
Now, Calculate the new cluster center for each data points using mean.
So, for data points {2,3,4,10}, mean = 4.75 which is the new cluster center C1 = 4.75
while for data points {11,12,20,25,30}, mean = 18, which is the new cluster center C2 =
16.33. Now we have to repeat the same steps till there is no change in the cluster center.
Cluster Data Points
Centers
2 3 4 10 11 12 20 25 30
C1 = 4.75 2.75 1.75 0.75 5.25 6.25 7.25 15.25 20.25 25.25
C2 = 16.33 14.33 13.33 12.33 6.33 5.33 4.33 3.67 8.67 13.67
Since we get the same data points for both the clusters as per the Table 2.2.3, so we can
say that the final Cluster centers are C1 = 4.75 and C2 = 16.33.
2.3 Use Cases of K-means Clustering

K-means clustering has wide range of applications. Some of the popular use cases are
listed below.
1. Document clustering 6. Fraud detection in insurance
2. Delivery store optimization 7. Rideshare data analysis
3. Identifying crime localities 8. Cyber-profiling criminals
4. Customer segmentation 9. Call record detail analysis
5. Fantasy league stat analysis 10. Automatic clustering of IT alerts
From above listed use cases, four popular use cases are explained below
1. Document clustering :
Document clustering refers to unsupervised classification (categorization) of

documents into groups (clusters) in such a way that the documents in a cluster are
similar, whereas documents in different clusters are dissimilar. The documents may be
web pages, blog posts, news articles, or other text files.
®
2. Identifying crime localities :

Crime analysis and prevention is a systematic approach for identifying and analyzing
patterns and trends in crime. Even though we cannot predict who all may be the victims
of crime but can predict the place that has probability for its occurrence. K-means
algorithm is done by partitioning data into groups based on their means. K-means
algorithm has an extension called expectation - maximization algorithm where we
partition the data based on their parameters. The following are the steps to demonstrate
crime localities.
Step 1 : Create a new server on the web hosting sites available
Step 2 : Create two databases; one for storing the details of the authorized user and
the other for storing details of the crime occurring in a particular location
Step 3 : The data can be added to the database using SQL queries
Step 4 : Create PHP scripts to add and retrieve data.
Step 5 : The PHP file to retrieve data converts the database in the JSON format.
Step 6 : This JSON data is parsed from the android so that it can be used.
Step 7 : The location added by the user from the android device is in the form of
address which is converted in the form of latitudes and longitude that is further added to
the online database.
Step 8 : The added locations are marked on the Google map.
Step 9 : The various crime types used are Robbery, Kidnapping, Murder, Burglary
and Rape. Each crime type is denoted using a different color marker.
Step 10 : The crime data plotted on the maps is passed to the K - means algorithm.
Step 11 : The data set is divided into different clusters by computing the distance of
the data from the centroid repeatedly.
Step 12 : A different colored circle is drawn for different clusters by taking the
centroid of the clusters as the center where the color represents the frequency of the crime
Step 13 : This entire process of clustering is also performed on each of the crime types
individually.
®
Step 14 : In the end, a red colored circle indicates the location where safety measures
must be adopted.
From the clustered results it is easy to identify crime prone areas and can be used to
design precaution methods for future. The classification of data is mainly used to
distinguish types of preventive measures to be used for each crime. Different crimes
require different treatment and it can be achieved easily using this application. The
clustering technique is effective in terms of analysis speed, identifying common crime
patterns and crime prone areas for future prediction. The developed application has
promising value in the current complex crime scenario and can be used as an effective
tool by Indian police and enforcement of law organizations for crime detection and
prevention.
3. Cyber-profiling criminals :
The activities of Internet users are increasing from year to year and have had an
impact on the behavior of the users themselves. Assessment of user behavior is often only
based on interaction across the Internet without knowing any others activities. The log
activity can be used as another way to study the behavior of the user. The Log Internet
activity is one of the types of big data so that the use of data mining with K-Means
technique can be used as a solution for the analysis of user behavior. This study is the
process of clustering using K-Means algorithm which divides into three clusters, namely
high, medium, and low. The cyber profiling is strongly influenced by environmental
factors and daily activities. For investigation, the cyber-profiling process gives a good,
contributing to the field of forensic computer science. Cyber Profiling is one of the efforts
to know the alleged offenders through the analysis of data patterns that include aspects of
technology, investigation, psychology, and sociology. Cyber Profiling process can be
directed to the benefit of:
1. Identification of users of computers that have been used previously.
2. Mapping the subject of family, social life, work, or network-based organizations,
including those for whom he/she worked.
3. Provision of information about the user regarding his ability, level of threat, and
how vulnerable to threats to identify the suspected abuser
Criminal profiles generated in the form of data on personal traits, tendencies, habits,
and geographic-demographic characteristics of the offender (for example: age, gender,
socio-economic status, education, origin place of residence). Preparation of criminal
profiling will relate to the analysis of physical evidence found at the crime scene, the
process of extracting the understanding of the victim (victimology), looking for a modus
®
operandi (whether the crime scene planned or unplanned), and the process of tracing the
perpetrators were deliberately left out (signature)
4. Fraud detection in Insurance :

Insurance industry is one of the most important issues in both economy and human
being life in modern societies which awards peace and safety to the people by
compensating the financial risk of detriments and losses. This industry, like others,
requires choosing some strategies to obtain desired ranking and remain in competitive
market. One of efficient factors which affects enormous decision makings in insurance is
paying attention to important information of customers and bazar that each insurance
company stores it in its own database. But with daily increasing data in databases,
although hidden knowledge and pattern discovery using usual statistical methods is
complicated, time-consuming and impossible to achieve. Data mining is a powerful
approach for extracting hidden knowledge and patterns on massive data to guide
insurance industry. For example, one of the greatest deleterious challenges here is
interacting between insurance companies and policyholders which create a feasible
situation for fraudulent claims. Due to importance of this issue, after investigating
different ways of fraudulent crimes in insurance, we use K-Means clustering technique to
find fraud patterns in automobile insurance include body and third-party. Our
experimental results indicate a high accuracy when have been compared with statistical
information extracted from data sets. Outcomes show significant relations among efficient
factors in similar fraud cases.
2.4 Determining the Number of Clusters

One of the important issues of K-means is to find the appropriate number of clusters in
a data set i.e. to specify the number of clusters “K” to be generated. The optimal number
of clusters is somehow subjective and depends on the method used for measuring
similarities and the parameters used for partitioning. These methods include direct
methods and statistical testing methods :
1. Direct methods which consists of optimizing a criterion, such as the within cluster
sums of squares or the average silhouette. The corresponding methods are named
as elbow and silhouette methods, respectively.
2. Statistical testing methods which consists of comparing evidence against null
hypothesis. An example is the gap statistic.
®
2.4.1 Elbow Method

The Elbow method looks at the total within-cluster sum of square (WSS) as a function
of the number of clusters. One should choose a number of clusters so that adding another
cluster doesn’t improve much better the total WSS. The total WSS measures the
compactness of the clustering and we want it to be as small as possible. WSS is the sum of
squares used to determine the optimal value of K. The formula to find WSS is given as
WSS = i=1
m
d(pi,q(i))2
Where pi represents data point and q(i) represents the cluster center
The optimal number of clusters can be defined as follows
1. Compute clustering algorithm for different values of K. For instance, by varying K
from 1 to 10 clusters.
2. Calculate the Within-Cluster-Sum of Squared Errors for different values of K, and
choose the K for which WSS becomes first starts to diminish. The Squared Error for
each point is the square of the distance of the point from its cluster center.
 The WSS score is the sum of these Squared Errors for all the points.
 Any distance metric like the Euclidean Distance or the Manhattan Distance can be
used.
3. Plot the curve of WSS according to the number of clusters K as shown in Fig. 2.4.1.
4. Location of a bend (knee) in the plot is generally considered as an indicator of the
appropriate number of clusters.
5. In the plot of WSS-versus-K, this is visible as an elbow.
Fig. 2.4.1 Elbow method
®
Big Data Analytics 2 - 10 Clustering and Classification
From the above Fig. 2.4.1, we conclude that for the given number of clusters, the elbow
point is found at K = 3. So for this problem, the maximum number of clusters would be 3.
2.4.2 Average Silhouette Method

The Silhouette method is used for interpretation and validation of consistency within
clusters of data. It provides graphical representation of how well each object has been
classified. It measures the quality of a clustering. That is, it determines how well each
object lies within its cluster. A high average silhouette width indicates a good clustering.
Average silhouette method computes the average silhouette of observations for different
values of K. The optimal number of clusters K is the one that maximize the average
silhouette over a range of possible values for K. The algorithm can be computed as
follow :
1. Compute clustering algorithm (e.g., K-means clustering) for different values of K.
For instance, by varying K from 1 to 10 clusters.
2. For each K, calculate the average silhouette of observations.
3. Plot the curve of according to the number of clusters K.
4. The location of the maximum is considered as the appropriate number of clusters.
The silhouette value measures the similarity of data point in its own cluster (cohesion)
compared to oth sample into two or more homogeneous sets (leaves) based on the most
significant differentiators in your input variables. er clusters (separation).
The range of the Silhouette value is between +1 and – 1. A high value is desirable and
indicates that the point is placed in the correct cluster. If many points have a negative
Silhouette value, it may indicate that we have created too many or too few clusters.
The Silhouette Value s(i) for each data point i is defined as follows :
b(i)  a(i)
s(i) = , if |Ci| > 1
max{a(i) b(i)}
and s(i) = 0, if |Ci| = 1
s(i) is defined to be equal to zero if i is the only point in the cluster. This is to prevent
the number of clusters from increasing significantly with many single-point clusters.
Here, a(i) is the measure of similarity of the point i to its own cluster. It is measured as
the average distance of i from other points in the cluster.
For each data point i  Ci (data point i in the duster Ci), let
a(i) =
1  d(i,j)
|Ci|  1 j  Ci i  j
®
Similarly, b(i) is the measure of dissimilarity of i from points in other clusters.

For each data point i  Ci, we now define
b(i) =
min 1  d (i,j)
i  j |Cj| j  CJ
d(i, j) is the distance between points i and j. Generally, Euclidean Distance is used as
the distance metric.
2.4.3 Gap Statistic Method

The gap statistic approach can be applied to any clustering method. The gap statistic
compares the total within intra-cluster variation for different values of K with their
expected values under null reference distribution of the data. The estimate of the optimal
clusters will be value that maximizes the gap statistic. This means that the clustering
structure is far away from the random uniform distribution of points.
2.5 Diagnostics
In WSS, the heuristic value must be chosen to get the desired output which can be
provided at least several possible values of K. When numbers of attributes are relatively
small, a common approach is required to further refinement of distinct identified clusters.
The example of distinct clusters is shown in Fig. 2.5.1 (a).
Fig. 2.5.1 (a) Example of distinct clusters Fig. 2.5.1 (b) Example of less obvious clusters
To resolve the problem of distinct clusters, the above three questions needs to be
considered.
Q 1) whether the clusters are separated from each other’s?
Q 2) whether any cluster has only few points?
Q 3) whether any centroid appears to be too close to each other’s?
The solution to above three questions might results in to less obvious clusters as
shown in Fig. 2.5.1 (b).
®
2.6 Reasons to Choose and Cautions

As we know that, K-means is a simplest method for defining clusters, where clusters
and their associated centroids are identified for easy assignment of new objects. Each new
object has its own distance from the closest centroid. But, as this method is unsupervised,
some of the consideration has to be look by practitioner like type of object attributes
included in the analysis, unit of measure for each attribute, rescaling of attributes without
disproportionate results and several additional considerations which are explained
below.
The first consideration is analysis of object attribute, where it is important to
understand what attributes will be known at the time a new object is assigned to a cluster
rather than which object attributes are used in the analysis. Usually, Data Scientist may
have a choice of a dozen or more attributes to use in the clustering analysis. But whenever
possible they reduce the number of attributes in clustering analysis as too many attributes
can minimize the impact of the most important variables. So, one must identify the highly
correlated attributes and use only one or two of the correlated attributes in the clustering
analysis.
The second consideration is Units of Measure, where each attribute must be defined
with some or other units of measure (like gram or kilogram for weight and meters or
centimeters for a patient’s height). So, all the attributes in a given cluster must have same
unit of measure. The dissimilar units of measure for a different attribute may result into
inconsistency in results.
The third consideration is Rescaling; where attributes that are common in clustering
analyses can have differ in magnitude from the other attributes. With the rescaled
attributes, the borders of the resulting clusters fall somewhere between the two earlier
clustering analyses. Some practitioners also subtract the means of the attributes to center
the attributes around zero. However, this step is unnecessary because the distance
formula is only sensitive to the scale of the attribute, not its location.
The additional consideration includes, defining the starting positions of the initial
centroid as it is sensitive for working of K-means algorithm. Thus, it is important to rerun
the k-means analysis several times for a particular value of K to ensure the cluster results
provide the overall minimum WSS.
2.7 Classification
In machine learning and statistics, classification is a supervised learning approach in
which the computer program learns from the data input given to it and then uses this
®
learning to classify new observation. Classification is technique to categorize our data into
a desired and distinct number of classes where we can assign label to each class.
Applications of classification are, speech recognition, handwriting recognition, biometric
identification, document classification etc. A Binary classifiers classifies 2 distinct classes
or with 2 possible outcomes while Multi-Class classifiers classifies more than two distinct
classes. The different types of classification algorithms in Machine Learning are :
1. Linear Classifiers : Logistic Regression, Naive Bayes Classifier
2. Nearest Neighbor
3. Support Vector Machines
4. Decision Trees
5. Boosted Trees
6. Random Forest
7. Neural Networks
2.7.1 Overview of Decision Tree Classifier

Given a data of attributes together with its classes, a decision tree produces a sequence
of rules that can be used to classify the data. Decision Tree, as it name says, makes
decision with tree-like model. It splits the samples into one or more similar sets (leaves)
based on the most significant differentiators in the input attributes. To choose a
differentiator (predictor), the algorithm considers all features and does a binary split on
them. It will then choose the one with the least cost (i.e. highest accuracy), and repeats
recursively, until it successfully splits the data in all leaves (or reaches the maximum
depth). The final result is a tree with decision nodes and leaf nodes where decision node
has two or more branches and a leaf node represents a classification or decision. The
uppermost decision node in a tree corresponds to the best predictor and is called the root
node. Decision trees can handle both categorical and numerical data.
2.7.2 Terms used in Decision Trees

1. Root node : The root node represents the entire dataset which further gets divided
into two or more standardized subsets.
2. Splitting : The Splitting is a process of dividing the nodes into two or more sub-
nodes.
3. Pruning : The Pruning is the procedure of removing the sub-nodes of a decision
node which is contradictory to process of splitting.
®
4. Branch / sub-tree : The branch or sub-tree are the sub part of an entire tree.
5. Decision node : The Decision node is generated when a sub-nodes are splited into
further sub-nodes.
6. Parent and child node : The node which is divided into sub-nodes is called parent
node and sub-nodes are called child nodes.
7. Leaf / terminal node : The nodes which do not get further splitted are called as Leaf
or Terminating node.
2.7.3 Advantages and Disadvantages of Decision Trees

The advantages of are listed below
 Decision trees generate understandable rules.
 Decision trees perform classification without requiring much computation.
 Decision trees are capable of handling both continuous and categorical variables.
 Decision trees provide a path to find out which fields are most important for
prediction or classification.
The disadvantages of are listed below
 Decision trees are less appropriate for estimation tasks where the goal is to predict
the value of a continuous attribute.
 Decision trees are prone to errors in classification problems with many classes and a
relatively small number of training examples.
 Decision trees can be computationally expensive to train. The process of growing a
decision tree is computationally expensive. At each node, each candidate splitting
field must be sorted before its best split can be found. In some algorithms,
combinations of fields are used and a search must be made for optimal combining
weights. Pruning algorithms can also be expensive since many candidate sub-trees
must be formed and compared.
 Over fitting : Decision-tree learners can create over-complex trees that do not
generalize the data well. This is called over fitting.
 The over fitting is not appropriate for continuous variables : While working with
continuous numerical variables, decision tree loses information, when it categorizes
variables in dissimilar categories.
 Decision trees can be unbalanced because small discrepancies in the data might
result in a completely different tree. This is called variance, which needs to be
lowered by methods like bagging and boosting.
®
 Greedy algorithms cannot guarantee to return the globally optimal decision tree.
This can be mitigated by training multiple trees, where the features and samples are
randomly sampled with replacement.
 Decision tree learners create biased trees if some classes dominate. It is therefore
recommended to balance the data set prior to fitting with the decision tree.
 Information gain in a decision tree with categorical variables gives a biased
response for attributes with greater no. of categories.
 Generally, it gives low prediction accuracy for a dataset as compared to other
machine learning algorithms.
Calculations can become complex when there are many class label.
Basically, there are three algorithms for creating a decision tree; namely Iterative
Dichotomiser 3 (ID3), Classification And Regression Trees (CART) and C4.5. The next
section 2.8 will describe the above three decision tree algorithms in detail.
2.8 Decision Tree Algorithms

In general, the objective of a decision tree algorithm is to construct a tree T from a
training set S. The general algorithm constructs a sub trees like T1, T2,…, Tn for the
subsets of S recursively until one of the following criteria is met :
 All the leaf nodes in the tree satisfy the minimum purity threshold.
 The tree cannot be further split with the preset minimum purity threshold.
 Any other stopping criterion is satisfied (such as the maximum depth of the tree).
The first step in constructing a decision tree is to choose the most informative attribute
i.e. Root node. A common way to identify the most informative attribute is to use
entropy-based methods, which is used by decision tree learning algorithms such as ID3.
The ID3 algorithm is explained in next section.
2.8.1 ID3 Algorithm

The ID3 algorithm makes use of entropy function and information gain. The ID3
method selects the most informative attribute based on two basic measures like Entropy,
which measures the impurity of an attribute and Information gain, which measures the
purity of an attribute.
Let us see an example how the decision tree is formed using ID3 algorithm. As shown
in Table 2.8.1, there are five attributes like Outlook, Temperature, Humidity, Windy and
®
Play Cricket. Make a decision tree that predicts whether cricket will be played on the
day.
Sr. No Outlook Temperature Humidity Windy Play cricket
1 Sunny Hot High Weak No
2 Sunny Hot High Strong No
3 Overcast Hot High Weak Yes
4 Rain Mild High Weak Yes
5 Rain Cool Normal Weak Yes
6 Rain Cool Normal Strong No
7 Overcast Cool High Strong Yes
8 Sunny Mild High Weak No
9 Sunny Cool Normal Weak Yes
10 Rain Mild Normal Weak Yes
11 Sunny Mild Normal Strong Yes
12 Overcast Mild High Strong Yes
13 Overcast Hot Normal Weak Yes
14 Rain Mild High Strong No

Table 2.8.1 : Attributes for ID3 algorithms
The play cricket is the final outcome which depends on the other four attributes. To
start with we have to choose the root node. To choose the best attribute we find entropy
which specifies the uncertainty of data and information gain represented as,
p  p  n  n 
Entropy = log2    p + n log2 p + n
p+n  p + n  
Average information can be calculated as,
p i + ni
I (Attribute) =  p+n
Entropy (A)
And the Information Gain can be calculated as :

Information Gain = Entropy(S) – I (Attribute)
The ID3 algorithm to create the decision tree will work as follows
1. Calculate the Entropy for Data-Set Entropy(S)
2. For Every Attribute do the following
®
a. Calculate the Entropy for all other Values Entropy(A)

b. Find Average Information Entropy for the Current attribute
c. Calculate Information Gain for the current attribute
3. Pick the Highest gain attribute
4. Repeat the above steps until we get the desired decision tree
As seen from the Table 2.8.1, the attribute play cricket has Yes/Positive (p) = 9 and
No/Negative (n) = 5. So using the formula we calculate the Data-Set Entropy.
9  9  5  5 
Entropy = log2    9 + 5 log2 9 + 5 = 0.940
9+5  9 + 5  
Next step is to calculate the entropy for each attribute. Let us start with outlook, which
has three outcomes like sunny, rainy and overcast. As seen from the Table 2.8.2 we
determine the entropy for each case.
Outlook Play Cricket Outlook Play Cricket Outlook Play

Rainy Yes Cricket
Sunny No
Rainy Yes Overcast Yes
Sunny No
Rainy No Overcast Yes
Sunny No
Rainy Yes Overcast Yes
Sunny Yes
Rainy No Overcast Yes
Sunny Yes
Outlook p n Entropy
Sunny 2 3 0.971
Rainy 3 2 0.971
Overcast 4 0 0
Table 2.8.2 : Relation of attribute outlook with play cricket
Now we calculate average information entropy for outlook

p(sunny) + n(sunny) p(rainy) + n(rainy)
I(Outlook) =  Entropy(outlook = sunny) + 
p+n p+n
p(overcast) + n(overcast)
Entropy(outlook = rainy) +  Entropy(outlook = overcast)
p+n
®
3+2 2+3 4+0

i.e. I(outlook) =  0.971 +  0.971 +  0 = 0.693
9+5 9+5 9+5
Now we calculate information gain for the attribute outlook
Gain = Entropy(S)  I(Outlook) = 0.940 – 0.693 = 0.247
Now, repeat the same procedure for Temperature as shown Table 2.8.3.
Temperature Play Cricket
Temperature Play Cricket Temperature Play Cricket
Hot No
Mild Yes Cool Yes
Hot No
Mild No Cool No
Hot No
Mild Yes Cool Yes
Hot Yes
Mild Yes Cool Yes
Mild Yes
Mild No
Temperature p n Entropy
Hot 2 2 1
Mild 4 2 0.918
Cool 3 1 0.811
Table 2.8.3 : Relation of attribute temperature with play cricket
Now we calculate the average information entropy for Temperature
p(hot) + n(hot) p(mild) + n(mild)

I(Temperature) =  Entropy(Temperature = hot) + 
p+n p+n
p(cool) + n(cool)
Entropy(Temperature = mild) +  Entropy(Temperature = cool)
p+n
2+2 4+2 3+1

i.e. I(Temperature) = 1+  0.918 +  0.811 = 0.911
9+5 9+5 9+5
Now we calculate information gain for the attribute temperature
Gain = Entropy(S) – I(Temperature) = 0.940 – 0.911 = 0.029
Now, repeat the same procedure for finding the entropy for Humidity as shown in
Table 2.8.4.
®
Humidity p n Entropy
High 3 4 0.985
Normal 6 1 0.591
Table 2.8.4 : Entropy of humidity
3+4 6+1
I(Humidity) =  0.985 +  0.591 = 0.788
9+5 9+5
Now we calculate gain for the attribute Humidity
Gain = Entropy(S) - I(Humidity) = 0.940-0.788 = 0.152
Now, repeat the same procedure for finding the entropy for Windy as shown in
Table 2.8.5.
Windy p n Entropy
Strong 3 3 1
Weak 6 2 0.811
Table 2.8.5 : Entropy of windy
3+3 6+2
I(Windy) = 1+  0.811 = 0.892
9+5 9+5
Now we calculate gain for the attribute Windy
Gain = Entropy(S) - I(Windy) = 0.940-0.892 = 0.048
Finally the Attributes and their information gain is shown in Table IX. From this table
we conclude that the outlook has maximum gain than others. So, root node for decision
tree will be selected as outlook which is shown in Table 2.8.6.
Attribute Information Gain
Outlook 0.247 Selected as a Root

Temperature 0.029
Humidity 0.152
Windy 0.048
Table 2.8.6 : Information Gain for Different Attributes
So, the initial decision tree will look like Fig. 2.8.1.
®
Fig. 2.8.1 : Initial decision tree with root node outlook
As seen for overcast, there is only outcome “Yes” because of which further splitting is
not required and “Yes” becomes the leaf node. Whereas the sunny and rain has to be
further splitted. So a new data set is created and the process is again repeated. Now
consider the new tables for Outlook=Sunny and Outlook=Rainy as shown in Table 2.8.7.
Outlook Temperature Humidity Windy Play Cricket

Sunny Hot High Weak No
Sunny Hot High Strong No
Sunny Mild High Weak No Outlook = Sunny
Sunny Cool Normal Weak Yes
Sunny Mild Normal Strong Yes
Rainy Mild High Weak Yes
Rainy Cool Normal Weak Yes
Rainy Cool Normal Strong No

Outlook =
Rainy Mild Normal Weak Yes Rainy
Rainy Mild High Strong No
Table 2.8.7 : Attribute outlook with sunny and rainy
Now we solve for attribute Outlook=Sunny. As seen from the Table 2.8.8, for Outlook=
Sunny the play cricket has p=2 and n=3.
®

Sunny Hot High Weak No
Sunny Hot High Strong No P=2
Sunny Mild High Weak No N=3
Sunny Cool Normal Weak Yes Total = 5
Sunny Mild Normal Strong Yes
Table 2.8.8 : Outlook=Sunny

Calculate the dataset entropy for the above table
–2  2  3  3 
Entropy =  log2  –  log2   = 0.971
2+3  2+3 2+3  2+3
We calculate the information gain for Humidity and as seen from the Table 2.8.9, the
entropy for Humidity with outcome high and normal is 0.
Outlook Humidity Play Cricket

Humidity p n Entropy
Sunny High No
High 0 3 0
Sunny High No
Normal 2 0 0
Sunny High No
Sunny Normal Yes
Sunny Normal Yes
Table 2.8.9 : Entropy for humidity
As I (Humidity) =0, hence information gain=0.971

Similarly, we find the Entropy and information gain for Windy as shown in
Table 2.8.10.
Outlook Windy Play Cricket
Windy p n Entropy
Sunny Strong No
Strong 1 1 1
Sunny Strong Yes
Weak 1 2 0.918
Sunny Weak No
Sunny Weak No
Sunny Weak Yes

Table 2.8.10 : Entropy for windy
As I (Windy) =0.951, hence information gain=0.971-0.951=0.020
Similarly, we find the Entropy and information gain for temperature as shown in
Table 2.8.11.
®
Outlook Temperature Play Cricket
Sunny Cool Yes Temperature p n Entropy
Sunny Hot No Cool 1 0 0
Sunny Hot No Hot 0 2 0
Sunny Mild No Mild 1 1 1
Sunny Mild Yes
Table 2.8.11 : Entropy for temperature
As I (Temperature) =0.951, hence information gain=0.971-0.4=0.571

The Information gain for the attributes is shown in Table 2.8.12.
Temperature 0.571
Humidity 0.971 Selected as a Next

Windy 0.02
Table 2.8.12 : Information gain for attributes
From the Table 2.8.12, it is seen that the Humidity has the highest gain amongst the
other attributes. So, it will be selected as a next node as shown in Fig. 2.8.2.
Fig. 2.8.2 : Decision tree with selected node humidity
As seen from Figure 2.5, for Humidity there are only two conditions as Normal “Yes”
and High “No”. So, further expansion is not required and both become the leaf node.
Now consider the new tables for Outlook=Rainy as shown in Table 2.8.13.
®
Rainy Mild High Weak Yes

P=3
Rainy Cool Normal Weak Yes
N=2
Rainy Cool Normal Strong No
Total = 5
Rainy Mild Normal Weak Yes
Rainy Mild Normal Strong No
Table 2.8.13 : Attribute outlook with rainy
Now, calculate the dataset entropy for the above table.

Entropy = 0.971
For each attribute like Humidity, calculate the entropy for High and Normal as shown
in Table 2.8.14.
Outlook Humidity Play Cricket
Rainy High Yes Attribute Entropy
Rainy High No High 1
Rainy Normal Yes Normal 0.918
Rainy Normal No
Rainy Normal Yes
Table 2.8.14 : Entropy for attribute humidity (High and normal)
Therefore, I (Humidity) = 0.951 and Information Gain=0.971-0.951=0.020.

For the attribute Windy, calculate the entropy for Strong and Weak as shown in
Table 2.8.15.
Outlook Windy Play Cricket
Attribute p n Entropy
Rainy Strong No
Strong 0 2 0
Rainy Strong No
Weak 3 0 0
Rainy Weak Yes
Rainy Weak Yes
Rainy Weak Yes
Table 2.8.15 : Entropy for attribute windy (Strong and weak)
Therefore, I (Windy) = 0 and Information Gain=0.971.
®
For each attribute like Temperature calculate the entropy for Cool, and Mild as shown
in Table 2.8.16.
Outlook Temperature Play Cricket

Attribute p n Entropy
Rainy Mild Yes
Cool 1 1 1
Rainy Cool Yes
Mild 2 1 0.918
Rainy Cool No
Rainy Mild Yes
Rainy Mild No
Table 2.8.16 : Entropy for attribute cool and mild
Therefore, Entropy for temperature, I (Temperature) = 0.951 and Information

Gain=0.020.
The following Table 2.8.17 shows the information gain for all the attributes. From this
table it is observed, the windy has highest gain so it will be selected as next node.
Humidity 0.02
Windy 0.971
Temperature 0.02
Table 2.8.17 : Information gain for attributes
So, accordingly the decision tree is formed as shown in Fig. 2.8.3.
Fig. 2.8.3 : Final decision tree
®
As seen from Fig. 2.8.3, for Windy there are only two conditions as Weak “Yes” and
Strong “No”. So, further expansion is not required and both become the leaf node. So, this
becomes the final decision tree. Hence, given the attributes and decisions we can easily
construct the decision tree using ID3 algorithm.
2.8.2 CART
Classification and Regression Tree (CART) is one of commonly used Decision Tree
algorithm. It uses recursive partitioning approach where each of the input node is split
into two child nodes. Therefore, CART decision tree is often called Binary Decision Tree.
In CART, at each level of decision tree, the algorithm identify a condition - which variable
and level to be used for splitting the input node (data sample) into two child nodes and
accordingly build the decision tree.
CART is an alternate decision tree algorithm which can handle both regression and
classification tasks. This algorithm uses a new metric named gini index to create decision
points for classification tasks. Given the attributes and the decision as shown in Table
2.8.18. The procedure for creating the decision tree using CART is explained below.
Day Outlook Temperature Humidity Wind Decision
7 Overcast Cool Normal Strong Yes
Table 2.8.18 : Attributes/Features for CART problem
®
The Gini index is a measure for classification tasks in CART algorithm which has sum
of squared probabilities of each class. It is defined as
GiniIndex (Attribute = value) = GI (v) = 1 –  (Pi)2 for i = 1 to number of classes.
Gini Index (Attribute) = V = values Pv  GI(v)
Outlook is attribute which can be either sunny, overcast or rain. Summarizing the final
decisions for outlook feature is given in Table 2.8.19.
Outlook Yes No Number of instances
Sunny 2 3 5
Overcast 4 0 4
Rain 3 2 5
Table 2.8.19 : Decision table for outlook attribute
Using the above information from the table, we calculate Gini(Outlook) by using the
formulae’s defined earlier
22 32
Gini(Outlook = Sunny) = 1 –   –   = 1 – 0.16 – 0.36 = 0.48
5 5
42 02
Gini(Outlook=Overcast) = 1 –  –   = 0
4 4
3 22
2
Gini(Outlook=Rain) = 1 –   –   =1 – 0.36 – 0.16
5 5
Then, we will calculate weighted sum of gini indexes for outlook feature.
5 4 5
Gini(Outlook) =    0.48 +    0 +    0.48 = 0.171 + 0 + 0.171 = 0.342
14 14 14
Here after, the same procedure is repeated for other attributes. Temperature is an
attribute which has 3 different values : Cool, Hot and Mild. The summary of decisions for
temperature is given in Table 2.8.20.
Temperature Yes No Number of instances
Hot 2 2 4
Cool 3 1 4
Mild 4 2 6
Table 2.8.20 : Decision table for temperature attribute
®
22 22
Gini(Temp = Hot) = 1 –   –   = 0.5
4 4
32 12
Gini(Temp = Cool) = 1 –   –   = 1 – 0.5625 – 0.0625 = 0.375
4 4
42 22
Gini(Temp= Mild) = 1–   –   = 1 – 0.444 – 0.111 = 0.445
6 6
We'll calculate weighted sum of gini index for temperature feature
4 4 6
Gini(Temp) =    0.5 +    0.375 +    0.445 = 0.142 + 0.107 + 0.190 = 0.439
 
14  
14 14
Humidity is a binary class feature. It can be high or normal as shown in Table 2.8.21.
Humidity Yes No Number of instances
High 3 4 7
Normal 6 1 7
Table 2.8.21 : Decision table for humidity attribute
3 2 4 2
Gini(Humidity = High) = 1 –   –   = 1 – 0.183 – 0.326 = 0.489
7 7
6 2 1 2
Gini(Humidity = Normal) = 1–   –   = 1 – 0.734 – 0.02 = 0.244
7 7
Weighted sum for humidity feature will be calculated next
7 7
Gini(Humidity) =    0.489 +    0.244 = 0.367
14 14
Wind is a binary class similar to humidity. It can be weak and strong as shown in
Table 2.8.22.
Wind Yes No Number of instances
Weak 6 2 8
Strong 3 3 6
Table 2.8.22 : Decision table for wind attribute
®
6 2 2 2
Gini(Wind = Weak) = 1 –   –   = 1 – 0.5625 – 0.062 = 0.375
8 8
 32 32
Gini(Wind = Strong) = 1– – = 1 – 0.25 – 0.25 = 0.5
6 6
8 6
Gini(Wind) =    0.375 +    0.5 = 0.428
14 14
After calculating the gini index for each attribute, the attribute having minimum value
is selected as the node. So, from the Table 2.8.23 the outlook feature has the minimum
value, therefore, outlook attribute will be at the top of the tree as shown in Fig. 2.8.4.
Feature Gini index
Outlook 0.342
Temperature 0.439
Humidity 0.367
Wind 0.428
Table 2.8.23 : Gini index for each feature
Fig. 2.8.4 : Representation of outlook as a top node

From Fig. 2.8.4, we can see that the sub dataset Overcast has only yes decisions, which
means the leaf node for Overcast is “Yes” as shown in Fig. 2.8.5 which will not require
further expansion.
®
We will apply same principles to the other sub datasets in the following steps. Let us
take sub dataset for Outlook=Sunny. We need to find the gini index scores for
temperature, humidity and wind features respectively. The sub dataset for
Outlook=Sunny is as shown in Table 2.8.24.
Fig. 2.8.5 : Representation of leaf node for overcast
Table 2.8.24 : Sub dataset of outlook=sunny
Now, we determine Gini of temperature for Outlook=Sunny as per Table 2.8.25.

Hot 0 2 2
Cool 1 0 1
Mild 1 1 2
Table 2.8.25 : Decision table of temperature for outlook=sunny
02 22
Gini(Outlook = Sunny and Temperature = Hot) = 1–   –   = 0
2 2
12 02
Gini(Outlook=Sunny and Temperature = Cool) = 1–   –   = 0
1 1
®
12 12
Gini(Outlook = Sunny and Temperature = Mild) = 1–   –   = 1 – 0.25 – 0.25 = 0.5
2 2
2 1 2
Gini(Outlook=Sunny and Temperature) =    0 +    0 +    0.5 = 0.2
5 5 5
Now, we determine Gini of humidity for Outlook=Sunny as per Table 2.8.26.
High 0 3 3
Normal 2 0 2
Table 2.8.26 : Decision table of humidity for outlook=sunny
02 32
Gini(Outlook=Sunny and Humidity=High) = 1–   –   = 0
3 3
22 02
Gini(Outlook=Sunny and Humidity=Normal) = 1–   –   = 0
2 2
3 2
Gini(Outlook=Sunny and Humidity) =    0 +    0 = 0
5 5
Now, we determine Gini of Wind for Outlook=Sunny as per Table 2.8.27.
Weak 1 2 3
Strong 1 1 2
Table 2.8.27 : Decision table of wind for outlook=sunny
12 22
Gini(Outlook=Sunny and Wind=Weak) = 1–   –   = 0.266
3 3
12 12
Gini(Outlook=Sunny and Wind=Strong) = 1–   –   = 0.2
2 2
3 2
Gini(Outlook=Sunny and Wind) =    0.266 +    0.2 = 0.466
 
5 5
Decision for sunny outlook
We’ve calculated gini index scores for features when outlook is sunny as shown
in Table 2.8.28. The winner is humidity because it has the lowest value.
Feature Gini index
Temperature 0.2
®
Humidity 0
Wind 0.466
Table 2.8.28 : Gini index of each feature for outlook=sunny
Humidity is the extension of Outlook Sunny as the gini index for Humidity is
minimum as shown in Fig. 2.8.6.
Fig. 2.8.6 : Node of humidity for outlook=sunny
As seen from Fig. 2.8.6, decision is always no for high humidity and sunny outlook.
On the other hand, decision will always be yes for normal humidity and sunny outlook.
Therefore, this branch is over and the leaf nodes of Humidity for Outlook=Sunny is
shown in Fig. 2.8.7.
Fig. 2.8.7 : Leaf node of humidity for outlook=sunny
®
Let us take sub dataset for Outlook= Rain, and determine the gini index for
temperature, humidity and wind features respectively. The sub dataset for Outlook=
Rain is as shown in Table 2.8.29.
Table 2.8.29 : Sub dataset of outlook=rain
The calculation for gini index scores for temperature, humidity and wind features
when outlook is rain is shown following Tables 2.8.30, 2.8.31, and 2.8.32 .
Cool 1 1 2
Mild 2 1 3
Table 2.8.30 : Decision table of temperature for outlook=rain
12 12
Gini(Outlook=Rain and Temperature=Cool) = 1 –   –   = 0.5
2 2
22 12
Gini(Outlook=Rain and Temperature=Mild) = 1 –   –   = 0.444
3 3
2 3
Gini(Outlook=Rain and Temperature) =    0.5 +    0.444 = 0.466
 
5 5
Table 2.8.31 : Decision table of humidity for outlook=rain

High 1 1 2
Normal 2 1 3
12 12
Gini(Outlook=Rain and Humidity=High) = 1 –   –   = 0.5
2 2
22 12
Gini(Outlook=Rain and Humidity=Normal)= 1 –   –   = 0.444
3 3
2 3
Gini(Outlook=Rain and Humidity) =    0.5 +    0.444 = 0.466
 
5 5
®
Weak 3 0 3
Strong 0 2 2
Table 2.8.32 : Decision table of wind for outlook=rain

2 2
3 0
Gini(Outlook=Rain and Wind=Weak) = 1 –   –   = 0
3 3
02 22
Gini(Outlook=Rain and Wind=Strong) = 1 –   –   = 0
2 2
3 2
Gini(Outlook=Rain and Wind) =    0 +    0 = 0
5 5
The winner is wind feature for rain outlook because it has the minimum gini index
score in features as per Table 2.8.33.
Feature Gini index
Temperature 0.466
Humidity 0.466
Wind 0
Table 2.8.33 : Gini index of each feature for Outlook=Rain
Place the wind attribute for outlook rain branch and monitor the new sub data sets as
shown in Figure 2.8.8.
Fig. 2.8.8 : Node of Wind for Outlook=Rain
®
As seen, when wind is weak the decision is always yes. On the other hand, if wind is
strong the decision is always no. This means that this branch is over and the final decision
tree using CART algorithm is depicted in Fig. 2.8.9.
Fig. 2.8.9 : Final decision tree using CART algorithm
2.8.3 C4.5
The C4.5 algorithm is used to generate a Decision tree based on Decision Tree
Classifier. It is mostly used in data mining where decisions are generated based on a
certain sample of data. It has many improvements over the original ID3 algorithm. The
C4.5 algorithm can handle missing data
So, If the training records contains unknown attribute values then C4.5 evaluates the
gain for each attribute by considering only the records where the attribute is defined. For
the corresponding records of each partition, the gain is calculated, and the partition that
maximizes the gain is chosen for the next split. It also supports both categorical and
continuous attributes where values of a continuous variable are sorted and partitioned.
The ID3 algorithm may construct a deep and complex tree, which would cause
overfitting. The C4.5 algorithm addresses the overfitting problem in ID3 by using a
bottom-up technique called pruning to simplify the tree by removing the least visited
nodes and branches.
2.9 Evaluating Decision Tree

The decision tree often uses greedy algorithms to choose the option that seems the best
available at that moment. At each step, the algorithm selects the attribute to be used for
splitting the remaining records.
®
This selection of attribute may not be the best overall, but it is guaranteed to be the
best at that step. This characteristic strengthens the effectiveness of decision trees.
However, selecting the wrong attribute with bad split may propagate through the rest of
the tree. Thus, to address this issue, the synergistic method can be utilized like random
forest which may randomize the splitting or even randomize data and come up with
numerous tree structure. These trees at that point vote in favor of each class, and the class
with the most votes is picked as the predicted class.
There are few ways to evaluate a decision tree. Some of the important evaluations are
given as follows.
Firstly, evaluate whether the splits of the tree make sense. Conduct stability checks by
validating the decision rules with domain experts, and determine if the decision rules are
sound. Second, look at the depth and nodes of the tree. Having such a large number of
layers and getting nodes with few members might be signs of overfitting. In overfitting,
the model fits the training set well, however it performs ineffectively on the new samples
in the testing set.
For decision tree learning, overfitting can be caused by either the lack of training data
or the biased data in the training set. So, to avoid overfitting in decision tree two
methodologies can be utilized. First is stop rising the tree early before it reaches the point
where all the training data is perfectly classified and second is grow the full tree, and then
post-prune the tree with methods such as reduced-error pruning and rule-based post
pruning and Lastly, use standard diagnostics tools that apply to classifiers that can help
evaluate overfitting.
The structure of a decision tree is sensitive to small variations in the training data.
Therefore, constructing two decision trees based on two different subsets of same dataset
may result in very different trees and it is not a good choice if the dataset contains many
irrelevant or redundant variables. If the dataset contains redundant variables, the
resulting decision tree ignores all and algorithm cannot able to detect the information
gain on other hand if dataset contains irrelevant variables and accidentally chosen as
splits in the tree, the tree may grow too large and may end up with less data at every
split, where overfitting is likely to occur.
Although decision trees are able to handle correlated variables, as when most of the
variables in the training set are correlated, overfitting is likely to occur. To overcome the
issue of instability and potential overfitting, one can combine the decisions of several
randomized shallow decision trees using classifier called random forest
®
For binary decisions, a decision tree works better if the training dataset consists of
records with an even probability of each result. In that scenario, the logistic regression on
a dataset with multiple variables can be used to determine which variables are the most
useful to select based on information gain.
2.10 Decision Tree in R

In R programming, the decision tree can be plotted using package called rpart.plot.
The common steps for implementing Decision tree in R are as follows :
Step 1 : Import the data
Step 2 : Clean the dataset
Step 3 : Create train/test set
Step 4 : Build the model
Step 5 : Make prediction
Step 6 : Measure performance
Step 7 : Tune the hyper-parameters

The implementation of Decision tree in R is briefly explained in Practical section.
2.11 Baye’s Theorem

Bayes' theorem is also called Bayes' Rule or Bayes' Law and is the foundation of the
field of Bayesian statistics. Bayes' Theorem was named after 18th century mathematician
Thomas Bayes. Bayes' Theorem allows you to update predicted probabilities of an event
by incorporating new information. It is often employed in finance in updating risk
evaluation.
The probability of two events A and B happening, i.e. P(A∩B), is the probability of A,
i.e. P(A), times the probability of B given that A has occurred, P(B|A).
P(A ∩ B) = P(A)P(B|A)
Simillarly, it is is also equal to the probability of B times the probability of A given B.
P(A ∩ B) = P(B)P(A|B)
Equating the two yields,
P(B)P(A|B) = P(A)P(B|A)
P(A|B) = P(A) P(B|A) P(B)
®
This equation is known as Bayes Theorem, which relates the conditional and marginal
probabilities of stochastic events A and B as
P(A|B) = P (B|A) P (A) / P (B) .
Each term in Bayes’ theorem has a conventional name. P(A) is the prior probability or
marginal probability of A. It is “prior” in the sense that it does not take into account any
information about B. P(A|B) is the conditional probability of A, given B. It is also called
the posterior probability because it is derived from or depends upon the specified value
of B. P(B|A) is the conditional probability of B given A. P(B) is the prior or marginal
probability of B, and acts as a normalizing constant. This theorem plays an important role
in determining the probability of the event, provided the prior knowledge of another
event.
2.12 Naive Bayes Classifier

It is a classification technique based on Bayes’ Theorem with an assumption of
independence among predictors. In simple terms, a Naive Bayes classifier assumes that
the presence of a particular feature in a class is unrelated to the presence of any other
feature. Even if these features depend on each other or upon the existence of the other
features, all of these properties independently contribute to the probability. Naive Bayes
model is easy to build and particularly useful for very large data sets. Along with
simplicity, Naive Bayes is known to outperform even highly sophisticated classification
methods. Naive Bayes is a probabilistic classifier inspired by the Bayes theorem. Under a
simple assumption which is the attributes are conditionally independent.
The classification is conducted by deriving the maximum posterior which is the

maximal P(c|X) with the above assumption applying to Bayes theorem. This assumption
greatly reduces the computational cost by only counting the class distribution.
Given the data set with classification and then using the condition which is not
present in the data set we can determine the appropriate class. Consider the Table 2.12.1,
find the probability of playing Cricket with outcome Yes or No when conditions are
®
Temperature = Cool, Humidity = High, Wind = Strong and Outlook = Sunny which is not
in the Table 2.12.1.
Day Outlook Temperature Humidity Wind Play cricket
7 Overcast Cool High Strong Yes
Table 2.12.1 : Attributes for Naive Bayes classifier problem
From the Table 2.12.1, it is seen the attribute play cricket has outcome Yes = 9 and No =
5 for 14 conditions. Using the table, we determined P (Strong | Yes), P (Strong | No), P
(Weak | Yes), P (Weak| No), P (High | Yes), P (High| No), P (Normal| Yes), P (Normal
| No), P (Hot | Yes), P (Hot | No), P (Mild| Yes), P (Mild| No), P (Cool| Yes), P (Cool|
No), P (Sunny| Yes), P (Sunny| No), P (Overcast| Yes), P (Overcast| No) and P (Rain|
Yes), P (Rain| No) as shown in Fig. 2.12.1.
®
®
Fig. 2.12.1 : Conditional probabilities for different attributes
Consider the first attribute Wind, in which we have two possibilities like Strong and
Weak. From the table, it is seen that Probability of Wind i.e. P (Wind)|Yes Finding the
probability of wind is demonstrated in Fig. 2.12.1.
As seen from the Fig. 2.12.1, the Wind has two subparameters like Strong and Weak.
From Table 2.12.2, number of Strong are appeared 6 times while Weak appeared at 8
times. Now number of Yes for Strong are 3 and Number of No for Strong are 3 therefor
3 3
we can say P(Strong|Yes) = and P(Strong|No)= . Similarly, for Weak, Number of yes
9 5
6 2
are 6 and No's are 2. So P(Weak|Yes) = and P(Weak|No) = . Similarly for Humidity,
9 5
two conditions are High and Normal. From Table, number of High's are 7 while Normal
appeared 7 times. Now number of Yes for High are 3 and No are 4. therefor we can say
3 4
P(High|Yes) = and P(High|No)= . Similarly, for Normal, Number of yes are 6 and
9 5
6 1
No's are 1. So P(Normal|Yes) = and P(Normal|No)= . Similarly for Temperature and
9 5
Outlook, the probabilities are given as shown in Table 2.12.2.
Temperature :
2 4 3
P(Hot | Yes) = p(Mild | Yes) = p(Cool | Yes) =
9 9 9
2 2 1
P(Hot | No) = p(Mild | No) = p(Cool | No) =
5 9 5
®
Outlook :
2 4 3
P(Sunny | Yes) = p(Overcast | Yes) = p(Rain | Yes) =
9 9 9
3 p(Overcast | No) = 0 2
P(Sunny | No) = p(Rain | No) =
5 5
Table 2.12.2 : Conditional probabilities for attributes temperature and outlook
Now Considering the problem statement, Let X={Sunny, Cool, High, Strong} and using
the relation we can write
P(X|Yes) = P(Yes) * P(Sunny|Yes)* P(Cool|Yes) * P(High|Yes) *P(Strong|Yes)
9 2 3 3 3
= * * * * = 0.0053
14 9 9 9 9
P(X|No) = P(No) * P(Sunny|No)* P(Cool|No)* P(High|No)* P(Strong|No)
5 3 1 4 3
= * * * * = 0.0206
14 5 5 5 5
As P(X|No) is greater than P(X|Yes), the answer is No for playing cricket under these
conditions i.e. Outlook is Sunny, Temprature is Cool, Humidity is High and Wind is
Strong. With this approach, we can determine the answer for playing cricket as Yes or No
for other conditions which are not mentioned in the Table. 2.12.2 In this way, we have
learned the K-means clustering along with use cases and two statistical classifiers in this
chapter.
Summary
 Clustering is one of the most popular exploratory data analysis techniques which
involves task of classifying data in to subgroups where data points in the same
subgroup (cluster) are very similar and data points in other clusters are different.
 K-means clustering is one of the simplest and popular unsupervised machine
learning algorithms which tries to partition the dataset into K pre-defined distinct
non-overlapping subgroups (clusters) in iterative manner where each data point
belongs to only one group.
 Some use cases of clustering are document clustering, fraud detection, cyber-
profiling criminals, delivery store optimization, customer segmentation etc.
 The Silhouette method is used for interpretation and validation of consistency
within clusters of data while gap statistic compares the total within intra-cluster
variation for different values of k with their expected values under null reference
®
distribution of the data.

 The classification is a supervised learning approach in which the computer
program learns from the data input given to it and then uses this learning to
classify new observation.
 The different types of classification algorithms are Linear Classifiers, Logistic
Regression, Naive Bayes Classifier, Nearest Neighbor, Support Vector Machines,
Decision Trees, Random Forest and Neural Networks.
 Given a data of attributes together with its classes, a decision tree produces a
sequence of rules that can be used to classify the data. Decision Tree, as its names
says, makes decision with tree-like model. In general, the objective of a decision
tree algorithm is to construct a tree T from a training set S.
 The ID3 is a decision tree algorithm that selects the most informative attribute
based on two basic measures like Entropy, which measures the impurity of an
attribute and Information gain, which measures the purity of an attribute.
 Classification and Regression Tree (CART) is one of commonly used Decision
Tree algorithm. It uses recursive partitioning approach where each of the input
node is split into two child nodes. Therefore, CART decision tree is often called
Binary Decision Tree.
 The C4.5 algorithm is used to generate a Decision tree based on Decision Tree
Classifier. It is mostly used in data mining where decisions are generated based
on a certain sample of data.
 Bayes' Theorem allows you to update predicted probabilities of an event by
incorporating new information. It is often employed in finance in updating risk
evaluation while Naive Bayes classifier assumes that the presence of a particular
feature in a class is unrelated to the presence of any other feature.

Q.1 Define Clustering and Classification. AU : May-17
Ans. : Clustering is the process of collecting and grouping similar data into classes or
clusters. In other words, clustering is a process in which similar data is grouped into
classes or clusters so that the objects within the same cluster or class have high
similarity with respect to the dissimilar objects in another cluster or group.
In machine learning and statistics, classification is a supervised learning approach in
which the computer program learns from the data input given to it and then uses this
®
learning to classify new observation. Classification is technique to categorize our data

into a desired and distinct number of classes where we can assign label to each class.
Q.2 Give the advantages and disadvantages of decision tree.
Ans. : The advantages of decision tree are listed below :
 Decision trees generate understandable rules.

 Decision trees perform classification without requiring much computation.
 Decision trees are capable of handling both continuous and categorical variables.
 Decision trees provide a path to find out which fields are most important for
prediction or classification.
The disadvantages of decision tree are listed below :
 Decision trees are less appropriate for estimation tasks where the goal is to predict
the value of a continuous attribute.
 Decision trees are prone to errors in classification problems with many classes and
a relatively small number of training examples.
 Decision trees can be computationally expensive to train.
 Pruning algorithms can also be expensive since many candidate sub-trees must be
formed and compared.
 Decision-tree learners can create over-complex trees that do not generalize the
data well. This is called over fitting.
Q.3 Explain Baye's theoram.
Ans. : Bayes' theorem allows you to update predicted probabilities of an event by
incorporating new information. It is often employed in finance in updating risk
evaluation.
The probability of two events A and B happening, i.e. P(A  B), is the probability of
A, i.e. P(A), times the probability of B given that A has occurred, P(B|A).
P(A  B) = P(A)P(B|A)
Simillarly, it is also equal to the probability of B times the probability of A given B.
P(A  B) = P(B)P(A|B)
Equating the two yields,
P(B)P(A|B) = P(A)P(B|A)
P(A|B) = P(A) P(B|A) P(B)
This equation is known as Bayes theorem, which relates the conditional and
marginal probabilities of stochastic events A and B as ,
®
P(A|B) = P (B|A) P (A) / P (B) .

Q.4 Explain K-means clustering algorithm.
Ans. : Algorithmic steps for K-means clustering are as follows :
Step 1 : Let X = {X1,X2……..,Xn} be the set of data points.
Step 2 : Arbitrarily select 'K' cluster centers denoted as C1,C2,…….,Ck.
Step 3 : Calculate the distance between each data point with the cluster centers by
using any distance measurers.
Step 4 : Assign the data points to the cluster center whose distance from the cluster
center is minimum with respect to other cluster centers.
Step 5 : Recalculate the distance between each data point and its centers to get new
cluster centers using mean.
Step 6 : Repeat from step 3 till there is no change in the cluster center .
Q.5 State the significance of C 4.5 algorithm.
Ans. : The C4.5 algorithm is used to generate a decision tree based on decision tree
classifier. It is mostly used in data mining where decisions are generated based on a
certain sample of data. It has many improvements over the original ID3 algorithm. The
C4.5 algorithm can handle missing data .
So, If the training records contains unknown attribute values then C4.5 evaluates the
gain for each attribute by considering only the records where the attribute is defined.
For the corresponding records of each partition, the gain is calculated, and the partition
that maximizes the gain is chosen for the next split. It also supports both categorical
and continuous attributes where values of a continuous variable are sorted and
partitioned.
Part - B Questions
Q.1 Explain the K-mean clustering algorithm with an example. AU : May 17
Q.2 State and explain decision tree ID3 algorithm in detail.

Q.3 Describe with example CART algorithm to generate decision tree.
Q.4 Explain in detail the methods to determine the number of clusters.
Q.5 Explain the use cases of K-means clustering.
Q.6 Explain Naïve Baye’s algorithm with example.

®
UNIT - III
3
Association and
Recommendation System
Syllabus
lAdvanced Analytical Theory and Methods: Association Rules - Overview - Apriori Algorithm -
Evaluation of Candidate Rules - Applications of Association Rules - Finding Association and finding
similarity - Recommendation System : Collaborative Recommendation- Content Based
Recommendation - Knowledge Based Recommendation- Hybrid Recommendation Approaches.
Contents
3.1 Overview of Association Rules
3.2 Apriori Algorithm
3.3 Evaluation of Candidate Rules
3.4 Applications of Association Rules
3.5 Finding Associations and Finding Similarity
3.6 Recommendation System
3.7 Collaborative Recommendation
3.8 Content-based Recommendation
3.9 Knowledge based Recommendation
3.10 Hybrid Recommendation Approaches
Summary
Part - B Questions
(3 - 1)
Big Data Analytics 3-2 Association and Recommendation System
3.1 Overview of Association Rules

The Association rules are useful for analyzing and predicting customers behavior.
They play an important role in applications like market basket analysis, customer-based
analytics, catalog design, product clustering and store layout. It is a descriptive and
unsupervised learning method to discover the relationships in a dataset and mining the
transactions in databases.
The "Market-basket Analysis" model is commonly used for better understanding of
association rules, where a many to many relationships between "items" and "baskets" are
represented. In Market-basket Analysis, the large collection of transactions is given where
each transaction comprises one or more associated items. In this analysis, the association
rules go through each item being purchased to see what items are frequently bought
together and to determine the list of rules that define the procuring behavior of a
customer. It is one of the key techniques used by many online stores to uncover the
associations between items which works by looking for combinations of items that
frequently occurs together in a transaction. It identifies the strength of association
between pairs of products purchased together and identify patterns of co-occurrence
where co-occurrence is when two or more things take place together. It also allows
retailers to recognize the relationships between the items or itemset that people buy. The
term itemset is nothing but the collection of items or individual entities that contain some
kind of relationship. The relationships are depends on both business context and the
functionality of algorithm being used for the discovery.
Technically, the Market Basket Analysis creates If-Then scenario rules, for example, if
item X is purchased then item Y is likely to be purchased. The rules are probabilistic in
nature or, in other words, they are derived from the frequencies of co-occurrence in the
observations. The rules can be used in pricing strategies, product placement, and various
types of cross-selling strategies. The Market Basket Analysis takes data at transaction
level, which lists all items bought by a customer in a single purchase. The technique
determines relationships of what products were purchased with which other product(s).
These relationships are then used to build profiles containing If-Then rules of the items
purchased.
The rules for Market Basket Analysis could be written as :
IF {X} THEN {Y}
Where X and Y are the items or products
®
Here, If part of the rule (the {X} above) is known as the antecedent and the THEN part
of the rule is known as the consequent (the {Y} above). The antecedent is the condition
and the consequent are the result. For example, If customer is purchasing Bread then
likely to purchase Butter also or likely to purchase Eggs.
Generally, association rule mining is the two-step process, first is finding all the
frequent item sets and second is generate strong association rules from the frequent
itemsets.
So, there are mainly two algorithms used for finding the frequent item sets namely
Apriori algorithm and FP growth algorithm. In this chapter, we are mainly focusing on
Apriori algorithms for the discussion of association rules.
3.2 Apriori Algorithm

The Apriori is one of the most fundamental algorithms for generating association
rules. It uses support for pruning the itemsets and controlling the exponential growth of
candidate itemsets where smaller candidate itemsets, which are known to be frequent
itemsets, are combined and pruned to generate longer frequent itemsets.
This approach eliminates the need for all possible itemsets to be enumerated within
the algorithm, since the number of all possible itemsets can become exponentially large.
The major components of Apriori algorithm are Support and Confidence. For example, if
60 % of transactions contain itemset {Bread}, then the support of {Bread} is 0.6 and 50 % of
all transactions contain itemset {Bread, Butter}, then the support of {Bread, Butter} would
be 0.5.
A frequent itemset has items that satisfies minimum support criterion. If the minimum
support is set at 0.5, any itemset can be considered a frequent itemset if at least 50 % of the
transactions contain this itemset. In other words, the support of a frequent itemset must
be greater than or equal to minimum support. Therefore, If the itemsets are considered as
frequent, then any subset of the frequent itemset must also be frequent. This is referred to
as the Apriori property. For example, if 50 % of the transactions contain {Bread, Jam}, then
the support of {Bread, Jam} is 0.5, the support of {Bread} or {Jam} should be at least 0.5.
The Apriori algorithm takes a bottom-up iterative approach to discover the frequent
itemsets by determining all the possible items and then identify which amongst them are
frequent. Assuming the minimum support threshold is set to 0.5, the algorithm identifies
and retains those itemsets that appear in at least 50 % of all transactions and discards or
prunes the itemsets which are having support less than 0.5.
®
In first iteration, the Apriori algorithm, the identifies the frequent 1-itemsets (for
example, {Bread}, {Milk}, {Butter}, …) and evaluated to identify the frequent 1-itemsets
among them. In the second iteration, the algorithm identifies frequent itemsets paired
into 2-itemsets (for example, {Bread, Butter}, {Bread, Milk}, {Jam, Butter}, …) and again
evaluated to identify the frequent 2-itemsets among them. Likewise, algorithm continues
to run n-iterations on n-itemsets to evaluate frequent n-itemsets among them. At each
iteration, the algorithm checks whether the support criterion can be met and stops when
algorithm runs out of support or until the itemsets reach a predefined length.
This algorithm uses two steps "join" and "prune" to reduce the search space. The join
step generates (K+1) itemset from K-itemsets by joining each item with itself while prune
step scans the count of each item in the database. If the candidate item does not meet
minimum support, then it gets removed. This step is performed to reduce the size of the
candidate itemsets.
The algorithm composed of sequence of steps to be followed to find the most frequent
itemset in a given database which performs the join and the prune steps iteratively until
the most frequent itemset is achieved. The generalized steps in Apriori algorithm are
given as below
Apriori Algorithm
Step 1 : In the first iteration of the algorithm, each item is taken as a 1-itemsets
candidate. The algorithm counts the occurrences of each item.
Step 2 : In the set of 1-itemsets, the occurrence of items which satisfies the minimum
support are determined. The minimum support is already defined in the problem. So,
the candidates which count more than or equal to minimum support, are taken forward
for the next iteration and the others are pruned or removed.
Step 3 : In the second iteration, 2-itemset frequent items with minimum support are
revealed. In this step the join of 2-itemset is generated by forming a group of 2 by
combining items with itself.
Step 4 : The 2-itemset candidates are pruned using minimum support threshold value
as like step 2.
Step 5 : The next iteration will form 3-itemsets using join and prune step. If all 2-itemset
subsets are frequent then the superset will be frequent otherwise it is pruned.
Step 6 : The process continues for {4-itemsets, 5-itemsets…. K-itemsets} till itemset does
not meet the minimum support criteria or itemsets reach a predefined length.
®
Example 1:
In a given table the transaction Ids (TID's) and Itemsets are provided, so find all the
frequent itemsets using Apriori algorithm with minimum support = 50 % and Minimum
confidence = 50 %.
TID Itemsets
T1 A, B, C
T2 A, C
T3 A, D
T4 B, E, F
As we have provided a minimum support = 50 %. Therefore, support count can be

calculated as
50 50
Support count =  No. of transactions = 4 =2
100 100
Iteration 1 : Now, let us scan the dataset for each candidate (item) for getting its
support count that means generating support count for each the candidates. Therefore,
Candidate support calculated for 1-Itemset is shown in Table 3.2.1.
C1
Items Support_count
{A} 3
{B} 2
{C} 2
{D} 1
{E} 1
{F} 1
Table 3.2.1 : Candidate support for 1-Itemset
Now, compare candidate support count with minimum support count. Here,
minimum support count = 2. So, from C1 compare support count of each item with
minimum support count which should be greater than equal to 2. Therefore, we need to
prune/remove the items whose support count is less than 2. So, the itemset that satisfies
the minimum support count (L1) are shown in Table 3.2.2.
®
L1
Items Support_count
{A} 3
{B} 2
{C} 2
Table 3.2.2 : Items satisfying minimum support count for 1-Itemset in Iteration 1
Iteration 2 : Now perform Join operations by combining single items from L1 into 2-
Itemsets with all combinations and find their support count. Therefore, candidate support
(C2) calculated for 2-Itemset is shown in Table 3.2.3.
C2
Items Support_count
{A, B} 1
{B, C} 1
{A, C} 2
Table 3.2.3 : Candidate support for 2-Itemset
Compare candidate support count with minimum support count 2. Here, we find only
one transaction {A, C} has support count equal to 2. Therefore, we need to prune/remove
the other items whose support count is less than 2. The itemset that satisfies the minimum
support count (L2) is shown in Table 3.2.4.
L2
Items Support_count
{A, C} 2
Table 3.2.4 : Items satisfying minimum support count for 2-Itemset in Iteration 2
Here, we find only one transaction which satisfies minimum support count and will be
a frequent item set for given problem. Now, let us find out final rules with support and
confidence for frequent itemsets are given in Table 3.2.5.
Association rule Support count Confidence Confidence %
AC 2 = Support/Occurrence of A 66 %
®
= 2/3 = 0.66
CA 2 = Support/Occurrence of C = 2/2 = 1 100 %
Table 3.2.5 : Final rules with support and confidence
In, both the cases confidence is greater than minimum confidence 50 % given in the
problem. Therefore, Final rules will be A  C and C  A.
Example 2 :
In the given Table 3.2.6, the transaction Ids (TID's) and list of Itemsets are provided, so
find all the frequent itemsets using Apriori algorithm with minimum support count is 2
and support confidence = 50 %.
TID List of itemsets
T1 I1, I2, I5
T2 I2, I4
T3 I2, I3
T4 I1, I2, I4
T5 I1, I3
T6 I2, I3
T7 I1, I3
T8 I1, I2, I3, I5
T9 I1, I2, I3
Table 3.2.6 : Example data for Apriori algorithm
As per previous example, the different iterations of Apriori algorithm are given in
following tables. Now let us find candidate support (C1) and Items satisfying minimum
support count (L1) for 1-Itemset as shown in Table 3.2.7. In this example the minimum
support count provided is 2, so prune all the values which are less than 2 in
Support_count column.
®
C1 L1
Items Support_count Items Support_count
{I1} 6 {I1} 6
{I2} 7 {I2} 7
{I3} 6 {I3} 6
{I4} 2 {I4} 2
{I5} 2 {I5} 2
Table 3.2.7 : Candidate support and items satisfying minimum support count for 1-itemset
Now calculate candidate support (C2) and Items satisfying minimum support count
(L2) for 2-Itemset as shown in Table 3.2.8.
C2 L2
{I1, I2} 4 {I1, I2} 4
{I1, I3} 4 {I1, I3} 4
{I1, I4} 1 {I1, I5} 2
{I1, I5} 2 {I2, I3} 4
{I2, I3} 4 {I2, I4} 2
{I2, I4} 2 {I2, I5} 2
{I2, I5} 2
{I3, I4} 0
{I3, I5} 1
{I4, I5} 0
Table 3.2.8 : Candidate support and Items satisfying minimum support count for 2-Itemset
Similarly, calculate candidate support (C3) and Items satisfying minimum support
count (L3) for 3-Itemset as shown in Table 3.2.9.
®
C2 L2
{I1, I2, I3} 2 {I1, I2, I3} 2
{I1, I2, I5} 2 {I1, I2, I5} 2
Table 3.2.9 : Candidate support and Items satisfying minimum support count for 3-Itemset
Now, let us find out final rules with support and confidence for frequent itemsets are
given in Table 3.2.10.
Association rule Support count Confidence Confidence %
I1  I2  I3 2 = Support/Occurrence of I1 = 2/6 = 0.33 33 %
I5  I1  I2 2 = Support/Occurrence of I5 = 2/2 = 1 100 %
Table 3.2.10 : Final rules with support and confidence
From above Table 3.2.10, we can conclude that the final frequent itemsets for given
problem would be I5  I1  I2 as it has confidence more than minimum confidence 50 %.
3.3 Evaluation of Candidate Rules

The association rule has three important measures that express the degree of
confidence in the rule called Support, Confidence, and Lift. The Support(S) is the number
of transactions that include items in the {X} and {Y} parts of the rule as a percentage of the
total number of transactions. It is a measure of how frequently the collection of items
occur together as a percentage of all transactions.
The Confidence(C) is the ratio of the no of transactions that includes all items in {Y} as
well as the no of transactions that includes all items in {X} to the no of transactions that
includes all items in {X}.
Lift(L) of the rule X  Y is the confidence of the rule divided by the expected
confidence, assuming that the itemsets X and Y are independent of each other. The
expected confidence is the confidence divided by the frequency of {Y}. Leverage is a
similar to Lift, but instead of using a ratio, leverage uses the difference. Here, Confidence
®
Big Data Analytics 3 - 10 Association and Recommendation System
is able to identify trustworthy rules, but it cannot tell whether a rule is coincidental.
Measures such as lift and leverage not only ensure interesting rules are identified but also
filter out the coincidental rules.
The association rules discover all sets of items that have support more than the
minimum support and then using the large itemsets to produce the desired rules that
have confidence greater than the minimum confidence. The lift of a rule is the ratio of the
observed support to that expected if X and Y are independent.
For Rule X  Y, the Support, Confidence and Lift can be calculated as
Frequency(XY) Number of transactions in which (XY) appear
Support = or
N (Total number of transactions)
where N is total number of transactions
Frequency(XY) Support (XY)
Confidence = or
Frequency(X) Support(X)
Support (XY)
Lift =
Support (X)  Support (Y)
Support (XY)
Leverage =
Support (X)  Support (Y)
For example, consider the following table which has list of Itemsets along with their
transaction ID (TID). The support, confidence and lift can be calculated as follows
TID ITEMS
T1 Bread, Milk
T2 Bread, Jam, Butter, Eggs
T3 Milk, Jam, Butter, Coke
T4 Bread, Milk, Jam, Butter
T5 Bread, Milk, Jam, Coke
From above table, If the Rule is {Milk, Jam} {Butter}

In above example, the combination of Milk, Jam and Butter has occurred 2 times while
total number of transactions (N) = 5.
The support can be calculated as
Frequency(XY)
Support = N
2
Therefore, Support(S) of ({Milk, Jam, Butter}) = = 0.4
5
®
Now Confidence can be calculated as

Frequency(XY)
Confidence =
Frequency (X)
So, the confidence (C) for rule {Milk, Jam}{Butter} would be
Confidence (C) = Frequency (Milk, Jam, Butter)/Frequency (Milk, Jam)
Therefore, Confidence (C) = 2/3 = 0.67
Now Lift can be calculated as
Support(XY)
Lift (L) =
Support(X)*Support(Y)
L = Support ({Milk, Jam, Butter}) / Support ({Milk, Jam}) *Support({Butter})
Here, Support ({Milk, Jam, Butter}) = 0.4
Support ({Milk, Jam}) =3/5=0.6 and Support({Butter}) =3/5 = 0.6
Therefore, Lift (L) = 0.4/ (0.6*0.6) = 1.11
Now, Leverage can be calculated as
Leverage = Support(X,Y)- (Support(X)*Support(Y))
= Support ({Milk, Jam, Butter}) – (Support ({Milk, Jam}) * Support({Butter}))
= 0.4 – (0.6*0.6)
= 0.04
3.4 Applications of Association Rules

There are many applications of association rules, some of them are described as
below :
1. Market Baset Analysis : This is the most common place example of association
rules where data is gathered utilizing standardized barcode scanners in many
supermarkets. This database, known as the " market basket " database, comprises of
an enormous number of records on past transactions. A record in a database lists all
the items purchased by a customer in one sale. Realizing which groups are inclined
towards which set of items gives these shops the opportunity to alter the store
layout and the store catalog to put the ideally concerning each other.
2. Medical Diagnosis : Association rules in medical diagnosis can be beneficial for
assisting physicians for curing patients. As, Diagnosis is not an easy process and has
a scope of errors which may result in unreliable end-results. Using relational
association rule mining, we can identify the probability of the occurrence of illness
concerning various factors and symptoms. Further, it can be extended by adding
®
new symptoms and defining relationships between the new signs and the
corresponding diseases.
3. Census Data : Every government has tones of census data which is used to plan
efficient public services as well as help public businesses. This application of
association rule mining help governments for supporting complete public policy
and bringing forth an efficient functioning of a democratic society.
4. Protein Sequence : The association rules can be used in proteins sequence findings.
As proteins sequences made up of twenty types of amino acids. Each protein bears
a unique 3D structure which depends on the sequence of these amino acids. A slight
change in the sequence can cause a change in structure which might change the
functioning of the protein. So, association rules can be effectively used to predict the
exact protein sequence.
5. Retail Marketplace : In Retail, association rules can help determine what items are
purchased together, purchased sequentially, and purchased by season. This can
assist retailers to determine product placement and promotion optimization.
6. Telecommunications : In Telecommunications, association rules can be used to
determine what services are being utilized and what packages customers are
purchasing. For instance, Telecommunications these days is also offering TV and
Internet. Creating bundles for purchases can be determined from an analysis of
what customers purchase, thereby giving the company an idea of how to price the
bundles.
7. Banks : In Financial organizations like Banks, the association rules can be used to
analyze credit card purchases of customers to build profiles for fraud detection
purposes and cross-selling opportunities.
8. Insurance : In Insurance, the association rule mining can be used to build profiles
to detect medical insurance claim fraud by viewing profiles of claims. It allows to
scan the profiles for determining whether a person has more than 1 claim belongs to
a particular claimee within a specified period of time.
3.5 Finding Associations and Finding Similarity

There are various measures used to find the associations and similarity between the
different entities. Finding association involves finding the frequent itemsets based on
association rules using any of two methods like Apriori or FP-Growth methods. Here,
association rules are used for finding those frequent itemsets whose support is greater or
equal to a support threshold. So, if we are looking for association rules X  Y that apply
®
to a reasonable fraction of the baskets in Market-basket analysis, then the support of X

must be reasonably high than Y. We also want the confidence of the rule to be reasonably
higher to avoid any practical effects. So, for finding the association between the entities,
we have to find all the itemsets that meet threshold of support and high confidence value.
During the extraction of itemsets, we must be assumed that there are not too many
frequent itemsets and thus not too many candidates for high-support, high-confidence
association rules. This assumption leads to important consequences about the efficiency
of algorithms. Therefore, it must be avoided to have too many candidates for high-
support and high-confidence association rules.
3.5.1 Finding Similarity using Distance Measures

The similarity between entities can be measured by finding the closest distance
between two data points. In Big data analytics, the similarity can be find using various
distance measures. A distance measure is the measure of how similar one observation is
compared to a set of other observations. The various distance measures used in big data
analytics are explained as follows.
A) Euclidean distance :
The Euclidean distance is the distance between two points in Euclidean space.
Considering two points, A and B, with their associated coordinates {a1, a2….an} and
{b1,b2,…bn} the Euclidean distance between two points A and B is given by the formula
distance (A, B) = (a1– b1)2 +(a2– b2)2 + … + (an– bn)2
The Euclidean distance between two points cannot be negative, because the positive
square root is intended. Therefore, in Euclidean space we assume that lowers the distance
between two points higher the similarity.
B) Jaccard similarity :
The Jaccard similarity is another measure for finding similarity index. The Jaccard
similarity of sets A and B is nothing but the ratio of the sizes of the intersection and union
of sets S and T. The Jaccard similarity is then given by the formula
|A  B|
J(A, B) =
|A  B|
Here, we consider that, lower the value of Jaccard distance higher the similarity.
®
C) Cosine similarity :
This method is very similar to the Jaccard similarity, but gives somewhat different
results, because it measures similarity instead of dissimilarity. The cosine distance
between two points is an angle that the vectors to those points make. This angle must be
in the range of 0 degree to 180 degrees, regardless of how many dimensions that space
has. The Cosine similarity can be calculated by the formula :
n
 Ai Bi
AB i=1
cos () = =
||A|| ||B|| n n
 A 2
i  B2i
i=1 i =1
D) Edit distance :
The Edit distance is the distance between two strings x = x1, x2 · · · xn and y = y1, y2 · · ·
ym is the smallest number of insertions and deletions of single characters that will convert
x to y. It is used to find how dissimilar two strings are by counting number of operations
required to convert one string into another.
E) Hamming distance :
The Hamming distance is another measure for finding the similarity. In a given a
space of vectors, we define the Hamming distance between two vectors to be the number
of components in which they differ. The Hamming distance cannot be negative, and if it
is zero, then the vectors are identical.
The above similarity measures are used in applications like plagiarism softwares
where textual similarity is found between original document and plagiarized documents,
mirror pages analysis to detect the fake webpages of popular websites where pages of
these mirror sites are quite similar, but are rarely identical and so on.
3.6 Recommendation System

With the increase in use of internet, today most of the websites like Netflix, Amazon,
YouTube etc. uses recommendation system to suggest relevant items for the searched
contents to their users. It involves predicting user responses to options. Such a facility is
called a recommendation system. Some of the examples of recommendation system are
suggesting a news videos based on the searched topic by the users, offering articles to on-
line newspaper readers based on a prediction of readers interest or offering customers a
®
suggestion about what they might like to buy, based on their past history of product
searches or purchase in an on-line retailers website. In simple words, a recommendation
system filters the data using different algorithms and recommends the most relevant
items to users. It captures the past behavior of a customer and based on that, recommends
the products which the users might be likely to buy.
The aim of developing recommender systems is to reduce information overload by
retrieving the most relevant information and services from a huge amount of data,
thereby providing personalized services. The most important feature of a recommender
system is its ability to "guess" a user's preferences and interests by analyzing the behavior
of this user and/or the behavior of other users to generate personalized recommendations.
The purpose of a recommender system is to suggest relevant items to users. To achieve
this task, there are four categories of methods exist namely collaborative filtering
methods, content-based methods, knowledge based recommendation methods and
hybrid recommendation methods.
There are four major types of recommendation systems shown in Fig. 3.6.1.
3.6.1 Classification of Recommendation Systems

There are four basic types of recommendation systems namely Collaborative, Content-
based, Knowledge based and Hybrid as shown in Fig. 3.6.1.
Fig. 3.6.1 : Classification of recommendation systems
The Collaborative recommendation systems aggregate ratings or recommendations of

objects, recognize commonalities between the users on the basis of their ratings, and
generate new recommendations based on inter-user comparisons. The Content-based
recommendation system uses algorithms to recommends users about similar items that
the user has liked in the past or is examining currently. The Knowledge based
recommendation works on keeping functional knowledge about how a particular item
meets a particular user need, and can therefore reason about the relationship between a
need and a possible recommendation. This type of system makes suggestions based on
®
information relating to each user's preferences and needs. Using function knowledge, it
can draw connections between a customer's need and a suitable product. While, the
Hybrid recommender system combines the strengths of more than two recommender
system and also eliminates any weakness which exist when only one recommender
system is used. The detailed description about above recommendation systems are
explained in next sections.
3.6.2 Applications of Recommendation Systems

Some of the popular applications of recommendation systems are given as follows
a) Recommendation of news articles : The online news agencies uses
recommendation system to identify articles of interest to readers, based on the
articles that they have read in the past. The similarity might be based on the
similarity of important words in the documents, or on the articles that are read by
people with similar reading tastes.
b) Recommendations of movies : These type of applications uses recommendations
system to provide suggestion of movies user might like based on ratings provided
by other users. The popular service providers who uses Movies recommendation
are You tube, Netflix, Amazon Prime etc.
c) Recommendations of products : It is one of the most popular application of
recommendation system used by on-line retailers. It provides users with
suggestions of products that they might like to buy based on the purchasing
decisions made by similar customers. The popular online retailers are Amazon,
eBay or Walmart etc. who uses recommendation system for their product
recommendations.
3.7 Collaborative Recommendation

In collaborative recommendation, instead of using features of items to determine their
similarity, similarity of the user ratings for two items has been found out. The process of
identifying similar users and recommending what similar users like is called collaborative
filtering. Collaborative recommendation techniques help to make choices based on the
opinions of others who share similar interest. To make recommendations, it uses
collaborative power of the ratings given by users. It uses two approaches for
recommendations like user-based approach and item-based approach. In the user-based
approach, a user receives recommendations of items liked by similar users while in the
item-based approach, a user receives recommendations of items that are similar to those
®
they have liked in the past. The similarity between users or items can be calculated by
correlation similarity or cosine-based similarity measures. When calculating the similarity
between items using the above measures, only users who have rated both items are
considered. The process of identifying similar users and recommending what similar
users like is called collaborative filtering or collaborative recommendation system. It is
represented as column in the utility matrix as shown in Fig. 3.7.1.
Harry Potter 1 Harry Potter 2 Harry Potter 3 Jumanji
User 1 4 5 3
User 2 5 5 4
User 3 4 3 4
User 4 4 4 4
Fig. 3.7.1 Utility matrix for collaborative recommendation
In above Fig. 3.7.1, The utility matrix represents movie ratings provided by different
users for movies Harry Potter 1, Harry Potter 2, Harry Potter 3 and Jumanji on scale 1 to 5
where 1 is lowest and 5 is highest rating. There are many values which are blank in
matrix, therefore collaborative recommendation system can be used to predict those
missing values in the matrix.
The main challenge in designing collaborative recommender is that the ratings
matrices are sparse in nature. Consider an example of a movie application in which users
specify ratings with value like or dislike of specific movies. Most users would have
watched only a small segment of the big number of available movies. As a result, most of
the ratings are unspecified. The specified ratings are also referred to as observed ratings.
The unspecified ratings will be called to as "missing" or "unobserved" ratings. The basic
idea of collaborative filtering methods is that, these unspecified ratings can be estimated
because the observed ratings are often highly correlated across various users and items.
This similarity can be used to make the interpretations about partly specified values. Most
of the models for collaborative filtering focus on leveraging either item-based correlations
or user-based correlations for the prediction process. Few models use both types of
correlations. This model is then used to estimate the missing values in the matrix, in the
same way that a classifier imputes the missing test labels.
There are two methods that are effectively used in collaborative filtering namely
Memory-based methods and Model-based methods. In Memory-based methods, the
ratings of user and item combinations are predicted on the basis of their neighborhoods
®
because of that they are also called as neighborhood based collaborative filtering. Here,
the similarity functions are computed between the columns of the ratings matrix to
discover similar items. The upsides of memory-based techniques are that they are simple
to implement and the resulting recommendations are easy to describe. On the other hand,
memory-based algorithms do not work very well with sparse ratings matrices. Despite
the fact that, the memory-based collaborative filtering algorithms are simple in nature but
they tend to be heuristic and do not work well in all the circumstances.
In model-based methods, the combination of data mining and machine learning
methods are utilized with regards to predictive models where the model is
parameterized, the parameters of this model are learned within the context of an
optimization framework. The examples of model-based methods include rule-based
models, Bayesian methods, decision trees etc. In general, collaborative filtering is used
for missing value analysis where underlying data matrix for the problem is very large and
sparse.
3.8 Content-based Recommendation

The content-based recommendation system focuses on properties of items to find their
similarity. It recommends the items based on a comparison between the content of the
items and a user profile. In this technique, the descriptive attributes of items are used to
make recommendations where the buying behavior of users and ratings are combined
with the content information available in the items. The construction of each items profile
is must where profile is a record or collection of records representing important
characteristics of that item. The basic principles of Content-Based recommender systems
are to analyze the description of the items preferred by a particular user, to determine the
common attributes or preferences that can be used to distinguish these items stored in a
user profile and to compare each item's preferences with the user profile so that only
items that have a high degree of similarity matching with the users profile will be
recommended.
Here, the item descriptions are often labeled with the ratings which are used as
training data sets to create a user-specific regression or classification modeling. For each
user, the training dataset stores the descriptions of the items that user has bought or rated
while class variable is corresponding to the specified ratings or buying behavior. The
content-based recommender systems use two techniques to generate recommendations.
In first technique, the recommendations are generated heuristically using traditional
information retrieval methods like cosine similarity measure while second technique, the
recommendations are generated by building models using statistical learning and
®
machine learning approaches and are capable of learning users interests from the
historical data of users.
The content-based methods are advantageous in making recommendations for new
items which doesn't have sufficient number of ratings. For such items, the
recommendations are made based on similarity index of other similar attributes which
might have been rated by the users. Therefore, to make recommendations for such items,
there is no requirement of history of ratings as supervised model will be able to leverage
these ratings in combination with other items. The content-based methods have several
disadvantages in many cases. First, the obvious recommendations are provided for the
items which user has never consumed and has no chance of being recommended as it is
undesirable because of the use of keywords. Second, even though content-based methods
are effective in providing recommendations for new items, they are not reliable for new
users. This is because the training model for the target user needs to have a large number
of ratings available for the target item in order to make robust predictions without over
fitting. Because of that the content-based methods have different trade-offs from
collaborative filtering systems.
3.9 Knowledge based Recommendation

In Knowledge-Based (KB) recommendation system, the users are offered some items
based on knowledge about the users, items and their relationships. It retains the
functional knowledge base that defines how a particular item meets the specific user's
requirement and, the exact recommendation is performed based on inferences about the
relationship between a user's need and a possible recommendation. In this technique, the
ontology is used as a knowledge representation method that represents the domain
concepts and the relationships between those concepts. It is mostly used to express the
domain knowledge and similarity between items based on the ontology. Most of the
recommendation systems has limitation like cold start problem, which may arise due to
the absence of information about ratings, user history, preferences, or clickstreams. It is
difficult resolve cold start problem due to the absence of information about user's
preferences and lack of enough user ratings which makes it hard for predictions. As we
know that, in collaborative and content-based systems, the recommendations are decided
entirely by either the user's historic preferences, actions or ratings which becomes difficult
when they are not available. In such cases, knowledge-based systems are used in which
users are allow to explicitly specify what they want.
The knowledge-based recommendations are mostly based on specific queries made by
the user rather than user's rating history, they are mainly useful in the context of items
that are not bought very often. Its examples include items belongs to financial services,
®
automobiles, real estate, tourism requests etc. In such examples, enough ratings may not
be available for the recommendation process. As the items are bought rarely, and with
different types of detailed options, it is difficult to obtain a sufficient number of ratings
for a specific instantiation of the item at hand. This problem is also come across the cold-
start problem. Furthermore, the historic data about such cases are not meaningful because
of nature of consumer preferences may evolve over time when dealing with such items as
particular item may have different attributes associated with its various properties, and a
user may be attentive only in items with specific properties. For example, cars may have
several makes, models, colors, engine options, and interior options, and user interests
may be very specific based on combination of these options. Thus, in such cases the
knowledge-based recommender systems can be useful, in which ratings are not used for
the purpose of recommendations, rather, the recommendation process is performed on
the basis of similarities between customer requirements and item descriptions, or the use
of constraints specifying user requirements. The explicit specification of requirements
results in greater control of users over the recommendation process. The knowledge-
based recommender systems can be classified on the basis of the type of the interfaces
used like constraint-based recommender systems and case-based recommender system.
In constraint-based recommender systems, the users typically specify the requirements or
constraints with lower or upper limits on the item attributes where domain-specific rules
are used for matching the user's requirement for item attributes. While in case-based
recommender systems the specific cases are specified by the user as targets or anchor
points and similarity metrics are defined on the item attributes to retrieve analogous
items to these cases. The similarity metrics form the domain knowledge that is used in
such systems. The obtained results are often used as new target cases with some
interactive modifications by the user. Note that in both the cases, the system offers an
opportunity to the user to modify their specified requirements. However, the interactivity
in knowledge-based recommender system is achieved using guidance takes place
through either conversational systems, Search-based systems or navigation-based
systems. In conversational systems, the preferences of users are determined iteratively in
the context of a feedback loop as the item domain is complex and the user preferences can
be determined only in the context of an iterative informal system. In search-based
systems, the user preferences are formed by using a precise sequence of questions that
can provide the ability to specify user constraints, while in navigation-based systems, the
user specifies a number of modification requests to the item being currently
recommended through an iterative set of change requests. Most of the times the
knowledge-based systems work analogous to content-based systems but major difference
between them is content-based systems learn from past user behavior, whereas
®
knowledge-based recommendation system recommends based on active user

requirements and their interests.
3.10 Hybrid Recommendation Approaches

The hybrid recommendation technique combines the best features of two or more
recommendation techniques into one to achieve higher performance and overcome the
drawbacks of traditional recommendation techniques. Most of the recommendation
systems now uses a hybrid approach by combining collaborative filtering, content-based
filtering, and other approaches. The Hybrid approaches can be applied in numerous
ways: by making content-based and collaborative-based predictions separately and then
combining them or by adding content-based abilities to a collaborative-based approach
(and vice versa) or by merging the methods into a single model. These methods can also
be used to overcome some of the common problems in recommender systems such as
cold start and the sparsity problem, as well as the knowledge engineering bottleneck in
knowledge-based approaches.
Netflix is the best example of hybrid recommender system. where website makes
recommendations by associating the watching and searching ways of similar users (i.e.
collaborative filtering) as well as by offering movies that share same characteristics with
movies that a user has rated highly (i.e. content-based filtering).
Some of the hybrid recommendation techniques are :
 Weighted : Where recommendation system combines the score (Weight) of different
recommendation components numerically and adjust the weighted predictions with
outcomes confirmed or disconfirmed.
 Switching : Here, recommendation system performs switching between different
recommendation system which is appropriate for current situation. Suppose the hybrid
recommendation system has combination of content and collaborative based
recommender systems. Initially, the switching hybrid recommender first deploy
content-based recommender system and if it doesn't work then it deploys collaborative
based recommender system.
 Mixed : In mixed mode, the recommendations from more than one technique are
presented together, so that user can choose from a wide range of recommendations.
Here, recommendations from other recommenders are used together to give the new
recommendation.
®
 Feature combination : In this technique, the features are derived from different
knowledge sources, then they are combined together and assigned to a single
recommendation algorithm.
 Feature augmentation : This technique computes the feature or set of features, which
are the part of the input for the next technique.
 Cascade : In this technique, the strict priority is given to all the recommenders where
the lower priority recommenders flouting ties in the scoring of the higher ones.
 Meta-level : This technique uses only a single recommendation technique and
produces some sort of model, which is used as a input for another technique.
Summary
 The Association rules are useful for analyzing and predicting customers behavior.
They play an important role in applications like market basket analysis, customer-
based analytics, catalog design, product clustering and store layout.
 The “Market-basket Analysis” model is commonly used for better understanding
of association rules, where a many to many relationships between “items” and
“baskets” are represented.
 The Apriori is one of the most fundamental algorithms for generating association
rules. It uses support for pruning the itemsets and controlling the exponential
growth of candidate itemsets where smaller candidate itemsets, which are known
to be frequent itemsets, are combined and pruned to generate longer frequent
itemsets.
 The association rule has three important measures that express the degree of
confidence in the rule called Support, Confidence, and Lift.
 Applications of Association Rules are Market Basket Analysis, Medical Diagnosis,
Census Data, Protein Sequence, Telecommunications etc.
 A distance measure is the measure of how similar one observation is compared to
a set of other observations. The similarity between entities can be measured by
finding the closest distance between two data points. The important distance
measures are Euclidean distance, Jaccard distance, Cosine distance, Edit Distance
and Hamming distance.
 A recommendation system filters the data using different algorithms and
recommends the most relevant items to users. It captures the past behavior of a
customer and based on that, recommends the products which the users might be
likely to buy.
 There are four basic types of recommendation systems namely Collaborative,
Content-based, Knowledge based and Hybrid.
®

Q.1 What is the significance of association rules in E-commerce applications.
Ans. : In E-commerce, the association rules go through each item being purchased to
see what items are frequently bought together and to determine the list of rules that
define the procuring behavior of a customer. It is one of the key techniques used by
many online stores to uncover the associations between items which works by looking
for combinations of items that frequently occurs together in a transaction.
Q.2 What is Market-basket analysis. AU : May-17
Ans. : In Market-basket Analysis, the large collection of transactions is given where
each transaction comprises one or more associated items. In this analysis, the
association rules go through each item being purchased to see what items are
frequently bought together and to determine the list of rules that define the procuring
behavior of a customer. It identifies the strength of association between pairs of
products purchased together and identify patterns of co-occurrence where co-
occurrence is when two or more things take place together. It also allows retailers to
recognize the relationships between the items or itemset that people buy. The term
itemset is nothing but the collection of items or individual entities that contain some
kind of relationship. Technically, the Market Basket Analysis creates If-Then scenario
rules, for example, if item X is purchased then item Y is likely to be purchased. The
rules are probabilistic in nature or, in other words, they are derived from the
frequencies of co-occurrence in the observations. The rules for Market Basket Analysis
could be written as :
IF {X} THEN {Y}
Where X and Y are the items or products
Here, If part of the rule (the {X} above) is known as the antecedent and
the THEN part of the rule is known as the consequent (the {Y} above). The antecedent is
the condition and the consequent are the result. For example, IF Customer is
purchasing Bread then likely to purchase Butter also or likely to purchase Eggs.
Q.3 Explain how to calculate the support and confidence for given itemsets
Ans. : In Apriori algorithm, the support is used for pruning the itemsets and
controlling the exponential growth of candidate itemsets where smaller candidate
itemsets, which are known to be frequent itemsets, are combined and pruned to
generate longer frequent itemsets.
For example : Suppose we have provided minimum support 50% and number of
transactions are 4. Therefore, support count can be calculated as
®
50 50
Support count =  No. of transactions = 4=2
100 100
Suppose in a transaction, the occurrence of Itemset i is 6.
Therefore, Confidence is given as Support count /Occurrence of Itemset i =2/6=0.33
i.e. 33 %
Q.4 What are the applications of association rules ?
Ans. : There are many applications of association rules, some of them are described as
below
1. Market Basket Analysis : This is the most commonplace example of association
rules where data is gathered utilizing standardized barcode scanners in many
supermarkets. This database, known as the " market basket " database, comprises
of an enormous number of records on past transactions. A record in a database
lists all the items purchased by a customer in one sale. Realizing which groups are
inclined towards which set of items gives these shops the opportunity to alter the
store layout and the store catalog to put the ideally concerning each other.
2. Medical Diagnosis : Association rules in medical diagnosis can be beneficial for
assisting physicians for curing patients. As, Diagnosis is not an easy process and
has a scope of errors which may result in unreliable end-results. Using relational
association rule mining, we can identify the probability of the occurrence of illness
concerning various factors and symptoms. Further, it can be extended by adding
new symptoms and defining relationships between the new signs and the
corresponding diseases.
3. Census Data : Every government has tones of census data which is used to plan
efficient public services as well as help public businesses. This application of
association rule mining help governments for supporting complete public policy
and bringing forth an efficient functioning of a democratic society.
4. Protein Sequence : The association rules can be used in Proteins sequence findings.
As proteins sequences made up of twenty types of amino acids. Each protein
bears a unique 3D structure which depends on the sequence of these amino acids.
A slight change in the sequence can cause a change in structure which might
change the functioning of the protein. So, association rules can be effectively used
to predict the exact protein sequence.
5. Retail Marketplace : In Retail, association rules can help determine what items
are purchased together, purchased sequentially, and purchased by season. This
can assist retailers to determine product placement and promotion optimization.
®
6. Telecommunications : In Telecommunications, association rules can be used to

determine what services are being utilized and what packages customers are
purchasing. For instance, Telecommunications these days is also offering TV and
Internet. Creating bundles for purchases can be determined from an analysis of
what customers purchase, thereby giving the company an idea of how to price the
bundles.
7. Banks : In Financial organizations like Banks, the association rules can be used to
analyze credit card purchases of customers to build profiles for fraud detection
purposes and cross-selling opportunities.
8. Insurance : In Insurance, the association rule mining can be used to build profiles
to detect medical insurance claim fraud by viewing profiles of claims. It allows to
scan the profiles for determining whether a person has more than 1 claim belongs
to a particular claimee within a specified period of time.
Q.5 Explain the importance of recommendation system along with their applications
Ans. : A recommendation system filters the data using different algorithms and
recommends the most relevant items to users. It captures the past behavior of a
customer and based on that, recommends the products which the users might be likely
to buy. The most important feature of a recommender system is its ability to “guess” a
user’s preferences and interests by analyzing the behavior of this user and/or the
behavior of other users to generate personalized recommendations.
Some of the popular applications of recommendation systems are given as follows
a) Recommendation of News Articles : The online news agencies uses
recommendation system to identify articles of interest to readers, based on the
articles that they have read in the past. The similarity might be based on the
similarity of important words in the documents, or on the articles that are read by
people with similar reading tastes.
b) Recommendations of Movies : These type of applications uses recommendations
system to provide suggestion of movies user might like based on ratings provided
by other users. The popular service providers who uses Movies recommendation
are You tube, Netflix, Amazon Prime etc.
c) Recommendations of Products : It is one of the most popular application of
recommendation system used by on-line retailers. It provides users with
suggestions of products that they might like to buy based on the purchasing
decisions made by similar customers. The popular online retailers are Amazon,
®
eBay or Walmart etc. who uses recommendation system for their product
recommendations.
Q.6 What is Memory-based and model-based filtering ?
Ans. : In Memory-based methods, the ratings of user and item combinations are
predicted on the basis of their neighborhoods because of that they are also called as
neighborhood based collaborative filtering. Here, the similarity functions are computed
between the columns of the ratings matrix to discover similar items. The upsides of
memory-based techniques are that they are simple to implement and the resulting
recommendations are easy to describe.
In model-based methods, the combination of data mining and machine learning
methods are utilized with regards to predictive models where the model is
parameterized, the parameters of this model are learned within the context of an
optimization framework. The examples of model-based methods include rule-based
models, Bayesian methods, decision trees etc.
Part - B Questions
Q.1 Explain the Apriori algorithm for mining frequent itemsets with an example.
AU : May-17
Q.2 Write short note on evaluation of candidate rules.

Q.3 Explain different distance measures to find the similarities between different entities.
Q.4 Briefly explain the strategies used in collaborative and content-based recommendation
systems.
Q.5 Explain the classification recommendation system in brief.

®
UNIT - IV
4 Stream Memory
Syllabus
Introduction to Streams Concepts - Stream Data Model and Architecture - Stream Computing,
Sampling Data in a Stream - Filtering Streams - Counting Distinct Elements in a Stream -
Estimating moments . Counting oneness in a Window - Decaying Window - Real time Analytics
Platform (RTAP) applications - Case Studies - Real Time Sentiment Analysis, Stock Market
Predictions - Using Graph Analytics for Big Data : Graph Analytics.
Contents
4.1 Introduction to Streams Concepts
4.2 Stream Data Model and Architecture
4.3 Stream Computing
4.4 Sampling Data in a Stream
4.5 Filtering Streams
4.6 Counting Distinct Elements in a Stream
4.7 Estimating Moments
4.8 Counting ones in a Window
4.9 Decaying Window
4.10 Real Time Analytics Platform (RTAP)
4.11 Real Time Sentiment Analysis
4.12 Stock Market Predictions
4.13 Graph Analytics
Summary
Part B - Questions
(4 - 1)
Big Data Analytics 4-2 Stream Memory
4.1 Introduction to Streams Concepts

The stream is the sequence of data elements which flows in a group. The big data
analytics process the data which is stored in databases or generated at real time.
Traditionally, the data used to be processed in batches which was stored in databases.
The Batch processing is nothing but processing of block of data that have been stored in a
database over a period of time. Such data contains millions of records generated in a day
which are stored as a file or record and undergo processing at the end of the day for
various kinds of analysis. It is capable to processes huge volumes of stored data with
longer periods of latency. For example, processing all the transactions performed by a
financial firm in a week for getting analytics.
The stream is communication of bytes or characters over the socket in a computer
network. The stream processing is another big data technology which is used to query
continuous data stream generated in a real-time for finding the insights or detect
conditions and quickly take actions within a small period of time. The data streaming is
the process of sending continuous data rather than in batches. In some applications, the
data comes as a continuous stream of events. If we use batch processing in such
applications, then we need to stop the data collection somewhere and need to store data
in a batch for processing in that case if next batch comes for process then we have to
aggregate the results of previously processed batches which is quite difficult and time
consuming. In contrast, the stream processing supports never ending data streams that
can detect patterns and inspect multiple levels of results on real time data without being
stored in batches with simpler aggregation of processed data. For example, with stream
processing, you can receive an alert when the stock market prices cross to the threshold or
get the notification when temperature reaches to the freezing point by querying data
streams coming from a temperature sensor. The data streaming is ideally used for time
series analysis or detecting hidden patterns over the time and it can simultaneously
process the multiple streams. In the data stream model, individual data items may be
relational tuples, for example, network measurements, call records, web page visits,
sensor readings and so on. However, their continuous arrival in multiple, rapid, time-
varying, possibly unpredictable and unbounded streams appear to yield some
fundamentally new research problems.
4.1.1 Applications of Data Streaming

The popular applications of data streaming are :
a) In E-commerce site, to find the anomalous behavior in the data stream they stream
®
the clickstream records and generates a security alert if the clickstream shows
suspicious behavior.
b) In financial institutions, to tracks the market changes the customer portfolios are
adjusted based on configured constraints.
c) In power power grid, the alert or notification is generated based on throughput
when certain thresholds are reached.
d) In news source, the articles that are relevant to the audience are generated by
analyzing the clickstream records from various sources based on their demographic
information.
e) In network management and web traffic engineering, the streams of packets are
collected and processed to detect anomalies.
4.1.2 Sources of Streamed Data

There are various sources of streamed data which provides data for stream processing.
The sources of streaming data ranges from computer applications to the Internet of
things (IOT) sensors. It include satellites data, sensors data, IOT applications, websites,
social media data etc. The various examples of sources of stream data are listed below
a) Sensor data : Where data receives from different kinds of wired or wireless sensors.
For example, the real time data generated by temperature sensor provided to the
stream processing engine for taking action when threshold meets.
b) Satellite image data : Where data receives from satellites to the to earth streams
which consist of many terabytes of images per day. The surveillance cameras are
fitted in a satellite which produces images for processing streamed at station on
earth.
c) Web data : Where the real time streams of IP packets generated on internet are
provided to the switching node which runs queries to detect denial-of-service attacks
or other attacks are then reroute the packets based on information about congestion
in the network.
d) Data in an online retail stores : Where retail firm data collect, store and process data
about product purchase and services by particular customer to understand the
customers behavior analysis
e) Social web data : Where data generated through social media websites like Twitter.
Facebook is used by third party organization for sentimental analysis and prediction
of human behavior.
®
The applications which uses data streams are,

 Realtime maps which uses location-based services to find nearest point of interest
 Location based advertisement of notifications
 Watching streamed videos or listening streamed music
 Subscribing online news alerts or weather forecasting service
 Monitoring
 Performing fraud detection on live online transaction
 Detection of anomaly on network applications
 Monitoring and detection of potential failure of system using network monitoring
tools
 Monitoring embedded systems and industry machinery in real-time using
surveillance cameras
 Subscribing real-time updates on social medias like twitter, Facebook etc.
4.2 Stream Data Model and Architecture

The stream data model is responsible for receiving and processing the real-time data
over analytical platforms. The stream data model uses data stream management system
unlike database management systems. It consists stream processors for managing and
processing the data streams. The input for stream processor is provided by the
applications which allows multiple streams to be enter into the system for processing.
4.2.1 Data Stream Management System

The traditional relational databases are intended for storing and retrieving records of
data that are static in nature. Further these databases do not perceive a notion of time
unless time is added as an attribute to the database during designing the schema itself.
While this model was adequate for most of the legacy applications and older repositories
of information, many current and emerging applications require support for online
analysis of rapidly arriving and changing data streams. This has prompted to build new
models to manage streaming data. This has resulted in data stream management systems
(DSMS), with an emphasis on continuous query languages and query evaluation. Each
input stream in data stream management system poses different data types and data
rates. The typical architecture of data stream management system is shown in Fig. 4.2.1.
®
Fig. 4.2.1: Architecture of data stream management system
The streams which are inputted to the stream processor has to be stored in a
temporary store or working store. The temporary store in data stream model is a transient
store used for storing the parts of streams which can be queried for processing. The
temporary store can be a disk, or main memory, depending on how fast queries to be
processed. The results of the queries are stored in a large archival storage but archival
data cannot be used for query processing but can be used in special circumstances. A
streaming query is a continuous query that executes over the streaming data. They are
similar to database queries used for analyzing data and differ by operating continuously
on data as they arrive incrementally in real-time. The stream processor supports two
types of queries namely ad-hoc queries and standing queries. In ad-hoc query, the
variable dependent results are generated where each query generates different results
depending on the value of the variable. The ad-hoc query uses common approach to store
a sliding window of each stream in the working or temporary store. It doesn’t allow to
store all the streams entirely as it expects to answer arbitrary queries about the streams by
storing appropriate parts or summaries of streams. They are intended for a specific
purpose in contrast to a predefined query.
Alternatively, the standing queries are continuous query that executes over the
streaming data whose functions are predetermined. For standing query, each time a new
stream is arrives and produces the aggregate results.
®
4.2.2 Data Streaming Architecture

A streaming data architecture is a framework for processing huge volumes of
streaming data from multiple sources. As traditional data solutions concentrated on
consuming and processing data in batches while streaming data architecture consumes
data immediately as it is produced, store it in storage mediums and perform real-time
data processing, manipulation and analytics. Most of the streaming architectures are built
on solutions, specific to the problems such as stream processing, data integration, data
storage and real-time analytics. The generalized streaming architecture composed of four
components like Message Broker or Stream Processor, ETL tools, Query Engine and
streaming data storage shown in Fig. 4.2.2.
The first component of data streaming architecture is Message Broker or Stream
Processors. They are the producer of the data which translate the streams into a standard
message format. The other components in the architecture can consume the messages
passed by the broker. The popular legacy message brokers are RabbitMQ and Apache
ActiveMQ which are based on Message Oriented Middleware while the latest messaging
platforms for stream processing are Apache Kafka and Amazon Kinesis.
The second component of the data stream architecture is batch or real-time ETL
(Extract, Transform and Load) tools that streams data from one or more message brokers
and aggregate or transform them into a well-defined structured before data can be
analyzed with SQL-based analytics tools.
Fig. 4.2.2 : Generalized data streaming architecture
®
The ETL platforms receives queries from users, based on that it fetches the events from
message queues and applies the query stream data to generate a result by performing
additional joins, transformations on aggregations. The result may be an API call, an
action, a visualization, an alert, or in some cases a new data stream. The popular ETL
tools for streaming data are Apache Storm, Spark and Flink and Samza.
The third component of the data stream architecture is Query Engine which is used
once streaming data is prepared for consumption by the stream processor. Such data
must be analyzed to provide valuable insights, while the fourth component is streaming
data storage, which is used to store the streaming event data into different data storage
mediums like data lakes.
The stream data processing provides several benefits like able to deal with never-
ending streams of events, real-time data processing, detecting patterns in time-series data
and easy data scalability while some of the limitations of stream data processing are
network latency, limited throughput, slow processing. Supporting on window sized
streams and limitations related to In-memory access to stream data.
The common examples of data stream applications are
 Sensor networks : Which is a huge source of data occurring in streams and are used in
numerous situations that require constant monitoring of several variables, based on
which important decisions are made.
 Network traffic analysis : In which, network service providers can constantly get
information about Internet traffic, heavily used routes, etc. to identify and predict
potential congestions or identify potentially fraudulent activities.
 Financial applications : In which the online analysis of stock prices is performed which
is used for making the sell decisions about the product, quickly identifying correlations
with other products, understand fast changing trends and to an extent forecasting
future valuations about the product.
The Queries over continuous data streams have much in common with queries in a
traditional DBMS. There are two types of queries can be identified as typical over data
streams namely One-time queries and Continuous queries :
a) One-time queries : One-time queries are queries that are evaluated once over a
point-in-time snapshot of the data set, with the answer returned to the user. For
example, a stock price checker may alert the user when a stock price crosses a
particular price point.
b) Continuous queries : Continuous queries, on the other hand, are evaluated
continuously as data streams continue to arrive. The answer to a continuous query is
®
produced over time, always reflecting the stream data seen so far. Continuous query
answers may be stored and updated as new data arrives, or they may be produced as
data streams themselves.
4.2.3 Issues in Data Stream Query Processing

Apart from benefits, there are some issues in data stream query processing which are
explained as follows
a) Unbounded memory requirements : Since data streams are potentially unbounded
in size, the amount of storage required to compute an exact answer to a data stream
query may also grow without bound. Algorithms that use external memory are not
well-suited to data stream applications since they do not support continuous queries.
For this reason, we are interested in algorithms that are able to confine themselves to
main memory without accessing disk.
b) Approximate query answering : When we are limited to a bounded amount of
memory, it is not always possible to produce exact answers for the data stream
queries; however, high-quality approximate answers are often acceptable in lieu of
exact answers.
c) Sliding windows : One technique for approximate query answering is to evaluate
the query not over the entire past history of the data streams, but rather only over
sliding windows of the recent data from the streams. Imposing sliding windows on
data streams is a natural method for approximation that has several attractive
properties but for many applications, sliding windows can be a requirement needed
as a part of the desired query semantics explicitly expressed as a part of the user’s
query.
d) Blocking operators : A blocking query operator is a query operator that is unable to
produce an answer until it has seen its entire input.
4.3 Stream Computing

The stream computing is a computing paradigm that reads data from collections of
sensors in a stream form and as a result it computes continuous real-time data streams.
The stream computing is enables graphics processors (GPUs) to work in coordination
with low-latency and high-performance CPUs to solve complex computational problems.
The data stream in stream computing has sequence of data sets and a continuous stream
carries infinite sequence of data sets. Stream computing can be applied on high velocity
stream of data from real time sources such as market data, mobile, sensors, click Stream
and even transactions. It empowers organizations to analyze and follow up on rapidly
changing data in real time, upgrade existing models with new bits of insights, capture
®
analyze and act on insights, and to move from batch processing to real time analytical
decisions. The stream computing supports low-latency velocities and massively parallel
processing architectures to obtain the useful knowledge from big data. Consequently, the
stream computing model is a new trend for high-throughput computing in the big data
analytics. The different organizations who uses stream computing are telecommunication,
health care, utility companies, municipal transits, security agencies and many more. The
two popular use cases of stream computing are distribution load forecasting, conditional
maintenance and smart meter analytics in energy industry and monitoring a continuous
stream of data and generate alerts when intrusion is detected on a network through a
sensor input.
4.3.1 Stream Computing Architecture

The architecture of stream computing consists of five components namely: Server,
integrated development environment, database connectors, streaming analytics engine
and data mart. The generalized architecture of stream computing is shown in Fig. 4.3.1.
In this architecture, the server is responsible for processing the real-time streaming
event data with high throughputs and low latency. The low latency is provided by means
of processing the streams in a main memory.
Fig. 4.3.1 : Generalized architecture of stream computing
®
Big Data Analytics 4 - 10 Stream Memory
The Integrated development environment (IDE) is used for debugging and testing of
stream processing applications that processes streams using streaming operators, visual
development of applications, provides filtering, aggregation, correlation methods for
streamed data along with user interface for time windows analysis. The database
connectors are used for providing rule engines and stream processing engines for
processing a streamed data with multiple DBMS features. Common main memory DBMS
and rule engines are to be redesigned to use in stream computing. The streaming
analytics engine allows management, monitoring, and real-time analytics for real-time
streaming data and data mart is used for storing live data for processing with additional
feature like operational business intelligence. It also provides automated alerts for the
events.
4.3.2 Advantages of stream computing

The advantages of stream computing as listed as follows
 It provides simple and extremely complex analytics with agility
 It is scalable as per computational intensity
 It supports a wide range of relational and non-relational data types
 It can analyze continuous, massive volumes of data at rates up to petabytes
 Performs complex analytics of heterogeneous data types including text, images,
audio, voice, VoIP, video, web traffic, email, GPS data, financial transaction data,
satellite data, sensors, and any other type of digital information that is relevant to
your business.
 Leverages sub-millisecond latencies to react to events and trends as they are
unfolding, while it is still possible to improve business outcomes.
 It can seamlessly deploy applications on any size computer cluster and adapts to
work in rapidly changing environment.
4.3.3 Limitations of Stream Computing

 In an extreme case security and data confidentiality is the main concern in Stream
computing.
 The flexibility, resiliency and data type handling are the serious considerations in
stream computing.
®
In data stream processing, the three important operations used are sampling, filtering
and counting distinct elements from the stream which are explained in next subsequent
sections.
4.4 Sampling Data in a Stream

The sampling in a data stream is the process of collecting and representing the sample
of the elements of a data stream. The samples are usually much smaller element of entire
stream, but designed to retain the original characteristics of the stream. Therefore, the
elements that are not stored within the sample are lost forever, and cannot be retrieved.
The sampling process is intended for extracting reliable samples from a stream. The data
stream sampling uses many stream algorithms and techniques to extract the sample, but
most popular technique is hashing.
It is used when we have multiple subsets of a stream and want to run the query on
that which can retrieve statistically representative of the stream as a whole. In such cases
ad-hoc queries can be used on the sample along with hashing.
For example, suppose, we want to study a user’s behavior on search engine and search
engine receives multiple stream of queries. Here, we assume that the stream consists of
tuples user, query and time. So, if we run an ad-hoc query to find out what portion of the
typical user’s search queries were repeated over the past month?” In that case, approach
would be to generate a random number between 0 to 9, in response to each search query.
Here assume that the tuple will be stored if and only if the random number is 0, so as to
store the average, 1/10th of queries of each user. But due to the statistical fluctuations into
the data noise gets introduced when users issue large numbers of queries. However, this
scheme gives wrong answer to the query asking for the average number of duplicate
queries for a user. So, we consider the S search queries one time in a month and T search
queries twice.
Therefore for 1/10th sample of queries, user can expect S/10 of the search and for twice
search queries, it would be T/10*1/10=T/100 to fraction T times the probability that both
occurrences of the query will be in the 1/10th sample. For, full stream the query about the
fraction of repeated searches would be T/(S+T).
To find representative search we use In and Out keywords. So, if we see the previous
search records for the user during the current search then we do not do anything but if
we have no search record for the user, then we generate a random integer between 0 and
9. If the number generated is 0 then we add this user to our list with value “in,” otherwise
we add the user with the value “out.”
®
This method works well as long as we keep the list of all users and in/out decision in
main memory. By using a hash function, one can avoid keeping the list of users such that
for each user name hash to one of ten buckets, 0 to 9. Therefore, if the user hashes to
bucket 0, then accept this search query for the sample, and if not, then not. Effectively, we
use the hash function as a random number generator and without storing the in/out
decision for any user, we can reconstruct that decision any time a search query by that
user arrives.
The generalized sampling problem consists of tuples with n components for the
streams. A subset of the components are the key components, on which the selection of
the sample will be based. In our example, the user, query, and time are the subsets and
users are in the key. However, we can use sample of queries on key attributes to get the
outcome.
In general, to generate a samples of size a/b where a is the key and b are the tuples, we
hash the key value a for each tuple to b buckets, and accept the tuple for the sample if the
hash value is less than a. The result will be a sample consisting of all tuples with certain
key values and the selected key values will be approximately a/b of all the key values
appearing in the stream. While sampling methods reduce the amount of data to process,
and, by consequence, the computational costs, they can also be a source of errors. The
main problem is to obtain a representative sample, a subset of data that has
approximately the same properties of the original data.
4.4.1 Types of Sampling

There are basic three types of sampling explained as follows
4.4.1.1 Reservoir Sampling
In reservoir sampling, the randomized algorithms are used for randomly choosing the
samples from a list of items, where list of items is either a very large or unknown number.
For example, imagine you are given a really large stream of data and your goal is to
efficiently return a random sample of 1000 elements evenly distributed from the original
stream. A simple way is to generate random integers between 0 and (N – 1), then
retrieving the elements at those indices will give the answer.
4.4.1.2 Biased Reservoir Sampling
In biased reservoir sampling is a bias function to regulate the sampling from the
stream. In many cases, the stream data may evolve over time, and the corresponding data
mining or query results may also change over time. Thus, the results of a query over a
®
more recent window may be quite different from the results of a query over a more
distant window. Similarly, the entire history of the data stream may not relevant for use
in a repetitive data mining application such as classification. The simple reservoir
sampling algorithm can be adapted to a sample from a moving window over data
streams. This is useful in many data stream applications where a small amount of recent
history is more relevant than the entire previous stream. This will give a higher
probability of selecting data points from recent parts of the stream as compared to distant
past. The bias function in sampling is quite effective since it regulates the sampling in a
smooth way so that the queries over recent horizons are more accurately resolved.
4.4.1.3 Concise Sampling
Many a time, the size of the reservoir is sometimes restricted by the available main
memory. It is desirable to increase the sample size within the available main memory
restrictions. For this purpose, the technique of concise sampling is quite effective. Concise
sampling exploits the fact that the number of distinct values of an attribute is often
significantly smaller than the size of the data stream. In many applications, sampling is
performed based on a single attribute in multi-dimensional data that type of sampling is
called concise sampling. For example, customer data in an e-commerce site sampling may
be done based on only customer ids. The number of distinct customer ids is definitely
much smaller than “n” the size of the entire stream.
4.5 Filtering Streams

The data stream processing poses another approach called selection, or filtering. The
filtering is the process of accepting the tuples in the stream that meets the selection
criterion where accepted tuples are provided to another process as a stream and rejected
tuples are dropped. In filtering, If the selection criterion is a based-on property of tuple
then filtering would easier but if selection criterion involves lookup for membership
function in a set then it becomes hard to filter the stream and large to store in main
memory.
The Hashes are the individual entries in a hash table that act like the index. The hash
function is used to produce the hash values where input is an element containing
complex data, and the output is a simple number that acts as an index to that element. A
hash function is deterministic in nature because it produces the same number every time
you feed it a specific data input.
Let us take an example, suppose we have a set {S} of one million allowed email
addresses which are not to be spam. So, the stream consists of email address and the
®
email itself as a pair. As each email address consumes 20 bytes or more space, it is not
reasonable to store the set S in main memory. Thus, we have to use disk to store and
access that.
Suppose we want to use main memory as a bit array, then we need eight million bits
array and have to run hash function h to eight million buckets from email addresses.
Since there are one million members of S, approximately 1/8th of the bits will be 1 and
rest would be 0. Here, as soon as stream element arrives, we hash its email address, if
hash value for stream element e-mail comes to 1 then we let the email through else we
drop this stream element. But sometimes spam email will get through, so to eliminate
every spam, we need to check for membership in set S those good and bad emails that get
through the filter. The Bloom filter is used in such cases to eliminate the tuples which do
not meet the selection criterion.
4.5.1 Bloom Filter

The purpose of the Bloom filter is to allow all the stream elements whose keys (K) lie
in set (S) otherwise rejecting most of the stream elements whose keys (K) are not part of
set (S).The basic algorithm of bloom filter consist of test and add methods where test is
used to check whether a given element is in the set or not. If it returns the outcome as
false then we conclude element is definitely not in the set, if it returns true then we
consider the element is probably in the set and false positive rate is a function of the
bloom filter' used to calculate size and the number of independent hash functions used.
The add method simply adds an element to the set where removal is not possible without
introducing false negative values, but extensions to the bloom filter are possible.
Typically, a Bloom filter algorithm has three basic steps given as follows :
a) Select an array/vector of n bits, whose initial bits set to all 0’s.
b) The group of hash functions like {H1, H2, . . . , Hk} where each hash function
maps the “Key” (K) values to n buckets, corresponding to the n bits of the
array.
c) Make a set (S) of matched (m) key values.
In bloom filtering, at the first step we initialize the n bit array by setting all bits 0. Then
starts with taking each key value (K) in set (S) and hash it using each of the m hash
functions. The outcome is set to 1, if each bit is in hi(K) for some hash function hi found
then we conclude that some key value (K) are present in set (S).
To test a key (K) that arrives in the stream, check that all the hash functions h1(K),
h2(K), . . . , hk(K) which has 1’s in the bit-array. If all the values found to be 1’s, then let
®
the stream element pass through else discard. That means, if one or more of these bits are
remains 0, then K could not be found in S, so reject the stream element. So, to find out
how many elements are passed we need to calculate the probability of a false positive
outcomes, as a function of n bit-array length, m the number of members of set (S), and m
number of hash functions.
Let us take an example, where we have a model which is used for throwing darts at
the targets. Here, suppose we have T targets and D darts and there is a possibility of any
dart is equally likely to hit any target. So, the analysis of how many targets can we expect
to be hit at least once falls in one of the conditions given below :
T–1
 The probability of a given dart will not hit a given target would be
T
T – 1
D
 The probability of none of the D darts will hit a given target would be  
 T 
 With approximation, the probability that none of the y darts hit a given target would
be e(D/T) .
4.6 Counting Distinct Elements in a Stream

After performing sampling and filtering on data stream, the third kind of processing is
count-distinct problem. The sampling and filtering as used to calculate space needed per
stream in a reasonable amount of main memory by using a variety of hashing and a
randomized algorithm.
4.6.1 The Count-Distinct Problem

The count-distinct problem is used for finding the number of distinct elements in a
data stream with repeated elements. Suppose stream elements are chosen from some
universal set. We would like to know how many different elements have appeared in the
stream, counting either from the beginning of the stream or from some known time in the
past. A simple solution is to traverse the given array, consider every window in it and
count distinct elements in the window.
For example : Given an array of size n and an integer k, return the of count of distinct
numbers in all windows of size k. Where k = 4 and input array is {1, 2, 1, 3, 4, 2, 3}.
Here as window size k = 4, in the first pass the window would be {1, 2, 1, 3}. So, the
count of distinct numbers in first pass is 3. In second pass, window would be {2, 1, 3, 4}
and the count of distinct numbers in second pass is 4. In third pass, window would be
{1, 3, 4, 2} and the count of distinct numbers in third pass is 4 and in fourth pass, window
®
would be {3, 4, 2, 3}, the count of distinct numbers in fourth pass is 3. Therefore, the final
count of distinct numbers are 3, 4, 4, 3.
Let us take another example, suppose we want to find out how many unique users
have accessed a particular website let’s say Amazon in a given month based on gathering
statistics. So, here universal set would be a set of logins and IP address which has
sequences of four 8-bit bytes from which they send the query for that site. The easiest way
to solve this problem is to keep the set in main memory which has list of all the elements
in the stream and make them arranged in an search structure like hash table or search tree
so as to add new elements quickly. But the problem here is to obtain an exact number of
distinct elements appear in the stream. However, if the number of distinct elements is too
large then we cannot store them in main memory. Therefore, the solution of this problem
is to use several machines for handling only one or more number of the streams and store
most of the data structure in secondary.
4.6.2 The Flajolet-Martin Algorithm

The Flajolet–Martin algorithm is used for calculating the number of distinct elements
in a stream. It approximates the number of unique objects in a stream in single pass. It is
possible to estimate the number of distinct elements by hashing the elements of the
universal set to a bit-string. The basic property of a hash function is when it is applied to
the same element, it generates same result.
The Flajolet-Martin algorithm is used in cases where large number of different
elements lies in the stream with more different hash-values where one of these values will
be “unusual.” So, if the stream contains n elements and m of them unique, this algorithm
runs in O(n) time and needs O(log(m)) memory space. The Flajolet-Martin algorithm is
given as follows
Flajolet-Martin Algorithm :
1) Pick a hash function h that maps each of the n elements to at least log2 n bits.
2) For each stream element x, let r(x) be the number of trailing 0’s in h(x).
3) Record R = the maximum r(x) seen.
4) Estimate the count = 2R
The steps for counting distinct elements in a stream using Flajolet-Martin algorithm is
as follows :
Step 1 : Create a bit array/vector of length L and suppose there are n number of
elements in the stream, such that 2L>n.
®
th
Step 2 : The i bit in array/vector represents the hash function value whose binary
representation ends in 0i. So, initialize each bit to 0.
Step 3 : Generate a feasible random hash function that maps input string to natural
numbers.
Step 4 : For each word in an input stream perform hashing and determine the number
of trailing zeros, such that if the number of trailing zeros is k, set the kth bit in the bit
array/vector to 1.
Step 5 : Get the index of the first 0 (called R) in the bit array/vector when input is
exhausted. In this way calculate the number of consecutive 1’s. Here, we have seen
0, 00, ..., 0R-1 as the output of the hash function plus one.
Step 6 : Calculate the number of unique words as 2R/ϕ, where ϕ is 0.77351.
Step 7 : This implies that our count can be off by a factor of 2 for 32 % of the
observations, off by a factory of 4 for 5 % of the observations, off by a factor of 8 for
0.3 % of the observations and so on. As the standard deviation of R is a constant :
σ(R) = 1.12. (as R can be off by 1 for 1 – 0.68 = 32% of the observations, off by 2 for
about 1 – 0.95 = 5% of the observations and off by 3 for 1 – 0.997 = 0.3 % of the
observations using the Empirical rule of statistics.)
Step 8 : To improve accuracy of this approximation algorithm, do the averaging that

uses multiple hash functions and use the average R instead, Bucketing that uses
multiple buckets of hash functions from the above step and use the median of the
average R as averages are susceptible to large fluctuations and bucketing gives fairly
good accuracy and use appropriate number of hash functions in the averaging and
bucketing steps to get more accuracy. It means that the accuracy is depends of number
of hash functions where higher the accuracy more the hash functions and higher
computation cost.
Example 1 : Given a stream S = {4, 2, 5 ,9, 1, 6, 3, 7} and hash function h(x) = (ax + b)
mod 32. So, count the distinct elements in a stream using Flajolet-Martin (FM) algorithm
and treat the result as a 5-bit binary integer.
In a given example, the hash function is given as h(x) = (ax + b) mod 32. So, to estimate
the number of elements appearing in a stream, we have to use hash function to integer
elements interpreted as binary numbers and find out 2 raised to the power of that which
is the longest sequence of 0's seen in the hash value of any stream element is an estimate
of the number of distinct elements.
®
Let us assume a = 3 and b = 1, therefore hash function h(x) would be 3x+1 mod 32.
So, calculate the hash value in binary format for each stream in S i.e. S = {4, 2, 5 ,9, 1, 6,
3, 7}
h(4) = 3(4) + 7 mod 32 = 19 mod 32 = 19 = (10011)
h(2) = 3(2) + 7 mod 32 = 13 mod 32 = 13 = (01101)
h(5) = 3(5) + 7 mod 32 = 22 mod 32 = 22 = (10110)
h(9) = 3(9) + 7 mod 32 = 34 mod 32 = 2 = (00010)
h(1) = 3(1) + 7 mod 32 = 10 mod 32 = 10 = (01010)
h(6) = 3(6) + 7 mod 32 = 25 mod 32 = 25 = (11001)
h(3) = 3(3) + 7 mod 32 = 16 mod 32 = 16 = (10000)
h(7) = 3(7) + 7 mod 32 = 28 mod 32 = 28 = (11100)
Now let us find trailing number of 0’s in each binary output by observing rightmost
number of zeroes. So trailing zero's for given stream would be {0, 0, 1, 1, 1, 0, 4, 2}.
Therefore, value of R would be maximum number of trailing zeros i.e. R= 4.
So, Number of distinct elements (N) = 2R = 24 = 16.
Example 2 : Given a Stream S = {1,3,2,1,2,3,4,3,1,2,3,1} and hash function h(x) = (6x+1)
mod 5, treat the result as a 5-bit binary integer.
So, calculate the hash value in binary format for each stream in
S = {1,3,2,1,2,3,4,3,1,2,3,1}
h(1) = (6 * (1)+1) mod 5 = 7 mod 5 = 2 = (00010)
h(3) = (6 * (3)+1) mod 5 = 19 mod 5 = 4 = (00100)
h(2) = (6 * (2) +1) mod 5 = 13 mod 5 = 3 = (00011)
h(1) = (6 * (1)+1) mod 5 = 7 mod 5 = 2 = (00010)
h(2) = (6 * (2) +1) mod 5 = 13 mod 5 = 3 = (00011)
h(3) = (6 * (3)+1) mod 5 = 32 = 19 mod 5 = 4 = (00100)
h(4) = (6 * (4)+1) mod 52 = 25 mod 5 = 0 = (00000)
h(3) = (6 * (3)+1) mod 5 = 19 mod 5 = 4 = (00100)
h(1) = (6 * (1)+1) mod 5 = 7 mod 5 = 2 = (00010)
h(2) = (6 * (2) +1) mod 5 = 13 mod 5 = 3 = (00011)
h(3) = (6 * (3)+1) mod 5 = 19 mod 5 = 4 = (00100)
h(1) = (6 * (1)+1) mod 5 = 7 mod 5 = 2 = (00010)
®
Now let us find trailing number of 0’s in each binary output by observing rightmost
number of zeroes. So Trailing zero's for given stream would be {1,2,0,1,0,2,5,2,1,0,2,1}.
Therefore, value of R would be maximum number of trailing zeros i.e. R= 5.
So, Number of distinct elements (N) = 2R = 25 = 32.
Here, whenever we apply a hash function H to a stream element a, the bit string H(a)
will end in some number of 0’s. Assume this number as a tail length for a and H. Let N be
the maximum tail length of the stream. Then we shall use estimate 2N for the number of
distinct elements seen in the stream. This estimate makes intuitive sense.
The Intuition here are
a) The probability of a given stream element a has hash H(a) ending in at least n
number of 0’s is 2−n.
b) Suppose there are m distinct elements in the stream, then the probability that none
of them has tail length at least n is (1 − 2−n)m.
Here we conclude that If m is larger than 2n, then the probability for finding tail length
at least n approaches is 1 and If m is much less than 2n, then the probability of finding a
tail length at least n approaches is 0. But inappropriately there is a trap regarding the
strategy for combining the estimates of m for the number of distinct elements obtained by
using many different hash functions. As per our first Intuition, if we take the average of
the values 2N then get a value that approaches the true m hash function but the influence
of overestimate has occurred on the average. Suppose value of 2n is much larger than m,
then some probability p to discover n to be the largest number of 0’s at the end of the
hash value for any of the stream elements m and the probability of finding n +1 for largest
number of 0’s would be at least p/2.
For the space requirement, as we know that during the reading on the stream, one
integer per hash function needs to be kept in main memory. But as integer records the
largest tail length for the hash function, it is difficult to find out the space requirement.
Processing only one stream could use millions of hash functions which are far more than
the estimates. So, the main memory constrain would be only the number of hash
functions that are trying to process many streams at the same time.
4.7 Estimating Moments

The generalization of the problem of counting distinct elements in a stream is an
interesting issue by itself. The problem, called computing “moments”, involves the
distribution of frequencies of different elements in the stream. The estimating moments
involves the distribution of frequencies of various elements in the stream. Suppose a
®
stream consists of elements chosen from a universal set U which has ordered elements i
and mi be the number of occurrences of the ith element. Then the kth order moment of the
stream is calculated as sum over all i.e.
Fk = i  A (mi)k
Here, 0th moment of the stream is sum of 1 for each mi >0; number of distinct elements.
1st moment of the stream is the sum of all mi , which must be the length of the stream. The
2nd moment of the stream sum of the squares of the mi2 , which could be a surprise
number (S) that measures the uneven the distribution of elements in the stream, m2
describes the “skewness” of a distribution; smaller the value of M2, less skewed is the
distribution.
For example, suppose we have a stream of length 100, in which eleven different
elements are appeared. The most even distribution of these eleven elements would be 1
appearing 10 times and the 10 appearing 9 times each. In this case, the surprise number
would be 1×102 + 10 × 92 = 910. Here, we can’t keep count for each element that appeared
in a stream in main memory. So, we need to estimate the kth moment of a stream by
keeping a limited number of values in main memory and computing an estimate from
these values.
Examples :
Consider the following data streams and calculate the surprise number :
1) 5,5,5,5,5  Surprise number = 5 × 52 = 125
2) 9,9,5,1,1  Surprise number = (2 × 92 + 1 × 52 + 2 × 12) =189
To estimate the second moment of the stream with limited amount of main memory
space we can use Alon-Matias-Szegedy algorithm. Here, more the space we use, the more
accurate the estimate will be. In this algorithm, we compute the number of variables X.
For each variable X, we store when a particular element of the universal set, which we
refer to as X.element and the value of the integer variable X.value. To find the value of a
variable X, we select the position in the stream between 1 and n randomly. If element is
found in set X.element then initialize X.value to 1. Likewise, we read the stream, add 1 to
X.value each time we encounter another occurrence of X.element. Technically, the
estimates of the second and higher moments assumes that the stream length n is a
constant and it grows with time. Here, we store only the values of variables and multiply
some function of that value by n when it is time to estimate the moment.
®
4.8 Counting ones in a Window

Now, let us see the counting problems for streams. Suppose we have a window of
length N on a binary stream and want to find out how many 1’s is there in the last k bits?
for any k ≤ N. As we know that practically we cannot afford to store entire window of a
stream in memory, so to calculate number on 1’s in last k-bits we are going to use
approximation algorithm which is explained in a subsequent section.
For a given problem for finding number of 1’s in last k-bits, it is necessary to store all
N bits of the window along with the representation as fewer than N bits could not work.
Since there are 2N sequences of N bits with fewer than 2N representations, there must be
two different bit strings x and y that have the same representation and if x≠y then they
must differ in at least one bit
4.8.1 The Datar-Gionis-Indyk-Motwani (DGIM) Algorithm

The DGIM algorithm is used to find the number 1’s in a data set. This algorithm uses
O(log2 N) bits to represent a window of N bit, allows to estimate the number of 1’s in the
window with and error of no more than 50 %. In this, each bit of the stream has a
timestamp which signifies the position in which it arrives. The first bit has timestamp 1,
the second has timestamp 2, and so on.
As we need to distinguish positions within the window of length N, we shall represent
timestamps modulo N which can be represented by log2 N bits. To store the total number
of bits we ever seen in the stream, we need to determine the window by timestamp
modulo N. For that, we need to divide the window into buckets consisting of timestamp
of its right (most recent) end and the number of 1’s in the bucket. This number must be a
power of 2, and we refer to the number of 1’s as the size of the bucket. To represent a
bucket, we need log 2 N bits to represent the timestamp which is modulo N of its right
end. To represent the number of 1’s we only need log2 log2 N bits. Thus, O(logN) bits
suffice to represent a bucket. There are six rules that must be followed when representing
a stream by buckets.
A) The right side of the bucket should always start with 1 as if it starts with a 0, it is to
be neglected. for example, 1001011 here a bucket of size would be 4 as it is having
four 1’s and starting with 1 on it’s right end i.e. Every position with a 1 is in some
bucket.
B) Every bucket should have at least one 1, else no bucket can be formed i.e. Every
position with a 1 is in some bucket
C) No position is in more than one bucket.
D) There are one or two buckets of any given size, up to some maximum size.
®
E) All buckets sizes must be in a power of 2.

F) Buckets cannot decrease in size as we move to the left.
Suppose, given stream is . . 1 0 1 1 0 1 1 0 0 0 1 0 1 1 1 0 1 1 0 0 1 0 1 1 0. The bitstream
divided into buckets following the DGIM rules is shown in Fig. 4.8.1.
Fig. 4.8.1 : Bitstream divided into buckets following the DGIM rules
For Example : Suppose the input stream bit is ….101011000101110110010110, so

estimate the total number of 1’s and number of buckets. Here window size is N = 24.
Now, create a bucket whose rightmost bit would be 1. In our example we found 5
buckets as shown below.
101011 000 10111 0 11 00 101 1 0
Bucket size Bucket size 4 Bucket size Bucket size Bucket size
4 i.e. 2 =4
2 2 2 1 i.e. 20=1
i.e. 22=4 i.e. 21=2 i.e. 21=2
Here, when new bit comes in then drop last bucket if its timestamp is prior to N time
before current time. If the new bit arrived is 0 with a time stamp 101, then there are no
changes needed in the buckets but if the new bit that arrives is 1, then we need to make
some changes.
101011 000 10111 0 11 00 101 1 0 1 1
New bits to be entered
So current bit is 1 then create a new bucket of size 1 and make the current timestamp
and size to 1. If there was only one bucket of size 1, then nothing more needs to be done.
However, if there are now three buckets of size 1 (buckets with timestamp 100,102,103)
then combine the leftmost(oldest) two buckets of size 2 as shown below.
101011 000 10111 0 11 00 101 1001 1 1
Bucket Bucket Bucket Bucket Bucket Bucket
size 4 size 4 size 2 size 2 size 2 size 1
To combine any two adjacent buckets of the same size, replace them by one bucket of
twice the size. The timestamp of the new bucket is the timestamp of the rightmost of the
two buckets. By performing combining operation on buckets, the resulting buckets would
be
®
101011 000 10111 1100101 1001 1 1

Bucket size 4 Bucket size 4 Bucket size 4 Bucket size 2 Bucket size
1
Now, sometimes combining two buckets of size 1 may create a third bucket of size 2. If
so, we combine the leftmost two buckets of size 2 into a bucket of size 4. This process may
ripple through the bucket sizes. Here, continues the process until current timestamp-
leftmost bucket timestamp of window is < N i.e. 24.
So finally, by counting the sizes of the buckets in the last 20 bits, we get solution to the
problem i.e. 11 ones.
As each bucket can be represented by O(logN) bits. If the window has length N, then
there are no more than N 1’s. So, if there are O(logN) buckets then the total space required
for all the buckets representing a window of size N is O(log2 N).The solution for the
problem to find how many 1’s there are in the last k bits of the window, for some
1 ≤ k ≤ N. Find the bucket b with the earliest times tamp that includes at least some of the
k most recent bits then estimate the number of 1’s to be the sum of the sizes of all the
buckets to the right (more recent) than bucket b, plus half the size of b itself.
To add new bit in a window of length N represented by buckets, we may need to
modify the buckets. So, with satisfying the DGIM conditions, first, whenever a new bit
enters check the leftmost bucket. If its timestamp has now reached the current timestamp
minus N, then this bucket no longer has any of its 1’s in the window. Therefore, drop it
from the list of buckets. or create a new bucket with the current timestamp and set its size
to 1. However, if there are now more buckets of size 1 then we need to fix this problem by
combining the leftmost two buckets of size 1. To combine any two adjacent buckets of the
same size, replace them by one bucket of twice the size. Here, the timestamp of the new
bucket is the timestamp of the rightmost (later in time) of the two buckets. As a result,
any new bit can be processed in O(logN) time.
4.9 Decaying Window

The decaying window is used for finding the most common “recent” elements in the
streams. Suppose, a stream consist of the elements a1, a2, . . . , at, where a1 is the first
element to arrive and at is the current element. Let c be a small constant, such as 10−6 to
10−9. Therefore, the exponentially decaying window for this stream would be
t–1
 at – i (1 – c)i
i=0
®
In decaying window, it is easier to adjust the sum exponentially than sliding window
of fixed length. The effect of this definition is to spread out the weights of the stream
elements as far back in time as the stream goes. In sliding window, the element that falls
out of the window each time a new element arrives needs to be taken care. In contrast, a
fixed window with the same sum of the weights, 1/c, would put equal weight 1 on each of
the most recent 1/c elements to arrive and weight 0 on all previous elements which is
illustrated in Fig. 4.9.1. However, when a new element at+1 arrives at the stream input,
we first multiply the current sum by 1 – c and then add at+1.
Fig. 4.9.1 : Decaying window
In this method, each of the previous elements get moved one position further from the
current element, so its weight is multiplied by 1 − c. Further, the weight on the current
element is (1 − c)0 = 1, so adding at+1 is the correct way to include the new element’s
contribution.
4.10 Real Time Analytics Platform (RTAP)

A real-time analytics platform enables organizations by helping them to extract the
valuable information and trends from most out of real-time data. Such platforms help in
measuring data from the business point of view in real time. An ideal real-time analytics
platform would help in analyzing the data, correlating it and predicting the outcomes on
a real-time basis. It helps organizations in tracking the things in real time, thus helping
them in the decision-making process as well as connect the data sources for better
analytics and visualization. The RTAP is related to responsiveness of data which needs to
be processed immediately upon generated, sometimes need to update the information at
the same rate at which it gets received. The RTAP analyzes the data, correlates and
predicts the outcomes in real-time and helps timely in decision making.
As we know the social medias like Facebook and Twitter generate petabytes of real-
time data. This data must be harnessed to provide real-time analytics to make better
business decisions. Further in today’s context, billions of devices are connected to the
internet such as mobile phones, personal computers, laptops, wearable medical devices,
®
smart meters with huge number of new data sources. The Real-time analytics will
leverage information from all these devices to apply analytics algorithms and generate
automated actions within milliseconds of a trigger. The Real-Time analytics platform
composed of three components namely :
Input : which is generated upon the event happens (like new sale, new customer,
someone enters a high security zone etc.)
Processing unit : which capture the data of the event, and analyze the data without
leveraging resources that are dedicated to operations. It also involves executing different
standing and ad-hoc queries over streamed data and
Output : that consume this data without disturbing operations, explore it for better
insights and generates analytical results by means of different visual reports over the
dedicated dashboard. The general architecture of Real-Time Analytics Platform is shown
in Fig. 4.10.1.
The various requirements for real-time analytics platform are as follows :
1. It must support continuous queries for real-time events.
2. It must consider the features like, robust- ness, fault tolerance, low-latency reads
and updates, incremental analytics and learning and scalability.
3. It must have improved the in-memory transaction speed.
4. It should quickly move the not needed data into secondary disk for persistent
Storage.
5. It must support distributing data from various sources with speedy processing.
Fig. 4.10.1 : Architecture of Real-Time Analytics Platform
®
The basic building blocks of Real Time Streaming Platform are shown in Fig. 4.10.2.
The streaming data is collected from various flexible data sources by producing
connectors which move and receive data from the sources to the queuing system. The
queuing system is faulty tolerance and persistent in nature. The streamed data then
buffered to be consumed by the stream processing engine. The queuing system is a high-
throughput, low latency system which provides high availability and fail-over
capabilities. There are many technologies that support real-time analytics, such as :
Fig. 4.10.2 : Basic building blocks of Real-Time Analytics Platform
1. Processing In Memory (PIM), a chip architecture in which the processor is

integrated into a memory chip to reduce latency.
2. In-database Analytics, a technology that allows data processing to be conducted
within the database by building analytic logic into the database itself.
3. Data Warehouse Appliances, combination of hardware and software products
designed specifically for analytical processing. An appliance allows the purchaser to
deploy a high-performance data warehouse right out of the box.
4. In-memory Analytics, an approach to querying data when it resides in Random
Access Memory (RAM), as opposed to querying data that is stored on physical disks.
5. Massively Parallel Programming (MPP), the coordinated processing of a program
by multiple processors that work on different parts of the program, with each
processor using its own operating system and memory.
Some of the popular Real Time Analytics Platforms are :
 IBM Info Streams : It is used as a streaming platform for analyzing broad range of real-
time unstructured data like text, videos, geospatial images, sensors data etc.
®
 SAP HANA : It is a streaming analytical tool that allows SAP users to capture, stream
and analyze data with active event monitoring and event driven response to
applications.
 Apache Spark : It is a streaming platform for big data analytics in real-time developed
by Apache.
 Cisco Connected Streaming Platform : It is used for finding the insights from high
velocity streams of live data over the network with multiple sources with enabled
immediate actions.
 Oracle Stream Analytics : It provides graphical interface to performing analytics over
the real-time streamed data.
 Google Real Time Analytics : It is used for performing real-time analytics over the
cloud data collected over different applications.
4.10.1 Applications of Realtime Analytics Platforms

There are many real-time applications which uses realtime analytics platforms, some
of them are listed below :
 Click analytics for online product recommendation
 Automated event actions for emergency services like fires, accidents or any disasters in
the industry
 Notification for any abnormal measurement in healthcare which requires immediate
actions
 Log analysis for understanding user’s behavior and usage pattern
 Fraud detection for online transactions
 Push notifications to the customers for location-based advertisements for retail
 Broadcasting news to the users which are relevant to them
4.11 Real Time Sentiment Analysis

The Sentiment Analysis (also referred as opinion mining) is a Natural Language
Processing and Information Extraction task that aims to obtain the feelings expressed in
positive or negative comments, questions and requests, by analyzing a large number of
data over the web. In real-time sentimental analysis, the sentiments are collected and
analyzed in real time with live data over the web. It uses natural language processing,
text analysis and computational linguistics to identify and extract subjective information
in source materials. The goal of sentiments analysis is to allows organizations, political
parties and common people to track sentiments by identifying feelings, attitude and state
of mind of people towards a product or service and classify them as positive, negative
and neutral from the tremendous amount of data in the form of reviews, tweets,
®
comments and feedback with emotional states such as “angry”, “sad” and “happy”. It
tries to identify and extract sentiments within the text. The analysis of sentiments can be
either document based where the sentiment in the entire document is summarized as
positive, negative or objective or can be sentence based where individual sentences,
bearing sentiments, in the text are classified.
Sentiment analysis is widely applied to reviews and social media for a variety of
applications, ranging from marketing to customer service. In the context of analytics,
sentiment analysis is “the automated mining of attitudes, opinions and emotions from
text, speech and database sources”. With the proliferation of reviews, ratings,
recommendations and other forms of online expression, online opinion has turned into a
kind of virtual currency for businesses looking to market their products, identify new
opportunities and manage their reputations.
Some of the popular applications of real-time sentiment analysis are,
1) Collecting and analyzing sentiments over the Twitter. As Twitter has become a
central site where people express their opinions and views on political parties and
candidates. Emerging events or news are often followed almost instantly by a burst
in Twitter volume, which if analyzed in real time can help explore how these events
affect public opinion. While traditional content analysis takes days or weeks to
complete, real time sentiment analysis can look into the entire Twitter traffic about
the election, delivering results instantly and continuously. It offers the public, the
media, politicians and scholars a new and timely perspective on the dynamics of the
electoral process and public opinion.
2) Analyzing the sentiments of messages posted to social networks or online forums
can generate countless business values for the organizations which aim to extract
timely business intelligence about how their products or services are perceived by
their customers. As a result, proactive marketing or product design strategy can be
developed to effectively increase the customer base.
3) Tracking the crowd sentiments during commercial viewing by advertising agencies
on TVs and decide which commercials are resulting in positive sentiments and
which are not.
4) A news media website is interested in getting an edge over its competitors by
featuring site content that is immediately relevant to its readers where they use
social media to know the topics relevant to their readers by doing real time
®
sentiment analysis on Twitter data. They Specifically, to identify what topics are
trending in real time on Twitter, they need real-time analytics about the tweet
volume and sentiment for key topics
5) In Marketing, the real-time sentiment analysis can be used to know the public
reactions on product or services supplies by an organization. The analysis is
performed on which product or services they like or dislike and how they can be
improved,
6) In Quality Assurance, the real-time sentiment analysis can be used to detect errors
in your products based on your actual user’s experience.
7) In Politics, the real-time sentiment analysis can be used to determine the views of
the people regarding specific situations on which they angry or happy.
8) In Finances, the real-time sentiment analysis tries to detect the sentiment towards a
brand, to anticipate their market moves
The best example of real time sentiment analysis is predicting the pricing or
promotions of a product being offered through social media and the web. The solution for
price or promotion prediction can be implemented software solutions like Radar (Real-
Time Analytics Dashboard Application for Retail) and Apache Storm. The RADAR is the
software solution for retailers built using a Natural Language Processing (NLP) based
Sentiment Analysis engine that utilizes different Hadoop’s technologies including HDFS,
Apache Storm, Apache Solr, Oozie and Zookeeper to help enterprises maximize sales
through databased continuous re-pricing. Apache Storm is a distributed real-time
computation system for processing large volumes of high-velocity data. It is part of the
Hadoop ecosystem. Storm is extremely fast, with the ability to process over a million
records per second per node on a cluster of modest size. Apache Solr is another tool from
the Hadoop ecosystem which provides highly reliable, scalable search facility at real time.
RADAR uses Apache STORM for real-time data processing and Apache SOLR for
indexing and data analysis. The generalized architecture of RADAR for retail is shown in
Fig. 4.11.1.
®
Fig. 4.11.1 : Generalized architecture of RADAR for retail
For retailers, the RADAR can be used to customize their environment so that they can
track the following for any number of products / services in their portfolio based on Social
Sentiment for each product or service they are offering and competitive
pricing/promotions being offered through social media and the web. With this solution,
retailers can create continuous re-pricing campaigns and implement them real-time in
their pricing systems, track the impact of re-pricing on sales and continuously compare it
with social sentiment.
4.12 Stock Market Predictions

Stock market prediction is the act of trying to determine the future value of a company
stock or other financial instrument traded on an exchange. The successful prediction of a
stock's future price could yield significant profit.
Predicting stock prices is a challenging problem in itself because of the number of
variables which are involved. The stock market process is full of uncertainty and it’s
affected by many factors. Hence the stock market prediction is one of the important
exertions in business and finance. As it produces large amount of data every day, it is
very difficult for an individual to consider all the current and past information for
predicting future trend of a stock.
®
Traditionally, stock market prediction algorithms used to check historical stock prices
and try to predict the future using different models. The traditional approach is not
effective in a real time because, as stock market trends continually changes based upon
economic forces, regulations, competition, new products, world events and even (positive
or negative) tweets are all factors to affect stock prices therefore. Thus, predicting the
stock prices using real-time analytics is the necessity. The generalized architecture for
real-time stock prediction has three basic steps, as shown in Fig. 4.12.1.
Fig. 4.12.1 : Generalized architecture for real-time stock prediction
There are three basic components :

1. In the first step, the incoming real-time trading data is captured and stored into a
persistent storage as it becomes historical data over the period of time.
2. Secondly, the system must be able to learn from historical trends in the data and
recognize patterns and probabilities to inform decisions.
3. Third, the system needs to do a real-time comparison of new, incoming trading data
with the learned patterns and probabilities based on historical data. Then, it
predicts an outcome and determines an action to take.
A more detailed picture with machine learning approach for stock prediction is given
in Fig. 4.12.2.
®
Fig. 4.12.2 : Detailed representation of real-time stock prediction using machine learning
The following steps are followed :

1. The Live data, from Yahoo! Finance or any other finance news RSS feeds is read and
processed. The data is then stored in memory with a fast, consistent, resilient, and
linearly scalable system.
2. Using the live, hot data from Apache Geode, a Spark MLib application creates and
trains a model, comparing new data to historical patterns. The models could also be
supported by other toolsets, such as Apache MADlib or R.
3. Results of the machine learning model are pushed to other interested applications
and also updated within Apache Geode for real-time prediction and decisioning.
4. As data ages and starts to become cool, it is moved from Apache Geode to Apache
HAWQ and eventually lands in Apache Hadoop™. Apache HAWQ allows for SQL-
based analysis on petabyte-scale data sets and allows data scientists to iterate on
and improve models.
5. Another process is triggered to periodically retrain and update the machine
learning model based on the whole historical data set. This closes the loop and
creates ongoing updates and improvements when historical patterns change or as
new models emerge.
The most common advantages of stock prediction using big data approach are
 It stabilizes the online trading
 Real-time data analysis with a rapid speed
®
 Improves the relationship between investors and stock trading firms

 Provides the best estimation of outcomes and returns :
 Mitigate the probable risks on stock trading online and make a right investment
decision
 Enhances the machine learning ability to produces accurate predictions
4.13 Graph Analytics

The Big data analytics systems is intended to provide different tools and platforms that
can support various analytic techniques and can be adapted to overcome the challenges
in existing system. The graph analytics is one of the techniques in which both structured
and unstructured data is supports from various sources to enable analysts to probe the
data in an undirected manner. It is adopted by many organizations because of simpler
visualization of data over the past data warehouse and analytics techniques.
Fig. 4.13.1 : Graph representation
It is composed of numerous individual entities and different relationships that connect

those entities. It consists of collection of vertices referred as nodes to represent entities,
connected by edges referred as links or connections to represent relationships between
entities. The Fig. 4.13.1 shows a typical graph representation, in which the edges between
vertices represents the nature of the relationship with direction to the entities. In typical
®
graph analysis, the labeled vertices indicate the types of entities that are related while
labeled edges are used to represent the nature of the relationship, while multiple
relationships between pairs of vertices are represented by multiple edges between pair of
vertices.
In graph analytics, the directed graph can be represented by triplet consist of subject
which is the source point of the relationship, an object is the target point of relationship,
and a predicate that represents type of the relationship. Therefore, the database which
support these triplets is called a semantic database. The graph model supports all the
types of entities and their relationship. The graph model composed of different models
namely communication models that represents communication across a community
triggered by a specific event, influence model that represents entities holding influential
sites within a network for intermittent periods of time, Distance modeling for analyzing
the distances between sets of entities like finding the strong correlations between
occurrences of sets of statistically improbable phrases and collaborative model that uses
isolated groups of individuals that share similar interests. The graph analytics is mainly
used for business problems that has characteristics like Adhoc nature of the analysis,
absence of structure in the problem, embedded knowledge in the network, connectivity
problem, predictable performance, undirected discovery and flexible semantics.
4.13.1 Features of a Graph Analytics

The Graph analytics encompass following features
a) Easier visualization : The graph analytics has visualization tools that represents
easier discovery of valuable information and highlight their value.
b) Seamless data intake : It provides a seamless capability to easily collect and use
data from a variety of different sources.
c) Simple data integration : A semantics-based approach in graph analytics allows to
easily integrate different sets of data that do not have predetermined structure.
d) Seamless workflow integration : A graph analytics platform provides seamless
approaches for workflow integration which are segregated from the existing
reporting and analytics environments having limited value when incorporating the
results from different environments.
e) Multithreading : The graph analytics platform has fine-grained multithreading
approaches which allows exploration of different paths for creating, managing and
allocating threads to available nodes on a parallel processing architecture.
®
f) Standardize representation : The graph analytics platforms has built-in resource

description framework standard like RDF and ontology to use triplets for
representing the graph.
g) Built-in inferencing mechanisms : The graph analytics platform has methods for
finding the insights derived from the embedded relationships by using different
built-in inference mechanisms for the deduction of new information.
The graph analytics applications run different algorithms to traverse or analyze the
graphs for finding interesting patterns within them. Those patterns are useful for getting
new business opportunities, increasing revenue, detecting frauds and identifying
different security risks.
The different approaches used by graph analytics algorithm are given as below
i) Path analysis : This approach examines the shapes and distances of the diverse
paths that connect entities within the graph.
ii) Clustering analytics : This approach examines the properties of the vertices and
edges to recognize features of entities that can be used to group them together.
iii)Pattern detection and analysis : This approach provide methods for finding the
inconsistent or unexpected patterns within a graph for analysis.
iv) Probabilistic analysis : This approach provides different graphical models for
probabilistic analysis on various applications like risk analysis, speech recognition,
medical diagnosis, protein structure prediction etc. using Bayesian networks.
v) Community analysis : In this approach the graph structures are traversed in search
of groups of entities connected in close ways.
The graph analytics is used in many applications like health care, where patient’s
health records like collections of medical histories, prescription records, laboratory results
and clinical records from many different sources are analyzed and based on that the rapid
assessment of therapies are provided for other patients who are facing the same medical
problem. The cyber security is another application where the patterns of attack are
recorded and actions are taken against those attacks. The third application is concept-
based correlations which is used for finding contextual relationships between different
entities like finding the fraud analysts by evaluating financial irregularities across
multiple-related organizations.
Unlike many advantages, there are some limitations of graph analytics like complexity
of graph partitioning, unpredictability of graph memory accesses, dynamic interactions
with graphs, and unpredicted growth of a graph models.
®
Summary
 The stream is the sequence of data elements which flows in a group while stream
processing is a big data technology which is used to query continuous data stream
generated in a real-time for finding the insights or detect conditions and quickly
take actions within a small period of time. The Sources of streamed data are
Sensor Data, Satellite Image Data, Web data, Social web data etc.
 The stream data model uses data stream management system unlike database
management systems for managing and processing the data streams
 A streaming data architecture is a framework for processing huge volumes of
streaming data from multiple sources. The generalized streaming architecture
composed of four components like Message Broker or Stream Processor, ETL
tools, Query Engine and streaming data storage.
 The Stream computing is a computing paradigm that reads data from collections
of sensors in a stream form and as a result it computes continuous real-time data
streams. It enables graphics processors (GPUs) to work in coordination with low-
latency and high-performance CPUs to solve complex computational problems.
 The architecture of stream computing consists of five components namely: Server,
Integrated development environment, Database Connectors, Streaming analytics
engine and data mart.
 The Sampling in a data stream is the process of collecting and representing the
sample of the elements of a data stream. The samples are usually much smaller
element of entire stream, but designed to retain the original characteristics of the
stream.
 There are basic three types of sampling namely Reservoir Sampling, Biased
Reservoir Sampling and Concise Sampling.
 The filtering is the process of accepting the tuples in the stream that meets the
selection criterion where accepted tuples are provided to another process as a
stream and rejected tuples are dropped.
 The purpose of the Bloom filter is to allow all the stream elements whose keys (K)
are lies in set (S) otherwise rejecting most of the stream elements whose keys (K)
are not part of set (S) while Flajolet–Martin algorithm is used for calculating the
number of distinct elements in a stream.
®
 A real-time analytics platform enables organizations by helping them to extract

the valuable information and trends from most out of real-time data. In real-time
sentimental analysis, the sentiments are collected and analyzed in real time with
live data over the web.
 Stock market prediction is the act of trying to determine the future value of a
company stock or other financial instrument traded on an exchange. The
successful prediction of a stock's future price could yield significant profit.
 The Graph analytics is one of the techniques in which both structured and
unstructured data is supports from various sources to enable analysts to probe the
data in an undirected manner.
 In graph analytics, the directed graph can be represented by triplet consist of
subject which is the source point of the relationship, an object is the target point of
relationship, and a predicate that represents type of the relationship.

Q.1 Outline the need of sampling in the stream. AU : May-17
Ans. : The Sampling in a data stream is the process of collecting and representing the
sample of the elements of a data stream. The samples are usually much smaller element
of entire stream, but designed to retain the original characteristics of the stream.
Therefore, the elements that are not stored within the sample are lost forever, and
cannot be retrieved. The sampling process is intended for extracting reliable samples
from a stream. The data stream sampling uses many stream algorithms and techniques
to extract the sample, but most popular technique is hashing. It is used when we have
multiple subsets of a stream and want to run the query on that which can retrieve
statistically representative of the stream as a whole. In such cases ad-hoc queries can be
used on the sample along with hashing.
Q.2 State the examples of stream sources. AU : Nov.-18
Ans. : There are various sources of streamed data which provides data for stream
processing. sources. The sources of streaming data ranges from computer applications
to the Internet of things (IOT) sensors. It includes satellites data, sensors data, IOT
applications, websites, social media data etc. The various examples of sources of stream
data are listed below
a) Sensor Data : Where data receives from different kinds of wired or wireless
sensors. For example, the real time data generated by temperature sensor
provided to the stream processing engine for taking action when threshold meets.
®
b) Satellite Image Data : Where data receives from Satellites to the to earth streams
which consist of many terabytes of images per day. The surveillance cameras are
fitted in a satellite which produces images for processing streamed at station on
earth.
c) Web data : Where the real time streams of IP packets generated on internet are
provided to the switching node which runs queries to detect denial-of-service
attacks or other attacks are then reroute the packets based on information about
congestion in the network.
d) Data in an online retail stores : Where retail firm data collect, store and process
data about product purchase and services by particular customer to understand
the customers behavior analysis
e) Social web data : where data generated through social media websites like
Twitter. Facebook is used by third party organization for sentimental analysis and
prediction of human behavior.
The applications which uses data streams are
 Realtime maps which uses location-based services to find nearest point of interest
 Location based advertisement of notifications
 Watching streamed videos or listening streamed music
 Subscribing online news alerts or weather forecasting service
 Monitoring
 Performing fraud detection on live online transaction
 Detection of anomaly on network applications

 Monitoring and detection of potential failure of system using network monitoring
tools
 Monitoring embedded systems and industry machinery in real-time using
surveillance cameras
 Subscribing real-time updates on social medias like twitter, Facebook etc.
Q.3 What is the storage requirement for the DGIM algorithm ? AU : Nov.-18
Ans. : The DGIM algorithm is used to find the number 1's in a data set. This algorithm
uses O(log2 N) bits to represent a window of N bit, allows to estimate the number of 1's
in the window with and error of no more than 50%. In this, each bit of the stream has a
timestamp which signifies the position in which it arrives. The first bit has timestamp 1,
the second has timestamp 2, and so on. As we need to distinguish positions within the
window of length N, we shall represent timestamps modulo N which can be
®
represented by log2 N bits. To store the total number of bits we ever seen in the stream,
we need to determine the window by timestamp modulo N. For that, we need to divide
the window into buckets consisting of timestamp of its right (most recent) end and the
number of 1's in the bucket. This number must be a power of 2, and we refer to the
number of 1's as the size of the bucket. To represent a bucket, we need log2 N bits to
represent the timestamp which is modulo N of its right end. To represent the number of
1's we only need log2 log2 N bits. Thus, O(logN) bits suffice to represent a bucket. There
are six rules that must be followed when representing a stream by buckets.
a) The right side of the bucket should always start with 1 as if it starts with a 0, it is
to be neglected. for example, 1001011 here a bucket of size would be 4 as it is
having four 1's and starting with 1 on it's right end i.e. Every position with a 1 is
in some bucket.
b) Every bucket should have at least one 1, else no bucket can be formed i.e. Every
position with a 1 is in some bucket
c) No position is in more than one bucket.
d) There are one or two buckets of any given size, up to some maximum size.
e) All buckets sizes must be in a power of 2.
f) Buckets cannot decrease in size as we move to the left.
Q.4 What is sentiment analysis ? AU : May-17
Ans. : The Sentiment Analysis (also referred as opinion mining) is a Natural Language
Processing and Information Extraction task that aims to obtain the feelings expressed in
positive or negative comments, questions and requests, by analyzing a large number of
data over the web. In real-time sentimental analysis, the sentiments are collected and
analyzed in real time with live data over the web. It uses natural language processing,
text analysis and computational linguistics to identify and extract subjective
information in source materials.
Part - B Questions
Q.1 With neat sketch explain the architecture of data stream management system
AU : May-17
Q.2 Outline the algorithm used for counting distinct elements in a data stream AU : May-17
Q.3 Explain with example Real Time Analytics Platform (RTAP)
Q.4 State and explain Bloom filtering with the example.
Q.5 State and explain Real Time Analytics Platform (RTAP) applications AU : Nov.-18
Q.6 Compute the surprise number (second moment) for the stream 3, 1, 4, 1, 3, 4, 2, 1, 2.
What is third moment of the stream ?
®
Ans. : Here given stream is 3,1,4,1,3,4,2,1,2
where we have unique streams {1,2,3,4}

The frequency moment of a stream is given by formula
Fk = i  A (mi)k
where k is the order of the moments and m is the number of occurance of ith element.
Therefore, the estimation about all elements are given as :
st nd 2 rd 3
Element Occurance 1 Moment 2 Moment (m ) 3 Moment (m )
1 3 3 9 27
2 2 2 4 8
3 2 2 4 8
4 2 2 4 8
mi = 9 mi = 21 mi = 51
From table, it is concluded that, the first moment or length of stream is 9, second
moment of the stream is 21 and third moment is 51.
The third moment of the stream for the given problem is 51


®
UNIT - V
NOSQL Data
5 Management for Big Data
and Visualization
Syllabus
N oSQL Databases : Schema-less Models : Increasing Flexibility for Data Manipulation-Key Value
Stores- Document Stores - Tabular Stores - Object Data Stores - Graph Databases Hive - Sharding –-
Hbase – Analyzing big data with twitter - Big data for E-Commerce Big data for blogs - Review of
Basic Data Analytic Methods using R.
Contents
5.1 Introduction to NoSQL
5.2 "Schema-Less Models" : Increasing Flexibility for Data Manipulation
5.3 Key Value Stores
5.4 Document Stores
5.5 Tabular Stores
5.6 Object Datastores
5.7 Graph Datastores
5.8 Hive
5.9 Sharding
5.10 Hbase
5.11 Analyzing Big Data with Twitter
5.12 Big Data for E-Commerce and Blogs
5.13 Review of Basic Data Analytic Methods using R
Summary
Part - B Questions
(5 - 1)
N OSQL Data Management for
Big Data Analytics 5-2 Big Data and Visualization
5.1 Introduction to NoSQL

With Rapid development in digital technologies, the massive amount of data is being
generated which has four Big data characteristics like volume, variety, veracity and
velocity. Earlier, the relational database management system was used to perform all
database operations. The RDBMS databases used to have SQL databases which used to
store data in to table based databases. It has rows to store records and columns to define
the attributes. Each row of table contains a unique instance of data for the categories
defined by the columns along with primary key, to uniquely identify the rows. In
traditional analytics system, the cleansed or meaningful data were collected and stored by
RDBMS in a data ware house. That data was analyzed by means of performing Extract,
Transform and Load (ETL) operations. But, because of above four Big data characteristics
it becomes more and more difficult to capture, store, organize, process and analyze the
data generated by various web applications or websites. Therefore, organizations were
faced many limitations related to traditional relational database management system
which are listed below
a) It has support of only structured and stored data
b) It required data to be in table form
c) The database must have predefined schema with limited in size
d) It does not support the scalability
e) Parallel processing of such data by traditional analytics were costlier because of
expensive hardware.
f) Unstructured and semi structured data like images, pdfs, compressed files, videos
etc. was not supported
g) Realtime data capturing and analytics was not supported
h) Most of the databases were proprietary which incurs huge cost for storing and
processing data
Therefore, to overcome above challenges the special kind of database was invented
called NOSQL. The NoSQL database stands for "Not Only SQL" or "Not SQL." The
NoSQL is a set of concepts that allows the rapid and efficient processing of data sets with
a focus on performance, reliability, and agility. It does not require a fixed schema,
normalized data (3NF), tables or joins, instead it is used for distributed data stores with
easier scalability. It is largely used for Big data applications where real-time data is
®
generated along with huge volume, variety, velocity and veracity. For example, social
media websites like Twitter, Facebook, Google, Instagram etc. that collect and process
terabytes of user's data every single day who uses NOSQL databases for their all the
operations.
The NOSQL databases came in to picture which has many advantages over the
traditional RDBMS. The major advantages of NOSQL databases over traditional SQL
databases are given as below
a) It supports real time or batch processing data along with different formats like
structured, unstructured or semi-structured.
b) It can process uncleansed or uncertain data for analytics
c) It does not require an expensive hardware for implementation as it can be run on
commodity hardware
d) It supports huge volume of data generated real time with velocity
e) It can perform data analytics at low cost.
f) It is free of joins and schema and doesn't require complex queries
g) It can work on many processors parallelly with linear scalability
h) Most of NOSQL systems are opensource that makes it very cost efficient
i) Cost per bit for processing is very low and compatible with cloud computing
5.1.1 Types of NoSQL Databases

There are five basic types of NoSQL databases :
1. Key-Value Store : It uses combination of keys and values stored in big hash tables to
process the data. e.g. Redis, Riak, Amazon DynamoDB etc.
2. Document Store : It store and process data in terms of documents in a collection. e.g.
MongoDB, CouchDB, Amazon DocumentDB etc.
3. Tabular Store (Column-based Store) : These databases are capable of storing and
processing data in a column of a table. Some of the columnar databases are Hbase,
Cassandra, Azure Table Storage (ATS), BerkeleyDB etc.
4. Object Store : These types of databases stores object in a database with capability to
represent ORDBMS. The examples of Object stores are ObjectDB, Perst, Objectstore
etc.
®
5. Graph-based databases : This type of databases are network databases that uses
edges and nodes to represent the stored data with relationship. e.g. Apache Giraph,
GraphX, Neo4j etc.
The above NOSQL databases are explained briefly in subsequent sections of this
chapter
5.2 "Schema-Less Models" : Increasing Flexibility for Data Manipulation

NoSQL data systems hold out the guarantee of flexibility and more noteworthy
adaptability in database management while lessening the reliance on progressively
formal database administration. The NoSQL databases have increasingly loosened up
modeling constraints, which may benefit both the application engineer and the end user's
analyst when their intuitive analysis are not throttled and required to cast each query in
relational table based environment. Therefore, different NOSQL databases are enhanced
for various kinds of analysis. For instance, some are implemented as key value stores,
which pleasantly adjust to certain big data programming models, while another rising
model is a graph database, where a graph abstraction is implemented to insert both
semantics and connectivity inside its structure. In fact, the general ideas for NoSQL
incorporate schema less modelling in which the semantics of the data are inserted inside
an adaptable connectivity and storage model; this accommodates automated distribution
of data and scalability with respect to utilization of compute, storage and network for
data transmission in manners that don't drive specific binding of data to be persistently
stored in physical locations. NoSQL databases additionally accommodate integrated data
caching that makes a difference in reducing the data latency and speedy execution.
5.3 Key Value Stores

A relatively simple type of NoSQL data store is a key value store, a schema less model
in which values are related with particular character strings called keys which are stored
in a hash table. It is presented with a simple string called key that returns an arbitrary
large BLOB of data called value. It doesn't have query language which provides a way to
add and remove key-value pairs into/from a database. Its store is like a dictionary which
has a list of words and each word has one or more definitions. The key value type
basically, uses a hash table in which there exists a unique key and a pointer to a particular
item of data. Key-Value is based on a hash table where there is a unique key and a pointer
to a particular item of data. The example of key value store is shown in Fig. 5.3.1.
®
Key Value
Image Name  image-12345.jpg Binary image file

Web page URL  http://www.example.com/my-web-page.html HTML of a web page
File path name  N:/folder/subfolder/mylife.pdf PDF document
MD5 hash  9e107d9d372bb6826bd81d3542a419d6 The quick brown fox
jumps over the lazy dog
REST web service call  view-person?person-id=12345&format=xml <Person><id>12345</id
</Person>
SQL query  SELECT PERSON FROM PEOPLE <Person><id>12345</id
WHERE PID = "12345" </Person>
Fig. 5.3.1 : Example of key value store

Let us take another example of country and locality represented in Table 5.3.1 where
name of the country is the key, while locality (addresses) in that country is value.
Key Value
"India" {"C/09, Shastrinagar, Maharastra, India - 400050"}

"Romania" {" City Business Center, Coriolan Brediceanu No. 10,
Building B, Casario, 30211"}
"US" {" Ridge Drive. Suite 200 South, Fairfax, California 22033"}
Table 5.3.1 : Example of Key-value store for country database
The key value pair store does not impose any constraints about data typing or data
structure instead used for consuming business applications to assert expectations about
the data values and their semantics with interpretation which schema-less property of the
model. The different operations can be performed over key value store are given in Table
5.3.2 and explained in Fig. 5.3.2. The easiness of the representation in key value data store
allows massive amounts of indexed data values to be added to the same key value table,
which can then be sharded, or distributed across the storage nodes.
Sr. No. Function Description
1. Get(key) Returns the value associated with the provided key.
2. Put (key, value) Which adds the value associates with the key.
3. Multi-get (key1, key2,.., keyN) Returns the list of values associated with the list of keys.
4. Delete(key) Removes the entry for the specified key from the data store.
Table 5.3.2 : Different operations in key value store
®
Under the right conditions, the table is circulated in a way that is aligned with the way
the keys are sorted out so that the hashing function is utilized to figure out which node
holds that key's bucket While key value sets are extremely valuable for both storing the
results of analytical algorithms and for creating those outcomes for reports.
Some of the popular use cases of the key-value databases are
 Storing and processing user session data
 Maintaining schema-less user profiles
 Storing user preferences
 Storing shopping cart data
Fig. 5.3.2 Key value datastore operations
One is that the model won't naturally give any sort of conventional database abilities
(such as atomicity of transactions, or consistency when numerous transactions are
executed at the same time) those abilities must be given by the application itself. Another
is that as the model develops, keeping up unique values as keys may become more
difficult, requiring the introduction of some complexity in creating character strings that
will stay novel among a myriad of keys.
5.4 Document Stores

The document-based NoSQL database stores and retrieves the data as a key value pair
where value part is stored as a document with associated key. In document store each
record is considered as a separate document where everything inside a document is
automatically indexed when a new document is added. Although the indexes are large
®
and everything is searchable within the document, most of the document stores group
documents together in a collection. The document that has some structure and encoding
for managing the data. Some of the common encoding used with the document stores are
XML, JSON (Java Script Object Notation), BSON (which is a binary encoding of JSON
objects) etc. or other means of serializing data. The document representation embeds the
model so that the meanings of the document values can be inferred by the application. It
is not suitable for running complex queries or if the application requires complex multiple
operation transactions.
The difference between RDBMS and document store database is given Table 5.4.1.
RDBMS Document store
Database Database
Table Collection
Tuple/Row Document
Column Field
Primary key Primary key (Default key _id provided by mongodb itself)
Table 5.4.1 : RDBMS vs Document store database
The commonly used use cases of document store databases are E-commerce platforms,
content management systems, web analytics, analytics platforms or blogging platforms.
One of the differences between a key value store and a document store is that the key
value store requires the use of a key to retrieve data, while document store often provides
an object metadata used for querying the data based on the contents either through a
programming API or using a query language.
The following example shows the insertion of data values and respective keys in to a
document called Book. It has keys like id, title, description, by, array of tags and likes. The
respective values for keys are assigned at right hand side.
db.Book.insert(
{
"_id": ObjectId(7df78458902c),
"title": "Python Programming",
"description": "Python is a Object oriented language ",
"by": "Technical publication",
"tags": ["SciPy", "NumPy", "Pandas"],
"likes": "100"
})
®
5.5 Tabular Stores

Tabular, or table-based stores are to a great extent dropped from Google's unique
Bigtable design to manage structured and unstructured data. The HBase is an example of
Hadoop integrated tabular NoSQL database which is evolved from big tables. The big
table NoSQL model enables to sparse data which is to be stored in a three-dimensional
table that is indexed by row key, column key that indicates the attribute for which data
value is stored, and a timestamp that may allude to the time at which the rows column
was stored.
For example, different attributes of a web page can be related with the page's URL as
the HTML substance of the page, URLs of other website pages that connect to this web
page, and the creator of the web page. column in a Bigtable model are assembled as
"families," and the timestamps empower the management of multiple versions of an
object. Each time the content changes, a new column affiliation can be made with the
timestamp of when the content was downloaded.
The column-oriented NoSQL database is the one of the popular tabular databases
which is shown in Fig. 5.5.1. They are widely used in applications like business
intelligence, CRM, Library card catalogs and in data warehouse.
ColumnFamily
Row Key Column Name
Key Key Key

Value Value Value
Column Name
Key Key Key

Value Value Value
Fig. 5.5.1 Column based tabular datastore
In Column based Tabular datastore, the ColumnFamily is used as a single structure

that can group Columns. The Key attribute is the permanent name of the record that can
have different numbers of columns and values that can be scaled in an irregular way. The
Value attribute represents the description about the keys whereas Column maintains an
ordered list of elements as known as tuple with a name and a value defined in it.
®
5.6 Object Datastores

The Object data store stores and manage the different kinds of objects in a database.
Somehow or another, object data stores and object databases appear to bridge the schema-
less data management and the traditional relational models. The object databases are
similar to document stores except document stores expressly serializes the objects so the
data values are stored as a string. While object databases keep up the object structures as
they are bound to object-oriented programming languages like C11, Objective-C, Java,
and Smalltalk. On other hand, object database the board frameworks are bound to give
customary ACID (atomicity, consistency, isolation, and durability) compliance. Object
databases are not relational databases and cannot be queried using SQL. It is mostly used
in cloud computing to store files as an object.
5.7 Graph Datastores

The Graph stores are highly optimized to proficiently store graph nodes and links, and
enable you to query these graphs. The Graph databases are valuable for any business
issues that has complex relationships between objects such as social networking, rules-
based engines, creating mashups, analyzing complex network structures and finding
patterns within these structures.
The graph store has three data fields namely nodes, relationships, and properties as
shown in Fig. 5.7.1. The Graph stores are mainly used for identifying the distinct patterns
of connections between nodes. The Graph databases provide a model of representing
discrete entities and many kinds of relationships that links those entities.
Fig. 5.7.1 Data fields in Graph datastore
The graph datastore of movies with different actors, properties and relationship are
shown in Fig. 5.7.2.
®
Big Data Analytics 5 - 10 Big Data and Visualization
Fig. 5.7.2 Graph of movies database
In above graph, the actors (like Emile Hirsh, Rain, Halle Berry etc.) at leaf nodes
represented by leaf nodes which are connected to the properties like movies and
relationship between actors and movies are defined by arrows.
5.8 Hive
The Apache Hive is a data warehouse software built on the top of Hadoop that
facilitates reading, writing, and managing large datasets stored in HDFS. The data stored
by Hive may residing on distributed storage which can be queried using SQL. It is mainly
used for storing the data in a data warehouse and performs different SQL operations over
that. The hive provides a simple SQL-like query language called Hive Query Language
(HQL) for querying and managing the large datasets. Hive engine compiles these queries
into Map-Reduce jobs to be executed on Hadoop.
The basic features of hive are given as follows
 It stores schema in a database and processed data into HDFS.
 It provides SQL type language for querying called HiveQL or HQL.
 It is fast, scalable and extensible.
 Multiple users can simultaneously query the data using Hive-QL.
 It supports different data formats with real time data analytics
 It can store and process large datasets in data warehouse
®
Using the Hive query language (HiveQL), which is very similar to SQL, queries are
converted into a series of jobs that execute on a Hadoop cluster through MapReduce or
Apache Spark. The Users can run batch processing workloads with Hive as well as can
analyze the same data simultaneously using interactive SQL or using machine-learning
workloads over the tools like Apache Impala or Apache Spark within a single platform.
The various SQL commands used in hive are given in Table 5.8.1.
Sr. No. Function Command
1 Create database hive> create database dbname;

2 View databases hive> show databases;
3 Select database hive> use dbname;
4 Copy input data to hive> hadoop dfs –copyFromLocal <local path to file> hdfs:/
HDFS from local
5 Create table hive> CREATE TABLE <tablename> (attribute-1 datatype,
attribute-2 datatype, -attribute-n datatype);
6 Copy local data in Hive hive> LOAD DATA INPATH <filename> OVERWRITE INTO
table TABLE <tablename>
7 Insert data hive> INSERT INTO [TABLE] <tablename> [(column_list)]
[PARTITION (partition_clause)]
8 Retrieving all the values hive> SELECT * FROM <tablename>;
9 Update data hive> UPDATE <tablename> SET [ column = value ...]
[WHERE expression]
10 Delete Touple hive> DELETE FROM <tablename> [WHERE expression]
Table 5.8.1 : SQL commands used in hive
It has support for different aggregation functions of SQL like SUM, COUNT, MAX,
MIN etc. and other functions like CONCAT, SUBSTR, ROUND etc. It also supports
GROUP BY and SORT BY clauses along with joins. It comes with a command-line shell
interface which can be used to create tables and execute queries. In addition, custom
Map-Reduce scripts can also be plugged into queries.
As Hive is a petabyte-scale data warehouse system built on the top of Hadoop
platform, it allows programmers to write the custom Map-Reduce framework to perform
more sophisticated analysis and data processing. Therefore, Hive on MapReduce or Spark
is best-suited for batch data preparation or ETL.
®
5.9 Sharding
The Sharding is a type of database partitioning that splits very large databases the into
smaller, faster, more easily managed parts called data shards. The word shard means a
small part of a whole. It is a database architecture pattern related to horizontal
partitioning that makes separation of one table's rows into multiple different tables,
known as partitions. Each partition has the same schema and columns, but also entirely
different rows. Likewise, the data held in each is unique and independent of the data held
in other partitions.
Original Table
Cust_id Name Gender City
1 Mukesh Male Mumbai

2 Arun Male Delhi
3 Bhushan Male Chennai
4 Sonali Female Hyderabad
Horizontal Partition 1 Horizontal Partition 2
Cust_id Name Gender City Cust_id Name Gender City
1 Mukesh Male Mumbai 3 Bhushan Male Chennai

2 Arun Male Delhi 4 Sonali Female Hyderabad
Vertical Partition 1 Vertical Partition 2
Cust_id Name Gender Cust_id City
1 Mukesh Male 1 Mumbai

2 Arun Male 2 Delhi
3 Bhushan Male 3 Chennai
4 Sonali Female 4 Hyderabad
Table 5.9.1 : Horizontal and vertical partitioning
In a vertically-partitioned table, entire columns are separated out and put into new,
distinct tables. The data held within one vertical partition is independent from the data in
®
all the others, and each table holds both distinct rows and columns. The Table 5.9.1
illustrates how a table could be partitioned both horizontally and vertically.
The Sharding involves breaking up the data into two or more smaller chunks, called
logical shards. The logical shards are then distributed across separate database nodes,
referred to as physical shards, which can hold multiple logical shards. Despite this, the
data held within all the shards collectively represent an entire logical dataset.
The major benefits of sharding are, it can help to facilitate horizontal scaling easily, it
helps to speed up query response times and it help to make an application more reliable
by mitigating the impact of outages. While the drawbacks of Sharding are complexity of
properly implementing a sharded database architecture, mostly shards eventually
become unbalanced and it becomes very difficult to return back to its unsharded
architecture once sharding is done.
5.10 Hbase
Hbase is an column-oriented non-relational database management system that runs on
the top of Hadoop Distributed File System (HDFS) and used to store and process
unstructured data. It provides a fault-tolerant way of storing and processing large data
sets and is well suited for real-time data processing or random read/write access to large
volumes of data. It does not support a structured query language like SQL and relational
data stores at all. The HBase applications are usually written in Java like a typical
MapReduce application. It is designed to scale linearly and comprises a set of standard
tables with rows and columns, much like a traditional database. Each table in a Hbase
must have an element defined as a primary key to uniquely define the attribute. It
supports a rich set of primitive data types like numeric, binary and strings along with
complex types including arrays, maps, enumerations and records. The HBase mostly
relies on Zookeeper for getting high-performance coordination which is built into Hbase.
It is also compatible with Hive which gives query engine for batch processing of big data
and fault-tolerance in big data applications.
The basic features provided by Hbase are listed as follows
 It offers consistent reads and writes on stored and real time data
 It provides atomic Read and Write where during one read or write process, all other
processes are prevented from performing any read or write operations
 It offers database sharding of tables which is required to reduce I/O time and
overhead
 It provides high availability, high throughput along with scalability in both linear
®
and modular form

 It supports distributed storage like HDFS
 It has built-in support for failover along with backup
 It supports real time data processing using block cache and Bloom filters to make
real time query processing easier
 It supports data replication across clusters.
The Hbase has Hbase shell to execute the queries. The commonly used commands
(queries) in Hbase are described in Table 5.10.1.
1 Listing the database hbase(main):003:0> list

2 Load databases hbase(main):012:0> scan '<table_name>
3 Add a tuple hbase(main):004:0> put <table_name>, <Row_no>,
<ColumnFamily> <value>
4 Retrieve a tuple hbase(main):023:0> get <table_name>, <Row_no>
5 Create table hbase(main):002:0> create <table_name>, <ColumnFamilyName>
6 Delete table hbase(main):003:0> drop <table_name>
Table 10.1 : Hbase commands
5.11 Analyzing Big Data with Twitter

Twitter (www.twitter.com) is an online social networking web platform that allow
users to send and read short 140-character text messages, called "tweets". It has more than
250+ million monthly active users who send about 500+ million tweets per day and as per
Alexa's web traffic analysis, it is ranked amongst the top 10 most visited websites in the
world. The twitter shares its data in a document store format (mostly in JSON - JavaScript
Object Notation) and allows developers to access that data using one of the most secured
authentication APIs like OAuth. OAuth provide a safer alternative because the user
generally logs into twitter and approves sharing of his information to each application.
Once we have received data from twitter we store it in NOSQL database over which Map-
Reduce approaches can be applied to analyze the same. To analyze such a massive data
relational database are not enough therefore massively parallel and distributed systems
are required like Hadoop.
There are two kinds of analysis that can be performed over the twitter data. The
trending analysis allows to analyze individual tweets and looks for certain words
amongst them while sentiment analysis allows to look for certain keywords in a tweet
®
and analyzing them to compute a score of sentiments. The sentiment analysis is the
process of detecting the contextual polarity of text. A common use case for sentiment
analysis is to discover how people feel about a particular topic.
5.11.1 Sentiment Analysis using Hadoop

The technologies that are required to perform sentiment analysis are Apache Flume,
Hive and Pig. The Apache Flume is a distributed, reliable, and available service for
efficiently collecting, aggregating, and moving large amounts of streaming data into the
Hadoop Distributed File System (HDFS). It can be used for dumping twitter data in
Hadoop HDFS. It has a simple and flexible architecture based on streaming data flows;
and is robust and fault tolerant with tuneable reliability mechanisms for failover and
recovery. The Hive is a data warehouse tool that allows to store data into HDFS allows to
process that using SQL. Hive is an excellent tool for analytical querying of historical data.
While Pig uses procedural programming, language called Pig Latin to perform data
processing and analytics. Pig Latin allows developers to insert their own code almost
anywhere in the data pipeline which is useful for pipeline development which is
accomplished using User Defined Functions UDFS that specify how data is loaded, stored
and processed.
The procedure for twitter sentiments analysis are explained as follws
Step 1 : Creating Twitter Application

Open the website dev.twitter.com/apps in the Browser. Now we will see the website
suggesting us to sign in. as shown in Fig. 5.11.1 (a). So, we sign into our twitter account
then will be asked for creating a new app, so Click on Create New App as shown in
Fig. 5.11.1 (b).
Fig. 5.11.1 (a) Signin to twitter Fig. 5.11.1 (b) Create new app
®
Fill in all the required fields to make the application and use the website as google.com
as shown in Fig. 5.11.2 then complete the form and click on Create your Twitter
application .
Fig. 5.11.2 : Create twitter application wizard
Step 2 : Open manage keys and access tokens from application settings and select my
access token to create new token for the accessing the twitter data as shown in Fig. 5.11.3.
Fig. 5.11.3 Twitter access token
Step 3 : After creating the access token, open flume.conf file in the directory
/usr/lib/flume/conf and then change the following keys in the file. These keys will be
®
obtained from the page as shown in Fig. 5.11.4.
Fig. 5.11.4 : Flume Configuration File
Step 4 : Now changes needs to be done in flume.conf file for Access Token, Access
Token Secret, Consumer Key (API Key) and Consumer Secret (API Secret) along with
adding the keywords for search that we want to extract from twitter. In this example, we
are extracting data on demonetization in India as shown in Fig. 5.11.5.
Fig. 5.11.5 : Configuration settings
Step 5 : Getting Data using Flume

After creating an application in Twitter developer's website, we can now access the
Twitter data that we want to extract. The flume will be used to retrieve data from twitter.
We will get everything in JSON format. The data will be fetched at given location comes
from the Twitter and will be saved in the HDFS. After running the Flume, the Twitter
data will automatically be saved into HDFS.
Following are the steps followed to collect and store dataset from Twitter into HDFS
1. Start terminal in ubuntu and execute following command to start flume agent
$/usr/lib/flume/bin/flume-ng agent --conf ./conf/ -f /usr/lib/flume/conf/flume.conf -
®
Dflume.root.logger=DEBUG,console -n TwitterAgent- Dtwitter4j.streamBaseURL=

https://stream.twitter.com/1.1/
2. By browsing HDFS, the list of extracted twitter data can be seen over the specified
keyword in the conf file
3. The dataset provided by twitter can be seen as shown in Fig. 5.11.6.
Fig. 5.11.6 Twitter dataset
Step 6 : Determining Popular Hash Tags using Hive
The tweets coming in from Twitter are in Json format, therefore we need to load the
tweets into the Hive using json input format. So add jar file in to hive using following
command.
ADD jar /usr/local/hive/lib/hive-serdes-1.0-SNAPSHOT.jar
After successfully adding the Jar file, create a Hive table to store the Twitter data. For
calculating the hashtags, we need the tweet_id and hashtag_text, to be stored in Hive
table. So, use following command to create a hive table.
CREATE EXTERNAL TABLE tweets (id BIGINT, entities
STRUCT<hashtags:ARRAY<STRUCT<text:STRING>>>) ROW FORMAT SERDE
'com.cloudera.hive.serde.JSONSerDe' LOCATION '/flumedir/data/tweets_raw';
Now, let's create another table which can store id and the hashtag text using the below
command:
create table hashtag_word as select id as id, hashtag from hashtags LATERAL VIEW
explode(words) w as hashtag;
®
Now, let's use the query to calculate the number of times each hashtag has been
repeated.
select hashtag, count(hashtag) as total_count from hashtag_word group by hashtag order by
total_count desc;
Then execute following command to generate output

hive -f /hashtag.sql
The hashtag and the number of times it is repeated in the Twitter data will appear as a
output as shown in Fig. 5.11.7.
Fig. 5.11.7 Hashtag output using hive
Step 7 : Load the tweets in to pig

The tweets are in nested Json format and consist of map data types. We need to load
the tweets using JsonLoader which supports maps using following command
load_tweets = LOAD '/flumedir/data/tweets_raw/' USING

com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') AS myMap;
extract the id and the hashtag from the above tweets using following command
extract_details = FOREACH load_tweets GENERATE FLATTEN(myMap#'entities') as
(m:map[]),FLATTEN(myMap#'id') as id;
Use following command to get formatted output

hash = foreach extract_details generate FLATTEN(m#'hashtags') as(tags:map[]), id as id;
txt = foreach hash generate FLATTEN(tags#'text') as text, id;
cnt = foreach grp generate group as hashtag_text, COUNT(txt.text) as hashtag_cnt;
®
The output of sentiment analysis using hashing is shown in Fig. 5.11.8.
Fig. 5.11.8 : Popular Hashtags using Pig
Step 8 : Determining Average Rating of Tweets using hive

The command for creating a Hive table to store id and text of the tweets is as follows :
CREATE EXTERNAL TABLE load_tweets(id BIGINT, text STRING) ROW FORMAT SERDE
'com.cloudera.hive.serde.JSONSerDe' LOCATION '/flumedir/data/tweets_raw/';
Create a Split words

create table split_words as select id as id, split(text,' ') as words from load_tweets;
Run the average rating query

select id, AVG(rating) as rating from word_join GROUP BY word_join.id order by rating
DESC;
Using above command, we have calculated the average rating of each tweet by using
each word of the tweet and arranging the tweets in the descending order as per their
rating. Now let us run sentiments.sql script as shown in Fig. 5.11.9.
Fig. 5.11.9 : Running sentiment.sql script
The output of average ratings by tweet ids is shown in Fig. 5.11.10 which shown the
tweet_id and its rating.
®
Fig. 5.11.10 Average ratings by tweet ids
5.12 Big Data for E-Commerce and Blogs

The massive usage of Internet has transformed E-Commerce. The E-commerce
customers now can access the wide range of products offered through E-Commerce
websites. Therefore, in order to remain competitive and defend market, the E-Commerce
firms formulates online marketing strategies based on real time data. As we know that,
the E-commerce websites generates real time data in different formats and with massive
in size which cannot be processed using traditional database management systems. So, E-
Commerce firms are finding ways to extract meaningful information from larger datasets
where data gets generated at greater velocity, different variety and at high volumes that
are often referred to Big Data.
Therefore, Big data analytics came in to picture where it can be used to perform data
acquisition, storage, processing and analytics of large datasets. As an effect of which, E-
Commerce firms are investing huge on Big Data Analytics to empower them to take
accurate and timely decisions.
®
5.12.1 Applications of Big Data Analytics for E-Commerce

Following are the Applications of big data analytics for E-Commerce
a) Predictive analytics : The predictive analysis can allow E-commerce firms to predict
the customer behavior. The Predictive analytics refers to the identification of events
before they take place through the use of big data analytics. Therefore, Big Data
analytics can enable you to predict the product demand, consumers behavior
patterns & supply chain mechanics.
b) Dynamic pricing : The Big data analytics in E-commerce can makes customers
satisfactory by providing dynamic pricing. Dynamic pricing is required as majority
of products compete on price offered with other sites. In order to attract new
customers, E-Commerce companies must be vigilant and vibrant while setting
competitive price for the products. E-Commerce firms need to actively influence the
customer to buy at their site, which involves setting a competitive price.
c) Enhanced customer service : The E-commerce firms can use Big Data to provide an
exceptional customer service experience. Providing excellent customer service can
lead firms to achieve competitive advantage.
d) Personalization : Big data analytics offer personalized services, products, content
and promotions for specific segments that makes better customer satisfaction with
improved ROI.
e) Fraud detection : By use of Big data in E-commerce, it can help in detecting
fraudulent activities over the web. If fraud is detected in real-time, it leads to a
speedier resolution and lower damage or loss. When fraud detection pattern is
combined with Big Data powered real-time detection, the system gathers the
required intelligence to detect and negate fraudulent practices. In addition, E-
Commerce firms are able to identify fraud in real time by combining transaction data
with customers” purchase history, web logs, social feed, and geospatial location data
from Smartphone apps.
f) Supply chain visibility : The Big data in E-commerce can allows you to track your
goods ordered online while the goods are still in shipment has become the standard.
Customers expect specific supply chain information, such as the exact availability,
status and location of their orders. This involves intensive use of data infrastructure
when E-Commerce firms have multiple third parties such as warehousing and
transportation providers in their supply chain. Therefore, the use of Big data allows
firms to build better models, which produce results with higher precision.
®
5.12.2 Big Data for Blogs

Unlike E-commerce, the big data analytics can also be used in Blogs to improve the
blog sites. Nowadays, the blogs are found for every aspect like about travel, about pets,
about food etc. The Big data analytics in blog can give
 Brief description of each blog from users perspective
 It can provide Alexa rank through which one can view users perspective about the blog
and page views over the last three months to give a global traffic ranking and their
update timing.
 The frequency with which the blog is updated.
 An interesting article from the site being reviewed.
5.13 Review of Basic Data Analytic Methods using R

R is a programming language used mostly for different types of data analysis and
graphics which is available free under the General Public License (GNU). The following
section explains the way to perform data analytics using R and to understand the flow of
a basic R script to address an analytical problem. The following are the steps to execute
Step 1 : Import dataset

Suppose, we have a dataset in comma-separated-value file i.e. csv file which has data
about annual sales in U.S. dollars for their customers The read.csv() function is used to
import the CSV file. The command for reading csv file is given below.
sales <- read.csv("c:/data/sales.csv")
Once the file has been imported, it is useful to examine the contents to ensure that the
data was loaded properly as well as to become familiar with the data. In the example, the
head() function, by default, displays the first six records of sales.
head(sales)
cust_id sales_totalnum_of_orders gender gender

1 100001 500 F
2 100002 200 F
3 100003 708 M
4 100004 490 M
5 100005 721 F
6 100006 643 F
®
The summary() function provides some descriptive statistics, such as the mean and
median, for each data column.
summary(sales)
cust_id sales_total num_of_orders gender
Min. :10001 Min. : 30.02 Min. : 1.000 F:5035
1st Qu.:10251 1st Qu.: 80.29 1st Qu.: 2.000 M:4965
Median :10501 Median : 151.65 Median : 2.000
Mean :10501 Mean : 249.46 Mean : 2.428
3rd Qu.:10750 3rd Qu.: 295.50 3rd Qu.: 3.000
Max. :11000 Max. :7606.09 Max. :22.000
Plotting a dataset's contents can provide information about the relationships between
the various columns.
The hist() function is used plot the histogram for the dataset
hist(results$residuals, breaks = 800)
®
Vectors are a basic building block for data in R. As seen previously, simple R variables
are actually vectors. A vector can only consist of values in the same class. The tests for
vectors can be conducted using the is.vector() function while The array() function can be
used to restructure a vector as an array.The matrix() function is used to create matrix as
shown below.
sales <- matrix(0, nrow = 3, ncol = 4)
sales
[,1] [,2] [,3] [,4]
[1,] 0 0 0 0
[2,] 0 0 0 0
[3,] 0 0 0 0
The basic data analytical methods in R are explained in Lab no. 3.
Summary
 The NoSQL is a set of concepts that allows the rapid and efficient processing of
data sets with a focus on performance, reliability, and agility. It does not require a
fixed schema, normalized data (3NF), tables or joins, instead it is used for
distributed data stores with easier scalability.
 There are five basic types of NoSQL databases like Key-Value Store, Document
Store, Tabular Store, Object Store and Graph-based databases
 The Key-Value store is presented with a simple string called key that returns an
arbitrary large BLOB of data called value, document-based NoSQL database
stores and retrieves the data as a key value pair where value part is stored as a
document with associated key, tabular store sparse data which is to be stored in a
three-dimensional table that is indexed by row key, column key and a timestamp
that may allude to the time at which the rows column was stored.
 The Object data store stores and manage the different kinds of objects in a
database while The Graph stores are highly optimized to proficiently store graph
nodes and links, and enable you to query these graphs.
®
 The Apache Hive is a data warehouse software built on the top of Hadoop that
facilitates reading, writing, and managing large datasets stored in HDFS. The data
stored by Hive may residing on distributed storage which can be queried using
SQL.
 The Sharding is a type of database partitioning that splits very large databases the
into smaller, faster, more easily managed parts called data shards.
 Hbase is an column-oriented non-relational database management system that
runs on the top of Hadoop Distributed File System (HDFS) and used to store and
process unstructured data.
 On twitter, the trending analysis allows to analyze individual tweets and looks for
certain words amongst them while sentiment analysis allows to look for certain
keywords in a tweet and analyzing them to compute a score of sentiments.
 Applications of big data analytics for E-Commerce and Blogs are Predictive
Analytics, Dynamic Pricing, Enhanced Customer Service, Fraud detection, Supply
Chain Visibility etc.

Q.1 What is NOSQL database ? AU : May-17
Ans. : The NOSQL database stands for "Not Only SQL" or "Not SQL." The NOSQL is a
set of concepts that allows the rapid and efficient processing of data sets with a focus on
performance, reliability, and agility. It does not require a fixed schema, normalized data
(3NF), tables or joins, instead it is used for distributed data stores with easier scalability.
It is largely used for Big data applications where real-time data is generated along with
huge volume, variety, velocity and veracity. For example, social media websites like
Twitter, Facebook, Google, Instagram etc. that collect and process terabytes of user's
data every single day who uses NOSQL databases for their all the operations.
Q.2 What is the significance of NOSQL datastores over traditional DBMS ?
Ans. : The major advantages of NOSQL databases over traditional SQL databases are
given as below
1. It supports real time or batch processing data along with different formats like
structured, unstructured or semi-structured.
2. It can process uncleansed or uncertain data for analytics.
®
3. It does not require an expensive hardware for implementation as it can be run on

commodity hardware.
4. It supports huge volume of data generated real time with velocity.
5. It can perform data analytics at low cost.
6. It is free of joins and schema and doesn't require complex queries.
7. It can work on many processors parallelly with linear scalability.
8. Most of NOSQL systems are Opensource that makes it very cost efficient.
9. Cost per bit for processing is very low and compatible with cloud computing.
Q.3 What are key value store and document store datastores.
Ans. : A relatively simple type of NoSQL data store is a key value store, a schema less
model in which values are related with particular character strings called keys which
are stored in a hash table. It is presented with a simple string called key that returns an
arbitrary large BLOB of data called value. It doesn't have query language which
provides a way to add and remove key-value pairs into/from a database. Its store is like
a dictionary which has a list of words and each word has one or more definitions. Let
us see example of Key-value store in Table 5.1 where country is represented as key and
locality represented as value.
Key Value
“India” {“C/09, Shastrinagar, Maharastra, India – 400050”}
“Romania” {“ City Business Center, Coriolan Brediceanu No. 10, Building B,

Casario, 30211”}
“US” {“ Ridge Drive. Suite 200 South, Fairfax, California 22033”}
Table 5.1 : Example of Key-value store for country database
The document-based NoSQL database stores and retrieves the data as a key value
pair where value part is stored as a document with associated key. In document store
each record is considered as a separate document where everything inside a document
is automatically indexed when a new document is added. Although the indexes are
large and everything is searchable within the document, most of the document stores
group documents together in a collection. The document that has some structure and
encoding for managing the data. Some of the common encoding used with the
document stores are XML, JSON (Java Script Object Notation), BSON (which is a binary
®
encoding of JSON objects) etc. or other means of serializing data. The example of
document store for collection Book using Mongo dB is shown as follows.
{"_id": ObjectId(7df78458902c), "title": "Python Programming", "description": "Python
is a Object oriented language ", "by": "Technical publication", "tags": ["SciPy", "NumPy",
"Pandas"], "likes": "100"}
Q.4 Enlist the categories of NOSQL datastores.
Ans. : There are five basic categories of NoSQL databases described as follows
1. Key-Value store : It uses combination of keys and values stored in big hash tables
to process the data. e.g. Redis, Riak, Amazon DynamoDB etc.
2. Document store : It store and process data in terms of documents in a collection.

e.g. MongoDB, CouchDB, Amazon DocumentDB etc.
3. Tabular store (Column-based store) : These databases are capable of storing and
processing data in a column of a table. Some of the columnar databases are Hbase,
Cassandra, Azure Table Storage (ATS), BerkeleyDB etc.
4. Object store : These types of databases stores object in a database with capability to
represent ORDBMS. The examples of Object stores are ObjectDB, Perst, Objectstore
etc.
5. Graph-based databases : This type of databases are network databases that uses
edges and nodes to represent the stored data with relationship. e.g. Apache Giraph,
GraphX, Neo4j etc.
Q.5 List the features of Hive and Hbase.

Ans. : The basic features of hive are given as follows :
 It stores schema in a database and processed data into HDFS.

 It provides SQL type language for querying called HiveQL or HQL.
 It is fast, scalable and extensible.
 Multiple users can simultaneously query the data using Hive-QL.
 It supports different data formats with real time data analytics.
 It can store and process large datasets in data warehouse.
®
The features provided by Hbase are listed as follows :

 It offers consistent reads and writes on stored and real time data.
 It provides atomic read and write where during one read or write process, all other
processes are prevented from performing any read or write operations.
 It offers database sharding of tables which is required to reduce I/O time and
overhead.
 It provides high availability, high throughput along with scalability in both linear
and modular form.
 It supports distributed storage like HDFS.
 It has built-in support for failover along with backup.
 It supports real time data processing using block cache and Bloom filters to make real
time query processing easier.
 It supports data replication across clusters.
Q.6 What is sharding ? Explain horizontal and vertical portioning. [Refer section 5.9]
Part - B Questions
Q.1 What is NOSQL ? Explain different types of NOSQL in dtail.
Q.2 Explain if brief document store and graph-based stores.
Q.4 Write a short note on Hive and Hbase.
Q.5 Explain in brief Big data analytics for E-commerce and Blogs.
Q.6 Explain the architecture of Hive.
Ans. : The Hive architecture compose of five components namely: Hive User Interface,
Metadata Store, HDFS/Hbase, Hive Query Processing Engine (HiveQL) and Execution
Engine. The architecture of Hive is shown in Fig. 5.2.
®
Fig. 5.2 : Hive architecture
The functionality of each component is explained as follows

a) Hive User Interface : Hive creates interaction between user and HDFS through
Hive Web UI, Hive command line and in Windows server Hive HD insight.
b) Metadata Store : Hive stores the database schema and its HDFS mapping in the
database server.
c) HDFS/HBase : HDFS or HBASE are the data storage techniques to store data
into the file system.
d) HiveQuery Processing Engine (HiveQL : HiveQL is used for querying on
schema info on the metadata store. Instead of writing MapReduce program in
Java, a query can be written for MapReduce job and processed.
e) Execution Engine : The execution engine is used to processes the query and
generates results as same as MapReduce results.
Initially, Hive user interface (Web UI/command line/HD insight) sends a query to
database driver (JDBC/ODBC/) to execute. The driver with the help of query compiler
that parses the query, checks the syntax and the requirement of the query. The compiler
then sends the metadata request to database where the metadata is stored. The database
sends the response to the compiler. The compiler then sends the response to the driver
which is passed to the execution engine. The execution engine (MapReduce process)
sends the job to task to JobTracker, which is in NameNode and it assigns this job to
TaskTracker, which is in DataNode. The execution engine will receive the results from
DataNodes. The execution engine then sends the results to the driver. Finally, the
driver sends result back to UI.

®
Big Data Analytics Lab
Contents
Lab 1 : Installation of Hadoop single node cluster on Ubuntu 16.04
Lab 2 : Write a Map-reduce program for to demonstrate Word count application
Lab 3 : Review of Basic Data Analytic Methods Using R
Lab 4 : To implement K-Means clustering in R
Lab 5 : To implement decision tree classification in R
Lab 6 : To implement Naive Bayes Classification in R
Lab 7 : To study NOSQL databases Mongo dB and implement CRUD operations
Lab 8 : To study HDFS, Hive and Hbase commands
(L - 1)
Big Data Analytics L-2 Big Data Analytics Lab
Lab 1 : Installation of Hadoop single node cluster on Ubuntu 16.04
In single node setup name node and data node runs on same machine. The detail
steps to install hadoop on ubuntu 16.04 are explained as follows
Step 1 : Update the Ubuntu
$ sudo apt-get update
Step 2 : Install JDK

$ sudo apt-get install default-jdk
®
Verify the Java Version
Step 3 : Add dedicated hadoop users

$ sudo addgroup hadoop
$ sudo adduser --ingroup hadoop hduser
®
Step 4 : Install SSH

$ sudo apt-get install ssh
Verify SSH using which command
Step 5 : Create and setup SSH certificates

Hadoop requires SSH access to manage its nodes, i.e. remote machines plus our local
machine. For our single-node setup of Hadoop, we therefore need to configure SSH
access to localhost. So, we need to have SSH up and running on our machine and
configured it to allow SSH public key authentication.
$ ssh-keygen -t rsa -P ""
®
Add the newly created key to the list of authorized keys so that Hadoop can use ssh
without prompting for a password.
$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
Check the SSH to localhost

$ ssh localhost
®
Disable ipv6 feature as it uses 0.0.0.0 for the various networking-related Hadoop
configuration options will result in Hadoop binding to the IPv6 addresses. To disable it
open sysctl.conf file
$ sudo nano /etc/sysctl.conf
add following lines at the end of sysctl.conf file and reboot the machine.
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1
You can check whether IPv6 is enabled on your machine with the following command:
$ cat /proc/sys/net/ipv6/conf/all/disable_ipv6
A return value of 0 means IPv6 is enabled and a value of 1 means disabled.
®
Step 6 : Download hadoop

$ wget http://mirrors.sonic.net/apache/hadoop/common/hadoop-2.6.0/hadoop-2.6.0.tar.gz
Extract hadoop zip file and move hadoop to /usr/local directory

$ tar xvzf hadoop-2.6.0.tar.gz
®
Step 7 : Assign root privileged to hduser

$ sudo adduser hduser sudo
$ sudo chown -R hduser:hadoop /usr/local/hadoop
Step 8 : Setup Configuration Files

The following files will have to be modified to complete the Hadoop setup:
1. ~/.bashrc
2. /usr/local/hadoop/etc/hadoop/hadoop-env.sh
3. /usr/local/hadoop/etc/hadoop/core-site.xml
4. /usr/local/hadoop/etc/hadoop/mapred-site.xml.template
5. /usr/local/hadoop/etc/hadoop/hdfs-site.xml
1. Configure bashrc file

we need to find the path where Java has been installed to set the JAVA_HOME
environment variable in bashrc file.So open bashrc file
®
Append the following lines at the end of bashrc file

#HADOOP VARIABLES START
export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-i386
export HADOOP_INSTALL=/usr/local/hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib"
#HADOOP VARIABLES END
®
Big Data Analytics L - 10 Big Data Analytics Lab
Compile the file and check java version

$ source ~/.bashrc
$ javac -version
$ which javac
$ readlink -f /usr/bin/javac
®
2. Configure hadoop-env.sh to set JAVA_HOME
Export the path of Java Home

export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
®
3. Configure core-site.xml file

The /usr/local/hadoop/etc/hadoop/core-site.xml file contains configuration properties
that Hadoop uses when starting up. This file can be used to override the default settings
that Hadoop starts with.So create temp directory inside hadoop and assign it to hduser.
Then open the core-site.xml file
$ sudo mkdir -p /app/hadoop/tmp
$ sudo chown hduser:hadoop /app/hadoop/tmp
$ sudo nano /usr/local/hadoop/etc/hadoop/core-site.xml
Add the following lines inside <configuration> section

<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>
</configuration>
®
4. Configuremapred-site.xml file
By default, the /usr/local/hadoop/etc/hadoop/ folder contains
/usr/local/hadoop/etc/hadoop /mapred-site.xml.template file which has to be
renamed/copied with the name mapred-site.xml.
So copy the file and open it for configuration.
$ cp / usr / local / hadoop / etc / hadoop / mapred-site.xml.template
/usr/local/hadoop/etc/hadoop/mapred-site.xml
$ sudo nano /usr/local/hadoop/etc/hadoop/mapred-site.xml
Add following lines inside <Configuration> section

<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
®
and reduce task.

</description>
</property>
</configuration>
5. Configure hdfs-site.xml
The /usr/local/hadoop/etc/hadoop/hdfs-site.xml file needs to be configured for each
host in the cluster that is being used. It is used to specify the directories which will be
used as the namenode and the datanode on that host.So first create directories under hdfs
for name node,data node and hdfs store.
$ sudo mkdir -p /usr/local/hadoop_store/hdfs/namenode
$ sudo mkdir -p /usr/local/hadoop_store/hdfs/datanode
$ sudo chown -R hduser:hadoop /usr/local/hadoop_store
®
Open hdfs-site.xml for configuration

$ sudo nano /usr/local/hadoop/etc/hadoop/hdfs-site.xml
Add the following lines under <configuration> section

<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/usr/local/hadoop_store/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/usr/local/hadoop_store/hdfs/datanode</value>
</property>
</configuration>
Step 9 : Format the New Hadoop Filesystem

The Hadoop file system needs to be formatted so that we can start to use it.
$ hadoop namenode -format
®
Step 10 : Start all services of hadoop to use it.

There are two commands to start all the services of hadoop.They are given as follows
$start-all-sh
$./start-all.sh
®
To verify all the services are running pass JPS command. If output of JPS command shows
following output then we can say that hadoop is successfully installed.
®
Lab 2 : Write a Map-reduce program for to demonstrate Word count application
In this practical single node hadoop cluster have been used. The hadoop cluster with
pre-installed eclipse on Cent OS is going to used for running Map-reduce program. The
steps to run word count program using map-reduce framework are as follows
Step 1 : Open Eclipse and create new Java project specify name and click on finish
®
Step 2 : Right click on project and Create new package wordcount
®
Step 3 : Right click on Package name wordcount and create new class in it and assign
name wordcount
®
Step 4 : Write mapreduce program for wordcount with in that class

package wordcount;
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
public class wordcount

{
public static class MapForWordCount extends Mapper<LongWritable, Text, Text,
IntWritable>
{
public void map(LongWritable key, Text value, Context con) throws
IOException, InterruptedException
{
String line = value.toString();
StringTokenizer token = new StringTokenizer(line);
while(token.hasMoreTokens())
{
String status = new String();
String word = token.nextToken();
Text outputKey = new Text(word);
IntWritable outputValue = new IntWritable(1);
con.write(outputKey, outputValue);
}
} // end of map()
} //end of Mapper Class
public static class ReduceForWordCount extends Reducer<Text, IntWritable, Text,

IntWritable>
{
®
public void reduce(Text word, Iterable<IntWritable> values, Context con) throws IOException,
InterruptedException
{
int sum = 0;
for(IntWritable value : values)

{
sum += value.get();
}
con.write(word, new IntWritable(sum));
} // end of reduce()
} // end of Reducer class
/*
*/
// job definition
public static void main(String[] args) throws Exception

{
Configuration c = new Configuration();

String[] files = new GenericOptionsParser(c, args).getRemainingArgs();
Path input = new Path(files[0]);
Path output = new Path(files[1]);
Job j = new Job(c, "wordcount");
j.setJarByClass(wordcount.class);
j.setMapperClass(MapForWordCount.class);
j.setReducerClass(ReduceForWordCount.class);
j.setOutputKeyClass(Text.class);
j.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(j, input);
FileOutputFormat.setOutputPath(j, output);
System.exit(j.waitForCompletion(true) ? 0:1);
} // end of main()
} //end of main class
®
Step 5 : Add required jar files to resolve errors

To add jar files right click on class file then select build path option then open
configure build path window. To add essential libraries click on add external jars butoon
and add three jar files one by one .Here we need three jar files namely hadoop-
core.jar,common-cli-1.2.jar and core-3.1.1.jar
Step 6 : Once all the errors have been resolved then right click on project and select
export jar files,specify name to it and click on finish.
®
®
Step 7 : Create input text file and copy both input and jar files it to hadoop directory
[training@Localhost ~]$ cat > inputtsec
this is a demo program on map reduce
the the is is a a demo mo
map map
reduce reduce
[1] + Stopped cat > inputtsec
[training@Localhost ~]$ hadoop fs - copyFromLocal / home/training/inputtsec hadoop/
[training@Localhost ~]$ hadoop fs - copyFromLocal / home/training/word.jar hadoop/
[training@Localhost ~]$
Step 8 : Run the program using following command

$ hadoop jar jar-name.jar package.class input-file(s) output-directory
In our program jar file name is word.jar, package name is wordcount, class name is
wordcount and input file name is inputtsec. So command will be
$hadoop jar word.jar wordcount.wordcount hadoop/inputtsec.txt
hadoop/output002/
®
Step 9 : check the output

To see the output open part file which lies inside output002 directory
[training@Localhost ~]$ hadoop fs – cat hadoop/output002/part-r-00000
Map 1
a 3
demo 2
is 3
map 2
no 1
on 1
program 1
reduce 3
the 2
this 1
[training@Loaclhost ~]$
Lab 3 : Review of Basic Data Analytic Methods Using R
R is a language and environment for statistical computing and graphics. It is a GNU

project which is similar to the S language and environment which was developed at Bell
Laboratories. R provides a wide variety of statistical (linear and nonlinear modelling,
classical statistical tests, time-series analysis, classification, clustering, etc.) and graphical
techniques, and is highly extensible. R is available as Free Software under the terms of the
Free Software Foundation’s GNU General Public License in source code form. It compiles
and runs on a wide variety of UNIX platforms and similar systems (including FreeBSD
and Linux), Windows and MacOS. R studio is an integrated development environment
for R, with a console, syntax-highlighting editor that supports direct code execution, and
tools for plotting, history, debugging and workspace management.
To use R programming tool, we need to download and install R and R studio which is
a IDE for R using following URLs respectively
https://cran.r-project.org/bin/windows/base/
https://rstudio.com/products/rstudio/download/#downloadstudio
Once installation is done, the R studio IDE looks like as shown below
®
In this lab, we are going to study various methods in R available for exploring,
conditioning, modeling, and presenting the data in a dataset. The various functions used
to perform data analytics are explained as follows.
1) Import data
Syntax: Var_name<-read.csv(<Filename.csv>)
read_csv from local path Loading dataset from remote website
titanic_data <- read.csv(“c:/data /titanic.csv”) titanic_data <- "https://goo.gl/At238b" %>%
read.csv %>%
Choosing a File : data <- read.csv(file.choose())
2) Examine the dataset
head(Var_name) # For printing the top lines of a dataset
summary(Var_name) #For printing the summary of a dataset
3) Plot the graph
Syntax for Bar plot : Syntax for plot : Syntax for histogram :
barplot(height, name.args = plot(x, y, …) hist(x, breaks, freq,
NULL, col = NULL, main = probability, density, angle,
NULL) col, border)
®
barplot(height = data$Marks, x <- seq(-pi,pi,0.1) library(ggplot2)

names.arg = data$Names, col = plot(x, sin(x))
"Blue")
hist(housing$Home.Value)
4) Install libraries
install.packages("dplyr")
install.packages("ggplot2")
5) Import libraries
library(dplyr)
library(ggplot2)
6) Read table
Syntax - read.table(file, header, sep, quote =, row.names, col.names, allowEscapes,flush,
stringsAsFactors, fileEncoding)
> x <- read.table(file.choose(),header=T,sep=”\t”); :
>x :
State region date Home.value structur.cost
1 MH west 2019 214952 160599
2 MH west 2019 225511 160252
3 RS west 2019 234994 163791
4 GJ west 2017 235820 161787
5 GA west 2016 244590 155400
6 MH west 2017 253714 157458
7) Operators in R
Arithmetic +  * / %% %/% ^
Operators
Relational < > == <= >= !=
Operators
Logical Operators & | ! && ||
Assignment = <- -> <<- ->>
Operators
®
Miscellaneous : %in% %*%

Operators
8) Constants in R
Constant Value
pi 3.141593
LETTERS A........Z
letters a........z
month.abb “Jan”... “Dec”
month.name “January” ...
“December”
9) Datatypes in R
Vectors (Numeric, Logical, Integer, Complex, Characters)

Matrix (Rows, Columns)
Data frame (Tables)
List (Objects)
Array (Elements)
Factor
10) Functions for Descriptive Statistics
Function Description Function Description
mean(x) Finding the Mean coefficients(a) Slope and intercept of

linear regression
median(x) Finding the Median sort(x) Sort in ascending order
var(x) Finding the Variance rank(x) Rank of number in
vector
sd(x) Standard deviation quantile(x) The 0th, 25th, 50th, 75th,
and
scale(x) Standard scores t.test(x,mu=n,alternative Two-tailed t-test that
=“two.sided”) the mean of the
numbers in vector x
from n
cor(x,y) Correlation aoy(y~x, data = d) Single-factor ANOVA
cor.test(x,y) Correlation test Im(y~x, data = d) Linear regression
®
Lab 4 : To implement K-Means clustering in R
Clustering is an unsupervised learning technique and the task of grouping together a

set of objects in a way that objects in the same cluster are more similar to each other than
to objects in other clusters. Similarity is an amount that reflects the strength of
relationship between two data objects. Clustering is mainly used for exploratory data
mining. It is used in many fields such as machine learning, pattern recognition, image
analysis, information retrieval, bio-informatics, data compression, and computer
graphics. In K-means clustering, k represents the number of clusters and must be
provided by the user.
Example 1
In this example, we are going to use “USArrest” dataset available on Kaggle
website.The steps to perform clustering using K-means technique are given as follows
Step 1 : Install and load required packages

> install.packages("tidyverse")
> install.packages("tidyverse")
> install.packages("cluster ")
> install.packages("factoextra ")
> install.packages("gridExtra ")
> library(tidyverse)
> library(cluster)
> library(factoextra)
> library(gridExtra)
Step 2 : Load dataset “USArrest” and remove the missing values from dataset
> data('USArrests')
> d_frame <- USArrests
> d_frame <- na.omit(d_frame) #Removing the missing values
> d_frame <- scale(d_frame)
> head(d_frame)
The output of above command is given as follows.
> data(‘USArrests’)
> d_frame <- USArrests
> d_frame <- na.omit(d_frame) #Removing the missing values
> d_frame <- scale(d_frame)
> head(d_frame)
®
Murder Assault Urbanpop Rape

Alabama 1.24256408 0.7828393  0.5209066  0.003416473
Alaska 0.50786248 1.1068225  1.2117642 2.484202941
Arizona 0.07163341 1.4788032 0.9989801 1.042878388
Arkansas 0.23234938 0.2308680  1.0735927  0.184916602
California 0.27826823 1.2628144 1.7589234 2.067820292
Colorado 0.02571456 0.3988593 0.8608085 1.864967207
Step 3 : Now apply K-means clustering with cluster size 2

> kmeans2 <- kmeans(d_frame, centers = 2, nstart = 25)
> str(kmeans2)
The output of above commands describes the output for K-means for cluster size=2 for
given dataset which is depicted as below
> str (kmeans2)
List of 9
$ Cluster : Named int [1:50] 2 2 2 1 2 2 1 1 2 2 ...
.. - attr(*, “names”)= chr [1:50] “Alabama” “Alaska” “Arizona” Arkansas” ...
$ centers : num [1:2, 1:4]  0.67 1.005  0.6076 1.014  0.132 ...
.. - attr(*, “dimnames”)= List of 2
.. ..$ : chr [1:2] “1” “2”
.. ..$ : chr [1:4] “Murder” “Assault” “urbanpop” “Rape”
$ totss : num 196
$ withinss : num [1:2] 56.1 46.7
$ tot.withinss : num 103
$ betweenss : num 93.1
$ size : int [1:2] 30 20
$ iter : int 1
$ ifault : int 0
- attr(*, “class”)= chr “kmeans”
Now, plot the graph for analyzing the output of k-means clustering with cluster size=2
using following command
> fviz_cluster(kmeans2, data = d_frame)
®
Step 4 : Plot the graphs with variable cluster sizes

Now to find the accuracy, create multiple clusters with size 3,4,5 and 6; and analyze
the results of clustering by plotting the outputs using following command.
> #Comparing the Plots
> plot1 <- fviz_cluster(kmeans2, geom = "point", data = d_frame) + ggtitle("k = 2")
> grid.arrange(plot1, plot2, plot3, plot4, nrow = 2)
The results os K-means for above problem is plotted as follows.
®
Example 2
In example2, we are going to use Uber dataset provided by Kaggle website where
value of k is predefined which is 5. So, the steps to perform clustering using K-means
technique for Uber dataset are given as follows
Step 1 : Initialize and load the datasets in R

To load datasets in R, read.csv command is used with following syntax
Syntax :
Variable_name <- read.csv (<Path to csv file>)
Here symbol <- is used as an assignment operator
Example:
X<- read.csv(https://raw.githubusercontent.com/raw-data.csv)
In this example we are going to use Uber datasets which are available on Kaggle
website for the analysis. The urls for five data sets are given below
https://raw.githubusercontent.com/fivethirtyeight/uber-tlc-foil-response/master/uber-
trip-data/uber-raw-data-apr14.csv
https://raw.githubusercontent.com/fivethirtyeight/uber-tlc-foil-
response/master/uber-trip-data/uber-raw-data-may14.csv
trip-data/uber-raw-data-jun14.csv
trip-data/uber-raw-data-jul14.csv
https://raw.githubusercontent.com/fivethirtyeight/uber-tlc-foil-
response/master/uber-trip-data/uber-raw-data-aug14.csv
trip-data/uber-raw-data-sep14.csv"
Now load the all the datasets using read.csv command to the respective variables.
(Note: You must have active internet connection to load data from Kaggle website to local
R studio)
apr <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/uber-tlc-foil-
response/master/uber-trip-data/uber-raw-data-apr14.csv")
may <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/uber-tlc-foil-
response/master/uber-trip-data/uber-raw-data-may14.csv")
jun <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/uber-tlc-foil-
response/master/uber-trip-data/uber-raw-data-jun14.csv")
jul <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/uber-tlc-foil-
response/master/uber-trip-data/uber-raw-data-jul14.csv")
®
aug <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/uber-tlc-foil-

response/master/uber-trip-data/uber-raw-data-aug14.csv")
sep <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/uber-tlc-foil-
response/master/uber-trip-data/uber-raw-data-sep14.csv")
The output of above commands in R studio is shown below
Now to bind the data of all separate datasets in to one, we need use dplyr package
with function name bind_rows.
To install dplyr package type following command on R Prompt
>install.packages("dplyr")
To bind datasets, use following command to merge all records in variable data.
> library("dplyr")
> data <- bind_rows(apr, may, jun, jul, aug, sep)
Now view the summary of data using following command
summary(data)
OUTPUT
> summary(data)
Date.Time Lat Lon Base
Length : 4534327 min. : 39.66 min. :  74.93 B02512 : 205673
Class : character 1st Qu. : 40.72 1st Qu. :  74.00 B02598 : 1393113
Mode : character Median : 40.74 Median :  73.98 B02617 : 1458853
Mean : 40.74 Mean :  73.97 B02682 : 1212789
3rd Qu. : 40.76 3rd Qu. :  73.97 B02764 : 263899
Max. : 42.12 Max. :  72.07
®
The merged dataset contains the following columns :

 Date.Time: represents the date and time of the Uber pickup;
 Lat: represents the latitude of the Uber pickup;
 Lon: represents the longitude of the Uber pickup;
 Base: represents the TLC base company code affiliated with the Uber pickup.
Step 2 : Prepare data for clustering and perform Missing value analysis
This step consists of cleaning and rearranging your data so that you can work on it
more easily. It's a good idea to first think of the sparsity of the dataset and check the
amount of missing data. For that first install library VIM.
install.packages("VIM")
Now, aggregate all datasets using aggr command which will plot the graphs as shown
below
From above output, we can say that there are no missing values in the dataset. Now to
use date and time values let us use lubridate library for this. Lubridate makes it simple for
you to identify the order in which the year, month, and day appears in your dates and
manipulate them.
> install.packages("lubridate")
> library("lubridate")
> data$Date.Time <- mdy_hms(data$Date.Time)
> data$Year <- factor(year(data$Date.Time))
> data$Month <- factor(month(data$Date.Time))
> data$Day <- factor(day(data$Date.Time))
> data$Weekday <- factor(wday(data$Date.Time))
> data$Hour <- factor(hour(data$Date.Time))
> data$Minute <- factor(minute(data$Date.Time))
> data$Second <- factor(second(data$Date.Time))
> data$Month
®
Now, check out the first few rows to see what our data looks like now
> head(data, n = 10)
Date.Time Lat Lon Base Year Month Day Weekday Hour Minute
1 2014-04-01 00:11:00 40.7690  73.9549 B02512 2014 4 1 3 0 11
2 2014-04-01 00:17:00 40.7267  74.0345 B02512 2014 4 1 3 0 17
3 2014-04-01 00:21:00 40.7316  73.9873 B02512 2014 4 1 3 0 21
4 2014-04-01 00:28:00 40.7588  73.9776 B02512 2014 4 1 3 0 28
5 2014-04-01 00:33:00 40.7594  73.9722 B02512 2014 4 1 3 0 33
6 2014-04-01 00:33:00 40.7383  74.0403 B02512 2014 4 1 3 0 33
7 2014-04-01 00:39:00 40.7223  73.9887 B02512 2014 4 1 3 0 39
8 2014-04-01 00:45:00 40.7620  73.9790 B02512 2014 4 1 3 0 45
9 2014-04-01 00:55:00 40.7524  73.9960 B02512 2014 4 1 3 0 55
10 2014-04-01 01:01:00 40.7575  73.9846 B02512 2014 4 1 3 1 1
Step 3 : Apply K-means Clustering

Now, use the kmeans() function where k value will be set as 5. Also, there is a nstart
option that attempts multiple initial configurations and reports on the best one within the
kmeans function. Seeds allow you to create a starting point for randomly generated
numbers, so that each time your code is run, the same answer is generated. Use following
commands to apply K-means clustering on given dataset.
> set.seed(20)
>
> clusters <- kmeans(data[,2:3], 5)
>
> data$borough <- as.factor(clusters$cluster)
>
> str(clusters)
List of 9
$ Cluster : int [1:4534327] 3 4 4 3 3 4 4 3 4 3 ...
$ Centers : num [1:5, 1:2] 40.7 40.8 40.8 40.7 40.7 ...
.. - attr(*, “dimnames”)=List of 2
.. ..$ : chr [1:5] “1” “2” “3” “4” ...
.. ..$ : chr [1:2] “Lat” “Lon”
$ totss : num 22107
$ withinss : num [1:5] 1386 1264 948 2787 1029
$ tot.withinss : num 7414
$ betweenss : num 14692
®
$ size : int [1:5] 145109 217566 1797598 1802301 571753

$ iter : int 4
$ ifault : int 0
- attr(*, “class”)= chr “kmeans”
The above list is an output of the kmeans() function where cluster is a vector of
integers (from 1:k) indicating the cluster to which each point is allocated, centers is a
matrix of cluster centers, withinss represents vector of within-cluster sum of squares, one
component per cluster, tot.withinss is the total within-cluster sum of squares. That
is, sum(withinss) and size is the number of points in each cluster.
Step 4 : Plot the results

Now, let’s plot the graphs for obtained output to visualize the data as well as the
results of the k-means clustering using following commands.
> Install.packages (“ggmap”)
> library(“ggmap)
> library (DT)
> data$Month <-as.couble(data$Month)
>month_borough_14 <-count_(data, vars = c(‘Month’, ‘Borough’), sort = TRUE) %>% +
arrange(Month, borough)
>datatable (month_borough_14)
>library(dplyr)
>monthly_growth<- month_borough_14 %>%
+ mutate(Date = paste(“04”, Month)) %>%
+ ggplot(aes(Month, n, colour = Borough)) + geom_line() +
+ ggtitle (“Uber Monthly Growth”)
>monthly_growth
The output of above commands is given below
Uber Monthly Growth
®
References:
https://www.kaggle.com/deepakg/usarrests
https://uc-r.github.io/kmeans_clustering
https://www.datacamp.com/community/tutorials/k-means-clustering-r
https://www.guru99.com/r-k-means-clustering.html
https://data-flair.training/blogs/clustering-in-r-tutorial/
Lab 5 : To implement decision tree classification in R
Decision Trees are a popular Data Mining technique that makes use of a tree-like
structure to deliver consequences based on input decisions. In this lab we are going to
implement decision tree classification in R. The steps are as follows
Step 1 : Install and import the required packages

To build a decision tree in R, we need to install and import the packages for libraries
like dplyr, rpart, caTools ,party, rpart and reader. The commands for installation and
initialization are as follows
> install.packages("dplyr")
> install.packages("readr")
> install.packages("caTools")
> install.packages("party")
> install.packages("partykit")
> install.packages("rpart")
> install.packages("rpart.plot")
> library(dplyr)
> library(readr)
> library(caTools)
> library(party)
> library(partykit)
> library(rpart)
> library(rpart.plot)
Step 2 : Import the dataset

In this example, we are going to use titanic dataset freely available on Kaggle website.
The purpose of this dataset is to predict which people are more likely to survive after the
collision with the iceberg. The dataset contains 13 variables and 1309 observations. The
dataset is ordered by the variable X. Now, let us read the dataset and store it inside the
titanic data variable. The commands for importing dataset are as follows.
®
1. titanic_data <- "https://goo.gl/At238b" %>%

2. read.csv %>%
3. select(survived, embarked, sex,
4. sibsp, parch, fare) %>%
5. mutate(embarked = factor(embarked),
6. sex = factor(sex))
Step 3 : Split the dataset into training and testing sets

Now, split our dataset into two sub datasets namely training and testing sets using
following commands
> set.seed(123)
> sample_data = sample.split(titanic_data, SplitRatio = 0.75)
> train_data <- subset(titanic_data, sample_data == TRUE)
> test_data <- subset(titanic_data, sample_data == FALSE)
Step 4 : Plot the Decision Tree

In this step we are going to plot the decision tree based on the training data sets using
rpart function. The commands for plotting decision tree are as follows
> rtree <- rpart(survived ~ ., train_data)
> rpart.plot(rtree)
The output of decision tree is shown in following figure
®
The conditional parting of above plot can be performed using following commands.
> ctree_ <- ctree(survived ~ ., train_data)
> plot(ctree_)
The output of conditional parting is shown as follows
References :
https://www.kaggle.com/c/titanic/data
https://data-flair.training/blogs/r-decision-trees/
https://www.guru99.com/r-decision-trees.html#1
Lab 6 : To implement Naive Bayes Classification in R
The steps for implementing Naive Bayes Classification in R are as follows
Step 1 : Load the libraries using following commands

> library(naivebayes)
> library(dplyr)
> library(ggplot2)
> library(psych)
®
Step 2 : Import the dataset

For this example, we have created a csv file with values admit, gre, gpa and rank
which has 400 records. The admit attribute has two values 0 and 1, gre has random values
between 380 to 700, gpa has values between 2 to 4 and rank is between 1 to 4. You can
create your own dataset also.
Now load the dataset using following commands and check the data streams.
> data <- read.csv(file.choose(), header = T)
> str(data)
'data.frame': 400 obs. of 4 variables:
$ admit: int 0 1 1 1 0 1 1 0 1 0 ...
$ gre : int 380 660 800 640 520 760 560 400 540 700 ...
$ gpa : num 3.61 3.67 4 3.19 2.93 3 2.98 3.08 3.39 3.92 ...
$ rank : int 3 3 1 4 4 2 1 2 3 2 ...
> xtabs(~admit+rank, data = data)
rank
admit 1 2 3 4
0 28 97 93 55
1 33 54 28 12
Step 3 : Visualize the data

In this example, the ggplot() function is used to plot the data for admit with
parameters density on y-axis and gpa at x-axis.
> data$rank <- as.factor(data$rank)
> data$admit <- as.factor(data$admit)
> pairs.panels(data[-1])
> data %>%
+ ggplot(aes(x=admit, y=gpa, fill = admit)) +
+ geom_boxplot() +
+ ggtitle("Box Plot")
>
> data %>% ggplot(aes(x=gpa, fill = admit)) +
+ geom_density(alpha=0.8, color= 'black') +
+ ggtitle("Density Plot")
®
The output of visualization is shown as follows
Step 4 : Perform data partitioning

For classification, make the data partition using following commands
> set.seed(1234)
> ind <- sample(2, nrow(data), replace = T, prob = c(0.8, 0.2))
> train <- data[ind == 1,]
> test <- data[ind == 2,]
®
Step 5 : Create a naive Bayes model and train the model

Now create a naive Bayes model. In this example the kernel-based densities are
compared with non kernel-based densities. To create a model with and without kernel-
based densities use following command
Create a naive > model <- naive_bayes(admit ~ ., data = train)
Bayes model > model
without
kernel based
densities
Create a naive > model <- naive_bayes(admit ~ ., data = train, usekernel = T)
Bayes model with > model
kernel based
densities
The output of above commands is shown as below

Naive Bayes
Call :
naive_bayes.formula(formula=admit ~ .. data = train, usekerne] = T)
-------------------------------------------------------------------------------------------------------------
Laplace smoothing : 0
-------------------------------------------------------------------------------------------------------------
A priori probabilities :
0 1
0.6861538 0.3238462
-------------------------------------------------------------------------------------------------------------
Tables :
-------------------------------------------------------------------------------------------------------------
::: gre::0 (KDE)
-------------------------------------------------------------------------------------------------------------
Call :
Density.default(x = x, na.rm = TRUE)
Data : x (223 obs.) : Bandwidth ‘bw’ = 35.5
x y
Min. : 193.5 Min. : 6.010e–07
st st
1 Qu. : 371.7 1 Qu. : 2.924e04
Median : 550.0 Median : 1.291e03
Mean : 550.0 Mean : 1.401e03
rd rd
3 Qu. : 728.3 3 Qu. : 2.405e03
Max. : 906.5 Max. : 3.199e03
-------------------------------------------------------------------------------------------------------------
®
-------------------------------------------------------------------------------------------------------------
::: gre::0 (KDE)
-------------------------------------------------------------------------------------------------------------
Call :
x y
Min. : 181.2 Min. : 1.145e06
st st
1 Qu. : 365.6 1 Qu. : 2.007e04
Median : 550.0 Median : 1.291e03
Mean : 550.0 Mean : 1.354e03
3rd Qu. : 734.4 3rd Qu. : 2.375e03
Max. : 918.8 Max. : 3.465e03
-------------------------------------------------------------------------------------------------------------
::: gre::0 (KDE)
-------------------------------------------------------------------------------------------------------------
Call :
x y
Min. : 2.080 Min. : 0.0002229
1st Qu. : 2.645 1st Qu. : 0.0924939
Median : 3.210 Median : 0.4521795
Mean : 3.210 Mean : 0.4419689
3rd Qu. : 3.775 3rd Qu. : 0.6603271
Max. : 4.340 Max. : 1.1433285
-------------------------------------------------------------------------------------------------------------
::: gpa::1 (KDE)
-------------------------------------------------------------------------------------------------------------
Call :
x y
Min. : 2.25 Min. : 0.0005231
st st
1 Qu. : 2.78 1 Qu. : 0.0800747
Median : 3.31 Median : 0.4801891
Mean : 3.31 Mean : 0.4710851
3rd Qu. : 3.84 3rd Qu. : 0.8626207
Max. : 4.37 Max. : 1.0595464
-------------------------------------------------------------------------------------------------------------
::: rank (categorical)
-------------------------------------------------------------------------------------------------------------
rank 0 1
®
1 0.10313901 0.24509804
2 0.36771300 0.42156863
3 0.33183857 0.24509804
4 0.19730942 0.08823529
Now train the model

> train %>%
+ filter(admit == "1") %>%
+ summarise(mean(gre), sd(gre))
mean(gre) sd(gre)
1 622.9412 110.924
Plot the model
> plot(model)
®
Step 6 : Create prediction using confusion matrix over test and training data
First make the predictions over trained data sets
> p <- predict(model, train, type = 'prob')
> head(cbind(p, train))
The output of above command is as shown below
>head (cbind (p,train))
0 1 admit gre gpa rank
1 0.8528794 0.1471206 0 380 3.61 3
2 0.5621460 0.4378540 1 660 3.67 3
3 0.2233490 0.7766510 1 800 4.00 1
4 0.8643901 0.1356099 1 640 3.19 4
6 0.6263274 0.3736726 1 760 3.00 2
7 0.5933791 0.4066209 1 560 2.98 1
Now to create confusion matrix for training data as p1 and observe the result stored in
p1.
> p1 <- predict(model, train)
Non kernel based densities kernel based densities
> (tab1 <- table(p1, train$admit)) > (tab1 <- table(p1, train$admit))
P1 0 1 P1 0 1
0 196 69 0 203 69
1 27 33 1 20 33
> 1 – sum(diag(tab1))/(sum(tab1) > 1 – sum(diag(tab1))/(sum(tab1)
[1] 0.2953846 [1] 0.2738462
create a confusion matrix for test data as p2 and observe the result stored in p2.
> p2 <- predict(model, test)
Non kernel-based densities kernel based densities
®
> (tab2 <- table(p2, test$admit)) > (tab2 <- table(p2, test$admit))
P2 0 1 P2 0 1
0 47 21 0 47 20
1 3 4 1 3 5
> 1 – sum(diag(tab2))/(sum(tab2) > 1 – sum(diag(tab2))/(sum(tab2)
[1] 0.32 [1] 0.3066667
While developing a model, if we do not use the kernel-based densities then the
accuracy may be degraded. From above results, it is observed that in train dataset the
accuracy has been improved in kernel-based densities from 29.53 % to 27.38 % and in
testing dataset the accuracy has been improved from 32 % to 30.66 %
References:
https://www.youtube.com/watch?v=RLjSQdcg8AM
Lab 7 : To study NOSQL databases Mongo dB and implement CRUD operations
MongoDB is a general purpose, document-based, distributed database built for

modern application developers and for the cloud. Mongo dB can be downloaded for any
operating system from https://www.mongodb.com/ website as shown below.
Once it is downloaded and installed, it can be started by using mongod command on

the terminal to start server followed by mongo command to start client’s session. On
client’s session, we can perform various database manipulation operations like CRUD.
®
The various commands for performing CRUD operations in mongodb are given as
follows
Sr. Operation Syntax Example
No.
1 Create database >use <database_name> >use mydb
switched to db mydb
2 Show databases >show dbs >show dbs
local 0.78125GB
test 0.23012GB
3 Delete current >db.dropDatabase() >db.dropDatabase()
database
4 Create collection >db.createCollection("<Collecti >db.createCollection("Stude
on_Name") nt”)
5 Insert records in a >db.<Collectionname>.insert({“ >db.movie.insert({"name"
database Key”:”Value”}) :"tsec"})
6 Delete collection >db.<Collection_Name>.drop() >db.Student.drop()
true
7 View records >db.<Collection_name>.find() >db.Student.find()
8 Update records >db.<Collection_name>.update >db.Student.update({'title':;
({<oldKey:oldValue>},{$set:{<n Manager'},{$set:{'title':'Lead
ewKey:newValue> }}) er'}})
Lab 8 : To study HDFS, Hive and Hbase commands
To work on HDFS, Hive and Hbase we need to have Hadoop configured server which
can be installed manually as like Lab 1 or can use the readily available Hadoop
distributions like Cloudera or Hortonworks. The Cloudera Hadoop virtual machine
(CDH) can be downloaded from website
https://www.cloudera.com/downloads/quickstart_vms/5-13.html as shown
below
®
The CDH virtual machine can be open in any virtualization software like VMware
workstation player as shown in following figure.
Once Hadoop is ready, we can work on any Hadoop ecosystem component like Hove,
HDFS, Pig, Hbase etc.
1. HDFS
The Apache Hadoop is an open source software framework that enables distributed
processing of large data sets across clusters of commodity servers using programming
models. It is designed to scale up from a single server to thousands of machines, with a
very high degree of fault tolerance. The Hadoop framework consist of Hadoop
distributed file system (HDFS).
®
To work on HDFS Open Terminal and try following commands

Sr.no Command Description
1 #hadoop fs –ls List the files
2 #hadoop fs –count hdfs:/ Count the number of directories, files and bytes
under the paths
3 #hadoop fs –mkdir /user /hadoop Create a new directory hadoop under user directory
4 #hadoop fs –rm hadoop/cust Delete file cust from hadoop directory
5 #hadoop fs –mv /user/training/cust Move file cust from /user/training directory to
hadoop/ hadoop directory
6 #hadoop fs –cp /user/training/cust Copy file cust from /user/training directory to
hadoop/ hadoop directory
7 #hadoop fs Copy file a.txt to local disk from HDFS
–copyToLocal hadoop/a.txt
/home/training/
8 #hadoop fs –copyFromLocal Copy file a.txt from local directory /home/training to
/home/training/a.txt hadoop/ HDFS
a) Hive
The Apache Hive is a data warehouse software that facilitates reading, writing, and
managing large datasets which are residing on distributed storage using SQL. The hive
provides a simple SQL-like query language called Hive Query Language (HQL) for
querying and managing the large datasets. It also allows programmers to write the
custom Map-Reduce framework to perform more sophisticated analysis. To work on Hive
Open the Terminal in CDH and type hive which opens hive shell as shown below
The different commands for Hive are explained in a table as follows
®

2 Load databases hbase(main):012:0> scan ‘<table_name>
3 Add a tuple hbase(main):004:0> put <table_name>,

<Row_no>, <ColumnFamily> <value>
5 Create table hbase(main):002:0> create <table_name>,

<ColumnFamilyName>
b) Hbase
Hbase is a Column-oriented database used to store unstructured data in column
format. It is a NoSQL solution for big data available inside Hadoop platform. The Hbase
is not similar to the relational database systems and also does not support SQL syntaxes
in general. To work on Hbase Open the Terminal in CDH and type Hbase shell as shown
below
The different commands for Hbase are explained in a table as follows

2 Load databases hbase(main):012:0> scan ‘<table_name>
3 Add a tuple hbase(main):004:0> put <table_name>, <Row_no>,
<ColumnFamily> <value>
5 Create table hbase(main):002:0> create <table_name>,
<ColumnFamilyName>

®
N otes
®

Big Data Analytics TEXTBOOK

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Big Data Analytics TEXTBOOK

Uploaded by

Copyright:

Available Formats

SUBJECT CODE : CS8091

Strictly as per Revised Syllabus of

Big Data Analytics

Dr. Bhushan Jadhav

Semester - VI ( Information Technology)

First Edition : January 2020

ã Copyright with Authors

9789389420883 [1] (ii)

1 Introduction to Big Data

Fig. 1.1.1 : Four Vs of Big data

1.2 Evolution of Big Data

1.3 Best Practices for Big Data Analytics

1.4 Big Data Characteristics

1.5 Validating the Promotion of the Value of Big Data

 Supports both Structured and unstructured data

 Supports high performance and scalable analytical operations

 Simple programming model for scalable applications

1.6 Big Data Use Cases

1.7 Characteristics of Big Data Applications

Sr. No. Application Name Possible Characteristics

1 Fraud detection Data throttling, Computation-restricted throttling, Large

2 Data profiling Large data volumes, Data parallelization

3 Clustering Data throttling, Computation-restricted throttling, Large

4 Price modelling Data throttling, Computation-restricted throttling, Large

5 Recommendation Data throttling, Computation-restricted throttling, Large

1.8 Perception and Quantification of Value

1.9 Understanding Big Data Storage

1.10 A General Overview Of High-Performance Architecture

Fig. 1.10.1 : Generalize High-performance architecture of Big data System

1.11 Architecture of Hadoop

running the applications on clusters of commodity hardware that provides massive

Fig. 1.11.1 Hadoop Ecosystem

Sr. No. Name of Description

4) HBase It is a Column-oriented database service used as NoSQL solution for

Table 1.11.1 : Different components of Hadoop ecosystem

1.12 Hadoop Distributed File System (HDFS)

made) which is progressively significant in bigger environments consisting of many racks

1.13 Architecture of HDFS

Fig. 1.13.1 : HDFS Architecture

The Components of HDFS composed of following elements

Fig. 1.13.2 : Representation of Name node and Data nodes

Sr. No. Command Description

1. #hadoop fs –ls List the files

4. #hadoop fs –rm hadoop/cust Delete file cust from hadoop

Table 1.13.1 : HDFS Commands

1.14 Map Reduce and YARN

1.15 Map Reduce Programming Model

1.15.1 Features of Map Reduce

1.15.2 Working of Map Reduce Framework

Fig. 1.15.2 Different phases of execution in map reduce

Fig. 1.15.3 : Word count process using map reduce

1.15.3 Input Splitting

Fig. 1.15.4 : Input split performed on Input file

1.15.4 Map and Reduce Functions

Fig. 1.15.5 : Mapper function

Fig. 1.15.6 : Reducer function

1.15.5 Input and Output Parameters

Fig. 1.15.7 : Input output Key value pair

Mapper Class Reducer Class

The major advantages of Big data analytics are :

 Supports huge volume of data generated at any velocity

 Reducing the capital and operational cost

 Supports both Structured and unstructured data

 Supports high performance and scalable analytical operations