You are on page 1of 71

Please read this disclaimer before proceeding:

This document is confidential and intended solely for the educational purpose of
RMK Group of Educational Institutions. If you have received this document
through email in error, please notify the system manager. This document
contains proprietary information and is intended only to the respective group /
learning community as intended. If you are not the addressee you should not
disseminate, distribute or copy through e-mail. Please notify the sender
immediately by e-mail if you have received this document by mistake and delete
this document from your system. If you are not the intended recipient you are
notified that disclosing, copying, distributing or taking any action in reliance on
the contents of this information is strictly prohibited.
CS8091
Big Data Analytics

Department: IT

Batch/Year: 2018-22/ III


Created by: Ms. S. Jhansi Ida,
Assistant Professor, RMKEC

Date: 17.04.2021
Table of Contents

S NO CONTENTS PAGE NO

1 Contents 5
2 Course Objectives 6
3 Pre Requisites (Course Names with Code) 7
4 Syllabus (With Subject Code, Name, LTPC details) 8
5 Course Outcomes 10
6 CO- PO/PSO Mapping 11
7 Lecture Plan 12
8 Activity Based Learning 13
NOSQL DATA MANAGEMENT FOR BIG DATA AND
9 5 14
VISUALIZATION
NoSQL Databases : Schema-less Models: Increasing
5.1 14
Flexibility for Data Manipulation
Key Value Stores- Document Stores - Tabular Stores -
5.2 15
Object Data Stores - Graph Databases
5.3 Hive 19
5.4 HBase 22
5.5 Sharding 24
5.6 Analyzing big data with twitter 28

5.7 Big data for E-Commerce Big data for blogs 31

5.8 Review of Basic Data Analytic Methods using R 35

10 Assignments 57
11 Part A (Questions & Answers) 58
12 Part B Questions 63

13 Supportive Online Certification Courses 64

14 Real time Applications 65


15 Content Beyond the Syllabus 66
16 Assessment Schedule 68
17 Prescribed Text Books & Reference Books 69
18 Mini project Suggestions 70

5
Course Objectives
To know the fundamental concepts of big data and analytics.
To explore tools and practices for working with big data
To learn about stream computing.
To know about the research that requires the integration of large amounts of
data.
Pre Requisites

CS8391 – Data Structures

CS8492 – Database Management System


Syllabus
LTPC
CS8091 BIG DATA ANALYTICS
3003
UNIT I INTRODUCTION TO BIG DATA 9

Evolution of Big data - Best Practices for Big data Analytics - Big data
characteristics Validating – The Promotion of the Value of Big Data - Big
Data Use Cases- Characteristics of Big Data Applications - Perception and
Quantification of Value - Understanding Big Data Storage - A General
Overview of High- Performance Architecture - HDFS – Map Reduce and
YARN - Map Reduce Programming Model.

UNIT II CLUSTERING AND CLASSIFICATION 9

Advanced Analytical Theory and Methods: Overview of Clustering -


K-means - Use Cases - Overview of the Method - Determining the Number
of Clusters – Diagnostics Reasons to Choose and Cautions .- Classification:
Decision Trees - Overview of a Decision Tree - The General Algorithm -
Decision Tree Algorithms - Evaluating a Decision Tree - Decision Trees in R -
Naïve Bayes - Bayes’ Theorem - Naïve Bayes Classifier.

UNIT III ASSOCIATION AND RECOMMENDATION SYSTEM 9

Advanced Analytical Theory and Methods: Association Rules - Overview -


Apriori Algorithm – Evaluation of Candidate Rules - Applications of
Association Rules - Finding Association& finding similarity -
Recommendation System: Collaborative Recommendation- Content Based
Recommendation - Knowledge Based Recommendation- Hybrid
Recommendation Approaches.
Syllabus
UNIT IV STREAM MEMORY 9
Introduction to Streams Concepts – Stream Data Model and Architecture
- Stream Computing, Sampling Data in a Stream – Filtering Streams –
Counting Distinct Elements in a Stream – Estimating moments –
Counting oneness in a Window – Decaying Window – Real time Analytics
Platform(RTAP) applications - Case Studies - Real Time Sentiment
Analysis, Stock Market Predictions - Using Graph Analytics for Big Data:
Graph Analytics

UNIT V NOSQL DATA MANAGEMENT FOR BIG DATA AND


VISUALIZATION 9
NoSQL Databases : Schema-less Models: Increasing Flexibility for Data
Manipulation- Key Value Stores- Document Stores - Tabular Stores -
Object Data Stores - Graph Databases Hive - Sharding –- Hbase –
Analyzing big data with twitter - Big data for E-Commerce Big data for
blogs - Review of Basic Data Analytic Methods using R.
Course Outcomes
CO# COs K Level

CO1 Identify big data use cases, characteristics and make K3


use of HDFS and Map-reduce programming model for
data analytics

CO2 Examine the data with clustering and classification K4


techniques

CO3 Discover the similarity of huge volume of data with K4


association rule mining and examine recommender system

CO4 Perform analytics on data streams K4

CO5 Inspect NoSQL database and its management K4

CO6 Examine the given data with R programming K4


CO-PO/PSO Mapping

PO PO PO PO PO PO PO PO PO PO PO PO PSO PSO PSO


CO #
1 2 3 4 5 6 7 8 9 10 11 12 1 2 3

CO1 2 3 3 3 3 1 1 - 1 2 1 1 2 2 2

CO2 2 3 2 3 3 1 1 - 1 2 1 1 2 2 2

CO3 2 3 2 3 3 1 1 - 1 2 1 1 2 2 2

CO4 2 3 2 3 3 1 1 - 1 2 1 1 2 2 2

CO5 2 3 2 3 3 1 1 - 1 2 1 1 1 1 1

CO6 2 3 2 3 3 1 1 - 1 2 1 1 1 1 1
Lecture Plan
UNIT – V
No
S of Proposed Actual Pertain Taxon Mode of
Topics
No peri date Date ing CO omy delivery
ods level
NoSQL Databases :
Schema-less Models”:
1 1 28.04.2021 28.04.2021 CO5 K4 PPT
Increasing Flexibility for
Data Manipulation
Manipulation-Key Value
Stores- Document Stores -
2 Tabular Stores - Object 1 01.05.2021 01.05.2021 CO5 K4 PPT
Data Stores - Graph
Databases
3 Hive - Hbase 1 04.05.2021 04.05.2021 CO5 K4 PPT/ Video

4 Sharding 1 05.05.2021 05.05.2021 CO5 K4 PPT


Analyzing big data with
5 1 07.05.2021 07.05.2021 CO5 K4 PPT/Video
twitter
Big data for E-Commerce PPT
6 1 08.05.2021 08.05.2021 CO5 K4
Big data for blogs
Review of Basic Data PPT /
7 1 11.05.2021 11.05.2021 CO6 K4
Analytic Methods using R RSTUDIO
Review of Basic Data
PPT/
8 Analytic Methods using R 1 15.05.2021 15.05.2021 CO6 K4
RSTUDIO
ACTIVITY BASED LEARNING

Crossword Puzzle

https://crosswordlabs.com/view/cs8091-bda-unit-5

Flash cards

https://quizlet.com/in/597267333/big-data-analytics-unit-5-flash-cards/?x=1qqt
Lecture Notes
5.1 NoSQL Databases

The availability of a high-performance, elastic distributed data environment enables


creative algorithms to exploit variant modes of data management in different ways.
Some algorithms will not be able to consume data in traditional RDBMS systems and
will be acutely dependent on alternative means for data management. Many of these
alternate data management frameworks are bundled under the term “NoSQL
databases.”
The term “NoSQL” may convey two different connotations—one implying that the data
management system is not an SQL-compliant one, while the more accepted implication
is that the term means “Not only SQL,” suggesting environments that combine
traditional SQL (or SQL-like query languages) with alternative means of querying and
access.

Schema-less Models: Increasing Flexibility For Data Manipulation


NoSQL data systems holds greater flexibility in database management while reducing
the dependence on formal database administration. NoSQL databases have more
relaxed modeling constraints, that benefits both the application developer and the end-
user analysts when their interactive analyses are not throttled by the need to cast each
query in terms of a relational table-based environment.
Different NoSQL frameworks are optimized for different types of analyses. For
example, some are implemented as key-value stores, which align to certain big data
programming models, while another emerging model is a graph database, in which a
graph abstraction is implemented to embed both semantics and connectivity within its
structure.
The general concepts for NoSQL include schema-less modeling in which the semantics
of the data are embedded within a flexible connectivity and storage model. This
provides automatic distribution of data and elasticity with respect to the use of
computing, storage, and network bandwidth that don’t force specific binding of data to
be persistently stored in particular physical locations. NoSQL databases also provide for
integrated data caching that helps reduce data access latency and speed performance.
The loosening of the relational structure is intended to allow different models to be
adapted to specific types of analyses. The “relaxed” approach to modeling and
management does not enforce shoe-horning data into strictly defined relational
structures, the models themselves do not necessarily impose any validity rules. This
potentially introduces risks associated with ungoverned data management activities
such as inadvertent inconsistent data replication, reinterpretation of semantics, and
currency and timeliness issues.

5.2 KEY-VALUE STORES

A simple type of NoSQL data store is a key-value store, a schema-less model in which
values (or sets of values, or complex entity objects) are associated with distinct
character strings called keys. Programmers may see similarity with the data structure
known as a hash table. Other alternative NoSQL data stores are variations on the key-
value theme, which lends a degree of credibility to the model.
Consider the data subset represented in the below Table.

The key is the name of the automobile make, while the value is a list of names of
models associated with that automobile make. From the example, the key-value store
does not impose any constraints about data typing or data structure—the value
associated with the key is the value, and it is up to the business applications to assert
expectations about the data values and their semantics and interpretation. This
demonstrates the schema-less property of the model.
The core operations performed on a key-value store include:
❖Get(key), which returns the value associated with the provided key.
❖Put(key, value), which associates the value with the key.
❖Multi-get(key1, key2,.., keyN), which returns the list of values associated with the list
of keys.
❖Delete(key), which removes the entry for the key from the data store.

The critical characteristic of a key-value store is uniqueness of the key, (ie) to find the
values the exact key is used. In this data management approach, if you want to
associate multiple values with a single key, then consider the representations of the
objects and how they are associated with the key.
For example, you want to associate a list of attributes with a single key, which
suggests that the value stored with the key is another key-value store object itself.
Key-value stores are very long, and presumably thin tables (in that there are not many
columns associated with each row). The table’s rows can be sorted by the key value to
simplify finding the key during a query. Alternatively, the keys can be hashed using a
hash function that maps the key to a particular location (sometimes called a “bucket”)
in the table.
Additional supporting data structures and algorithms (such as bit vectors and bloom
filters) can be used to even determine whether the key exists in the data set at all.
The representation can grow indefinitely, which makes it good for storing large
amounts of data that can be accessed relatively quickly, as well as environments
requiring incremental appends of data.
Examples include capturing system transaction logs, managing profile data about
individuals, or maintaining access counts for millions of unique web page URLs. The
simplicity of the representation allows massive amounts of indexed data values to be
appended to the same key-value table, which can then be sharded, or distributed
across the storage nodes.
Under the right conditions, the table is distributed in a way that is aligned with the way
the keys are organized, so that the hashing function that is used to determine where
any (i.e., the portion of the table holding that key). specific key exists in the table can
also be used to determine which node holds that key’s bucket.
Key-value pairs are very useful for both storing the results of analytical algorithms (such
as phrase counts among massive numbers of documents) and for producing those results
for reports.

Drawbacks

The potential drawbacks are


❖The model will not inherently provide any kind of traditional database capabilities
(such as atomicity of transactions, or consistency when multiple transactions are
executed simultaneously)—those capabilities must be provided by the application itself.
❖As the model grows, maintaining unique values as keys may become more difficult,
requiring the introduction of some complexity in generating character strings that will
remain unique among a myriad of keys.

5.2.1 DOCUMENT STORES

A document store is similar to a key-value store in that stored objects are associated (and
therefore accessed via) character string keys. The difference is that the values being
stored, are referred to as “documents,” provide some structure and encoding of the
managed data.
There are different common encodings, including XML (Extensible Markup Language),
JSON (Java Script Object Notation), BSON (binary encoding of JSON objects), or other
means of serializing data (i.e., packaging up the potentially linearizing data values
associated with a data record or object).
In the below example, we have some examples of documents stored in association with
the names of specific retail locations. Note that while the three examples all represent
locations, yet the representative models differ. The document representation embeds the
model so that the meanings of the document values can be inferred by the application.
The differences between a key-value store and a document store is that while the key-
value store requires the use of a key to retrieve data, the document store provides a
means (either through a programming API or using a query language) for querying the
data based on the contents. The approaches used for encoding the documents embed
the object metadata, one can use methods for querying by example.
For instance, using the example, one could execute a FIND (MallLocation: “Westfield
Wheaton”) that would pull out all documents associated with the Retail Stores in that
particular shopping mall.

5.2.3 TABULAR STORES

Tabular, or table-based stores are largely descended from Google’s original Bigtable
design to manage structured data. The HBase model is an example of a Hadoop-
related NoSQL data management system that evolved from bigtable.
The bigtable NoSQL model allows sparse data to be stored in a three-dimensional table
that is indexed by a row key (similar to the key-value and document stores), a column
key that indicates the specific attribute for which a data value is stored, and a
timestamp that may refer to the time at which the row’s column value was stored.
Example: Various attributes of a web page can be associated with the web page’s URL:
the HTML content of the page, URLs of other web pages that link to this web page,
and the author of the content.
Columns in a Bigtable model are grouped together as “families,” and the timestamps
enable management of multiple versions of an object. The timestamp can be used to
maintain history— each time the content changes, new column affiliations can be
created with the timestamp of when the content was downloaded.

5.2.4 OBJECT DATA STORES

Object data stores and object databases seem to bridge the worlds of schema-less
data management and the traditional relational models. Approaches to object
databases can be similar to document stores except that the document stores explicitly
serializes the object so the data values are stored as strings, while object databases
maintain the object structures as they are bound to object-oriented programming
languages such as C11, Objective-C, Java, and Smalltalk.
Object database management systems are more likely to provide traditional ACID
(atomicity, consistency, isolation, and durability) compliance—characteristics that are
bound to database reliability. Object databases are not relational databases and are not
queried using SQL.

5.2.4 GRAPH DATABASES

Graph databases provide a model of representing individual entities and numerous kinds
of relationships that connect those entities. It employs the graph abstraction for
representing connectivity, consisting of a collection of vertices (which are also referred to
as nodes or points) that represent the modeled entities, connected by edges (which are
also referred to as links, connections, or relationships) that capture the way that two
entities are related. Graph analytics performed on graph data stores are different than
more frequently used querying and reporting.

5.3 HIVE

Hive is a data warehousing infrastructure based on Apache Hadoop. Hadoop provides


massive scale out and fault tolerance capabilities for data storage and processing on
commodity hardware. Hive provides SQL which enables users to do data
summarization, ad-hoc querying and analysis of large volumes of data. Hive's SQL gives
users multiple places to integrate their own functionality to do custom analysis, such as
User Defined Functions (UDFs).
Hive is layered on top of the file system and execution framework for Hadoop and
enables applications and users to organize data in a structured data warehouse and
therefore query the data using a query language called HiveQL that is similar to SQL.
The Hive system provides tools for extracting/transforming/loading data (ETL) into a
variety of different data formats. Since the data warehouse system is built on top of
Hadoop, it enables native access to the MapReduce model, allowing programmers to
develop custom Map and Reduce functions that can be directly integrated into HiveQL
queries.
Hive provides scalability and extensibility for batch-style queries over large datasets
and are being expanded while relying on the fault tolerant aspects of the underlying
Hadoop execution model. Apache Hive enables users to process data without explicitly
writing MapReduce code. A Hive table structure consists of rows and columns. The
rows correspond to record, transaction, or particular entity (for example, customer)
detail. The values of the corresponding columns represent the various attributes or
characteristics for each row. Hadoop and its ecosystem are used to apply some
structure to unstructured data. A Hive query is first translated into a MapReduce job,
which is then submitted to the Hadoop cluster. Thus, the execution of the query has to
compete for resources with any other submitted job.

Hive Architecture

Hive User Interface: Hive creates interaction between user and HDFS through Hive
WebUI, hive command line.
Metadata store: Hive stores the database schema and its HDFS mapping in the
database server.
HDFS/HBase: HDFS or HBase are the data storage techniques to store data into the
file system.
Hive Query Processing Engine(HiveQL): HiveQL is used for querying on schema
information on the metadata store. A query can be written for Mapreduce job and
processed.
Execution Engine: The execution engine is used to process the query and generate
results.
Hive – Working Principles

The Hive User Interface (WebUI, command line) sends a query to database driver
(JDBC/ODBC) to execute.
The driver with the help of the query compiler parses the query, checks the syntax and
requirement of the query.
The compiler then sends the meta data request to database where the metadata is
stored.
The database sends the response to the compiler. The compiler sends the response to
the driver which is passed to the execution engine.
The execution engine (Map reduce process) sends the job to jobtracker which is in
namenode and it assigns this job to task tracker which is in datanode.
The execution engine will receive the results from datanodes and then sends the
results to the driver.
The driver sends it to UI.

HiveQL Basics

From the command prompt, a user enters the interactive Hive environment by simply
entering hive:
$ hive
hive>
A user can define new tables, query them, or summarize their contents.
Example:
This defines a new Hive table to hold customer data, load existing HDFS data into the
Hive table, and query the table.
The first step is to create a table called customer to store customer details.
Because the table will be populated from an existing tab (‘\t’)-delimited HDFS file, this
format is specified in the table creation query.
hive> select count(*) from customer;
Result: 0
The HiveQL query is executed to count the number of records in the newly created
table, customer. Because the table is currently empty, the query returns a result of
zero. The query is converted and run as a MapReduce job, which results in one map
task and one reduce task being executed.

Hive use cases

❖ Exploratory or ad-hoc analysis of HDFS data: Data can be queried, transformed, and
exported to analytical tools, such as R.
❖ Extracts or data feeds to reporting systems, dashboards, or data repositories such
as HBase: Hive queries can be scheduled to provide such periodic feeds.
❖ Combining external structured data to data already residing in HDFS: Hadoop is
excellent for processing unstructured data, but often there is structured data residing
in an RDBMS, such as Oracle or SQL Server, that needs to be joined with the data
residing in HDFS. The data from an RDBMS can be periodically added to Hive tables for
querying with existing data in HDFS.
Reference Video
https://www.youtube.com/watch?v=cMziv1iYt28

5.4 HBASE

HBase is another example of a non relational data management environment that


distributes massive datasets over the underlying Hadoop framework. HBase is derived
from Google’s BigTable and is a column-oriented data layout that, when layered on top
of Hadoop, provides a fault-tolerant method for storing and manipulating large data
tables. HBase is not a relational database, and it does not support SQL queries.

Basic operations for Hbase

❖Get (which access a specific row in the table)


❖Put (which stores or updates a row in the table)
❖Scan (which iterates over a collection of rows in the table), and
❖Delete (which removes a row from the table).
HBase Architecture

HBase architecture has 3 main components: HMaster, Region Server, Zookeeper.

Hbase Architecture
1. Hmaster:
The implementation of Master Server in HBase is HMaster.
It is a process in which regions are assigned to region server and takes the help
of Apache Zookeeper.
It handles load balancing of the regions across region servers.
It unloads the busy servers and shifts the regions to less occupied servers and
maintains the state of the cluster by negotiating the load balancing.
It is responsible for schema changes and other metadata operations such as
creation of tables and column families.
2. Region:
Regions are nothing but tables that are split up and spread across the region
servers. The default size of a region is 256 MB.
3. RegionServer:
The region server has regions that communicate with the client and handle data
related operations.
Handle read and write requests for all the regions under it.
Decide the size of the region by following the region size thresholds.
4. Zookeeper:
Zookeeper is an open-source project that provides services like maintaining
configuration information, naming, providing distributed synchronization, etc.
Zookeeper has ephemeral nodes representing different region servers. Master
servers use these nodes to discover available servers.
In addition to availability, the nodes are also used to track server failures or
network partitions.
Clients communicate with region servers via zookeeper. In pseudo and standalone
modes, HBase itself will take care of zookeeper.
Advantages of Hbase
1. Can store large data sets
2. Database can be shared
3. Cost-effective from gigabytes to petabytes
4.High availability through failover and replication
Disadvantages of Hbase
1. No support of SQL structure
2. No transaction supports
3. Sorted only on key
4. Memory issues on the cluster
Reference Video
https://www.youtube.com/watch?v=VRD775iqAko

5.5 Sharding

Sharding (also known as Data Partitioning) is the process of splitting a large dataset into
many small partitions which are placed on different machines. Each partition is known as
a "shard". Each shard has the same database schema as the original database. Most data
is distributed such that each row appears in exactly one shard. The combined data from
all shards is the same as the data from the original database.
Sharding is the process of storing data records across multiple machines and it is
MongoDB's approach to meeting the demands of data growth. As the size of the data
increases, a single machine may not be sufficient to store the data nor provide an
acceptable read and write throughput. Sharding solves the problem with horizontal
scaling. With sharding, you add more machines to support data growth and the
demands of read and write operations. MongoDB supports horizontal scaling through
sharding.

Sharding Strategies

1.Horizontal or Range Based Sharding:

The data is split based on the value ranges that are inherent in each entity. For
example, if you store the contact information for online customers, you might choose
to store the information for customers whose last name starts with A-H on one shard,
while storing the rest on another shard.
The disadvantage of this scheme is that the last names of the customers may not be
evenly distributed. You might have a lot more customers whose names fall in the
range of A-H than customers whose last name falls in the range I-Z. In that case, your
first shard will be experiencing a much heavier load than the second shard and can
become a system bottleneck.
The benefit of this approach is that it's the simplest sharding scheme available. Each
shard also has the same schema as the original database. It works well for relative non
static data -- for example to store the contact info for students in a college because
the data is unlikely to see huge churn.

2. Vertical Sharding

The different features of an entity will be placed in different shards on different


machines. For example, in a LinkedIn application, an user might have a profile, a list of
connection and a set of articles he has authored.
In Vertical sharding scheme, we might place the various user profiles on one shard,
the connections on a second shard and the articles on a third shard.
The main benefit of this scheme is that you can handle the critical part of your data (for
examples User Profiles) differently from the not so critical part of your data (for example,
blog posts) and build different replication and consistency models around it.
Two main disadvantages of vertical sharding scheme are as follows:
1. Depending on the system, the application layer might need to combine data from
multiple shards to answer a query. For example, a profile view request will need
to combine data from the User Profile, Connections and Articles shard. This
increases the development and operational complexity of the system.
2. If your site/system experiences additional growth then it may be necessary to
further shard a feature specific database across multiple servers.

3. Key or Hash based Sharding

An entity has a value (Eg. IP address of a client application) which can be used as an
input to a hash function and a resultant hash value generated. This hash value
determines which database server(shard) to use.
Example: Imagine you have 4 database servers and each request contained an
application id which was incremented by 1 every time a new application is registered.
Perform a modulo operation on the application id with the number 4 and take the
remainder to determine which server the application data should be placed on.

The main drawback of this method is that elastic load balancing (dynamically
adding/removing database servers) becomes very difficult and expensive.
4. Directory based Sharding

Directory based shard partitioning involves placing a lookup service in front of the
sharded databases. The lookup service knows the current partitioning scheme and
keeps a map of each entity and which database shard it is stored on. The lookup
service is usually implemented as a webservice.
The client application first queries the lookup service to figure out the shard (database
partition) on which the entity resides/should be placed. Then it queries / updates the
shard returned by the lookup service.
In the previous example, we had 4 database servers and a hash function that
performed a modulo 4 operation on the application ids. Now, if we wanted to add 6
more database servers without incurring any downtime, we'll need to do the following
steps:
1.Keep the modulo 4 hash function in the lookup service .
2.Determine the data placement based on the new hash function - modulo 10.
3.Write a script to copy all the data based on #2 into the six new shards and possibly
on the 4 existing shards. Note that it does not delete any existing data on the 4
existing shards.
4. Once the copy is complete, change the hash function to modulo 10 in the lookup
service
5. Run a cleanup script to purge unnecessary data from 4 existing shards based on
step#2. The reason being that the purged data is now existing on other shards.
5.6 Twitter Data Analysis
Carefully listening to voice of the customer on Twitter using sentiment analysis allows
companies to understand their audience, keep on top of what’s being said about their
brand – and their competitors – and discover new trends in the industry.
What is Sentiment Analysis?
Sentiment analysis is the automated process of identifying and classifying subjective
information in text data. This might be an opinion, a judgment, or a feeling about a
particular topic or product feature.
The most common type of sentiment analysis is ‘polarity detection’ and involves
classifying statements as positive, negative or neutral.
Performing sentiment analysis on Twitter data involves five steps:
i. Gather relevant Twitter data
ii. Clean data using pre-processing techniques
iii. Create a sentiment analysis machine learning model
iv. Analyze Twitter data using sentiment analysis model
v. Visualize the results of Twitter sentiment analysis

i. Gather relevant Twitter data


Use the Twitter API
Use the Twitter Streaming API to connect to Twitter data streams and gather tweets
containing keywords, brand mentions, and hashtags, or collect tweets from specific
users.
Connect with Tweepy
Tweepy is an easy-to-use Python library for accessing the Twitter API.
ii. Clean data using pre-processing techniques
Preprocessing a Twitter dataset involves a series of tasks like removing all types of
irrelevant information like emojis, special characters, and extra blank spaces. It can
also involve making format improvements, delete duplicate tweets, or tweets that are
shorter than three characters.
iii. Create a sentiment analysis machine learning model
There are a number of techniques and complex algorithms used to command and train
machines to perform sentiment analysis. There are pros and cons to each. But, used
together, they can provide exceptional results. Below are some of the most used
algorithms.
Naive Bayes
Naive Bayes classifier works very well for text classification as it computes the
posterior probability of a class, based on the distribution of the words (features) in the
document. The model uses the Bag of words feature extraction. Naïve Bayes classifier
assumes that each feature is independent of each other. It uses Bayes Theorem to
predict the probability that a given feature set belongs to a particular label.
The probability of A, if B is true, is equal to the probability of B, if A is true, times the
probability of A being true, divided by the probability of B being true:

P(label) is the prior probability of a label or the likelihood that a random feature set the
label. P(features | label) is the prior probability that a given feature set is being
classified as a label. P(features) is the prior probability that a given feature set is

occurred. When techniques like lemmatization, stop word removal, and TF-IDF are

implemented, Naive Bayes becomes more and more predictively accurate.


Linear Regression
Linear regression is a statistical algorithm used to predict a Y value, given X features.
Using machine learning, the data sets are examined to show a relationship. The
relationships are then placed along the X/Y axis, with a straight line running through
them to predict further relationships.
Linear regression calculates how the X input (words and phrases) relates to
the Y output (polarity). This will determine where words and phrases fall on a scale of
polarity from “really positive” to “really negative” and everywhere in between.
iv. Analyze Twitter data using sentiment analysis model
There are various measures which helps to analyze the data.
Accuracy
The accuracy is the percentage of texts that were predicted with the correct tag. It is
the total number of correct predictions divided by the total number of texts in the
dataset.
F1 Score
F1 Score is another measure for how well the classifier is doing its job, by combining
both Precision and Recall for all the tags
Precision
Precision refers to the percentage of texts the classifier got right out of the total
number of texts that it predicted for a given tag.
Recall
Recall refers to the percentage of texts the classifier predicted for a given tag out of
the total number of texts it should have predicted for that given tag.
Many of the statistics for a classifier start with a simple question: was a text
correctly classified or not?
This forms the basis for four possible outcomes:
A true positive is an outcome where the model correctly predicts the right tag.
Similarly, a true negative is an outcome where the model correctly predicts the tags
that don't apply.
A false positive is an outcome where the model incorrectly predicts the right tag. And a
false negative is an outcome where the model incorrectly predicts the tags that don't
apply.
v. Visualize the results of Twitter sentiment analysis
The results obtained from the analysis will be visualized in the form of bar graph, pie
chart, time series graph etc. The visualized results will be made available on the
website for end user.
Reference Video
https://www.youtube.com/watch?v=O_B7XLfx0ic
5.7 Big Data for Ecommerce and Blogs

Incorporating big data in E-commerce industry will allow businesses to gain access to
significantly large amount of data in order to convert the growth in to revenue,
streamline operation processes and gain more customers.
Big data solutions can help ecommerce industry to flourish.
Eight ways big data can foster positive change in any E-commerce business:
1.Elevated shopping experience
2. More secure online payment
3. Increased personalization
4. Increased focus on “Micro moments”
5. Optimized pricing and increased sales
6. Dynamic customer service
7. Generate increased sales
8. Predict Trends, forecast demand

1. Elevated shopping experience

E-commerce companies have an endless supply of data to fuel predictive analytics that
anticipate how customers will behave in the future. Retail websites track the number of
clicks per page, the average number of products people add to their shopping carts
before checking out, and the average length of time between a homepage visit and a
purchase. If the customers are signed up for a rewards or subscription program,
companies can analyze demographic, age, style, size, and socioeconomic information.
Predictive analytics can help companies develop new strategies to prevent shopping cart
abandonment, lessen time to purchase, and cater to budding trends. Likewise, E-
commerce companies use this data to accurately predict inventory needs with changes in
seasonality or the economy.

2.More secure online payments

To provide a peak shopping experience, customers need to know that their payments are
secure. Big data analysis can recognize a typical spending behavior and notify customers
as it happens. Companies can set up alerts for various fraudulent activities, like a series
of different purchases on the same credit card within a short time frame or multiple
payment methods coming from the same IP address.
Many E-commerce sites now offer several payment methods on one centralized platform.
Big data analysis can determine which payment methods work best for which customers,
and can measure the effectiveness of new payment options like “bill me later”. Some e-
commerce sites have implemented an easy checkout experience to decrease the chances
of an abandoned shopping cart. The checkout page gives customers the ability to put an
item on a wish list, choose a “bill me later” option, or pay with multiple various credit
cards.

3. Increased personalization

Besides enabling customers to make secure, simple payments, big data can cultivate a
more personalized shopping experience. 86% of consumers say that personalization plays
an important role in their buying decisions. Millennials are especially interested in
purchasing online, and assume they will receive personalized suggestions.
Using big data analytics, e-commerce companies can establish a 360-degree view of the
customer. This view allows e-commerce companies to segment customers based on their
gender, location, and social media presence. With this information, companies can create
and send emails with customized discounts, use different marketing strategies for different
target audiences, and launch products that speak directly to specific groups of consumers.
Many retailers cash in on this strategy, giving members loyalty points that can be used on
future purchases. Sometimes, e-commerce companies will pick several dates throughout
the year to give loyalty members extra bonus points on all purchases. This is done during
a slow season, and increases customer engagement, interest, and spending. Not only do
loyalty members feel like VIPs, they give information companies can use to deliver
personalized shopping recommendations.

4. Increased focus on Micro Moments

Micro Moments” is the latest e-commerce trend. Customers generally seek quick actions —
I want to go, I want to know, I want to buy, etc and they look at accessing what they
want on their smart phones. Ecommerce retailers use these micro-moments to foresee
customers tendencies and action patterns. Smartphone technologies help big data
analytics to a large extent.
5. Optimized pricing and increased sales

Beyond loyalty programs, secure payments, and seamless shopping experiences,


customers appreciate good deals. E-commerce companies are starting to use big data
analytics to pinpoint the fairest price for specific customers to bring in increased sales
from online purchases. Consumers with long-standing loyalty to a company may
receive early access to sales, and customers may pay higher or lower prices depending
on where they live and work.

6. Dynamic customer service

Customer satisfaction is key to customer retention. Even companies with the most
competitive prices and products suffer without exceptional customer service.
Business.com states that acquiring new customers costs 5 to 10 times more than
selling to a new customer. Loyal customers spend up to 67% more than new
customers. Companies focused on providing the best customer service increases their
chances of good referrals and sustains recurring revenue. Keeping customers happy
and satisfied should be a priority for every e-commerce company.
How does big data improve customer service?
1. Big data can reveal problems in product delivery, customer satisfaction
levels, and even brand perception in social media.
2. Big data analytics can identify the exact points in time when customer
perception or satisfaction changed.
3. It is easier to make sustainable change to customer service when companies
have defined areas for improvement.

7. Generate increased sales

Big data helps e-retailers customize their recommendations and coupons to fit
customer desires. High traffic results from this personalized customer experience,
yielding higher profit. Big data about consumers can also help e-commerce businesses
run precise marketing campaigns, give appropriate coupons, and reminding people
that they still have something sitting in their cart.
8. Predict trends and forecast demand

Catering to a customer’s needs is not just a present-state issue. E-commerce depends on


stocking the correct inventory for the future. Big data can help companies prepare for
emerging trends, slow or potentially booming parts of the year, or plan marketing
campaigns around large events. E-commerce companies compile huge datasets. By
evaluating data from previous years, e-retailers can plan inventory accordingly, stock up
to anticipate peak periods, streamline overall business operations, and forecast demand.
For example, e-commerce sites can advertise large markdowns on social media during
peak shopping times to get rid of excess product. To optimize pricing decisions, e-
commerce sites can also give extremely limited time discounts. Understanding when to
offer discounts, how long discounts should last, and what discounted price to offer is
much more accurate and precise with big data analytics and machine learning.

Big data for Blogs

Definition :
A blog (shortening of “weblog”) is an online journal or informational website displaying
information in the reverse chronological order, with the latest posts appearing first. It is a
platform where a writer or even a group of writers share their views on an individual
subject.
Purpose of a blog:
The main purpose of a blog is to connect you to the relevant audience. The more
frequent and better your blog posts are, the higher the chances for your website to get
discovered and visited by your target audience.

Blogs and websites:

❖Blogs need frequent updates. Good examples include a food blog sharing meal recipes
or a company writing about their industry news. Blogs promote perfect reader
engagement. Readers get a chance to comment and voice their different concerns to the
viewer.
❖Static websites, on the other hand, consists of the content presented on static pages.
Static website owners rarely update their pages. Blog owners update their site with new
blog posts on a regular basis.
5.8 Review of Basic Data Analytic Methods using R

Introduction to R
R is a programming language and software framework for statistical analysis and
graphics. Available for use under the GNU General Public License.
The annual sales in U.S. dollars for 10,000 retail customers have been provided in the
form of a comma separated- value (CSV) file. The read.csv() function is used to import
the CSV file. This dataset is stored to the R variable sales using the assignment operator
<-.

In this example, the data file is imported using the read.csv() function. Once the file has
been imported, it is useful to examine the contents to ensure that the data was loaded
properly as well as to become familiar with the data. In the example, the head()
function, by default, displays the first six records of sales.

Reference Video
https://www.youtube.com/watch?v=_V8eKsto3Ug
The summary() function provides some descriptive statistics, such as the mean and
median, for each data column. Additionally, the minimum and maximum values as
well as the 1st and 3rd quartiles are provided. Because the gender column contains
two possible characters, an “F” (female) or “M” (male), the summary() function
provides the count of each character’s occurrence.

Plotting a dataset’s contents can provide information about the relationships


between the various columns. In this example, the plot() function generates a
scatterplot of the number of orders (sales$num_of_orders) against the annual sales
(sales$sales_total). The $ is used to reference a specific column in the dataset sales.
The resulting plot is shown in Figure given below.

Graphically Examining the data


Each point corresponds to the number of orders and the total sales for each
customer. The plot indicates that the annual sales are proportional to the number of
orders placed. Although the observed relationship between these two variables is
not purely linear, the analyst decided to apply linear regression using the lm()
function as a first step in the modeling process.

The resulting intercept and slope values are –154.1 and 166.2, respectively, for the
fitted linear equation. However, results stores considerably more information that
can be examined with the summary() function. Details on the contents of results are
examined by applying the attributes() function.
The summary() function is an example of a generic function. A generic function is
a group of functions sharing the same name but behaving differently depending on
the number and the type of arguments they receive. Utilized previously, plot() is
another example of a generic function; the plot is determined by the passed
variables. the following R code uses the generic function hist() to generate a
histogram (Figure) of the residuals stored in results. The function call illustrates that
optional parameter values can be passed. In this case, the number of breaks is
specified to observe the large residuals.

R Graphical User Interfaces

R software uses a command-line interface (CLI) that is similar to the BASH shell in
Linux or the interactive versions of scripting languages such as Python. UNIX and
Linux users can enter command R at the terminal prompt to use the CLI. For
Windows installations, R comes with RGui.exe, which provides a basic graphical user
interface (GUI). However, to improve the ease of writing, executing, and debugging
R code, several additional GUIs have been written for R. Popular GUIs include the R
commander, Rattle, and R Studio.

The four highlighted window panes follow.

Scripts: Serves as an area to write and save R code

Workspace: Lists the datasets and variables in the R environment

Plots: Displays the plots generated by the R code and provides a straightforward
mechanism to export the plots

Console: Provides a history of the executed R code and the output.

Attribute and Data Types


Attributes can be categorized into four types: nominal, ordinal, interval, and ratio
(NOIR).

Numeric, Character, and Logical Data Types


Like other programming languages, R supports the use of numeric, character, and
logical (Boolean) values. Examples of such variables are given in the following R
code.
R provides several functions, such as class() and typeof(), to examine the
characteristics of a given variable. The class() function represents the abstract
class of an object. The typeof() function determines the way an object is stored
in memory. Although i appears to be an integer, i is internally stored using
double precision. To improve the readability of the code segments in this
section, the inline R comments are used to explain the code or to provide the
returned values.
Vectors

Vectors are a basic building block for data in R. As seen previously, simple R
variables are actually vectors. A vector can only consist of values in the same class.
The tests for vectors can be conducted using the is.vector() function.

R provides functionality that enables the easy creation and manipulation of vectors.
The following R code illustrates how a vector can be created using the combine
function, c() or the colon operator, :, to build a vector from the sequence of integers
from 1 to 5.

It is necessary to initialize a vector of a specific length and then populate the


content of the vector later. The vector() function, by default, creates a logical
vector. A vector of a different type can be specified by using the mode parameter.
The vector c, an integer vector of length 0, may be useful when the number of
elements is not initially known and the new elements will later be added to the end
of the vector as the values become available.
Arrays and Matrices
The array() function can be used to restructure a vector as an array. For example,
the following R code builds a three-dimensional array to hold the quarterly sales for
three regions over a two-year period and then assign the sales amount of $158,000
to the second region for the first quarter of the first year.

A two-dimensional array is known as a matrix. The following code initializes a


matrix to hold the quarterly sales for the three regions. The parameters nrow and
ncol define the number of rows and columns, respectively, for the sales_matrix.
R provides the standard matrix operations such as addition, subtraction, and
multiplication, as well as the transpose function t() and the inverse matrix function
matrix.inverse() included in the matrixcalc package.

The following R code builds a 3 × 3 matrix, M, and multiplies it by its inverse to


obtain the identity matrix.

Data Frames

Similar to the concept of matrices, data frames provide a structure for storing and
accessing several variables of possibly different data types. The is.data.frame()
function indicates, a data frame was created by the read.csv() function.

Lists

Lists can contain any type of objects, including other lists. Using the vector v and
the matrix M created in earlier examples, the following R code creates assortment, a
list of different object types.
In displaying the contents of assortment, the use of the double brackets, [[]], is of
particular importance. As the following R code illustrates, the use of the single set
of brackets only accesses an item in the list, not its content.

The str() function offers details about the structure of a list.


Descriptive Statistics

The summary() function provides several descriptive statistics, such as the mean
and median, about a variable such as the sales data frame.

The following code provides some common R functions that include descriptive
statistics.

The IQR() function provides the difference between the third and the first quartiles.
The other functions are fairly self-explanatory by their names.

Exploratory Data Analysis

Functions such as summary() can help analysts easily get an idea of the magnitude
and range of the data, but other aspects such as linear relationships and
distributions are more difficult to see from descriptive statistics. For example, the
following code shows a summary view of a data frame data with two columns x and
y. The output shows the range of x and y, but it’s not clear what the relationship
may be between these two variables.
A useful way to detect patterns and anomalies in the data is through the exploratory
data analysis with visualization. Visualization gives a succinct, holistic view of the
data that may be difficult to grasp from the numbers and summaries alone.
Variables x and y of the data frame data can instead be visualized in a scatterplot
(Figure 3-5), which easily depicts the relationship between two variables. An
important facet of the initial data exploration, visualization assesses data cleanliness
and suggests potentially important relationships in the data prior to the model
planning and building phases.

FIGURE 3-5 A scatterplot can easily show if x and y share a relation

Exploratory data analysis is a data analysis approach to reveal the important


characteristics of a dataset, mainly through visualization.
Visualization Before Analysis

To illustrate the importance of visualizing data, consider Anscombe’s quartet.


Anscombe’s quartet consists of four datasets, as shown in Figure 3-6. It was
constructed by statistician Francis Anscombe in 1973 to demonstrate the importance
of graphs in statistical analyses.

The four datasets in Anscombe’s quartet have nearly identical statistical properties,
as shown in Table 3-3.

Based on the nearly identical statistical properties across each dataset, one might
conclude that these four datasets are quite similar. However, the scatterplots in
Figure 3-7 tell a different story. Each dataset is plotted as a scatterplot, and the
includes eight attributes: x1, x2, x3, x4, y1, y2, y3, and y4. The expression part in
the code creates a data frame from the anscombe dataset, and it only includes three
attributes: x, y, and the group each data point belongs to (mygroup).
fitted lines are the result of applying linear regression models. The estimated
regression line fits Dataset 1 reasonably well. Dataset 2 is definitely nonlinear.
Dataset 3 exhibits a linear trend, with one apparent outlier at x =13. For Dataset 4,
the regression line fits the dataset quite well. However, with only points at two x
values, it is not possible to determine that the linearity assumption is proper.

The R code requires the R package ggplot2, which can be installed simply by
running the command install.packages(“ggplot2”). The anscombe dataset for the
plot is included in the standard R distribution. Enter data() for a list of datasets
included in the R base distribution. Enter data(DatasetName) to make a dataset
available in the current workspace. In the code that follows, variable levels is
created using the gl() function, which generates factors of four levels (1, 2, 3, and
4), each repeating 11 times. Variable mydata is created using the with(data,
expression) function, which evaluates an expression in an environment
constructed from data. In this example, the data is the anscombe dataset, which
which includes eight attributes: x1, x2, x3, x4, y1, y2, y3, and y4. The expression
part in the code creates a data frame from the anscombe dataset, and it only
includes three attributes: x, y, and the group each data point belongs to (mygroup).

Dirty Data
In general, analysts should look for anomalies, verify the data with domain
knowledge, and decide the most appropriate approach to clean the data. In R, the
is.na() function provides tests for missing values. The following example creates a
vector x where the fourth value is not available (NA). The is.na() function returns
TRUE at each NA value and FALSE otherwise.

x <- c(1, 2, 3, NA, 4)

is.na(x)

[1] FALSE FALSE FALSE TRUE FALSE

Visualizing a Single Variable

Using visual representations of data is a hallmark of exploratory data analyses:


letting the data speak to its audience rather than imposing an interpretation on the
data a priori. R has many functions available to examine a single variable. Some of
these functions are listed in Table 3-4.
Examining Multiple Variables
A scatterplot is a simple and widely used visualization for finding the relationship
among multiple variables. A scatterplot can represent data with up to five variables
using x-axis, y-axis, size, color, and shape. But usually only two to four variables are
portrayed in a scatterplot to minimize confusion. When examining a scatterplot, one
needs to pay close attention to the possible relationship between the variables. If
the functional relationship between the variables is somewhat pronounced, the data
may roughly lie along a straight line, a parabola, or an exponential curve. If variable
y is related exponentially to x, then the plot of x versus log(y) is approximately
linear. If the plot looks more like a cluster without a pattern, the corresponding
variables may have a weak relationship.

The scatterplot in Figure 3.13 portrays the relationship of two variables: x and y.
The red line shown on the graph is the fitted line from the linear regression. Figure
3.13 shows that the regression line does not fit the data well. This is a case in which
linear regression cannot model the relationship between the variables. Alternative
methods such as the loess() function can be used to fit a nonlinear line to the data.
The blue curve shown on the graph represents the LOESS curve, which fits the data
better than linear regression.
Statistical Methods for Evaluation

Visualization is useful for data exploration and presentation, but statistics is crucial
because it may exist throughout the entire Data Analytics Lifecycle. Statistical
techniques are used during the initial data exploration and data preparation, model
building, evaluation of the final models, and assessment of how the new models
improve the situation when deployed in the field.

In particular, statistics can help answer the following questions


for data analytics:

● Model Building and Planning

● What are the best input variables for the model?

● Can the model predict the outcome given the input?

Model Evaluation

● Is the model accurate?

● Does the model perform better than an obvious guess?

● Does the model perform better than another candidate model?

● Model Deployment

● Is the prediction sound?

● Does the model have the desired effect (such as reducing the cost)?

This section discusses some useful statistical tools that may answer these questions.

Hypothesis Testing
When comparing populations, such as testing or evaluating the difference of the
means from two samples of data (Figure 3-22), a common technique to assess the
difference or the significance of the difference is hypothesis testing.
The basic concept of hypothesis testing is to form an assertion and test it with data.
When performing hypothesis tests, the common assumption is that there is no
difference between two samples. This assumption is used as the default position for
building the test or conducting a scientific experiment. Statisticians refer to this as
the null hypothesis (H0). The alternative hypothesis (HA) is that there is a
difference between two samples. For example, if the task is to identify the effect of
drug A compared to drug B on patients, the null hypothesis and alternative
hypothesis would be this.

● H0: Drug A and drug B have the same effect on patients.

● HA: Drug A has a greater effect than drug B on patients.

If the task is to identify whether advertising Campaign C is effective on reducing


customer churn, the null hypothesis and alternative hypothesis would be as follows.

● H0: Campaign C does not reduce customer churn better than the current
campaign method.

● HA: Campaign C does reduce customer churn better than the current campaign.

It is important to state the null hypothesis and alternative hypothesis, because


misstating them is likely

to undermine the subsequent steps of the hypothesis testing process. A hypothesis


test leads to either rejecting the null hypothesis in favor of the alternative or not
rejecting the null hypothesis. Table 3-5 includes some examples of null and
alternative hypotheses that should be answered during the analytic lifecycle.
Once a model is built over the training data, it needs to be evaluated over the
testing data to see if the proposed model predicts better than the existing model
currently being used. The null hypothesis is that the proposed model does not
predict better than the existing model. The alternative hypothesis is that the
proposed model indeed predicts better than the existing model. In accuracy
forecast, the null model could be that the sales of the next month are the same as
the prior month.

Difference of Means

Hypothesis testing is a common approach to draw inferences on whether or not the


two populations, denoted pop1 and pop2, are different from each other. This section
provides two hypothesis tests to compare the means of the respective populations
based on samples randomly drawn from each population.

Specifically, the two hypothesis tests in this section consider the following null and
alternative hypotheses.

The μ1 and μ2 denote the population means of pop1 and pop2, respectively.

The basic testing approach is to compare the observed sample means, X1 and X2,
corresponding to each population. If the values of X1 and X2 are approximately
equal to each other, the distributions of X1 and X2 overlap substantially (Figure 3-
23), and the null hypothesis is supported. A large observed difference between the
sample means indicates that the null hypothesis should be rejected. Formally, the
difference in means can be tested using Student’s t-test or the Welch’s t-test.
Wilcoxon Rank-Sum Test
The Wilcoxon rank-sum test [15] is a nonparametric hypothesis test that checks
whether two populations are identically distributed.

Let the two populations again be pop1 and pop2, with independently random
samples of size n1 and n2 respectively. The total number of observations is then N
=n1 +n2. The first step of the Wilcoxon test is to rank the set of observations from
the two groups as if they came from one large group. The smallest observation
receives a rank of 1, the second smallest observation receives a rank of 2, and so on
with the largest observation being assigned the rank of N. Ties among the
observations receive a rank equal to the average of the ranks they span. The test
uses ranks instead of numerical outcomes to avoid specific assumptions about the
shape of the distribution. After ranking all the observations, the assigned ranks are
summed for at least one population’s sample.

If the distribution of pop1 is shifted to the right of the other distribution, the rank-
sum corresponding to pop1’s sample should be larger than the rank-sum of pop2.
The Wilcoxon rank-sum test determines the significance of the observed rank-sums.
The following R code performs the test on the same dataset used for the previous t-
test.
The wilcox.test() function ranks the observations, determines the respective rank-
sums corresponding to each population’s sample, and then determines the
probability of such rank-sums of such magnitude being observed assuming that the
population distributions are identical.

In this example, the probability is given by the p-value of 0.04903. Thus, the null
hypothesis would be rejected at a 0.05 significance level.

Type I and Type II Errors

A hypothesis test may result in two types of errors, depending on whether the test
accepts or rejects the null hypothesis. These two errors are known as type I and
type II errors.

● A type I error is the rejection of the null hypothesis when the null hypothesis is
TRUE. The probability of the type I error is denoted by the Greek letter α.

● A type II error is the acceptance of a null hypothesis when the null hypothesis is
FALSE. The probability of the type II error is denoted by the Greek letter β.
Assignments
Q. Question CO K Level
No. Level
1
Write a R program to create a 5 × 4 matrix , 3 × 3 matrix
with labels and fill the matrix by rows and 2 × 2 matrix with
CO6 K4
labels and fill the matrix by columns.

2 Write a R program to create a Dataframes which contain


CO6 K4
details of 5 employees and display the details.
Part-A Questions and Answers
1. Define NoSQL? (CO5,K2)
NoSQL is a non-relational database that stores and accesses data using key-values.
Instead of storing data in rows and columns like a traditional database, a
NoSQL DBMS stores each item individually with a unique key. NoSQL database does
not require a structured schema that defines each table and the related columns. This
provides a much more flexible approach to storing data than a relational database.

2. Differentiate RDBMS Vs NoSQL? (CO5,K2)


RDBMS
Structured and organized data
Structured query language (SQL)
Data and its relationships are stored in separate tables.
Data Manipulation Language, Data Definition Language
Tight Consistency
BASE Transaction

NoSQL
Stands for Not Only SQL
No declarative query language
No predefined schema
Key-Value pair storage, Column Store, Document Store, Graph databases
Eventual consistency rather ACID property
Unstructured and unpredictable data
CAP Theorem
Prioritizes high performance, high availability and scalability
3. What Hive is NOT? (CO5,K2)
Hive is not designed for online transaction processing. It is best used for traditional
data warehousing tasks. Hive is layered on top of the file system and execution
framework for Hadoop and enables applications and users to organize data in a
structured data warehouse and therefore query the data using a query language
called HiveQL that is similar to SQL (the standard Structured Query Language used
for most modern relational database management systems). The Hive system
provides tools for extracting/transforming/loading data (ETL) into a variety of
different data formats.
4. What is meant by Hbase. Mention some basic operations on it? (CO5,K2)
HBase is a distributed column-oriented database built on top of the Hadoop file
system. It is an open-source project and is horizontally scalable. Apache HBase is
capable of providing real-time read and write access to datasets with billions of rows
and millions of columns. HBase is derived from Google’s BigTable and is a column-
oriented data layout layered on top of Hadoop, provides a fault-tolerant method for
storing and manipulating large data tables.
There are some basic operations for HBase:
Get (which access a specific row in the table),
Put (which stores or updates a row in the table),
Scan (which iterates over a collection of rows in the table), and
Delete (which removes a row from the table).
5. Compare Hbase and HDFS. (CO5,K4)
▪ HBase provides low latency access while HDFS provide high latency operations.
▪ HBase supports random read and write while HDFS supports Write once Read
Many times.
▪ HBase is accessed through shell commands, Java API, REST, Avro or Thrift API
while HDFS is accessed through MapReduce jobs.
6. Mention the key terms representing the table schema in Hbase. (CO5,K2)
Table: Collection of rows present.
Row: Collection of column families.
Column Family: Collection of columns.
Column: Collection of key-value pairs.
Namespace: Logical grouping of tables.
Cell: A {row, column, version} tuple exactly specifies a cell definition in HBase.
7. Define Sharding. (CO5, K2)
Sharding (also known as Data Partitioning) is the process of splitting a large dataset
into many small partitions which are placed on different machines. Each partition is
known as a "shard".
Each shard has the same database schema as the original database. Most data is
distributed such that each row appears in exactly one shard. The combined data from
all shards is the same as the data from the original database.
8. Write the benefit of the Vertical sharding scheme. (CO5,K2)
The main benefit of this scheme is that you can handle the critical part of your data
(for examples User Profiles) differently from the not so critical part of your data (for
example, blog posts) and build different replication and consistency models around it.
9. What are the disadvantage of vertical sharding scheme? (CO5,K2)
The two main disadvantages of vertical sharding scheme are as follows:
Depending on your system, the application layer might need to combine data from
multiple shards to answer a query. For example, a profile view request will need a
combine data from the User Profile, Connections and Articles shard. This increases
the development and operational complexity of the system.
If the Site/System experiences additional growth then it may be necessary to further
shard a feature specific database across multiple servers.
10. Define Sentiment Analysis. (CO5,K2)
Sentiment analysis is the automated process of identifying and classifying subjective
information in text data. This might be an opinion, a judgment, or a feeling about a
particular topic or product feature. The most common type of sentiment analysis is
‘polarity detection’ and involves classifying statements as positive, negative or neutral.
11. Define CAP Theorem. (CO5, K2)
CAP theorem states that there are three basic requirements which exist in a special
relation when designing applications for a distributed architecture.
Consistency - The data in the database remains consistent after the execution of an
operation.
Availability - The system is always on (service guarantee availability), no downtime.
Partition Tolerance - The system continues to function even the communication
among the servers is unreliable.
12. What is the zookeeper? (CO5,K2)
❖ Zookeeper is an open-source project that provides services like maintaining
configuration information, naming, providing distributed synchronization, etc.
❖ Zookeeper has ephemeral nodes representing different region servers.
❖ Master servers use these nodes to discover available servers. In addition to availability,
the nodes are also used to track server failures or network partitions.
❖ Clients communicate with region servers via zookeeper.
❖ In pseudo and standalone modes, HBase itself will take care of zookeeper.
13. How Region Servers works? (CO5,K2)
The region servers have regions that,
❖ Communicate with the client and handle data-related operations.
❖ Handle read and write requests for all the regions under it.
❖ Decide the size of the region by following the region size thresholds.
14. What is the work of Master Server? (CO5,K2)
❖ Assigns regions to the region servers and takes the help of Apache Zoo Keeper for this
task.
❖ Handles load balancing of the regions across region servers. It unloads the busy
servers and shifts the regions to less occupy servers.
❖ Maintain the state of the cluster by negotiating the load balancing.
❖ Is responsible for schema changes and other metadata operations such as creation of
tables and column families.
15. What is the use of Vector? (CO6,K2)
Vectors are a basic building block for data in R, simple R variables are actually vectors.
❖ A vector can only consist of values in the same class.
❖ The tests for vectors can be conducted using the is.vector ()function
is.vector(i)# returns TRUE
is.vector(flag)# returns TRUE
is.vector(sport)# returns TRUE
16. Write few points about R Programming Language? (CO6,K2)
R is a programming language and software framework for statistical analysis and
graphics. Available for use under the GNU General Public License. R software uses a
command-line interface (CLI) that is similar to the BASH shell in Linux or the
interactive versions of scripting languages such as Python. UNIX and Linux users can
enter command R at the terminal prompt to use the CLI. For Windows installations, R
comes with RGui.exe, which provides a basic graphical user interface (GUI). However,
to improve the ease of writing, executing, and debugging R code, several additional
GUIs have been written for R. Popular GUIs include the R commander , Rattle, and
RStudio.
17. What is meant by data frames in R Programming? (CO6,K2)
Data frames provide a structure for storing and accessing several variables of possibly
different data types. The is.data.frame() function indicates, a data frame was created
by the read.csv() function.

18. What is meant by hypothesis testing? (CO6,K2)


It is a common technique to assess the difference or the significance of the difference
is hypothesis testing. The basic concept of hypothesis testing is to form an assertion
and test it with data. When performing hypothesis tests, the common assumption is
that there is no difference between two samples and it is used as the default position
for building the test or conducting a scientific experiment and referred as null
hypothesis. The alternative hypothesis (HA) is that there is a difference between two
samples.
18. Define Wilcoxon Rank-Sum Test? (CO6,K2)
The Wilcoxon rank-sum test is a nonparametric hypothesis test that checks whether
two populations are identically distributed.
19. What is meant by Type I and Type II Errors? (CO6,K2)
A hypothesis test may result in two types of errors, depending on whether the test
accepts or rejects the null hypothesis. These two errors are known as type I and type
II errors. A type I error is the rejection of the null hypothesis when the null
hypothesis is TRUE. The probability of the type I error is denoted by the Greek letter
α. A type II error is the acceptance of a null hypothesis when the null hypothesis is
FALSE. The probability of the type II error is denoted by the Greek letter β.
Part-B Questions
Q. Questions CO K Level
No. Level

1 Explain in detail about schema less models? CO5 K2

2 Discuss in detail about NoSQL Database CO5 K2

3 Explain in detail about Hive architecture and Hive Query CO5 K2


Language(HQL) with its use cases.

4 Explain in detail about Hbase with its architecture. CO5 K2

5 Discuss in detail about sharding. CO5 K2

6 Explain briefly on R Programming Language. CO6 K2

7 Explain in detail about Exploratory data analysis in R. CO6 K2

8 Discuss in detail about analyzing big data with twitter. CO5 K2

Discuss the following


i) Big data for Ecommerce
9 CO5 K2
ii) Big data for Blogs
Supportive Online Courses

Sl. Courses Platform


No.
1 Big Data Computing Swayam
https://onlinecourses.nptel.ac.in/noc21_cs86/preview
2 Python for Data Science Swayam
https://onlinecourses.nptel.ac.in/noc21_cs78/preview
3 Getting Started with R Coursera
https://www.coursera.org/projects/getting-started-with-r
4 R Programming Coursera
https://www.coursera.org/learn/r-programming
5 Modelling Data Warehouses using Apache Hive Coursera
https://www.coursera.org/projects/data-warehousing-with-
apache-hive
REAL TIME APPLICATIONS IN DAY TO DAY LIFE

Companies using Hbase

i. Mozilla
“Mozilla” uses HBase to store all crash data in HBase
ii. Facebook
To store real-time messages, “Facebook” uses HBase storage.
iii. Infolinks
to process advertisement selection and user events for the In-Text ad network, Infolinks uses
HBase. It is an In-Text ad provider company. Moreover, to optimize ad selection, they use the
reports which HBase generates as feedback for their production system.
iv. Twitter
A company like Twitter also runs HBase across its entire Hadoop cluster. For them, HBase
offers a distributed, read/write the backup of all MySQL tables in their production backend.
That helps engineers to run MapReduce jobs over the data while maintaining the ability to
apply periodic row updates.
v. Yahoo!
One of the most famous companies Yahoo! also uses HBase. There HBase helps to store
document fingerprint in order to detect near-duplicates.
Content Beyond the Syllabus

APACHE PIG
Apache Pig is an abstraction over Map Reduce. It is a tool/platform which is used to
analyze larger sets of data representing them as data flows. Pig is generally used
with Hadoop; we can perform all the data manipulation operations in Hadoop using
Apache Pig.
To write data analysis programs, Pig provides a high-level language known as Pig
Latin. This language provides various operators using which programmers can
develop their own functions for reading, writing, and processing data.
To analyze data using Apache Pig, programmers need to write scripts using Pig
Latin language. All these scripts are internally converted to Map and Reduce tasks.
Apache Pig has a component known as Pig Engine that accepts the Pig Latin scripts
as input and converts those scripts into Map Reduce jobs.
Why Do We Need Apache Pig?
Programmers who are not so good at Java normally used to struggle working with
Hadoop, especially while performing any Map Reduce tasks. Apache Pig is a boon
for all such programmers.
Using Pig Latin, programmers can perform MapReduce tasks easily without having
to type complex codes in Java.
Apache Pig uses multi-query approach, thereby reducing the length of codes. For
example, an operation that would require you to type 200 lines of code (LoC) in
Java can be easily done by typing as less as just 10 LoC in Apache Pig. Ultimately
Apache Pig reduces the development time by almost 16 times.
Pig Latin is SQL-like language and it is easy to learn Apache Pig when you are
familiar with SQL.
Apache Pig provides many built-in operators to support data operations like joins,
filters, ordering, etc. In addition, it also provides nested data types like tuples,
bags, and maps that are missing from MapReduce.
Features of Pig
Apache Pig comes with the following features
Rich set of operators − It provides many operators to perform operations like join, sort,
filer, etc.
Ease of programming − Pig Latin is similar to SQL and it is easy to write a Pig script if you
are good at SQL.
Optimization opportunities − The tasks in Apache Pig optimize their execution
automatically, so the programmers need to focus only on semantics of the language.
Extensibility − Using the existing operators, users can develop their own functions to read,
process, and write data.
UDF’s − Pig provides the facility to create User-defined Functions in other programming
languages such as Java and invoke or embed them in Pig Scripts.
Handles all kinds of data − Apache Pig analyzes all kinds of data, both structured as well
as unstructured. It stores the results in HDFS.

Apache Pig Vs Hive


Both Apache Pig and Hive are used to create MapReduce jobs. And in some cases, Hive
operates on HDFS in a similar way Apache Pig does. The following table, listed a few
significant points that set Apache Pig apart from Hive.

Apache Pig Hive

Apache Pig uses a language called Pig Hive uses a language called HiveQL. It was
Latin. It was originally created at Yahoo. originally created at Facebook.

Pig Latin is a data flow language. HiveQL is a query processing language.

Pig Latin is a procedural language and it HiveQL is a declarative language.


fits in pipeline paradigm.

Apache Pig can handle structured, Hive is mostly for structured data.
unstructured, and semi-structured data.

Reference Video
https://www.youtube.com/watch?v=Hve24pRW_Ps
Assessment Schedule

Proposed Date
28.05.2021

Actual Date
28.05.2021
Text & Reference Books
Sl. Book Name & Author Book
No.
1 Anand Rajaraman and Jeffrey David Ullman, "Mining of Massive Text Book
Datasets", Cambridge University Press, 2012.
http://infolab.stanford.edu/~ullman/mmds/bookL.pdf
2 David Loshin, "Big Data Analytics: From Strategic Planning to Text Book
Enterprise Integration with Tools, Techniques, NoSQL, and
Graph", Morgan Kaufmann/Elsevier Publishers, 2013.
http://digilib.stmik-
banjarbaru.ac.id/data.bc/5.%20Computer%20Graphic/2013%2
0Big%20Data%20Analytics%20From%20Strategic%20Planning
%20to%20Enterprise%20Integration%20with%20Tools%2C%
20Techniques%2C%20NoSQL%2C%20and%20Graph.pdf
3 EMC Education Services, "Data Science and Big Data Analytics: Text Book
Discovering, Analyzing, Visualizing and Presenting Data", Wiley
publishers, 2015.
https://bhavanakhivsara.files.wordpress.com/2018/06/data-
science-and-big-data-analy-nieizv_book.pdf
4 Bart Baesens, "Analytics in a Big Data World: The Essential Text Book
Guide to Data Science and its Applications", Wiley Publishers,
2015.
5 Dietmar Jannach and Markus Zanker, "Recommender Systems: Text Book
An Introduction", Cambridge University Press, 2010.
https://drive.google.com/file/d/1Wr4fllOj03X72rL8CHgVJ1dGxG
58N63S/view?usp=sharing
6 Kim H. Pries and Robert Dunnigan, "Big Data Analytics: A Reference
Practical Guide for Managers " CRC Press, 2015. Book
7 Jimmy Lin and Chris Dyer, "Data-Intensive Text Processing with Reference
MapReduce", Synthesis Lectures on Human Language Book
Technologies, Vol. 3, No. 1, Pages 1-177, Morgan Claypool
publishers, 2010.
Mini Project Suggestions

Q. Question CO K Level
No. Level

Upload sales dataset from the given URL


1 CO6 K4
https://www.kaggle.com/kyanyoga/sample-sales-data and
perform various analytics using R programming.
Thank you

Disclaimer:

This document is confidential and intended solely for the educational purpose of RMK Group of
Educational Institutions. If you have received this document through email in error, please notify the
system manager. This document contains proprietary information and is intended only to the
respective group / learning community as intended. If you are not the addressee you should not
disseminate, distribute or copy through e-mail. Please notify the sender immediately by e-mail if you
have received this document by mistake and delete this document from your system. If you are not
the intended recipient you are notified that disclosing, copying, distributing or taking any action in
reliance on the contents of this information is strictly prohibited.

You might also like