You are on page 1of 31

Big Data Analytics

Teaching Scheme
(Hrs.) Credits Assigned
Subject Subject
Code Name
Theory Practical Tutorial Theory Practic Tutori Total
al al

Big Data -- -- -- 04
ECCDLO Analytics 04 -- 04
7032

Examination Scheme
Theory Marks
Subject Subject Internal assessment End Sem. Term Practi
Code Name Test 1 Avg. Of
Test2 Test 1 and Exam Or Tota
Work cal & al l
Test 2 Oral
ECCDLO Big Data -
Analytics 20 20 20 80 -- - -- 100
7032
2
Module Unit
No. No. Topics Hrs.

1.0 Introduction to Big Data Analytics 06


1.1 Introduction to Big Data, Big Data characteristics, types of Big Data,
Traditional vs. Big Data business approach.
1.2 Technologies Available for Big Data, Infrastructure for Big Data, Big Data
Challenges, Case Study of Big Data Solutions.
2.0 Hadoop 06
2.1 Introductionto Hadoop. Core Hadoop Components, Hadoop Ecosystem,
Physical Architecture, Hadoop limitations.
3.0 NoSQL 08
3.1 Introduction to NoSQ, NoSQL business drivers, NoSQL case studies.
NoSQL data architecture patterns: Key-value stores, Graph stores,
3.2 Column family (Bigtable) stores, Document stores, Variations of NoSQL
architectural patterns.
Using NoSQL to manage big data: What is a big data NoSQL solution?
3.3 Understanding the types of big data problems; Analyzing big data with a
shared-nothing architecture; Choosing distribution models: master- slave
versus peer-to-peer; Four ways that NoSQL systems handle big data problems

3
Module Unit Hrs.
No. No. Topics

4.0 MapReduce 08

4.1 MapReduce and The New Software Stack: Distributed File Systems,
Physical Organization of Compute Nodes, Large Scale File-System Organization.

4.2 MapReduce: The Map Tasks, Grouping by Key, The Reduce Tasks,
Combiners, Details of MapReduce Execution, Coping With Node Failures.

4.3 Algorithms Using MapReduce: Matrix-Vector Multiplication by MapReduce ,


Relational-Algebra Operations by MapReduce, Matrix
Operations, Matrix Multiplication by MapReduce.

5.0 Techniques in Big Data Analytics 12


5.1 Finding Similar Item: Nearest Neighbor Search, Similarity of Documents
Mining Data Streams: Data Stream Management Systems, Data Stream Model, Examples
5.2 of Data Stream Applications: Sensor Networks, Network Traffic Analysis

Link Analysis: PageRank Definition, Structure of the web, dead ends,


5.3 Using Page rank in a search engine, Efficient computation of Page Rank: Page Rank
Implementation Using MapReduce
Frequent Itemset Mining : Market-Basket Model, Apriori Algorithm, Algorithm of Park-
5.4 Chen-Yu
4
Module Unit
No. No. Topics Hrs
.

6.0 Big Data Analytics Applications 08

6.1 Recommendation Systems: Introduction, A Model for


Recommendation Systems, Collaborative-Filtering System: Nearest- Neighbor
Technique, Example.

6.2 Mining Social-Network Graphs: Social Networks as Graphs, Types


of Social-Network. Clustering of Social Graphs: Applying Standard Clustering
Techniques, Counting triangles using MapReduce.

Text Books :
1. Radha Shankarmani and M Vijayalakshmi ―Big Data Analytics‖, Wiley
2. Alex Holmes ―Hadoop in Practice‖, Manning Press, Dreamtech Press.
3. Dan McCreary and Ann Kelly ―Making Sense of NoSQL‖ – A guide for managers and the rest of
us, Manning Press.
5
What is Big data
Big data is data which is too large, complex and
dynamic for any conventional data tools to capture,
store, analyze and manage for optimized decision
making.

The three Vs, i.e. the Volume, Variety and Velocity of


the data coming in what creates the challenge.
VOLUME
Amount of data
>2000 stored across the
>3,500
NORTH EUROPE world in petabytes
AMERICA >400
JAPAN
>200 >50
MIDDLE INDIA
>50
EAST
LATIN
AMERICA

VARIETY VELOCITY
People to people People to Machine Machine to Machine 2.9 Millions 20 Hours of
Communications 50 Million
Medical devices, Sensors, GPS, emails sent video
, Tweets per
Digital TV, Barcode Scanner, second uploaded
Social day
E-Commerce, Surveillance cameras every minutes 7
networking, Smart and Bank card
How Long we need to collect the data ?
It depends on following mathematical notation

⬥ The length is the time dimension


⬥ Discovery and integration form the breadth
⬥ Analysis and insights form the depth

8
Big Data Analytics
Big data analytics is the often complex process of examining 
big data to uncover information, such as hidden patterns,
correlations, market trends and customer preferences that can
help organizations make business decisions.

Big data analytics is the use of advanced analytic techniques


against very large, diverse data sets that include structured,
semi-structured and unstructured data, from different sources,
and in different sizes.
Continued……
Big data analytics is the result of three major trends in mobile
computing using hand held devices such as Smartphone and
tablets; Social networking such as facebook and pinterest and
Cloud computing by which one can space for storing and
computing.

Big data analytics reduces cost by storing large amount of data,


faster decision making as access will be given directly to the
memory and identify customer needs.
Advantages of Big Data Analytics
⬥ Scalability.

⬥ No pre-processing required.

⬥ Unstructured data can be stored.

⬥ No limit and time deadline.

⬥ Protection against hardware failure.

11
Case Study……..
1. Feedback mail or reviews link we get from bank, college
and food apps.

2. Click stream and pickup analysis.

3. Manufactures and distributers analysis.

4. Smart city.

12
Big Data Analytics Applications
⬥ Machine Learning.
⬥ Data Management.
⬥ Hadoop.
⬥ Predictive analytics.
⬥ In-memory analytics.
⬥ Statistical computing.

13
Hadoop
Hadoop is a framework that allows you to store big
data in a distributed system, so that we can process
it parallely. It is divided into two:

1. HDFS ( Hadoop Distributed File System )

2. YARN ( Yet Another Resource Navigator )

In Big data analytics we preferred to use hadoop


because it is cheaper, it provides parallel data
processing and suitable for all domain.

14
Hadoop Problems with traditional approach:

1. Storing the Large amount of data.

2. Storing heterogeneous data.

3. Accessing and processing speed.

Where Hadoop is used?

⬥ Search – Yahoo, Amazon


⬥ Log processing – Facebook, Yahoo
⬥ Data Warehouse – Facebook, AOL
⬥ Video and Image Analysis – New York Times
15
Hadoop
It provides a method to access data that is
distributed among multiple clustered computers,
process the data, and manage resources across the
computing and network resources that are involved.

Hadoop commonly refers to the core technology


that consists of the four main components:

1. HDFS

2. YARN

3. MapReduce

4. Common

16
Hadoop Examples of Hadoop:

1. Financial services companies use analytics to


assess risk, build investment models, and create
trading algorithms; Hadoop has been used to
help build and run those applications
2. Retailers use it to help analyze structured and
unstructured data to better understand and serve
their customers.
3. Telecommunications companies can use
Hadoop-powered analytics to execute
predictive maintenance on their infrastructure.
Big data analytics can also plan efficient
network paths and recommend optimal
locations. To support customer-facing
operations telcos can analyze customer
behavior and billing statements to inform new
service offerings. 17
Hadoop Architecture

18
HDFS Architecture

19
HDFS Continued....

20
YARN Architecture

21
How YARN Works?
1. The ResourceManager instructs a NodeManager to start
an Application Master for this request, which is then
started in a container.
2. Application Master registers itself with the RM. The
Application Master proceeds to contact the HDFS
NameNode and determine the location of the needed data
blocks and calculates the amount of map and reduce tasks
needed to process the data.
3. Application Master then requests the needed resources
from the RM and continues to communicate the resource
requirements throughout the life-cycle of the container. 22
Continued.......
4. The RM schedules the resources along with the requests
from all the other Application Masters and queues their
requests.
5. The Application Manager contacts the NodeManager for
that slave node and requests it to create a container by
providing variables, authentication tokens, and the
command string for the process. 
6. The Application Manager then monitors the process and
reacts in the event of failure by restarting the process on the
next available slot. If it fails after four different attempts, the
entire job fails. 23
MapReduce

24
Map Phase

25
Reduce Phase

26
Big Data Approach......
1. Traditional Approach

2. Big Data business Approach

 Accuracy and Confidentiality .

 Data Relationship.

 Data storage size.

 Types of data. 27
Technologies availble Big Data......
 Apache Hadoop

 Microsoft HDInsight

 NoSQL

 Hive

 Sqoop

 PolyBase
28
 Presto
Case Study…….
 Solve Advertisers Problem and Offer Marketing Insights

 Risk Management

 Driver of Innovations and Product Development

 Supply Chain Management

 Big Data for Customer Acquisition and Retention

29
⬥ www.cloudera.com ⬥ Vm ware or
⬥ Cloudera ⬥ oracle virtual box
⬥ Downloads
⬥ 5.4

30
Insurance Manufactures Firm such as
Company and distributers hotel

Public traffic Health Industry


service Supply
management

31

You might also like