Professional Documents
Culture Documents
Selected Topics in Computer Science 2
Selected Topics in Computer Science 2
1
Contents of Big Data
• Understand the basics of Big Data
4
Examples of Big Data
Walmart handles more than 2.5 petabytes of data.
5
How can you avoid big data?
Pay cash for everything!
Never go online!
6
Characteristics of Big Data (5V)
Volume
Velocity
Variety
Veracity
Value
7
Volume(Scale)
Refers to the quantity of generated and stored data.
8
Velocity(Speed)
Refers to the speed of generation of data.
9
Variety (Complexity)
The type and nature of the data.
10
Veracity/Validity
It refers to the quality and accuracy of data.
Gathered data could have missing pieces, may be inaccurate or may
not be able to provide real, valuable insight.
Veracity, overall, refers to the level of trust there is in the collected data.
Data can sometimes become messy and difficult to use.
A large amount of data can cause more confusion than insights if it's
incomplete.
• For example, concerning the medical field, if data about what drugs
a patient is taking is incomplete, then the patient's life may be
endangered.
11
Value
This refers to the value that big data can provide, and it relates
directly to what organizations can do with that collected data.
Organizations can use the same big data tools to gather and
analyse the data, but how they derive value from that data should
be unique to them.
12
Big data is all data
.
13
Big Data sources
Users
Application
Systems
Sensors
14
Who’s Generating Big Data
Social media and networks (all of us are generating data)
15
Benefits of Big Data
To make better decisions and take meaningful actions at the
right time.
Increased productivity
Nature of Data.
17
Cont…
Traditional systems are useful for structured data but they can’t
manage such a large amount of unstructured data.
80% of the data over the glob is unstructured or available in widely
varying structures, which are difficult to analyse through the
traditional systems.
So, How to process big data with reasonable cost and time?”
To better address the high storage and computational needs of big
data, computer clusters are a better fit.
A computer cluster is a set of computers that work together so that
they can be viewed as a single system.
Do we have a framework for cluster computing?
18
What is Hadoop?
Hadoop is a collection of open-source software utilities that
facilitate using a network of many computers to solve problems
involving massive amounts of data and computation.
19
Cont…
Commodity computers are cheap and widely available.
20
Core-Components of Hadoop
1. Hadoop distributive file system (HDFS)
2. Map reduce.
21
1. Hadoop Distributed File System (HDFS)
Type of distributed file system and its Storage part of hadoop.
22
Cont.…
Stores multiple copies of data on different nodes
23
HDFS daemons
Has three services
1. Name node
3. Data node
24
1. Name/Master node
HDFS consists of only one Namenode we call it master node
which can track file.
Manages all file system metadata and has meta data and
about whole data in it.
25
Cont.…
Only one namenode we call it as single point failure
Receives Heartbeat and block report from all the data nodes.
26
2. Data node
Stores actual data in it as blocks.
27
Cont.…
In this way when Namenode does not receive a heartbeat
from a data node for 2 minutes, it take that data node as dead
and starts the process of block replications on some other data
node.
28
3. Secondary name node
This is only to take care of the check point of the file system
meta data which is in the namemode.
• File attributes
• Creation time
30
HDFS Blocks
Data files in HDFS are broken into block sized chunks, which
are stored as independent units.
31
Cont.…
32
Cont…
33
File write operation on HDFS
To write file to HDFS, client needs to interact with Namenode.
34
1. Setting up HDFS pipeline.
35
2. Write pipe line
36
3. Acknowledgement in HDFS write
37
File read operation on HDFS
To read file from HDFS, clients needs to interact with Namenode.
38
HDFS file reading mechanisms
39
HDFS Features
Distributed
Scalable
Cost effective
Fault tolerant.
High throughout
Others
40
Distributed
41
2. Scalable
As it is discussed in case of distributed file HDFS.
42
3. Economical/cost effective
43
4. Fault tolerance
It stores copies of the data on different machines and is
resistant to hardware failure
This is true for also Failures in main switch or rack. (puts copy
of rack) is called rack awareness.
44
5. High throughout
HDFS stores data in a distributed fashion, which allows data to
be processed parallels on a cluster of nodes.
45
6. Others
Unlimited data storage
46
2. MapReduce
47
The Mapper
1. Data split and sent to worker nodes.
48
Shuffling
There is shuffling before reducer which is called exchanging
the intermediate outputs from the map tasks into where they
are required by the reducer .
49
Reducer
Reduces the set of intermediate values with share key to a
similar set of values.
All of the value with the same key are presented to single
reducer together.
50
Example 01:- sum of square
51
Example: square of even and odd numbers
52
Example: square of even and odd and prime numbers
53
Example:-do the following word count process.
54
MapReduce Engine
1. Job Tracker
Responsible for accepting jobs from clients, dividing those
jobs into tasks, and assigning those tasks to be executed by
worker nodes.
55
Cont.…
2. Task tracker
Runs Map Reduce tasks periodically
Its slave node for the job tracker and it will take the task
from the job tracker.
56
Big Data Life Cycle with Hadoop
1. Ingesting data into the system
The data is ingested/transferred to Hadoop from various
sources such as RDBMS, systems, or local files.
57
Cont.…
The data is stored in the , HDFS, HBase. and MapReduce
perform data processing
58
Assignment two
1. List and describe Hadoop ecosystem
2. Write Application of Big Data Analytics
3. What is Network File System?
4. Define the following terms
RPC
SSH
TCP/IP
5. Compare traditional RDBMS and Hbase
6. Advantages and disadvantages of Hadoop.
59
The end
60