You are on page 1of 60

Selected Topics in Computer Science(CoSc4181)

Lecture 02: Basic concepts of Big Data

Department of Computer Science


Dilla University

By: Tsegalem G/hiwot


2022 G.C

1
Contents of Big Data
• Understand the basics of Big Data

• Evolution of Big Data

• Characteristics of Big Data

• Big Data sources

• Benefits of Big Data

• Challenges of Big Data and solution

• The main concepts Hadoop

• Hadoop components and ecosystem


2
What Is Big Data?
 Big Data is a term used to describe a collection of data that is
huge in size and yet growing exponentially with time.

 It generates value from different sources and processing of very


large quantities of digital information that cannot be analyzed
with traditional computing techniques.

 Having data bigger it requires different approaches:


– Techniques, tools and architecture

 Big data is the realization of storing, processing, and analyzing data


that was previously ignored due to the limitations of traditional data
management technologies.

 Big data is all data (structured, semi-structured and unstructured).


3
Factors for Evolution of Big Data
 Evolution of technology
• World changed from Telephone to Cellphone

• From stand alone computers to networked computers(Internet)

 IoT (50 billion IoT devices in 2020)

 Social media e.g. Instagram, Facebook, YouTube etc.

 Others eg. Amazon, Flipkart , etc.

4
Examples of Big Data
 Walmart handles more than 2.5 petabytes of data.

 NASA Climate Simulation 32 PB/Day.

 Google processes 20 PB a day (2008)

 Facebook has 2.5 PB of user data per day.

 eBay has 6.5 PB of user data + 50 TB/day.

 Tweeter produces over 90 billion tweets/day.

5
How can you avoid big data?
 Pay cash for everything!

 Never go online!

 Don’t use a Cellphone!

 Don’t fill any online prescriptions!

 Never leave your house!

6
Characteristics of Big Data (5V)
 Volume

 Velocity

 Variety

 Veracity

 Value

7
Volume(Scale)
 Refers to the quantity of generated and stored data.

 The name Big Data itself is related to size which is enormous.

 Size of data plays a very crucial role in determining value out


of data.

 Hence, volume is one characteristics which needs to be


considered while dealing with big data.

8
Velocity(Speed)
 Refers to the speed of generation of data.

 Data is live streaming or in motion.

• Example: medical devices that monitor patients need to collect


data and sent to its destination and analysed quickly.

 Big data is often available in real-time.

 Compared to small data, big data are produced more continually.

9
Variety (Complexity)
 The type and nature of the data.

 Refers to varied source and the nature of data both structure


and unstructured.

 Traditional database systems were designed to address


smaller volumes of structured data, fewer updates or a
predictable, consistent data structure. However, Now a day
80% of the data is unstructured cant be put in table easily.

10
Veracity/Validity
 It refers to the quality and accuracy of data.
 Gathered data could have missing pieces, may be inaccurate or may
not be able to provide real, valuable insight.
 Veracity, overall, refers to the level of trust there is in the collected data.
 Data can sometimes become messy and difficult to use.
 A large amount of data can cause more confusion than insights if it's
incomplete.
• For example, concerning the medical field, if data about what drugs
a patient is taking is incomplete, then the patient's life may be
endangered.

11
Value
 This refers to the value that big data can provide, and it relates
directly to what organizations can do with that collected data.

 Being able to pull value from big data is a requirement, as the


value of big data increases significantly depending on the insights
that can be gained from them.

 Organizations can use the same big data tools to gather and
analyse the data, but how they derive value from that data should
be unique to them.

12
Big data is all data
.

13
Big Data sources
 Users

 Application

 Systems

 Sensors

14
Who’s Generating Big Data
 Social media and networks (all of us are generating data)

 Scientific instruments (measuring all kinds of data)

 Mobile devices (tracking all objects all the time)

 Sensor technology and networks(collecting all sorts of data)

15
Benefits of Big Data
 To make better decisions and take meaningful actions at the
right time.

 Reduce costs of business processes

 Fraud Detection: Financial companies, in particular, use big


data to detect fraud. Data analysts use machine learning
algorithms and artificial intelligence to detect anomalies and
transaction patterns.

 Increased productivity

 Improved customer service

 Increase innovation and development of next generation


16
product.
Challenges of Big Data
 Lack of proper understanding of Big Data.

 Data growth issues or Data storage growth.

 Insufficient budget/Costs increase too fast.

 Nature of Data.

 Lack of Trained Professionals: Experts in using the new


technology and dealing with big data and Lack of analytic
skills. Confusion while Big Data tool selection. Integrating data
from a variety of sources.

17
Cont…
 Traditional systems are useful for structured data but they can’t
manage such a large amount of unstructured data.
 80% of the data over the glob is unstructured or available in widely
varying structures, which are difficult to analyse through the
traditional systems.
 So, How to process big data with reasonable cost and time?”
 To better address the high storage and computational needs of big
data, computer clusters are a better fit.
 A computer cluster is a set of computers that work together so that
they can be viewed as a single system.
 Do we have a framework for cluster computing?

18
What is Hadoop?
 Hadoop is a collection of open-source software utilities that
facilitate using a network of many computers to solve problems
involving massive amounts of data and computation.

 Allows for distributed processing of large datasets across


clusters of commodity computers using a simple programming
model.

 Originally designed for computer clusters built from commodity


hardware

19
Cont…
 Commodity computers are cheap and widely available.

 These are manly useful for achieving greater computational


power at low cost.

 Similar to data residing in local file system of a personal file


system of personal computer, in hadoop data resides in
distributed file system which is called as HDFS.

20
Core-Components of Hadoop
1. Hadoop distributive file system (HDFS)

2. Map reduce.

21
1. Hadoop Distributed File System (HDFS)
 Type of distributed file system and its Storage part of hadoop.

 Splits files into number of blocks and distribute them across


nodes in machine.

 It then transfer package codes into nodes to process the Data


in parallel.

 This approach takes advantages of data locality where nodes


manipulate data they have access.

 Hadoop splits files into blocks and distributes them across


nodes in a cluster.

22
Cont.…
 Stores multiple copies of data on different nodes

 Typically has a single Namenode and number of Datanodes.

 HDFS works on master/slave architecture.

 Master services can communicate with each other and in the


same way slave service can communicate with each other .

 Namenode is masternode and datanode is corresponding


slavenode and can talk with other.

23
HDFS daemons
Has three services

1. Name node

2. Secondary name node

3. Data node

24
1. Name/Master node
 HDFS consists of only one Namenode we call it master node
which can track file.

 Manages all file system metadata and has meta data and
about whole data in it.

 Contains details file name , ownership , permission, number


of block, location at what data node data is stored and
where the replication are stored and other details.

25
Cont.…
 Only one namenode we call it as single point failure

 It has direct connect with client.

 Receives Heartbeat and block report from all the data nodes.

 Authentication and authorization.

26
2. Data node
 Stores actual data in it as blocks.

 Known as slavenodes and is responsible for the client to read


and write.

 Receiving data instructed by namenode and reporting back as


acknowledgement.

 Stored multiple copy for each block.

 Every datanode sends heartbeat message to the namenode


every 3 seconds and conveys that its live.

27
Cont.…
 In this way when Namenode does not receive a heartbeat
from a data node for 2 minutes, it take that data node as dead
and starts the process of block replications on some other data
node.

 Serves read and write requests from clients.

 Has no knowledge about HDFS file system.

 Receive data from namenode or from peer.

28
3. Secondary name node
 This is only to take care of the check point of the file system
meta data which is in the namemode.

 This also known as checkpoint node.

 It is the helper Node for the Name Node.


 List of files

• List of blocks for each file

• List of Data Nodes for each block

• File attributes

• Creation time

• Records every change in the metadata 29


HDFS Master/Slave Architecture

30
HDFS Blocks

 Data files in HDFS are broken into block sized chunks, which
are stored as independent units.

• Default size of block is 128 MB in apache hadoop 2.0


(64MB in apache hadoop 1.0)

 Blocks are replicated for reliability.

• Multiple copy of data blocks are created and distributed


on nodes throughout the cluster to enable high
availability of data even a node failure occurred.

31
Cont.…

 One copy on local node and another copy on a remote rack

 Third copy on local rack, Additional replicas are randomly


placed.

 Default replication is 3-fold

32
Cont…

Eg : 420 MB file is split as

33
File write operation on HDFS
 To write file to HDFS, client needs to interact with Namenode.

 Namenode provides address of slave on which client will start


writing the data.

 As soon as client finishes writing the block, the slave start


copying the block to another slave which in term copy the
block to another slave(3 replicas by default).

 After required replicas are created then it will send the


acknowledgement to client.

34
1. Setting up HDFS pipeline.

35
2. Write pipe line

36
3. Acknowledgement in HDFS write

37
File read operation on HDFS
 To read file from HDFS, clients needs to interact with Namenode.

 Namenode provides address of slaves where file is stored.

 Client will interact with respective Datanodes to read the file.

 Namenode also provides to client which it shows to datanode


for authentication.

38
HDFS file reading mechanisms

39
HDFS Features
 Distributed

 Scalable

 Cost effective

 Fault tolerant.

 High throughout

 Others

40
Distributed

 Stores huge files in distributed manner in network approach.

 This happens by combining commodity or cheap computers


in cluster.

 Client uses as a single computer.

41
2. Scalable
 As it is discussed in case of distributed file HDFS.

 It is easily scalable both, horizontally and vertically.

 A few extra nodes help in scaling up the framework.

42
3. Economical/cost effective

 One important feature is no need of buying higher expensive


server machine because its possible to collect cheap machines

 Its systems are highly economical as ordinary computers can


be used for data processing

43
4. Fault tolerance
 It stores copies of the data on different machines and is
resistant to hardware failure

 This is true for also Failures in main switch or rack. (puts copy
of rack) is called rack awareness.

 Replication is expensive it takes 1/3 of total storage.

 If name node failure happen then backup is solution.

44
5. High throughout
 HDFS stores data in a distributed fashion, which allows data to
be processed parallels on a cluster of nodes.

 This decreases the processing time and thus provides high


throughput.

 Latency-time to get first record.

 Throughout-number of records processed per unit of time.

45
6. Others
 Unlimited data storage

 High speed processing system

 All verities of data processing


1. Structural
2. Unstructured
3. semi-structural

46
2. MapReduce

 Mapreduce is processing part of hadoop.

 It process data parallel in distributed environment.

 Hadoop will distribute computation over cluster

 Programming framework (library and runtime) for analyzing


data sets stored in HDFS)

 MapReduce jobs are composed of two functions:

47
The Mapper
1. Data split and sent to worker nodes.

2. Maps are individual tasks that transform input into


intermediate records.

3. Each block is processed in isolation by a map task called


mapper

4. The following diagram shows a simplified flow diagram for the


MapReduce program.

48
Shuffling
 There is shuffling before reducer which is called exchanging
the intermediate outputs from the map tasks into where they
are required by the reducer .

49
Reducer
 Reduces the set of intermediate values with share key to a
similar set of values.

 All of the value with the same key are presented to single
reducer together.

 Produce final output

50
Example 01:- sum of square

51
Example: square of even and odd numbers

52
Example: square of even and odd and prime numbers

53
Example:-do the following word count process.

54
MapReduce Engine

1. Job Tracker
 Responsible for accepting jobs from clients, dividing those
jobs into tasks, and assigning those tasks to be executed by
worker nodes.

 Job tracker talks to the NameNode to find out the location of


the data and will request the NameNode for the processing
data.

 NameNode in response gives the meta data to job tracker.

55
Cont.…
2. Task tracker
 Runs Map Reduce tasks periodically

 Its slave node for the job tracker and it will take the task
from the job tracker.

 And also receives code from the job tracker.

 The process of applying that code on file is known as


mapper.

56
Big Data Life Cycle with Hadoop
1. Ingesting data into the system
 The data is ingested/transferred to Hadoop from various
sources such as RDBMS, systems, or local files.

 Sqoop transfers data from RDBMS to HDFS, whereas


Flume transfers event data.

2. Processing the data in storage


 The second stage is Processing.

 In this stage, the data is stored and processed.

57
Cont.…
 The data is stored in the , HDFS, HBase. and MapReduce
perform data processing

3. Computing and analyzing data


 The third stage is to Analyze.

 Here, the data is analyzed by processing frameworks such


as Pig, Hive,.

4. Visualizing the results


 In this stage, the analyzed data can be accessed by users.

58
Assignment two
1. List and describe Hadoop ecosystem
2. Write Application of Big Data Analytics
3. What is Network File System?
4. Define the following terms
 RPC
 SSH
 TCP/IP
5. Compare traditional RDBMS and Hbase
6. Advantages and disadvantages of Hadoop.

59
The end

60

You might also like