You are on page 1of 38

Big Data Overview

February 5, 2024 © 2013 IBM Corporation


Big Data is a term that
applies to data that can’t
be analyzed or processed
using traditional means.

© 2013 IBM Corporation


Increasingly organizations
are facing more challenges
related to Big Data.

They have access to


wealth of data but they
don’t know how to get
value out of this data.
As it sits in its most raw
form or in unstructured
format.
© 2013 IBM Corporation
Characteristics of Big Data

 Volume: Addressing the rapidly increasing volume of data available


 Velocity: (speed of data generation)
Addressing the analysis and handling of streaming data, data coming very fast
and very changing
 Variety: Addressing the different forms of data available: videos,
tweets, Facebook posts etc..

 Veracity: Another way to think of veracity is just "accuracy,"


"fidelity" or "truthfulness." In fact, it's the only thing that matters when you
want to actually DO something with all that data you collected.

4 © 2013 IBM Corporation


Characteristics of Big Data

 V4 = Volume Velocity Variety Veracity


Cost efficiently Responding to the Collectively analyzing
processing the increasing Velocity the broadening Variety
growing Volume

50x 30 Billion
35 ZB RFID sensors 80% of the
and counting worlds data is
unstructured

2010 2020

Establishing the 1 in 3 business leaders don’t trust


Veracity of big the information they use to make
data sources decisions
There are five main Big Data use cases
that we feel represents the majority of
activity in this space that’s of
commercial interest.

6 © 2013 IBM Corporation


The 5 Key Big Data Use Cases

Big Data Exploration Enhanced 360o View Security/Intelligence


Find, visualize, understand of the Customer Extension
all big data to improve Extend existing customer Lower risk, detect fraud
decision making views by incorporating and monitor cyber security
additional internal and in real-time
external data sources

Operations Analysis Data Warehouse Augmentation


Analyze a variety of machine Integrate big data and data warehouse
data for improved business results capabilities to increase operational efficiency

7 © 2013 IBM Corporation


Traditional RDBMS can’t handle the huge volume of the
explosion of the BIG DATA
NOT only BIG but unstructured
That is why Hadoop is there which was build on technology
started by Google “Distributed file system GDFS”
Data was in different nodes on a cluster and whenever you
want to add data then you can add a new node called
GFS “Google file system”

8 © 2013 IBM Corporation


Hadoop
IBM introduces BigInsights
based on Hadoop

<
What is Apache Hadoop?
Flexible, for storing and processing large volumes of data
Inspired by Google technologies (MapReduce, GFS, BigTable, …)
Initiated at Yahoo

10 © 2013 IBM Corporation


Hardware improvements through the years...

 CPU Speeds:
– 1990 – 44 MIPS at 40 MHz
– 2010 – 147,600 MIPS at 3.3 GHz

 RAM Memory
– 1990 – 640K conventional memory (256K extended memory recommended)
– 2010 – 8-32GB (and more)

 Disk Capacity
– 1990 – 20MB
– 2010 – 1TB

 Disk Latency (speed of reads and writes) – not much improvement in last 7-10 years, currently
around 70 – 80MB / sec

How long will it take to read 1TB of data?


1TB (at 80Mb / sec):
1 disk - 3.4 hours
10 disks - 20 min
100 disks - 2 min
1000 disks - 12 sec
Parallel Data Processing is the answer!
 GRID computing idea is to increase processing power of multiple
computers and bring data to place where processing capacity is available.

 Distributed workload is bring processing to data. This is great, but


writing applications that do this is very hard.
You need to worry how to exchange data between the different nodes,
you need to think about what happens if one of these nodes goes down,
Really difficult. You spend lots of time coding on passing the data than
dealing
with the problem itself.

- Hadoop, as we will see next is the answer to parallel data processing


without having the issues of GRID, distributed workload.

12 © 2013 IBM Corporation


 You may be familiar with OLTP (Online Transactional processing) where data
is randomly accessed on structured data like a relational db.
For example when you access your bank account.
 You may also be familiar with OLAP (Online Analytical processing) or DSS
(Decision Support Systems) where multidimensional access on structured data
like a relational database to generate reports that provide business intelligence.
 Now, you may not be that familiar with the concept of “Big Data”.
 Big Data is a term used to describe large collections of data (also
known as datasets) that may be unstructured, and grow so large
and quickly that is difficult to manage with regular database.
 It is optimized to handle massive amounts of data which could be
structured, unstructured, using commodity hardware, that is,
relatively inexpensive computers.
 Hadoop replicates its data across different computers, so that if
one goes down, the data is processed on one of the replicated
computers.

13 © 2013 IBM Corporation


Hadoop is not used for OLTP nor
OLAP, but for Big Data.
So Hadoop is NOT a replacement for
a RDBMS.

14 © 2013 IBM Corporation


Hadoop is not for all types of work
 Not good when work cannot be parallelized

 Not good for processing lots of small files

 Not good for intensive calculations with little data


What is Hadoop?

 Consists of 3 sub projects:


−Hadoop Distributed File System HDFS
−MapReduce
−Hadoop Common
Two Key Aspects of Hadoop

 Hadoop Distributed File System = HDFS


– Where Hadoop stores data
– A file system that spans all the nodes in a Hadoop cluster
– It links together the file systems on many local nodes to make them
into one big file system
– Hadoop replicates its data across different computers, so that if one goes
down, the data is processed on one of the replicated computers.
 MapReduce framework
– MapReduce is a software framework to support distributed processing on
large data sets of clusters of computers.
– How Hadoop understands and assigns work to the nodes
(machines)
What is the Hadoop Distributed File System?
 Manage storage of data on different nodes since the
HDFS stores data across multiple nodes

 HDFS assumes nodes will fail, so it achieves reliability by replicating data


across multiple nodes
 The degree of replication can be customized by the Hadoop administrator. However, by default is to replicate every chunk of data across 3 nodes: 2 on
the same rack, and 1 on a different rack.
 and hence does not require RAID storage on hosts. Data nodes can talk to each other to rebalance data, to move copies around.

 The file system is built from a cluster of data nodes, each of which serves
up blocks of data over the network using a block protocol specific to HDFS.
Enables applications to work with thousands of nodes and petabytes
of data in a highly cost effective manner
CPU + storage disks = “node”
Each node has a Linux O/S
Nodes can be combined into clusters
New nodes can be added as needed without changing

20 © 2013 IBM Corporation


 Data is stored across the entire cluster (the DFS)
 The Distributed File System (DFS) is responsible for spreading data across
the cluster, by making the entire cluster look like one giant file system.
When a file is written to the cluster, blocks of the file are spread out and
replicated across the whole cluster.
– The entire cluster participates in the file system
– Blocks of a single file are replicated across the cluster

10110100 Cluster
10100100
1
11100111
11100101
1 3 2
00111010
01010010
2
11001001
01010011 4 1 3
Blocks 00010100
10111010
11101011
3
11011011
01010110
2 4
10010101 4 2
3
00101010 1
10101110
4
01001101
01110100

Logical File
(Very quick) Introduction to MapReduce
 Driving principals
– Data is stored across the entire cluster
– Programs are brought to the data, not the data to the program
– rather than bringing the data to your programs, as you do in a traditional programming,
you write your program in a specific way that allows the program to be moved to the
data.

 The Distributed File System (DFS) is at the heart of MapReduce.


 MapReduce chooses to run the data on which node on the cluster then
replicates on the other node on the cluster
10110100 Cluster
10100100
1
11100111
11100101
1 3 2
00111010
01010010
2
11001001
01010011 4 1 3
Blocks 00010100
10111010
11101011
3
11011011
01010110
2 4
10010101 4 2
3
00101010 1
10101110
4
01001101
01110100

Logical File
 in a traditional DW you already know what you can ask, while in
Hadoop, you dump all the raw data into the HDFS and then start
asking the questions

 Traditional warehouses are mostly ideal for analyzing structured


data from various systems and producing insights. A Hadoop-based
platform is well suited to deal with unstructured data.
 In addition, when you consider where data should be stored, you
need to understand how data is stored today. When storing data in a
traditional data warehouse, typically, this data must shine with
respect to quality (good data); subsequently, it’s cleaned up via
cleansing, matching, modeling, and other services before it’s ready
for analysis. The data that lands in the warehouse is of high value,
then it has a broad purpose: it’s going to be used in reports and
dashboards where the accuracy of that data is the key.

23 © 2013 IBM Corporation


 In contrast, big data repositories rarely undergo the full quality
control prior of data being injected into a warehouse, We could say
that data warehouse the data is trusted enough to be “public,” while
Hadoop data isn’t as trusted.

24 © 2013 IBM Corporation


Big Difference:

 Regular database  Big Data (Hadoop)

Raw data
Raw data

Schema Storage
to filter (unfiltered,
raw data)

Schema
to filter

Storage
(pre-filtered data) Output

25 © 2013 IBM Corporation


RDBMS vs Hadoop
RDBMS Hadoop
Data
Structured data with known schemas Unstructured and structured
sources
Data type Records, long fields, objects, XML Files
Data
Updates allowed Only inserts and deletes
Updates
Language SQL & XQuery Pig (Pig Latin), Hive (HiveQL), Jaql
Processing
Quick response, random access Batch processing
type
Compress Sophisticated data compression Simple file compression
Hardware Enterprise hardware Commodity hardware
Data
Random access (indexing) Access files only (streaming)
access
History ~40 years of innovation < 5 years old
Community Widely used, abundant resources Not widely adopted yet
Two Key Aspects of Hadoop

HDFS MapReduce
• Distributed • Parallel Programming
• Reliable • Fault Tolerant
• Commodity gear

27 © 2013 IBM Corporation


Hadoop Distributed File System (HDFS)
 Distributed, scalable, fault tolerant, high throughput
 Data access through MapReduce
 Files split into blocks
 3 replicas for each piece of data by default
 Can create, delete, copy, but NOT update
 Designed for streaming reads, not random access
 Data locality: processing data on the physical storage to decrease
transmission of data

28
HDFS – Architecture
 Master / Slave architecture
 Each node has a Linux O/S NameNode File1
 CPU + storage disks = “node”
a
b
 Master: NameNode c
– manages the file system namespace and d
metadata
• FsImage
• EditLog
– regulates client access to files
– Has the replica factors and the properties
of data and data addresses

 Slave: DataNode
– many per cluster
– manages storage attached to the nodes a b a c
– periodically reports status to NameNode b a d b
– Each DataNode has blocks of data d c c d

DataNodes
29
Hadoop Distributed File System (HDFS)
 Files split into blocks

 Data in a Hadoop cluster is broken down into smaller pieces (called blocks).

 Data locality: processing data on the physical storage to decrease


transmission of data

30
HDFS – Replication
 Blocks of data are replicated to multiple nodes
– Behavior is controlled by replication factor, configurable per file
– Default is 3 replicas

Common case:
 one replica on one node in the
local rack
 another replica on a different
node in the local rack
 and the last on a different node
in a different rack

31
Summary
 The Hadoop Distributed File System (HDFS) is where Hadoop stores
its data. This file system spans all the nodes in a cluster. Effectively,
HDFS links together the data that resides on many local nodes, making
the data part of one big file system. Furthermore, HDFS assumes nodes
will fail, so it replicates a given chunk of data across multiple nodes to
achieve reliability. The degree of replication can be customized by the
Hadoop administrator or programmer. However, by default is to
replicate every chunk of data across 3 nodes: 2 on the same rack, and
1 on a different rack.

© 2013 IBM Corporation


MapReduce

<

33
MapReduce Explained
 MapReduce is a software framework introduced by Google to support
distributed computing on large data sets of clusters of computers.
 This is essentially a representation of the divide processing model,
where your input is split into many small pieces (the map step) on the
hadoop nodes, and the Hadoop nodes are processed in parallel.
 Once these pieces are processed, the results are distilled (in the reduce
step) down to a single answer.
• "Map" step: Each working node applies the "map()" function to the
local data, and writes the output to a temporary storage. A master node
orchestrates that for redundant copies of input data, only one is
processed.
• "Shuffle" step: Working nodes redistribute data based on the output
keys (produced by the "map()" function), such that all data belonging to
one key is located on the same worker node.
• "Reduce" step: Working nodes now process each group of output data,
per key, in parallel.
© 2013 IBM Corporation
Introduction to MapReduce
 Driving principals
– Data is stored across the entire cluster
– Programs are brought to the data, not the data to the
program
 Data is stored across the entire cluster (the DFS) and replicated
– Blocks of a single file are distributed across the cluster
– BLOCK SIZE IN MapReduce is 128KB

10110100 Cluster
10100100
1
11100111
11100101
1 3 2
00111010
01010010
2
11001001
01010011 4 1 3
Blocks 00010100
10111010
11101011
3
11011011
01010110
2 4
10010101 4 2
3
00101010 1
10101110
4
01001101
01110100

Logical File

35
MapReduce Overview

Results can be
written to HDFS or a
database

Map Shuffle Reduce


Distributed
FileSystem HDFS,
data in blocks

37
"if you write your programs in a special way" the programs can be
brought to the data. This special way is called MapReduce, and
involves breaking your program down into two discrete parts: Map
and Reduce.

A mapper is typically a relatively small program with a relatively


simple task: it is responsible for reading a portion of the input data,
interpreting, filtering or transforming the data as necessary and then
finally producing a stream of <key, value>.
As shown in the diagram, the MapReduce environment will
automatically take care of taking your small "map" program (the blue
boxes) and pushing that program out to every machine that has a
block of the file you are trying to process. This means that the bigger
the file, the bigger the cluster, the more mappers get involved in
processing the data! That's a pretty powerful idea.

38 © 2013 IBM Corporation


Notes
 The main objective of Big data is distributed processing
and here comes the distributed storage “not the
opposite”.
 Normally, we bring data to the mainframe processor and
process it but here we keep the data on the nodes
“maybe a PC with a Linux O/S” and process it there, by
sending the program over there “MapReduce”.
 Comes after;
tools for Reporting Analysis “to describe
what happened” such as Cognos
And
tools for Reporting a predictive Analysis “to
describe what may happen”
such as IBM SPSS and IBM Modular

39 © 2013 IBM Corporation


Questions?

You might also like