Big Data - Midsem

BIG DATA SYSTEMS
BITS Pilani
Pilani Campus Email: kanantharaman@wilp.bits-pilani.ac.in
10/20/2023 CCCSZG522 BDS S1-13-14 1

Introduction to Big Data,
Locality of Reference
BITS Pilani CSession-1
Pilani Campus
10/20/2023 CCCSZG522 BDS S1-13-14 2

Course outline
➢ S1: Introduction to Big Data and data locality

➢ S2: Parallel and Distributed Processing
➢ S3: Big Data Analytics and Big Data Systems
➢ S4: Consistency, Availability, Partition tolerance and Data
Lifecycle
➢ S5: Distributed Systems Programming
➢ S6-S9: Hadoop ecosystem technologies
➢ S10: NoSQL Databases
➢ S11: Big Data on Cloud
➢ S12: Amazon storage services
➢ S13-16: In-memory and streaming - Spark
10/20/2023 CCCSZG522 BDS S1-13-14 3
BITS Pilani, Pilani Campus
Books
T1 Seema Acharya and Subhashini Chellappan. Big Data

and Analytics.
Wiley India Pvt. Ltd. Second Edition
R1 DT Editorial Services. Big Data - Black Book.
DreamTech. Press. 2016
R2 Kai Hwang, Jack Dongarra, and Geoffrey C. Fox.
Distributed and Cloud Computing: From Parallel
Processing to the Internet of Things. Morgan Kauffman
2011
AR Additional Reading (As per topic)
10/20/2023 CCCSZG522 BDS S1-13-14 4

Topics for today
➢ Motivation
– Why do modern Enterprises need to work with data
– What is Big Data and data classification
– Scaling RDBMS
➢ What is a Big Data System

– Characteristics
– Design challenges
➢ Architecture
– Data warehouse
– High level architecture of Big Data solutions
➢ Why locality of reference and storage organization

matter?
10/20/2023 CCCSZG522 BDS S1-13-14 5

Why do modern Enterprises need
to work with data?
10/20/2023 CCCSZG522 BDS S1-13-14 6

Example of a data-driven Enterprise:
A large online retailer (1)
➢ What data is collected

– Millions of transactions and browsing clicks per day across
products, users
– Delivery tracking
– Reviews on multiple channels - website, social media, customer
support
– Support emails, logged calls
– Ad click and browsing data
– …
➢ Data is a mix of metrics, natural language text, logs,

events, videos, images etc.
10/20/2023 CCCSZG522 BDS S1-13-14 7

Example of a data-driven Enterprise:
A large online retailer (2)
➢ What is this data used for
– User profiling for better shopping experience
– Operations efficiency metrics
– Improve customer support experience, support training
– Demand forecasting
– Product marketing
– …
➢ Data is the only way to create competitive differentiators,

retain customers and ensure growth
10/20/2023 CCCSZG522 BDS S1-13-14 8

Data Volume Growth
▪ Facebook: 500+ TB/day of comments, images, videos etc.

• NYSE: 1TB/day of trading data
• A Jet Engine: 20TB / hour of sensor / log data
Source: https://www.guru99.com/what-is-big-data.html
10/20/2023 CCCSZG522 BDS S1-13-14 9

Variety of data sources
Data → Information → Insights
Source: https://www.guru99.com/what-is-big-data.html
10/20/2023 CCCSZG522 BDS S1-13-14 10

Type of Digital Data
10/20/2023 CCCSZG522 BDS S1-13-14 11

Data Classification
• Digital data is classified into the following categories:
➢Structured data
➢Semi-structured data
➢Unstructured data
10/20/2023 CCCSZG522 BDS S1-13-14 12

Data Classification
Digital data is classified into the following categories:

Structured data
Semi-structured data
Unstructured data
10/20/2023 CCCSZG522 BDS S1-13-14 13

Approximate percentage
distribution of digital data
10/20/2023 CCCSZG522 BDS S1-13-14 14

Structured Data
➢ This is the data which is in an organized form (e.g., in

rows and columns) and can be easily used by a
computer program.
➢ Relationships exist between entities of data, such as

classes and their objects.
➢ Data stored in databases is an example of structured

data.
10/20/2023 CCCSZG522 BDS S1-13-14 15

Sources of Structured Data
10/20/2023 CCCSZG522 BDS S1-13-14 16

Ease with Structured Data
10/20/2023 CCCSZG522 BDS S1-13-14 17

Semi-Structured Data
➢ This is the data which does not conform to a data model

but has some structure. However, it is not in a form
which can be used easily by a computer program.
➢ Example, emails, XML, markup languages like HTML,

etc. Metadata for this data is available but is not
sufficient.
10/20/2023 CCCSZG522 BDS S1-13-14 18

Sources of Semi-Structured
Data
XML (eXtensible Markup Language)
XHTML
Other Markup Languages SGML
….
JSON (Java Script Object Notation)

Semi-Structured Data
10/20/2023 CCCSZG522 BDS S1-13-14 19

Characteristics of Semi-
structured Data
10/20/2023 CCCSZG522 BDS S1-13-14 20

Unstructured Data
➢ This is the data which does not conform to a data

model or is not in a form which can be used easily by a
computer program.
➢ About 80–90% data of an organization is in this format.
➢ Example: memos, chat rooms, PowerPoint

presentations, images, videos, letters, researches,
white papers, body of an email, etc.
10/20/2023 CCCSZG522 BDS S1-13-14 21

Sources of Unstructured Data
10/20/2023 CCCSZG522 BDS S1-13-14 22

Issues with terminology –
Unstructured Data
10/20/2023 CCCSZG522 BDS S1-13-14 23

Dealing with Unstructured
Data
10/20/2023 CCCSZG522 BDS S1-13-14 24

Q&A
10/20/2023 CCCSZG522 BDS S1-13-14 25

Topics – 2nd Half
1. Definition of big data.
2. Challenges of big data.
3. Why big data?
4. Traditional Business Intelligence versus big data.
10/20/2023 CCCSZG522 BDS S1-13-14 26

Definition of Big Data
Big data is high-volume, high-velocity,
and high-variety information assets that
demand cost effective, innovative
forms of information processing for
enhanced insight and decision making.
Source: Gartner IT Glossary
10/20/2023 CCCSZG522 BDS S1-13-14 27

Volume - A Mountain of Data
10/20/2023 CCCSZG522 BDS S1-13-14 28

Velocity
Batch → Periodic → Near real-time → Real-time processing
10/20/2023 CCCSZG522 BDS S1-13-14 29

Variety
➢ Structured data: example: traditional transaction

processing systems and RDBMS, etc.
➢ Semi-structured data: example: Hyper Text Markup

Language (HTML), eXtensible Markup Language (XML).
➢ Unstructured data: example: unstructured text

documents, audio, video, email, photos, PDFs, social
media, etc.
10/20/2023 CCCSZG522 BDS S1-13-14 30

Other Characteristics of Data –
Which are not Definitional Traits of Big Data
• Veracity and Validity

• Refers to biases, noise, and abnormality in data
• Volatility
• Volatility of data deals with, how long is the data valid?
• Variability
• Data flows can be highly inconsistent with periodic peaks.
10/20/2023 CCCSZG522 BDS S1-13-14 31

Dealing with Unstructured
Data
• Data mining
✓ Association rule mining, e.g. market basket or affinity analysis
✓ Regression, e.g. predict dependent variable from independent variables
✓ Collaborative filtering, e.g. predict a user preference from group preferences
✓ NLP - e.g. Human to Machine interaction, conversational systems
✓ Text Analytics - e.g. sentiment analysis, search
✓ Noisy text analytics - e.g. spell correction, speech to text
10/20/2023 CCCSZG522 BDS S1-13-14 32

Challenges with Big Data
10/20/2023 CCCSZG522 BDS S1-13-14 33

10/20/2023 CCCSZG522 BDS S1-13-14 34
Sources of Big Data
There are a multitude of sources for big data. An XLS, a DOC, a

PDF … is unstructured data, a video on YouTube, a chat
conversation on Internet Messenger, a customer feedback form
on an online retail website … is unstructured data, a CCTV
coverage, a weather forecast report is unstructured data too.
10/20/2023 CCCSZG522 BDS S1-13-14 35

Why Big Data?
10/20/2023 CCCSZG522 BDS S1-13-14 36

Traditional Business Intelligence (BI)
versus Big Data
10/20/2023 CCCSZG522 BDS S1-13-14 37

A Typical Data Warehouse
Environment
In a typical DW environment, data is collected from multiple disparate sources,

integrated, cleansed and transformed before loading it to a data warehouse. A host of
market leading BI tools can then be used on top of the data warehouse for
reporting/dashboarding, ad hoc querying and modeling.
10/20/2023 CCCSZG522 BDS S1-13-14 38

A Typical Hadoop Environment
Hadoop takes care of storage and processing using the following:
a) HDFS (Hadoop Distributed File System) (distributed storage)

b) MapReduce (distributed processing)
10/20/2023 CCCSZG522 BDS S1-13-14 39
Co-existence of Big Data and
Data Warehouse
10/20/2023 CCCSZG522 BDS S1-13-14 40

Systems perspective - Processing: In-memory vs. (from)
secondary storage vs. (over the) network
10/20/2023 CCCSZG522 BDS S1-13-14 41

Processing Approaches for Big Data
Systems
• In-Memory Processing
✓ Data loaded into RAM for fast access
✓ Useful for real-time, low latency needs
✓ Limited by available RAM capacity
Use Cases: It is often used in scenarios where speed is critical, such as financial
trading, real-time recommendation engines, and online gaming analytics.
• Secondary Storage Processing
✓ Data stored on disk (SSD/HDD)
✓ Higher latency than in-memory
✓ Allows larger dataset storage
✓ Disk I/O can bottleneck
Use Cases: It's often used in cloud-based or distributed systems where data is
stored in multiple geographic locations. Examples include processing data from
remote IoT devices and distributed data centers.
10/20/2023 CCCSZG522 BDS S1-13-14 42
Network and Hybrid Processing
• Network Processing
✓ Analyze data as it streams over network
✓ Enables real-time analytics on remote data
✓ Network latency affects speed
• Hybrid Strategies
✓ Combine in-memory, disk, and network processing
✓ Leverage benefits of each approach
✓ Balance real-time and batch needs
✓ Mitigate individual limitations
✓ Optimal mix depends on:
✓ Data sizes, analytics needs
✓ Infrastructure costs
✓ Real-time vs batch requirements
10/20/2023 CCCSZG522 BDS S1-13-14 43

Locality of Reference: Principle: examples
Impact of Latency: Algorithms and data structures that

leverage locality, data organization on disk for better
Locality
10/20/2023 CCCSZG522 BDS S1-13-14 44

Locality of Reference
• Principle: Data elements accessed close together in time or

space tend to be accessed again in the near future.
• Temporal Locality - Data accessed recently is likely to be
accessed again soon. Exhibited in code by accessing the same
variable/data item multiple times.
• For example, in an online shopping system, a user browsing a product
catalog may click on multiple product pages, showing temporal locality.
• Spatial Locality - Data elements stored close together are
likely to be referenced close together. Exhibited in code by
iterating through arrays or sequential data structures
• For instance, when a web page is loaded, the CSS and image files
associated with it are often requested, demonstrating spatial locality.
10/20/2023 CCCSZG522 BDS S1-13-14 45

Example: Temporal Locality
// Temporal locality -Data accessed recently is likely to be accessed again soon.

x = getData();
// Use x multiple times
Data X- Exhibits temporal Locality
10/20/2023 CCCSZG522 BDS S1-13-14 46

Example:Spatial Locality
// Array exhibiting spatial locality

int[] data = {1, 2, 3, 4, 5};
for (i = 0; i < data.length; i++) {

x = data[i];
// use x
}
Data X – Exhibits spatial locality
• The key point is that arrays and other sequential data structures store data
elements close together in memory. Traversing the array accesses
neighboring elements, exhibiting spatial locality. Storing the array
contiguously takes advantage of this access pattern for better performance.
10/20/2023 CCCSZG522 BDS S1-13-14 47

Impact of Latency(1)
• Algorithms leveraging temporal locality keep frequently reused elements in

faster memory to reduce access latency:
cache = new Cache();
// Keep reusing x
cache.add(x);
Caching: Caching is a common application of the locality of reference. Web
browsers, for example, cache web pages, images, and scripts to reduce
load times. When a user revisits a webpage, the cached resources can be
quickly retrieved from local storage rather than re-downloading them.
Disk Access Patterns: In big data systems, when processing a large dataset,
algorithms often read data from disk. These algorithms can benefit from
spatial locality if data on the same disk blocks or in nearby regions are
frequently accessed. This can be observed in data processing frameworks
like Hadoop, which are optimized for processing data with spatial locality.
10/20/2023 CCCSZG522 BDS S1-13-14 48
Impact of Latency(2)
• Data Organization on Disk: To mitigate latency, data is often organized in specific

ways on disk. For example, in a database system, related data might be stored
together in data pages. This minimizes the time required to read data, as reading a
page that contains multiple relevant records is more efficient than seeking and
reading data from different disk locations.
• Algorithms and Data Structures: Big data systems leverage algorithms and data
structures that take advantage of locality of reference to optimize performance.
For example, B-trees are used in databases to efficiently access data based on
spatial locality, reducing the number of disk I/O operations.
• MapReduce in Hadoop: Hadoop's MapReduce framework capitalizes on the

locality of reference. It processes data where it resides by scheduling tasks on
nodes that store the data, reducing data transfer time and improving performance.
10/20/2023 CCCSZG522 BDS S1-13-14 49

Summary – Locality and
Impact of Latency
• Understanding and leveraging the locality of reference is
crucial in big data systems to reduce latency and
enhance performance.
• This principle influences how data is organized on disk,
the choice of data structures and algorithms, and the
design of distributed data processing frameworks.
• It plays a key role in making big data systems more
efficient and responsive to data access and processing
needs.
10/20/2023 CCCSZG522 BDS S1-13-14 50

Q&A
10/20/2023 CCCSZG522 BDS S1-13-14 51

BIG DATA SYSTEMS
Contact Session-2
BITS Pilani
Pilani Campus Email: kanantharaman@wilp.bits-pilani.ac.in
Big data Systems - Parallel
and Distributed Computing
➢ Big data characterized by its sheer volume, velocity,
variety, and complexity.
➢ Processing such data using traditional methods on a

single machine can be extremely slow and often
impractical.
➢ Parallel and distributed processing is motivated by the

need to harness the processing power of multiple
machines to handle big data
12/20/2023 CCC ZG522 - BDS S1-23-24 2

Parallel and Distributed
system Organization
Shared Memory vs Message passing
12/20/2023 CCC ZG522 - BDS S1-23-24 3

Improvement in processor
and network technologies
12/20/2023 CCC ZG522 - BDS S1-23-24 4

Advances in CPU Processors
Example: Intel Core i7 990x

has reported
159,000 MIPS execution rate
• Schematic of a modern multicore CPU chip using a hierarchy of caches,

where L1 cache is private to each core, on-chip L2 cache is shared and L3
cache or DRAM Is off the chip.
12/20/2023 CCC ZG522 - BDS S1-23-24 5

Multicore CPU and Many-
Core GPU Architectures
• Five micro-architectures in modern CPU processors, that exploit ILP and

TLP supported by multicore and multithreading technologies.
12/20/2023 CCC ZG522 - BDS S1-23-24 6
GPU Programming Model
• Essentially, the CPU’s floating-point kernel computation role is largely

offloaded to the many-core GPU. The CPU instructs the GPU to perform
massive data processing.
12/20/2023 CCC ZG522 - BDS S1-23-24 7

The NVIDIA Fermi GPU Chip with
512 CUDA Cores
• CUDA® is a parallel computing
platform and programming
model developed
by NVIDIA for general
computing on graphical
processing units (GPUs).
• CUDA (Compute Unified

Device Architecture)
12/20/2023 CCC ZG522 - BDS S1-23-24 8

➢ Memory, Storage, and Wide-
Area Networking
12/20/2023 CCC ZG522 - BDS S1-23-24 9

• Improvement in memory and disk technologies over 33 years. The
Seagate Barracuda XT disk has a capacity of 3 TB in 2011
12/20/2023 CCC ZG522 - BDS S1-23-24 10
Memory (RAM)
Role in Big Data Systems:

– Memory is crucial for handling data processing tasks, as it allows for
quick access to frequently used data and reduces latency.
– In-memory processing involves keeping data in RAM, enabling faster
data retrieval and manipulation.
– It is used for caching frequently accessed data, improving query and
analysis performance in real-time and batch processing.
Examples:
– In-memory databases like Apache Ignite or Redis store data in RAM,
enabling real-time data analysis.
– Spark, a popular big data processing framework, utilizes memory for its
Resilient Distributed Datasets (RDDs) to accelerate data processing.
12/20/2023 CCC ZG522 - BDS S1-23-24 11

Storage (1)
• Storage refers to the non-volatile, long-term data storage

components within a big data system.
• This includes hard disk drives (HDDs), solid-state drives

(SSDs), and distributed file systems like Hadoop
Distributed File System (HDFS).
12/20/2023 CCC ZG522 - BDS S1-23-24 12

Storage(2)
• Role in Big Data Systems:

– Storage is essential for persistently storing large volumes of data,
including historical and raw data.
– Distributed file systems like HDFS divide data into blocks and store
them across multiple nodes to provide fault tolerance and scalability.
– Data is often stored on disks for long-term retention, with retrieval when
needed for processing.
• Examples:
• HDFS in Hadoop divides data into blocks and distributes them across a
cluster of machines to provide reliable and scalable storage.
• Cloud-based storage solutions like Amazon S3 or Azure Blob Storage
provide cost-effective and scalable storage for big data.
12/20/2023 CCC ZG522 - BDS S1-23-24 13

Wide-Area Networking
(WAN) (1)
• Wide-Area Networking (WAN) is the network
infrastructure that connects geographically dispersed
locations.
• It includes the use of public and private networks to
facilitate communication between data centers, offices,
or cloud services.
12/20/2023 CCC ZG522 - BDS S1-23-24 14

Wide-Area Networking
(WAN) (2)
• Role in Big Data Systems:
• WAN enables the transfer of data between distributed data centers, cloud
services, and remote locations.
• It supports the replication of data across multiple geographic regions for data
redundancy and disaster recovery.
• WAN connections are crucial for sending and receiving data in real-time or
batch data processing.
• Examples:
• Cloud services like AWS and Azure use WAN connections to enable
remote access to data stored in their data centers.
• Distributed big data systems use WAN connections to share data and
synchronize updates between nodes in different locations.
In big data systems, these three components—Memory, Storage, and Wide-Area

Networking—work together to ensure efficient data processing, storage, and data transfer,
especially when dealing with large and distributed datasets.
12/20/2023 CCC ZG522 - BDS S1-23-24 15
Virtual Machines and Virtualization
Middleware
• Three VM architectures in (b), (c), and (d), compared with the traditional
physical machine shown in (a).
12/20/2023 CCC ZG522 - BDS S1-23-24 16

VM Primitive Operations (1)
• The VMM provides the VM abstraction to the guest OS.

• With full virtualization, the VMM exports a VM
abstraction identical to the physical machine so that a
standard OS such as Windows 2000 or Linux can run
just as it would on the physical hardware.
12/20/2023 CCC ZG522 - BDS S1-23-24 17

VM Primitive Operations (2)
• VM multiplexing, suspension, provision, and migration in a

distributed computing environment.
12/20/2023 CCC ZG522 - BDS S1-23-24 18

Data Center Virtualization
for Cloud Computing
12/20/2023 CCC ZG522 - BDS S1-23-24 19

Cluster Architecture
• A cluster of servers interconnected by a high-bandwidth SAN

or LAN with shared I/O devices and disk arrays; the cluster
acts as a single computer attached to the Internet.
12/20/2023 CCC ZG522 - BDS S1-23-24 20

• At the platform level, MapReduce offers a new
programming model that transparently handles data
parallelism with natural fault tolerance capability. We will
discuss
• Iterative MapReduce extends MapReduce to support a

broader range of data mining algorithms commonly used
in scientific applications. The cloud runs on an extremely
large cluster of commodity computers. I
12/20/2023 CCC ZG522 - BDS S1-23-24 21

• Cloud Computing Over the
Internet
12/20/2023 CCC ZG522 - BDS S1-23-24 22

Internet Clouds
12/20/2023 CCC ZG522 - BDS S1-23-24 23

The Cloud Landscape
• Three cloud service models in a cloud landscape of major

providers.
12/20/2023 CCC ZG522 - BDS S1-23-24 24

Service-Oriented
Architecture (SOA) (1)
• These architectures build on the traditional seven Open
Systems Interconnection (OSI) layers that provide the
base networking abstractions.
• On top of this we have a base software environment,
which would be .NET or Apache Axis for web services,
the Java Virtual Machine for Java, and a broker network
for CORBA.
• On top of this base environment one would build a
higher level environment reflecting the special features
of the distributed computing environment.
12/20/2023 CCC ZG522 - BDS S1-23-24 25

Service-Oriented
Architecture (SOA) (2)
12/20/2023 CCC ZG522 - BDS S1-23-24 26

• Memory Hierarchy in Distributed
Systems: In-node vs. over the
network latencies, Locality,
Communication Cost.
12/20/2023 CCC ZG522 - BDS S1-23-24 27

Memory Hierarchy in Distributed
Systems
• Levels of memory within each node - registers, caches,

RAM, storage
• Faster memory used for frequently accessed data
• Slower memory used for less frequent data
• Optimizes data access and processing
12/20/2023 CCC ZG522 - BDS S1-23-24 28

In-Node vs Over-the-
Network Latencies
• In-node latencies faster - within a node's memory
hierarchy
• Over-network latencies higher - between nodes over
network
• Minimizing network transfer optimizes performance
Locality
• Locality of reference - nearby data likely to be accessed
together
• Data locality - processing near where data resides
• Minimizes network transfer overhead
12/20/2023 CCC ZG522 - BDS S1-23-24 29

Communication Cost
• Overhead of transferring data between nodes

• Includes network latency, bandwidth, protocols
• High costs impact performance negatively
12/20/2023 CCC ZG522 - BDS S1-23-24 30

Key Principles
• In-memory computing minimizes disk accesses

• Distributed processing near data (data locality)
• Efficient data partitioning and shuffling
• Minimize network transfer overhead
• In summary,
• The memory hierarchy in distributed systems, in-node vs. over-the-
network latencies, data locality, and communication cost are critical
considerations in designing and optimizing big data systems.
• Minimizing data transfer overhead and maximizing data locality are key
principles for achieving efficient data processing and analysis in
distributed environments.
12/20/2023 CCC ZG522 - BDS S1-23-24 31

• Distributed Systems: Motivation (size,
scalability, cost-benefit)
12/20/2023 CCC ZG522 - BDS S1-23-24 32

Distributed Systems:
Motivation
• Size:
• Big data involves massive datasets that are often too large to be
processed efficiently on a single machine. Distributed systems enable
the storage and processing of these datasets by distributing the
workload across multiple nodes.
• Scalability:
• As data volumes continue to grow, distributed systems provide
scalability by allowing organizations to add more computational and
storage resources as needed.
• Cost-Benefit:
• Distributed systems offer cost-effective solutions by utilizing commodity
hardware and open-source software, making it feasible to store and
process vast amounts of data without investing in expensive,
specialized infrastructure.
12/20/2023 CCC ZG522 - BDS S1-23-24 33

Client-Server vs. Peer-to-Peer
Models
Client-Server Model:
– In a client-server model, one or more central servers provide services
and resources to multiple clients.
Relevance to Big Data:
– Many big data processing frameworks, such as Hadoop and Spark,
follow a client-server model. The master node (server) coordinates
tasks across worker nodes (clients) for distributed data processing.
Peer-to-Peer Model:
– In a peer-to-peer model, individual nodes (peers) in the network have
equal status and share resources directly with one another.
Relevance to Big Data:
– While less common in big data processing, peer-to-peer models can be
used for distributed storage or data sharing in decentralized
environments.
12/20/2023 CCC ZG522 - BDS S1-23-24 34

Client-Server Model
12/20/2023 CCC ZG522 - BDS S1-23-24 35

Peer to Peer Network
12/20/2023 CCC ZG522 - BDS S1-23-24 36

Cluster Computing: Components
and Architectures (1)
Components:
• Nodes: Cluster computing involves multiple
interconnected nodes, each equipped with CPU,
memory, and storage resources.
• Interconnect: High-speed interconnects (e.g., Ethernet,
InfiniBand) enable fast data transfer and communication
between nodes.
• Software Stack: Cluster computing relies on specialized
software for resource management, workload
distribution, and coordination.
12/20/2023 CCC ZG522 - BDS S1-23-24 37

and Architectures(2)
Homogeneous Clusters: All nodes in the cluster have
similar hardware configurations. Homogeneous clusters
are often used in high-performance computing (HPC)
environments.
Heterogeneous Clusters: Nodes in the cluster have
varying hardware specifications. Heterogeneous clusters
can be more cost-effective and can accommodate a
wider range of workloads.
12/20/2023 CCC ZG522 - BDS S1-23-24 38

and Architectures(3)
Symmetric Multiprocessing (SMP): In SMP architectures,
multiple processors share the same memory, offering a
single-system image. SMP clusters are suited for multi-
threaded applications.
Non-Uniform Memory Access (NUMA): NUMA
architectures feature processors with localized memory,
reducing memory access latencies. NUMA clusters are
effective for memory-intensive tasks.
12/20/2023 CCC ZG522 - BDS S1-23-24 39

Relevance to Big Data
• Cluster computing is fundamental to big data systems,

as it provides the infrastructure for parallel data
processing and storage.
• Big data frameworks like Hadoop, Spark, and Apache
Kafka are designed to run on cluster architectures.
• Hadoop clusters use the Hadoop Distributed File System
(HDFS) for distributed storage and MapReduce for
parallel data processing.
12/20/2023 CCC ZG522 - BDS S1-23-24 40

BIG DATA SYSTEMS
Contact Session-2
Big Data Analytics and Big Data System Characteristics
BITS Pilani Email: kanantharaman@wilp.bits-pilani.ac.in
Pilani Campus
Learning Objectives and
Learning Outcomes
Learning Objectives Learning Outcomes
Big Data Analytics
1. What is big data analytics and what it a) To understand the significance of big
isn’t? data analytics.
2. Why is big data analytics important? b) To understand the role of data

scientist.
3. What is data Science?
c) To understand the various
4. Getting familiar with the terminologies terminologies used in the big data
used in the big data environment. environment.
12/20/2023 CCC ZG522 - BDS S1-23-24 2

Agenda
➢ What is Big Data Analytics?

➢ What Big Data Analytics isn’t?
➢ Classification of Analytics
➢ Why is Big Data Analytics Important?
➢ Data Science
➢ Data Scientist … Your New Best Friend!!!
➢ Terminologies Used in Big Data Environment
➢ In Memory Analytics
➢ In Database Processing
➢ Massively Parallel Processing
➢ Difference between Parallel versus Distributed Systems
➢ Shared Nothing Architecture
➢ Consistency, Availability, Partition Tolerance (CAP): Theorem
Explained
➢ Few Top Analytics Tools
12/20/2023 CCC ZG522 - BDS S1-23-24 3

What is Big Data Analytics?
12/20/2023 CCC ZG522 - BDS S1-23-24 4

What Big Data Analytics isn’t?
12/20/2023 CCC ZG522 - BDS S1-23-24 5

Analytics 1.0, 2.0 and 3.0
12/20/2023 CCC ZG522 - BDS S1-23-24 6

Data Science & Data Scientist
12/20/2023 CCC ZG522 - BDS S1-23-24 7

Terminologies Used in Big
data Environments
• In-Memory Analytics
• In-Database Processing
• Massively Parallel Processing
• Parallel System
• Distributed System
• Shared Nothing Architecture
12/20/2023 CCC ZG522 - BDS S1-23-24 8

Brewer’s CAP Theorem
CAP theorem:
Any distributed system can comply with

only two of the three characteristics of
CAP theorem.
Consistency: Any read fetches the last

write.
Availability: No read/write requests will

ever be refused.
Partition tolerance: The distributed

system shall continue to function even
when network partition occurs.
12/20/2023 CCC ZG522 - BDS S1-23-24 9

Few Top Analytical Tools
• MS Excel
https://support.office.microsoft.com/en-in/article/Whats-new-in-Excel-2013-
1cbc42cd-bfaf-43d7-9031-5688ef1392fd?CorrelationId=1a2171cc-191f-47de-
8a55-08a5f2e9c739&ui=en-US&rs=en-IN&ad=IN
SAS
http://www.sas.com/en_us/home.htm
IBM SPSS Modeler
http://www-01.ibm.com/software/analytics/spss/products/modeler/
12/20/2023 CCC ZG522 - BDS S1-23-24 10

Q&A
• What are the key questions to be answered by all

organizations stepping into analytics?
• What is predictive and prescriptive analytics?
12/20/2023 CCC ZG522 - BDS S1-23-24 11

Answer(1)
• The key questions for any organization stepping into analytics

are:
• Should you be storing all of your big data? If “Yes”, where are you
going to store it? If “No”, how will you know what to store and what
to discard?
o • How will you sieve through your massive data to filter out the
relevant from the irrelevant?
• How long will you store this data? • How will you accommodate the
peaks (variability in terms of data influx) in your data?
• How will you analyze? Will you analyze all the data that is stored or
analyze a sample?
• What will you do with the insights generated from this analysis?
12/20/2023 CCC ZG522 - BDS S1-23-24 12

Answer(2)
• Predictive analytics helps you answer the questions:

“What will happen?” and “Why will it happen”?
• Prescriptive analytics goes beyond
“What will happen?” “Why will it happen?” and “When will it
happen?” to answer
“What should be the action taken to take advantage of what
will happen”?
12/20/2023 CCC ZG522 - BDS S1-23-24 13

2nd Half
Learning Objectives and Learning Outcomes

The big data technology landscape a) To understand the significance of NoSQL
databases
1. What is NoSQL databases?
b) To understand the need for NewSQL
2. Why NoSQL?
c) To understand the Hadoop platform and be
3. Key advantages of NoSQL able to appreciate the difference between
Hadoop 1.0 and Hadoop 2.0
4. What is NewSQL?
5. SQL Vs. NoSQL
6. Getting familiar with Hadoop.
12/20/2023 CCC ZG522 - BDS S1-23-24 14

Agenda
• NoSQL
❖ What is it?
❖ Types of NoSQL Databases
❖ Why NoSQL?
❖ Advantages of NoSQL
❖ NoSQL Vendors
❖ SQL versus NoSQL
❖ NewSQL
❖ Comparison of SQL, NoSQL and NewSQL
• Hadoop
– Features of Hadoop
– Key Advantages of Hadoop
– Versions of Hadoop
12/20/2023 CCC ZG522 - BDS S1-23-24 15

What is NoSQL?
NoSQL databases are open source, non-relational and distributed databases.

They relax one or more ACID (Atomicity, Consistency, Isolation and Durability)
properties of transactions. However, they adhere to Brewer’s CAP theorem.
12/20/2023 CCC ZG522 - BDS S1-23-24 16
Types of NoSQL
12/20/2023 CCC ZG522 - BDS S1-23-24 17

Use cases:
(Key, Value pair NOSQL)
Key−value or the big hash table. 2. Schema-less.
12/20/2023 CCC ZG522 - BDS S1-23-24 18

Use case:
Document Oriented NOSQL
12/20/2023 CCC ZG522 - BDS S1-23-24 19

Column Oriented NOSQL
For example:
Cassandra,
HBase, etc.
12/20/2023 CCC ZG522 - BDS S1-23-24 20

Graph Database- NOSQL
12/20/2023 CCC ZG522 - BDS S1-23-24 21

Advantages of NoSQL
• NoSQL has
flexibility with
respect to
schema.
• One of the key
advantages with
NoSQL is its
ability to scale
up and down
easily.
12/20/2023 CCC ZG522 - BDS S1-23-24 22

NOSQL Vendors
12/20/2023 CCC ZG522 - BDS S1-23-24 23

SQL Vs. NoSQL
12/20/2023 CCC ZG522 - BDS S1-23-24 24

NewSQL
12/20/2023 CCC ZG522 - BDS S1-23-24 25

SQL Vs. NoSQL Vs.
NewSQL
12/20/2023 CCC ZG522 - BDS S1-23-24 26

• Hadoop
12/20/2023 CCC ZG522 - BDS S1-23-24 27

12/20/2023 CCC ZG522 - BDS S1-23-24 28
Key Advantages of Hadoop
➢ Stores data in its native format

➢ Scalable
➢ Cost-effective
➢ Resilient to failure
➢ Flexibility
➢ Fast
12/20/2023 CCC ZG522 - BDS S1-23-24 29

Versions of Hadoop
12/20/2023 CCC ZG522 - BDS S1-23-24 30

Hadoop Ecosystem(1)
12/20/2023 CCC ZG522 - BDS S1-23-24 31

Hadoop Ecosystem(2)
• Components that help with Data Ingestion are:

1. Sqoop
2. Flume
• Components that help with Data Processing are:
1. MapReduce
2. Spark
• Components that help with Data Analysis are:
1. Pig
2. Hive
3. Impala
12/20/2023 CCC ZG522 - BDS S1-23-24 32

Three Difference between
HBase and Hadoop/ HDFS
• HDFS is the file system where as HBase is a Hadoop
database. It is like NTFS and MySQL.
• HDFS is WORM (Write once and read multiple times or

many times). Latest versions supports appending of data
but this feature is rarely used. However, HBase supports
real time random read and write.
• HDFS is based on Google File System (GFS) whereas

Hbase is based on Google Big Table.
12/20/2023 CCC ZG522 - BDS S1-23-24 33

Hadoop Ecosystem Components
for Data Ingestion
Sqoop:
Sqoop stands for SQL to Hadoop. It can provision the data
from external system on to HDFS and populate tables in
Hive and HBase.
Flume:
Flume is an important log aggregator (aggregates logs from
different machines and places them in HDFS)
component in the Hadoop Ecosystem.
12/20/2023 CCC ZG522 - BDS S1-23-24 34

Hadoop Ecosystem Components
for Data Processing
MapReduce:
It is a programing paradigm that allows distributed and parallel processing of
huge datasets. It. is based on Google MapReduce.
Spark:
It is both a programming model as well as a computing model. It is an open
source big data processing framework.
It is written in Scala. It provides in-memory computing for Hadoop.
Spark can be used with Hadoop coexisting smoothly with MapReduce (sitting
on top of Hadoop YARN) or used independently of Hadoop (standalone).
12/20/2023 CCC ZG522 - BDS S1-23-24 35

Hadoop ecosystem components for Data Analysis
Pig
It is a high level scripting language used with Hadoop. It serves as an
alternative to MapReduce. It has two parts:
Pig Latin: It is a SQL like scripting language.
Pig runtime: is the runtime environment.
Hive:
Hive is a data warehouse software project built on top of Hadoop. Three main
tasks performed by Hive are summarization, querying and analysis
Impala:
It is a high performance SQL engine that runs on Hadoop cluster. It is ideal for
interactive analysis. It has very low latency measured in milliseconds. It
supports a dialect of SQL called Impala SQL.
12/20/2023 CCC ZG522 - BDS S1-23-24 36

Q&A
1. MongoDB is ------------------ and ---------------.

2. Cassandra is -------------------and. -------------
3. ---------------------- has no support for ACID properties of
transactions
4.----------------------- is a robust database that supports
ACID properties of transactions and has the scalability of
NoSQL.
12/20/2023 CCC ZG522 - BDS S1-23-24 37

1, Mango DB is Consistent and Partition Tolerant
2. Cassandra is Available and Partition Tolerant
3. NOSQL has no support for ACID properties of

transactions.
4. NewSQL is a robust database that supports ACID
properties of transactions and has the scalability of
NoSQL.
12/20/2023 CCC ZG522 - BDS S1-23-24 38

BIG DATA SYSTEMS
Contact Session-4
Cap Theorem and Big data life cycle
Pilani Campus
Agenda
• What is CAP Theorem?

• Consistency, Availability, and Partition Tolerance defined.
• The inherent trade-offs between these three factors.
• Big Data Lifecycle:
• Data Acquisition, Data Extraction –Validation and Cleaning, Data
Loading, Data Transformation, Data Analysis and Visualization. Case
study – Big data application
11/2/23 CCC ZG522 - BDS S1-23-24 2

CAP Theorem Overview(1)
• CAP Theorem is a fundamental concept in distributed

systems.
• Developed by Eric Brewer in 2000.
• It states that in a distributed system:
Ø You can achieve at most two of the three attributes:
Consistency, Availability, and Partition Tolerance.
11/2/23 CCC ZG522 - BDS S1-23-24 3

CAP Theorem Overview(2)
• Consistency refers to all clients seeing the same data at

the same time.
• Availability means that any client making a request for
data gets a response.
• Partition tolerance means that the cluster must
continue to work despite any number of communication
breakdowns between nodes in the system
• In big data analytics, the CAP theorem has significant
implications for designing distributed applications and
choosing a NoSQL or relational data store
11/2/23 CCC ZG522 - BDS S1-23-24 4

Big Data Analytics Use Cases
• Example: Consistency is crucial in financial analytics,

while availability may take precedence in real-time
monitoring systems.
• Example: Apache Cassandra and its AP characteristics

for distributed databases.
11/2/23 CCC ZG522 - BDS S1-23-24 5

Balancing CAP in Big Data Systems
• How can big data systems strike a balance?
• Techniques like eventual consistency, replication, and

data sharding.
11/2/23 CCC ZG522 - BDS S1-23-24 6

1. Eventual Consistency:
• Explanation: Eventual consistency allows data to

become consistent over time, even when there are
temporary inconsistencies.
• Use Case: Eventual consistency is often acceptable in

scenarios where immediate consistency is not a strict
requirement, such as content delivery networks (CDNs)
or social media platforms.
11/2/23 CCC ZG522 - BDS S1-23-24 7

2. Replication:
Explanation: Replicating data across multiple nodes can

enhance availability while maintaining a degree of
consistency.
Use Case: Replication is commonly used in distributed

databases like Apache Cassandra, which provides
tunable consistency levels to balance between C and A.
11/2/23 CCC ZG522 - BDS S1-23-24 8

3. Data Sharding:
Explanation: Data sharding involves partitioning data into

smaller subsets (shards) distributed across different
nodes. Each shard is responsible for specific data,
enhancing parallelism and availability.
• Use Case: Sharding is employed in systems like

Hadoop, which distribute and process data across
multiple nodes in parallel, optimizing A and P.
11/2/23 CCC ZG522 - BDS S1-23-24 9

4. Smart Data Routing:
Explanation: Implementing intelligent data routing

strategies can ensure that read and write requests are
directed to the most appropriate nodes based on the
current state of the system.
Use Case: Systems like Apache ZooKeeper use leader

election to ensure availability, while routing data to the
leader for writes, ensuring a level of consistency.
11/2/23 CCC ZG522 - BDS S1-23-24 10

5. Trade-Off Analysis:
Explanation: Assessing trade-offs between C, A, and P

based on the unique requirements of your application is
crucial. Evaluate which elements are non-negotiable and
which can be relaxed.
Use Case: Financial applications may prioritize strong

consistency (C) to prevent data inconsistencies, while
real-time monitoring systems may emphasize high
availability (A) to ensure uninterrupted data collection.
11/2/23 CCC ZG522 - BDS S1-23-24 11

Big Data Lifecycle
11/2/23 CCC ZG522 - BDS S1-23-24 12

Big Data Lifecycle Stages
• High-level breakdown of the key stages:Data Ingestion

1. Data Ingestion
2. Data Storage
3. Data Processing
4. Data Analysis
5. Data Visualization
6. Data Insights
11/2/23 CCC ZG522 - BDS S1-23-24 13

Stack of layers in Big Data
Architecture
11/2/23 CCC ZG522 - BDS S1-23-24 14

Data Ingestion
11/2/23 CCC ZG522 - BDS S1-23-24 15

Data Ingestion
• Definition: Data ingestion is the process of collecting

data from various sources, such as databases, logs,
sensors, social media, and more.
• Variety of Data: Ingested data can be structured

(relational databases), semi-structured (XML, JSON), or
unstructured (text, images, videos), and can arrive in
real-time or batch mode.
11/2/23 CCC ZG522 - BDS S1-23-24 16

Functioning of Data Ingestion
Layer (1)
• Identification—At this stage, data is

categorized into various known data
formats, or we can say that
unstructured data is assigned with
default formats.
• Filtration—At this stage, the

information relevant for the enterprise is
filtered on the basis of the Enterprise
Master Data Management (MDM)
repository.
11/2/23 CCC ZG522 - BDS S1-23-24 17

Layer (2)
• Validation—At this stage, the filtered
data is analyzed against MDM
metadata.
• Noise reduction—At this stage, data is

cleaned by removing the noise and
minimizing the related disturbances.
• Transformation—At this stage, data is

split or combined on the basis of its
type, contents, and the requirements of
the organization.
11/2/23 CCC ZG522 - BDS S1-23-24 18

Layer (3)
Compression—At this stage, the size of
the data is reduced without affecting its
relevance for the required process. It
should be noted that compression does not
affect the analysis results.
Integration—At this stage, the refined

dataset is integrated with the Hadoop
storage layer, which consists of Hadoop
Distributed File System (HDFS) and
NoSQL databases.
11/2/23 CCC ZG522 - BDS S1-23-24 19

Methods and Tools for Data
Ingestion
ETL (Extract, Transform, Load): ETL processes are used to extract
data from source systems, transform it into a suitable format, and
load it into a target storage or processing system. Tools like Apache
Nifi, Talend, and Apache Beam facilitate ETL processes.
Real-Time Streaming: Data can be ingested in real time using

streaming frameworks such as Apache Kafka or Apache Pulsar. This
is crucial for applications that require immediate processing and
analysis of data as it arrives.
Data Connectors: Various connectors and APIs are used to fetch data
from sources like databases, cloud storage, and web services.
11/2/23 CCC ZG522 - BDS S1-23-24 20

Importance of Data Quality and
Data Sources:
Data Quality: Ensuring the quality, consistency, and
accuracy of ingested data is critical. Poor data quality
can lead to incorrect analysis and insights. Data
cleansing and validation are often part of the ingestion
process.
Data Sources: Diverse data sources, including IoT

devices, social media, customer transactions, and
internal databases, contribute to the richness of big data.
Understanding the sources and the data they produce is
essential.
11/2/23 CCC ZG522 - BDS S1-23-24 21

Data Ingestion: Summary
• Data ingestion sets the stage for the rest of the Big Data
Lifecycle, making it imperative to gather, clean, and
organize data from different sources for subsequent
storage, processing, and analysis.
• The choice of ingestion method depends on the data's

volume, velocity, variety, and the specific use case
requirements.
11/2/23 CCC ZG522 - BDS S1-23-24 22

Data Storage
11/2/23 CCC ZG522 - BDS S1-23-24 23

Data Storage
• Definition:
• zzData storage involves the management of ingested
data and its organization into a format that is accessible
for analysis. This stage addresses the challenge of
managing vast amounts of data efficiently.
11/2/23 CCC ZG522 - BDS S1-23-24 24

Different Storage
Technologies (1)
Data Warehouses: Traditional data warehouses like
Oracle, Teradata, and Amazon Redshift are designed for
structured data storage and retrieval. They are suitable
for scenarios where data is primarily structured and
relational.
Data Lakes: Data lakes (e.g., Amazon S3, Hadoop HDFS)

store raw, unstructured or semi-structured data at scale.
Data lakes are ideal for retaining large volumes of data,
whether structured or unstructured, and make it available
for future analysis.
11/2/23 CCC ZG522 - BDS S1-23-24 25

Different Storage
Technologies (2)
NoSQL Databases: NoSQL databases, including
document stores, key-value stores, and column-family
databases, offer flexible storage for unstructured or
semi-structured data. They are optimized for high write
throughput and fast access.
In-Memory Databases: In-memory databases like Redis

or Apache Cassandra with in-memory options provide
high-speed access to frequently used data and are well-
suited for real-time analytics.
11/2/23 CCC ZG522 - BDS S1-23-24 26

Data Storage: Scalability and
Data Retention:
Scalability is a critical aspect of data storage. As data
volume grows, the storage system should be able to
scale horizontally to accommodate the increasing data
load.
Data retention policies must be defined to determine how

long data should be retained, which data should be
archived, and when data should be purged
11/2/23 CCC ZG522 - BDS S1-23-24 27

Data Storage: Data Compression
and Optimization:
• Data compression techniques can be employed to
reduce storage costs and improve data retrieval times.
• Storage optimization techniques, such as indexing and

partitioning, help organize data for efficient access.
11/2/23 CCC ZG522 - BDS S1-23-24 28

Data Storage: Data Security and
Access Control:
Ensuring data security and access control is essential in
data storage. Access to sensitive data should be
restricted, and data encryption should be implemented to
protect data at rest.
11/2/23 CCC ZG522 - BDS S1-23-24 29

Data Storage:Metadata
Management:
• Metadata, which describes the characteristics of the
stored data, plays a crucial role in data storage. Effective
metadata management helps in data discovery,
governance, and tracking the lineage of data.
11/2/23 CCC ZG522 - BDS S1-23-24 30

Data Storage: Data Lifecycle
Management:
• Data storage should be integrated with data lifecycle
management, allowing organizations to manage the
entire data journey, from ingestion to archiving or
deletion.
11/2/23 CCC ZG522 - BDS S1-23-24 31

Data Storage: Summary
• Data storage serves as the repository for all ingested

data, providing the foundation for data processing,
analysis, and insights.
• The choice of storage technology depends on the nature

of the data, scalability requirements, and the
organization's data retention policies.
• An effective data storage strategy is crucial for

optimizing data access and ensuring that data remains
available for future analysis and decision-making.
11/2/23 CCC ZG522 - BDS S1-23-24 32

Data Processing
Definition: Data processing involves the manipulation,

transformation, and analysis of the ingested data to
extract meaningful information, patterns, and insights.
Objective: The primary goal of data processing is to

convert raw data into a format that is suitable for analysis
and decision-making.
11/2/23 CCC ZG522 - BDS S1-23-24 33

Batch and Real-Time Processing
• Batch Processing: In batch processing, data is

collected and processed in fixed-size chunks or batches
at scheduled intervals. It is well-suited for applications
that can tolerate some latency in data analysis.
• Real-Time Processing: Real-time processing, or stream

processing, involves the immediate analysis of data as it
arrives. It is crucial for applications that require low-
latency insights and quick responses.
11/2/23 CCC ZG522 - BDS S1-23-24 34

Data Processing Frameworks and
Tools (1)
• Hadoop: Hadoop, along with the MapReduce
framework, was one of the pioneers in big data
processing. It enables distributed batch processing of
large datasets.
• Apache Spark: Apache Spark is a powerful, in-memory

data processing framework that supports both batch and
real-time data processing. It is known for its speed and
ease of use.
11/2/23 CCC ZG522 - BDS S1-23-24 35

Data Processing Frameworks and
Tools (2)
• Apache Flink: Apache Flink is designed for high-

throughput, low-latency stream processing, making it
suitable for real-time analytics.
• Data Warehousing Solutions: Traditional data

warehousing solutions such as Amazon Redshift and
Google BigQuery provide SQL-based batch processing
capabilities for structured data.
11/2/23 CCC ZG522 - BDS S1-23-24 36

Data Transformation and
Enrichment
Data processing often involves data transformation to
convert data from one format to another, filter out
irrelevant data, and enrich data with additional
information.
Data cleansing, normalization, and feature engineering

are common data transformation tasks.
11/2/23 CCC ZG522 - BDS S1-23-24 37

Parallel Processing and Distributed
Computing
• Data processing frameworks leverage parallel
processing and distributed computing to handle the
immense volumes of data. They distribute tasks across
multiple nodes for faster execution.
11/2/23 CCC ZG522 - BDS S1-23-24 38

Data Quality and Validation
• Ensuring data quality is a critical part of data

processing. It involves validating data for accuracy,
completeness, and consistency.
• Data quality checks and validation rules are applied

during processing.
11/2/23 CCC ZG522 - BDS S1-23-24 39

Data Processing :Summary
• Data processing is the stage where the value of big data

is unlocked.
• It involves transforming raw data into insights, which can
drive informed decision-making.
• The choice of data processing framework depends on
factors like data volume, processing speed
requirements, and the complexity of data
transformations.
• Effective data processing is essential for generating
actionable insights and extracting the full potential of big
data.
11/2/23 CCC ZG522 - BDS S1-23-24 40

Data Analysis
11/2/23 CCC ZG522 - BDS S1-23-24 41

Data Analysis
Definition: Data analysis is the systematic process of

examining, cleaning, transforming, and interpreting data
to discover meaningful patterns and insights.
Objective: The primary goal of data analysis is to derive

actionable information from raw data that can be used
for decision-making and problem-solving.
11/2/23 CCC ZG522 - BDS S1-23-24 42

Data Analysis Techniques(1)
• Descriptive Analysis: Descriptive statistics and

visualizations are used to summarize and present data,
providing an overview of key characteristics.
• Diagnostic Analysis: This stage involves identifying

patterns, anomalies, and potential issues in the data. It
aims to answer questions like "Why did this happen?"
11/2/23 CCC ZG522 - BDS S1-23-24 43

Data Analysis Techniques(2)
Predictive Analysis: Predictive modeling and machine

learning techniques are applied to make future
predictions or classifications based on historical data.
Prescriptive Analysis: This advanced stage recommends

actions or decisions based on the insights generated,
guiding organizations on what steps to take.
11/2/23 CCC ZG522 - BDS S1-23-24 44

Data Mining and Machine
Learning:
• Data mining techniques, such as clustering,
classification, and association rule mining, are used to
discover hidden patterns in data.
• Machine learning algorithms, including supervised and

unsupervised learning, are applied to build predictive
models.
11/2/23 CCC ZG522 - BDS S1-23-24 45

Big Data Analytics Tools:
• Big data analytics tools include platforms like Apache

Hadoop for distributed data processing and storage,
Apache Spark for in-memory data processing, and
specialized tools like R and Python for statistical
analysis.
• Data analytics platforms such as SAS, IBM SPSS, and

open-source tools like Jupyter and scikit-learn are
commonly used for data analysis.
11/2/23 CCC ZG522 - BDS S1-23-24 46

Visualization and Reporting
• Data visualization tools (e.g., Tableau, Power BI, D3.js)

are employed to create interactive charts, graphs, and
dashboards for presenting insights.
• Reporting solutions generate reports that summarize

key findings and recommendations for stakeholders.
11/2/23 CCC ZG522 - BDS S1-23-24 47

Iterative and Exploratory
Process
• Data analysis is often an iterative and exploratory
process. Analysts may need to revisit data processing
and analysis steps to refine their understanding of the
data and generate deeper insights.
11/2/23 CCC ZG522 - BDS S1-23-24 48

Insights for Decision-Making
• The ultimate goal of data analysis is to provide

actionable insights that support decision-making, inform
strategies, and drive organizational improvements.
11/2/23 CCC ZG522 - BDS S1-23-24 49

Data Analysis: Summary
• Data analysis is the stage where the raw data is

transformed into knowledge.
• It involves exploring data, building models, and
visualizing results.
• Effective data analysis helps organizations make
informed decisions, optimize processes, and gain a
competitive edge by leveraging the power of data-driven
insights.
11/2/23 CCC ZG522 - BDS S1-23-24 50

Q&A
11/2/23 CCC ZG522 - BDS S1-23-24 51

End
11/2/23 CCC ZG522 - BDS S1-23-24 52

BIG DATA SYSTEMS
Contact Session-2
Pilani Campus
Agenda
Ø Distributed Computing - Design Strategy:

Ø Divide-and-conquer for Parallel / Distributed Systems - Basic scenarios
and Implications.
11/2/23 CCC ZG522 - BDS S1-23-24 2

Divide-and-conquer for Parallel /
Distributed Systems
Distributed Computing Design Strategy:
§ Think of distributed computing as teamwork for computers. Instead of
one computer doing all the work, many computers work together to
solve a big task.
§ Example:
§ Some common scenarios where divide-and-conquer is used
include sorting, searching, and matrix multiplication.
§ It's like a team of chefs in a restaurant kitchen; each chef has a specific
job to prepare a delicious meal.
11/2/23 CCC ZG522 - BDS S1-23-24 3

Divide-and-conquer Strategy
in Big Data
11/2/23 CCC ZG522 - BDS S1-23-24 4

Divide-and-conquer Strategy
Basic Scenarios
1. Imagine you have a big pile of LEGO bricks. Instead of trying to build a
whole LEGO castle at once, you break it down into smaller parts, like walls,
towers, and a drawbridge. Then, you build those smaller parts one by one,
making it easier to assemble the entire castle.
2. When you shop online, your order needs to be processed quickly.
Websites use divide-and-conquer when they split your order into parts, like
finding items in the warehouse, checking them, and packing them. Different
computers handle each part, making your order ready faster.
3. In video games, making characters move realistically can be hard. Game
designers use divide-and-conquer to break the problem into pieces, like
walking, running, and jumping. Each action is handled by different parts of
the computer to make the game smooth.
4. Divide-and-conquer helps things work faster and better. It's like
teamwork in a relay race. One runner passes the baton to the next, and
they all work together to finish the race. In the end, the whole team is faster
because they divided the task.
11/2/23 CCC ZG522 - BDS S1-23-24 5

Programming patterns
11/2/23 CCC ZG522 - BDS S1-23-24 6

Parallel merge sort
• Parallel merge sort is a technique that takes advantage

of multiple processors or threads to speed up the sorting
process.
• The idea is to divide the array into smaller parts, sort
each part independently, and then merge the sorted
parts together.
• Here's a simplified pseudocode for parallel merge sort:
11/2/23 CCC ZG522 - BDS S1-23-24 7

Pseudocode for parallel merge
sort
• This algorithm sorts a list
recursively by dividing
the list into smaller
pieces, sorting the
smaller pieces during
reassembly of the list.
§ / Create two threads (or

processes) to sort the parts
in parallel
11/2/23 CCC ZG522 - BDS S1-23-24 8

Data-Parallel Programs
• Data-parallel programming is a pattern where the same

operation is applied to multiple data elements in parallel.
It's like an assembly line where each worker does the
same task simultaneously.
• This pattern is useful for tasks like element-wise

operations on arrays, such as adding numbers or filtering
data points.
11/2/23 CCC ZG522 - BDS S1-23-24 9

Map as a Construct
• The "map" operation is a fundamental concept in data-

parallel programming.
• It involves applying a function to each item in a collection
and generating a new collection of results.
• Think of it as applying the same operation to every item

in a list.
11/2/23 CCC ZG522 - BDS S1-23-24 10

Tree-Parallelism
• Tree-parallelism extends the idea of data-parallelism to

hierarchical data structures, like trees. In this pattern,
you process tree nodes in parallel, which is common in
tasks like parsing XML or JSON data.
11/2/23 CCC ZG522 - BDS S1-23-24 11

Reduce as a Construct
• The "reduce" operation is another key concept. It

combines multiple values into a single result. It's like
summarizing the results of parallel computations into one
final output, such as finding the sum of all numbers.
11/2/23 CCC ZG522 - BDS S1-23-24 12

Map-Reduce Model
• Map-reduce is a programming model for processing and

generating large datasets.
• It's widely used in distributed computing. The model
consists of two main operations: "map" and "reduce."
11/2/23 CCC ZG522 - BDS S1-23-24 13

Map-Reduce Model :
Examples
Map Example: Consider a dataset of words and you want
to count how many times each word appears.
• The "map" step would split the text into words and
assign a count of 1 to each word. For example, "apple" -
> 1, "banana" -> 1.
Reduce Example: After the "map" step, you have a list of

words with counts.
The "reduce" step would group all instances of the same
word together and sum their counts. So, "apple" (1, 1, 1)
-> "apple" -> 3.
11/2/23 CCC ZG522 - BDS S1-23-24 14

Map-Reduce Combinations(1)
• You can combine "map" and "reduce" operations in more

complex ways.
• For instance, in sentiment analysis, the "map" step might

assign positive or negative sentiment scores to words,
and the "reduce" step could aggregate the sentiment for
an entire document.
11/2/23 CCC ZG522 - BDS S1-23-24 15

Map-Reduce Combinations(2)
• In some scenarios, you need to apply map-reduce

multiple times in an iterative fashion.
• For example, in graph processing, you may repeatedly

apply map-reduce to traverse and analyze a graph until
you find specific patterns or insights
11/2/23 CCC ZG522 - BDS S1-23-24 16

Programming patterns:
Summary
• These programming patterns and the map-reduce model
are essential for efficiently processing large datasets in
parallel and distributed computing environments.
• They provide a framework for breaking down complex

tasks into smaller, parallelizable components and
aggregating their results to solve a wide range of
problems, from counting words in text to analyzing vast
datasets.
11/2/23 CCC ZG522 - BDS S1-23-24 17

BIG DATA SYSTEMS
Contact Session-6
Overview of Hadoop
Pilani Campus
Learning Objectives and
Learning Outcomes
11/29/2023 CCC ZG522 - BDS S1-23-24 2

Agenda
➢ Hadoop - An Introduction
HDFS
➢ RDBMS versus Hadoop
➢ HDFS Daemons
➢ Distributed Computing
➢ Anatomy of File Read
Challenges
➢ Anatomy of File Write
➢ History of Hadoop
➢ Replica Placement Strategy
➢ Hadoop Overview
➢ Working with HDFS commands
❖ Key Aspects of Hadoop ➢ Special Features of HDFS
❖ Hadoop Components
❖ High Level Architecture of
Hadoop
➢ Use case for Hadoop
➢ ClickStream Data
➢ Hadoop Distributors
11/29/2023 CCC ZG522 - BDS S1-23-24 3

Hadoop
• Processing Data with Hadoop

➢ What is MapReduce Programming?
➢ How does MapReduce Works?
➢ MapReduce Word Count Example
• Managing Resources and Application with Hadoop

YARN
➢ Limitations of Hadoop 1.0 Architecture
➢ Hadoop 2 YARN: Taking Hadoop Beyond Batch
• Hadoop Ecosystem
➢ Pig
➢ Hive
➢ Sqoop
➢ HBase
11/29/2023 CCC ZG522 - BDS S1-23-24 4
What is Hadoop
Hadoop is:
• Ever wondered why Hadoop has been and is one of the most wanted
technologies!!
• The key consideration (the rationale behind its huge popularity) is:
• Its capability to handle massive amounts of data, different categories of data
– fairly quickly.
• The other considerations are :
• Commodity hardware
• Distributed computing
• Scalable horizontally
• Flexible storage
• Failsafe
11/29/2023 CCC ZG522 - BDS S1-23-24 5

A Hadoop Cluster
11/29/2023 CCC ZG522 - BDS S1-23-24 6

Why not RDBMS?
• Not suitable for processing large files, Images Videos.

• Storage Cost/GB is high
11/29/2023 CCC ZG522 - BDS S1-23-24 7

RDBMS versus HADOOP
11/29/2023 CCC ZG522 - BDS S1-23-24 8

Distributed computing
challenges (1)
• Hardware failure
• Node/Disk Failure
• Hadoop handles it by using replication
11/29/2023 CCC ZG522 - BDS S1-23-24 9

Distributed computing
Challenges(2)
• How to Process This Gigantic Store of Data?
• In a distributed system, the data is spread across the
network on several machines.
• A key challenge here is to integrate the data available

on several machines prior to processing it.
• Hadoop solves this problem by using MapReduce

Programming. It is a programming model to process the
data (MapReduce programming will be discussed a little
later).
11/29/2023 CCC ZG522 - BDS S1-23-24 10

The Name “Hadoop”
The Name “Hadoop” The name Hadoop is not an

acronym; it’s a made-up name.
The project creator, Doug Cutting, explains how the name

came about: “The name my kid gave a stuffed yellow
elephant. Short, relatively easy to spell and pronounce,
meaningless, and not used elsewhere: those are my
naming criteria. Kids are good at generating such.
Googol is a kid’s term”.
11/29/2023 CCC ZG522 - BDS S1-23-24 11

History of Hadoop
11/29/2023 CCC ZG522 - BDS S1-23-24 12

Key Aspects of Hadoop
11/29/2023 CCC ZG522 - BDS S1-23-24 13

Hadoop Components(1)
11/29/2023 CCC ZG522 - BDS S1-23-24 14

Hadoop Components(2)
Hadoop Core Components:

HDFS:
(a) Storage component.
(b) Distributes data across several nodes.
(c) Natively redundant.
MapReduce:
(a) Computational framework.
(b) Splits a task across multiple nodes.
(c) Processes data in parallel.
11/29/2023 CCC ZG522 - BDS S1-23-24 15

Hadoop Eco System
Hadoop Ecosystem: Hadoop Ecosystem are support

projects to enhance the functionality of Hadoop Core
Components.
The Eco Projects are as follows:
1. HIVE
2. PIG
3. SQOOP
4. HBASE
5. FLUME
6. OOZIE 7. MAHOUT
11/29/2023 CCC ZG522 - BDS S1-23-24 16

Hadoop Conceptual Layers
It is conceptually divided into:

• Data Storage Layer which stores huge volumes of data
• Data Processing Layer which processes data in

parallel to extract richer and meaningful insights from
data.
11/29/2023 CCC ZG522 - BDS S1-23-24 17

Hadoop High Level
Architecture
Master HDFS: Its main responsibility is
partitioning the data storage across the
slave nodes. It also keeps track of
locations of data on DataNodes.
Master MapReduce: It decides and

schedules computation task on slave
nodes
11/29/2023 CCC ZG522 - BDS S1-23-24 18

Use case for Hadoop
• Clickstream data (mouse clicks) helps you to

understand the purchasing behavior of customers.
• Clickstream analysis helps online marketers to optimize

their product web pages, promotional content, etc. to
improve their business
11/29/2023 CCC ZG522 - BDS S1-23-24 19

Hadoop Distributors
11/29/2023 CCC ZG522 - BDS S1-23-24 20

HDFS
(HADOOP DISTRIBUTED FILE SYSTEM)
11/29/2023 CCC ZG522 - BDS S1-23-24 21

Hadoop Distributed File
System
1. Storage component of Hadoop.
2. Distributed File System.
3. Modeled after Google File System.
4. Optimized for high throughput (HDFS leverages large block size and
moves computation where data is stored).
5. You can replicate a file for a configured number of times, which is
tolerant in terms of both software and hardware.
6. Re-replicates data blocks automatically on nodes that have failed.
7. You can realize the power of HDFS when you perform read or write
on large files (gigabytes and larger).
8. Sits on top of native file system such as ext3 and ext4.
11/29/2023 CCC ZG522 - BDS S1-23-24 22

HDFS Daemons
NameNode:
• Single NameNode per cluster.
• Keeps the metadata details
DataNode:
• Multiple DataNode per cluster
• Read/Write operations
SecondaryNameNode:
• Housekeeping Daemon
11/29/2023 CCC ZG522 - BDS S1-23-24 23

Anatomy of File Read
11/29/2023 CCC ZG522 - BDS S1-23-24 24

Anatomy of File Write
11/29/2023 CCC ZG522 - BDS S1-23-24 25

Replica Placement Strategy
• As per the Hadoop Replica Placement Strategy, first replica is placed on the
same node as the client. Then it places second replica on a node that is
present on different rack. It places the third replica on the same rack as
second, but on a different node in the rack. Once replica locations have
been set, a pipeline is built. This strategy provides good reliability.
11/29/2023 CCC ZG522 - BDS S1-23-24 26

Working with HDFS
Commands
Objective: To create a directory (say, sample) in HDFS.
Act:
hadoop fs -mkdir /sample
Objective: To copy a file from local file system to HDFS.

Act:
hadoop fs -put /root/sample/test.txt /sample/test.txt
Objective: To copy a file from HDFS to local file system.

Act:
Hadoop fs- get /sample/test.txt /root/sample/testsample.txt
11/29/2023 CCC ZG522 - BDS S1-23-24 27

HDFS Snapshots
11/29/2023 CCC ZG522 - BDS S1-23-24 28

Special Features of HDFS
Data Replication: There is absolutely no need for a client

application to track all blocks. It directs the client to the
nearest replica to ensure high performance.
Data Pipeline: A client application writes a block to the first

DataNode in the pipeline. Then this DataNode takes
over and forwards the data to the next node in the
pipeline. This process continues for all the data blocks,
and subsequently all the replicas are written to the disk.
11/29/2023 CCC ZG522 - BDS S1-23-24 29

Hadoop
Some Additional points about Hadoop
11/29/2023 CCC ZG522 - BDS S1-23-24 30

Hadoop
11/29/2023 CCC ZG522 - BDS S1-23-24 31

Contributing Organizations
11/29/2023 CCC ZG522 - BDS S1-23-24 32

Understanding Usage Mode of
Hadoop
• Development of New tools
11/29/2023 CCC ZG522 - BDS S1-23-24 33

Hadoop History
Hadoop is also used by many smaller companies – open source , Cheap
11/29/2023 CCC ZG522 - BDS S1-23-24 34

Hadoop Handles Big Data
11/29/2023 CCC ZG522 - BDS S1-23-24 35

Hadoop addresses Data
throughput mismatch (1)
• Due to slow disk transfer rate
11/29/2023 CCC ZG522 - BDS S1-23-24 36

Hadoop addresses Data
throughput mismatch (2)
1 TB = 10 **6 MB
Transfer speed = 100 MB/sec
Time taken = 10**6/100 = 10**4= 10000 Sec ~ 167 minutes
11/29/2023 CCC ZG522 - BDS S1-23-24 37
Hadoop Design Principles
11/29/2023 CCC ZG522 - BDS S1-23-24 38

Hadoop Version -1
11/29/2023 CCC ZG522 - BDS S1-23-24 39

Hadoop-1 Eco System
11/29/2023 CCC ZG522 - BDS S1-23-24 40

Hadoop-1 limitations
11/29/2023 CCC ZG522 - BDS S1-23-24 41

YARN Overview
(We shall discuss more about YARN in the later part of the course)
11/29/2023 CCC ZG522 - BDS S1-23-24 42

YARN Solution (Hadoop-2
approach)
11/29/2023 CCC ZG522 - BDS S1-23-24 43

Hadoop Version-2
So for instance, in version 1 MapReduce, the number of reducers and mappers

was fixed. In version 2, the MapReduce framework now, which is now a YARN
application, can dynamically adjust the numbers of mappers and reducers at
runtime.
11/29/2023 CCC ZG522 - BDS S1-23-24 44
Hadoop Eco System-2
11/29/2023 CCC ZG522 - BDS S1-23-24 45

Hadoop Yarn Beyond Map
reduce
11/29/2023 CCC ZG522 - BDS S1-23-24 46

Hadoop YARN - Features
• Move Beyond Java
11/29/2023 CCC ZG522 - BDS S1-23-24 47

Hadoop 2.x or Greater
Versions
11/29/2023 CCC ZG522 - BDS S1-23-24 48

Hadoop V2 components
11/29/2023 CCC ZG522 - BDS S1-23-24 49

Hadoop-2 Eco system
components (1)
Apache Pig is a high-level language for creating MapReduce programs
used with Hadoop.
Apache Hive is a data warehouse infrastructure built on top of Hadoop for
providing data summarization, ad-hoc queries, and the analysis of large
datasets using a SQL-like language called HiveQL.
Apache HCataIog Apache HCataIog is a table and storage management
service for data created using Apache Hadoop.
Apache HBase HBase (Hadoop Database) is a distributed and scalable,
column oriented database. Similar to Google Big Table.
Apache ZooKeeper is a centralized service used by applications for
maintaining configuration, health, etc. on and between nodes.
11/29/2023 CCC ZG522 - BDS S1-23-24 50

Hadoop-2 Eco system
components (2)
Apache Ambari is a tool for provisioning, managing, and monitoring
Apache Hadoop clusters.
Apache Oozie is a workflow/coordination system to manage multistage
Hadoop jobs.
Apache Avro is a remote procedure call and serialization framework.
Apache Cassandra is a distributed database designed to handle large
amounts of data across many commodity servers.
Apache Sqoop is a tool designed for efficiently transferring bulk data
between Hadoop and relational databases.
Apache Flume is a distributed, reliable, and available service for efficiently
collecting, aggregating, and moving large amounts of log data. We will
cover Flume in this tutorial. Apache Mahout a scalable machine learning
library that implements many different approaches to machine learning.
11/29/2023 CCC ZG522 - BDS S1-23-24 51

Apache Tez
11/29/2023 CCC ZG522 - BDS S1-23-24 52

Processing with Hadoop
11/29/2023 CCC ZG522 - BDS S1-23-24 53

Introduction- MapReduce
Programming
• In MapReduce Programming, Jobs (Applications) are
split into a set of map tasks and reduce tasks. Then
these tasks are executed in a distributed fashion on
Hadoop cluster.
• Each task processes small subset of data that has been

assigned to it. This way, Hadoop distributes the load
across the cluster.
• MapReduce job takes a set of files that is stored in

HDFS (Hadoop Distributed File System) as input.
11/29/2023 CCC ZG522 - BDS S1-23-24 54

What is MapReduce
Programming?
• MapReduce Programming is a software framework.
MapReduce Programming helps you to process massive
amounts of data in parallel.
11/29/2023 CCC ZG522 - BDS S1-23-24 55

Mapper
11/29/2023 CCC ZG522 - BDS S1-23-24 56

Mapper
• A mapper maps the input key−value pairs into a set of

intermediate key–value pairs. Maps are individual tasks
that have the responsibility of transforming input records
into intermediate key–value pairs.
Mapper Consists of following phases:

• RecordReader
• Map
• Combiner
• Partitioner
11/29/2023 CCC ZG522 - BDS S1-23-24 57

Reducer
11/29/2023 CCC ZG522 - BDS S1-23-24 58

Reducer
• The primary chore of the Reducer is to reduce a set of

intermediate values (the ones that share a common key)
to a smaller set of values.
• The Reducer has three primary phases: Shuffle and

Sort, Reduce, and Output Format.
11/29/2023 CCC ZG522 - BDS S1-23-24 59

The chores of Mapper, Combiner,
Partitioner, and Reducer
11/29/2023 CCC ZG522 - BDS S1-23-24 60

How MapReduce Programming
Works
11/29/2023 CCC ZG522 - BDS S1-23-24 61

MapReduce – Word Count
Data flow
11/29/2023 CCC ZG522 - BDS S1-23-24 62

Understand how MapReduce
works (3 Nodes)
11/29/2023 CCC ZG522 - BDS S1-23-24 63

Add a combiner
• It is an optimization technique for MapReduce Job. Generally, the

reducer class is set to be the combiner class. The difference between
combiner class and reducer class is as follows:
• Output generated by combiner is intermediate data and it is passed to
the reducer.
• Output of the reducer is passed to the output file on disk.
11/29/2023 CCC ZG522 - BDS S1-23-24 64
MapReduce – Word count
Example: Python Code (1)
Sample Data
# Sample document data
documents = ["the cow jumps over the moon",
"the cow jumps over the moon again"]
➢ Next Define Mapper function
11/29/2023 CCC ZG522 - BDS S1-23-24 65

Define Mapper function
# Mapper
def mapper(document):
# Split document string into words
words = document.split()
for word in words:
# Emit each word with a count of 1
print('%s\t%d' % (word, 1))
11/29/2023 CCC ZG522 - BDS S1-23-24 66

Apply Mapper function
# Process each document

for document in documents:
mapper(document)
• Intermediate results
# Sample data mapped to words and counts
words = {'the': [1, 1],
'cow': [1, 1],
'jumps': [1, 1],
'over': [1, 1],
'moon': [1],
'again': [1]}
11/29/2023 CCC ZG522 - BDS S1-23-24 67
Define Reducer function
# Reducer
def reducer(word, counts):
# Sums counts for each instance of a word
sum = 0
for count in counts:
sum += count
# Emit word with final count
print('%s\t%d' % (word, sum))
11/29/2023 CCC ZG522 - BDS S1-23-24 68

Call Reducer for each word
# Call reducer on each word

for word, counts in words.iteritems():
reducer(word, counts)
11/29/2023 CCC ZG522 - BDS S1-23-24 69

11/29/2023 CCC ZG522 - BDS S1-23-24 70

word
11/29/2023 CCC ZG522 - BDS S1-23-24 71

11/29/2023 CCC ZG522 - BDS S1-23-24 72

• Understand Hadoop-2 version
MapReduce
11/29/2023 CCC ZG522 - BDS S1-23-24 73

MapReduce compatibility
11/29/2023 CCC ZG522 - BDS S1-23-24 74

Binary Compatibility
11/29/2023 CCC ZG522 - BDS S1-23-24 75

Source Code Compatibility
11/29/2023 CCC ZG522 - BDS S1-23-24 76

Compatibility of command line
scripts
11/29/2023 CCC ZG522 - BDS S1-23-24 77

11/29/2023 CCC ZG522 - BDS S1-23-24 78
MapReduce History Server
11/29/2023 CCC ZG522 - BDS S1-23-24 79

• Benchmarking programs to test the performance and
health of a Hadoop cluster.
• Provided by Apache.org
WE shall discuss five Commonly used Benchmark

programs.
11/29/2023 CCC ZG522 - BDS S1-23-24 80

Benchmark Programs (1)
(To Test workload)
TeraSort :
• Tests the cluster's ability to handle a large-scale sort
workload.
• It sorts 1 TB of randomly generated 100 byte records.
• This tests both the MapReduce framework's
performance as well as the HDFS throughput.
Expected results:
• Total time to sort 1 terabyte of data
• MapReduce job execution time
• Data throughput rate (GB/min)
• Processor utilization
11/29/2023 CCC ZG522 - BDS S1-23-24 81
Benchmark Programs(2)
TestDFSIO :’
• Benchmark tool to test HDFS throughput for large file
reads/writes.
• It measures aggregate read/write throughput across the
cluster by testing them under varied conditions and file
sizes.
Expected Operational Metrics:
• Read throughput (MB/sec)
• Write throughput (MB/sec)
• Average I/O rate
11/29/2023 CCC ZG522 - BDS S1-23-24 82

MRBench :
• Helpful for testing incremental changes in the system.
• Micro benchmark suites to test various aspects of
MapReduce operation.
• Includes tests like sort, grep, aggregation, join, statistics
metrics etc.
• Completion time for sort, word count etc
• Processor utilization percentage
• Mapper/reducer task numbers
• Map input/output records
11/29/2023 CCC ZG522 - BDS S1-23-24 83
GridMix
• Creates a controllable MapReduce workload mix to
simulate production load.
• Workloads can be customized to match variety of use
cases based on needs.
• Helps test cluster performance for production
environments.
• MapReduce job latency distribution
• Overall execution time
• Network utilization
• 11/29/2023
CPU/memory usage per node
CCC ZG522 - BDS S1-23-24 84
YARN benchmarking
• To test ResourceManager and ability of YARN to

schedule under load.

• Scheduling latency
• Node utilization levels
• Job queue lengths
• Time for queues to drain under simulated load
• Improved resource allocation over time
11/29/2023 CCC ZG522 - BDS S1-23-24 85
Job tracking and Monitoring:
• YARN in Hadoop 2 helps track jobs, tasks, and provide

logs for debugging:.
11/29/2023 CCC ZG522 - BDS S1-23-24 86

Job Tracking and Monitoring
• YARN ResourceManager has a built-in web UI and CLI

to track execution and status of MapReduce jobs.
• Provides details like job ID, start/end times,

mapper/reducer details.
• Metrics on CPU, memory, shuffle/sort numbers per job
11/29/2023 CCC ZG522 - BDS S1-23-24 87

Task Tracking
• MapReduce and Monitoring Protocol tracks task

attempts from initialization to completion.
• Failed tasks are designated and diagnosis information
logged
• Task logs are aggregated and made available for
debugging
11/29/2023 CCC ZG522 - BDS S1-23-24 88

Logs
• YARN collects logs from containers running mapper and

reducer attempts
• Logs include stdout, stderr, syslog, and Application
Master logs
• Logs are aggregated and stored on HDFS
• Can be accessed using yarn logs CLI along with job ID
11/29/2023 CCC ZG522 - BDS S1-23-24 89

End
11/29/2023 CCC ZG522 - BDS S1-23-24 90

NAME OF THE COURSE
Contact Session-7 HDFS YARN
BITS Pilani
Pilani Campus kanantharaman@wilp.bits-pilani.ac.in
Hadoop 2 YARN: Taking Hadoop
beyond Batch
(Job Tracker changes)
11/29/23 CCC ZG522 - BDS S1-23-24 2

Apache Hadoop-2 YARN (1)
Provides 5 new features
11/29/23 CCC ZG522 - BDS S1-23-24 3

Hadoop 2.2-YARN Release
11/29/23 CCC ZG522 - BDS S1-23-24 4

YARN Components
11/29/23 CCC ZG522 - BDS S1-23-24 5

YARN Architecture
MPI- message passing interface, MRAM – Map Reduce Application master

MPIAM – MPI Application master
11/29/23 CCC ZG522 - BDS S1-23-24 6
Resource Manager
11/29/23 CCC ZG522 - BDS S1-23-24 7

YARN Containers
11/29/23 CCC ZG522 - BDS S1-23-24 8

Node Manager
11/29/23 CCC ZG522 - BDS S1-23-24 9

Application Master
11/29/23 CCC ZG522 - BDS S1-23-24 10

Job History server
11/29/23 CCC ZG522 - BDS S1-23-24 11

YARN workflow Scheduling
11/29/23 CCC ZG522 - BDS S1-23-24 12

YARN Scheduling options
11/29/23 CCC ZG522 - BDS S1-23-24 13

FIFO Scheduler
11/29/23 CCC ZG522 - BDS S1-23-24 14

Capacity Scheduler
11/29/23 CCC ZG522 - BDS S1-23-24 15

Fair Scheduler
11/29/23 CCC ZG522 - BDS S1-23-24 16

Understand the application life cycle in
YARN
11/29/23 CCC ZG522 - BDS S1-23-24 17

Client Request to Resource
Manager
11/29/23 CCC ZG522 - BDS S1-23-24 18

Application – Node Manager
Interaction
11/29/23 CCC ZG522 - BDS S1-23-24 19

YARN Summary

YARN (1)
• The fundamental idea behind this architecture is splitting

the JobTracker responsibility of resource management
and Job Scheduling/Monitoring into separate daemons.
• Daemons/process that are part of YARN Architecture

are described below:
11/29/23 CCC ZG522 - BDS S1-23-24 21

YARN (2)
A Global ResourceManager: Its main responsibility is to

distribute resources among various applications in the
system. It has two main components:
(a) Scheduler: The pluggable scheduler of

ResourceManager decides allocation of resources to
various running applications. The scheduler is just that,
a pure scheduler, meaning it does NOT monitor or track
the status of the application.
11/29/23 CCC ZG522 - BDS S1-23-24 22

YARN(3)
(b) ApplicationManager: ApplicationManager does the

following:
• Accepting job submissions.
• Negotiating resources (container) for executing the
application specific ApplicationMaster.
• Restarting the ApplicationMaster in case of failure
11/29/23 CCC ZG522 - BDS S1-23-24 23

YARN(4)
• NodeManager: This is a per-machine slave daemon.

NodeManager responsibility is launching the application
containers for application execution.
• NodeManager monitors the resource usage such as

memory, CPU, disk, network, etc. It then reports the
usage of resources to the global ResourceManager.
11/29/23 CCC ZG522 - BDS S1-23-24 24

YARN(5)
Per-application ApplicationMaster: This is an application-

specific entity. Its responsibility is to negotiate required
resources for execution from the ResourceManager.
• It works along with the NodeManager for executing and

monitoring component tasks.
11/29/23 CCC ZG522 - BDS S1-23-24 25

Basic Concepts YARN
11/29/23 CCC ZG522 - BDS S1-23-24 26

YARN(6)
1. Application is a job submitted to the framework.

2. Example – MapReduce Job.
3. Container: 1. Basic unit of allocation.

–2. Fine-grained resource allocation across multiple resource types
(Memory, CPU, disk, network, etc.)
1. (a) container_0 = 2GB, 1CPU

2. (b) container_1 = 1GB, 6 CPU
3. Replaces the fixed map/reduce slots.
11/29/23 CCC ZG522 - BDS S1-23-24 27

YARN Architecture
1.A client program submits
the application which includes
the necessary specifications
to launch the application-
specific ApplicationMaster
itself.
2. The ResourceManager
launches the
ApplicationMaster by
assigning some container.
3. The ApplicationMaster, on
boot-up, registers with the
Resource Manager
4. During the normal course, ApplicationMaster negotiates

appropriate resource containers via the resource-request
protocol.
11/29/23 CCC ZG522 - BDS S1-23-24 28

YARN- Summary
• The YARN architecture allows Hadoop to support

various types of processing models, beyond the
traditional MapReduce paradigm.
• It enables the execution of diverse workloads, such as
interactive querying, stream processing, and machine
learning, by providing a more flexible and efficient
resource management framework.
• This separation of resource management from job
execution has contributed to the scalability and versatility
of Hadoop clusters.
11/29/23 CCC ZG522 - BDS S1-23-24 29

Q&A
11/29/23 CCC ZG522 - BDS S1-23-24 30

END
11/29/23 CCC ZG522 - BDS S1-23-24 31

BIG DATA SYSTEMS
Contact Session-7
HDFS
BITS Pilani Email:kanantharaman@wilp.bitspilani.ac.in
Pilani Campus
Hadoop Distributed File
System
Some key Points of Hadoop Distributed File System are as follows:
1. Storage component of Hadoop.
2. Distributed File System.
3. Modeled after Google File System.
4. Optimized for high throughput (HDFS leverages large block size and moves
computation where data is stored).
5. You can replicate a file for a configured number of times, which is tolerant in
terms of both software and hardware.
6. Re-replicates data blocks automatically on nodes that have failed.
7. You can realize the power of HDFS when you perform read or write on large
files (gigabytes and larger).
8. Sits on top of native file system such as ext3 and ext4,
12/9/23 CCC ZG522 - BDS S1-23-24 2

Hadoop Distributed file
Systems (1)
• Distributing data over multiple machines needed when
capacity exceeds single machine
• Distributed filesystems manage storage across network
of machines
• More complex than regular disk filesystems due to
network complications
• Main challenge is tolerating node failures without losing
data
12/9/23 CCC ZG522 - BDS S1-23-24 3

What is Streaming Data
access Pattern (1)
• Data is continuously generated from various sources like
users, sensors, applications etc.
• Data elements arrive sequentially in a stream rather than

as a persistent store
• Access is primarily sequential, accessing elements in

arrival order
12/9/23 CCC ZG522 - BDS S1-23-24 4

What is Streaming Data
access Pattern (2)
• Often processing needs to happen in real-time or near-
real-time as data arrives
• Difficult to go back to previous elements after they've

been processed
• Streaming computations are long-running and

continuous rather than batch oriented
12/9/23 CCC ZG522 - BDS S1-23-24 5

The Design Principles of
HDFS
• HDFS is a filesystem designed for:
• Very large files:

• Streaming Data Access
• Commodity hardware
12/9/23 CCC ZG522 - BDS S1-23-24 6

The Design of HDFS
-Very Large files
• “Very large” in this context means files that are
hundreds of megabytes, gigabytes, or terabytes in size.
There are Hadoop clusters running today that store
petabytes of data.
12/9/23 CCC ZG522 - BDS S1-23-24 7

The Design of HDFS
-Streaming Data Access
• Write-once, read-many-times pattern
• A dataset is typically generated or copied from source.
• Then various analyses are performed on that dataset
over time
• Each analysis will involve a large proportion, if not all, of
the dataset.
• The time to read the whole dataset is more important
than the latency in reading the first record.
12/9/23 CCC ZG522 - BDS S1-23-24 8

The Design of HDFS
-Commodity Hardware
• Hadoop doesn’t require expensive, highly reliable
hardware.
• It’s designed to run on clusters of commodity hardware
(commonly available hardware that can be obtained from
multiple vendors.
• For which the chance of node failure across the cluster
is high, at least for large clusters.
• HDFS is designed to carry on working without a
noticeable interruption to the user in the face of such
failure.
12/9/23 CCC ZG522 - BDS S1-23-24 9

HDFS Does work for: (1)
• Low-latency data access

– Applications that require low-latency access to data, in the tens of
milliseconds range, will not work well with HDFS.
• HDFS is optimized for delivering high throughput of data
12/9/23 CCC ZG522 - BDS S1-23-24 10

• Lots of small files

• Name node holds the file system.
• The number files in the file system is governed by
amount of memory available in the NameNode.
12/9/23 CCC ZG522 - BDS S1-23-24 11

• Multiple writers, arbitrary file modifications

• Files in HDFS may be written to by a single writer.
• Writes are always made at the end of the file, in append-
only fashion.
• There is no support for multiple writers or for
modifications at arbitrary offsets in the file.
12/9/23 CCC ZG522 - BDS S1-23-24 12

HDFS Block Size (1)
• A disk has a block size, which is the minimum amount of

data that it can read or write.
• Normal File system block size – 512 bytes
• HDFS block size - 128 MB
• The reason for larger block size is to minimize the cost of
seek time
• If the block is large enough, the time it takes to transfer
the data from the disk can be significantly longer than
the time to seek to the start of the block.
• Thus, transferring a large file made of multiple blocks
operates at the disk transfer rate.
12/9/23 CCC ZG522 - BDS S1-23-24 13

HDFS Block Size (2)
• Suppose disk seek time = 10ms

• Transfer rate = 100MB/Sec
• Block size = 128MB
• Transfer time is approx. = 1 sec
• Seek time = 1% of transfer time
12/9/23 CCC ZG522 - BDS S1-23-24 14

HDFS Tradeoff
12/9/23 CCC ZG522 - BDS S1-23-24 15

HDFS Nodes
12/9/23 CCC ZG522 - BDS S1-23-24 16

HDFS Nodes
12/9/23 CCC ZG522 - BDS S1-23-24 17

Various roles in HDFS
Not a failover
node
12/9/23 CCC ZG522 - BDS S1-23-24 18

HDFS Namespace
12/9/23 CCC ZG522 - BDS S1-23-24 19

Key points about HDFS
• .Client Application interacts with NameNode for metadata related activities

and communicates with DataNodes to read and write files.
• DataNodes converse with each other for pipeline reads and writes.
• Let us assume that the file “Sample.txt” is of size 192 MB. As per the default
data block size(64 MB), it will be split into three blocks and replicated across
the nodes on the cluster based on the default replication factor.
12/9/23 CCC ZG522 - BDS S1-23-24 20
HDFS Replication (Tunable)
12/9/23 CCC ZG522 - BDS S1-23-24 21

HDFS Read
12/9/23 CCC ZG522 - BDS S1-23-24 22

HDFS Write
12/9/23 CCC ZG522 - BDS S1-23-24 23

Namenode(1)
• HDFS breaks a large file into smaller pieces called

blocks
• NameNode uses a rack ID to identify DataNodes in the
rack. A rack is a collection of DataNodes within the
cluster.
• NameNode keeps tracks of blocks of a file as it is placed
on various DataNodes.
• NameNode manages file-related operations such as
read, write, create, and delete. Its main job is managing
the File System Namespace.(FsImage)
12/9/23 CCC ZG522 - BDS S1-23-24 24

Namenode (2)
• A file system namespace is collection of files in the cluster.

NameNode stores HDFS namespace.
• File system namespace includes mapping of blocks to file, file
properties and is stored in a file called FsImage.
• NameNode uses an EditLog (transaction log) to record every
transaction that happens to the filesystem metadata.
• When NameNode starts up, it reads FsImage and EditLog from disk
and applies all transactions from the EditLog to in-memory
representation of the FsImage.
• Then it flushes out new version of FsImage on disk and truncates
the old EditLog because the changes are updated in the FsImage.
• There is a single NameNode per cluster.
12/9/23 CCC ZG522 - BDS S1-23-24 25

Namenode (3)
12/9/23 CCC ZG522 - BDS S1-23-24 26

DataNode
There are multiple DataNodes per cluster.

• During Pipeline read and write DataNodes communicate
with each other.
• A DataNode also continuously sends “heartbeat”

message to NameNode to ensure the connectivity
between the NameNode and DataNode.
• In case there is no heartbeat from a DataNode, the

NameNode replicates that DataNode within the cluster
and keeps on running as if nothing had happened.
12/9/23 CCC ZG522 - BDS S1-23-24 27

NameNode and DataNode
Communication
12/9/23 CCC ZG522 - BDS S1-23-24 28

Secondary NameNode
• The Secondary NameNode takes a snapshot of HDFS

metadata at intervals specified in the Hadoop
configuration.
• Since the memory requirements of Secondary
NameNode are the same as NameNode.
• it is better to run NameNode and Secondary NameNode
on different machines.
• In case of failure of the NameNode, the Secondary
NameNode can be configured manually to bring up the
cluster.
• However, the Secondary NameNode does not record
any real-time changes that happen to the HDFS
metadata.
12/9/23 CCC ZG522 - BDS S1-23-24 29
HADOOP HDFS commands
(Four node cluster-Screen shots)
12/9/23 CCC ZG522 - BDS S1-23-24 30

Hadoop with four Nodes
1.Loging into Master Node

2, Type Cat /etc/hosts
List the IP Addresses of Master node and Data nodes
12/9/23 CCC ZG522 - BDS S1-23-24 31

List the local files in Master
Node
>ls
12/9/23 CCC ZG522 - BDS S1-23-24 32

List Hadoop files system
(HDFS)
12/9/23 CCC ZG522 - BDS S1-23-24 33

To Copy a local file into HDFS
> - put Command
3 – is replication factor
File size is 3.2MB
12/9/23 CCC ZG522 - BDS S1-23-24 34

HDFS Make Directory
> makdir command
12/9/23 CCC ZG522 - BDS S1-23-24 35

Change the name of the HDFS
file
> mv command
12/9/23 CCC ZG522 - BDS S1-23-24 36

To copy file from HDFS to
Local file system
Ø -get command
12/9/23 CCC ZG522 - BDS S1-23-24 37

Copy Directory into HDFS file
system
Ø -put command
12/9/23 CCC ZG522 - BDS S1-23-24 38

To Delete a HDFS directory
>- rmr command
12/9/23 CCC ZG522 - BDS S1-23-24 39

TO Delete a file from HDFS file
system
Ø - rm command
12/9/23 CCC ZG522 - BDS S1-23-24 40

HDFS Help option
Ø - help command
Ø Hadoop dfs –help|more
12/9/23 CCC ZG522 - BDS S1-23-24 41

Additional Features of HDFS
12/9/23 CCC ZG522 - BDS S1-23-24 42

New HDFS command
12/9/23 CCC ZG522 - BDS S1-23-24 43

HDFS high Availability(1)
12/9/23 CCC ZG522 - BDS S1-23-24 44

HDFS High Availability
12/9/23 CCC ZG522 - BDS S1-23-24 45

HDFS Federation
12/9/23 CCC ZG522 - BDS S1-23-24 46

HDFS1.0 limitation
• NameNode saves all its file metadata in main memory.

• Although the main memory today is not as small and as
expensive as it used to be two decades ago, still there is
a limit on the number of objects that one can have in the
memory on a single NameNode.
• The NameNode can quickly become overwhelmed with
load on the system increasing. In Hadoop 2.x, this is
resolved with the help of HDFS Federation
12/9/23 CCC ZG522 - BDS S1-23-24 47

HDFS federation
12/9/23 CCC ZG522 - BDS S1-23-24 48

Hadoop 2.0 (HDFS 2.0)
• HDFS 2 consists of two major components:

(a) namespace
(b) blocks storage service.
• Namespace service takes care of file-related

operations, such as creating files, modifying files, and
directories.
• The block storage service handles data node cluster

management, replication.
12/9/23 CCC ZG522 - BDS S1-23-24 49

HDFS2 features
HDFS 2 Features
1. Horizontal scalability.
2. High availability.
• HDFS Federation uses multiple independent
NameNodes for horizontal scalability. NameNodes are
independent of each other.
• .The DataNodes are common storage for blocks and

shared by all NameNodes. All DataNodes in the cluster
registers with each NameNode in the cluster. High
availability of NameNodeis obtained.
12/9/23 CCC ZG522 - BDS S1-23-24 50

Passive Standby Namenode
In Hadoop 2.x, Active−Passive NameNode handles

failover automatically. All namespace edits are recorded
to a shared NFS storage and there is a single writer at
any point of time. Passive NameNode reads edits from
shared storage.
12/9/23 CCC ZG522 - BDS S1-23-24 51

Q&A
12/9/23 CCC ZG522 - BDS S1-23-24 52

END
12/9/23 CCC ZG522 - BDS S1-23-24 53

BITS Pilani presentation
BITS Pilani K Anantharaman
kanantharaman@wilp.bits-pilani.ac.in
Pilani Campus
12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24

BITS Pilani
Pilani Campus
CCCSZG522 - Big Data Systems

Lecture No 8 - Introduction to Pig
Learning Objectives and Learning
Outcomes

Introduction to Pig
1. To study the key features and a) To have an easy comprehension on

anatomy of Pig. when to use and when NOT to use Pig.
2. To study the execution modes of Pig. b) To be able to differentiate between Pig

and Hive.
3. To study the various relational
operators in pig.

Agenda-1
➢ What is Pig?
❖ Key Features of Pig
➢ Why do we Need Apache Pig?
➢ Features of Pig
➢ Pig vs MapReduce
➢ Pig Architecture
➢ Usecase for Pig
➢ Pig Latin Overview
❖ Pig Latin Statements
❖ Pig Latin: Identifiers
❖ Pig Latin: Comments
➢ Data Types in Pig
❖ Simple Data Types
❖ Complex Data Type

Agenda-2
➢ Running Pig
➢ Execution Modes of Pig
➢ Relational Operators
➢ Eval Function
➢ Piggy Bank
➢ When to use Pig?
➢ When NOT to use Pig?
➢ Pig versus Hive

What is Pig?
• Apache Pig is an abstraction over MapReduce

• Pig used to handle structured, semi-structured, and unstructured data.
• It is a high-level data flow tool developed to execute queries on large
datasets that are stored in HDFS.
• Pig Latin is the high-level scripting language used by Apache Pig to write
data analysis programs.
• Pig was developed as a research project at Yahoo.

Why do we Need Apache Pig?(1)
Source: analyticsvidhya.

Why do we Need Apache Pig?(2)
• Pig Latin is easy to understand as it is a SQL-like language.
• Pig is also known as an operator-rich language because it offers multiple

built-in operators like joins, filters, ordering, etc.

Pig vs MapReduce

Architecture and Components of Pig (1)

(Parser)
Pig Latin Script Processing Stages:
First Stage: Parser
• Pig Latin script sent to Hadoop is processed by the Parser.
• Parser conducts type, syntax, and other checks on the script.
• Outputs a Directed Acyclic Graph (DAG) containing logical operators and Pig
Latin statements.
• In the DAG, script's logical operators are nodes, and data flows are edges

(Parser)
• A DAG is a directed graph that has no cycles.
• In the context of Apache Pig, the DAG represents the logical flow of data in a
Pig Latin script.
• The nodes in the DAG represent the logical operators in the script, such as
LOAD, FILTER, GROUP, and JOIN.
• The edges in the DAG represent the data flows between the operators. For
example, the output of a LOAD operator might flow into a FILTER operator,
which in turn might flow into a GROUP operator.

Optimizer
Second Stage: Optimizer
• After retrieving the output from the parser, a logical plan for DAG is submitted
to a logical optimizer.
• The logical optimizations are carried out by the optimizer, which includes
activities like transform, split, merge, reorder operators, etc.
• The optimizer basically aims to reduce the quantity of data in the pipeline
when it processes the extracted data.

Optimizer
➢ This optimizer performs automatic optimization of the data and uses various
functions like:
✓ PushUpFilter: If multiple conditions are available in a filter and the filter can be split, then Pig pushes up each
condition individually and splits those conditions. An earlier selection of these conditions is helpful by resulting in
the reduction of the number of records left in the pipeline
✓ .LimitOptimizer: If the limit operator is applied just after a load or sort operator, then Pig converts these
operators into a limit-sensitive implementation, which omits the processing of the whole data set.
✓ ColumnPruner: This function will omit the columns that are never used; hence, it reduces the size of the
record. This function can be applied after each operator to prune the fields aggressively and frequently.
✓ MapKeyPruner: This function will omit the map keys that are never used, hence, reducing the size of the
record.

Architecture and Components of Apache
Pig (6) - Compiler
Third Stage: Compiler
➢ After receiving the optimizer’s output, the Compiler compiles the resultant
code into a series of MapReduce tasks.
➢ The Compiler is responsible for the conversion of Pig Script into MapReduce
jobs.

Architecture and Components of Apache
Pig (7) – Execution Engine
Fourth Stage: Execution Engine
• At last, come to the Execution Engine, where the MapReduce jobs are
transferred for execution to the Hadoop.
• Then the MapReduce jobs get executed, and Hadoop provides the required
results.
• Output can be displayed on the screen by using the ‘DUMP’ statement and
can be stored in the HDFS by the ‘STORE’ statement.

Process flow of Pig Script execution

Usecase for Pig- ETL Processing
• ETL (Extract, Transform, Load)

Pig Data Model

Pig Data Model (1)
➢ The data model of Pig Latin allows it to handle a variety of data.

➢ Pig Latin can handle simple atomic data types such as int, float, long, double,
etc., as well as complex non-atomic data types such as map, tuple, and bag.

Pig Latin Data Model (2)
(Atom)
Atom
➢ Atom is a scaler primitive data type that can be any single value in Pig Latin,
irrespective of their data type.
➢ The atomic values of Pig can be string, int, long, float, double, char array, and
byte array.
➢ A simple atomic value or a byte of data is known as a field.
✓ Example of an atom : 2, ‘Kiran,’ ‘25’, ‘Kolkata,’ etc.

Tuple
Tuple
➢ A tuple is a record that is formed by an ordered set of fields that may carry
different data types for each field.
➢ A tuple can be compared with the records stored in a row in an RDBMS.
➢ It is not mandatory to have a schema attached with the elements present
inside a tuple. Small brackets ‘()’ are used to represent the tuples.
✓ Example of tuple : (2, Kiran, 25, Kolkata)

Bag

Map
Map
➢ A map is nothing but a set of key-value pairs used to represent the data
elements.
➢ The key should be unique and must be of the type char array, whereas the
value can be of any type.
➢ Square brackets ‘[]’ are used to represent the Map, and the hash ‘#’ symbol is
used to separate the key-value pair.
✓ Example of maps: [name#Kiran, age#25 ], [name#Aisha, age#20 ]

Example: Atom ,Tuple, Map, Bag
Source: Medium

Running Pig

Running Pig
Pig can run in two ways:
1. Interactive Mode.
2. Batch Mode.

1. Running Pig- Interactive Mode (1)
➢ You can run Pig in interactive mode by invoking grunt shell.

➢ Type pig to get grunt shell as shown below.

1. Running Pig- Interactive Mode (2)
➢ Once you get the grunt prompt, you can type the Pig Latin statement as
shown below.

2. Running Pig- Batch Mode (1)
➢ To run pig in MapReduce mode, you need to have access to a Hadoop

Cluster to read /write file. This is the default mode of Pig.
➢ Syntax
✓ Pig filename

HDFS Commands

Pig Latin Overview

Pig Latin Statements
1. Pig Latin statements are basic constructs to process data using Pig.
2. Pig Latin statement is an operator.
3. An operator in Pig Latin takes a relation as input and yields another relation
as output.
4. Pig Latin statements include schemas and expressions
5. Pig Latin Statements end with Semi-colon ‘;’

Example – Pig Latin Script

Pig Latin : Keywords
➢ Keywords are reserved. It cannot be used to name things.
✓ Example: Load, Filter, foreach etc..

Pig Latin Identifiers:
1. Identifiers are names assigned to fields or other data structures.

2. It should begin with a letter and should be followed only by letters, numbers,
and underscores.

Pig Latin -Comments
In Pig Latin two types of comments are supported:

1. Single line comments that begin with “--”.
2. Multiline comments that begin with “/* and end with */”.

Pig Latin: Case Sensitivity
1. Keywords are not case sensitive such as LOAD, STORE, GROUP,

FOREACH, DUMP, etc.
2. Relations and paths are case-sensitive.
3. Function names are case sensitive such as PigStorage, COUNT.

Relation Operators

LOAD
Reading Data
To read data in Pig, we need to put the data from the local file system to
Hadoop. Let’s see the steps:
Step 1:- Create a file using the cat command in the local file system.
Step 2:- Transfer the file into the HDFS using the put command.
Step 3:- Read the data from the Hadoop to the Pig Latin using the load
command.

LOAD Operator-Syntax
Syntax:-
Relation = LOAD 'Input file path information' USING load_function AS schema;
Where,
Relation − We have to provide the relation name where we want to load the file content.
Input file path information − We have to provide the path of the Hadoop directory where the file is stored.
load_function − Apache Pig provides a variety of load functions like BinStorage, JsonLoader, PigStorage,
TextLoader. Here, we need to choose a function from this set. PigStorage is the commonly used
function as it is suited for loading structured text files.
Schema − We need to define the schema of the data or the passing files in parenthesis.

Example -Load
Source: Edureka.

STORE
Source: Edureka.

FOREACH

FILTER

DISTINCT (1)

DISTINCT (2)

LIMIT

ORDER BY

JOIN

UNION (1)
• It is used to merge contents of two relations

UNION(2)

SPILT
• It used to split the relation into two or more relations

Aggregate Functions

AVG
• It used to computer Average numeric values in single column Bag.

MAX

COUNT
• Count number of elements in a Bag

Complex Datatypes

TUBLE

MAP
12/16/2023
CCCSZG522 - Big Data Systems - S1-23-24 BITS Pilani, Pilani Campus
Piggy Bank

Piggy Bank

UDF (User Defined Function)

UDF (2)

Word Count Example using Pig (1)

Word Count Example using Pig (2)-
Output

When to use Pig?

Pig can be used in the following situations:
1. When data loads are time sensitive.
2. When processing various data sources.
3. When analytical insights are required through sampling

When NOT to use Pig?
Pig should not be used in the following situations:
1. When data is completely unstructured such as video, text, and audio.
2. When there is a time constraint because Pig is slower than MapReduce jobs.

Pig vs Hive

Pig Vs Hive
Source : geekforgeeks

Q&A

End

BITS Pilani
Pilani Campus
CCCSZG522 - Big Data Systems

Lecture No 8 – Introduction to Hive
Learning Objectives and Learning
Outcomes
Introduction to Hive
1. To study the Hive Architecture a) To understand the hive architecture.

b) To create databases, tables and execute
2. To study the Hive File format data manipulation language statements on
it.
3. To study the Hive Query Language c) To differentiate between static and
dynamic partitions.
d) To differentiate between managed and
external tables.

Agenda-1
➢ What is Hive?
➢ Hive Architecture
➢ Hive Data Types
➢ Primitive Data Types
➢ Collection Data Types
➢ Hive File Format
➢ Text File
➢ Sequential File
➢ RCFile (Record Columnar File)

Agenda-2
➢ Hive Query Language

– DDL (Data Definition Language) Statements
– DML (Data Manipulation Language) Statements
– Database
– Tables
– Partitions
– Buckets
– Aggregation
– Group BY and Having
➢ SERDER

What is Hive?
• ETL and Data warehousing tool developed on top of Hadoop Distributed File
System (HDFS)
• Provides an SQL-like interface between the user and the Hadoop distributed
file system (HDFS)
• Hive makes job easy for performing operations like
✓ Data encapsulation
✓ Ad-hoc queries
✓ Analysis of huge datasets
✓ Support for data query and analysis using SQL
✓ Processing of structured and semi-structured
▪ Facebook created Hive component to

manage their ever-growing volumes of data.
Source : towardsdatascience

What is Hive in Hadoop?
• Writing MapReduce jobs is tedious work!

• Hadoop Hive, allows to submit SQL queries and perform MapReduce jobs
✓ If comfortable with SQL, then Hive is the right tool!
• Hive has its own language, called HiveQL (HQL)

✓ similar to SQL
• HQL translates SQL-like queries into MapReduce jobs, like what Pig Latin
does, uses HDFS for Storage
✓ No need to learn Java to work with Hadoop Hive
• The three important functionalities for which Hive is deployed are

✓ data summarization
✓ data analysis
✓ and data query

Hive – in short
• Best suited for batch jobs

• Cannot work for online transaction processing (OLTP) systems
• Does not provide real-time querying for row-level updates
Source : datadog

Characteristics of Hive
• Databases and tables are built before loading the data

• Hive
✓ as data warehouse is built to manage and query only structured data which is residing under tables
✓ framework have optimization and usability
✓ can partition the data with directory structures to improve performance on certain queries
✓ is compatible for the various file formats which are TEXTFILE, SEQUENCEFILE, RCFILE, etc
✓ uses derby database in single user metadata storage
✓ uses MYSQL for multiple user Metadata or shared Metadata
Source : davidscoding
Applications of Apache Hive
You can use Apache Hive mainly for
✓Data Warehousing
✓Ad-hoc Analysis
Also Can be used for

✓Data Mining
✓Log Processing
✓Document Indexing
✓Customer Facing Business Intelligence
✓Predictive Modelling
✓Hypothesis Testing
Not meant for

✓any real-time queries
✓Online transaction processing OLTP
Ref: Linkedin Learning: Analyzing Data with Hive

Hadoop vs Hive
Source : geekforgeeks
History of Hive and Recent Releases of
Hive

Hive Architecture

Architecture
RPC
Protocol
Source : guru99

Core Parts
Hive Clients
✓ Hive provides different drivers for communication with a different type of applications
✓ For Thrift based applications, it will provide Thrift client for communication
✓ For Java related applications, it provides JDBC Drivers
Hive Services
✓ Client interactions with Hive can be performed through Hive Services
✓ If the client wants to perform any query related operations in Hive, it has to communicate
through Hive Services
✓ CLI is the command line interface acts as Hive service for DDL (Data definition
Language) operations
✓ Driver present in the Hive services represents the main driver
❖ communicates all type of JDBC, ODBC, and other client specific applications
❖ processes those requests from different applications to meta store and field systems for
further processing
Hive Storage and Computing
– Hive services such as Meta store, File system, and Job Client in turn communicates with
Hive storage and performs the following actions
✓ Metadata information of tables created in Hive is stored in Hive "Meta storage database
✓ Query results and data loaded in the tables are going to be stored in Hadoop cluster on
HDFS
Major Components of Hive
Architecture(1)
Clients
CLI, UI, and Thrift Server
✓ The command-line interface (CLI) and the user interface (UI) submit queries and process monitoring and
instructions so that the external users can interact with Hive
✓ Thrift Server lets other clients interact with Hive

Major Components of Hive Architecture (2)
Metastore
The Metastore stores the information about the tables,
partitions, the columns within the table
The repository of metadata

✓ Metadata consists of data for each table like its location and schema
✓ Holds the information for partition metadata which allows monitoring
✓ of various distributed data progresses in the cluster
✓ Generally present in the relational databases
✓ Metadata keeps track of the data, replicates it, and provides a backup in
✓ the case of data loss
There are 3 ways of storing in Metastore
– Embedded Metastore
– Local Metastore
– Remote Metastore
– Remote Metastore will be used in production mode
Major Components of Hive Architecture (3)
Types of Metastore
Embedded metastore
✓ is a simple way to get started with Hive
✓ only one embedded Derby database can access the database files on disk at any
one time
✓ means you can only have one Hive session open at a time that shares the same
metastore
Local Metastore
✓ The solution to supporting multiple sessions (and therefore multiple users) is to use a
standalone database
✓ Metastore service still runs in the same process as the Hive service, but connects to
a database running in a separate process
✓ either on the same machine or on a remote machine.
Remote metastore
✓ One or more metastore servers run in separate processes to the Hive service
✓ This brings better manageability and security, since the database tier can be
completely firewalled off
Architecture(4)
Driver
✓ Receives HiveQL statements and works like a controller
✓ Monitors the progress and life cycle of various executions by creating sessions
✓ Stores the metadata that is generated while executing the HiveQL statement
✓ Collects the data points and query results, when the reducing operation is completed by the MapReduce job

Architecture(5)
Driver Components
Compiler
✓ The compiler is assigned with the task of converting a HiveQL query into a MapReduce
input
✓ Includes a method to execute the steps and tasks needed to let the HiveQL output as
needed by MapReduce
✓ Transforms the query into an execution plan that contains tasks.
Optimizer
✓ Performs many transformations on the execution plan for providing an optimized DAG
✓ Performs various transformation steps for aggregation and pipeline conversion
by a single join for multiple joins
✓ Assigned to split a task while transforming data, before the reduce operations, for
improved efficiency and scalability
Executor
✓ Executes tasks after the compilation and optimization steps
✓ Directly interacts with the Hadoop Job Tracker for scheduling the tasks to be run
✓ Responsible for pipelining the tasks

Modes of Hive
• Depending on the size of data nodes in Hadoop
Local Mode
✓ Used when the Hadoop is built under pseudo mode which have only one data node
✓ when the data size is smaller in term of restricted to single local machine Hive
✓ when processing will be faster on smaller datasets existing in the local machine
Map Reduce Mode

✓ Used when Hadoop is built with multiple data nodes and data is divided
Local MapReduce
across various nodes
✓ Function on huge datasets and query is executed parallelly
✓ Used to achieve enhanced performance in processing large datasets

Hive Request Flow
Source : AnalyticsVidhya

Request Flow - Detailed
Source : geeksforgeeks
Request Flow - Detailed
Jobexecution flow in Hive with Hadoop

Step-1: Execute Query
✓ Interface of the Hive such as Command Line or Web user interface delivers query to the driver to execute
✓ In this, UI calls the execute interface to the driver such as ODBC or JDBC.
Step-2: Get Plan

✓ Driver designs a session handle for the query and transfer the query to the compiler to make execution plan
✓ In other words, driver interacts with the compiler
Step-3: Get Metadata

✓ The compiler transfers the metadata request to any database and the compiler gets the necessary metadata
from the metastore
Step-4: Send Metadata

✓ Metastore transfers metadata as an acknowledgement to the compiler

Request Flow – Detailed (2)
Step-5: Send Plan

✓ Compiler communicating with driver with the execution plan made by the compiler to execute the query
Step-6: Execute Plan

✓ Execute plan is sent to the execution engine by the driver
✓ Execute Job
✓ Job Done
✓ Dfs operation (Metadata Operation)
Step-7: Fetch Results

✓ Fetching results from the driver to the user interface (UI)
Step-8: Send Results

✓ Result is transferred to the execution engine from the driver
✓ Sending results to Execution engine
✓ When the result is retrieved from data nodes to the execution engine, it returns the result to the driver and to
user interface (UI)

Hive Integration and Work Flow
Explanation of the workflow: Hourly Log Data can be stored directly into HDFS and then data cleansing
is performed on the log file. Finally, Hive table(s) can be created to query the log file.

Hive Data Model

Data Modelling
Database
Table
Data Model
Partition Bucket

Hive Dataunits
• Databases
–The namespace for tables.
• Tables
–Set of records that have similar schema.
• Partitions
–Logical separations of data based on classification of given information as per specific attributes. Once hive has
partitioned the data based on a specified key, it starts to assemble the records into specific folders as and when
the records are inserted, (Partition by : Year, Country)
• Buckets (Clusters)
– Similar to partitions but uses hash function to segregate data and determines the cluster or bucket into which
the record should be placed
– “CLUSTERED BY (customer_id) INTO XX BUCKETS”;

Tables

Tables (1)
Two types
Managed Tables
✓ In a managed table, both the table data and the table schema are managed by Hive
✓ The data will be located in a folder named after the table within the Hive data warehouse
❖ which is essentially just a file location in HDFS
✓ The location is user-configurable when Hive is installed
✓ If you drop (delete) a managed table, then Hive will delete both the Schema (the description of the table) and
the data files associated with the table
✓ Default location is /user/hive/warehouse

Tables (2)
External Tables
✓ An external table is one where only the table schema is controlled by Hive
✓ In most cases, the user will set up the folder location within HDFS and copy the data file(s) there
✓ This location is included as part of the table definition statement
✓ When an external table is deleted, Hive will only delete the schema associated with the table
✓ The data files are not affected

Managed vs. External Table – What’s the
Difference

Create Managed Table
To create managed table named ‘STUDENT’.
CREATE TABLE IF NOT EXISTS STUDENT(rollno INT,name STRING,gpa

FLOAT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

Managed table location
• Hive creates managed table in the warehouse directory of Hive as shown

below:

Create External Table
To create external table named ‘EXT_STUDENT’.
CREATE EXTERNAL TABLE IF NOT EXISTS EXT_STUDENT(rollno INT,name

STRING,gpa FLOAT) ROW FORMAT DELIMITED FIELDS TERMINATED BY
'\t' LOCATION ‘/STUDENT_INFO

Table Location of External Table
• Hive loads the file in the specified location as shown below:

Load Data into a Table
To load data into the table from file named student.tsv.
LOAD DATA LOCAL INPATH ‘/root/hivedemos/student.tsv' OVERWRITE

INTO TABLE EXT_STUDENT;

The difference between INTO TABLE and
OVERWRITE TABLE
• Assume the “EXT_STUDENT” table already had 100 records and the
“student.tsv” file has 10 records.
• After issuing the LOAD DATA statement with the INTO TABLE clause, the
table “EXT_STUDENT” will contain 110 records;
• However, the same LOAD DATA statement with the OVERWRITE clause will
wipe out all the former content from the table and then load the 10 records
from the data file.

To Retrieve Data from a Table
To retrieve the student details from “EXT_STUDENT” table.
SELECT * from EXT_STUDENT;

Partitions

Partitions
• Hive organizes tables into partitions for grouping similar type of data together based on a column or
partition key
• Each Table can have one or more partition keys to identify a particular partition
• Allows us to have a faster query on slices of the data

Partition - example
• E –commerce data which belongs to India operations in which each state (28 states) operations mentioned in
as a whole
• If we take state column as partition key and perform partitions on that India data as a whole, we can able to get
Number of partitions (28 partitions) which is equal to number of states (28) present in India
• Each state data can be viewed separately in partitions tables

Partition
Partition is of two types:

STATIC PARTITION: It is upon the user to mention the partition (the
segregation unit) where the data from the file is to be loaded.
DYNAMIC PARTITION: The User is required to simply state the column, basis
which the partitioning will take place. Hive will then create partitions basis the
unique values in the column on which partition is to be carried out.

Static Partition
• To create static partition based on “gpa” column.

CREATE TABLE IF NOT EXISTS STATIC_PART_STUDENT (rollno INT, name
STRING) PARTITIONED BY (gpa FLOAT) ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t';

Load Data into Static Partition
• Load data into partition table from table.

INSERT OVERWRITE TABLE STATIC_PART_STUDENT PARTITION (gpa
=4.0) SELECT rollno, name from EXT_STUDENT where gpa=4.0;

Dynamic Partition
• To create dynamic partition on column date.
CREATE TABLE IF NOT EXISTS DYNAMIC_PART_STUDENT(rollno INT,

name STRING) PARTITIONED BY (gpa FLOAT) ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t';

To load data into a dynamic partition
INSERT OVERWRITE TABLE DYNAMIC_PART_STUDENT PARTITION (gpa)

SELECT rollno,name,gpa from EXT_STUDENT;

Buckets

Partitioning Vs. Bucketing
Bucketing is similar to partition.

• However there is a subtle difference between partition and bucketing. In partition, you need to create
partition for each unique value of the column.
✓ This may lead to a situation where you may end up with thousands of partitions.
• This can be avoided using Bucketing in which you can limit the number of buckets that will be created.
• A bucket is a file whereas a partition is a directory.

When to Use Partitioning/Bucketing?
• Bucketing works well when the field has high cardinality (cardinality is the
number of values a column or field can have) and data is evenly distributed
among buckets.
• Partitioning works best when the cardinality of the partitioning field is not too
high.
.

Create Buckets
• To create a bucketed table having 3 buckets.

CREATE TABLE IF NOT EXISTS STUDENT_BUCKET (rollno INT,name
STRING,grade FLOAT)
CLUSTERED BY (grade) into 3 buckets;

Load data to bucketed table.
FROM STUDENT
INSERT OVERWRITE TABLE STUDENT_BUCKET
SELECT rollno,name,grade;

• To display the content of first bucket
SELECT DISTINCT GRADE FROM STUDENT_BUCKET

TABLESAMPLE(BUCKET 1 OUT OF 3 ON GRADE);

Hive Data Types

Hive Datatypes (1)
Numeric Data Type
TINYINT 1 - byte signed integer
SMALLINT 2 -byte signed integer
INT 4 - byte signed integer

BIGINT 8 - byte signed integer
FLOAT 4 - byte single-precision floating-point
DOUBLE 8 - byte double-precision floating-point number
String Types
STRING
VARCHAR Only available starting with Hive 0.12.0
CHAR Only available starting with Hive 0.13.0
Strings can be expressed in either single quotes (‘) or double quotes (“)
Miscellaneous Types
BOOLEAN
BINARY Only available starting with Hive
Hive Datatypes(2)

Hive File Formats

Hive File Format
• Text File
• Sequential file
• RCFile (Record Columnar File)

Text file
• The default file format is text file. In this format, each record is a line in the
file. In text file, different control characters are used as delimiters. (“,”, ‘t’, ^A
(O001))
• The supported text files are CSV and TSV. JSON or XML documents too can
be specified as text file.
.

Sequential file
• Sequential files are flat files that store binary key–value pairs. It includes
compression support which reduces the CPU, I/O requirement.

RCFile (Record Columnar File)
• RCFile stores the data in Column Oriented Manner. So it's efficient for
column-based queries.

HQL
• Create and manage tables and partitions.

• Support various Relational, Arithmetic, and Logical Operators.
• Evaluate functions.
• Download the contents of a table to a local directory or result of queries to
HDFS directory.

DDL and DML statements

DDL (Data Definition Language)
Statements
These statements are used to build and modify the tables and other objects in
the database.
The DDL commands are as follows:
1. Create/Drop/Alter Database
2. Create/Drop/Truncate Table
3. Alter Table/Partition/Column
4. Create/Drop/Alter View
5. Create/Drop/Alter Index
6. Show- Databases
7. Describe – a Database

DML (Data Manipulation Language)
Statements
• These statements are used to retrieve, store, modify, delete, and update data
in database.
The DML commands are as follows:

1. Loading files into table.
2. Inserting data into Hive Tables from queries.
3. Hive 0.14 supports update, delete, and transaction

Starting Hive Shell
To start Hive, go to the installation path of Hive and type as below:

Important Hive Commands

Database (1)
• A database is like a container for data. It has a collection of tables which

houses the data.
To create a database named “STUDENTS” with comments and database
properties.
CREATE DATABASE IF NOT EXISTS STUDENTS COMMENT 'STUDENT Details' WITH DBPROPERTIES ('creator'
= 'JOHN');

Explanation of the syntax:
• IF NOT EXIST: It is an optional clause. The create database statement with “IF Not EXISTS” clause
creates a database if it does not exist. However, if the database already exists then it will notify the user
that a database with the same name already exists and will not show any error message.
• COMMENT: This is to provide short description about the database.

• WITH DBPROPERTIES: It is an optional clause. It is used to specify any properties of database in the
form of (key, value) separated pairs. In the above example, “Creator” is the “Key” and “JOHN” is the
value.

Note
• We have not specified the location where the Hive database will be created.
• By default all the Hive databases will be created under default warehouse
directory (set by the property hive.metastore.warehouse. dir) as
/user/hive/warehouse/database_name.db.
• But if we want to specify our own location, then the LOCATION clause can
be specified. This clause is optional.

Database(2)
➢ Show Databases
➢ Objective: To display a list of all databases.

Database (3)
To describe a database.
➢ Shows only DB name, comment, and DB directory.
DESCRIBE DATABASE STUDENTS

Database(3)
To drop database.
• The DROP DATABASE command in Hive is used to delete an existing
database along with all its tables, partitions and data
DROP DATABASE STUDENTS;

Aggregations

Aggregation
Hive supports aggregation functions like avg, count, etc.

To write the average and count aggregation function.
SELECT avg(gpa) FROM STUDENT;
SELECT count(*) FROM STUDENT

Group by and Having

Group by Having
To write group by and having function.

SELECT rollno, name,gpa
FROM STUDENT GROUP BY rollno,name,gpa
HAVING gpa > 4.0;

SerDer

SerDer
• SerDer stands for Serializer/Deserializer.

• Contains the logic to convert unstructured data into records.
• Implemented using Java.
• Serializers are used at the time of writing.
• Deserializers are used at query time (SELECT Statement).

Q&A

End


Big Data - Midsem

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Big Data - Midsem

Uploaded by

Copyright:

Available Formats

BIG DATA SYSTEMS

10/20/2023 CCCSZG522 BDS S1-13-14 1

10/20/2023 CCCSZG522 BDS S1-13-14 2

➢ S1: Introduction to Big Data and data locality

T1 Seema Acharya and Subhashini Chellappan. Big Data

10/20/2023 CCCSZG522 BDS S1-13-14 4

➢ What is a Big Data System

➢ Why locality of reference and storage organization

10/20/2023 CCCSZG522 BDS S1-13-14 5

10/20/2023 CCCSZG522 BDS S1-13-14 6

➢ What data is collected

➢ Data is a mix of metrics, natural language text, logs,

10/20/2023 CCCSZG522 BDS S1-13-14 7

➢ Data is the only way to create competitive differentiators,

10/20/2023 CCCSZG522 BDS S1-13-14 8

▪ Facebook: 500+ TB/day of comments, images, videos etc.

10/20/2023 CCCSZG522 BDS S1-13-14 9

10/20/2023 CCCSZG522 BDS S1-13-14 10

10/20/2023 CCCSZG522 BDS S1-13-14 11

• Digital data is classified into the following categories:

10/20/2023 CCCSZG522 BDS S1-13-14 12

Digital data is classified into the following categories:

10/20/2023 CCCSZG522 BDS S1-13-14 13

10/20/2023 CCCSZG522 BDS S1-13-14 14

➢ This is the data which is in an organized form (e.g., in

➢ Relationships exist between entities of data, such as

➢ Data stored in databases is an example of structured

10/20/2023 CCCSZG522 BDS S1-13-14 15

10/20/2023 CCCSZG522 BDS S1-13-14 16

10/20/2023 CCCSZG522 BDS S1-13-14 17

➢ This is the data which does not conform to a data model

➢ Example, emails, XML, markup languages like HTML,

10/20/2023 CCCSZG522 BDS S1-13-14 18

XML (eXtensible Markup Language)

JSON (Java Script Object Notation)

10/20/2023 CCCSZG522 BDS S1-13-14 19

10/20/2023 CCCSZG522 BDS S1-13-14 20

➢ This is the data which does not conform to a data

➢ About 80–90% data of an organization is in this format.

➢ Example: memos, chat rooms, PowerPoint

10/20/2023 CCCSZG522 BDS S1-13-14 21

10/20/2023 CCCSZG522 BDS S1-13-14 22

10/20/2023 CCCSZG522 BDS S1-13-14 23

10/20/2023 CCCSZG522 BDS S1-13-14 24

10/20/2023 CCCSZG522 BDS S1-13-14 25

1. Definition of big data.

2. Challenges of big data.

3. Why big data?

4. Traditional Business Intelligence versus big data.

10/20/2023 CCCSZG522 BDS S1-13-14 26

Source: Gartner IT Glossary

10/20/2023 CCCSZG522 BDS S1-13-14 27

10/20/2023 CCCSZG522 BDS S1-13-14 28

Batch → Periodic → Near real-time → Real-time processing

10/20/2023 CCCSZG522 BDS S1-13-14 29

➢ Structured data: example: traditional transaction

➢ Semi-structured data: example: Hyper Text Markup

➢ Unstructured data: example: unstructured text

10/20/2023 CCCSZG522 BDS S1-13-14 30

• Veracity and Validity

10/20/2023 CCCSZG522 BDS S1-13-14 31

10/20/2023 CCCSZG522 BDS S1-13-14 32

10/20/2023 CCCSZG522 BDS S1-13-14 33