You are on page 1of 98

BDA MTech-1 SEM 2024 QUESTIONS:

UNIT-I
1a) Define Big data and explain how it differs from traditional data sets, discuss the
convergence of key trends that have led to the rise of big data. 8M
1b) Describe the role of unstructured data in big data analytics.provide an example of how un-
structured data is used in one industry. 7M
OR
2a) what do you mean by linear and non-linear data structures specify the sets are comes
under linear or non-linear and explain the various types of sets supported by java. 7M
2b) what is the advantages of object serialization in java and explain about serializing and de-
seralizing an object with suitable examples 8M
UNIT-II
3a) Discuss the architecture and data model of cassandra.how does it differ from other NoSQL
data base 8M
3b) describe the process of creating and managing tables in cassandra includes an example of
table creation and data manipulation 7M
OR
4a)Explain hadoop distributed file system architecture with a neat sketch. 8M
4b) Define data node how does name node tackle data node failure 7M
UNIT-III
5a)What is Hive meta store which classes are used by the Hive to read and write HDFS
Files. 7M
5b) Explain the following
a) Logical joins b) Window functions 8M
OR
6a) Explain how Hive facilitate the big data analytics discuss its data types, file formats and
HiveQL 8M
6b) How can we install the Apache Hive on the System-Explain 7M
UNIT-IV
7a) list and explain the important features of hadoop 7M
7b) Explain the architecture of building blocks of hadoop 8M
OR
8a) Write a neat diagram and explain the components of apache Hive Architecture. 8M
8b) Describe how spark handles data frames and complex data types include
an example working with JSON data in spark 7M
UNIT-V
9a) Explain Event time and state full processing 7M
9b) Discuss about structured Streaming 8M
OR
10a) Define Streaming explain duplicates in a streaming 8M
10b)explain structured streaming in action and transforms on streaming how it is useful
in real world 7M
I M.TECH. I Semester Regular Examination (MR23), March 2024
ADVANCED DATA STRUCTURES AND ALGORITHMS
(COMPUTER SCIENCE AND ENGINEERING)
UNIT-I
1a) Define Big data and explain how it differs from traditional data sets, discuss the
convergence of key trends that have led to the rise of big data. 8M

Answer:
Big data refers to extremely large and complex datasets that cannot be easily
managed, processed, or analyzed using traditional data processing tools or methods.
The term is often characterized by the three Vs: Volume (the sheer size of the data),
Velocity (the speed at which data is generated and processed), and Variety (the
diverse types of data, including structured, semi-structured, and unstructured).
Here's a breakdown of the differences between big data and traditional data sets:
Volume:
Big Data: Involves datasets that are too large to be processed and analyzed by
traditional database systems. These datasets can range from terabytes to petabytes
and beyond.
Traditional Data: Typically involves smaller datasets that can be easily handled by
conventional database systems.
Velocity:
Big Data: Refers to the speed at which data is generated, collected, and processed.
This is crucial for real-time analytics and decision-making.
Traditional Data: Usually involves data that is generated and processed at a slower
pace compared to big data environments.
Variety:
Big Data: Encompasses a wide range of data types, including structured data (e.g.,
relational databases), semi-structured data (e.g., XML, JSON), and unstructured data
(e.g., text, images, videos).
Traditional Data: Primarily deals with structured data that fits neatly into tables and
relational databases.
Convergence of Key Trends:
The convergence of key trends has played a significant role in the rise of big data.
Several factors have contributed to this convergence:

1. The rise of machine learning


 Machine learning has been around for a while, but we are only now beginning to realize its true
potential.
 It’s no longer just about artificial intelligence; it’s about how computers can learn from their data
and make predictions on their own.
 It is the most crucial component of big data because it can process and analyse massive
amounts of data in a short period.
 It accomplishes this by employing algorithms that have been trained to recognise patterns in
your data and then use those patterns to predict what will happen next.
 Machine learning enables organizations to more easily identify patterns and detect anomalies in
large data sets and to support predictive analytics and other advanced data analysis capabilities.
 With the help of AI, companies are using their big data environments to provide deeper
customer support through intelligent and more personalized interactions without requiring
significant increases in customer support staff.
2. The need for better security
 Data breaches are more common than ever before, and there is no sign that they will stop
anytime soon. Organizations that want to stay ahead of the curve must invest heavily in security.
 Businesses place a high value on this topic because revealing sensitive customer data to the
public without their consent would harm their brand and jeopardise their ability to retain
customers.
3. More cloud adoption/Cloud and hybrid cloud computing:
 Moving to the cloud can greatly benefit organisations because it allows them to reduce costs,
increase efficiency, and rely on outside services to address security concerns.
 To deal with the inexorable increase in data generation, organze are spending more of their
resources storing this data in a range of cloud based and hybrid cloud systems optimized for v’s
of big data.

 The move to cloud computing changed that dynamic.


 By shifiting the responsibilities to cloud infrastructure provider such as AWS,Google,
Microsoft,Oracle, and IBM organization storage and compute capability on demand without
having to maintain their own large and complex data centers.
 One of the most important big data trends is the continued push for cloud migration and a
reduction in reliance on on-premises data centres.
4. Advanced big data tools
 Organizations require advanced tools that invest in cognitive technologies such as Artificial
Intelligence and Machine Learning to facilitate big data management and help them gain more
insights to properly handle big data.
 Business intelligence software companies are investing heavily in their technology to provide
more powerful tools that will fundamentally change the way big data is handled.
5. Data lakes
 Data lakes are a new type of architecture that is revolutionising how businesses store and
analyse data.
 Organizations used to store their data in relational databases. The issue with this type of
storage is that it is too structured to store a variety of data types such as images, audio files,
video files, and so on.
 It can store data in its native format and process any variety of it, ignoring size limits.
Increased Data Generation: The proliferation of digital technologies, social
media, sensors, and IoT devices has led to an exponential increase in the generation
of data.
Advancements in Storage and Processing Technologies: Improved storage
solutions, such as distributed file systems and cloud storage, along with parallel
processing frameworks like Apache Hadoop and Apache Spark, have made it
feasible to store and process massive amounts of data.
Cost Reduction in Storage: The decreasing costs of storage have made it more
economical to store and retain large volumes of data for extended periods.

Open Source Technologies: The development and widespread adoption of open-


source technologies, such as Hadoop, Spark, and NoSQL databases, have provided
scalable and cost-effective solutions for big data processing.

Advanced Analytics and Machine Learning: The increasing demand for advanced
analytics and machine learning applications has driven the need for large and
diverse datasets to train and improve models.

Real-time Processing Requirements: The need for real-time or near-real-time


data processing has become crucial in various industries, leading to the
development of technologies that can handle streaming data.
In summary, the convergence of these trends has given rise to the era of big data,
where organizations can leverage massive and diverse datasets to gain valuable
insights, make informed decisions, and uncover patterns that were previously
challenging to discover with traditional data processing methods.

1b) Describe the role of unstructured data in big data analytics.provide an


example of how un structured data is used in one industry
7M
Unstructured data plays a crucial role in big data analytics, contributing valuable insights
that complement the structured data traditionally used in analytics processes.
Unstructured data refers to information that doesn't fit neatly into a traditional relational
database or structured format. This type of data includes text, images, videos, social
media posts, emails, and other content that doesn't follow a predefined data model.

 Any form of data that has no proper structure or an unknown form is unstructured data. This
type of data is challenging to derive valuable insights from because of the raw nature of the
data.
 Unstructured data refers to the data that lacks any specific form or structure whatsoever.
This makes it very difficult and time-consuming to process and analyze unstructured data.
Email is an example of unstructured data. Structured and unstructured are two important
types of big data.
 Unstructured data is the kind of data that doesn’t adhere to any definite schema or set of rules.
Its arrangement is unplanned and haphazard.
 Photos, videos, text documents, and log files can be generally considered unstructured data.
Even though the metadata accompanying an image or a video may be semi-structured, the
actual data being dealt with is unstructured.

 Additionally, Unstructured data is also known as “dark data” because it cannot be analyzed
without the proper software tools.
Role of Unstructured Data in Big Data Analytics:
Rich Information Source: Unstructured data contains rich information that is often
contextually significant. Textual data, for example, can include sentiments, opinions, and
nuances that are important for understanding customer feedback, social trends, or market
sentiment.
Diversity and Variety: Unstructured data contributes to the variety aspect of big data. It
encompasses a wide range of data formats, including audio, video, and text. This
diversity allows organizations to gain a more comprehensive view of their operations,
customers, and market conditions.
Real-world Context: Unstructured data often reflects real-world scenarios and human
interactions, providing a more holistic and realistic representation of the situations being
analyzed. This context is valuable for decision-making and gaining deeper insights.

Advanced Analytics and Machine Learning: Unstructured data is a key component in


advanced analytics and machine learning applications. Natural Language Processing
(NLP) techniques, image recognition, and sentiment analysis are examples of how
unstructured data is leveraged to extract meaningful information and patterns.

Example of Unstructured Data in Healthcare:


In the healthcare industry, unstructured data is abundant and can significantly impact
patient care, research, and operational efficiency. Electronic Health Records (EHRs)
often contain unstructured data in the form of clinical notes, medical reports, and
diagnostic images.
Scenario: Radiology Reports for Cancer Diagnosis
In the context of cancer diagnosis, radiology reports contain a wealth of unstructured
information. These reports describe the findings of medical imaging tests such as CT
scans, MRIs, and X-rays. Extracting valuable insights from these reports involves
analyzing the unstructured text to identify key information, such as tumor location, size,
and characteristics.
How Unstructured Data is Used:
Natural Language Processing (NLP): NLP algorithms can be applied to analyze
radiology reports and extract relevant information about cancer diagnoses.
Pattern Recognition: Machine learning models trained on a large dataset of radiology
reports can learn to recognize patterns and trends indicative of specific types of cancer.
Decision Support Systems: Insights derived from unstructured data can be integrated into
decision support systems, assisting healthcare professionals in making more informed and
timely decisions about patient care and treatment plans.
By harnessing unstructured data in this way, the healthcare industry can improve the
accuracy and efficiency of cancer diagnoses, leading to better patient outcomes and
advancements in medical research. This example illustrates how unstructured data
contributes to the broader goal of leveraging big data analytics to enhance decision-making
in various industries.
OR
2a) what do you mean by linear and non-linear data structures specify
the sets are comes under linear or non-linear and explain the various
types of sets supported by java. 7M
Linear and Non-linear Data Structures:
Linear and non-linear data structures refer to the ways in which data elements are organized
and connected within a data structure.
Linear Data Structures: Elements are arranged in a linear or sequential order.
Each element has a unique predecessor and successor, except for the first and last elements.
Examples include arrays, linked lists, stacks, and queues.
Non-linear Data Structures:
Elements are not arranged in a sequential order.
Each element can have multiple predecessors and successors, forming a hierarchical or
interconnected structure.
Examples include trees and graphs.
Java Sets:
In the context of Java programming and big data analytics, sets are a type of collection that
represents an unordered collection of unique elements. Java provides several set
implementations in the java.util package. These sets are not specifically categorized as
linear or non-linear structures in the context of big data analytics, as they are more related
to the organization of data within a programming language. However, sets can be used in
various big data scenarios for managing unique elements efficiently.
Here are some common types of sets supported by Java:
HashSet:
Unordered collection of unique elements.
Implements the Set interface.
Uses a hash table for storage, providing constant-time complexity for basic operations
(add, remove, contains).
LinkedHashSet:
Similar to HashSet but maintains the order of insertion.
Implements the Set interface and also extends the HashSet class.
TreeSet:
Implements the Set interface and navigable interface.
Maintains elements in sorted order (natural order or according to a specified comparator).
Uses a Red-Black tree for storage, providing log(n) time complexity for basic operations.

EnumSet:
Specialized implementation for sets where elements are enum constants.
Highly efficient and compact representation of enum values.
BitSet:
Represents a set of bits or flags.
Used for efficient manipulation of sets of flags or boolean values.
Implements the Set interface.
Usage in Big Data Analytics:
In big data analytics, sets can be used to handle unique identifiers, filter out duplicate
records, and manage distinct values efficiently.
The choice of a specific set implementation depends on the requirements of the analytics
process, such as the need for ordering, fast membership checks, or memory efficiency.
While sets themselves may not be directly tied to the linear or non-linear distinction, the
algorithms and data structures used in big data analytics (e.g., graphs, trees) may exhibit
characteristics of linear or non-linear organization depending on the nature of the data and
the analytical tasks at hand.

2b) what is the advantages of object serialization in java and explain about
serializing and de-seralizing an object with suitable examples 8M
object serialization in Java continues to offer advantages similar to those in general
programming. However, its application in big data scenarios often revolves around
distributed computing environments, data storage, and data interchange between different
systems. Here are the advantages and an example of serializing and deserializing objects in
the context of big data analytics:
Advantages of Object Serialization in Big Data Analytics:
Distributed Data Processing:
Big data analytics often involve distributed computing environments where data is
processed across multiple nodes. Serialization facilitates the efficient transmission of
objects between nodes, enabling seamless communication and collaboration.
Data Storage:Serialized objects can be stored in various data storage solutions, including
distributed file systems like Hadoop Distributed File System (HDFS) or cloud storage.
This allows for the preservation of complex data structures and their states.
Interoperability:
In big data analytics, systems may use different programming languages or frameworks.
Object serialization provides a standardized format for data representation, promoting
interoperability between different technologies.
Efficient Data Transfer:
Serialized objects can be transmitted over networks more efficiently than raw, unstructured
data. This efficiency is crucial in scenarios where large volumes of data need to be
transferred between nodes in a distributed environment.
State Preservation:
Serialization enables the preservation of object states, which is valuable in applications
where the state of an object is critical for analysis. This is particularly relevant when
dealing with complex data structures and machine learning models.
Example: Serializing and Deserializing Objects in a Big Data Context:
Consider a scenario where you have a complex data structure representing a machine learning
model, and you want to distribute this model across a cluster for parallel processing
import java.io.*;

public class MLModel implements Serializable {


private String modelName;
private double[] coefficients;

public MLModel(String modelName, double[] coefficients) {


this.modelName = modelName;
this.coefficients = coefficients;
}
public void train(double[][] trainingData, double[] labels) {
// Training logic here
}

public double predict(double[] input) {


// Prediction logic here
return 0.0;
}

public static void main(String[] args) {


// Serialize the machine learning model
serializeMLModel();

// Deserialize the machine learning model


MLModel deserializedModel = deserializeMLModel();
System.out.println("Deserialized Model: " + deserializedModel.modelName);
}

private static void serializeMLModel() {


try (ObjectOutputStream oos = new ObjectOutputStream(new
FileOutputStream("mlmodel.ser"))) {
double[] initialCoefficients = {0.5, -0.3, 0.8};
MLModel model = new MLModel("LinearRegression", initialCoefficients);
oos.writeObject(model);
System.out.println("Machine Learning Model serialized successfully.");
} catch (IOException e) {
e.printStackTrace();
}
}

private static MLModel deserializeMLModel() {


try (ObjectInputStream ois = new ObjectInputStream(new
FileInputStream("mlmodel.ser"))) {
return (MLModel) ois.readObject();
} catch (IOException | ClassNotFoundException e) {
e.printStackTrace();
return null;
}
}
}
In this example:
The MLModel class represents a simple machine learning model with serialization
capabilities.
The serializeMLModel method serializes an instance of the MLModel class and writes it to
a file.
The deserializeMLModel method reads the serialized model from the file and returns a
new instance.
In big data analytics, this kind of serialization allows for the distribution of machine
learning models across a cluster of nodes, enabling parallel processing and efficient
utilization of computational resources. It also facilitates the interchange of models between
different components of a big data ecosystem.
Unit II
3a) Discuss the architecture and data model of cassandra.how does it differ from other
NoSQL data base 8M
Cassandra Architecture:
Cassandra is a distributed NoSQL database designed for scalability, high availability, and
fault tolerance. Its architecture is decentralized and follows a peer-to-peer model, allowing
for linear scalability by adding more nodes to the cluster. Key components of the Cassandra
architecture include:
Node:
Nodes in a Cassandra cluster are distributed across multiple machines.
Each node is responsible for a portion of the data, and all nodes are equal.
Data Distribution:

Data is distributed across nodes using a partitioning strategy (e.g., random, Murmur3).
Each node is responsible for a range of data, and a consistent hashing mechanism helps
route requests to the appropriate node.
Gossip Protocol:
Nodes communicate with each other using a gossip protocol to share information about
the cluster's state.
Gossip ensures that every node eventually learns about changes in the cluster, supporting
decentralized coordination.
Replication:
Cassandra replicates data across multiple nodes to ensure fault tolerance and high
availability.
Replication factor and strategy are configurable, allowing users to define how many copies
of data are stored and on which nodes.
Data Model:
Cassandra's data model is based on column-family, resembling a tabular structure.
Data is organized into keyspaces (similar to databases in traditional systems), which
contain column families (analogous to tables).
Commit Log and Memtable:
Write operations are first recorded in a commit log for durability.
Data is then written to an in-memory structure called a memtable, providing fast write
performance.
SSTables and Compaction:
Data from memtables is periodically flushed to on-disk storage in SSTables (Sorted String
Tables).
Compaction processes merge and discard obsolete data from SSTables to maintain
efficiency.
Read Path:
Cassandra supports fast read operations by using an index to locate data efficiently.
Read requests can be served from memory, SSTables, or a combination of both.
Cassandra Data Model:
Keyspace:

The outermost container for data in Cassandra.


Defines the replication strategy and contains column families.
Column Families (Tables):

A column family is a container for rows, similar to a table in relational databases.


Rows are identified by a primary key, and each row can have different columns.
Wide Rows:
Rows in Cassandra can be wide, allowing for dynamic column addition without altering
the table schema.
Useful for scenarios with evolving data structures.
CQL (Cassandra Query Language):
CQL is a SQL-like language for interacting with Cassandra databases.
Provides a familiar syntax for querying and manipulating data.
Differences from Other NoSQL Databases in Big Data Analytics:
Cassandra's architecture and data model differentiate it from other NoSQL databases,
especially in the context of big data analytics:

Decentralized and Peer-to-Peer:

Cassandra's decentralized architecture is distinctive, and its peer-to-peer model


contributes to its scalability and fault-tolerance, making it suitable for large-scale big data
scenarios.
Linear Scalability:

Cassandra's linear scalability allows it to efficiently handle growing workloads by adding


more nodes to the cluster. This is crucial in big data analytics where the volume of data
can be massive.
Tunable Consistency:

Cassandra provides tunable consistency, enabling users to balance between consistency


and availability based on specific analytics requirements.
Write-Optimized:

Cassandra is well-suited for write-intensive workloads, making it favorable for scenarios


involving continuous data ingestion and real-time analytics.
Wide-Column Model:

The wide-column data model allows for flexible schema design, accommodating the
evolving nature of big data analytics requirements.
In big data analytics, where distributed computing, scalability, and fault tolerance are
critical, Cassandra's architecture and data model make it a suitable choice for applications
that demand high write throughput, fast query performance, and continuous availability.
It is often used in conjunction with other big data technologies to build robust and scalable
data processing pipelines.

3b) describe the process of creating and managing tables in cassandra includes
an example of table creation and data manipulation 7M

Creating and managing tables in Cassandra involves defining keyspaces, specifying replication
strategies, creating column families (tables), and performing data manipulation using the Cassandra
Query Language (CQL). Below is a step-by-step guide along with an example of table creation and
data manipulation in Cassandra.
1. Connect to Cassandra:
Use a CQL shell (cqlsh) or connect programmatically to a Cassandra cluster.
2. Create a Keyspace:
A keyspace is the top-level container for tables in Cassandra. It defines the replication strategy and
other configuration options.

CREATE KEYSPACE IF NOT EXISTS mykeyspace


WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1};
3. Use the Keyspace:
Switch to the keyspace you created.
USE mykeyspace;
4. Create a Table:
Define the structure of your table, including columns, primary key, and any other relevant options.

CREATE TABLE IF NOT EXISTS users (


user_id UUID PRIMARY KEY,
username TEXT,
email TEXT,
age INT
);

5. Insert Data into the Table:


Add records to your table.
INSERT INTO users (user_id, username, email, age) VALUES (
uuid(), 'john_doe', 'john.doe@example.com', 25
);

INSERT INTO users (user_id, username, email, age) VALUES (


uuid(), 'jane_smith', 'jane.smith@example.com', 30
);

6. Query Data:
Retrieve data from the table.
SELECT * FROM users;
7. Update Data
Modify existing records.
UPDATE users SET age = 26 WHERE user_id = <UUID>;
8. Delete Data:
Remove records from the table.
DELETE FROM users WHERE user_id = <UUID>;

Example: Table Creation and Data Manipulation in Big Data Analytics


Suppose you are managing user data in a big data analytics application. You can create a table to store user
information and manipulate the data as follows:
-- Step 2: Create a Keyspace
CREATE KEYSPACE IF NOT EXISTS bigdata
WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1};

-- Step 3: Use the Keyspace


USE bigdata;

-- Step 4: Create a Table


CREATE TABLE IF NOT EXISTS user_data (
user_id UUID PRIMARY KEY,
username TEXT,
email TEXT,
age INT
);
-- Step 5: Insert Data into the Table
INSERT INTO user_data (user_id, username, email, age) VALUES (
uuid(), 'john_doe', 'john.doe@example.com', 25
);

INSERT INTO user_data (user_id, username, email, age) VALUES (


uuid(), 'jane_smith', 'jane.smith@example.com', 30
);
-- Step 6: Query Data
SELECT * FROM user_data;
-- Step 7: Update Data
UPDATE user_data SET age = 26 WHERE user_id = <UUID>;
-- Step 8: Delete Data
DELETE FROM user_data WHERE user_id = <UUID>;
In this example:
 We create a keyspace named 'bigdata'.
 We define a table named 'user_data' with columns for user ID, username, email, and age.
 We insert two records into the table.
 We query the data, update the age of a user, and then delete a user's record.
This example demonstrates the basic steps of creating and managing tables in Cassandra
using CQL. In a big data analytics context, such tables can store and process vast amounts of
user data efficiently, and the operations can be scaled horizontally as the data grows.

OR
4a) Explain hadoop distributed file system architecture with a neat sketch 8M
Architecture of Hadoop distributed file system (HDFS):
Hadoop File System was developed using distributed file system design. It is run on
commodity hardware. Unlike other distributed systems, HDFS is highly faulttolerant and
designed using low-cost hardware.

HDFS holds very large amount of data and provides easier access. To store such huge
data, the files are stored across multiple machines. These files are stored in redundant
fashion to rescue the system from possible data losses in case of failure. HDFS also
makes applications available to parallel processing.

Features of HDFS

• It is suitable for the distributed storage and processing.


• Hadoop provides a command interface to interact with HDFS.
• The built-in servers of namenode and datanode help users to easily check the status of cluster.
• Streaming access to file system data.
• HDFS provides file permissions and authentication.
HDFS Architecture

Given below is the architecture of a Hadoop File System.

HDFS follows the master-slave architecture and it has the following


elements.
Namenode

The namenode is the commodity hardware that contains the GNU/Linux operating system
and the namenode software. It is a software that can be run on commodity hardware. The
system having the namenode acts as the master server and it does the following tasks −
• Manages the file system namespace.
• Regulates client’s access to files.
• It also executes file system operations such as renaming, closing, and opening files and
directories.

Datanode

The datanode is a commodity hardware having the GNU/Linux operating system and datanode
software. For every node (Commodity hardware/System) in a cluster, there will be a datanode.
These nodes manage the data storage of their system.
• Datanodes perform read-write operations on the file systems, as per client request.
• They also perform operations such as block creation, deletion, and replication according to the
instructions of the namenode.
Fault Tolerance:
1. Definition

Fault tolerance in HDFS refers to the working strength of a system in unfavorable


conditions and how that system can handle such situation. HDFS is highly fault
tolerant. It handles faults by the process of replica creation. The replica of users
data is created on different machines in the HDFS cluster. So whenever if any
machine in the cluster goes down, then data can be accessed from other
machines in which same copy of data was
created.

HDFS also maintains the replication factor by creating replica of data on other
available machines in the cluster if suddenly one machine fails

2. How HDFS Faut Tolerance achieved?

HDFS achieves fault tolerance mechanism by replication process. In HDFS


whenever a file is stored by the user, then firstly that file is divided into blocks
and then these blocks of data are distributed across different machines present
in HDFS cluster. After this, replica of each block is created on other machines
present in the cluster. By default HDFS creates 3 copies of a file on other
machines present in the cluster. So due some reason if any machine on the
HDFS goes down or fails, then also user can easily access that data from other
machines in the cluster in which replica of file is present. Hence HDFS provides
faster file read and write mechanism, due to its unique feature of distributed
storage.

3. Example of HDFS Fault Tolerance

Suppose there is a user data named FILE. This data FILE is divided in into
blocks say B1, B2, B3 and send to Master. Now master sends these blocks to
the slaves say S1, S2, and S3. Now slaves creates replica of these blocks to the
other slaves present in the cluster say S4, S5 and S6. Hence multiple copies of
blocks are created on slaves. Say S1 contains B1 and B2, S2 contains B2 and
B3, S3 contains B3 and B1, S4 contains B2 and B3, S5 contains B3 and B1, S6
contains B1 and B2. Now if due to some reasons slave S4 gets crashed. Hence
data present in S4 was B2 and B3 become unavailable. But we don’t have to
worry because we can get the blocks B2 and B3 from other slave S2. Hence in
unfavourable conditions also our data doesn’t get lost. Hence HDFS is highly
fault tolerant.

Data Replication:
● HDFS is designed to store very large files across machines in a large cluster.

● All blocks in the file except the last are of the same size.

● Blocks are replicated for fault tolerance.

● Block size and replicas are configurable per file.

● The Namenode receives a Heartbeat and a BlockReport from each


DataNode in the cluster.

● BlockReport contains all the blocks on a Datanode.

Replica Placement:

● The placement of the replicas is critical to HDFS reliability and performance.


● Optimizing replica placement distinguishes HDFS from other distributed file systems.

● Rack-aware replica placement:

🞆 Goal: improve reliability, availability and network bandwidth utilization


🞆 Research topic

● Many racks, communication between racks are through switches.

● Network bandwidth between machines on the same rack is greater than


those in different racks.

● Namenode determines the rack id for each DataNode.

● Replicas are typically placed on unique racks

🞆 Simple but non-optimal

🞆 Writes are expensive

🞆 Replication factor is 3

🞆 Another research topic?

● Replicas are placed: one on a node in a local rack, one on a different


node in the local rack and one on a node in a different rack.

● 1/3 of the replica on a node, 2/3 on a rack and 1/3 distributed evenly
across remaining racks.

Replica Selection:

● Replica selection for READ operation: HDFS tries to minimize the


bandwidth consumption and latency.

● If there is a replica on the Reader node then that is preferred.

4.

● HDFS cluster may span multiple data centers: replica in the local data
center is preferred over the remote one.
High availability

The loss of NameNodes can crash the cluster in both Hadoop 1.x as well as
Hadoop 2.x. In Hadoop 1.x, there was no easy way to recover, whereas Hadoop
2.x introduced high availability (active-passive setup) to help recover from
NameNode failures.

The following diagram shows how high availability works:


In Hadoop 3.x you can have two passive NameNodes along with the active
node, as well as five JournalNodes to assist with recovery from catastrophic
failures:
NameNode machines: The machines on which you run the active and standby
NameNodes. They should have equivalent hardware to each other and to what
would be used in a non-HA ...

4b) Define data node how does name node tackle data node failure 7M
DataNode in Hadoop Distributed File System (HDFS):

In Hadoop Distributed File System (HDFS), a DataNode is a component that runs on each
individual machine (node) in the Hadoop cluster. Its primary responsibility is to store and
manage the actual data blocks that make up the files stored in HDFS. The DataNode is
responsible for performing read and write operations, as well as managing the replication of data
blocks across the cluster for fault tolerance.
Key responsibilities of a DataNode include:
Storage: Storing and managing data blocks on the local file system of the node it resides on.
Replication: Replicating data blocks to other DataNodes to ensure fault tolerance and data
availability. The default replication factor is three, meaning each block is stored on three
different DataNodes.
Heartbeat and Block Report: Periodically sending heartbeat signals to the NameNode to
confirm its availability. Additionally, sending block reports to provide information about the
list of blocks it is currently storing.
NameNode's Handling of DataNode Failure:
Since the NameNode manages the metadata and namespace of the HDFS, it needs to be
aware of the health and status of each DataNode in the cluster. When a DataNode fails or
becomes unreachable, the NameNode employs several mechanisms to handle the situation:
Heartbeat Monitoring:
The NameNode expects regular heartbeat signals from each DataNode.
If the NameNode does not receive a heartbeat within a specified time period, it marks the
DataNode as dead or unreliable.
Block Report:DataNodes periodically send block reports to the NameNode, listing all the
blocks they are currently storing.
The NameNode uses this information to track the health and status of each block and identify
missing or corrupt blocks.
Replication Factor Maintenance:
If a DataNode fails or is marked as unreliable, the NameNode takes corrective actions to
maintain the desired replication factor for each block.
It schedules the replication of the missing or under-replicated blocks to other healthy
DataNodes.
Decommissioning:
If a DataNode is identified as consistently failing or unreliable, it may be decommissioned by
the administrator.
Decommissioning involves removing the node from the active set of DataNodes, preventing
it from receiving new blocks. Existing blocks are replicated to other nodes.
Block Re-replication:
The NameNode continuously monitors the replication factor of each block.
If the replication factor falls below the desired value due to DataNode failures, the
NameNode triggers re-replication to ensure fault tolerance.
By actively monitoring heartbeats, block reports, and taking corrective actions, the
NameNode ensures the health and availability of the DataNodes in the HDFS cluster. This
proactive approach helps maintain fault tolerance and data reliability in the face of individual
DataNode failures in a distributed environment.
UNIT –III

5a) what is Hive meta store which classes are used by the Hive to Read and
Write HDFS Files 7M
Hive Metastore in Big Data Analytics:
In big data analytics, the Hive Metastore is a critical component that facilitates metadata
management for large-scale data processing using Apache Hive. It allows users to define
and manage tables, databases, and associated metadata in a relational database, enabling
efficient querying and processing of data stored in Hadoop Distributed File System
(HDFS) or other storage systems. The separation of metadata from data storage is essential
for scalability and better integration with various data processing tools and frameworks.
The Hive Metastore is responsible for managing metadata related to Hive tables, including
their schemas, partitions, and storage location. It stores this metadata in a relational
database and allows Hive to decouple the storage of metadata from the data itself,
facilitating metadata management and enabling better integration with other tools.
Key functions of the Hive Metastore include:
Storing metadata about Hive tables, databases, partitions, and column statistics.
Managing the schema and structure of Hive tables.
Facilitating access to metadata for query planning and execution.
Enabling compatibility with various storage formats and systems.

Classes Used by Hive to Read and Write HDFS Files:


Hive uses several classes to interact with HDFS for reading and writing data. Here are some
key classes involved in these processes:
InputFormat and OutputFormat Classes:
org.apache.hadoop.mapred.InputFormat and org.apache.hadoop.mapred.OutputFormat
classes define how data is read from and written to HDFS in the MapReduce framework.
Examples include TextInputFormat for plain text, SequenceFileInputFormat for Hadoop
sequence files, and corresponding output formats.
SerDe (Serializer/Deserializer) Classes:
SerDe classes define how Hive tables' data is serialized for storage and deserialized for
processing during queries.
Examples include org.apache.hadoop.hive.serde2.avro.AvroSerDe for Avro format and
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe for simple text-based serialization.
StorageHandler Interface:
The org.apache.hadoop.hive.ql.metadata.StorageHandler interface is implemented by custom
storage handlers, defining how data is stored and retrieved from a particular storage system.
It includes methods for initializing tables, obtaining input and output formats, and working
with SerDe.
HiveStorageHandler Classes:
Hive includes built-in storage handlers for various storage formats, such as
org.apache.hadoop.hive.ql.io.orc.OrcStorageHandler for ORC (Optimized Row Columnar)
format and org.apache.hadoop.hive.hbase.HBaseStorageHandler for Apache HBase.
HiveFileFormatUtils:
The org.apache.hadoop.hive.ql.io.HiveFileFormatUtils class provides utility methods for
determining the default file format based on table properties.

It is used to infer the file format when the format is not explicitly specified.
HiveMetaStoreClient:
The org.apache.hadoop.hive.metastore.HiveMetaStoreClient class is used to interact with the
Hive Metastore service programmatically.
It provides methods for managing metadata, including creating and altering tables, databases,
and partitions.
In big data analytics workflows, these classes play a crucial role in managing the interaction
between Hive and HDFS. They handle the intricacies of reading and writing data in different
formats, enabling efficient data processing, querying, and analysis across large-scale
distributed datasets. The Hive Metastore ensures that metadata about tables and databases is
well-managed, providing a foundation for the integration of Hive with other big data tools
and analytics frameworks.

5b) Explain the following


a) Logical Joins b) Window Functions 8M

Answer:
Logical joins:

Hive Join :

HiveQLSelect Joins QueryTypes of Join in Hive


1. Apache Hive Join – Objective
In Apache Hive, for combining specific fields from two tables by using values common to
each one we use Hive Join – HiveQL Select Joins Query. However, we need to know the
syntax of Hive Join for implementation purpose. So, in this article, “Hive Join – HiveQL
Select Joins Query and its types” we will cover syntax of joins in hive. Also, we will learn
an example of Hive Join to understand well. Moreover, there are several types of Hive
join – HiveQL Select Joins: Hive inner join, hive left outer join, hive right outer join, and
hive full outer join. We will also learn Hive Join tables in depth.
2. Apache Hive Join – HiveQL Select Joins Query
Basically, for combining specific fields from two tables by using values common
to each one we use Hive JOIN clause. In other words, to combine records from
two or more tables in the database we use JOIN clause. However, it is more or
less similar to SQL JOIN. Also, we use it to combine rows from multiple tables.
Moreover, there are some points we need to observe about Hive Join:
• In Joins, only Equality joins are allowed.
• However, in the same query more than two tables can be joined.
• Basically, to offer more control over ON Clause for which there is no
match LEFT, RIGHT, FULL OUTER joins exist in order.
• Also, note that Hive Joins are not Commutative
• Whether they are LEFT or RIGHT joins in Hive, even then Joins are left-associative.

1. Apache Hive Join Syntax


Following is a syntax of Hive Join – HiveQL
Select Clause. join_table:
table_reference JOIN table_factor [join_condition]

| table_reference {LEFT|RIGHT|FULL} [OUTER] JOIN


table_reference join_condition| table_reference LEFT SEMI JOIN
table_reference join_condition

| table_reference CROSS JOIN table_reference [join_condition]


2. Example of Join in Hive
Example of Hive Join – HiveQL Select Clause.
However, to understand well let’s suppose the following table named CUSTOMERS.

Table.1 Hive Join Example


ID Name Age Address Salary

1 Ross 32 Ahmedabad 2000

2 Rachel 25 Delhi 1500

3 Chandler 23 Kota 2000

4 Monika 25 Mumbai 6500

5 Mike 27 Bhopal 8500

6 Phoebe 22 MP 4500

7 Joey 24 Indore 10000

Also, suppose another table ORDERS as follows:


Table.2 – Hive Join Example
OID Date Customer_ID Amount

102 2016-10-08 00:00:00 3 3000

100 2016-10-08 00:00:00 3 1500

101 2016-11-20 00:00:00 2 1560

103 2015-05-20 00:00:00 4 2060

3. Type of Joins in Hive


4.
Basically, there are 4 types of Hive Join. Such as:
TYPES OF HIVE
1. Inner join in Hive
2. Left Outer Join in Hive
3. Right Outer Join in Hive
4. Full Outer Join in Hive
So, let’s discuss each Hive join in detail.
1. Inner Join
Basically, to combine and retrieve the records from multiple tables we use Hive
Join clause. Moreover, in SQL JOIN is as same as OUTER JOIN. Moreover, by
using the primary keys and foreign keys of the tables JOIN condition is to be
raised.
Furthermore, the below query executes JOIN the CUSTOMER and ORDER
tables. Then further retrieves the records:
hive> SELECT c.ID, c.NAME, c.AGE, o.AMOUNT FROM CUSTOMERS c JOIN
ORDERS o
ON (c.ID = o.CUSTOMER_ID);
Moreover, we get to see the following response, on the successful execution of the query:
Table.3 – Inner Join in Hive
ID Name Age Amount

3 Chandler 23 1300

3 Chandler 23 1500

2 Rachel 25 1560
4 Monika 25 2060

2. Left Outer Join


On defining HiveQL Left Outer Join, even if there are no matches in the right table it
returns all the rows from the left table. To be more specific, even if the ON clause
matches 0 (zero) records in the right table, then also this Hive JOIN still returns a row in
the result. Although, it returns with NULL in each column from the right table.
In addition, it returns all the values from the left table. Also, the matched values from the
right table, or NULL in case of no matching JOIN predicate.
However, the below query shows LEFT OUTER JOIN between CUSTOMER as well as
ORDER tables:
1. hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE
2. FROM CUSTOMERS c
3. LEFT OUTER JOIN ORDERS o
4. ON (c.ID = o.CUSTOMER_ID);

Moreover, we get to see the following response, on the successful execution of the HiveQL
Select query:
Table.4 – Left Outer Join in Hive

ID Name Amount Date

1 Ross NULL NULL

2 Rachel 1560 2016-11-20 00:00:00

3 Chandler 3000 2016-10-08 00:00:00

3 Chandler 1500 2016-10-08 00:00:00


4 Monika 2060 2015-05-20 00:00:00

5 Mike NULL NULL

6 Phoebe NULL NULL

7 Joey NULL NULL


3. Right Outer Join
Basically, even if there are no matches in the left table, HiveQL Right Outer Join returns
all the rows from the right table. To be more specific, even if the ON clause matches 0
(zero) records in the left table, then also this Hive JOIN still returns a row in the result.
Although, it returns with NULL in each column from the left table
In addition, it returns all the values from the right table. Also, the matched values from
the left table or NULL in case of no matching join predicate.
However, the below query shows RIGHT OUTER JOIN between the CUSTOMER as
well as ORDER tables
notranslate"> hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE FROM CUSTOMERS
c RIGHT OUTER JOIN ORDERS o ON (c.ID = o.CUSTOMER_ID);
Moreover, we get to see the following response, on the successful execution of the
HiveQL Select query:
Table.5 – Right Outer Join in Hive
ID Name Amount Date

3 Chandler 1300 2016-10-08 00:00:00

3 Chandler 1500 2016-10-08 00:00:00

2 Rachel 1560 2016-11-20 00:00:00

4 Monika 2060 2015-05-20 00:00:00

4.Full Outer Join


The major purpose of this HiveQL Full outer Join is it combines the records of both the left
and the right outer tables which fulfills the Hive JOIN condition. Moreover, this joined table
contains either all the records from both the tables or fills in NULL values for missing
matches on either side.
However, the below query shows FULL OUTER JOIN between CUSTOMER as well as
ORDER tables:
1. hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE
2. FROM CUSTOMERS c
3. FULL OUTER JOIN ORDERS o
4. ON (c.ID = o.CUSTOMER_ID);

Moreover, we get to see the following response, on the successful execution of the query:
Table.6 – Full Outer Join in Hive

ID Name Amount Date

1 Ross NULL NULL

2 Rachel 1560 2016-11-20 00:00:00

3 Chandler 3000 2016-10-08 00:00:00

3 Chandler 1500 2016-10-08 00:00:00

4 Monika 2060 2015-05-20 00:00:00

5 Mike NULL NULL

6 Phoebe NULL NULL

7 Joey NULL NULL

3 Chandler 3000 2016-10-08 00:00:00

3 Chandler 1500 2016-10-08 00:00:00

2 Rachel 1560 2016-11-20 00:00:00

4 Monika 2060 2015-05-20 00:00:00

This was all about HiveQL Select – Apache Hive Join Tutorial. Hope you like our
explanation of Types of Joins in Hive.
4. Conclusion
As a result, we have seen what is Apache Hive Join and possible types of Join in
Hive- HiveQL Select.

Window functions:
Windowing Functions In Hive

Windowing allows you to create a window on a set of data further allowing


aggregation surrounding that data. Windowing in Hive is introduced from Hive
0.11. In this blog, we will be giving a demo on the windowing functions available in

Hive.
Windowing in Hive includes the following functions
• Lead
 The number of rows to lead can optionally be specified. If the number of rows to
lead is not specified, the lead is one row.
 Returns null when the lead for the current row extends beyond the end of the
window
• Lag
 The number of rows to lag can optionally be specified. If the number of rows to lag
is not specified, the lag is one row.
 Returns null when the lag for the current row extends before the beginning of the
window.
• FIRST_VALUE
• LAST_VALUE

The OVER clause


● OVER with standard aggregates:
 COUNT

 SUM
 MIN
 MAX
 AVG
OVER with a PARTITION BY statement with one or more partitioning columns.
• OVER with PARTITION BY and ORDER BY with one or more partitioning and/or
ordering columns.
Analytics functions
 RANK
 ROW_NUMBER
 DENSE_RANK
 CUME_DIST
 PERCENT_RANK
 NTILE
To give you a brief idea of these windowing functions in Hive, we will be using stock
market data. You can download the sample stocks data from here and load into your
stocks table.

Now we will create a table to load this stock market data as shown below.
Let us dive deeper into the window functions in Hive.
Lag
This function returns the values of the previous row. You can specify an integer offset
which designates the row position else it will take the default integer offset as 1.
Here is the sample function for lag

select ticker,date_,close,lag(close,1) over(partition by ticker) as yesterday_price from


acadgild.stocks

Here using lag we can display the yesterday’s closing price of the ticker. Lag is to be
used with over function, inside the over function you can use partition or order by classes.
In the below screenshot, you can see the closing price of the stock for the day and the
yesterday’s price.

Lead
This function returns the values from the following rows. You can specify an integer offset
which designates the row position else it will take the default integer offset as 1.
Here is the sample function for lead
Now using the lead function, we will find that whether the following day’s closing price is
higher or lesser than today’s and that can be done as follows.

select ticker,date_,close,case(lead(close,1) over(partition by ticker)-close)>0 when true then


"higher" when false then "lesser" end as Changes from acadgild.stocks

In the below screenshot, you can see the result.


FIRST_VALUE

It returns the value of the first row from that window. With the below query, you can see the
first row high price of the ticker for all the days.
select ticker,first_value(high) over(partition by ticker) as first_high from acadgild.stocks

LAST_VALUE

It is the reverse of FIRST_VALUE. It returns the value of the last row from that
window. With the below query, you can see the last row high price value of the ticker for
all the days.

select ticker,last_value(high) over(partition by ticker) as first_high from acadgild.stocks

Let us now see the usage of the aggregate function using Over.

Count

It returns the count of all the values for the expression written in the over clause. From
the below query, we can find the number of rows present for each ticker.
select ticker,count(ticker) over(partition by ticker) as cnt from acadgild.stocks

For each partition, the count of ticker will be calculated, you can see the same in
the below screen shot.

Sum

It returns the sum of all the values for the expression written in the over clause. From the
below query, we can find the sum of all the closing stock prices for that particular ticker.

select ticker,sum(close) over(partition by ticker) as total from acadgild.stocks

For each ticker, the sum of all the closing prices will be calculated, you can see the same
in the below screen shot.

For suppose let us take if you want to get running total of the volume_for_the_day

for all the days for every ticker then you can do this with the below query.

select ticker,date_,volume_for_the_day,sum(volume_for_the_day) over(partition by ticker order


by date_) as running_total from acadgild.stocks
Finding the percentage of each row value

Now let’s take a scenario where you need to find the percentage of the
volume_for_the_day on the total volumes for that particular ticker and that can be
done as follows.

select ticker,date_,volume_for_the_day,(volume_for_the_day*100/(sum(volume_for_the_day)
over(partition by ticker))) from acadgild.stocks
In the above screenshot, you can see that the percentage contribution of the
volumes for the day is found based on the total volume for that ticker.

Min
It returns the minimum value of the column for the rows in that over clause. From
the below query, we can find the minimum closing stock price for each particular
ticker.

select ticker, min(close) over(partition by ticker) as minimum from acadgild.stocks

Max
It returns the maximum value of the column for the rows in that over clause. From
the below query, we can find the maximum closing stock price for each particular
ticker.

select ticker, max(close) over(partition by ticker) as maximum from acadgild.stocks


AVG

It returns the average value of the column for the rows that over clause returns.
From the below query, we can find the average closing stock price for each particular
ticker.

select ticker, avg(close) over(partition by ticker) as maximum from acadgild.stocks

Now let us work on some Analytic functions.


Rank

The rank function will return the rank of the values as per the result set of the over
clause. If two values are same then it will give the same rank to those 2 values and then
for the next value, the sub-sequent rank will be skipped.
The below query will rank the closing prices of the stock for each ticker. The same
you can see in the below screenshot.

select ticker,close,rank() over(partition by ticker order by close) as closing from acadgild.stocks


Row_number
Row number will return the continuous sequence of numbers for all the rows of
the result set of the over clause.
From the below query, you will get the ticker, closing price and its row number for
each ticker.

select ticker,close,row_number() over(partition by ticker order by close) as num from


acadgild.stocks

Dense_rank

It is same as the rank() function but the difference is if any duplicate value is present
then the rank will not be skipped for the subsequent rows. Each unique value will
get the ranks in a sequence.
The below query will rank the closing prices of the stock for each ticker. The same
you can see in the below screenshot.

select ticker,close,dense_rank() over(partition by ticker order by close) as closing from


acadgild.stocks
Cume_dist

It returns the cumulative distribution of a value. It results from 0 to 1. For suppose


if the total number of records are 10 then for the 1st row the cume_dist will be 1/10
and for the second 2/10 and so on till 10/10.
This cume_dist will be calculated in accordance with the result set returned by the
over clause. The below query will result in the cumulative of each record for every
ticker.

select ticker,cume_dist() over(partition by ticker order by close) as cummulative from


acadgild.stocks
Percent_rank

It returns the percentage rank of each row within the result


set of over clause. Percent_rank is calculated in accordance
with the rank of the row and the calculation is as follows
(rank-1)/ (total_rows_in_group – 1). If the result set has only
one row then the percent_rank will be 0.
The below query will calculate the percent_rank for every row
in each partition and you can see the same in the below
screen shot.

select ticker,close,percent_rank() over(partition by ticker order by close) as closing from


acadgild.stocks

Ntile

It returns the bucket number of the particular value. For


suppose if you say Ntile(5) then it will create 5 buckets based
on the result set of the over clause after that it will place the
first 20% of the records in the 1st bucket and so on till 5th
bucket.
The below query will create 5 buckets for every ticker and the
first 20% records for every ticker will be in the 1st bucket and
so on.

select ticker,ntile(5) over(partition by ticker order by close ) as bucket from acadgild.stocks

In the below screenshot, you can see that 5 buckets will be


created for every ticker and the least 20% closing prices will
be in the first bucket and the next 20% will be in the second
bucket and so on till 5th bucket for all the tickers.

This is how we can perform windowing operations in Hive.

6a) Explain how Hive facilitate the big data analytics discuss its data
types, file formats and HiveQL 8M
What is a Hive?

Before understanding the Hive Data Types first we will study

the hive. Hive is a data warehousing technique of Hadoop.

Hadoop is the data storage and processing segment of Big

data platform. Hive holds its position for sequel data

processing techniques. Like other sequel environments hive

can be reached through sequel queries. The major offerings

by hive are data analysis, ad-hoc querying and summarize the

stored data from a latency perspective, the queries go a

greater amount.

Datatypes are classified into two types:


• Primitive Data Types
• Collective Data Types
1. Primitive Data Types
Primitive means were ancient and old. all datatypes listed as
primitive are legacy ones. The important primitive datatypes
areas listed below:

Type Size (byte) Example

TinyInt 1 20

SmallInt 2 20

Int 4 20

Bigint 8 20

Boolean Boolean true/False FALSE


Double 8 10.2222

Float 4 10.2222

String Sequence of characters ABCD

Timestamp Integer/float/string 2/3/2012 12:34:56:1

Date Integer/float/string 2/3/2019

Hive Data Types are Implemented using JAVA

Ex: Java Int is used for implementing the Int data type
here.

• Character arrays are not supported in HIVE.


• Hive relies on delimiters to separate its fields, hive on
coordinating with Hadoop allows to increase the write
performance and read performance.
• Specifying the length of each column is not expected in
the hive database.
• String literals can be articulated within either double
quotes (“) single quotes (‘).
• In a newer version of the hive, Varchar types are
introduced and they form a span specifier of (amid 1 and
65535), So for a character string, this acts as the largest
length of value which it can accommodate. When a value
exceeding this length is inserted then the rightmost
elements of that values are been truncated. Character
length is resolution with the figure of code points
controlled by the character string.
• All integer literals (TINYINT, SMALLINT, BIGINT) are
considered as INT datatypes basically, and only the length
exceeds the actual int level it gets transmuted into a
BIGINT or any other respective type.
• Decimal literals afford defined values and superior
collection for floating-point values when compared to the
DOUBLE type. Here numeric values are stored on their
exact form, but in the case of double, they are not stored
exactly as numeric values.
Date Value Casting Process
Date types can only be converted to/from Date, Timestamp,
or String types. Casting with user-specified formats is

Valid casts to/from Result


Date type

cast(date as date) Same date value

cast(timestamp as The year/month/day of the timestamp is determined, based on the local timezone, and returned
date) as a date value.

cast(string as date) If the string is in the form 'YYYY-MM-DD', then a date value corresponding to that
year/month/day is returned. If the string value does not match this formate, then NULL is
returned.

cast(date as A timestamp value is generated corresponding to midnight of the year/month/day of the date
timestamp) value, based on the local timezone.

cast(date as string) The year/month/day represented by the Date is formatted as a string in the form 'YYYY-MM-DD'.

2. Collection Data Types


There are four collection datatypes in the hive they are also
termed as complex data types.

• ARRAY
• MAP
• STRUCT
• UNIONTYPE

1. ARRAY: A sequence of elements of a common type that

can be indexed and the index value starts from zero.


Code:

array (‘anand’, ‘balaa’, ‘praveeen’);

2. MAP: These are elements that are declared and retrieved


using key-value pairs.

Code:

‘firstvalue’ -> ‘balakumaran’ , ‘lastvalue’ -> ‘pradeesh’ is


represented as map(‘firstvalue’,

‘balakumaran’, ‘last’, ‘PG’). Now ‘balakumaran ‘ can be


retrived with map[‘first’].

3. STRUCT: As in C, the struct is a datatype that

accumulates a set of fields that are labeled and can be of any

other data type.

Code:

For a column D of type STRUCT {Y INT; Z INT} the Y field can be retrieved by the

expression D.Y

4. UNIONTYPE: Union can hold any one of the specified data


types.

Code:

CREATE TABLE test(col1 UNIONTYPE<INT, DOUBLE,


ARRAY<VARCHAR>>)
Output:s Delimiters used in Complex Data Types are listed
below,

Below are the examples of Complex Datatypes:

Delimiter Code Description

\n \n Record or row delimiter

^A (Ctrl+A) \001 Field delimiter

^B (Ctrl+B) \002 STRUCTS and ARRAYS

^C (Ctrl+C) \003 MAP’s

1. TABLE CREATION

Code:
create table store_complex_type (

emp_id int,

name string,

local_address STRUCT<street:string,
city:string,country:string,zipcode:bigint>,

country_address MAP<STRING,STRING>,

job_history array<STRING>)
row format delimited fields terminated by ','

collection items terminated by ':'

map keys terminated by '_';

2. SAMPLE TABLE DATA

Code:

100 , Shan , 4th : CHN : IND : 600101 , CHENNAI_INDIA , SI :


CSC

101 , Jai ,1th : THA : IND : 600096 , THANJAVUR_INDIA , HCL :


TM

102 , Karthik , 5th : AP : IND : 600089 , RENIKUNDA_INDIA


,CTS : HCL

3. LOADING THE DATA


Code:
load data local inpath
'/home/cloudera/Desktop/Hive_New/complex_type.txt'
overwrite into table store_complex_type;

4. VIEWING THE DATA


Code:
select emp_id, name, local_address.city,
local_address.zipcode, country_address['CHENNAI'],

job_history[0] from store_complex_type where


emp_id='100';
Being an on relational DB and yet a Sequel connects the HIVE

offers all the key properties of usual SQL databases in a very

sophisticated manner which makes this one among the more


efficient structured data processing units in Hadoop.

FILE FORMATS:

The Apache Hive data file formats:

The below are the most Hive data file formats one can use it

for processing variety of data format within Hive and Hadoop

ecosystems.

• TEXTFILE
• SEQUENCEFILE
• RCFILE
• ORC
• AVRO
• PARQUET

TEXTFILE
• A simplest data format to use, with whatever delimiters you
prefer.
• Also a default format, equivalent to creating a table with the
clause STORED AS TEXTFILE.
• Can be shared data with other Hadoop related tools, such as
Pig.
• Can also be used within Unix text tools like grep, sed, and
awk, etc.
• Also convenient for viewing or editing files manually.
• It is not space efficient compared to binary formats,which offer
added other advantages over TEXTFILE other than just the
simplicity.
• Can be specified using “STORED AS SEQUEN

SEQUENCEFILE
• Afirst alternative to the hive default file format.
• Can be specified using”STORED AS SQUENCEFILE” clause
during table creation.
• Files are in flat files structure consisting of binary key-value
pairs.
• In a runtime Hive queries processed into MapReduce jobs,
during which records are assigned/generated with the
appropriate key-value pairs.
• It is a standard format supported by Hadoop itself, thus
becomes native or acceptable while sharing files between
Hive and other Hadoop-related tools.
• It’s less suitable for use with tools outside the Hadoop
ecosystem.
• When needed, the sequence files can be compressed at the
block and record level, which is very useful for optimizing disk
space utilization and I/O, while still supporting the ability to
split files on block for parallel processing.

RCFILE
• RCFile = Record Columnar File
• An efficient internal (binary) hive format and natively
supported by Hive.
• Used when Column-oriented organization is a good storage
option for certain types of data and applications.
• If data is stored by column instead of by row, then only the
data for the desired columns has to be read, this intern
improves performance.
• Makes columns compression very efficient, especially for low
cardinality columns.
• Also, some column-oriented stores do not physically need to
store null columns.
• Helps storing columns of a table in a record columnar way.
• It first partitions rows horizontally into row splits and then it
vertically partitions each row split in a columnar way.
• It first stores the metadata of a row split, as the key part of a
record, and all the data of a row split as value part.
• The rcfilecat tool to display the contents of RCFiles from
Hive command line, since the RCFiles can not be seen with
simple editors.
ORC
• ORC = Optimized Row Columnar
• Designed to overcome limitations of other Hive file formats
and has highly efficient way to store Hive data.
• Stores data as groups of row data called stripes, along with
auxiliary information in a file footer.
• Holds compression parameters and size of the compressed
footer at the end of the file a postscript section.

AVRO

• Relatively newest Apache’s Hadoop related projects.


• A language neutral preferred data serialization system.
• Handles multiple data formats that can be processed by
multiple languages.
• Relies on schema.
• Uses JSON for defining data structure schema, types
and protocols.
• Stores data structure definitions along with the data, in
an easy-to-process form.
• It includes support for integers, numeric types, arrays,
maps, enums, variable and fixed- length binary data and
strings.
• It also defines a container file format intended to
provide good support for MapReduce and other
analytical frameworks.
• Data structures can specify sort order,
• Faster sorting is possible without deserialization.
• The data created in one programming language can be
sorted by another.
• When data is read, schema used for writing it is always
present and available permitting records data
Serialization faster with minimal overheads per record.
• It serializes data in a compact binary format.
• It can provide both a serialization format for persistent
data, and a wire format for communication between
Hadoop nodes, and from client programs to the Hadoop
services.
• An additional advantage of storing the full data
structure definition with the data is that it permits the
data to be written faster and more compactly without a
need to process metadata separately.
• Avro as a file format to store data in a predefined
format and can be used in any of the Hadoop’s tools
like Pig, Hive and other programming languages like
Java, Python, more.
• lets one define Remote Procedure Call (RPC)
protocols. The data types used in RPC are usually
distinct from those in datasets, using a common
serialization system is still useful.
3. HiveQL (Hive Query Language) in Big Data Analytics:

HiveQL is a SQL-like language used to interact with Hive for


defining schema, querying data, and managing metadata.
Key features of HiveQL in big data analytics include:
Data Definition Language (DDL):

Allows users to define and manage tables, databases, and


views.
Example: CREATE TABLE, ALTER TABLE, CREATE
DATABASE.
Data Manipulation Language (DML):

Enables users to perform operations like inserting, updating,


and deleting data.
Example: INSERT INTO, UPDATE, DELETE.
Query Language:

Supports querying large datasets using SQL-like syntax.


Example: SELECT, JOIN, GROUP BY, ORDER BY.
UDFs (User-Defined Functions):

Users can define their custom functions to extend the


functionality of HiveQL.
Example: CREATE FUNCTION.
Built-in Functions:

Hive provides a rich set of built-in functions for common data


manipulations and transformations.
Example: SUM, AVG, MAX, MIN, etc.
HiveQL abstracts the complexities of distributed data processing
and allows users to express complex analytical queries using
familiar SQL constructs. This is particularly advantageous in
big data analytics, where many data analysts and business
intelligence professionals are already well-versed in SQL.

In summary, Hive plays a significant role in big data analytics


by providing a SQL-like interface to interact with large
datasets stored in HDFS. Its support for various data types,
file formats, and a familiar query language makes it an
essential tool for analysts and data scientists working on big
data platforms.

6b) How can we install the Apache Hive on the System-Explain 7M

Step 1: Verifying JAVA Installation

Java must be installed on your system before installing Hive. Let


us verify java installation using the following command:

$ java –version
If Java is already installed on your system, you get to see the
following response:

java version "1.7.0_71"


Java(TM) SE Runtime Environment (build 1.7.0_71-b13)
Java HotSpot(TM) Client VM (build 25.0-b02, mixed mode)

If java is not installed in your system, then follow the steps given
below for installing java.

Installing Java
Step I:

Download java (JDK <latest version> - X64.tar.gz) by visiting the


following
link http://www.oracle.com/technetwork/java/javase/downloads/j
dk7-downloads-1880260.html.

Then jdk-7u71-linux-x64.tar.gz will be downloaded onto your


system.

Step II:

Generally you will find the downloaded java file in the Downloads
folder. Verify it and extract the jdk-7u71-linux-x64.gz file using
the following commands.

$ cd Downloads/
$ ls
jdk-7u71-linux-x64.gz
$ tar zxf jdk-7u71-linux-x64.gz
$ ls
jdk1.7.0_71 jdk-7u71-linux-x64.gz
Step III:

To make java available to all the users, you have to move it to


the location “/usr/local/”. Open root, and type the following
commands.

$ su
password:
# mv jdk1.7.0_71 /usr/local/
# exit
Step IV:

For setting up PATH and JAVA_HOME variables, add the following


commands to ~/.bashrc file.

export JAVA_HOME=/usr/local/jdk1.7.0_71
export PATH=$PATH:$JAVA_HOME/bin

Now apply all the changes into the current running system.

$ source ~/.bashrc
Step V:

Use the following commands to configure java alternatives:

# alternatives --install
/usr/bin/java/java/usr/local/java/bin/java 2

# alternatives --install
/usr/bin/javac/javac/usr/local/java/bin/javac 2

# alternatives --install
/usr/bin/jar/jar/usr/local/java/bin/jar 2

# alternatives --set java/usr/local/java/bin/java

# alternatives --set javac/usr/local/java/bin/javac

# alternatives --set jar/usr/local/java/bin/jar

Now verify the installation using the command java -version from
the terminal as explained above.

Step 2: Verifying Hadoop Installation

Hadoop must be installed on your system before installing Hive.


Let us verify the Hadoop installation using the following
command:

$ hadoop version

If Hadoop is already installed on your system, then you will get


the following response:
Hadoop 2.4.1 Subversion
https://svn.apache.org/repos/asf/hadoop/common -r
1529768
Compiled by hortonmu on 2013-10-07T06:28Z
Compiled with protoc 2.5.0
From source with checksum
79e53ce7994d1628b240f09af91e1af4

If Hadoop is not installed on your system, then proceed with the


following steps:

Downloading Hadoop

Download and extract Hadoop 2.4.1 from Apache Software


Foundation using the following commands.

$ su
password:
# cd /usr/local
# wget http://apache.claz.org/hadoop/common/hadoop-
2.4.1/
hadoop-2.4.1.tar.gz
# tar xzf hadoop-2.4.1.tar.gz
# mv hadoop-2.4.1/* to hadoop/
# exit
Installing Hadoop in Pseudo Distributed Mode

The following steps are used to install Hadoop 2.4.1 in pseudo


distributed mode.

Step I: Setting up Hadoop

You can set Hadoop environment variables by appending the


following commands to ~/.bashrc file.

export HADOOP_HOME=/usr/local/hadoop
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export
HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export
PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
Now apply all the changes into the current running system.

$ source ~/.bashrc
Step II: Hadoop Configuration

You can find all the Hadoop configuration files in the location
“$HADOOP_HOME/etc/hadoop”. You need to make suitable
changes in those configuration files according to your Hadoop
infrastructure.

$ cd $HADOOP_HOME/etc/hadoop

In order to develop Hadoop programs using java, you have to


reset the java environment variables in hadoop-env.sh file by
replacing JAVA_HOME value with the location of java in your
system.

export JAVA_HOME=/usr/local/jdk1.7.0_71

Given below are the list of files that you have to edit to configure
Hadoop.

core-site.xml

The core-site.xml file contains information such as the port


number used for Hadoop instance, memory allocated for the file
system, memory limit for storing the data, and the size of
Read/Write buffers.

Open the core-site.xml and add the following properties in


between the <configuration> and </configuration> tags.

<configuration>

<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>

</configuration>

hdfs-site.xml
The hdfs-site.xml file contains information such as the value of
replication data, the namenode path, and the datanode path of
your local file systems. It means the place where you want to
store the Hadoop infra.

Let us assume the following data.

dfs.replication (data replication value) = 1

(In the following path /hadoop/ is the user name.


hadoopinfra/hdfs/namenode is the directory created by
hdfs file system.)

namenode path = //home/hadoop/hadoopinfra/hdfs/namenode

(hadoopinfra/hdfs/datanode is the directory created by


hdfs file system.)
datanode path = //home/hadoop/hadoopinfra/hdfs/datanode

Open this file and add the following properties in between the
<configuration>, </configuration> tags in this file.

<configuration>

<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.name.dir</name>

<value>file:///home/hadoop/hadoopinfra/hdfs/namenode
</value>
</property>
<property>
<name>dfs.data.dir</name>

<value>file:///home/hadoop/hadoopinfra/hdfs/datanode
</value >
</property>

</configuration>
Note: In the above file, all the property values are user-defined
and you can make changes according to your Hadoop
infrastructure.

yarn-site.xml

This file is used to configure yarn into Hadoop. Open the yarn-
site.xml file and add the following properties in between the
<configuration>, </configuration> tags in this file.

<configuration>

<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>

</configuration>

mapred-site.xml

This file is used to specify which MapReduce framework we are


using. By default, Hadoop contains a template of yarn-site.xml.
First of all, you need to copy the file from mapred-
site,xml.template to mapred-site.xml file using the following
command.

$ cp mapred-site.xml.template mapred-site.xml

Open mapred-site.xml file and add the following properties in


between the <configuration>, </configuration> tags in this file.

<configuration>

<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>

</configuration>
Verifying Hadoop Installation

The following steps are used to verify the Hadoop installation.


Step I: Name Node Setup

Set up the namenode using the command “hdfs namenode -


format” as follows.

$ cd ~
$ hdfs namenode -format

The expected result is as follows.

10/24/14 21:30:55 INFO namenode.NameNode: STARTUP_MSG:


/******************************************************
******
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = localhost/192.168.1.11
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 2.4.1
...
...
10/24/14 21:30:56 INFO common.Storage: Storage
directory
/home/hadoop/hadoopinfra/hdfs/namenode has been
successfully formatted.
10/24/14 21:30:56 INFO
namenode.NNStorageRetentionManager: Going to
retain 1 images with txid >= 0
10/24/14 21:30:56 INFO util.ExitUtil: Exiting with
status 0
10/24/14 21:30:56 INFO namenode.NameNode: SHUTDOWN_MSG:
/******************************************************
******
SHUTDOWN_MSG: Shutting down NameNode at
localhost/192.168.1.11

*******************************************************
*****/
Step II: Verifying Hadoop dfs

The following command is used to start dfs. Executing this


command will start your Hadoop file system.

$ start-dfs.sh

The expected output is as follows:


10/24/14 21:37:56
Starting namenodes on [localhost]
localhost: starting namenode, logging to
/home/hadoop/hadoop-2.4.1/logs/hadoop-hadoop-namenode-
localhost.out
localhost: starting datanode, logging to
/home/hadoop/hadoop-2.4.1/logs/hadoop-hadoop-datanode-
localhost.out
Starting secondary namenodes [0.0.0.0]
Step III: Verifying Yarn Script

The following command is used to start the yarn script. Executing


this command will start your yarn daemons.

$ start-yarn.sh

The expected output is as follows:

starting yarn daemons


starting resourcemanager, logging to
/home/hadoop/hadoop-2.4.1/logs/yarn-hadoop-
resourcemanager-localhost.out
localhost: starting nodemanager, logging to
/home/hadoop/hadoop-2.4.1/logs/yarn-hadoop-nodemanager-
localhost.out
Step IV: Accessing Hadoop on Browser

The default port number to access Hadoop is 50070. Use the


following url to get Hadoop services on your browser.

http://localhost:50070/
Step V: Verify all applications for cluster

The default port number to access all applications of cluster is


8088. Use the following url to visit this service.

http://localhost:8088/

Step 3: Downloading Hive

We use hive-0.14.0 in this tutorial. You can download it by


visiting the following link http://apache.petsads.us/hive/hive-
0.14.0/. Let us assume it gets downloaded onto the /Downloads
directory. Here, we download Hive archive named “apache-hive-
0.14.0-bin.tar.gz” for this tutorial. The following command is used
to verify the download:

$ cd Downloads
$ ls

On successful download, you get to see the following response:


apache-hive-0.14.0-bin.tar.gz
Step 4: Installing Hive

The following steps are required for installing Hive on your


system. Let us assume the Hive archive is downloaded onto the
/Downloads directory.

Extracting and verifying Hive Archive

The following command is used to verify the download and


extract the hive archive:

$ tar zxvf apache-hive-0.14.0-bin.tar.gz


$ ls

On successful download, you get to see the following response:

apache-hive-0.14.0-bin apache-hive-0.14.0-bin.tar.gz
Copying files to /usr/local/hive directory

We need to copy the files from the super user “su -”. The
following commands are used to copy the files from the extracted
directory to the /usr/local/hive” directory.

$ su -
passwd:

# cd /home/user/Download
# mv apache-hive-0.14.0-bin /usr/local/hive
# exit
Setting up environment for Hive

You can set up the Hive environment by appending the following


lines to ~/.bashrc file:

export HIVE_HOME=/usr/local/hive
export PATH=$PATH:$HIVE_HOME/bin
export CLASSPATH=$CLASSPATH:/usr/local/Hadoop/lib/*:.
export CLASSPATH=$CLASSPATH:/usr/local/hive/lib/*:.

The following command is used to execute ~/.bashrc file.


$ source ~/.bashrc
Step 5: Configuring Hive

To configure Hive with Hadoop, you need to edit the hive-


env.sh file, which is placed in the $HIVE_HOME/conf directory. The
following commands redirect to Hive config folder and copy the
template file:

$ cd $HIVE_HOME/conf
$ cp hive-env.sh.template hive-env.sh

Edit the hive-env.sh file by appending the following line:

export HADOOP_HOME=/usr/local/hadoop

Hive installation is completed successfully. Now you require an


external database server to configure Metastore. We use Apache
Derby database.

Step 6: Downloading and Installing Apache Derby

Follow the steps given below to download and install Apache


Derby:

Downloading Apache Derby

The following command is used to download Apache Derby. It


takes some time to download.

$ cd ~
$ wget http://archive.apache.org/dist/db/derby/db-
derby-10.4.2.0/db-derby-10.4.2.0-bin.tar.gz

The following command is used to verify the download:

$ ls

On successful download, you get to see the following response:

db-derby-10.4.2.0-bin.tar.gz
Extracting and verifying Derby archive

The following commands are used for extracting and verifying the
Derby archive:

$ tar zxvf db-derby-10.4.2.0-bin.tar.gz


$ ls

On successful download, you get to see the following response:

db-derby-10.4.2.0-bin db-derby-10.4.2.0-bin.tar.gz
Copying files to /usr/local/derby directory

We need to copy from the super user “su -”. The following
commands are used to copy the files from the extracted directory
to the /usr/local/derby directory:

$ su -
passwd:
# cd /home/user
# mv db-derby-10.4.2.0-bin /usr/local/derby
# exit
Setting up environment for Derby

You can set up the Derby environment by appending the following


lines to ~/.bashrc file:

export DERBY_HOME=/usr/local/derby
export PATH=$PATH:$DERBY_HOME/bin
Apache Hive
18
export
CLASSPATH=$CLASSPATH:$DERBY_HOME/lib/derby.jar:$DERBY_H
OME/lib/derbytools.jar

The following command is used to execute ~/.bashrc file:

$ source ~/.bashrc
Create a directory to store Metastore

Create a directory named data in $DERBY_HOME directory to


store Metastore data.
$ mkdir $DERBY_HOME/data

Derby installation and environmental setup is now complete.

Step 7: Configuring Metastore of Hive

Configuring Metastore means specifying to Hive where the


database is stored. You can do this by editing the hive-site.xml
file, which is in the $HIVE_HOME/conf directory. First of all, copy
the template file using the following command:

$ cd $HIVE_HOME/conf
$ cp hive-default.xml.template hive-site.xml

Edit hive-site.xml and append the following lines between the


<configuration> and </configuration> tags:

<property>
<name>javax.jdo.option.ConnectionURL</name>

<value>jdbc:derby://localhost:1527/metastore_db;create=
true </value>
<description>JDBC connect string for a JDBC
metastore </description>
</property>

Create a file named jpox.properties and add the following lines


into it:

javax.jdo.PersistenceManagerFactoryClass =

org.jpox.PersistenceManagerFactoryImpl
org.jpox.autoCreateSchema = false
org.jpox.validateTables = false
org.jpox.validateColumns = false
org.jpox.validateConstraints = false
org.jpox.storeManagerType = rdbms
org.jpox.autoCreateSchema = true
org.jpox.autoStartMechanismMode = checked
org.jpox.transactionIsolation = read_committed
javax.jdo.option.DetachAllOnCommit = true
javax.jdo.option.NontransactionalRead = true
javax.jdo.option.ConnectionDriverName =
org.apache.derby.jdbc.ClientDriver
javax.jdo.option.ConnectionURL =
jdbc:derby://hadoop1:1527/metastore_db;create = true
javax.jdo.option.ConnectionUserName = APP
javax.jdo.option.ConnectionPassword = mine
Step 8: Verifying Hive Installation

Before running Hive, you need to create the /tmp folder and a
separate Hive folder in HDFS. Here, we use
the /user/hive/warehouse folder. You need to set write permission
for these newly created folders as shown below:

chmod g+w

Now set them in HDFS before verifying Hive. Use the following
commands:

$ $HADOOP_HOME/bin/hadoop fs -mkdir /tmp


$ $HADOOP_HOME/bin/hadoop fs -mkdir
/user/hive/warehouse
$ $HADOOP_HOME/bin/hadoop fs -chmod g+w /tmp
$ $HADOOP_HOME/bin/hadoop fs -chmod g+w
/user/hive/warehouse

The following commands are used to verify Hive installation:

$ cd $HIVE_HOME
$ bin/hive

On successful installation of Hive, you get to see the following


response:

Logging initialized using configuration in


jar:file:/home/hadoop/hive-0.9.0/lib/hive-common-
0.9.0.jar!/hive-log4j.properties
Hive history
file=/tmp/hadoop/hive_job_log_hadoop_201312121621_14949
29084.txt
………………….
hive>

The following sample command is executed to display all the


tables:
hive> show tables;
OK
Time taken: 2.798 seconds
hive>

UNIT –IV

7a) list and explain the important features of hadoop 7M


Hadoop is a powerful open-source framework designed for distributed storage and
processing of large datasets across clusters of commodity hardware. Here are some
important features of Hadoop:
Features of Hadoop Which Makes It Popular
Let’s discuss the key features which make Hadoop more reliable to use, an
industry favorite, and the most powerful Big Data tool.
1. Open Source:
Hadoop is open-source, which means it is free to use. Since it is an open-
source project the source-code is available online for anyone to understand it
or make some modifications as per their industry requirement.
2. Highly Scalable Cluster:
Hadoop is a highly scalable model. A large amount of data is divided into
multiple inexpensive machines in a cluster which is processed parallelly. the
number of these machines or nodes can be increased or decreased as per the
enterprise’s requirements. In traditional RDBMS(Relational DataBase
Management System) the systems can not be scaled to approach large
amounts of data.
3. Fault Tolerance is Available:
Hadoop uses commodity hardware(inexpensive systems) which can be
crashed at any moment. In Hadoop data is replicated on various DataNodes
in a Hadoop cluster which ensures the availability of data if somehow any of
your systems got crashed. You can read all of the data from a single machine
if this machine faces a technical issue data can also be read from other nodes
in a Hadoop cluster because the data is copied or replicated by default. By
default, Hadoop makes 3 copies of each file block and stored it into different
nodes. This replication factor is configurable and can be changed by changing
the replication property in the hdfs-site.xml file.
4. High Availability is Provided:
Fault tolerance provides High Availability in the Hadoop cluster. High
Availability means the availability of data on the Hadoop cluster. Due to fault
tolerance in case if any of the DataNode goes down the same data can be
retrieved from any other node where the data is replicated. The High available
Hadoop cluster also has 2 or more than two Name Node i.e. Active NameNode
and Passive NameNode also known as stand by NameNode. In case if Active
NameNode fails then the Passive node will take the responsibility of Active
Node and provide the same data as that of Active NameNode which can easily
be utilized by the user.
5. Cost-Effective:
Hadoop is open-source and uses cost-effective commodity hardware which
provides a cost-efficient model, unlike traditional Relational databases that
require expensive hardware and high-end processors to deal with Big Data.
The problem with traditional Relational databases is that storing the Massive
volume of data is not cost-effective, so the company’s started to remove the
Raw data. which may not result in the correct scenario of their business.
Means Hadoop provides us 2 main benefits with the cost one is it’s open-
source means free to use and the other is that it uses commodity hardware
which is also inexpensive.
6. Hadoop Provide Flexibility:
Hadoop is designed in such a way that it can deal with any kind of dataset like
structured(MySql Data), Semi-Structured(XML, JSON), Un-structured (Images
and Videos) very efficiently. This means it can easily process any kind of data
independent of its structure which makes it highly flexible. It is very much
useful for enterprises as they can process large datasets easily, so the
businesses can use Hadoop to analyze valuable insights of data from sources
like social media, email, etc. With this flexibility, Hadoop can be used with log
processing, Data Warehousing, Fraud detection, etc.
7. Easy to Use:
Hadoop is easy to use since the developers need not worry about any of the
processing work since it is managed by the Hadoop itself. Hadoop ecosystem
is also very large comes up with lots of tools like Hive, Pig, Spark, HBase,
Mahout, etc.
8. Hadoop uses Data Locality:
The concept of Data Locality is used to make Hadoop processing fast. In the
data locality concept, the computation logic is moved near data rather than
moving the data to the computation logic. The cost of Moving data on HDFS
is costliest and with the help of the data locality concept, the bandwidth
utilization in the system is minimized.
9. Provides Faster Data Processing:
Hadoop uses a distributed file system to manage its storage i.e.
HDFS(Hadoop Distributed File System). In DFS(Distributed File System) a
large size file is broken into small size file blocks then distributed among the
Nodes available in a Hadoop cluster, as this massive number of file blocks are
processed parallelly which makes Hadoop faster, because of which it provides
a High-level performance as compared to the traditional DataBase
Management Systems.
10. Support for Multiple Data Formats:
Hadoop supports multiple data formats like CSV, JSON, Avro, and more,
making it easier to work with different types of data sources. This makes it
more convenient for developers and data analysts to handle large volumes of
data with different formats.
11. High Processing Speed:
Hadoop’s distributed processing model allows it to process large amounts of
data at high speeds. This is achieved by distributing data across multiple
nodes and processing it in parallel. As a result, Hadoop can process data
much faster than traditional database systems.
12. Machine Learning Capabilities:
Hadoop offers machine learning capabilities through its ecosystem tools like
Mahout, which is a library for creating scalable machine learning
applications. With these tools, data analysts and developers can build
machine learning models to analyze and process large datasets.
13. Integration with Other Tools:
Hadoop integrates with other popular tools like Apache Spark, Apache Flink,
and Apache Storm, making it easier to build data processing pipelines. This
integration allows developers and data analysts to use their favorite tools
and frameworks for building data pipelines and processing large datasets.
14. Secure:
Hadoop provides built-in security features like authentication, authorization,
and encryption. These features help to protect data and ensure that only
authorized users have access to it. This makes Hadoop a more secure
platform for processing sensitive data.
15. Community Support:
Hadoop has a large community of users and developers who contribute to its
development and provide support to users. This means that users can
access a wealth of resources and support to help them get the most out of
Hadoop.
.

7b) Explain the architecture of building blocks of hadoop


8M

Answer:

Hadoop works on MapReduce Programming Algorithm that was introduced


by Google. Today lots of Big Brand Companies are using Hadoop in their
Organization to deal with big data, eg. Facebook, Yahoo, Netflix, eBay, etc.
The Hadoop Architecture Mainly consists of 4 components.

 MapReduce
 HDFS(Hadoop Distributed File System)
 YARN(Yet Another Resource Negotiator)
 Common Utilities or Hadoop Common

Let’s understand the role of each one of this component in detail.

1. MapReduce

MapReduce nothing but just like an Algorithm or a data structure that is


based on the YARN framework. The major feature of MapReduce is to
perform the distributed processing in parallel in a Hadoop cluster which
Makes Hadoop working so fast. When you are dealing with Big Data, serial
processing is no more of any use. MapReduce has mainly 2 tasks which are
divided phase-wise:

In first phase, Map is utilized and in next phase Reduce is utilized.


Here, we can see that the Input is provided to the Map() function then
it’s output is used as an input to the Reduce function and after that, we
receive our final output. Let’s understand What this Map() and Reduce()
does.

As we can see that an Input is provided to the Map(), now as we are using
Big Data. The Input is a set of Data. The Map() function here breaks this
DataBlocks into Tuples that are nothing but a key-value pair. These key-
value pairs are now sent as input to the Reduce(). The Reduce() function
then combines this broken Tuples or key-value pair based on its Key value
and form set of Tuples, and perform some operation like sorting, summation
type job, etc. which is then sent to the final Output Node. Finally, the Output
is Obtained.

The data processing is always done in Reducer depending upon the


business requirement of that industry. This is How First Map() and then
Reduce is utilized one by one.

Let’s understand the Map Task and Reduce Task in detail.

Map Task:

 RecordReader The purpose of recordreader is to break the records. It is


responsible for providing key-value pairs in a Map() function. The key is
actually is its locational information and value is the data associated with
it.
 Map: A map is nothing but a user-defined function whose work is to
process the Tuples obtained from record reader. The Map() function
either does not generate any key-value pair or generate multiple pairs of
these tuples.
 Combiner: Combiner is used for grouping the data in the Map workflow. It
is similar to a Local reducer. The intermediate key-value that are
generated in the Map is combined with the help of this combiner. Using a
combiner is not necessary as it is optional.
 Partitionar: Partitional is responsible for fetching key-value pairs
generated in the Mapper Phases. The partitioner generates the shards
corresponding to each reducer. Hashcode of each key is also fetched by
this partition. Then partitioner performs it’s(Hashcode) modulus with the
number of reducers(key.hashcode()%(number of reducers)).

Reduce Task

 Shuffle and Sort: The Task of Reducer starts with this step, the process in
which the Mapper generates the intermediate key-value and transfers
them to the Reducer task is known as Shuffling. Using the Shuffling
process the system can sort the data using its key value.

Once some of the Mapping tasks are done Shuffling begins that is why it
is a faster process and does not wait for the completion of the task
performed by Mapper.
 Reduce: The main function or task of the Reduce is to gather the Tuple
generated from Map and then perform some sorting and aggregation sort
of process on those key-value depending on its key element.
 OutputFormat: Once all the operations are performed, the key-value pairs
are written into the file with the help of record writer, each record in a new
line, and the key and value in a space-separated manner.
2. HDFS

HDFS(Hadoop Distributed File System) is utilized for storage permission. It


is mainly designed for working on commodity Hardware devices(inexpensive
devices), working on a distributed file system design. HDFS is designed in
such a way that it believes more in storing the data in a large chunk of blocks
rather than storing small data blocks.

HDFS in Hadoop provides Fault-tolerance and High availability to the storage


layer and the other devices present in that Hadoop cluster. Data storage
Nodes in HDFS.

 NameNode(Master)
 DataNode(Slave)

NameNode:NameNode works as a Master in a Hadoop cluster that guides


the Datanode(Slaves). Namenode is mainly used for storing the Metadata
i.e. the data about the data. Meta Data can be the transaction logs that keep
track of the user’s activity in a Hadoop cluster.

Meta Data can also be the name of the file, size, and the information about
the location(Block number, Block ids) of Datanode that Namenode stores to
find the closest DataNode for Faster Communication. Namenode instructs
the DataNodes with the operation like delete, create, Replicate, etc.

DataNode: DataNodes works as a Slave DataNodes are mainly utilized for


storing the data in a Hadoop cluster, the number of DataNodes can be from
1 to 500 or even more than that. The more number of DataNode, the Hadoop
cluster will be able to store more data. So it is advised that the DataNode
should have High storing capacity to store a large number of file blocks.
High Level Architecture Of Hadoop

File Block In HDFS: Data in HDFS is always stored in terms of blocks. So the
single block of data is divided into multiple blocks of size 128MB which is
default and you can also change it manually.

Let’s understand this concept of breaking down of file in blocks with an


example. Suppose you have uploaded a file of 400MB to your HDFS then
what happens is this file got divided into blocks of
128MB+128MB+128MB+16MB = 400MB size. Means 4 blocks are created
each of 128MB except the last one. Hadoop doesn’t know or it doesn’t care
about what data is stored in these blocks so it considers the final file blocks
as a partial record as it does not have any idea regarding it. In the Linux file
system, the size of a file block is about 4KB which is very much less than the
default size of file blocks in the Hadoop file system. As we all know Hadoop
is mainly configured for storing the large size data which is in petabyte, this
is what makes Hadoop file system different from other file systems as it can
be scaled, nowadays file blocks of 128MB to 256MB are considered in
Hadoop.

Replication In HDFS Replication ensures the availability of the data.


Replication is making a copy of something and the number of times you
make a copy of that particular thing can be expressed as it’s Replication
Factor. As we have seen in File blocks that the HDFS stores the data in the
form of various blocks at the same time Hadoop is also configured to make a
copy of those file blocks.

By default, the Replication Factor for Hadoop is set to 3 which can be


configured means you can change it manually as per your requirement like in
above example we have made 4 file blocks which means that 3 Replica or
copy of each file block is made means total of 4×3 = 12 blocks are made for
the backup purpose.

This is because for running Hadoop we are using commodity hardware


(inexpensive system hardware) which can be crashed at any time. We are
not using the supercomputer for our Hadoop setup. That is why we need
such a feature in HDFS which can make copies of that file blocks for backup
purposes, this is known as fault tolerance.

Now one thing we also need to notice that after making so many replica’s of
our file blocks we are wasting so much of our storage but for the big brand
organization the data is very much important than the storage so nobody
cares for this extra storage. You can configure the Replication factor in
your hdfs-site.xml file.

Rack Awareness The rack is nothing but just the physical collection of nodes
in our Hadoop cluster (maybe 30 to 40). A large Hadoop cluster is consists of
so many Racks . with the help of this Racks information Namenode chooses
the closest Datanode to achieve the maximum performance while performing
the read/write information which reduces the Network Traffic.

HDFS Architecture
3. YARN(Yet Another Resource Negotiator)

YARN is a Framework on which MapReduce works. YARN performs 2


operations that are Job scheduling and Resource Management. The
Purpose of Job schedular is to divide a big task into small jobs so that each
job can be assigned to various slaves in a Hadoop cluster and Processing
can be Maximized. Job Scheduler also keeps track of which job is important,
which job has more priority, dependencies between the jobs and all the other
information like job timing, etc. And the use of Resource Manager is to
manage all the resources that are made available for running a Hadoop
cluster.

Features of YARN

 Multi-Tenancy
 Scalability
 Cluster-Utilization
 Compatibility
4. Hadoop common or Common Utilities

Hadoop common or Common utilities are nothing but our java library and
java files or we can say the java scripts that we need for all the other
components present in a Hadoop cluster. these utilities are used by HDFS,
YARN, and MapReduce for running the cluster. Hadoop Common verify that
Hardware failure in a Hadoop cluster is common so it needs to be solved
automatically in software by Hadoop Framework.

8a) Write a neat diagram and explain the components of apache Hive
Architecture 8M

Hive Client:

Hive drivers support applications written in any language like Python, Java,
C++, and Ruby, among others, using JDBC, ODBC, and Thrift drivers, to
perform queries on the Hive. Therefore, one may design a hive client in any
language of their choice.

The three types of Hive clients are referred to as Hive clients:

1. Thrift Clients: The Hive server can handle requests from a thrift client
by using Apache Thrift.
2. JDBC client: A JDBC driver connects to Hive using the Thrift
framework. Hive Server communicates with the Java applications using
the JDBC driver.
3. ODBC client: The Hive ODBC driver is similar to the JDBC driver in
that it uses Thrift to connect to Hive. However, the ODBC driver uses
the Hive Server to communicate with it instead of the Hive Server.

Hive Services:

Hive provides numerous services, including the Hive server2, Beeline,


etc. The services offered by Hive are:

1. Beeline: HiveServer2 supports the Beeline, a command shell that which


the user can submit commands and queries to. It is a JDBC client that
utilises SQLLINE CLI (a pure Java console utility for connecting with
relational databases and executing SQL queries). The Beeline is based
on JDBC.
2. Hive Server 2: HiveServer2 is the successor to HiveServer1. It provides
clients with the ability to execute queries against the Hive. Multiple
clients may submit queries to Hive and obtain the desired results. Open
API clients such as JDBC and ODBC are supported by HiveServer2.

Note: Hive server1, which is also known as a Thrift server, is used to


communicate with Hive across platforms. Different client applications can
submit requests to Hive and receive the results using this server.

HiveServer2 handled concurrent requests from more than one client, so it was
replaced by HiveServer1.

Hive Driver: The Hive driver receives the HiveQL statements submitted by
the user through the command shell and creates session handles for the
query.

Hive Compiler: Metastore and hive compiler both store metadata in order to
support the semantic analysis and type checking performed on the different
query blocks and query expressions by the hive compiler. The execution plan
generated by the hive compiler is based on the parse results.

The DAG (Directed Acyclic Graph) is a DAG structure created by the compiler.
Each step is a map/reduce job on HDFS, an operation on file metadata, and a
data manipulation step.

Optimizer: The optimizer splits the execution plan before performing the
transformation operations so that efficiency and scalability are improved.

Execution Engine: After the compilation and optimization steps, the


execution engine uses Hadoop to execute the prepared execution plan, which
is dependent on the compiler’s execution plan.

Metastore: Metastore stores metadata information about tables and


partitions, including column and column type information, in order to improve
search engine indexing.

The metastore also stores information about the serializer and deserializer as
well as HDFS files where data is stored and provides data storage. It is
usually a relational database. Hive metadata can be queried and modified
through Metastore.

We can either configure the metastore in either of the two modes:

1. Remote: A metastore is not enabled in remote mode, and non-Java


applications cannot benefit from Thrift services.
2. Embedded: A client in embedded mode can directly access the
metastore via JDBC.

HCatalog: HCatalog is a Hadoop table and storage management layer that


provides users with different data processing tools such as Pig, MapReduce,
etc. with simple access to read and write data on the grid.

The data processing tools can access the tabular data of Hive metastore
through It is built on the top of Hive metastore and exposes the tabular data to
other data processing tools.
WebHCat: The REST API for HCatalog provides an HTTP interface to
perform Hive metadata operations. WebHCat is a service provided by the user
to run Hadoop MapReduce (or YARN), Pig, and Hive jobs.

Processing and Resource Management:

Hive uses a MapReduce framework as a default engine for performing the


queries, because of that fact.

MapReduce frameworks are used to write large-scale applications that


process a huge quantity of data in parallel on large clusters of commodity
hardware. MapReduce tasks can split data into chunks, which are processed
by map-reduce jobs.

Distributed Storage:Hive is based on Hadoop, which means that it uses the


Hadoop Distributed File System for distributed storage.

Working with Hive:We will now look at how to use Apache Hive to process
data.

1. The driver calls the user interface’s execute function to perform a query.
2. The driver answers the query, creates a session handle for the query,
and passes it to the compiler for generating the execution plan.
3. The compiler responses to the metadata request are sent to the
metaStore.
4. The compiler computes the metadata using the meta data sent by the
metastore. The metadata that the compiler uses for type-checking and
semantic analysis on the expressions in the query tree is what is written
in the preceding bullet. The compiler generates the execution plan
(Directed acyclic Graph) for Map Reduce jobs, which includes map
operator trees (operators used by mappers and reducers) as well as
reduce operator trees (operators used by reducers).
5. The compiler then transmits the generated execution plan to the driver.
6. After the compiler provides the execution plan to the driver, the driver
passes the implemented plan to the execution engine for execution.
7. The execution engine then passes these stages of DAG to suitable
components. The deserializer for each table or intermediate output uses
the associated table or intermediate output deserializer to read the rows
from HDFS files. These are then passed through the operator tree. The
HDFS temporary file is then serialised using the serializer before being
written to the HDFS file system. These HDFS files are then used to
provide data to the subsequent MapReduce stages of the plan. After the
final temporary file is moved to the table’s location, the final temporary
file is moved to the table’s final location.
8. The driver stores the contents of the temporary files in HDFS as part of
a fetch call from the driver to the Hive interface. The Hive interface
sends the results to the driver.

Different Modes of Hive:


A hive can operate in two modes based on the number of data nodes in
Hadoop.

1. Local mode
2. Map-reduce mode

When using Local mode:

1. We can run Hive in pseudo mode if Hadoop is installed under pseudo


mode with one data node.
2. In this mode, we can have a data size of up to one machine as long as
it is smaller in terms of physical size.
3. Smaller data sets will be processed rapidly on local machines due to the
processing speed of small data sets.

When using Map Reduce mode:


1. In this type of setup, there are multiple data nodes, and data is
distributed across different nodes. We use Hive in this scenario
2. It will be able to handle large amounts of data as well as parallel queries
in order to execute them in a timely fashion.
3. By turning on this mode, you can increase the performance of data
processing by processing large data sets with better performance.

8b) Describe how spark handle data frame and complex data types include
an example working with JSON data in spark 7M

Data Frames:
A DataFrame is a distributed collection of data, which is
organized into named columns. Conceptually, it is equivalent to
relational tables with good optimization techniques.
A DataFrame can be constructed from an array of different
sources such as Hive tables, Structured Data files, external
databases, or existing RDDs. This API was designed for
modern Big Data and data science applications taking
inspiration from DataFrame in R Programming and Pandas
in Python.

Features of DataFrame

Here is a set of few characteristic features of DataFrame −


• Ability to process the data in the size of Kilobytes to
Petabytes on a single node cluster to large cluster.
• Supports different data formats (Avro, csv, elastic
search, and Cassandra) and storage systems (HDFS,
HIVE tables, mysql, etc).
• State of art optimization and code generation through
the Spark SQL Catalyst optimizer (tree transformation
framework).
• Can be easily integrated with all Big Data tools and frameworks via
Spark-Core.
• Provides API for Python, Java, Scala, and R Programming.

SQLContext

SQLContext is a class and is used for initializing the


functionalities of Spark SQL. SparkContext class object (sc) is
required for initializing SQLContext class object.
The following command is used for initializing the SparkContext through
spark-shell.
$ spark-shell
By default, the SparkContext object is initialized with the name
sc when the spark-shell starts.
Use the following command to create SQLContext.
scala> val sqlcontext = new org.apache.spark.sql.SQLContext(sc)

Example

Let us consider an example of employee records


in a JSON file named employee.json. Use the following
commands to create a DataFrame (df) and read a JSON
document named employee.json with the following content.

employee.json − Place this file in the directory where the


current scala> pointer is located.
{
{"id" : "1201", "name" : "satish", "age" : "25"}
{"id" : "1202", "name" : "krishna", "age" : "28"}
{"id" : "1203", "name" : "amith", "age" : "39"}
{"id" : "1204", "name" : "javed", "age" : "23"}
{"id" : "1205", "name" : "prudvi", "age" : "23"}
} DataFrame Operations

DataFrame provides a domain-specific language for structured


data manipulation. Here, we include some basic examples of
structured data processing using DataFrames.
Follow the steps given below to perform DataFrame operations −

Read the JSON Document

First, we have to read the JSON document. Based on this,


generate a DataFrame named (dfs).
Use the following command to read the JSON document named
employee.json. The data is shown as a table with the fields −
id, name, and age.
scala> val dfs = sqlContext.read.json("employee.json")
Output − The field names are taken automatically from employee.json.
dfs: org.apache.spark.sql.DataFrame = [age: string, id: string,
name: string]

Show the Data

If you want to see the data in the DataFrame, then use the following
command.
scala> dfs.show()
Output − You can see the employee data in a tabular format.
<console>:22, took 0.052610 s
+----+------+ ------------------ +
|age | id | name |
+----+------+ ------------------ +
| 25 | 1201 | satish |
| 28 | 1202 | krishna|
| 39 | 1203 | amith |
| 23 | 1204 | javed |
| 23 | 1205 | prudvi |
+----+------+ ------------------ +

Use printSchema Method

If you want to see the Structure (Schema) of the


DataFrame, then use the following command.
scala> dfs.printSchema()
Output
root
|-- age: string (nullable = true)
|-- id: string (nullable = true)
|-- name: string (nullable = true)
Use Select Method

Use the following command to fetch name-column among


three columns from the DataFrame.
scala> dfs.select("name").show()
Output − You can see the values of the name column.
<console>:22, took 0.044023 s
+ ------------+
| name |
+ ------------+
| satish |
| krishna|
| amith |
| javed |
| prudvi |
+ ------------+

Use Age Filter

Use the following command for finding the employees whose


age is greater than 23 (age > 23).
scala> dfs.filter(dfs("age") > 23).show()
Output
<console>:22, took 0.078670 s
+----+------+ ------------------ +
|age | id | name |
+----+------+ ------------------ +
| 25 | 1201 | satish |
| 28 | 1202 | krishna|
| 39 | 1203 | amith |
+----+------+ ------------------ +

Use groupBy Method


Use the following command for counting the number of
employees who are of the same age.
scala> dfs.groupBy("age").count().show()
Output − two employees are having age 23.
<console>:22, took 5.196091 s
+----+---------- +
|age |count|
+----+---------- +
| 23 | 2 |
| 25 | 1 |
| 28 | 1 |
| 39 | 1 |
+----+---------- +

Complex Data Types in Spark DataFrames:


Spark supports various complex data types:

Arrays:

Represents a collection of elements of the same type.


Useful for scenarios where a field contains multiple values, such as tags or
a list of items.
Structs:

Represents a structure with named fields, allowing the grouping of related


attributes.
Useful for dealing with nested or hierarchical data structures.
Example Working with JSON Data in Spark:
Let's consider an example where we work with JSON data using Spark

# Import required Spark libraries


from pyspark.sql import SparkSession
from pyspark.sql.functions import col
# Create a Spark session
spark = SparkSession.builder.appName("JsonExample").getOrCreate()

# JSON data
json_data = '''
{
"id": 1,
"name": "John Doe",
"age": 25,
"skills": ["Python", "Spark", "SQL"],
"address": {
"city": "New York",
"zipcode": "10001"
}
}
'''

# Read JSON data into a DataFrame


df = spark.read.json(spark.sparkContext.parallelize([json_data]))

# Show the DataFrame


df.show()

# Extracting data using DataFrame operations


result_df = df.select(
col("id"),
col("name"),
col("age"),
col("address.city").alias("city"),
col("address.zipcode").alias("zipcode"),
col("skills")
)
# Show the result DataFrame
result_df.show(truncate=False)

# Stop the Spark session


spark.stop()

Explanation:

Spark Session Creation:


The SparkSession is created, serving as the entry point for Spark
functionality.
Define JSON Data:
A JSON string (json_data) is defined, representing a sample JSON record.
Read JSON Data into a DataFrame:
The spark.read.json() method reads JSON data into a DataFrame (df).
Show the Original DataFrame:
The original DataFrame is displayed to illustrate the structure of the loaded
JSON data.
DataFrame Operations:
DataFrame operations are used to select specific columns (id, name, age,
address.city, address.zipcode, skills) and alias some of them.
Show Result DataFrame:

The result DataFrame is displayed, showing the extracted and manipulated


data.
Stop Spark Session:

The Spark session is stopped to release resources.


In big data analytics, this example demonstrates how Spark can efficiently
handle and process JSON data with nested structures using its DataFrame
API. The support for complex data types allows for flexible and powerful
data manipulations, making Spark well-suited for diverse big data scenarios.
UNIT-V

9a) Explain Event time and state full processing 7M


Covered the core concepts and basic APIs; this chapter dives into
event-time andstateful processing. Event-time processing is a hot topic
because we analyze information with respect to the time that it was
created, not processed. The key idea between this style of processing
is that over the lifetime of the job, Spark will maintain relevant state
that it can update over the course of the job before outputting it to the
sink.
Let’s cover these concepts in greater detail before we begin working
with codeto show they work.

Event Time

Event time is an important topic to cover discretely because Spark’s


DStreamAPI does not support processing information with respect to
event-time. At a
higher level, in stream-processing systems there are effectively two
relevant times for each event: the time at which it actually occurred
(event time), and thetime that it was processed or reached the stream-
processing system (processing time).

Event time

Event time is the time that is embedded in the data itself. It is most
often, though not required to be, the time that an event actually occurs.
This is important to use because it provides a more robust way of
comparing events against one another. The challenge here is that
event data can be late or out of order. This means that the stream
processing system must beable to handle out-of-order or late data.

Processing time
Processing time is the time at which the stream-processing system
..Stateful Streaming in Apache Spark

Apache Spark is a general processing engine built on top of the Hadoop


eco- system. Spark has a complete setup and a unified framework to
process any kind of data. Spark can do batch processing as well as
stream processing. Sparkhas a powerful SQL engine to run SQL queries
on the data; it also has an integrated Machine Learning library called
MlLib and a graph processing library called GraphX. As it can integrate
many things into it, we identify Sparkas a unified framework rather than a
processing engine.

Now coming to the real-time stream processing engine of Spark. Spark


doesn’tprocess the data in real time it does a near-real-time processing. It
means it processes the data in micro batches, in just a few milliseconds.

Here we have got a program where Spark’s streaming context will


process the data in micro batches but generally, this processing is
stateless. Let’s take we have defined the streaming Context to run for
every 10 seconds, it will process the data that is arrived

within that 10 seconds, to process the previous data we have something


called windows concept, windows cannot give the accumulatedresults from
the starting timestamp.

But what if you need to the accumulate the results from the start of the
streaming job. Which means you need to check the previous state of the
RDD inorder to update the new state of the RDD. This is what is known
as stateful streaming in Spark.
Spark provides 2 API’s to perform stateful streaming, whichis
updateStateByKey and mapWithState.
Now we will see how to perform stateful streaming of wordcount
using updateStateByKey. UpdateStateByKey is a function of Dstreams in
Spark which accepts an update function as its parameter. In that update
function,you need to provide the following parameters newState for the key
which is
a seq of values and the previous state of key as an Option[?].
Let’s take a word count program, let’s say for the first 10 seconds we have
giventhis data hello every one from acadgild. Now the wordcount program
result will be
(one,1)
(hello,1)
(from,1) (acadgild,1)(every,1)
Now without writing the updateStateByKey function, if you give some other
data, in the next 10 seconds i.e. let’s assume we give the same line hello
every one from acadigld. Now we will get the same result in the next 10
seconds alsoi.e.,
(one,1)
(hello,1)
(from,1) (acadgild,1)(every,1)

9b) Discuss about Structured Streaming 8M


Structured Streaming is a high-level API for stream processing that
became production-ready in Spark 2.2. Structured Streaming allows you to
take the same operations that you perform in batch mode using Spark’s
structured APIs, and run them in a streaming fashion. This can reduce
latency and allow for incremental processing. The best thing about
Structured Streaming is that it allows you to rapidly and quickly get value
out of streaming systems with virtually no code changes. It also makes it
easy to reason about because you can write your batch job as a way to
prototype it and then you can convert it to a streaming job. The way all of
this works is by incrementally processing that data.

Structured Streaming in big data analytics refers to the application of Apache


Spark's Structured Streaming API for handling and processing real-time data at
scale. It is designed to provide a unified and high-level API for stream
processing, making it more accessible and intuitive for developers familiar
with Spark's DataFrame and SQL API. Here are key aspects of Structured
Streaming in the context of big data analytics:

Unified Programming Model:

Structured Streaming unifies batch and streaming processing, allowing


developers to use the same DataFrame and SQL API for both static (batch) and
streaming data. This brings consistency and simplifies the development
process.

Declarative SQL-Like API:

Developers can express their stream processing logic in a declarative SQL-like


syntax using the DataFrame API. This abstraction simplifies the coding
process and allows for easier integration of stream processing tasks into
existing Spark applications.

Incremental Processing:

One of the key features of Structured Streaming is its support for incremental
processing. It processes only the new data that arrives in the stream since the
last batch was processed. This enables low-latency and efficient stream
processing.

Fault Tolerance:

Structured Streaming provides end-to-end exactly-once semantics, ensuring


that every record is processed exactly once, even in the presence of failures.
This is achieved through mechanisms like checkpointing and the use of write-
ahead logs.

Continuous Processing Model:

Unlike traditional micro-batch processing, Structured Streaming adopts a


continuous processing model. This enables lower end-to-end latencies and a
more natural handling of event time in the context of real-time analytics.

Event-Time Processing:

Structured Streaming supports event-time processing, allowing developers to


work with the timestamp of events. This is essential for applications where the
order of events matters, and handling late-arriving data is crucial.
Integration with Spark Ecosystem:

Structured Streaming seamlessly integrates with other components of the


Spark ecosystem, such as Spark SQL, MLlib, and GraphX. This allows
organizations to build unified data processing pipelines for both batch and
streaming workloads.

Stateful Processing:

Structured Streaming supports stateful processing, enabling the maintenance


of intermediate state across batches. This is particularly useful for scenarios
where aggregations need to be carried over time, such as maintaining a
running count or sum.

Integration with External Systems:

The API allows for integration with external systems for various
functionalities, such as connecting to external databases, calling external APIs,
or incorporating custom logic using user-defined functions (UDFs).

Source and Sink Agnostic:

It supports a variety of data sources and sinks, including popular ones like
Apache Kafka, HDFS, Amazon S3, and more. This flexibility allows
organizations to ingest data from various sources and output the results to
different sinks.

In big data analytics, Structured Streaming enables organizations to build


robust, scalable, and fault-tolerant real-time data processing pipelines, making
it well-suited for applications ranging from fraud detection and monitoring to
live dashboarding and continuous ETL (Extract, Transform, Load) processes.

10a) Define Streaming explain duplicates in a streaming 8M

Streaming refers to the processing of data continuously and in real-time, typically in


the context of data arriving as a continuous flow rather than in discrete batches. In a
streaming scenario, data is generated and processed continuously, and results are
produced as soon as the data becomes available.

Duplicates in Streaming:
In the context of streaming data, duplicates refer to the occurrence of identical
records or events within the data stream. Duplicate records can arise due to various
reasons, and handling them appropriately is essential for ensuring the accuracy and
reliability of streaming analytics.

Common scenarios leading to duplicates in streaming data include:

Reprocessing or Retrying:

In a distributed and fault-tolerant streaming system, it's possible for a record to be


processed more than once. This can happen if there are failures during processing,
and the system retries to process the same record.
Network Delays or Glitches:

Network delays or glitches can cause the same record to be transmitted more than
once. The receiving end of the streaming system may interpret these retransmissions
as duplicate records.
Out-of-Order Arrival:

In some cases, records may arrive out of order due to network latency or delays. This
can result in the same record being processed multiple times if the system is not
designed to handle out-of-order arrivals.
Data Source Characteristics:

The characteristics of the data source itself, such as the way data is produced and
transmitted, can contribute to duplicates. For example, in some streaming scenarios,
records may be emitted periodically, leading to the generation of identical records.
Handling Duplicates in Streaming:

Managing duplicates in a streaming environment is crucial for maintaining the


integrity of analytics and preventing inaccuracies in results. Several techniques can be
employed to handle duplicates:

Deduplication:

Deduplication involves identifying and removing duplicate records from the stream.
This can be achieved by maintaining state and checking for duplicates before
processing each record.
Windowing:

Windowing involves grouping records within a specified time window and processing
them collectively. This can help identify and handle duplicates within the window.
Timestamp-Based Processing:

Processing records based on their timestamp can help identify and discard duplicates
by considering the temporal order of events.
Idempotent Operations:

Designing operations to be idempotent ensures that processing the same record


multiple times has the same effect as processing it once. This approach is effective in
mitigating the impact of duplicates.
Buffering and Caching:

Buffering and caching records for a short period can help identify and eliminate
duplicates by comparing incoming records with those in the buffer.
Handling duplicates in streaming is a critical aspect of building robust and reliable
real-time data processing systems. By implementing appropriate strategies and
techniques, organizations can ensure the accuracy of analytics results and maintain
the integrity of their streaming applications.

10b) explain structured streaming in action and transforms on streaming


how it is useful in real world 8M

Structured Streaming in action involves processing streaming data using a


declarative SQL-like API or DataFrame/Dataset API in Apache Spark. This
approach offers benefits in big data analytics, enabling real-time processing and
analysis of large datasets. Let's delve deeper into the application of Structured
Streaming and transforms in the context of big data analytics:

Real-time Data Processing:

One of the primary use cases for Structured Streaming is real-time data
processing. With the ability to process data continuously in micro-batches,
organizations can gain insights and make decisions in near real-time. This is
crucial for applications where up-to-date information is essential, such as fraud
detection, monitoring, and alerting systems.
Transformations on Streaming Data:

Structured Streaming allows for various transformations on streaming data,


similar to batch processing. This includes filtering, aggregating, joining, and
applying custom transformations. These transformations enable analysts and data
scientists to extract meaningful information from streaming data and derive
valuable insights.
Windowed Aggregations:

Windowed aggregations are powerful in big data analytics for understanding


trends and patterns over specified time intervals. Structured Streaming supports
window functions, allowing you to perform aggregations over sliding or
tumbling time windows. This is beneficial in applications such as financial
analytics, where understanding trends over specific time periods is crucial.
Integration with Machine Learning:

By integrating Structured Streaming with Spark's machine learning library


(MLlib), organizations can perform real-time machine learning on streaming
data. This is useful for scenarios like predictive maintenance, where machine
learning models can be continuously updated based on incoming data, leading to
more accurate predictions.
Dynamic Data Enrichment:

Streaming data often requires enrichment with additional information from


external sources. Structured Streaming supports dynamic data enrichment by
allowing joins with static datasets or external streaming sources. This capability
is valuable in scenarios like real-time recommendation systems where user
profiles are dynamically enriched with the latest information.
Complex Event Processing (CEP):

CEP involves identifying complex patterns and relationships within streaming


data. Structured Streaming provides support for defining and identifying
complex events using SQL-like queries, making it easier to express and execute
complex event processing logic. This is beneficial in applications such as
network monitoring or IoT analytics.
Fault Tolerance and Scalability:

Structured Streaming inherits Spark's fault-tolerance and scalability features. In a


big data analytics environment, where data volumes can be massive, the ability
to scale horizontally across a cluster of machines ensures that the system can
handle the processing demands. Additionally, the fault-tolerance mechanisms in
Spark ensure that processing can recover gracefully from failures.
Diverse Output Sinks:

Processed data in Structured Streaming can be written to a variety of output


sinks, including databases, data lakes, dashboards, or external systems. This
flexibility is crucial in big data analytics where the processed insights may need
to be consumed by different downstream applications or stored in different
formats.
In summary, Structured Streaming and transforms play a crucial role in big data
analytics by enabling real-time data processing, complex event processing,
dynamic data enrichment, and seamless integration with machine learning. The
combination of fault tolerance, scalability, and diverse output sinks makes
Structured Streaming a valuable tool for organizations dealing with large
volumes of streaming data in their big data analytics workflows.

You might also like