Professional Documents
Culture Documents
UNIT-I
1a) Define Big data and explain how it differs from traditional data sets, discuss the
convergence of key trends that have led to the rise of big data. 8M
1b) Describe the role of unstructured data in big data analytics.provide an example of how un-
structured data is used in one industry. 7M
OR
2a) what do you mean by linear and non-linear data structures specify the sets are comes
under linear or non-linear and explain the various types of sets supported by java. 7M
2b) what is the advantages of object serialization in java and explain about serializing and de-
seralizing an object with suitable examples 8M
UNIT-II
3a) Discuss the architecture and data model of cassandra.how does it differ from other NoSQL
data base 8M
3b) describe the process of creating and managing tables in cassandra includes an example of
table creation and data manipulation 7M
OR
4a)Explain hadoop distributed file system architecture with a neat sketch. 8M
4b) Define data node how does name node tackle data node failure 7M
UNIT-III
5a)What is Hive meta store which classes are used by the Hive to read and write HDFS
Files. 7M
5b) Explain the following
a) Logical joins b) Window functions 8M
OR
6a) Explain how Hive facilitate the big data analytics discuss its data types, file formats and
HiveQL 8M
6b) How can we install the Apache Hive on the System-Explain 7M
UNIT-IV
7a) list and explain the important features of hadoop 7M
7b) Explain the architecture of building blocks of hadoop 8M
OR
8a) Write a neat diagram and explain the components of apache Hive Architecture. 8M
8b) Describe how spark handles data frames and complex data types include
an example working with JSON data in spark 7M
UNIT-V
9a) Explain Event time and state full processing 7M
9b) Discuss about structured Streaming 8M
OR
10a) Define Streaming explain duplicates in a streaming 8M
10b)explain structured streaming in action and transforms on streaming how it is useful
in real world 7M
I M.TECH. I Semester Regular Examination (MR23), March 2024
ADVANCED DATA STRUCTURES AND ALGORITHMS
(COMPUTER SCIENCE AND ENGINEERING)
UNIT-I
1a) Define Big data and explain how it differs from traditional data sets, discuss the
convergence of key trends that have led to the rise of big data. 8M
Answer:
Big data refers to extremely large and complex datasets that cannot be easily
managed, processed, or analyzed using traditional data processing tools or methods.
The term is often characterized by the three Vs: Volume (the sheer size of the data),
Velocity (the speed at which data is generated and processed), and Variety (the
diverse types of data, including structured, semi-structured, and unstructured).
Here's a breakdown of the differences between big data and traditional data sets:
Volume:
Big Data: Involves datasets that are too large to be processed and analyzed by
traditional database systems. These datasets can range from terabytes to petabytes
and beyond.
Traditional Data: Typically involves smaller datasets that can be easily handled by
conventional database systems.
Velocity:
Big Data: Refers to the speed at which data is generated, collected, and processed.
This is crucial for real-time analytics and decision-making.
Traditional Data: Usually involves data that is generated and processed at a slower
pace compared to big data environments.
Variety:
Big Data: Encompasses a wide range of data types, including structured data (e.g.,
relational databases), semi-structured data (e.g., XML, JSON), and unstructured data
(e.g., text, images, videos).
Traditional Data: Primarily deals with structured data that fits neatly into tables and
relational databases.
Convergence of Key Trends:
The convergence of key trends has played a significant role in the rise of big data.
Several factors have contributed to this convergence:
Advanced Analytics and Machine Learning: The increasing demand for advanced
analytics and machine learning applications has driven the need for large and
diverse datasets to train and improve models.
Any form of data that has no proper structure or an unknown form is unstructured data. This
type of data is challenging to derive valuable insights from because of the raw nature of the
data.
Unstructured data refers to the data that lacks any specific form or structure whatsoever.
This makes it very difficult and time-consuming to process and analyze unstructured data.
Email is an example of unstructured data. Structured and unstructured are two important
types of big data.
Unstructured data is the kind of data that doesn’t adhere to any definite schema or set of rules.
Its arrangement is unplanned and haphazard.
Photos, videos, text documents, and log files can be generally considered unstructured data.
Even though the metadata accompanying an image or a video may be semi-structured, the
actual data being dealt with is unstructured.
Additionally, Unstructured data is also known as “dark data” because it cannot be analyzed
without the proper software tools.
Role of Unstructured Data in Big Data Analytics:
Rich Information Source: Unstructured data contains rich information that is often
contextually significant. Textual data, for example, can include sentiments, opinions, and
nuances that are important for understanding customer feedback, social trends, or market
sentiment.
Diversity and Variety: Unstructured data contributes to the variety aspect of big data. It
encompasses a wide range of data formats, including audio, video, and text. This
diversity allows organizations to gain a more comprehensive view of their operations,
customers, and market conditions.
Real-world Context: Unstructured data often reflects real-world scenarios and human
interactions, providing a more holistic and realistic representation of the situations being
analyzed. This context is valuable for decision-making and gaining deeper insights.
EnumSet:
Specialized implementation for sets where elements are enum constants.
Highly efficient and compact representation of enum values.
BitSet:
Represents a set of bits or flags.
Used for efficient manipulation of sets of flags or boolean values.
Implements the Set interface.
Usage in Big Data Analytics:
In big data analytics, sets can be used to handle unique identifiers, filter out duplicate
records, and manage distinct values efficiently.
The choice of a specific set implementation depends on the requirements of the analytics
process, such as the need for ordering, fast membership checks, or memory efficiency.
While sets themselves may not be directly tied to the linear or non-linear distinction, the
algorithms and data structures used in big data analytics (e.g., graphs, trees) may exhibit
characteristics of linear or non-linear organization depending on the nature of the data and
the analytical tasks at hand.
2b) what is the advantages of object serialization in java and explain about
serializing and de-seralizing an object with suitable examples 8M
object serialization in Java continues to offer advantages similar to those in general
programming. However, its application in big data scenarios often revolves around
distributed computing environments, data storage, and data interchange between different
systems. Here are the advantages and an example of serializing and deserializing objects in
the context of big data analytics:
Advantages of Object Serialization in Big Data Analytics:
Distributed Data Processing:
Big data analytics often involve distributed computing environments where data is
processed across multiple nodes. Serialization facilitates the efficient transmission of
objects between nodes, enabling seamless communication and collaboration.
Data Storage:Serialized objects can be stored in various data storage solutions, including
distributed file systems like Hadoop Distributed File System (HDFS) or cloud storage.
This allows for the preservation of complex data structures and their states.
Interoperability:
In big data analytics, systems may use different programming languages or frameworks.
Object serialization provides a standardized format for data representation, promoting
interoperability between different technologies.
Efficient Data Transfer:
Serialized objects can be transmitted over networks more efficiently than raw, unstructured
data. This efficiency is crucial in scenarios where large volumes of data need to be
transferred between nodes in a distributed environment.
State Preservation:
Serialization enables the preservation of object states, which is valuable in applications
where the state of an object is critical for analysis. This is particularly relevant when
dealing with complex data structures and machine learning models.
Example: Serializing and Deserializing Objects in a Big Data Context:
Consider a scenario where you have a complex data structure representing a machine learning
model, and you want to distribute this model across a cluster for parallel processing
import java.io.*;
Data is distributed across nodes using a partitioning strategy (e.g., random, Murmur3).
Each node is responsible for a range of data, and a consistent hashing mechanism helps
route requests to the appropriate node.
Gossip Protocol:
Nodes communicate with each other using a gossip protocol to share information about
the cluster's state.
Gossip ensures that every node eventually learns about changes in the cluster, supporting
decentralized coordination.
Replication:
Cassandra replicates data across multiple nodes to ensure fault tolerance and high
availability.
Replication factor and strategy are configurable, allowing users to define how many copies
of data are stored and on which nodes.
Data Model:
Cassandra's data model is based on column-family, resembling a tabular structure.
Data is organized into keyspaces (similar to databases in traditional systems), which
contain column families (analogous to tables).
Commit Log and Memtable:
Write operations are first recorded in a commit log for durability.
Data is then written to an in-memory structure called a memtable, providing fast write
performance.
SSTables and Compaction:
Data from memtables is periodically flushed to on-disk storage in SSTables (Sorted String
Tables).
Compaction processes merge and discard obsolete data from SSTables to maintain
efficiency.
Read Path:
Cassandra supports fast read operations by using an index to locate data efficiently.
Read requests can be served from memory, SSTables, or a combination of both.
Cassandra Data Model:
Keyspace:
The wide-column data model allows for flexible schema design, accommodating the
evolving nature of big data analytics requirements.
In big data analytics, where distributed computing, scalability, and fault tolerance are
critical, Cassandra's architecture and data model make it a suitable choice for applications
that demand high write throughput, fast query performance, and continuous availability.
It is often used in conjunction with other big data technologies to build robust and scalable
data processing pipelines.
3b) describe the process of creating and managing tables in cassandra includes
an example of table creation and data manipulation 7M
Creating and managing tables in Cassandra involves defining keyspaces, specifying replication
strategies, creating column families (tables), and performing data manipulation using the Cassandra
Query Language (CQL). Below is a step-by-step guide along with an example of table creation and
data manipulation in Cassandra.
1. Connect to Cassandra:
Use a CQL shell (cqlsh) or connect programmatically to a Cassandra cluster.
2. Create a Keyspace:
A keyspace is the top-level container for tables in Cassandra. It defines the replication strategy and
other configuration options.
6. Query Data:
Retrieve data from the table.
SELECT * FROM users;
7. Update Data
Modify existing records.
UPDATE users SET age = 26 WHERE user_id = <UUID>;
8. Delete Data:
Remove records from the table.
DELETE FROM users WHERE user_id = <UUID>;
OR
4a) Explain hadoop distributed file system architecture with a neat sketch 8M
Architecture of Hadoop distributed file system (HDFS):
Hadoop File System was developed using distributed file system design. It is run on
commodity hardware. Unlike other distributed systems, HDFS is highly faulttolerant and
designed using low-cost hardware.
HDFS holds very large amount of data and provides easier access. To store such huge
data, the files are stored across multiple machines. These files are stored in redundant
fashion to rescue the system from possible data losses in case of failure. HDFS also
makes applications available to parallel processing.
Features of HDFS
The namenode is the commodity hardware that contains the GNU/Linux operating system
and the namenode software. It is a software that can be run on commodity hardware. The
system having the namenode acts as the master server and it does the following tasks −
• Manages the file system namespace.
• Regulates client’s access to files.
• It also executes file system operations such as renaming, closing, and opening files and
directories.
Datanode
The datanode is a commodity hardware having the GNU/Linux operating system and datanode
software. For every node (Commodity hardware/System) in a cluster, there will be a datanode.
These nodes manage the data storage of their system.
• Datanodes perform read-write operations on the file systems, as per client request.
• They also perform operations such as block creation, deletion, and replication according to the
instructions of the namenode.
Fault Tolerance:
1. Definition
HDFS also maintains the replication factor by creating replica of data on other
available machines in the cluster if suddenly one machine fails
Suppose there is a user data named FILE. This data FILE is divided in into
blocks say B1, B2, B3 and send to Master. Now master sends these blocks to
the slaves say S1, S2, and S3. Now slaves creates replica of these blocks to the
other slaves present in the cluster say S4, S5 and S6. Hence multiple copies of
blocks are created on slaves. Say S1 contains B1 and B2, S2 contains B2 and
B3, S3 contains B3 and B1, S4 contains B2 and B3, S5 contains B3 and B1, S6
contains B1 and B2. Now if due to some reasons slave S4 gets crashed. Hence
data present in S4 was B2 and B3 become unavailable. But we don’t have to
worry because we can get the blocks B2 and B3 from other slave S2. Hence in
unfavourable conditions also our data doesn’t get lost. Hence HDFS is highly
fault tolerant.
Data Replication:
● HDFS is designed to store very large files across machines in a large cluster.
● All blocks in the file except the last are of the same size.
Replica Placement:
🞆 Replication factor is 3
● 1/3 of the replica on a node, 2/3 on a rack and 1/3 distributed evenly
across remaining racks.
Replica Selection:
4.
● HDFS cluster may span multiple data centers: replica in the local data
center is preferred over the remote one.
High availability
The loss of NameNodes can crash the cluster in both Hadoop 1.x as well as
Hadoop 2.x. In Hadoop 1.x, there was no easy way to recover, whereas Hadoop
2.x introduced high availability (active-passive setup) to help recover from
NameNode failures.
4b) Define data node how does name node tackle data node failure 7M
DataNode in Hadoop Distributed File System (HDFS):
In Hadoop Distributed File System (HDFS), a DataNode is a component that runs on each
individual machine (node) in the Hadoop cluster. Its primary responsibility is to store and
manage the actual data blocks that make up the files stored in HDFS. The DataNode is
responsible for performing read and write operations, as well as managing the replication of data
blocks across the cluster for fault tolerance.
Key responsibilities of a DataNode include:
Storage: Storing and managing data blocks on the local file system of the node it resides on.
Replication: Replicating data blocks to other DataNodes to ensure fault tolerance and data
availability. The default replication factor is three, meaning each block is stored on three
different DataNodes.
Heartbeat and Block Report: Periodically sending heartbeat signals to the NameNode to
confirm its availability. Additionally, sending block reports to provide information about the
list of blocks it is currently storing.
NameNode's Handling of DataNode Failure:
Since the NameNode manages the metadata and namespace of the HDFS, it needs to be
aware of the health and status of each DataNode in the cluster. When a DataNode fails or
becomes unreachable, the NameNode employs several mechanisms to handle the situation:
Heartbeat Monitoring:
The NameNode expects regular heartbeat signals from each DataNode.
If the NameNode does not receive a heartbeat within a specified time period, it marks the
DataNode as dead or unreliable.
Block Report:DataNodes periodically send block reports to the NameNode, listing all the
blocks they are currently storing.
The NameNode uses this information to track the health and status of each block and identify
missing or corrupt blocks.
Replication Factor Maintenance:
If a DataNode fails or is marked as unreliable, the NameNode takes corrective actions to
maintain the desired replication factor for each block.
It schedules the replication of the missing or under-replicated blocks to other healthy
DataNodes.
Decommissioning:
If a DataNode is identified as consistently failing or unreliable, it may be decommissioned by
the administrator.
Decommissioning involves removing the node from the active set of DataNodes, preventing
it from receiving new blocks. Existing blocks are replicated to other nodes.
Block Re-replication:
The NameNode continuously monitors the replication factor of each block.
If the replication factor falls below the desired value due to DataNode failures, the
NameNode triggers re-replication to ensure fault tolerance.
By actively monitoring heartbeats, block reports, and taking corrective actions, the
NameNode ensures the health and availability of the DataNodes in the HDFS cluster. This
proactive approach helps maintain fault tolerance and data reliability in the face of individual
DataNode failures in a distributed environment.
UNIT –III
5a) what is Hive meta store which classes are used by the Hive to Read and
Write HDFS Files 7M
Hive Metastore in Big Data Analytics:
In big data analytics, the Hive Metastore is a critical component that facilitates metadata
management for large-scale data processing using Apache Hive. It allows users to define
and manage tables, databases, and associated metadata in a relational database, enabling
efficient querying and processing of data stored in Hadoop Distributed File System
(HDFS) or other storage systems. The separation of metadata from data storage is essential
for scalability and better integration with various data processing tools and frameworks.
The Hive Metastore is responsible for managing metadata related to Hive tables, including
their schemas, partitions, and storage location. It stores this metadata in a relational
database and allows Hive to decouple the storage of metadata from the data itself,
facilitating metadata management and enabling better integration with other tools.
Key functions of the Hive Metastore include:
Storing metadata about Hive tables, databases, partitions, and column statistics.
Managing the schema and structure of Hive tables.
Facilitating access to metadata for query planning and execution.
Enabling compatibility with various storage formats and systems.
It is used to infer the file format when the format is not explicitly specified.
HiveMetaStoreClient:
The org.apache.hadoop.hive.metastore.HiveMetaStoreClient class is used to interact with the
Hive Metastore service programmatically.
It provides methods for managing metadata, including creating and altering tables, databases,
and partitions.
In big data analytics workflows, these classes play a crucial role in managing the interaction
between Hive and HDFS. They handle the intricacies of reading and writing data in different
formats, enabling efficient data processing, querying, and analysis across large-scale
distributed datasets. The Hive Metastore ensures that metadata about tables and databases is
well-managed, providing a foundation for the integration of Hive with other big data tools
and analytics frameworks.
Answer:
Logical joins:
Hive Join :
6 Phoebe 22 MP 4500
3 Chandler 23 1300
3 Chandler 23 1500
2 Rachel 25 1560
4 Monika 25 2060
Moreover, we get to see the following response, on the successful execution of the HiveQL
Select query:
Table.4 – Left Outer Join in Hive
Moreover, we get to see the following response, on the successful execution of the query:
Table.6 – Full Outer Join in Hive
This was all about HiveQL Select – Apache Hive Join Tutorial. Hope you like our
explanation of Types of Joins in Hive.
4. Conclusion
As a result, we have seen what is Apache Hive Join and possible types of Join in
Hive- HiveQL Select.
Window functions:
Windowing Functions In Hive
Hive.
Windowing in Hive includes the following functions
• Lead
The number of rows to lead can optionally be specified. If the number of rows to
lead is not specified, the lead is one row.
Returns null when the lead for the current row extends beyond the end of the
window
• Lag
The number of rows to lag can optionally be specified. If the number of rows to lag
is not specified, the lag is one row.
Returns null when the lag for the current row extends before the beginning of the
window.
• FIRST_VALUE
• LAST_VALUE
SUM
MIN
MAX
AVG
OVER with a PARTITION BY statement with one or more partitioning columns.
• OVER with PARTITION BY and ORDER BY with one or more partitioning and/or
ordering columns.
Analytics functions
RANK
ROW_NUMBER
DENSE_RANK
CUME_DIST
PERCENT_RANK
NTILE
To give you a brief idea of these windowing functions in Hive, we will be using stock
market data. You can download the sample stocks data from here and load into your
stocks table.
Now we will create a table to load this stock market data as shown below.
Let us dive deeper into the window functions in Hive.
Lag
This function returns the values of the previous row. You can specify an integer offset
which designates the row position else it will take the default integer offset as 1.
Here is the sample function for lag
Here using lag we can display the yesterday’s closing price of the ticker. Lag is to be
used with over function, inside the over function you can use partition or order by classes.
In the below screenshot, you can see the closing price of the stock for the day and the
yesterday’s price.
Lead
This function returns the values from the following rows. You can specify an integer offset
which designates the row position else it will take the default integer offset as 1.
Here is the sample function for lead
Now using the lead function, we will find that whether the following day’s closing price is
higher or lesser than today’s and that can be done as follows.
It returns the value of the first row from that window. With the below query, you can see the
first row high price of the ticker for all the days.
select ticker,first_value(high) over(partition by ticker) as first_high from acadgild.stocks
LAST_VALUE
It is the reverse of FIRST_VALUE. It returns the value of the last row from that
window. With the below query, you can see the last row high price value of the ticker for
all the days.
Let us now see the usage of the aggregate function using Over.
Count
It returns the count of all the values for the expression written in the over clause. From
the below query, we can find the number of rows present for each ticker.
select ticker,count(ticker) over(partition by ticker) as cnt from acadgild.stocks
For each partition, the count of ticker will be calculated, you can see the same in
the below screen shot.
Sum
It returns the sum of all the values for the expression written in the over clause. From the
below query, we can find the sum of all the closing stock prices for that particular ticker.
For each ticker, the sum of all the closing prices will be calculated, you can see the same
in the below screen shot.
For suppose let us take if you want to get running total of the volume_for_the_day
for all the days for every ticker then you can do this with the below query.
Now let’s take a scenario where you need to find the percentage of the
volume_for_the_day on the total volumes for that particular ticker and that can be
done as follows.
select ticker,date_,volume_for_the_day,(volume_for_the_day*100/(sum(volume_for_the_day)
over(partition by ticker))) from acadgild.stocks
In the above screenshot, you can see that the percentage contribution of the
volumes for the day is found based on the total volume for that ticker.
Min
It returns the minimum value of the column for the rows in that over clause. From
the below query, we can find the minimum closing stock price for each particular
ticker.
Max
It returns the maximum value of the column for the rows in that over clause. From
the below query, we can find the maximum closing stock price for each particular
ticker.
It returns the average value of the column for the rows that over clause returns.
From the below query, we can find the average closing stock price for each particular
ticker.
The rank function will return the rank of the values as per the result set of the over
clause. If two values are same then it will give the same rank to those 2 values and then
for the next value, the sub-sequent rank will be skipped.
The below query will rank the closing prices of the stock for each ticker. The same
you can see in the below screenshot.
Dense_rank
It is same as the rank() function but the difference is if any duplicate value is present
then the rank will not be skipped for the subsequent rows. Each unique value will
get the ranks in a sequence.
The below query will rank the closing prices of the stock for each ticker. The same
you can see in the below screenshot.
Ntile
6a) Explain how Hive facilitate the big data analytics discuss its data
types, file formats and HiveQL 8M
What is a Hive?
greater amount.
TinyInt 1 20
SmallInt 2 20
Int 4 20
Bigint 8 20
Float 4 10.2222
Ex: Java Int is used for implementing the Int data type
here.
cast(timestamp as The year/month/day of the timestamp is determined, based on the local timezone, and returned
date) as a date value.
cast(string as date) If the string is in the form 'YYYY-MM-DD', then a date value corresponding to that
year/month/day is returned. If the string value does not match this formate, then NULL is
returned.
cast(date as A timestamp value is generated corresponding to midnight of the year/month/day of the date
timestamp) value, based on the local timezone.
cast(date as string) The year/month/day represented by the Date is formatted as a string in the form 'YYYY-MM-DD'.
• ARRAY
• MAP
• STRUCT
• UNIONTYPE
Code:
Code:
For a column D of type STRUCT {Y INT; Z INT} the Y field can be retrieved by the
expression D.Y
Code:
1. TABLE CREATION
Code:
create table store_complex_type (
emp_id int,
name string,
local_address STRUCT<street:string,
city:string,country:string,zipcode:bigint>,
country_address MAP<STRING,STRING>,
job_history array<STRING>)
row format delimited fields terminated by ','
Code:
FILE FORMATS:
The below are the most Hive data file formats one can use it
ecosystems.
• TEXTFILE
• SEQUENCEFILE
• RCFILE
• ORC
• AVRO
• PARQUET
TEXTFILE
• A simplest data format to use, with whatever delimiters you
prefer.
• Also a default format, equivalent to creating a table with the
clause STORED AS TEXTFILE.
• Can be shared data with other Hadoop related tools, such as
Pig.
• Can also be used within Unix text tools like grep, sed, and
awk, etc.
• Also convenient for viewing or editing files manually.
• It is not space efficient compared to binary formats,which offer
added other advantages over TEXTFILE other than just the
simplicity.
• Can be specified using “STORED AS SEQUEN
SEQUENCEFILE
• Afirst alternative to the hive default file format.
• Can be specified using”STORED AS SQUENCEFILE” clause
during table creation.
• Files are in flat files structure consisting of binary key-value
pairs.
• In a runtime Hive queries processed into MapReduce jobs,
during which records are assigned/generated with the
appropriate key-value pairs.
• It is a standard format supported by Hadoop itself, thus
becomes native or acceptable while sharing files between
Hive and other Hadoop-related tools.
• It’s less suitable for use with tools outside the Hadoop
ecosystem.
• When needed, the sequence files can be compressed at the
block and record level, which is very useful for optimizing disk
space utilization and I/O, while still supporting the ability to
split files on block for parallel processing.
RCFILE
• RCFile = Record Columnar File
• An efficient internal (binary) hive format and natively
supported by Hive.
• Used when Column-oriented organization is a good storage
option for certain types of data and applications.
• If data is stored by column instead of by row, then only the
data for the desired columns has to be read, this intern
improves performance.
• Makes columns compression very efficient, especially for low
cardinality columns.
• Also, some column-oriented stores do not physically need to
store null columns.
• Helps storing columns of a table in a record columnar way.
• It first partitions rows horizontally into row splits and then it
vertically partitions each row split in a columnar way.
• It first stores the metadata of a row split, as the key part of a
record, and all the data of a row split as value part.
• The rcfilecat tool to display the contents of RCFiles from
Hive command line, since the RCFiles can not be seen with
simple editors.
ORC
• ORC = Optimized Row Columnar
• Designed to overcome limitations of other Hive file formats
and has highly efficient way to store Hive data.
• Stores data as groups of row data called stripes, along with
auxiliary information in a file footer.
• Holds compression parameters and size of the compressed
footer at the end of the file a postscript section.
AVRO
$ java –version
If Java is already installed on your system, you get to see the
following response:
If java is not installed in your system, then follow the steps given
below for installing java.
Installing Java
Step I:
Step II:
Generally you will find the downloaded java file in the Downloads
folder. Verify it and extract the jdk-7u71-linux-x64.gz file using
the following commands.
$ cd Downloads/
$ ls
jdk-7u71-linux-x64.gz
$ tar zxf jdk-7u71-linux-x64.gz
$ ls
jdk1.7.0_71 jdk-7u71-linux-x64.gz
Step III:
$ su
password:
# mv jdk1.7.0_71 /usr/local/
# exit
Step IV:
export JAVA_HOME=/usr/local/jdk1.7.0_71
export PATH=$PATH:$JAVA_HOME/bin
Now apply all the changes into the current running system.
$ source ~/.bashrc
Step V:
# alternatives --install
/usr/bin/java/java/usr/local/java/bin/java 2
# alternatives --install
/usr/bin/javac/javac/usr/local/java/bin/javac 2
# alternatives --install
/usr/bin/jar/jar/usr/local/java/bin/jar 2
Now verify the installation using the command java -version from
the terminal as explained above.
$ hadoop version
Downloading Hadoop
$ su
password:
# cd /usr/local
# wget http://apache.claz.org/hadoop/common/hadoop-
2.4.1/
hadoop-2.4.1.tar.gz
# tar xzf hadoop-2.4.1.tar.gz
# mv hadoop-2.4.1/* to hadoop/
# exit
Installing Hadoop in Pseudo Distributed Mode
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export
HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export
PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
Now apply all the changes into the current running system.
$ source ~/.bashrc
Step II: Hadoop Configuration
You can find all the Hadoop configuration files in the location
“$HADOOP_HOME/etc/hadoop”. You need to make suitable
changes in those configuration files according to your Hadoop
infrastructure.
$ cd $HADOOP_HOME/etc/hadoop
export JAVA_HOME=/usr/local/jdk1.7.0_71
Given below are the list of files that you have to edit to configure
Hadoop.
core-site.xml
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
hdfs-site.xml
The hdfs-site.xml file contains information such as the value of
replication data, the namenode path, and the datanode path of
your local file systems. It means the place where you want to
store the Hadoop infra.
Open this file and add the following properties in between the
<configuration>, </configuration> tags in this file.
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>file:///home/hadoop/hadoopinfra/hdfs/namenode
</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>file:///home/hadoop/hadoopinfra/hdfs/datanode
</value >
</property>
</configuration>
Note: In the above file, all the property values are user-defined
and you can make changes according to your Hadoop
infrastructure.
yarn-site.xml
This file is used to configure yarn into Hadoop. Open the yarn-
site.xml file and add the following properties in between the
<configuration>, </configuration> tags in this file.
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
mapred-site.xml
$ cp mapred-site.xml.template mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
Verifying Hadoop Installation
$ cd ~
$ hdfs namenode -format
*******************************************************
*****/
Step II: Verifying Hadoop dfs
$ start-dfs.sh
$ start-yarn.sh
http://localhost:50070/
Step V: Verify all applications for cluster
http://localhost:8088/
$ cd Downloads
$ ls
apache-hive-0.14.0-bin apache-hive-0.14.0-bin.tar.gz
Copying files to /usr/local/hive directory
We need to copy the files from the super user “su -”. The
following commands are used to copy the files from the extracted
directory to the /usr/local/hive” directory.
$ su -
passwd:
# cd /home/user/Download
# mv apache-hive-0.14.0-bin /usr/local/hive
# exit
Setting up environment for Hive
export HIVE_HOME=/usr/local/hive
export PATH=$PATH:$HIVE_HOME/bin
export CLASSPATH=$CLASSPATH:/usr/local/Hadoop/lib/*:.
export CLASSPATH=$CLASSPATH:/usr/local/hive/lib/*:.
$ cd $HIVE_HOME/conf
$ cp hive-env.sh.template hive-env.sh
export HADOOP_HOME=/usr/local/hadoop
$ cd ~
$ wget http://archive.apache.org/dist/db/derby/db-
derby-10.4.2.0/db-derby-10.4.2.0-bin.tar.gz
$ ls
db-derby-10.4.2.0-bin.tar.gz
Extracting and verifying Derby archive
The following commands are used for extracting and verifying the
Derby archive:
db-derby-10.4.2.0-bin db-derby-10.4.2.0-bin.tar.gz
Copying files to /usr/local/derby directory
We need to copy from the super user “su -”. The following
commands are used to copy the files from the extracted directory
to the /usr/local/derby directory:
$ su -
passwd:
# cd /home/user
# mv db-derby-10.4.2.0-bin /usr/local/derby
# exit
Setting up environment for Derby
export DERBY_HOME=/usr/local/derby
export PATH=$PATH:$DERBY_HOME/bin
Apache Hive
18
export
CLASSPATH=$CLASSPATH:$DERBY_HOME/lib/derby.jar:$DERBY_H
OME/lib/derbytools.jar
$ source ~/.bashrc
Create a directory to store Metastore
$ cd $HIVE_HOME/conf
$ cp hive-default.xml.template hive-site.xml
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:derby://localhost:1527/metastore_db;create=
true </value>
<description>JDBC connect string for a JDBC
metastore </description>
</property>
javax.jdo.PersistenceManagerFactoryClass =
org.jpox.PersistenceManagerFactoryImpl
org.jpox.autoCreateSchema = false
org.jpox.validateTables = false
org.jpox.validateColumns = false
org.jpox.validateConstraints = false
org.jpox.storeManagerType = rdbms
org.jpox.autoCreateSchema = true
org.jpox.autoStartMechanismMode = checked
org.jpox.transactionIsolation = read_committed
javax.jdo.option.DetachAllOnCommit = true
javax.jdo.option.NontransactionalRead = true
javax.jdo.option.ConnectionDriverName =
org.apache.derby.jdbc.ClientDriver
javax.jdo.option.ConnectionURL =
jdbc:derby://hadoop1:1527/metastore_db;create = true
javax.jdo.option.ConnectionUserName = APP
javax.jdo.option.ConnectionPassword = mine
Step 8: Verifying Hive Installation
Before running Hive, you need to create the /tmp folder and a
separate Hive folder in HDFS. Here, we use
the /user/hive/warehouse folder. You need to set write permission
for these newly created folders as shown below:
chmod g+w
Now set them in HDFS before verifying Hive. Use the following
commands:
$ cd $HIVE_HOME
$ bin/hive
UNIT –IV
Answer:
MapReduce
HDFS(Hadoop Distributed File System)
YARN(Yet Another Resource Negotiator)
Common Utilities or Hadoop Common
1. MapReduce
As we can see that an Input is provided to the Map(), now as we are using
Big Data. The Input is a set of Data. The Map() function here breaks this
DataBlocks into Tuples that are nothing but a key-value pair. These key-
value pairs are now sent as input to the Reduce(). The Reduce() function
then combines this broken Tuples or key-value pair based on its Key value
and form set of Tuples, and perform some operation like sorting, summation
type job, etc. which is then sent to the final Output Node. Finally, the Output
is Obtained.
Map Task:
Reduce Task
Shuffle and Sort: The Task of Reducer starts with this step, the process in
which the Mapper generates the intermediate key-value and transfers
them to the Reducer task is known as Shuffling. Using the Shuffling
process the system can sort the data using its key value.
Once some of the Mapping tasks are done Shuffling begins that is why it
is a faster process and does not wait for the completion of the task
performed by Mapper.
Reduce: The main function or task of the Reduce is to gather the Tuple
generated from Map and then perform some sorting and aggregation sort
of process on those key-value depending on its key element.
OutputFormat: Once all the operations are performed, the key-value pairs
are written into the file with the help of record writer, each record in a new
line, and the key and value in a space-separated manner.
2. HDFS
NameNode(Master)
DataNode(Slave)
Meta Data can also be the name of the file, size, and the information about
the location(Block number, Block ids) of Datanode that Namenode stores to
find the closest DataNode for Faster Communication. Namenode instructs
the DataNodes with the operation like delete, create, Replicate, etc.
File Block In HDFS: Data in HDFS is always stored in terms of blocks. So the
single block of data is divided into multiple blocks of size 128MB which is
default and you can also change it manually.
Now one thing we also need to notice that after making so many replica’s of
our file blocks we are wasting so much of our storage but for the big brand
organization the data is very much important than the storage so nobody
cares for this extra storage. You can configure the Replication factor in
your hdfs-site.xml file.
Rack Awareness The rack is nothing but just the physical collection of nodes
in our Hadoop cluster (maybe 30 to 40). A large Hadoop cluster is consists of
so many Racks . with the help of this Racks information Namenode chooses
the closest Datanode to achieve the maximum performance while performing
the read/write information which reduces the Network Traffic.
HDFS Architecture
3. YARN(Yet Another Resource Negotiator)
Features of YARN
Multi-Tenancy
Scalability
Cluster-Utilization
Compatibility
4. Hadoop common or Common Utilities
Hadoop common or Common utilities are nothing but our java library and
java files or we can say the java scripts that we need for all the other
components present in a Hadoop cluster. these utilities are used by HDFS,
YARN, and MapReduce for running the cluster. Hadoop Common verify that
Hardware failure in a Hadoop cluster is common so it needs to be solved
automatically in software by Hadoop Framework.
8a) Write a neat diagram and explain the components of apache Hive
Architecture 8M
Hive Client:
Hive drivers support applications written in any language like Python, Java,
C++, and Ruby, among others, using JDBC, ODBC, and Thrift drivers, to
perform queries on the Hive. Therefore, one may design a hive client in any
language of their choice.
1. Thrift Clients: The Hive server can handle requests from a thrift client
by using Apache Thrift.
2. JDBC client: A JDBC driver connects to Hive using the Thrift
framework. Hive Server communicates with the Java applications using
the JDBC driver.
3. ODBC client: The Hive ODBC driver is similar to the JDBC driver in
that it uses Thrift to connect to Hive. However, the ODBC driver uses
the Hive Server to communicate with it instead of the Hive Server.
Hive Services:
HiveServer2 handled concurrent requests from more than one client, so it was
replaced by HiveServer1.
Hive Driver: The Hive driver receives the HiveQL statements submitted by
the user through the command shell and creates session handles for the
query.
Hive Compiler: Metastore and hive compiler both store metadata in order to
support the semantic analysis and type checking performed on the different
query blocks and query expressions by the hive compiler. The execution plan
generated by the hive compiler is based on the parse results.
The DAG (Directed Acyclic Graph) is a DAG structure created by the compiler.
Each step is a map/reduce job on HDFS, an operation on file metadata, and a
data manipulation step.
Optimizer: The optimizer splits the execution plan before performing the
transformation operations so that efficiency and scalability are improved.
The metastore also stores information about the serializer and deserializer as
well as HDFS files where data is stored and provides data storage. It is
usually a relational database. Hive metadata can be queried and modified
through Metastore.
The data processing tools can access the tabular data of Hive metastore
through It is built on the top of Hive metastore and exposes the tabular data to
other data processing tools.
WebHCat: The REST API for HCatalog provides an HTTP interface to
perform Hive metadata operations. WebHCat is a service provided by the user
to run Hadoop MapReduce (or YARN), Pig, and Hive jobs.
Working with Hive:We will now look at how to use Apache Hive to process
data.
1. The driver calls the user interface’s execute function to perform a query.
2. The driver answers the query, creates a session handle for the query,
and passes it to the compiler for generating the execution plan.
3. The compiler responses to the metadata request are sent to the
metaStore.
4. The compiler computes the metadata using the meta data sent by the
metastore. The metadata that the compiler uses for type-checking and
semantic analysis on the expressions in the query tree is what is written
in the preceding bullet. The compiler generates the execution plan
(Directed acyclic Graph) for Map Reduce jobs, which includes map
operator trees (operators used by mappers and reducers) as well as
reduce operator trees (operators used by reducers).
5. The compiler then transmits the generated execution plan to the driver.
6. After the compiler provides the execution plan to the driver, the driver
passes the implemented plan to the execution engine for execution.
7. The execution engine then passes these stages of DAG to suitable
components. The deserializer for each table or intermediate output uses
the associated table or intermediate output deserializer to read the rows
from HDFS files. These are then passed through the operator tree. The
HDFS temporary file is then serialised using the serializer before being
written to the HDFS file system. These HDFS files are then used to
provide data to the subsequent MapReduce stages of the plan. After the
final temporary file is moved to the table’s location, the final temporary
file is moved to the table’s final location.
8. The driver stores the contents of the temporary files in HDFS as part of
a fetch call from the driver to the Hive interface. The Hive interface
sends the results to the driver.
1. Local mode
2. Map-reduce mode
8b) Describe how spark handle data frame and complex data types include
an example working with JSON data in spark 7M
Data Frames:
A DataFrame is a distributed collection of data, which is
organized into named columns. Conceptually, it is equivalent to
relational tables with good optimization techniques.
A DataFrame can be constructed from an array of different
sources such as Hive tables, Structured Data files, external
databases, or existing RDDs. This API was designed for
modern Big Data and data science applications taking
inspiration from DataFrame in R Programming and Pandas
in Python.
Features of DataFrame
SQLContext
Example
If you want to see the data in the DataFrame, then use the following
command.
scala> dfs.show()
Output − You can see the employee data in a tabular format.
<console>:22, took 0.052610 s
+----+------+ ------------------ +
|age | id | name |
+----+------+ ------------------ +
| 25 | 1201 | satish |
| 28 | 1202 | krishna|
| 39 | 1203 | amith |
| 23 | 1204 | javed |
| 23 | 1205 | prudvi |
+----+------+ ------------------ +
Arrays:
# JSON data
json_data = '''
{
"id": 1,
"name": "John Doe",
"age": 25,
"skills": ["Python", "Spark", "SQL"],
"address": {
"city": "New York",
"zipcode": "10001"
}
}
'''
Explanation:
Event Time
Event time
Event time is the time that is embedded in the data itself. It is most
often, though not required to be, the time that an event actually occurs.
This is important to use because it provides a more robust way of
comparing events against one another. The challenge here is that
event data can be late or out of order. This means that the stream
processing system must beable to handle out-of-order or late data.
Processing time
Processing time is the time at which the stream-processing system
..Stateful Streaming in Apache Spark
But what if you need to the accumulate the results from the start of the
streaming job. Which means you need to check the previous state of the
RDD inorder to update the new state of the RDD. This is what is known
as stateful streaming in Spark.
Spark provides 2 API’s to perform stateful streaming, whichis
updateStateByKey and mapWithState.
Now we will see how to perform stateful streaming of wordcount
using updateStateByKey. UpdateStateByKey is a function of Dstreams in
Spark which accepts an update function as its parameter. In that update
function,you need to provide the following parameters newState for the key
which is
a seq of values and the previous state of key as an Option[?].
Let’s take a word count program, let’s say for the first 10 seconds we have
giventhis data hello every one from acadgild. Now the wordcount program
result will be
(one,1)
(hello,1)
(from,1) (acadgild,1)(every,1)
Now without writing the updateStateByKey function, if you give some other
data, in the next 10 seconds i.e. let’s assume we give the same line hello
every one from acadigld. Now we will get the same result in the next 10
seconds alsoi.e.,
(one,1)
(hello,1)
(from,1) (acadgild,1)(every,1)
Incremental Processing:
One of the key features of Structured Streaming is its support for incremental
processing. It processes only the new data that arrives in the stream since the
last batch was processed. This enables low-latency and efficient stream
processing.
Fault Tolerance:
Event-Time Processing:
Stateful Processing:
The API allows for integration with external systems for various
functionalities, such as connecting to external databases, calling external APIs,
or incorporating custom logic using user-defined functions (UDFs).
It supports a variety of data sources and sinks, including popular ones like
Apache Kafka, HDFS, Amazon S3, and more. This flexibility allows
organizations to ingest data from various sources and output the results to
different sinks.
Duplicates in Streaming:
In the context of streaming data, duplicates refer to the occurrence of identical
records or events within the data stream. Duplicate records can arise due to various
reasons, and handling them appropriately is essential for ensuring the accuracy and
reliability of streaming analytics.
Reprocessing or Retrying:
Network delays or glitches can cause the same record to be transmitted more than
once. The receiving end of the streaming system may interpret these retransmissions
as duplicate records.
Out-of-Order Arrival:
In some cases, records may arrive out of order due to network latency or delays. This
can result in the same record being processed multiple times if the system is not
designed to handle out-of-order arrivals.
Data Source Characteristics:
The characteristics of the data source itself, such as the way data is produced and
transmitted, can contribute to duplicates. For example, in some streaming scenarios,
records may be emitted periodically, leading to the generation of identical records.
Handling Duplicates in Streaming:
Deduplication:
Deduplication involves identifying and removing duplicate records from the stream.
This can be achieved by maintaining state and checking for duplicates before
processing each record.
Windowing:
Windowing involves grouping records within a specified time window and processing
them collectively. This can help identify and handle duplicates within the window.
Timestamp-Based Processing:
Processing records based on their timestamp can help identify and discard duplicates
by considering the temporal order of events.
Idempotent Operations:
Buffering and caching records for a short period can help identify and eliminate
duplicates by comparing incoming records with those in the buffer.
Handling duplicates in streaming is a critical aspect of building robust and reliable
real-time data processing systems. By implementing appropriate strategies and
techniques, organizations can ensure the accuracy of analytics results and maintain
the integrity of their streaming applications.
One of the primary use cases for Structured Streaming is real-time data
processing. With the ability to process data continuously in micro-batches,
organizations can gain insights and make decisions in near real-time. This is
crucial for applications where up-to-date information is essential, such as fraud
detection, monitoring, and alerting systems.
Transformations on Streaming Data: