You are on page 1of 17

Big Data Analytics

Suggestion for Final sem

1. How a secondary name node differs from the name node in HDFS.
Ans. Secondary namenode is responsible for writing editlogs of NameNode in file
called fSimage in HDFS. After which the edit logs are cleared. This activity is
done periodically which helps minimizing the size of edit log files(since changes
are flushed to fsimage on secondary namenode).

The Checkpoint Node fetches periodically fsimage and edits from the NameNode
and merges them.
The resulting state is called checkpoint. After this is uploads the result to the
NameNode.

The main difference between Secondary and Checkpoint namenode is secondary


namenode does not upload the merged Fsimage with editlogs to active
namenode
where as the checkpoint node uplods the merged new image back to active
Namenode.
So the NameNode need to fetch the state from the Secondary NameNode
2. Define the role of combiner and partitioner in a map reduce application.
Ans. Combiner:The combiner in MapReduce is also known as ‘Mini-reducer’. The
primary job of Combiner is to process the output data from the Mapper, before
passing it to Reducer. It runs after the mapper and before the Reducer and its use
is optional.
Partitioner: takes the decision that which key goes to which reducer by using
Hash function. All the records having the same key will be sent to the same
reducer for the final output computation.
3. Define the three key design principles of pig latin.
Ans. Pig Latin is the programing platform which provides a language for Pig
programs. Pig helps to convert the Pig Latin script into MapReduce tasks that can
be run within Hadoop cluster. When it comes to Pig Latin, the development team
considered three core design principles to design it more elegantly:

Keep it simple.
Pig Latin enables streamlined functions for communicating with Java
MapReduce. It’s is a kind of an abstraction, in simple words, that uses an easy
way for the creation of parallel programs on the Hadoop cluster for data flows and
analysis. Messy tasks may needs a series of interrelated data transformations —
like series are encoded as data flow sequences. Implementing data
transformation and flows as Pig Latin scripts in comparison to Java MapReduce
programs makes these programs much simpler and easier to write, understand,
and maintain since:

1) We don’t have to write the job in Java

2) We don’t have to think in terms of MapReduce, and

3) We don’t need to come up with custom code to support rich data types.

Pig Latin provides a simpler language to exploit our Hadoop cluster, thus making
it easier for more people to leverage the power of Hadoop and become
productive sooner.

Make it smart.
We may recall that the Pig Latin Compiler does his duty of transforming a Pig
Latin program into a series of Java MapReduce tasks. The trick is to ensure that
the compiler can perfectly optimize the execution of these Java MapReduce jobs
automatically, by just allowing the user to focus on semantics rather than on how
to optimize and access the data. For the people with SQL types out there, this
explanation will sound familiar. SQL is set up as a declarative query that we use
to access structured data stored in an RDBMS. The RDBMS engine first
translates the query to a data access method and then inspects the statistics and
builds a series of data access strategies. The cost-based optimizer selects the
most efficient strategy for execution.

Don’t limit development.


Make Pig extensible so that developers can contribute and add customize
functions to address their specific business problems.

Traditional RDBMS data warehouses make use of the ETL data processing
pattern, where we extract data from outside sources, transform it to fit our
operational needs, and then load it into the end target, whether it’s an operational
data store, a data warehouse, or another variant of database. However, with big
data, we typically want to reduce the amount of data we have moving about, so
we finally end up bringing the processing to the data itself. The language for Pig
data flows, that’s why, takes a pass on the old ETL approach, and goes with ELT
instead: Extract the data from our various sources, load it into HDFS, and then
transform it as necessary to prepare the data for further analysis.
4. How to create a table by using HIVEQL.
Ans. A table in Hive is a set of data that uses a schema to sort the data by given
identifiers.

The general syntax for creating a table in Hive is:


CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
(col_name data_type [COMMENT 'col_comment'],, ...)
[COMMENT 'table_comment']
[ROW FORMAT row_format]
[FIELDS TERMINATED BY char]
[STORED AS file_format];

Follow the steps below to create a table in Hive.


Step 1: Create a Database
1. Create a database named “company” by running the create command:
create database company;
The terminal prints a confirmation message and the time needed to perform the
action.
2. Next, verify the database is created by running the show command:
show databases;
3. Find the “company” database in the list:
4. Open the “company” database by using the following command:
use company;
Step 2: Create a Table in Hive
The “company” database does not contain any tables after initial creation. Let’s
create a table whose identifiers will match the .txt file you want to transfer data
from.
1. Create an “employees.txt” file in the /hdoop directory. The file shall contain data
about employees:
2. Arrange the data from the “employees.txt” file in columns. The column names
in our example are:
ID
Name
Country
Department
Salary
3. Use column names when creating a table. Create the table by running the
following command:
create table employees (id int, name string, country string, department string,
salary int)
4. Create a logical schema that arranges data from the .txt file to the
corresponding columns. In the “employees.txt” file, data is separated by a '-'. To
create a logical schema type:
row format delimited fields terminated by '-';
The terminal prints out a confirmation message
5. Verify if the table is created by running the show command:
show tables;
Step 3: Load Data From a File
You have created a table, but it is empty because data is not loaded from the
“employees.txt” file located in the /hdoop directory.
1. Load data by running the load command:
load data inpath '/hdoop/employees.txt' overwrite into table employees;
2. Verify if the data is loaded by running the select command:
select * from employees;
The terminal prints out data imported from the employees.txt file:
Display Hive Data
You have several options for displaying data from the table. By using the following
options, you can manipulate large amounts of data more efficiently.
5. Explain the basic building blocks of Hadoop with a neat sketch.
Ans. The Hadoop architecture is a package of the file system, MapReduce
engine and the HDFS (Hadoop Distributed File System). The MapReduce engine
can be MapReduce/MR1 or YARN/MR2.

A Hadoop cluster consists of a single master and multiple slave nodes. The
master node includes Job Tracker, Task Tracker, NameNode, and DataNode
whereas the slave node includes DataNode and TaskTracker.
Hadoop Distributed File System
The Hadoop Distributed File System (HDFS) is a distributed file system for
Hadoop. It contains a master/slave architecture. This architecture consist of a
single NameNode performs the role of master, and multiple DataNodes performs
the role of a slave.

Both NameNode and DataNode are capable enough to run on commodity


machines. The Java language is used to develop HDFS. So any machine that
supports Java language can easily run the NameNode and DataNode software.

NameNode
It is a single master server exist in the HDFS cluster.
As it is a single node, it may become the reason of single point failure.
It manages the file system namespace by executing an operation like the
opening, renaming and closing the files.
It simplifies the architecture of the system.
DataNode
The HDFS cluster contains multiple DataNodes.
Each DataNode contains multiple data blocks.
These data blocks are used to store data.
It is the responsibility of DataNode to read and write requests from the file
system's clients.
It performs block creation, deletion, and replication upon instruction from the
NameNode.
Job Tracker
The role of Job Tracker is to accept the MapReduce jobs from client and process
the data by using NameNode.
In response, NameNode provides metadata to Job Tracker.
Task Tracker
It works as a slave node for Job Tracker.
It receives task and code from Job Tracker and applies that code on the file. This
process can also be called as a Mapper.
MapReduce Layer
The MapReduce comes into existence when the client application submits the
MapReduce job to Job Tracker. In response, the Job Tracker sends the request to
the appropriate Task Trackers. Sometimes, the TaskTracker fails or time out. In
such a case, that part of the job is rescheduled.
6. Explain the various operational modes of Hadoop cluster configuration.
Ans. Hadoop Mainly works on 3 different Modes:
Standalone Mode
Pseudo-distributed Mode
Fully-Distributed Mode
1. Standalone Mode
In Standalone Mode none of the Daemon will run i.e. Namenode, Datanode,
Secondary Name node, Job Tracker, and Task Tracker. We use job-tracker and
task-tracker for processing purposes in Hadoop1. For Hadoop2 we use Resource
Manager and Node Manager. Standalone Mode also means that we are installing
Hadoop only in a single system. By default, Hadoop is made to run in this
Standalone Mode or we can also call it as the Local mode. We mainly use
Hadoop in this Mode for the Purpose of Learning, testing, and debugging.
2. Pseudo Distributed Mode (Single Node Cluster)
In Pseudo-distributed Mode we also use only a single node, but the main thing is
that the cluster is simulated, which means that all the processes inside the cluster
will run independently to each other. All the daemons that are Namenode,
Datanode, Secondary Name node, Resource Manager, Node Manager, etc. will
be running as a separate process on separate JVM(Java Virtual Machine) or we
can say run on different java processes that is why it is called a Pseudo-
distributed.

One thing we should remember that as we are using only the single node set up
so all the Master and Slave processes are handled by the single system.
Namenode and Resource Manager are used as Master and Datanode and Node
Manager is used as a slave. A secondary name node is also used as a Master.
The purpose of the Secondary Name node is to just keep the hourly based
backup of the Name node. In this Mode,
Hadoop is used for development and for debugging purposes both.
Our HDFS(Hadoop Distributed File System ) is utilized for managing the Input
and Output processes.
We need to change the configuration files mapred-site.xml, core-site.xml, hdfs-
site.xml for setting up the environment.
3. Fully Distributed Mode (Multi-Node Cluster)
This is the most important one in which multiple nodes are used few of them run
the Master Daemon’s that are Namenode and Resource Manager and the rest of
them run the Slave Daemon’s that are DataNode and Node Manager. Here
Hadoop will run on the clusters of Machine or nodes. Here the data that is used is
distributed across different nodes. This is actually the Production Mode of Hadoop
let’s clarify or understand this Mode in a better way in Physical Terminology.
Once you download the Hadoop in a tar file format or zip file format then you
install it in your system and you run all the processes in a single system but here
in the fully distributed mode we are extracting this tar or zip file to each of the
nodes in the Hadoop cluster and then we are using a particular node for a
particular process. Once you distribute the process among the nodes then you’ll
define which nodes are working as a master or which one of them is working as a
slave.
7. Distinguish between the old and new versions of Hadoop API for Map
Reduce framework.
ans.
8. Explain about the implementation of map reduce concept with a small
example.
9. Explain the architecture of a pig with a neat sketch.
Ans. The language used to analyze data in Hadoop using Pig is known as Pig Latin. It is
a highlevel data processing language which provides a rich set of data types and operators
to perform various operations on the data.
To perform a particular task Programmers using Pig, programmers need to write a Pig
script using the Pig Latin language, and execute them using any of the execution
mechanisms (Grunt Shell, UDFs, Embedded). After execution, these scripts will go
through a series of transformations applied by the Pig Framework, to produce the desired
output.
Internally, Apache Pig converts these scripts into a series of MapReduce jobs, and thus, it
makes the programmer’s job easy. The architecture of Apache Pig is shown below.
Apache Pig Components
As shown in the figure, there are various components in the Apache Pig
framework. Let us take a look at the major components.

Parser
Initially the Pig Scripts are handled by the Parser. It checks the syntax of the
script, does type checking, and other miscellaneous checks. The output of the
parser will be a DAG (directed acyclic graph), which represents the Pig Latin
statements and logical operators.

In the DAG, the logical operators of the script are represented as the nodes and
the data flows are represented as edges.

Optimizer
The logical plan (DAG) is passed to the logical optimizer, which carries out the
logical optimizations such as projection and pushdown.

Compiler
The compiler compiles the optimized logical plan into a series of MapReduce
jobs.

Execution engine
Finally the MapReduce jobs are submitted to Hadoop in a sorted order. Finally,
these MapReduce jobs are executed on Hadoop producing the desired results.

10. Explain the syntax of a pig program with a suitable example.


11. Explain about the various data types supported by HIVEQL with an example
Ans. All the data types in Hive are classified into four types, given as follows:

 Column Types
 Literals
 Null Values
 Complex Types

Column type are used as column data types of Hive. They are as follows:

 Integral Types
Integer type data can be specified using integral data types, INT. When the data
range exceeds the range of INT, you need to use BIGINT and if the data range is
smaller than the INT, you use SMALLINT. TINYINT is smaller than SMALLINT.

 String Types
String type data types can be specified using single quotes (' ') or double quotes
(" "). It contains two data types: VARCHAR and CHAR. Hive follows C-types
escape characters.

 Timestamp
It supports traditional UNIX timestamp with optional nanosecond precision. It
supports java.sql.Timestamp format “YYYY-MM-DD HH:MM:SS.fffffffff” and format
“yyyy-mm-dd hh:mm:ss.ffffffffff”.

 Dates
DATE values are described in year/month/day format in the form {{YYYY-MM-
DD}}.

 Decimals
The DECIMAL type in Hive is as same as Big Decimal format of Java. It is used
for representing immutable arbitrary precision. The syntax and example is as
follows:

DECIMAL(precision, scale)
decimal(10,0)
Union Types
Union is a collection of heterogeneous data types. You can create an instance
using create union. The syntax and example is as follows:
UNIONTYPE<int, double, array<string>, struct<a:int,b:string>>

{0:1}
{1:2.0}
{2:["three","four"]}
{3:{"a":5,"b":"five"}}
{2:["six","seven"]}
{3:{"a":8,"b":"eight"}}
{0:9}
{1:10.0}
 Literals
The following literals are used in Hive:

 Floating Point Types


Floating point types are nothing but numbers with decimal points. Generally, this
type of data is composed of DOUBLE data type.

 Decimal Type
Decimal type data is nothing but floating point value with higher range than
DOUBLE data type. The range of decimal type is approximately -10-308 to
10308.
 Null Value
Missing values are represented by the special value NULL.

 Complex Types
The Hive complex data types are as follows:

 Arrays
Arrays in Hive are used the same way they are used in Java.

Syntax: ARRAY<data_type>
 Maps
Maps in Hive are similar to Java Maps.

Syntax: MAP<primitive_type, data_type>


 Structs
Structs in Hive is similar to using complex data with comment.

Syntax: STRUCT<col_name : data_type [COMMENT col_comment], ...>


12. What happens in map phase and reduce phase of a hadoop map reduce frame.
13. What is a pig and specify its role in Hadoop?
Ans. Pig Hadoop is basically a high-level programming language that is helpful
for the analysis of huge datasets. Pig Hadoop was developed by Yahoo! and is
generally used with Hadoop to perform a lot of data administration operations.
For writing data analysis programs, Pig renders a high-level programming
language called Pig Latin. Several operators are provided by Pig Latin using
which personalized functions for writing, reading, and processing of data can be
developed by programmers.
14. Define the various file formats supported by HIVE
Ans.Hive and Impala table in HDFS can be created using four different Hadoop
file formats:
 Text files
 Sequence File
 Avro data files
 Parquet file format

1. Text files
A text file is the most basic and a human-readable file. It can be read or written in
any programming language and is mostly delimited by comma or tab.

The text file format consumes more space when a numeric value needs to be
stored as a string. It is also difficult to represent binary data such as an image.

2. Sequence File
The sequencefile format can be used to store an image in the binary format. They
store key-value pairs in a binary container format and are more efficient than a
text file. However, sequence files are not human- readable.

3. Avro Data Files


The Avro file format has efficient storage due to optimized binary encoding. It is
widely supported both inside and outside the Hadoop ecosystem.

The Avro file format is ideal for long-term storage of important data. It can read
from and write in many languages like Java, Scala and so on.Schema metadata
can be embedded in the file to ensure that it will always be readable. Schema
evolution can accommodate changes. The Avro file format is considered the best
choice for general-purpose storage in Hadoop.

4. Parquet File Format


Parquet is a columnar format developed by Cloudera and Twitter. It is supported
in Spark, MapReduce, Hive, Pig, Impala, Crunch, and so on. Like Avro, schema
metadata is embedded in the file.

Parquet file format uses advanced optimizations described in Google’s Dremel


paper. These optimizations reduce the storage space and increase performance.
This Parquet file format is considered the most efficient for adding multiple
records at a time. Some optimizations rely on identifying repeated patterns. We
will look into what data serialization is in the next section.

15. Explain the hadoop distributed file system architecture with a neat sketch.
Ans.
The Hadoop Distributed File System (HDFS) is the primary data storage system
used by Hadoop applications. HDFS employs a NameNode and DataNode
architecture to implement a distributed file system that provides high-performance
access to data across highly scalable Hadoop clusters.

Hadoop itself is an open source distributed processing framework that manages


data processing and storage for big data applications. HDFS is a key part of the
many Hadoop ecosystem technologies. It provides a reliable means for managing
pools of big data and supporting related big data analytics applications.

How does HDFS work?


HDFS enables the rapid transfer of data between compute nodes. At its outset, it
was closely coupled with MapReduce, a framework for data processing that filters
and divides up work among the nodes in a cluster, and it organizes and
condenses the results into a cohesive answer to a query. Similarly, when HDFS
takes in data, it breaks the information down into separate blocks and distributes
them to different nodes in a cluster.

With HDFS, data is written on the server once, and read and reused numerous
times after that. HDFS has a primary NameNode, which keeps track of where file
data is kept in the cluster.

HDFS also has multiple DataNodes on a commodity hardware cluster -- typically


one per node in a cluster. The DataNodes are generally organized within the
same rack in the data center. Data is broken down into separate blocks and
distributed among the various DataNodes for storage. Blocks are also replicated
across nodes, enabling highly efficient parallel processing.

The NameNode knows which DataNode contains which blocks and where the
DataNodes reside within the machine cluster. The NameNode also manages
access to the files, including reads, writes, creates, deletes and the data block
replication across the DataNodes.

The NameNode operates in conjunction with the DataNodes. As a result, the


cluster can dynamically adapt to server capacity demand in real time by adding or
subtracting nodes as necessary.
The DataNodes are in constant communication with the NameNode to determine
if the DataNodes need to complete specific tasks. Consequently, the NameNode
is always aware of the status of each DataNode. If the NameNode realizes that
one DataNode isn't working properly, it can immediately reassign that DataNode's
task to a different node containing the same data block. DataNodes also
communicate with each other, which enables them to cooperate during normal file
operations.

Moreover, the HDFS is designed to be highly fault-tolerant. The file system


replicates -- or copies -- each piece of data multiple times and distributes the
copies to individual nodes, placing at least one copy on a different server rack
than the other copies.

16. What is HDFS? List all the components of HDFS and explain.
Ans.
Hadoop is a framework that uses distributed storage and parallel processing to
store and manage big data. It is the software most used by data analysts to
handle big data, and its market size continues to grow. There are three
components of Hadoop:

Hadoop HDFS - Hadoop Distributed File System (HDFS) is the storage unit.
Hadoop MapReduce - Hadoop MapReduce is the processing unit.
Hadoop YARN - Yet Another Resource Negotiator (YARN) is a resource
management unit.
17. With block diagram discuss the various frameworks that run under YARN.
18. What are the characteristics of Big Data?
19. Explain hadoop Architectural Model.
20. Explain NoSQL data Architecture patterns.
Ans.
Architecture Pattern is a logical way of categorizing data that will be stored on the
Database. NoSQL is a type of database which helps to perform operations on big
data and store it in a valid format. It is widely used because of its flexibility and a
wide variety of services.

Architecture Patterns of NoSQL:


The data is stored in NoSQL in any of the following four data architecture
patterns.

1. Key-Value Store Database


2. Column Store Database
3. Document Database
4. Graph Database
These are explained as following below.

1. Key-Value Store Database:


This model is one of the most basic models of NoSQL databases. As the name
suggests, the data is stored in form of Key-Value Pairs. The key is usually a
sequence of strings, integers or characters but can also be a more advanced data
type. The value is typically linked or co-related to the key. The key-value pair
storage databases generally store data as a hash table where each key is unique.
The value can be of any type (JSON, BLOB(Binary Large Object), strings, etc).
This type of pattern is usually used in shopping websites or e-commerce
applications.
Advantages:

Can handle large amounts of data and heavy load,


Easy retrieval of data by keys.
Limitations:

Complex queries may attempt to involve multiple key-value pairs which may delay
performance.
Data can be involving many-to-many relationships which may collide.
Examples:

DynamoDB
Berkeley DB
2. Column Store Database:
Rather than storing data in relational tuples, the data is stored in individual cells
which are further grouped into columns. Column-oriented databases work only on
columns. They store large amounts of data into columns together. Format and
titles of the columns can diverge from one row to other. Every column is treated
separately. But still, each individual column may contain multiple other columns
like traditional databases.
Basically, columns are mode of storage in this type.

Advantages:

Data is readily available


Queries like SUM, AVERAGE, COUNT can be easily performed on columns.
Examples:

HBase
Bigtable by Google
Cassandra
3. Document Database:
The document database fetches and accumulates data in form of key-value pairs
but here, the values are called as Documents. Document can be stated as a
complex data structure. Document here can be a form of text, arrays, strings,
JSON, XML or any such format. The use of nested documents is also very
common. It is very effective as most of the data created is usually in form of
JSONs and is unstructured.

Advantages:

This type of format is very useful and apt for semi-structured data.
Storage retrieval and managing of documents is easy.
Limitations:

Handling multiple documents is challenging


Aggregation operations may not work accurately.
Examples:

MongoDB
CouchDB

4. Graph Databases:
Clearly, this architecture pattern deals with the storage and management of data
in graphs. Graphs are basically structures that depict connections between two or
more objects in some data. The objects or entities are called as nodes and are
joined together by relationships called Edges. Each edge has a unique identifier.
Each node serves as a point of contact for the graph. This pattern is very
commonly used in social networks where there are a large number of entities and
each entity has one or many characteristics which are connected by edges. The
relational database pattern has tables that are loosely connected, whereas
graphs are often very strong and rigid in nature.

Advantages:

Fastest traversal because of connections.


Spatial data can be easily handled.
Limitations:
Wrong connections may lead to infinite loops.

Examples:

Neo4J
FlockDB( Used by Twitter)

21. What is Big data? Why we need Big Data? What are the challenges of Big
Data?
22. What is HDFS? What is PIG? What is hive used for ?
Ans.
Pig is an open-source high level data flow system. It provides a simple language
called Pig Latin, for queries and data manipulation, which are then compiled in to
MapReduce jobs that run on Hadoop
Hive allows users to read, write, and manage petabytes of data using SQL. Hive
is built on top of Apache Hadoop, which is an open-source framework used to
efficiently store and process large datasets.

23. What are the 3 V ’s of Big data? Explain with the help of two big data case
studies
24. How can we examine the HIVE Clients? Explain.

25. What is Hadoop API? Explain Hadoop API for MapReduce framework.

You might also like