You are on page 1of 9

BigData Asssignment – 2

1. Write a note on NameNode High Availability


With early Hadoop installations, the NameNode was a single point of failure that could bring down
the entire Hadoop cluster. NameNode hardware often employed redundant power supplies and
storage to guard against such problems, but it was still susceptible to other failures. The solution was
to implement NameNode High Availability (HA) as a means to provide true failover service.

As shown in Figure 3.3, an HA Hadoop cluster has two (or more) separate NameNode machines.
Each machine is configured with exactly the same software. One of the NameNode machines is in
the Active state, and the other is in the Standby state. Like a single NameNode cluster, the Active
NameNode is responsible for all client HDFS operations in the cluster. The Standby NameNode
maintains enough state to provide a fast failover (if required).

2. HDFS user commands:


1. List Files in HDFS :
a. To list the files in the root HDFS direct01y, enter tl1e following command:
Syntax: $ hdfs dfs -ls /
Output: Found 2 items
d1wx1wxiwx- yarn hadoop O 2015-04-29 16:52 /app-logs
d1wxr-xr-x - hdfs hdfs O 2015-04-21 14:28 /apps

b. To list files in your home directory, enter the following command:


Syntax: $ hdfs dfs -Is
Output: Found 2 items
d1wxr-xr-x - hdfs hdfs O 2015-05-24 20:06 bin
d1wxr-xr-x - hclfs hdfs O 2015-04-29 16:52 examples

3. Make a Directot'Y in HDFS


To make a directory in HDFS, use the following command. As with the -Is command, when no path is
supplied, the user's home directory is used
Syntax: $ hdfs dfs -mkdir stuff

3. Copy Files to HDFS


To copy a file from your cunent local directory into HDFS, use the following command. If a full path is not
supplied, your home directory is assumed. In this case, the file test is placed in the directory stuff that was
created previously.
Syntax: $ hdfs dfs -put test stuff

The file transfer can be confirmed by using the -ls command:


Syntax: $ hdfs dfs -Is stuff
Output: Found 1 items
-1w-r--r-- 2 hdfs hdfs 12857 2015-05-29 13: 12 stuff/test
4. Copy Files from HDFS
Files can be copied back to your local file system using the following command. In this case, the file we
copied into HDFS, test, will be copied back to the cunent local directory with the name test-local.
Syntax: $ hdfs dfs -get stuIDtest test-local

5. Copy Files within HDFS


The following command will copy a file in HDFS:
Syntax: $ hdfs dfs -cp stuff/test test.hdfs

6. Delete a File within HDFS


The following command will delete the HDFS file test.
Syntax: $ hdfs dfs -tm test.hd

3. List the functions of YARN.

YARN is a resource management platform. It manages computer resources. The platform is responsible for
providing the computational resources, such as CPUs, memory, network 1/0 which are needed when an
application executes. An application task has a number of sub-tasks. YARN manages the schedules for running
of the sub-tasks. Each sub-task uses the resources in allotted time intervals.

YARN separates the resource management and processing components. YARN stands for Yet Another Resource
Negotiator. An application consists of a number of tasks. Each task can consist of a number of sub-tasks (threads),
which run in parallel at the nodes in the cluster. YARN enables running of multi-threaded applications. YARN
manages and allocates the resources for the application sub­tasks and submits the resources for them at the Hadoop
system.

4. What are the activities which Mahout supports in the Hadoop system?
Mahout is a project of Apache with library of scalable machine learning algorithms. Apache implemented Mahout
on top of Hadoop.

Mahout supports four main areas:


1. Collaborative data-filtering that mines user behavior and makes product recommendations.
2. Clustering that takes data items in a particular class, and organizes them into naturally occurring groups, such
that items belonging to the same group are similar to each other.
3. Classification that means learning from existing categorizations and then assigning the future items to the best
category.
Frequent item-set mining that analyzes items in a group and then identifies which items usually occur
together.

5. Explain when will you use the following: MongoDB, Cassandra, CouchDB, Oracle NoSQL and Riak.
NoSQL DBs are needed for Big Data solutions. They play an important role in handling Big Data challenges.
Table gives the examples of widely used NoSQL data stores.
6. How does CAP theorem hold in NoSQL databases?
Brewer's CAP (consistency, Availability and partition Tolerance) theorem demonstrates that any distributed
system cannot guarantee C, A and P together.
1. Consistency- All nodes observe the same data at the same time.
2. Availability- Each request receives a response on success/failure.
3. Partition Tolerance-The system continues to operate as a whole even in case of message loss,
node failure or node not reachable.

Partition tolerance cannot be overlooked for achieving reliability in a distributed database system. Thus, in case
of any network failure, a choice can be:
 Database must answer, and that answer would be old or wrong data (AP).
 Database should not answer, unless it receives the latest copy of the data (CP).

The CAP theorem implies that for a network partition system, the choice of consistency and availability are
mutually exclusive. CA means consistency and availability, AP means availability and partition tolerance and CP
means consistency and partition tolerance. Figure 3.1 shows the CAP theorem usage in Big Data Solutions.

7. How do ACID and BASE properties differ?


ACID properties BASE properties
a) Transactions on SQL databases exhibit a) BASE is a flexible model for NoSQL data
ACID properties. stores. Provisions of BASE increase
b) ACID stands for atomicity, consistency, flexibility.
isolation and durability. b) BASE Properties BA stands for basic
c) Atomicity of transaction means all
availability, S stands for soft state and E
operations in the transaction must complete,
and if interrupted, then must be undone
stands for eventual consistency.
(rolled back). c) Basic availability ensures by distribution of
d) Consistency in transactions means that a shards across many data nodes with a high
transaction must maintain the integrity degree of replication. Then, a segment failure
constraint, and follow the consistency does not necessarily mean a complete data
principle store unavailability.
e) Isolation of transactions means two d) Soft state ensures processing even in the
transactions of the database must be presence of inconsistencies but achieving
isolated from each other and done consistency eventually.
separately. e) Eventual consistency means consistency
f) Durability means a transaction must persist requirement in NoSQL databases meeting at
once completed.
some point of time in future. Data converges
eventually to a consistent state with no time-
frame specification for achieving that.

8. When will you use the document data store?


Document databases are considered to be non-relational (or NoSQL) databases. Instead of storing data
in fixed rows and columns, document databases use flexible documents. NoSQL document databases
are based on a model that does not require SQL and tables, unlike relational databases. Instead of
using tables with the data types, columns, rows, schemas, and tabular relations used in relational
databases, NoSQL databases use documents with data type descriptions and values.
Characteristics of Document Data Store are high performance and flexibility. Scalability varies,
depends on stored contents. Complexity is low compared to tabular, object and graph data stores.

9. List and compare the features of BigTable, RC, ORC and Parquet data stores.
BigTable
a) Massively scalable NoSQL. BigTable scales up to 100s of petabytes.
b) Integrates easily with Hadoop and Hadoop compatible systems.
c) Compatibility with MapReduce, HBase APIs which are open-source Big Data platforms.
d) Handles million of operations per second.
e) Handle large workloads with low latency and high throughput
f) Consistent low latency and high throughput
g) APIs include security and permissions
h) BigTable, being Google's cloud service, has global availability and its service is seamless.
RC
Hive uses Record Columnar (RC) file-format records for querying. RC is the best choice for intermediate
tables for fast column-family store in HDFS with Hive. Serializability of RC table column data is the
advantage. RC file is DeSerializable into column data. A table such as that shown in Example 3.6 can
be partitioned into row groups. Values at each column of a row group store as the RC record.
The RC file records store data of a column in the row group (Serializability means query or transaction
executable by series of instructions such that execution ensures correct results).
The following example explains the use of row groups in the RC file format for column of a row group:

ORC
a) ORC (Optimized Row Columnar) is an intelligent Big Data file format for HDFS and Hive.
b) An ORC file stores a collections of rows as a row-group. Each row-group data store in columnar
format.
c) This enables parallel processing of multiple row-groups in an HDFS cluster.
d) A mapped column has contents required by the query.
e) The columnar layout in each ORC file thus, optimizes for compression and enables skipping of
data in columns.
f) This reduces read and decompression load.
g) An ORC thus, optimizes for reading serially the column fields in HDFS environment.
h) The throughput increases due to skipping and reading of the required fields at contents-column
key.
i) Reading less number of ORC file content-columns reduces the workload on the NameNode.

Parquet data stores


a) Apache Parquet is a columnar storage file format available to any project in the Hadoop
ecosystem (Hive, Hbase, MapReduce, Pig, Spark).
b) In order to understand Parquet file format in Hadoop better, first let’s see what is columnar
format.
c) In a column oriented format values of each column of in the records are stored together.
d) row wise storage format : For this table in a row wise storage format the data will be stored as
follows-
e) Column oriented storage format- data will be stored as follows in a Column oriented storage
format-

10. What are the Characteristics of Big Data NoSQL solution?


Characteristics of Big Data NoSQL solution are:

1. High and easy scalability: NoSQL data stores are designed to expand horizontally. Horizontal
scaling means that scaling out by adding more machines as data nodes (servers) into the pool of
resources (processing, memory, network connections). The design scales out using multi-utility
cloud services.
2. Support to replication: Multiple copies of data store across multiple nodes of a cluster. This
ensures high availability, partition, reliability and fault tolerance.
3. Distributable: Big Data solutions permit sharding and distributing of shards on multiple clusters
which enhances performance and throughput.
4. Usages of NoSQL servers which are less expensive. NoSQL data stores require less management
efforts. It supports many features like automatic repair, easier data distribution and simpler data
models that makes database administrator (DBA) and tuning requirements less stringent.
5. Usages of open-source tools: NoSQL data stores are cheap and open source. Database
implementation is easy and typically uses cheap servers to manage the exploding data and
transaction while RDBMS databases are expensive and use big servers and storage systems. So,
cost per gigabyte data store and processing of that data can be many times less than the cost of
RDBMS.
6. Support to schema-less data model: NoSQL data store is schema less, so data can be inserted in
a NoSQL data store without any predefined schema. So, the format or data model can be changed
any time, without disruption of application. Managing the changes is a difficult problem in SQL.
7. Support to integrated caching: NoSQL data store support the caching in system memory. That
increases output performance. SQL database needs a separate infrastructure for that.
8. No inflexibility unlike the SQL/RDBMS, NoSQL DBs are flexible (not rigid) and have no
structured way of storing and manipulating data. SQL stores in the form of tables consisting of
rows and columns. NoSQL data stores have flexibility in following ACID rules.

11. List the benefits of peer-to-peer nodes data distribution model.


Peer-to-Peer distribution (PPD) model and replication has the following characteristics:
(1) All replication nodes accept read request and send the responses.
(2) All replicas function equally [read support and write support also].
(3) Node failures do not cause loss of write capability, as other replicated node responds.
Cassandra adopts the PPD model.
Benefits:
a) Performance can further be enhanced by adding the nodes.
b) Since nodes read and write both, a replicated node also has updated data. Therefore, the
biggest advantage in the model is consistency. When a write is on different nodes, then
write inconsistency occurs.

12. List the functions of MongoDB query language and database commands.
13. List the components in Casandra and their uses.

14. Write the syntax to create keyspace in Cassandra. State when the ALTER keyspace is used.
Create Keyspace Command syntax

CREATE KEYSPACE <Keyspace Name> WITH replication = {'class': '<Strategy name>',


'replication_factor': '<No. of replicas>'}AND durable_ writes = '<TRUE/FALSE>';

CREATE KEYSPACE statement has attributes replication with option class and replication
factor, and durable_write.

ALTER KEYSPACE command changes (alter) properties, such as the number of replicas and the
durable_writes of a keyspace:
ALTER KEYSPACE <Keyspace Name> WITH replication = {`class': '<Strategy name>',
`replication_factor': '<No. of replicas>'};

15. Explain function of Group By, partitioning and combining using one example for each.

a) Grouping by Key : When a map task completes, Shuffle process aggregates (combines) all the
Mapper outputs by grouping the key-values of the Mapper output, and the value v2 append in a list
of values. A "Group By" operation on intermediate keys creates v2.
b) Partitioning : The Partitioner does the partitioning. The partitions are the semi-mappers in
MapReduce. Partitioner is an optional class. MapReduce driver class can specify the Partitioner. A
partition processes the output of map tasks before submitting it to Reducer tasks. Partitioner function
executes on each machine that performs a map task.
c) Combiners : Combiners are semi-reducers in MapReduce. Combiner is an optional class. Map
Reduce driver class can specify the combiner. The combiner() executes on each machine that
performs a map task. Combiners optimize MapReduce task that locally aggregates before the shuffle
and sort phase. Typically, the same codes implement both the combiner and the reduce functions,
combiner() on map node and reducer() on reducer node.

16. Show MapReduce process diagrammatically to depict a client submitting a job, the workflow
of JobTracker and TaskTracker, and TaskTrackers creating the outputs.
A user application specifies locations of the input/ output data and translates into map and reduces
functions. A job does implementations of appropriate interfaces and/ or abstract-classes. These, and
other job parameters, together comprise the job configuration. The Hadoop job client then submits
the job (jar/executable etc.) and configuration to the JobTracker, which then assumes the
responsibility of distributing the software/ configuration to the slaves by scheduling tasks,
monitoring them, and provides status and diagnostic information to the job-client. Figure 4.3 shows
MapReduce process when a client submits a job, and the succeeding actions by the JobTracker and
TaskTracker.
JobTracker and Task Tracker MapReduce consists of a single master JobTracker and one slave
TaskTracker per cluster node. The master is responsible for scheduling the component tasks in a job
onto the slaves, monitoring them and re-executing the failed tasks. The slaves execute the tasks as
directed by the master.

The data for a MapReduce task is initially at input files. The input files typically reside in the HDFS.
The files may be line-based log files, binary format file, multi-line input records, or something else
entirely different. These input files are practically very large, hundreds of terabytes or even more than
it.

Most importantly, the MapReduce framework operates entirely on key, value-pairs. The framework
views the input to the task as a set of (key, value) pairs and produces a set of (key, value) pairs as the
output of the task

17. How does MapReduce program implement counting, filtering and parsing?
a) Counting and Summing Assume that the number of alerts or messages generated during a specific
maintenance activity of vehicles need counting for a month. the pseudocode using emit() in the map()
of Mapper class. Mapper emits 1 for each message generated. The reducer goes through the list of ls
and sums them. Counting is used in the data querying application. For example, count of messages
generated, word count in a file, number of cars sold, and analysis of the logs, such as number of
tweets per month. Application is also in business analytics field.
b) Filtering or Parsing Filtering or parsing collects only those items which satisfy some condition or
transform each item into some other representation. Filtering/parsing include tasks such as text
parsing, value extraction and conversion from one format to another. Examples of applications of
filtering are found in data validation, log analysis and querying of datasets. Mapper takes items one
by one and accepts only those items which satisfy the conditions and emit the accepted items or their
transformed versions. Reducer obtains all the emitted items, saves them into a list and outputs to the
application.

18. What are the features of Hive? what are the components of the Hive architecture ?

Figure shows the main features of Hive.


Figure shows the Hive architecture.

Components of Hive architecture are:

a) Hive Server (Thrift) - An optional service that allows a remote client to submit requests to Hive
and retrieve results. Requests can use a variety of programming languages. Thrift Server exposes
a very simple client API to execute HiveQL statements.

b) Hive CLI (Command Line Interface) - Popular interface to interact with Hive. Hive runs in
local mode that uses local storage when running the CLI on a Hadoop cluster instead of HDFS.

c) Web Interface - Hive can be accessed using a web browser as well. This requires a HWI Server
running on some designated code. The URL http:// hadoop:<port no.> / hwi command can be
used to access Hive through the web.

d) Metastore - It is the system catalog. All other components of Hive interact with the Metastore.
It stores the schema or metadata of tables, databases, columns in a table, their data types and
HDFS mapping.

e) Hive Driver - It manages the life cycle of a HiveQL statement during compilation, optimization
and execution.

19. How does Hive use text, sequential, RC and ORC files? How does the Hive use Collection data
types: STRUCT, MAP and ARRAY?

File formats and their descriptions


a) Text file : The default file format, and a line represents a record. The delimiting characters
separate the lines. Text file examples are CSV, TSV,JSON and XML.
b) Sequential file : Flat file which stores binary key-value pairs, and supports compression.
c) RC File : Record columnar file means one that can be partitioned in rows and then partitioned
with columns. Partitioning in this way enables serialization.
d) ORC files : ORC stands for Optimized Row Columnar which means it can store data in an
optimized way than in the other file formats.

Collection data-types and their descriptions


a. STRUCT : Similar to 'C' struct, a collection of fields of different data types. An access to
field uses dot notation. For example, struct ('a', 'b')
b. MAP : A collection of key-value pairs. Fields access using [] notation. For example, map
('keyl', 'a', 'keyz', 'b')
c. ARRAY ; Ordered sequence of same types. Accesses to fields using array index. For
example, array ('a', 'b')
20. Differences between Pig and MapReduce

You might also like