Professional Documents
Culture Documents
1.Volume
2.Velocity
3.Variety
(1M)
3. System is overengineered.
a. Messy data visualization
b. The system is overengineered
4. Long system response time
a. Inefficient data organization
b. Problems with big data analytics infrastructure and resource utilization
5. Expensive maintenance
a. Outdated technologies
b. Non optimal infrastructure
c. The system that you have chosen is overengineered
3. Hadoop is framework that allows distributed processing of large data sets across cluster of
commodity computers using simple programming models.
Four key characteristics of Hadoop
1. Economical: ordinary computers can be used for data processing.
2. Reliable: stores copies of data on different machine and is resistance to hardware
failure.
3. Scalable: Can follow both horizontal and vertical scaling.
4. Flexible: can store as much of data and decide to use it later. (2m)
(2M)
YARN
YARN also allows different data processing engines like graph processing,
interactive processing, stream processing as well as batch processing to run and
process data stored in HDFS (Hadoop Distributed File System) thus making the
system much more efficient.
HDFS
⮚ HDFS is a filesystem designed for large-scale distributed data processing under
frameworks such as MapReduce.
⮚ As HDFS isn’t a native Unix filesystem, standard Unix file tools, such as ls and cp
don’t work on it, neither do standard file read/write operations, such as fopen()
and fread().
On the other hand, Hadoop does provide a set of command line utilities that work
similarly to the Linux file commands.
The most common file management tasks in Hadoop, which include
■ Adding files and directories
■ Retrieving files
■ Deleting files (2M)
MAPREDUCE
⮚ In scaling our distributed word-counting program in the last section, we also had
to write the partitioning and shuffling functions.
⮚ Partitioning and shuffling are common design patterns that go along with mapping
and reducing.
⮚ In order for mapping, reducing, partitioning, and shuffling to seamlessly work
together, we need to agree on a common structure for the data being processed.
⮚ MapReduce uses lists and (key/value) pairs as its main data primitives. The keys
and values are often integers or strings but can also be dummy values to be
ignored or complex object types.
(2M)
4. a YARN also allows different data processing engines like graph processing, interactive 04 M
. processing, stream processing as well as batch processing to run and process data stored
in HDFS (Hadoop Distributed File System) thus making the system much more efficient.
(1M)
(1M)
Let’s look at the complete data flow in Mapreduce framework.
1. The input to your application must be structured as a list of (key/value) pairs,
list(<k1, v1>). The input format for processing multiple files is usually list(<String
filename, String file_content>).
2. The list of (key/value) pairs is broken up and each individual (key/value) pair,
<k1,v1>, is processed by calling the map function of the mapper. In practice, the
key k1 is often ignored by the mapper (for instance, it may be the line number of
the incoming text in the value). The mapper transforms each <k1, v1> pair into a
list of <k2, v2>pairs.
3. The output of all the mappers are (conceptually) aggregated into one giant list of
<k2,v2> pairs. All pairs sharing the same k2 are grouped together into a new
(key/value) pair, <k2, list(v2)>. The framework asks the reducer to process each
one of these aggregated (key/value) pairs individually. (2M)
OR
5
MapReduce is also a data processing model, its greatest advantage is the easy scaling of
data processing over multiple computing nodes. Under the MapReduce model, the data
processing primitives are called mappers and reducers .
MapReduce programs are executed in two main phases, called mapping and reducing.
Each phase is defined by a data-processing function, and these functions are called mapper
and reducer, respectively. (2M)
In the original work at Google, the task was to create search indexes that contain vectors of
document URLs for each word in the web; the pages were tokenized and then the
combined lists aggregated together, much like the word counter presented here.
10M
Partitioning and shuffling are common design patterns that go along with mapping and
reducing.
In order for mapping, reducing, partitioning, and shuffling to seamlessly work together, we
need to agree on a common structure for the data being processed. MapReduce uses lists
and (key/value) pairs as its main data primitives. The keys and values are often integers or
strings but can also be dummy values to be ignored or complex object types.
(2M)
1. The input to your application must be structured as a list of (key/value) pairs, list(<k1,
v1>). The input format for processing multiple files is usually list(<String filename, String
file_content>).
2. The list of (key/value) pairs is broken up and each individual (key/value) pair, <k1,v1>,
is processed by calling the map function of the mapper. In practice, the key k1 is often
ignored by the mapper (for instance, it may be the line number of the incoming text in the
value). The mapper transforms each <k1, v1> pair into a list of <k2, v2>pairs.
3. The output of all the mappers are (conceptually) aggregated into one giant list of
<k2,v2> pairs. All pairs sharing the same k2 are grouped together into a new (key/value)
pair, <k2, list(v2)>. The framework asks the reducer to process each one of these
aggregated (key/value) pairs individually.
(3M)
6M
When you create a table in Hive, by default Hive will manage the data, which
means that Hive moves the data into its warehouse directory.
The difference between the two table types is seen in the LOAD and DROP
semantics.
When you load data into a managed table, it is moved into Hive’s warehouse
directory.
For example, this:
CREATE TABLE managed_table (dummy STRING);
LOAD DATA INPATH '/user/tom/data.txt' INTO table managed_table;
Will move the file hdfs://user/tom/data.txt into Hive’s warehouse directory for
the managed_table table, which is hdfs://user/hive/warehouse/managed_table.
(2M)
DROP TABLE managed_table;
The table, including its metadata and its data, is deleted. It bears repeating that
since the initial LOAD performed a move operation, and the DROP performed a
delete operation, the data no longer exists anywhere.
An external table behaves differently. Can control the creation and deletion of the
data.
The location of the external data is specified at table creation time:
CREATE EXTERNAL TABLE external_table (dummy STRING)
LOCATION '/user/tom/external_table';
LOAD DATA INPATH '/user/tom/data.txt' INTO TABLE external_table;
With the EXTERNAL keyword, Hive knows that it is not managing the data, so it
doesn’t move it to its warehouse directory.
Indeed, it doesn’t even check whether the external location exists at the time it is
defined.
This is a useful feature because it means you can create the data lazily after
creating the table.
(2M)
OR
7. HDFS is a filesystem designed for large-scale distributed data processing under 10M
frameworks such as MapReduce. It can store a big data set of 100 TB as a single file in
HDFS. Hadoop does provide a set of command line utilities that work similarly to the
Linux file commands.
hadoop fs -cmd <args> (1M)
You can use the Hadoop cat command to show the content of that file:
hadoop fs -cat hdfs://localhost:9000/user/chuck/example.txt (1M)
RETRIEVING FILES
1. To retrieve file from HDFS hadoop fs -get example.txt .
2. If the file is huge and you’re interested in a quick check of its content, you can pipe
the output of Hadoop’s cat into a Unix head.
hadoop fs -cat example.txt | head
3. To look at the last kilobyte of a file. hadoop fs -tail example.txt (3M)
DELETING FILES
Hadoop command for removing files is rm.
hadoop fs –rm example.txt (1M)
*********************************************************************************************