You are on page 1of 7

USN

DAYANANDA SAGAR COLLEGE OF ENGINEERING


(An Autonomous Institute Affiliated to VTU, Belagavi)
ShavigeMalleshwara Hills, Kumaraswamy Layout, Bengaluru-560078

DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING

Continuous Internal Assessment –1


Scheme and solution
Course: Big Data Analytics Course Code: 20AI5DCIBDA
Semester and Section:5(A) Maximum marks:50
Mar
ks
1 a ii. It has a predefined organization 10
. M
b. iii. When it is processed and analyzed
c. ii. MapReduce
d. iii. ByteInputFormat
e. ii. abcedf
f. iii. Namenode stores metadata
g. i. DataNode is the slave/worker node and holds the user data in the form of Data Blocks.
h. ii. put
i. i.a subdirectory under the database directory
j. ii. Show Index
2 a
. Big data is a combination of structured, semistructured and unstructured data
collected by organizations that can be mined for information and used in machine
learning projects, predictive modeling and other advanced analytics applications.
5M
Big data refers to data sets that are too large or complex to be dealt with by
traditional data-processing application software
(1M)

Big data is often characterized by the three V's:

1.Volume

2.Velocity

3.Variety
(1M)

1. The large volume of data in many environments;


2. The wide variety of data types frequently stored in big data systems; and
3. The velocity at which much of the data is generated collected and processed. (3M)
b 5M
.
In today’s digital world, companies embrace big data business analytics to improve
decision-making, increase accountability, raise productivity, make better predictions,
monitor performance, and gain a competitive advantage.
1. Business analytics solution fails to provide new or timely insights.
a. Lack of data
b. Long data response
c. Old approaches applied to a new system
2. Inaccurate analytics
d. Poor quality of source data
e. System defects related to the data flow
f. Using data analytics in complicated

3. System is overengineered.
a. Messy data visualization
b. The system is overengineered
4. Long system response time
a. Inefficient data organization
b. Problems with big data analytics infrastructure and resource utilization
5. Expensive maintenance
a. Outdated technologies
b. Non optimal infrastructure
c. The system that you have chosen is overengineered
3. Hadoop is framework that allows distributed processing of large data sets across cluster of
commodity computers using simple programming models.
Four key characteristics of Hadoop
1. Economical: ordinary computers can be used for data processing.
2. Reliable: stores copies of data on different machine and is resistance to hardware
failure.
3. Scalable: Can follow both horizontal and vertical scaling.
4. Flexible: can store as much of data and decide to use it later. (2m)

(2M)
YARN
YARN also allows different data processing engines like graph processing,
interactive processing, stream processing as well as batch processing to run and
process data stored in HDFS (Hadoop Distributed File System) thus making the
system much more efficient.

Scalability: The scheduler in Resource manager of YARN architecture allows


Hadoop to extend and manage thousands of nodes and clusters.
Compatibility: YARN supports the existing map-reduce applications without
disruptions thus making it compatible with Hadoop 1.0 as well.
Cluster Utilization:Since YARN supports Dynamic utilization of cluster in Hadoop,
which enables optimized Cluster Utilization.
Multi-tenancy: It allows multiple engine access thus giving organizations a benefit
of multi-tenancy.(2M)

HDFS
⮚ HDFS is a filesystem designed for large-scale distributed data processing under
frameworks such as MapReduce.
⮚ As HDFS isn’t a native Unix filesystem, standard Unix file tools, such as ls and cp
don’t work on it, neither do standard file read/write operations, such as fopen()
and fread().
On the other hand, Hadoop does provide a set of command line utilities that work
similarly to the Linux file commands.
The most common file management tasks in Hadoop, which include
■ Adding files and directories
■ Retrieving files
■ Deleting files (2M)

MAPREDUCE
⮚ In scaling our distributed word-counting program in the last section, we also had
to write the partitioning and shuffling functions.
⮚ Partitioning and shuffling are common design patterns that go along with mapping
and reducing.
⮚ In order for mapping, reducing, partitioning, and shuffling to seamlessly work
together, we need to agree on a common structure for the data being processed.
⮚ MapReduce uses lists and (key/value) pairs as its main data primitives. The keys
and values are often integers or strings but can also be dummy values to be
ignored or complex object types.

(2M)

4. a YARN also allows different data processing engines like graph processing, interactive 04 M
. processing, stream processing as well as batch processing to run and process data stored
in HDFS (Hadoop Distributed File System) thus making the system much more efficient.
(1M)

Scalability: The scheduler in Resource manager of YARN architecture allows Hadoop to


extend and manage thousands of nodes and clusters.
Compatibility: YARN supports the existing map-reduce applications without disruptions
thus making it compatible with Hadoop 1.0 as well.
Cluster Utilization:Since YARN supports Dynamic utilization of cluster in Hadoop, which
enables optimized Cluster Utilization.
Multi-tenancy: It allows multiple engine access thus giving organizations a benefit of
multi-tenancy.
(3M)
b 06M
. ⮚ In scaling our distributed word-counting program in the last section, we also had
to write the partitioning and shuffling functions.
⮚ Partitioning and shuffling are common design patterns that go along with mapping
and reducing.
⮚ In order for mapping, reducing, partitioning, and shuffling to seamlessly work
together, we need to agree on a common structure for the data being processed.
⮚ MapReduce uses lists and (key/value) pairs as its main data primitives. The keys
and values are often integers or strings but can also be dummy values to be
ignored or complex object types. (3M)

(1M)
Let’s look at the complete data flow in Mapreduce framework.
1. The input to your application must be structured as a list of (key/value) pairs,
list(<k1, v1>). The input format for processing multiple files is usually list(<String
filename, String file_content>).
2. The list of (key/value) pairs is broken up and each individual (key/value) pair,
<k1,v1>, is processed by calling the map function of the mapper. In practice, the
key k1 is often ignored by the mapper (for instance, it may be the line number of
the incoming text in the value). The mapper transforms each <k1, v1> pair into a
list of <k2, v2>pairs.
3. The output of all the mappers are (conceptually) aggregated into one giant list of
<k2,v2> pairs. All pairs sharing the same k2 are grouped together into a new
(key/value) pair, <k2, list(v2)>. The framework asks the reducer to process each
one of these aggregated (key/value) pairs individually. (2M)
OR
5
MapReduce is also a data processing model, its greatest advantage is the easy scaling of
data processing over multiple computing nodes. Under the MapReduce model, the data
processing primitives are called mappers and reducers .
MapReduce programs are executed in two main phases, called mapping and reducing.
Each phase is defined by a data-processing function, and these functions are called mapper
and reducer, respectively. (2M)
In the original work at Google, the task was to create search indexes that contain vectors of
document URLs for each word in the web; the pages were tokenized and then the
combined lists aggregated together, much like the word counter presented here.
10M
Partitioning and shuffling are common design patterns that go along with mapping and
reducing.
In order for mapping, reducing, partitioning, and shuffling to seamlessly work together, we
need to agree on a common structure for the data being processed. MapReduce uses lists
and (key/value) pairs as its main data primitives. The keys and values are often integers or
strings but can also be dummy values to be ignored or complex object types.

(2M)
1. The input to your application must be structured as a list of (key/value) pairs, list(<k1,
v1>). The input format for processing multiple files is usually list(<String filename, String
file_content>).
2. The list of (key/value) pairs is broken up and each individual (key/value) pair, <k1,v1>,
is processed by calling the map function of the mapper. In practice, the key k1 is often
ignored by the mapper (for instance, it may be the line number of the incoming text in the
value). The mapper transforms each <k1, v1> pair into a list of <k2, v2>pairs.
3. The output of all the mappers are (conceptually) aggregated into one giant list of
<k2,v2> pairs. All pairs sharing the same k2 are grouped together into a new (key/value)
pair, <k2, list(v2)>. The framework asks the reducer to process each one of these
aggregated (key/value) pairs individually.
(3M)

The word-counting program in MapReduce


map(String filename, String document) {
List<String> T = tokenize(document);
for each token in T {
emit ((String)token, (Integer) 1);
}
}
reduce(String token, List<Integer> values) {
Integer sum = 0;
for each value in values {
sum = sum + value;
}
emit ((String)token, (Integer) sum);
} (3M)
6. a
.

6M

b Two types of tables 4M


. 1. Managed tables
2. External table

When you create a table in Hive, by default Hive will manage the data, which
means that Hive moves the data into its warehouse directory.
The difference between the two table types is seen in the LOAD and DROP
semantics.
When you load data into a managed table, it is moved into Hive’s warehouse
directory.
For example, this:
CREATE TABLE managed_table (dummy STRING);
LOAD DATA INPATH '/user/tom/data.txt' INTO table managed_table;

Will move the file hdfs://user/tom/data.txt into Hive’s warehouse directory for
the managed_table table, which is hdfs://user/hive/warehouse/managed_table.
(2M)
DROP TABLE managed_table;
The table, including its metadata and its data, is deleted. It bears repeating that
since the initial LOAD performed a move operation, and the DROP performed a
delete operation, the data no longer exists anywhere.

An external table behaves differently. Can control the creation and deletion of the
data.
The location of the external data is specified at table creation time:
CREATE EXTERNAL TABLE external_table (dummy STRING)
LOCATION '/user/tom/external_table';
LOAD DATA INPATH '/user/tom/data.txt' INTO TABLE external_table;
With the EXTERNAL keyword, Hive knows that it is not managing the data, so it
doesn’t move it to its warehouse directory.
Indeed, it doesn’t even check whether the external location exists at the time it is
defined.
This is a useful feature because it means you can create the data lazily after
creating the table.
(2M)

OR
7. HDFS is a filesystem designed for large-scale distributed data processing under 10M
frameworks such as MapReduce. It can store a big data set of 100 TB as a single file in
HDFS. Hadoop does provide a set of command line utilities that work similarly to the
Linux file commands.
hadoop fs -cmd <args> (1M)

The most common file management tasks in Hadoop, which include


1. Adding files and directories
2. Retrieving files
3. Deleting files
A URI pinpoints the location of a specific file or directory. The full URI format is
scheme://authority/path
hdfs://localhost:9000/user/chuck/example.txt.

You can use the Hadoop cat command to show the content of that file:
hadoop fs -cat hdfs://localhost:9000/user/chuck/example.txt (1M)

ADDING FILES AND DIRECTORIES


HDFS has a default working directory of /user/$USER, where $USER is your login user
name.
1. Ceate directory hadoop fs –mkdir /user/chuck
2. To see the list of subdirectories hadoop fs -lsr /
3. Create some text file on local filesystem called example.txt. hadoop fs -put
example.txt .
4. can re-execute the recursive file listing command to see that the new file is added
to HDFS. $ hadoop fs -lsr /
(4M)

RETRIEVING FILES
1. To retrieve file from HDFS hadoop fs -get example.txt .
2. If the file is huge and you’re interested in a quick check of its content, you can pipe
the output of Hadoop’s cat into a Unix head.
hadoop fs -cat example.txt | head
3. To look at the last kilobyte of a file. hadoop fs -tail example.txt (3M)
DELETING FILES
Hadoop command for removing files is rm.
hadoop fs –rm example.txt (1M)

*********************************************************************************************

You might also like