Professional Documents
Culture Documents
Table of Content
Page No.
Sr.No. Experiment Date Marks Signature
From To
To Study of Big Data Analytics and
1. Hadoop Architecture. 3 5
To Understand Overall
Programming architecture of
2. 6 9
Mapreduce API. Implement
MapReduce Programming.
3. To Study HDFS Commands. 10 12
To Study serializes and
4. deserializes data of integer type 13 15
in Hadoop.
To run a basic Word Count
MapReduce program to
5. 16 17
understand MapReduce
Paradigm.
Basic CRUD operations in
6. 18 20
MongoDB.
Store the basic information about
students such as roll no and
7. 21 22
name using various collection
types Map.
To run a Grep program on
Hadoop to understand
Mapreduce Paradigm: To count
8. 23 24
words in a given file, To view the
output file, and To calculate
execution time.
Installation of SPARK framework
9. with or without Hadoop 25 28
framework.
To Study about the Hive
10. commands using HQL (DDL and 29 37
DML).
lOMoAR cPSD| 12440473
THEORY:
➢ Data sources. All big data solutions start with one or more data sources.
➢ Batch processing. Because the data sets are so large, often a big data solution
must process data files using long-running batch jobs to filter, aggregate, and
otherwise prepare the data for analysis. Usually these jobs involve reading
source files, processing them, and writing the output to new files.
Page 3 of 37
lOMoAR cPSD| 12440473
➢ Analytical data store. Many big data solutions prepare data for analysis and
then serve the processed data in a structured format that can be queried using
analytical tools
➢ Analysis and reporting. The goal of most big data solutions is to provide
insights into the data through analysis and reporting.
Page 4 of 37
lOMoAR cPSD| 12440473
➢ Hadoop Common – the libraries and utilities used by other Hadoop modules
➢ Hadoop Distributed File System (HDFS) – the Java-based scalable system that
stores data across multiple machines without prior organization.
➢ YARN – (Yet Another Resource Negotiator) provides resource management
for the processes running on Hadoop.
➢ MapReduce – a parallel processing software framework. It is comprised of
two steps. Map step is a master node that takes inputs and partitions them
into smaller sub problems and then distributes them to worker nodes. After
the map step has taken place, the master node takes the answers to all of the
sub problems and combines them to produce output.
EXCERCISE:
1) What do you know about the term “Big Data” and what is application of Big
Data ?
2) What is Internet of things?
3) What are the challenges of using Hadoop?
4) Tell us how big data and Hadoop are related to each other.
5) How would you transform unstructured data into structured data?
EVALUATION:
Understanding /
Involvement Timely
Problem solving Total
Completion
(10)
(4) (3)
(3)
Page 5 of 37
lOMoAR cPSD| 12440473
THEORY:
❖ Architecture of Mapreduce API
➢ A MapReduce job usually splits the input data set into independent chunks
which are processed by the map tasks in a completely parallel manner.
➢ The framework sorts the outputs of the maps, which are then input to the
reduce tasks.
➢ Typically, both the input and the output of the job are stored in a file system.
➢ The framework takes care of scheduling and monitoring tasks, then re-
executes the failed tasks.
Page 6 of 37
lOMoAR cPSD| 12440473
➢ The Main class of job context interface is a job class that helps with
implementation.
➢ Job class: The important class in the MapReduce API is job class. The Job
class allows the user to job configure, submission, execution and the query
state. Until the submitted job the set methods work, after that they will
throw an illegal state exception.
Page 7 of 37
lOMoAR cPSD| 12440473
Method :The most important method of mapper class is map. The syntax is
map(KEYIN key, VALUEIN value,
org.apache.hadoop.mapreduce.Mapper.Context context)
➢ Reducer class :It defines the reducer job in mapreduce. Reduces is a group of
intermediate values, that share a key to a smaller set of values via
• JobContext.getConfiguration() method
we can access the configuration for a job.
Three phases of reducers are
• Shuffle: The sorted output of reducer copies from every mapper using
http across the network.
• Sort: When the outputs are fetched, both the phases (shuffle and sort)
occurs at a time and they merged the data.
• Reduce: Syntax of this phase is reduce (object, Iterable, Context).
Method
The most important method of reducer class is reduce.
• reduce(KEYIN key, Iterable<VALUEIN> values,
org.apache.hadoop.mapreduce.Reducer.Context context)
lOMoAR cPSD| 12440473
EXERCISES:
1) It contains the monthly electrical consumption and the annual average for various
years. To find the maximum number of electrical consumption and minimum
number of electrical consumption in the year.
1979 23 23 2 43 24 25 26 26 26 26 25 26 25
1980 26 27 28 28 28 30 31 31 31 30 30 30 29
1981 31 32 32 32 33 34 35 36 36 34 34 34 34
1984 39 38 39 39 39 41 42 43 40 39 38 38 40
1985 38 39 39 39 39 41 41 41 00 40 39 39 45
2) The table includes the monthly visitors of website page and annual average of
five years.
FEB MAR APR MAY JUN JULY AUG SEP OCT NOV DEC AVG
JAN
2008 23 23 2 43 24 25 26 26 26 25 26 26 25
2009 26 27 28 28 28 30 31 31 31 30 30 30 29
2010 31 32 32 32 33 34 35 36 36 34 34 34 34
2014 39 38 39 39 39 41 42 43 40 39 39 38 40
2016 38 39 39 39 39 41 41 41 00 40 40 39 45
To find the maximum number of visitors and minimum number of visitors in the
year.
EVALUATION:
Understanding /
Involvement Timely
Problem solving Total
Completion
(10)
(4) (3)
(3)
THEORY:
❖ HDFS Commands
1) touchz
HDFS Command to create a file in HDFS with file size 0 bytes.
Usage: hdfs dfs–touchz/directory/filename
Command:hdfsdfs–touchz /new_edureka/sample
2) text
HDFS Command that takes a source file and outputs the file in text format.
Usage:hdfs dfs –text /directory/filename
Command:hdfs dfs –text /new_edureka/test
3) cat
HDFS Command that reads a file on HDFS and prints the content of that file to the
standard output.
Usage:hdfsdfs –cat /path/to/file_in_hdfs
Command: hdfs dfs –cat /new_edureka/test
4) copyFromLocal
HDFS Command to copy the file from a Local file system to HDFS.
Usage: hdfs dfs -copyFromLocal <localsrc> <hdfs destination>
Command: hdfs dfs–copyFromLocal /home/edureka/test /new_edureka
5) copyToLocal
HDFS Command to copy the file from HDFS to Local File System.
Usage: hdfs dfs -copyToLocal <hdfs source> <localdst>
Command:hdfs dfs –copyToLocal /new_edureka/test /home/edureka
6) put
HDFS Command to copy single source or multiple sources from local file system to
the destination file system.
Usage:hdfs dfs -put <localsrc> <destination>
Command: hdfs dfs–put /home/edureka/test /user
7) get
HDFS Command to copy files from hdfs to the local file system.
lOMoAR cPSD| 12440473
8) count
HDFS Command to count the number of directories, files, and bytes under the paths
that match the specified file pattern.
Usage: hdfs dfs -count <path>
Command: hdfs dfs –count /user
10) rm
HDFS Command to remove the file from HDFS.
Usage: hdfs dfs –rm <path>
Command:hdfs dfs –rm /new_edureka/test
11) cp
HDFS Command to copy files from source to destination. This command allows
multiple sources as well, in which case the destination must be a directory.
Usage: hdfs dfs -cp <src> <dest>
Command: hdfs dfs -cp/user/hadoop/file1 /user/hadoop/file2
Command: hdfs dfs -cp /user/hadoop/file1 /user/hadoop/file2 /user/hadoop/dir
12) expunge
HDFS Command that makes the trash empty.
Command: hdfs dfs-expunge
13) usage
HDFS Command that returns the help for an individual command.
Usage: hdfs dfs -usage <command>
Command: hdfs dfs -usage mkdir
14) fsck
HDFS Command to check the health of the Hadoop file system
15) ls
HDFS Command to display the list of Files and Directories in HDFS.
Command:hdfs dfs –ls /
16) mkdir
HDFS Command to create the directory in HDFS.
Usage: hdfs dfs –mkdir /directory_name
Command:hdfs dfs –mkdir /new_edureka
lOMoAR cPSD| 12440473
EXCERCISES:
1) tail 2) setrep 3) chgrp 4) chown 5) df 6) du 7) test 8) mv 9) getmerge 10) rmdir
EVALUATION :
Understanding /
Involvement Timely
Problem solving Total
Completion
(10)
(4) (3)
(3)
THEORY:
➢ Serialization is the process of converting object data into byte stream data
for transmission over a network across different nodes in a cluster or for
persistent data storage.
➢ serialization and converts byte stream data into object data for reading data
from HDFS. Hadoop provides Writables for serialization and deserialization
purpose.
➢ Hadoop provides classes that wrap the Java primitive types and implement
the WritableComparable and Writable Interfaces. They are provided in the
org.apache.hadoop.io package.
➢ All the Writable wrapper classes have a get() and a set() method for
retrieving and storing the wrapped value.
➢ All these primitive writable wrappers have get() and set() methods to read or
write the wrapped value. Below is the list of primitive writable data types
available in Hadoop.
• BooleanWritable
• ByteWritable
• IntWritable
• VIntWritable
• FloatWritable
• LongWritable
• VLongWritable
• DoubleWritable
lOMoAR cPSD| 12440473
➢ In the above list VIntWritable and VLongWritable are used for variable
length Integer types and variable length long types respectively.
➢ Serialized sizes of the above primitive writable data types are same as the size
of actual java data type. So, the size of IntWritable is 4 bytes and
LongWritable is 8 bytes.
4. ArrayWritable
➢ TwoDArrayWritable
➢ Map Writable Classes
Hadoop provided below MapWritable data types which implement
java.util.Map interface
➢ AbstractMapWritable – This is abstract or base class for other
MapWritable classes.
➢ MapWritable – This is a general purpose map mapping Writable keys to
Writable values.
➢ SortedMapWritable – This is a specialization of the MapWritable class
that also implements the SortedMap interface.
➢ Other Writable Classes
5. NullWritable
➢ NullWritable is a special type of Writable representing a null value. No
bytes are read or written when a data type is specified as NullWritable.
So, in Mapreduce, a key or a value can be declared as a NullWritable
when we don’t need to use that field.
6. ObjectWritable
➢ This is a general-purpose generic object wrapper which can store any
objects like Java primitives, String, Enum, Writable, null, or arrays.
Text
➢ Text can be used as the Writable equivalent of java.lang.String and It’s
max size is 2 GB. Unlike java’s String data type, Text is mutable in
Hadoop.
7. BytesWritable
➢ BytesWritable is a wrapper for an array of binary data.
8. GenericWritable
lOMoAR cPSD| 12440473
EXCERCISE:
EVALUATION:
Understanding /
Involvement Timely
Problem solving Total
Completion
(10)
(4) (3)
(3)
Set of data Bus, Car, bus, car, train, car, bus, car, train, bus,
Input
TRAIN,BUS, buS, caR, CAR, car, BUS, TRAIN
(BUS,7),
Converts into smaller set
Output (CAR,7),
of tuples
(TRAIN,4)
lOMoAR cPSD| 12440473
➢ Reduce Function –Takes the output from Map as an input and combines
EXCERCISE:
1) Write a program to count words in a given file, To view the output file, and To calculate
execution time.
EVALUATION:
Understanding /
Involvement Timely
Problem solving Total
Completion
(10)
(4) (3)
(3)
➢ To retrieve (Select) the inserted document, run the below command. The
find() command will retrieve all the documents of the given collection.
db.collection_name.find()
➢ If a record is to be retrieved based on some criteria, the find() method should
be called passing parameters, then the record will be retrieved based on the
attributes specified.
db.collection_name.find({"fieldname":"value"})
➢ For Example: Let us retrieve the record from the student collection where the
attribute regNo is 3014and the query for the same is as shown below:
db.students.find({"regNo":"3014"})
➢ Note that after running the remove() method, the entry has been deleted from
the student collection.
EXCERCISE:
EVALUATION:
Understanding /
Involvement Timely
Problem solving Total
Completion
(10)
(4) (3)
(3)
THEORY:
List is an interface and the different classes which implements List are Linked List
and ArrayList.
A hash function is any function that can be used to map data of arbitrary size to
fixed-size values. The values returned by a hash function are called hash values,
hash codes, digests, or simply hashes.
The values are used to index a fixed-size table called a hash table. Use of a hash
function to index a hash table is called hashing or scatter storage addressing.
import java.util.*;
while(it.hasNext()) {
Map.Entry m = (Map.Entry)it.next();
System.out.println(m.getKey() + " " + m.getValue());
lOMoAR cPSD| 12440473
EXCERCISE:
1) Create operation like add,update and delete with use of hash collection for production
table.
EVALUATION:
Understanding /
Involvement Timely
Problem solving Total
Completion
(10)
(4) (3)
(3)
Dmapred.job.queue.name=default submits the job in the particular queue. The queues are
sqoop and default.
Found 7 items
-rw-r--r-- 1 bcampbell supergroup 507105 2014-09-07 15:5
input/Milton_ParadiseLost.txt
-rw-r--r-- 1 bcampbell supergroup 246679 2014-09-07 15:55 input/WilliamYeats.txt
EXCERCISE:
EVALUATION:
Understanding / Timely
Involvement Total
Problem solving Completion
(4) (10)
(3) (3)
$java -version
If Java is already, installed on your system, you get to see the following response −
In case, you do not have Java installed on your system then Install Java before
proceeding to next step.
Step 2: Verifying Scala installation
You should Scala language to implement Spark. So let us verify Scala installation
using following command.
$scala -version
If Scala is already installed on your system, you get to see the following response −
Scala code runner version 2.11.6 -- Copyright 2002-2013, LAMP/EPFL
In case you don’t have Scala installed on your system, then proceed to next step for
Scala installation.
Step 3: Downloading Scala
Download the latest version of Scala by visit the following link Download Scala.
For this tutorial, we are using scala-2.11.6 version. After downloading, you will find
the Scala tar file in the download folder.
Step 4: Installing Scala
Follow the below given steps for installing Scala.
Step 5: Extract the Scala tar file
lOMoAR cPSD| 12440473
Type the following command for extracting the Scala tar file.
$ tar xvf scala-2.11.6.tgz
Step 6: Move Scala software files
Use the following commands for moving the Scala software files, to respective
directory (/usr/local/scala).
$ su –
Password:
# cd /home/Hadoop/Downloads/
# mv scala-2.11.6 /usr/local/scala
# exit
Step 7: Set PATH for Scala
Use the following command for setting PATH for Scala.
$ export PATH = $PATH:/usr/local/scala/bin
Step 8: Verifying Scala Installation
After installation, it is better to verify it. Use the following command for verifying
Scala installation.
$scala -version
If Scala is already installed on your system, you get to see the following response −
Scala code runner version 2.11.6 -- Copyright 2002-2013, LAMP/EPFL
Step 9: Downloading Apache Spark
Download the latest version of Spark by visiting the following link Download
Spark. For this tutorial, we are using spark-1.3.1-bin-hadoop2.6 version. After
downloading it, you will find the Spark tar file in the download folder.
Step 10: Installing Spark
Follow the steps given below for installing Spark.
Step 11: Extracting Spark tar
The following command for extracting the spark tar file.
$ tar xvf spark-1.3.1-bin-hadoop2.6.tgz
Step 12: Moving Spark software files
The following commands for moving the Spark software files to respective
directory (/usr/local/spark).
lOMoAR cPSD| 12440473
$ su –
Password:
# cd /home/Hadoop/Downloads/
# mv spark-1.3.1-bin-hadoop2.6 /usr/local/spark
# exit
Step 13: Setting up the environment for Spark
Add the following line to ~/.bashrc file. It means adding the location, where the
spark software file are located to the PATH variable.
export PATH = $PATH:/usr/local/spark/bin
Use the following command for sourcing the ~/.bashrc file.
$ source ~/.bashrc
Step 14: Verifying the Spark Installation
Write the following command for opening Spark shell.
$spark-shell
If spark is installed successfully then you will find the following output.
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/06/04 15:25:22 INFO SecurityManager: Changing view acls to: hadoop
15/06/04 15:25:22 INFO SecurityManager: Changing modify acls to: hadoop
15/06/04 15:25:22 INFO SecurityManager: SecurityManager: authentication
disabled;
ui acls disabled; users with view permissions: Set(hadoop); users with modify
permissions: Set(hadoop)
15/06/04 15:25:22 INFO HttpServer: Starting HTTP Server
15/06/04 15:25:23 INFO Utils: Successfully started service 'HTTP class server' on
port 43292.
Welcome to
/ / //
_\ \/ _ \/ _ `/ / '_/
/ / . /\_,_/_/ /_/\_\ version 1.4.0
lOMoAR cPSD| 12440473
/_/
Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_71)
Type in expressions to have them evaluated.
Spark context available as sc
scala>
EVALUATION:
Understanding / Timely
Involvement Total
Problem solving Completion
(4) (10)
(3) (3)
What is HQL?
Hive defines a simple SQL-like query language to querying and managing large datasets
called Hive-QL ( HQL ). It’s easy to use if you’re familiar with SQL Language. Hive allows
programmers who are familiar with the language to write the custom MapReduce framework
to perform more sophisticated analysis.
Hive Commands:
DDL statements are used to build and modify the tables and other objects in the database.
Example :
Go to Hive shell by giving the command sudo hive and enter the
command ‘create database<data base name>’ to create the new database in the Hive.
To list out the databases in Hive warehouse, enter the command ‘show databases’.
Copy the input data to HDFS from local by using the copy From Local command.
When we create a table in hive, it creates in the default location of the hive warehouse. –
“/user/hive/warehouse”, after creation of the table we can move the data from HDFS to
hive table.
DML statements are used to retrieve, store, modify, delete, insert and update data in the
database.
Example :
Syntax :
The Load operation is used to move the data into corresponding Hive table. If the
keyword local is specified, then in the load command will give the local file system path. If
the keyword local is not specified we have to use the HDFS path of the file.
Here are some examples for the LOAD data LOCAL command
After loading the data into the Hive table we can apply the Data Manipulation Statements
or aggregate functions retrieve the data.
Count aggregate function is used count the total number of the records in a table.
lOMoAR cPSD| 12440473
The create external keyword is used to create a table and provides a location where the table
will create, so that Hive does not use a default location for this table. An EXTERNAL table
points to any HDFS location for its storage, rather than default storage.
Insert Command:
The insert command is used to load the data Hive table. Inserts can be done to a table or a
partition.
• INSERT OVERWRITE is used to overwrite the existing data in the table or partition.
• INSERT INTO is used to append the data into existing data in a table. (Note:
INSERT INTO syntax is work from the version 0.8)
lOMoAR cPSD| 12440473
The ‘Drop Table’ statement deletes the data and metadata for a table. In the case of external
tables, only the metadata is deleted.
The ‘Drop Table’ statement deletes the data and metadata for a table. In the case of external
tables, only the metadata is deleted.
Load data local inpath ‘aru.txt’ into table tablename and then we check employee1 table by
using Select * from table name command
To count the number of records in table by using Select count(*) from txnrecords;
Aggregation :
This command will count the different category of ‘cate’ table. Here there are 3 different
categories.
lOMoAR cPSD| 12440473
Grouping :
Join Command :
Join Operation:
A Join operation is performed to combining fields from two tables by using values common
to each.
lOMoAR cPSD| 12440473
The result of a left outer join (or simply left join) for tables A and B always contains all
records of the “left” table (A), even if the join-condition does not find any matching record in
the “right” table (B).
A right outer join (or right join) closely resembles a left outer join, except with the treatment
of the tables reversed. Every row from the “right” table (B) will appear in the joined table at
least once.
lOMoAR cPSD| 12440473
Full Join:
The joined table will contain all records from both tables, and fill in NULLs for missing
matches on either side.
Once done with hive we can use quit command to exit from the hive shell.
EVALUATION:
Understanding / Timely
Involvement Total
Problem solving Completion
(4) (10)
(3) (3)