You are on page 1of 39

lOMoAR cPSD| 12440473

lOMoAR cPSD| 12440473

Big Data & Web Intelligence


Laboratory Manual
lOMoAR cPSD| 12440473

Table of Content
Page No.
Sr.No. Experiment Date Marks Signature
From To
To Study of Big Data Analytics and
1. Hadoop Architecture. 3 5
To Understand Overall
Programming architecture of
2. 6 9
Mapreduce API. Implement
MapReduce Programming.
3. To Study HDFS Commands. 10 12
To Study serializes and
4. deserializes data of integer type 13 15
in Hadoop.
To run a basic Word Count
MapReduce program to
5. 16 17
understand MapReduce
Paradigm.
Basic CRUD operations in
6. 18 20
MongoDB.
Store the basic information about
students such as roll no and
7. 21 22
name using various collection
types Map.
To run a Grep program on
Hadoop to understand
Mapreduce Paradigm: To count
8. 23 24
words in a given file, To view the
output file, and To calculate
execution time.
Installation of SPARK framework
9. with or without Hadoop 25 28
framework.
To Study about the Hive
10. commands using HQL (DDL and 29 37
DML).
lOMoAR cPSD| 12440473

EXPERIMENT NO: 1 DATE: / /


TITLE: To Study of Big Data Analytics and Hadoop Architecture.
OBJECTIVES: On completion of this experiment student will able to…
➢ Know the concept of bigdata architecture.
➢ Know the concept of hadoop architecture.

THEORY:

❖ Introduction of bigdata architecture:


➢ A big data architecture is designed to handle the ingestion, processing, and
analysis of data that is too large or complex for traditional database systems.

Component of Big Data Architecture

➢ Data sources. All big data solutions start with one or more data sources.

➢ Data storage. Data for batch processing operations is typically stored in a


distributed file store that can hold high volumes of large files in various
formats

➢ Batch processing. Because the data sets are so large, often a big data solution
must process data files using long-running batch jobs to filter, aggregate, and
otherwise prepare the data for analysis. Usually these jobs involve reading
source files, processing them, and writing the output to new files.

➢ Real-time message ingestion. If the solution includes real-time sources, the


architecture must include a way to capture and store real-time messages for
stream processing.

Page 3 of 37
lOMoAR cPSD| 12440473

➢ Stream processing. After capturing real-time messages, the solution must


process them by filtering, aggregating, and otherwise preparing the data for
analysis

➢ Analytical data store. Many big data solutions prepare data for analysis and
then serve the processed data in a structured format that can be queried using
analytical tools

➢ Analysis and reporting. The goal of most big data solutions is to provide
insights into the data through analysis and reporting.

➢ Orchestration. Most big data solutions consist of repeated data processing


operations, encapsulated in work flows, that transform source data, move
data between multiple sources and sinks, load the processed data into an
analytical data store, or push the results straight to a report or dashboard.

❖ Introduction of Hadoop Architecture:


➢ Apache Hadoop offers a scalable, flexible and reliable distributed computing
big data framework for a cluster of systems with storage capacity and local
computing power by leveraging commodity hardware.

➢ Hadoop follows a Master Slave architecture for the transformation and


analysis of large datasets using Hadoop MapReduce paradigm. The 3
important hadoop components that play a vital role in the Hadoop
architecture.

Page 4 of 37
lOMoAR cPSD| 12440473

➢ Hadoop Common – the libraries and utilities used by other Hadoop modules
➢ Hadoop Distributed File System (HDFS) – the Java-based scalable system that
stores data across multiple machines without prior organization.
➢ YARN – (Yet Another Resource Negotiator) provides resource management
for the processes running on Hadoop.
➢ MapReduce – a parallel processing software framework. It is comprised of
two steps. Map step is a master node that takes inputs and partitions them
into smaller sub problems and then distributes them to worker nodes. After
the map step has taken place, the master node takes the answers to all of the
sub problems and combines them to produce output.
EXCERCISE:
1) What do you know about the term “Big Data” and what is application of Big
Data ?
2) What is Internet of things?
3) What are the challenges of using Hadoop?
4) Tell us how big data and Hadoop are related to each other.
5) How would you transform unstructured data into structured data?

EVALUATION:

Understanding /
Involvement Timely
Problem solving Total
Completion
(10)
(4) (3)
(3)

Signature with date:

Page 5 of 37
lOMoAR cPSD| 12440473

EXPERIMENT NO: 2 DATE: / /

TITLE: To Understand Overall Programming architecture of Mapreduce API.


Implement MapReduce Programming.

OBJECTIVES: On completion of this experiment student will able to…


➢ Know the concept of Mapreduce architecture.
➢ Know the concept of Programming Methods.

THEORY:
❖ Architecture of Mapreduce API
➢ A MapReduce job usually splits the input data set into independent chunks
which are processed by the map tasks in a completely parallel manner.
➢ The framework sorts the outputs of the maps, which are then input to the
reduce tasks.
➢ Typically, both the input and the output of the job are stored in a file system.
➢ The framework takes care of scheduling and monitoring tasks, then re-
executes the failed tasks.

➢ Classes and methods are involved in the operations of MapReduce


programming. Job context interface, Job class, Mapper class, Reducer class

Page 6 of 37
lOMoAR cPSD| 12440473

➢ Job context interface


It is the super-interface for all the classes, which defines different jobs in
MapReduce. While running they provide the job with read-only option to the
task. The job context sub interfaces are:
• Map context: It defines the context which is given to the mapper.
Mapcontext< KEYIN, VALUEIN, KEYOUT, VALUEOUT >
• Reduce context: It defines the context which is passed to reducer.
Reducecontext< KEYIN, VALUEIN, KEYOUT, VALUEOUT >

➢ The Main class of job context interface is a job class that helps with
implementation.

➢ Job class: The important class in the MapReduce API is job class. The Job
class allows the user to job configure, submission, execution and the query
state. Until the submitted job the set methods work, after that they will
throw an illegal state exception.

➢ Constructors of job class


• job( )
• job (Configuration conf)
• job(Configuration conf, String jobname)

➢ Methods of job class


• getjobName( ): job name specified by the user
• getjobState( ): Returns the job current state
• isComplete ( ): Checks whether the job is finished or not
• setInputFormatClass( ): Sets the input format for the job
• setjobName(String name): Sets the job name specified by the user
• setOutputFormatClass( ): Sets the output format for the job
• setMapperClass(Class): Sets the mapper for the job
• setReducerClass(Class): Sets the reducer for the job
• setPartitionerClass(Class) :Sets the practitioner for the job
• setCombinerClass(Class) :Sets the combiner for the job.
➢ Mapper class:It defines a map job, it maps input key or value to a group of
intermediate key or value pairs. Maps are individual task that translate input
records to intermediate records. It maps zero or more output pairs from
giving an input pair.

Page 7 of 37
lOMoAR cPSD| 12440473

Method :The most important method of mapper class is map. The syntax is
map(KEYIN key, VALUEIN value,
org.apache.hadoop.mapreduce.Mapper.Context context)

➢ Reducer class :It defines the reducer job in mapreduce. Reduces is a group of
intermediate values, that share a key to a smaller set of values via
• JobContext.getConfiguration() method
we can access the configuration for a job.
Three phases of reducers are
• Shuffle: The sorted output of reducer copies from every mapper using
http across the network.
• Sort: When the outputs are fetched, both the phases (shuffle and sort)
occurs at a time and they merged the data.
• Reduce: Syntax of this phase is reduce (object, Iterable, Context).

Method
The most important method of reducer class is reduce.
• reduce(KEYIN key, Iterable<VALUEIN> values,
org.apache.hadoop.mapreduce.Reducer.Context context)
lOMoAR cPSD| 12440473

EXERCISES:
1) It contains the monthly electrical consumption and the annual average for various
years. To find the maximum number of electrical consumption and minimum
number of electrical consumption in the year.

1979 23 23 2 43 24 25 26 26 26 26 25 26 25
1980 26 27 28 28 28 30 31 31 31 30 30 30 29
1981 31 32 32 32 33 34 35 36 36 34 34 34 34
1984 39 38 39 39 39 41 42 43 40 39 38 38 40
1985 38 39 39 39 39 41 41 41 00 40 39 39 45

2) The table includes the monthly visitors of website page and annual average of
five years.
FEB MAR APR MAY JUN JULY AUG SEP OCT NOV DEC AVG
JAN

2008 23 23 2 43 24 25 26 26 26 25 26 26 25

2009 26 27 28 28 28 30 31 31 31 30 30 30 29

2010 31 32 32 32 33 34 35 36 36 34 34 34 34

2014 39 38 39 39 39 41 42 43 40 39 39 38 40

2016 38 39 39 39 39 41 41 41 00 40 40 39 45

To find the maximum number of visitors and minimum number of visitors in the
year.

EVALUATION:

Understanding /
Involvement Timely
Problem solving Total
Completion
(10)
(4) (3)
(3)

Signature with date:


lOMoAR cPSD| 12440473

EXPERIMENT NO: 3 DATE: / /

TITLE: To Study HDFS Commands.

OBJECTIVES: On completion of this experiment student will able to…


➢ Understand the HDFS Commands.

THEORY:

❖ HDFS Commands
1) touchz
HDFS Command to create a file in HDFS with file size 0 bytes.
Usage: hdfs dfs–touchz/directory/filename
Command:hdfsdfs–touchz /new_edureka/sample

2) text
HDFS Command that takes a source file and outputs the file in text format.
Usage:hdfs dfs –text /directory/filename
Command:hdfs dfs –text /new_edureka/test

3) cat
HDFS Command that reads a file on HDFS and prints the content of that file to the
standard output.
Usage:hdfsdfs –cat /path/to/file_in_hdfs
Command: hdfs dfs –cat /new_edureka/test

4) copyFromLocal
HDFS Command to copy the file from a Local file system to HDFS.
Usage: hdfs dfs -copyFromLocal <localsrc> <hdfs destination>
Command: hdfs dfs–copyFromLocal /home/edureka/test /new_edureka

5) copyToLocal
HDFS Command to copy the file from HDFS to Local File System.
Usage: hdfs dfs -copyToLocal <hdfs source> <localdst>
Command:hdfs dfs –copyToLocal /new_edureka/test /home/edureka

6) put
HDFS Command to copy single source or multiple sources from local file system to
the destination file system.
Usage:hdfs dfs -put <localsrc> <destination>
Command: hdfs dfs–put /home/edureka/test /user

7) get
HDFS Command to copy files from hdfs to the local file system.
lOMoAR cPSD| 12440473

Usage: hdfs dfs -get <src> <localdst>


Command:hdfs dfs –get /user/test /home/edureka
lOMoAR cPSD| 12440473

8) count
HDFS Command to count the number of directories, files, and bytes under the paths
that match the specified file pattern.
Usage: hdfs dfs -count <path>
Command: hdfs dfs –count /user

10) rm
HDFS Command to remove the file from HDFS.
Usage: hdfs dfs –rm <path>
Command:hdfs dfs –rm /new_edureka/test

11) cp
HDFS Command to copy files from source to destination. This command allows
multiple sources as well, in which case the destination must be a directory.
Usage: hdfs dfs -cp <src> <dest>
Command: hdfs dfs -cp/user/hadoop/file1 /user/hadoop/file2
Command: hdfs dfs -cp /user/hadoop/file1 /user/hadoop/file2 /user/hadoop/dir

12) expunge
HDFS Command that makes the trash empty.
Command: hdfs dfs-expunge

13) usage
HDFS Command that returns the help for an individual command.
Usage: hdfs dfs -usage <command>
Command: hdfs dfs -usage mkdir

14) fsck
HDFS Command to check the health of the Hadoop file system

15) ls
HDFS Command to display the list of Files and Directories in HDFS.
Command:hdfs dfs –ls /
16) mkdir
HDFS Command to create the directory in HDFS.
Usage: hdfs dfs –mkdir /directory_name
Command:hdfs dfs –mkdir /new_edureka
lOMoAR cPSD| 12440473

EXCERCISES:
1) tail 2) setrep 3) chgrp 4) chown 5) df 6) du 7) test 8) mv 9) getmerge 10) rmdir

EVALUATION :
Understanding /
Involvement Timely
Problem solving Total
Completion
(10)
(4) (3)
(3)

Signature with date:


lOMoAR cPSD| 12440473

EXPERIMENT NO: 4 DATE: / /

TITLE: To Study serializes and deserializes data of integer type in Hadoop.

OBJECTIVES: On completion of this experiment student will able to…


➢ know the concept of data types in Hadoop.

THEORY:
➢ Serialization is the process of converting object data into byte stream data
for transmission over a network across different nodes in a cluster or for
persistent data storage.
➢ serialization and converts byte stream data into object data for reading data
from HDFS. Hadoop provides Writables for serialization and deserialization
purpose.

1. Hadoop Data Types

➢ Hadoop provides classes that wrap the Java primitive types and implement
the WritableComparable and Writable Interfaces. They are provided in the
org.apache.hadoop.io package.

➢ All the Writable wrapper classes have a get() and a set() method for
retrieving and storing the wrapped value.

2. Primitive Writable Classes


➢ These are Writable Wrappers for Java primitive data types and they hold a
single primitive value that can be set either at construction or via a setter
method.

➢ All these primitive writable wrappers have get() and set() methods to read or
write the wrapped value. Below is the list of primitive writable data types
available in Hadoop.
• BooleanWritable
• ByteWritable
• IntWritable
• VIntWritable
• FloatWritable
• LongWritable
• VLongWritable
• DoubleWritable
lOMoAR cPSD| 12440473

➢ In the above list VIntWritable and VLongWritable are used for variable
length Integer types and variable length long types respectively.

➢ Serialized sizes of the above primitive writable data types are same as the size
of actual java data type. So, the size of IntWritable is 4 bytes and
LongWritable is 8 bytes.

3. Array Writable Classes


➢ Hadoop provided two types of array writable classes, one for single-
dimensional and another for two-dimensional arrays. But the elements of
these arrays must be other writable objects like IntWritable or
LongWritable only but not the java native data types like int or float.

4. ArrayWritable
➢ TwoDArrayWritable
➢ Map Writable Classes
Hadoop provided below MapWritable data types which implement
java.util.Map interface
➢ AbstractMapWritable – This is abstract or base class for other
MapWritable classes.
➢ MapWritable – This is a general purpose map mapping Writable keys to
Writable values.
➢ SortedMapWritable – This is a specialization of the MapWritable class
that also implements the SortedMap interface.
➢ Other Writable Classes

5. NullWritable
➢ NullWritable is a special type of Writable representing a null value. No
bytes are read or written when a data type is specified as NullWritable.
So, in Mapreduce, a key or a value can be declared as a NullWritable
when we don’t need to use that field.

6. ObjectWritable
➢ This is a general-purpose generic object wrapper which can store any
objects like Java primitives, String, Enum, Writable, null, or arrays.
Text
➢ Text can be used as the Writable equivalent of java.lang.String and It’s
max size is 2 GB. Unlike java’s String data type, Text is mutable in
Hadoop.
7. BytesWritable
➢ BytesWritable is a wrapper for an array of binary data.
8. GenericWritable
lOMoAR cPSD| 12440473

➢ It is similar to ObjectWritable but supports only a few types. User need to


subclass this GenericWritable class and need to specify the types to
support.

EXCERCISE:

1) Implement the program for serializes data type.


2) Implement the program for sterilizes data type.

EVALUATION:

Understanding /
Involvement Timely
Problem solving Total
Completion
(10)
(4) (3)
(3)

Signature with date:


lOMoAR cPSD| 12440473

EXPERIMENT NO: 5 DATE: / /

TITLE: To run a basic Word Count MapReduce program to understand MapReduce


Paradigm.
OBJECTIVES: On completion of this experiment student will able to…
➢ Know the concept of MapReduce program.
THEORY:
➢ Map Function –It takes a set of data and converts it into another set of
data, where individual elements are broken down into tuples (Key-Value
pair).
➢ Example –(Map function in Word Count)

Set of data Bus, Car, bus, car, train, car, bus, car, train, bus,
Input
TRAIN,BUS, buS, caR, CAR, car, BUS, TRAIN

(Bus,1), (Car,1), (bus,1), (car,1), (train,1),


Convert into another
(car,1), (bus,1), (car,1), (train,1), (bus,1),
Output set of data
(TRAIN,1),(BUS,1), (buS,1), (caR,1), (CAR,1),
(Key,Value)
(car,1), (BUS,1), (TRAIN,1)

(Bus,1), (Car,1), (bus,1), (car,1),


(train,1),

Input (car,1), (bus,1), (car,1), (train,1),


Set of Tuples (bus,1),
(output of Map
function) (TRAIN,1),(BUS,1), (buS,1), (caR,1),
(CAR,1),

(car,1), (BUS,1), (TRAIN,1)

(BUS,7),
Converts into smaller set
Output (CAR,7),
of tuples
(TRAIN,4)
lOMoAR cPSD| 12440473

➢ Reduce Function –Takes the output from Map as an input and combines

EXCERCISE:

1) Write a program to count words in a given file, To view the output file, and To calculate
execution time.

EVALUATION:

Understanding /
Involvement Timely
Problem solving Total
Completion
(10)
(4) (3)
(3)

Signature with date:


lOMoAR cPSD| 12440473

EXPERIMENT NO: 6 DATE: / /

TITLE: Basic CRUD operations in MongoDB.

OBJECTIVES: On completion of this experiment student will able to…


➢ know the concept of CRUD Operations.
➢ know the concept of MongoDB.
THEORY:
➢ CRUD operations refer to the basic Insert, Read, Update and Delete
operations.
➢ Inserting a document into a collection (Create)
➢ The command db.collection.insert()will perform an insert operation into a
collection of a document.
➢ Let us insert a document to a student collection. You must be connected to a
database for doing any insert. It is done as follows:
db.student.insert({
regNo: "3014",
name: "Test Student",
course: {
courseName: "MCA",
duration: "3 Years"
},
address: {
city: "Bangalore",
state: "KA",
country: "India"
}
})
➢ An entry has been made into the collection called student.

➢ Querying a document from a collection (Read)


lOMoAR cPSD| 12440473

➢ To retrieve (Select) the inserted document, run the below command. The
find() command will retrieve all the documents of the given collection.
db.collection_name.find()
➢ If a record is to be retrieved based on some criteria, the find() method should
be called passing parameters, then the record will be retrieved based on the
attributes specified.
db.collection_name.find({"fieldname":"value"})
➢ For Example: Let us retrieve the record from the student collection where the
attribute regNo is 3014and the query for the same is as shown below:
db.students.find({"regNo":"3014"})

➢ Updating a document in a collection (Update)


In order to update specific field values of a collection in MongoDB, run the
below query.
db.collection_name.update()
➢ update() method specified above will take the fieldname and the new value
as argument to update a document.
➢ Let us update the attribute name of the collection student for the document
with regNo 3014.
db.student.update({
"regNo": "3014"
},
$set:
{
"name": "Viraj"
})

➢ Removing an entry from the collection (Delete)


➢ Let us now look into the deleting an entry from a collection. In order to delete
an entry from a collection, run the command as shown below :
db.collection_name.remove({"fieldname":"value"})
➢ For Example : db.student.remove({"regNo":"3014"})
lOMoAR cPSD| 12440473

➢ Note that after running the remove() method, the entry has been deleted from
the student collection.

EXCERCISE:

1) Create a collection for production with prod_no,prod_name,prod_price,prod_qty.

EVALUATION:

Understanding /
Involvement Timely
Problem solving Total
Completion
(10)
(4) (3)
(3)

Signature with date:


lOMoAR cPSD| 12440473

EXPERIMENT NO: 7 DATE: / /


TITLE: Store the basic information about students such as roll no and name using
various collection types Map.

OBJECTIVES: On completion of this experiment student will able to…


➢ know the concept of collection types Map.

THEORY:

List is an interface and the different classes which implements List are Linked List
and ArrayList.

Set is an interface which is implemented by HashSet, TreeSet and LinkedHashSet


classes.

Map interface is implemented by HashMap, TreeMap and LinkedHashMap.

A hash function is any function that can be used to map data of arbitrary size to
fixed-size values. The values returned by a hash function are called hash values,
hash codes, digests, or simply hashes.

The values are used to index a fixed-size table called a hash table. Use of a hash
function to index a hash table is called hashing or scatter storage addressing.
import java.util.*;

public class HashMapDemo {


public static void main(String args[]) {
HashMap hm = new HashMap();
/* add key-value pairs in the HashMap where
* key -> Roll No. and value -> Name
*/
hm.put(1, "XYZ");
hm.put(2, "PQR");
hm.put(3, "ABC");

/* find the student with roll no. 2 */


String stud = (String)hm.get(2);
System.out.println("Student name : " + stud);

/* find all the students and their roll no. */


System.out.println("Students data :- ");
Set s = hm.entrySet();
Iterator it = s.iterator();

while(it.hasNext()) {
Map.Entry m = (Map.Entry)it.next();
System.out.println(m.getKey() + " " + m.getValue());
lOMoAR cPSD| 12440473

hm.remove(3); /* Remove the student with roll no. 3 */


hm.put(2, "HJM"); /* Replace the student with roll no. 2 */
System.out.println("Modified Students data :- ");
it = s.iterator();
while(it.hasNext()) {
Map.Entry m = (Map.Entry)it.next();
System.out.println(m.getKey() + " " + m.getValue());
}
}

EXCERCISE:

1) Create operation like add,update and delete with use of hash collection for production
table.

EVALUATION:

Understanding /
Involvement Timely
Problem solving Total
Completion
(10)
(4) (3)
(3)

Signature with date:


lOMoAR cPSD| 12440473

EXPERIMENT NO: 8 DATE: / /


TITLE: To run a Grep program on Hadoop to understand Mapreduce Paradigm:
To count words in a given file, To view the output file, and To calculate execution time.

OBJECTIVES: On completion of this experiment student will able to…


➢ know the concept of mapreduce paradigm types Map.

Learn about Map Reduce and execution Time in Haddop.


To run grep application in Hadoop.
To run the application following command need to be executed in the terminal:
hduser@master:~$ bin/hadoop dfs –copyFromLocal /home/hduser/Desktop/test
/hdfiles/test
The command –copyFromlocal copy the input data file test into the HDFS.
hduser@master:~$ bin/hadoop jar hadoop*examples*.jar grep /hdfiles/test
/hdfiles/output Vinci -Dmapred.job.queue.name=default
The above commands are for running the applications. Two directories are created in the
HDFS, input directory and output directory. /hdfiles/test is the input directory and
/hdfiles/output is the output directory.

The command -Dmapred.job.queue.name=sqoop

Dmapred.job.queue.name=default submits the job in the particular queue. The queues are
sqoop and default.

To get the output command is:


$ hadoop job –history /hdfiles/output
Here, /hdfiles/output is the output directory in which the output file is generated. The Data
locality rate requires data local map task and total number of map task launched. These
values can be obtained from terminal, JobTracker interface.

hadoop fs -put /etc/hadoop/conf/*.xml input

[bcampbell@localhost ~]$ hadoop fs -ls input


lOMoAR cPSD| 12440473

Found 7 items
-rw-r--r-- 1 bcampbell supergroup 507105 2014-09-07 15:5

input/Milton_ParadiseLost.txt
-rw-r--r-- 1 bcampbell supergroup 246679 2014-09-07 15:55 input/WilliamYeats.txt

EXCERCISE:

1) Find Out the Execution Time on other Dataset.

EVALUATION:

Understanding / Timely
Involvement Total
Problem solving Completion
(4) (10)
(3) (3)

Signature with date:


lOMoAR cPSD| 12440473

EXPERIMENT NO: 9 DATE: / /


TITLE: Installation of SPARK framework with or without Hadoop framework.

OBJECTIVES: On completion of this experiment student will able to…


➢ know the concept of spark framework..

The following steps show how to install Apache Spark.


Step 1: Verifying Java Installation
Java installation is one of the mandatory things in installing Spark. Try the
following command to verify the JAVA version.

$java -version
If Java is already, installed on your system, you get to see the following response −

java version "1.7.0_71"


Java(TM) SE Runtime Environment (build 1.7.0_71-b13)
Java HotSpot(TM) Client VM (build 25.0-b02, mixed mode)

In case, you do not have Java installed on your system then Install Java before
proceeding to next step.
Step 2: Verifying Scala installation
You should Scala language to implement Spark. So let us verify Scala installation
using following command.
$scala -version
If Scala is already installed on your system, you get to see the following response −
Scala code runner version 2.11.6 -- Copyright 2002-2013, LAMP/EPFL
In case you don’t have Scala installed on your system, then proceed to next step for
Scala installation.
Step 3: Downloading Scala
Download the latest version of Scala by visit the following link Download Scala.
For this tutorial, we are using scala-2.11.6 version. After downloading, you will find
the Scala tar file in the download folder.
Step 4: Installing Scala
Follow the below given steps for installing Scala.
Step 5: Extract the Scala tar file
lOMoAR cPSD| 12440473

Type the following command for extracting the Scala tar file.
$ tar xvf scala-2.11.6.tgz
Step 6: Move Scala software files
Use the following commands for moving the Scala software files, to respective
directory (/usr/local/scala).
$ su –
Password:
# cd /home/Hadoop/Downloads/
# mv scala-2.11.6 /usr/local/scala
# exit
Step 7: Set PATH for Scala
Use the following command for setting PATH for Scala.
$ export PATH = $PATH:/usr/local/scala/bin
Step 8: Verifying Scala Installation
After installation, it is better to verify it. Use the following command for verifying
Scala installation.
$scala -version
If Scala is already installed on your system, you get to see the following response −
Scala code runner version 2.11.6 -- Copyright 2002-2013, LAMP/EPFL
Step 9: Downloading Apache Spark
Download the latest version of Spark by visiting the following link Download
Spark. For this tutorial, we are using spark-1.3.1-bin-hadoop2.6 version. After
downloading it, you will find the Spark tar file in the download folder.
Step 10: Installing Spark
Follow the steps given below for installing Spark.
Step 11: Extracting Spark tar
The following command for extracting the spark tar file.
$ tar xvf spark-1.3.1-bin-hadoop2.6.tgz
Step 12: Moving Spark software files
The following commands for moving the Spark software files to respective
directory (/usr/local/spark).
lOMoAR cPSD| 12440473

$ su –
Password:
# cd /home/Hadoop/Downloads/
# mv spark-1.3.1-bin-hadoop2.6 /usr/local/spark
# exit
Step 13: Setting up the environment for Spark
Add the following line to ~/.bashrc file. It means adding the location, where the
spark software file are located to the PATH variable.
export PATH = $PATH:/usr/local/spark/bin
Use the following command for sourcing the ~/.bashrc file.
$ source ~/.bashrc
Step 14: Verifying the Spark Installation
Write the following command for opening Spark shell.
$spark-shell
If spark is installed successfully then you will find the following output.

Spark assembly has been built with Hive, including Datanucleus jars on classpath
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/06/04 15:25:22 INFO SecurityManager: Changing view acls to: hadoop
15/06/04 15:25:22 INFO SecurityManager: Changing modify acls to: hadoop
15/06/04 15:25:22 INFO SecurityManager: SecurityManager: authentication
disabled;
ui acls disabled; users with view permissions: Set(hadoop); users with modify
permissions: Set(hadoop)
15/06/04 15:25:22 INFO HttpServer: Starting HTTP Server
15/06/04 15:25:23 INFO Utils: Successfully started service 'HTTP class server' on
port 43292.
Welcome to

/ / //
_\ \/ _ \/ _ `/ / '_/
/ / . /\_,_/_/ /_/\_\ version 1.4.0
lOMoAR cPSD| 12440473

/_/

Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_71)
Type in expressions to have them evaluated.
Spark context available as sc
scala>

EVALUATION:

Understanding / Timely
Involvement Total
Problem solving Completion
(4) (10)
(3) (3)

Signature with date:


lOMoAR cPSD| 12440473

EXPERIMENT NO: 10 DATE: / /


TITLE: To Study about the Hive commands using HQL (DDL and DML).

OBJECTIVES: On completion of this experiment student will able to…


➢ know the concept of Hive Commands.

What is HQL?

Hive defines a simple SQL-like query language to querying and managing large datasets
called Hive-QL ( HQL ). It’s easy to use if you’re familiar with SQL Language. Hive allows
programmers who are familiar with the language to write the custom MapReduce framework
to perform more sophisticated analysis.

Hive Commands:

Data Definition Language (DDL )

DDL statements are used to build and modify the tables and other objects in the database.

Example :

CREATE, DROP, TRUNCATE, ALTER, SHOW, DESCRIBE Statements.

Go to Hive shell by giving the command sudo hive and enter the
command ‘create database<data base name>’ to create the new database in the Hive.

To list out the databases in Hive warehouse, enter the command ‘show databases’.

The command to use the database is USE <data base name>


lOMoAR cPSD| 12440473

Copy the input data to HDFS from local by using the copy From Local command.

When we create a table in hive, it creates in the default location of the hive warehouse. –
“/user/hive/warehouse”, after creation of the table we can move the data from HDFS to
hive table.

The following command creates a table with in location of “/user/hive/warehouse/retail.db”

Note : retail.db is the database created in the Hive warehouse.

Describe provides information about the schema of the table.


lOMoAR cPSD| 12440473

Data Manipulation Language (DML )

DML statements are used to retrieve, store, modify, delete, insert and update data in the
database.

Example :

LOAD, INSERT Statements.

Syntax :

LOAD data <LOCAL> inpath <file path> into table [tablename]

The Load operation is used to move the data into corresponding Hive table. If the
keyword local is specified, then in the load command will give the local file system path. If
the keyword local is not specified we have to use the HDFS path of the file.

Here are some examples for the LOAD data LOCAL command

After loading the data into the Hive table we can apply the Data Manipulation Statements
or aggregate functions retrieve the data.

Example to count number of records:

Count aggregate function is used count the total number of the records in a table.
lOMoAR cPSD| 12440473

‘create external’ Table :

The create external keyword is used to create a table and provides a location where the table
will create, so that Hive does not use a default location for this table. An EXTERNAL table
points to any HDFS location for its storage, rather than default storage.

Insert Command:

The insert command is used to load the data Hive table. Inserts can be done to a table or a
partition.

• INSERT OVERWRITE is used to overwrite the existing data in the table or partition.

• INSERT INTO is used to append the data into existing data in a table. (Note:
INSERT INTO syntax is work from the version 0.8)
lOMoAR cPSD| 12440473

The ‘Drop Table’ statement deletes the data and metadata for a table. In the case of external
tables, only the metadata is deleted.

The ‘Drop Table’ statement deletes the data and metadata for a table. In the case of external
tables, only the metadata is deleted.

Load data local inpath ‘aru.txt’ into table tablename and then we check employee1 table by
using Select * from table name command

To count the number of records in table by using Select count(*) from txnrecords;

Aggregation :

Select count (DISTINCT category) from tablename;

This command will count the different category of ‘cate’ table. Here there are 3 different
categories.
lOMoAR cPSD| 12440473

Suppose there is another table cate where f1 is field name of category.

Grouping :

Group command is used to group the result-set by one or more columns.

Select category, sum( amount) from txt records group by category

It calculates the amount of same category.


lOMoAR cPSD| 12440473

The result one table is stored in to another table.

Create table newtablename as select * from oldtablename;

Join Command :

Here one more table is created in the name ‘mailid’

Join Operation:

A Join operation is performed to combining fields from two tables by using values common
to each.
lOMoAR cPSD| 12440473

Left Outer Join:

The result of a left outer join (or simply left join) for tables A and B always contains all
records of the “left” table (A), even if the join-condition does not find any matching record in
the “right” table (B).

Right Outer Join:

A right outer join (or right join) closely resembles a left outer join, except with the treatment
of the tables reversed. Every row from the “right” table (B) will appear in the joined table at
least once.
lOMoAR cPSD| 12440473

Full Join:

The joined table will contain all records from both tables, and fill in NULLs for missing
matches on either side.

Once done with hive we can use quit command to exit from the hive shell.

EVALUATION:

Understanding / Timely
Involvement Total
Problem solving Completion
(4) (10)
(3) (3)

Signature with date:

You might also like