You are on page 1of 140

G PULLAIAH COLLEGE OF ENGINEERING AND

TECHNOLOGY
PASUPULA(V), NANDIKOTKUR ROAD, KURNOOL - 518452

INTRODUCTION TO BIG DATA


LECTURE NOTES FOR III CSE I SEMESTER

SUBJECT CODE: 15A05506

PREPARED BY

A DAVID DONALD
ASST.PROFESSOR
DEPARTMENT OF CSE

TEXT BOOKS:

1. JAVA in a Nutshell 4th Edition.


2. Hadoop: The definitive Guide by Tom White, 3rd Edition, O'reily.
BIG DATA LECTURE NOTES

UNIT-I
DISTRIBUTED PROGRAMMING USING JAVA: QUICK RECAP AND
ADVANCED JAVA PROGRAMMING:

Generics in Java

The Java Generics programming is introduced in J2SE 5 to deal with type-safe objects.
Before generics, we can store any type of objects in collection i.e. non-generic. Now generics,
forces the java programmer to store specific type of objects.
Java Generic methods and generic classes enable programmers to specify, with a single
method declaration, a set of related methods, or with a single class declaration, a set of related
types, respectively.
Why Use Generics?
In a nutshell, generics enable types (classes and interfaces) to be parameters when
defining classes, interfaces and methods. Much like the more familiar formal parameters used
in method declarations, type parameters provide a way for you to
re-use the same code with different inputs. The difference is that the inputs to formal
parameters are values, while the inputs to type parameters are types.
Advantage of Java Generics:
There are many advantages of generics. They are as follows:
1) Type-safety: We can hold only a single type of objects in generics. It doesn’t allow to store
other objects.
2) Type casting is not required(Individual Type Casting is not needed): There is no need to
typecast the object.
Before Generics, we need to type cast. Programs that uses Generics has got many benefits over
non-generic code. If we do not use generics, then, in the above example every-time we retrieve
data from ArrayList, we have to typecast it. Typecasting at every retrieval operation is a big
headache. If we already know that our list only holds string data then we need not to typecast it
every time.

Page | 1
BIG DATA LECTURE NOTES

Before Generics, we use Typecasting


1. List list = new ArrayList( );
2. list.add("hello");
3. String s = (String) list.get(0); //typecasting

After Generics, we don't need to typecast the object


1. List<String> list = new ArrayList<String>( );
2. list.add("hello");
3. String s = list.get(0);

3) Compile-Time Checking :It is checked at compile time so problem will not occur at runtime.
The good programming strategy says it is far better to handle the problem at compile time than
runtime.
Generics make errors to appear compile time than at run time (It’s always better to know
problems in your code at compile time rather than making your code fail at run time).
1. List<String> list = new ArrayList<String>( );
2. list.add("hello");
3. list.add(32); //Compile Time Error

4) Code Reuse: We can write a method/class/interface once and use for any type we want.

GENERIC METHODS:
Like generic class, we can create generic method that can accept any type of argument.
Using Java Generic concept, we might write a generic method for sorting an array of objects,

Page | 2
BIG DATA LECTURE NOTES

then invoke the generic method with Integer arrays, Double arrays, String arrays and so on, to
sort the array elements.
You can write a single generic method declaration that can be called with arguments of
different types. Based on the types of the arguments passed to the generic method, the
compiler handles each method call appropriately. Following are the rules to define Generic
Methods −
· All generic method declarations have a type parameter section delimited by angle
brackets (< and >) that precedes the method's return type ( < E > in the next example).
· Each type parameter section contains one or more type parameters separated by
commas. A type parameter, also known as a type variable, is an identifier that specifies a
generic type name.
· The type parameters can be used to declare the return type and act as
placeholders for the types of the arguments passed to the generic method, which are known as
actual type arguments.
· A generic method's body is declared like that of any other method. Note that type
parameters can represent only reference types, not primitive types (like int, double and char).
Example:
Following example illustrates how we can print an array of different type using a single Generic
method –
LIST:
A List or sequence is an abstract data type that represents a countable number of
ordered values, where the same value may occur more than once.
public class GenericMethodTest {
// generic method printArray
public static < E > void printArray( E[ ] inputArray ) {
// Display array elements
for(E element : inputArray) {
System.out.printf("%s ", element);
}

Page | 3
BIG DATA LECTURE NOTES

System.out.println( );
}

public static void main(String args[ ]) {


// Create arrays of Integer, Double and Character
Integer[ ] intArray = { 1, 2, 3, 4, 5 };
Double[ ] doubleArray = { 1.1, 2.2, 3.3, 4.4 };
Character[ ] charArray = { 'H', 'E', 'L', 'L', 'O' };

System.out.println("Array integerArray contains:");


printArray(intArray); // pass an Integer array

System.out.println("\nArray doubleArray contains:");


printArray(doubleArray); // pass a Double array

System.out.println("\nArray characterArray contains:");


printArray(charArray); // pass a Character array
}
}
This will produce the following result −
Output
Array integerArray contains:
12345

Array doubleArray contains:


1.1 2.2 3.3 4.4

Array characterArray contains:


HELLO

Page | 4
BIG DATA LECTURE NOTES

Type Parameters
The type parameters naming conventions are important to learn generics thoroughly. The
commonly type parameters are as follows:
1. T - Type
2. E - Element
3. K - Key
4. N - Number
5. V – Value
Let’s see a simple example of java generic method to print array elements. We are using
here E to denote the element.
1. public class TestGenerics4{
2. public static < E > void printArray(E[ ] elements) {
3. for ( E element : elements){
4. System.out.println(element );
5. }
6. System.out.println( );
7. }
8. public static void main( String args[ ] ) {
9. Integer[ ] intArray = { 10, 20, 30, 40, 50 };
10. Character[ ] charArray = { 'J', 'A', 'V', 'A', 'T','P','O','I','N','T' };
11.
12. System.out.println( "Printing Integer Array" );
13. printArray( intArray );
14.
15. System.out.println( "Printing Character Array" );
16. printArray( charArray );
17. }

Page | 5
BIG DATA LECTURE NOTES

18. }

Output:
Printing Integer Array
10
20
30
40
50
Printing Character Array
J
A
V
A
T
P
O
I
N
T
Generic class:
A class that can refer to any type is known as generic class. Here, we are using T type parameter
to create the generic class of specific type.
Let’s see the simple example to create and use the generic class.
Creating generic class:
1. class MyGen<T> {
2. T obj;
3. void add(T obj){ this.obj=obj; }
4. T get( ){

Page | 6
BIG DATA LECTURE NOTES

5. return obj;
6. } }
The T type indicates that it can refer to any type (like String, Integer, Employee etc.). The type
you specify for the class, will be used to store and retrieve the data.
Using generic class:
Let’s see the code to use the generic class.
1. class TestGenerics3{
2. public static void main(String args[ ]){
3. MyGen<Integer> m=new MyGen<Integer>( );
4. m.add(2);
5. System.out.println (m.get( ));
6. } }
Output:
2
Full Example of Generics in Java using ArrayList class
Here, we are using the ArrayList class, but you can use any collection class such as ArrayList,
LinkedList, HashSet, TreeSet, HashMap, Comparator etc.

1. import java.util.*;
2. class TestGenerics1 {
3. public static void main(String args[ ]) {
4. ArrayList<String> list=new ArrayList<String>( );
5. list.add("rahul");
6. list.add("jai");
7. String s=list.get(1); //type casting is not required
8. System.out.println("element is: "+s);
9. Iterator<String> itr=list.iterator();
10. while(itr.hasNext()){
11. System.out.println(itr.next());

Page | 7
BIG DATA LECTURE NOTES

12. }
13. } }
Output:
element is: jai
rahul
jai

Example of Java Generics using Map class


· A Map is also called associative array or symbol tableor dictionary is an abstract data
type composed of a collection of <key,value> pairs, such that each possible key appears at most
once in the collection.
· A set is an abstract data type that can store certain values, without any particular order, and
no repeated values. An abstract data structure is a collection, or aggregate, of data.
· In computer science, an abstract data type (ADT) is a mathematical model for data types.
· Now we are going to use map elements using generics. Here, we need to pass key and value.
Let us understand it by a simple example:
1. Import java.util.*;
2. class TestGenerics2{
3. public static void main(String args[ ]) {
4. Map<Integer,String> map = new HashMap <Integer, String> ( );
5. map.put(1,"vijay");
6. map.put(4,"umesh");
7. map.put(2,"ankit");
8. //Now use Map.Entry for Set and Iterator
9. Set<Map.Entry<Integer,String>> set = map.entrySet( );
10. Iterator<Map.Entry<Integer,String>> itr = set.iterator( );
11. while( itr.hasNext ( ) ) {
12. Map.Entry e=itr.next( ); //no need to typecast
13. System.out.println( e.getKey ( )+" "+e.getValue ( ) );

Page | 8
BIG DATA LECTURE NOTES

14. } } }

Output:
1 vijay
2 ankit
4 umesh

THREADS
Introduction:
The java.lang.Thread class is a thread of execution in a program. The Java Virtual Machine
allows an application to have multiple threads of execution running concurrently.Following are
the important points about Thread:
· Every thread has a priority. Threads with higher priority are executed in preference
to threads with lower priority
· Each thread may or may not also be marked as a daemon.
· There are two ways to create a new thread of execution. One is to declare a class
to be a subclass of Thread and,the other way to create a thread is to declare a class that
implements the Runnable interface
Class declaration:
Following is the declaration for java.lang.Thread class:

public class Thread


extends Object
implements Runnable
Following are the fields for java.lang.Thread class:
· static int MAX_PRIORITY -- This is the minimum priority that a thread can have.
· static int NORM_PRIORITY -- This is the default priority that is assigned to a thread.
Life Cycle of Thread
A thread can be in any of the five following states

Page | 9
BIG DATA LECTURE NOTES

1. Newborn State: When a thread object is created a new thread is born and said to be in
Newborn state.
2. Runnable State: If a thread is in this state it means that the thread is ready for execution and
waiting for the availability of the processor. If all threads in queue are of same priority then
they are given time slots for execution in round robin fashion
3. Running State: It means that the processor has given its time to the thread for execution. A
thread keeps running until the following conditions occurs
a. Thread give up its control on its own and it can happen in the following situations
i. A thread gets suspended using suspend( ) method which can only be revived with resume(
) method
ii. A thread is made to sleep for a specified period of time using sleep(time) method, where
time in milliseconds
iii. A thread is made to wait for some event to occur using wait ( )method. In this case a thread
can be scheduled to run again usingnotify ( ) method.
b. A thread is pre-empted by a higher priority thread
4. Blocked State: If a thread is prevented from entering into runnable state
and subsequently running state, then a thread is said to be in Blocked state.
5. Dead State: A runnable thread enters the Dead or terminated state when it
completes its task or otherwise terminates.

Page | 10
BIG DATA LECTURE NOTES

Fig: Life Cycle of Thread

Main Thread
Every time a Java program starts up, one thread begins running which is called as
the main thread of the program because it is the one that is executed when your program
begins.
· Child threads are produced from main thread
· Often it is the last thread to finish execution as it performs various shut down operations
Creating a Thread
Java defines two ways in which this can be accomplished:
· You can implement the Runnable interface.
· You can extend the Thread class, itself.
Create Thread by Implementing Runnable
The easiest way to create a thread is to create a class that implements the Runnable interface.
To implement Runnable, a class need only implement a single method called run( ), which is
declared like this:
public void run( )
You will define the code that constitutes the new thread inside run( )method. It is important to
understand that run( ) can call other methods, use other classes, and declare variables, just like
the main thread can.

Page | 11
BIG DATA LECTURE NOTES

After you create a class that implements Runnable, you will instantiate an object of type Thread
from within that class. Thread defines several constructors. The one that we will use is shown
here:
Thread(Runnable threadOb, String threadName);
Here threadOb is an instance of a class that implements the Runnable interface and the name
of the new thread is specified by threadName. After the new thread is created, it will not start
running until you call its start( ) method, which is declared within Thread. The start( ) method is
shown here:
void start( );

Class constructors:

S.N. Constructor & Description

1 Thread( )
This allocates a new Thread object.

2 Thread(Runnable target)
This allocates a new Thread object.

3 Thread(Runnable target, String name)


This allocates a new Thread object.

4 Thread(String name)
This constructs allocates a new Thread object.

5 Thread(ThreadGroup group, Runnable target)


This allocates a new Thread object.

6 Thread(ThreadGroup group, Runnable target, String name)


This allocates a new Thread object so that it has target as its run object, has the specified name

Page | 12
BIG DATA LECTURE NOTES

name, and belongs to the thread group referred to by group.

7 Thread(ThreadGroup group, Runnable target, String name, long stackSize)


This allocates a new Thread object so that it has target as its run object, has the specified name
name, belongs to the thread group referred to by group, and has the specified stack size.

8 Thread(ThreadGroup group, String name)


This allocates a new Thread object.

SOCKETS
What Is a Socket?
Normally, a server runs on a specific computer and has a socket that is bound to a specific port
number. The server just waits, listening to the socket for a client to make a connection request.
On the client-side: The client knows the hostname of the machine on which the server is
running and the port number on which the server is listening. To make a connection request,
the client tries to rendezvous with the server on the server's machine and port. The client also
needs to identify itself to the server so it binds to a local port number that it will use during this
connection. This is usually assigned by the system.
If everything goes well, the server accepts the connection. Upon acceptance, the server gets a
new socket bound to the same local port and also has its remote endpoint set to the address
and port of the client. It needs a new socket so that it can continue to listen to the original
socket for connection requests while tending to the needs of the connected client.
On the client side, if the connection is accepted, a socket is successfully created and the client
can use the socket to communicate with the server.
The client and server can now communicate by writing to or reading from their sockets.

Page | 13
BIG DATA LECTURE NOTES

Definition:
A socket is one endpoint of a two-way communication link between two programs running on
the network. A socket is bound to a port number so that the TCP layer can identify the
application that data is destined to be sent to.

 An endpoint is a combination of an IP address and a port number. Every TCP connection


can be uniquely identified by its two endpoints. That way you can have multiple
connections between your host and the server.
 The java.net package in the Java platform provides a class, Socket, that implements one
side of a two-way connection between your Java program and another program on the
network. The Socket class sits on top of a platform-dependent implementation, hiding the
details of any particular system from your Java program. By using the java.net.Socket class
instead of relying on native code, your Java programs can communicate over the network
in a platform-independent fashion.
 Additionally, java.net includes the ServerSocket class, which implements a socket that
servers can use to listen for and accept connections to clients. This lesson shows you how
to use the Socket and ServerSocket classes.
 If you are trying to connect to the Web, the URL class and related classes
(URLConnection, URLEncoder) are probably more appropriate than the socket classes. In
fact, URLs are a relatively high-level connection to the Web and use sockets as part of the
underlying implementation. See Working with URLs for information about connecting to
the Web via URLs.
 Java Socket programming is used for communication between the applications running on
different JRE.
 Java Socket programming can be connection-oriented or connection-less.
 Socket and ServerSocket classes are used for connection-oriented socket programming and
DatagramSocket and DatagramPacket classes are used for connection-less socket
programming.
The client in socket programming must know two information:

Page | 14
BIG DATA LECTURE NOTES

1. IP Address of Server, and


2. Port number.
Reading from and Writing to a Socket
 Let's look at a simple example that illustrates how a program can establish a connection to
a server program using the Socket class and then, how the client can send data to and
receive data from the server through the socket.
 The example program implements a client, EchoClient, that connects to an echo server. The
echo server receives data from its client and echoes it back. The
example EchoServer implements an echo server. (Alternatively, the client can connect to
any host that supports the Echo Protocol.)
 The EchoClient example creates a socket, thereby getting a connection to the echo server.
It reads input from the user on the standard input stream, and then forwards that text to
the echo server by writing the text to the socket. The server echoes the input back through
the socket to the client. The client program reads and displays the data passed back to it
from the server.
 Note that the EchoClient example both writes to and reads from its socket, thereby sending
data to and receiving data from the echo server.
 Let's walk through the program and investigate the interesting parts. The following
statements in the try-with-resources statement in the EchoClient example are critical.
These lines establish the socket connection between the client and the server and open a
PrintWriter and a BufferedReader on the socket:
String hostName = args[0];
Int portNumber = Integer.parseInt(args[1]);
try (
Socket echoSocket = new Socket(hostName, portNumber);
PrintWriter out =
new PrintWriter(echoSocket.getOutputStream( ), true);
BufferedReader in = new BufferedReader(
new InputStreamReader(echoSocket.getInputStream( )));

Page | 15
BIG DATA LECTURE NOTES

BufferedReader stdIn = new BufferedReader( new


InputStreamReader(System.in))
)
 The first statement in the try-with resources statement creates a new Socket object and
names it echoSocket. The Socketconstructor used here requires the name of the computer
and the port number to which you want to connect.
 The example program uses the first command-line argument as the name of the computer
(the host name) and the second command line argument as the port number.
 When you run this program on your computer, make sure that the host name you use is
the fully qualified IP name of the computer to which you want to connect.
For example, if your echo server is running on the computer echoserver.example.comand it
is listening on port number 7, first run the following command from the
computer echoserver.example.com if you want to use the EchoServer example as your
echo server: java EchoServer 7
Afterward, run the EchoClient example with the following command:
java EchoClient echoserver.example.com 7
 The second statement in the try-with resources statement gets the socket's output stream
and opens a PrintWriter on it. Similarly, the third statement gets the socket's input stream
and opens a BufferedReader on it. The example uses readers and writers so that it can write
Unicode characters over the socket.
 To send data through the socket to the server, the EchoClient example needs to write to
the PrintWriter. To get the server's response, EchoClient reads from
the BufferedReader object stdIn, which is created in the fourth statement in the try-with
resources statement. If you are not yet familiar with the Java platform's I/O classes, you
may wish to read Basic I/O.
 The next interesting part of the program is the while loop. The loop reads a line at a time
from the standard input stream and immediately sends it to the server by writing it to
the PrintWriter connected to the socket:

Page | 16
BIG DATA LECTURE NOTES

String userInput;
while ((userInput = stdIn.readLine( )) != null) {
out.println(userInput);
System.out.println("echo: " + in.readLine( ));
}

 The last statement in the while loop reads a line of information from
the BufferedReader connected to the socket. The readLinemethod waits until the server
echoes the information back to EchoClient. When readline returns, EchoClient prints the
information to the standard output.
 The while loop continues until the user types an end-of-input character. That is,
the EchoClient example reads input from the user, sends it to the Echo server, gets a
response from the server, and displays it, until it reaches the end-of-input. (You can type an
end-of-input character by pressing Ctrl-C.)
 The while loop then terminates, and the Java runtime automatically closes the readers and
writers connected to the socket and to the standard input stream, and it closes the socket
connection to the server.
 The Java runtime closes these resources automatically because they were created in the try-
with-resources statement. The Java runtime closes these resources in reverse order that
they were created. (This is good because streams connected to a socket should be closed
before the socket itself is closed.)
 This client program is straightforward and simple because the echo server implements a
simple protocol. The client sends text to the server, and the server echoes it back. When
your client programs are talking to a more complicated server such as an HTTP server, your
client program will also be more complicated. However, the basics are much the same as
they are in this program:
1. Open a socket.
2. Open an input stream and output stream to the socket.
3. Read from and write to the stream according to the server's protocol.

Page | 17
BIG DATA LECTURE NOTES

4. Close the streams.


5. Close the socket.
Only step 3 differs from client to client, depending on the server. The other steps remain largely
the same.

Socket class :
A socket is simply an endpoint for communications between the machines. The Socket class can
be used to create a socket.
Important methods

Method Description

1)public InputStream getInputStream( ) returns the InputStream attached with this socket.

2) public OutputStream getOutputStream( ) returns the OutputStream attached with this socket.

3) public synchronized void close( ) closes this socket

ServerSocket class:
The ServerSocket class can be used to create a server socket. This object is used to establish
communication with the clients.
Important methods

Method Description

1) public Socket accept( ) returns the socket and establish a connection between server and client.

2) public synchronized void close( ) closes the server socket.

Simple client-server programming using java:


Let's see a simple of java socket programming in which client sends a text and server receives it.
File: MyServer.java

Page | 18
BIG DATA LECTURE NOTES

1. import java.io.*;
2. import java.net.*;
3. public class MyServer {
4. public static void main(String[ ] args){
5. try{
6. ServerSocket ss=new ServerSocket(6666);
7. Socket s=ss.accept( ); //establishes connection
8. DataInputStream dis=new DataInputStream(s.getInputStream( ));
9. String str=(String)dis.readUTF( );
10. System.out.println("message= "+str);
11. ss.close( );
12. }catch(Exception e){System.out.println(e);}
13. }
14. }

File: MyClient.java
1. import java.io.*;
2. import java.net.*;
3. public class MyClient {
4. public static void main(String[ ] args) {
5. try{
6. Socket s=new Socket("localhost",6666);
7. DataOutputStream dout=new DataOutputStream(s.getOutputStream( ));
8. dout.writeUTF("Hello Server");
9. dout.flush( );
10. dout.close( );
11. s.close( );
12. }
13. catch(Exception e) {

Page | 19
BIG DATA LECTURE NOTES

14. System.out.println(e);
15. }
16. }
17. }

DIFFICULTIES IN DEVELOPING DISTRIBUTED PROGRAMS FORLARGE SCALE CLUSTERS


The World Wide Web is used by millions of people everyday forvarious purposes
including email, reading news, downloading music, onlineshopping or simply accessing
information about anything.
Using a standardweb browser, the user can access information stored on Web serverssi
tuated anywhere on the globe.
Generally, distributed system i s defines as “a system in which hardware or
software components located at networked computers communicate andcoordinate their
actions only by message passing”.
Tanenbaum defines distributed systems “A collection of independent computers that appea
r to the users of the system as a single computer” Leslie Lamport is a famous researcher o
n timing, message ordering, andclock synchronization in distributed systems once said that “A
distributedsystem is one on which I cannot get any work done because some machine Ihave n
ever heard of has crashed“ reflecting on the huge number of challengesfaced by distributed
system designers.
Despite these challenges, the benefits of distributed systems andapplications are many, ma
king it worthwhile to pursue. Various types ofdistributed systems and applications have bee
n developed and are beingused extensively in the real world.
Various kinds of distributed systems operate today, each aimed at solvingdifferent kinds of pro
blems. The challenges faced in building a distributedsystem vary depending on the requirem
ents of the system. In general,however, most systems will need to handle the following issues
· Heterogeneity:

Page | 20
BIG DATA LECTURE NOTES

Various entities in the system must be able to interoperate with oneanother,


despite differences in hardware architectures, operating systems,communication
protocols, programming languages, software interfaces,security models, and data formats.
· Transparency:
The entire system should appear as a single unit and the complexityand
interactions between the components should be typically hidden fromthe end user.
· Fault tolerance and failure management:
Failure of one or more components should not bring down the entire system,and
should be isolated.

· Scalability:
The system should work efficiently with increasing number of users andaddition of a resource
should enhance the performance of the system.
· Concurrency:
Shared access to resources should be made possible. · Openness and Extensibility
Interfaces should be cleanly separated and publicly available to enableeasy
extensions to existing components and add new components.
· Migration and load balancing:
Allow the movement of tasks within a system without affecting the operationof
users or applications, and distribute load among available resourcesfor
improving performance.
· Security:
Access to resources should be secured to ensure only known users are able to perform
allowed operations.
INTRODUCTION TO CLOUD COMPUTING
An Overview
 Cloud computing is a computing paradigm, where a large pool ofsystems are connected
in private or

Page | 21
BIG DATA LECTURE NOTES

public networks, to providedynamically scalable infrastructure for application, data and fil
estorage. With
the advent of this technology, the cost of computation,application hosting, content storag
e and delivery is reducedsignificantly.
 Cloud computing is a practical approach to experience direct costbenefits and it has th
e potential to transform a data center from a capital-intensive set up to a variable
priced environment.
 The idea of cloud computing is based on a very fundamentalprincipal of „reusability of I
T capabilities'. The difference thatcloud computing brings compared to traditional co
ncepts of“grid computing”,“distributed computing”, “utility computing”, or“autonomic
computing” is to broaden horizons across organizational boundaries.
Cloud computing is defined as
“A pool of abstracted, highly scalable, and managed compute infrastructure capable of
hosting end-customer applications and billed by consumption.”

Figure 1: Conceptual view of cloud computing


Cloud Computing Models
Cloud Providers offer services that can be grouped into three categories.
1 . Software as a Service (SaaS): In this model, a completeapplication is offered to the customer,
as a service on demand. Asingle instance of the service runs on the cloud & multiple endusers

Page | 22
BIG DATA LECTURE NOTES

are serviced. On the customers‟ side, there is no need forupfront investment in servers o
r software licenses, while for theprovider, the costs are lowered.
since only a single application needs to be hosted & maintained. Today SaaS is offered b
ycompanies such as Google, Salesforce, Microsoft, Zoho, etc.
2. Platform as a Service (Paas): Here, a layer of software, ordevelopment environment is
encapsulated & offered as a service,upon which other higher levels of service can be built.
The customer has the freedom to build his own applications,which run on the provi
der‟s infrastructure. To meetmanageability and scalability requirements of the applications
,PaaS providers offer a predefined combination of OS andapplication servers, such as LAMP pl
atform (Linux, Apache, MySqland PHP), restricted J2EE, Ruby etc. Google‟s App Engine,Force.
com, etc are some of the popular PaaS examples.
3. Infrastructure as a Service (Iaas): IaaS provides basic storageand computing capabilities as
standardized services over thenetwork. Servers, storage systems, networking equipment, dat
a centre space etc. are pooled and made available to handleworkloads. The customer wo
uld typically deploy his own softwareon the infrastructure. Some common examples are Am
azon, GoGrid, 3 Tera, etc.

Figure 2: Cloud
Understanding Public and Private Clouds

Page | 23
BIG DATA LECTURE NOTES

Enterprises can choose to deploy applications on Public,Private or Hybrid clouds. Cloud In


tegrators can play a vital partin determining the right cloud path for each organization.
1. Public Cloud:
Public clouds are owned and operated by third parties;they deliver superior economies of
scale to customers, as the infrastructure costs are spread among a mix of users, givingeach indi
vidual client an attractive low-cost, “Pay-as-you go”model. All customers share the
same infrastructure pool withlimited configuration, security protections, and availability
variances. These are managed and supported by the cloudprovider. One of the advantages
of a Public cloud is that theymay be larger than an enterprises cloud, thus providing the
ability to scale seamlessly, on demand.]

2. Private Cloud:
Private clouds are built exclusively for a single enterprise.They aim to address concerns on
data security and offergreater control, which is typically lacking in a public cloud.There are two
variations to a private cloud:
 On-premise Private Cloud: On-premise private clouds,also known as internal
clouds are hosted within one‟sown data center. This model provides a more
standardized process and protection, but is limited inaspects of size and scalability. IT de
partments would alsoneed to incur the capital and operational costs for the
physical resources. This is best suited for applications which require complete control and
configurability of the infrastructure and security.
 Externally hosted Private Cloud: This type of privatecloud is hosted externally
with a cloud provider, wherethe provider facilitates an exclusive cloud environment
with full guarantee of privacy. This is best suited forenterprises that don‟t prefer a public
cloud due to sharingof physical resources.
3. Hybrid Cloud:
Hybrid Clouds combine both public and private cloud models. With a Hybrid Cloud, service
providers can utilize 3rd party Cloud Providers in a full or partial manner thus increasingthe
flexibility of computing. The Hybrid cloud environment is capable of providing on demand, exte

Page | 24
BIG DATA LECTURE NOTES

rnally provisioned scale.The ability to augment a private cloud with the resources of a public
cloud can be used to manage any unexpected surges inworkload.
Cloud Computing Benefits :
Enterprises would need to align their applications, so as toexploit the architecture models
that Cloud Computing offers.Some of the typical benefits are listed below:
1. Reduced Cost
There are a number of reasons to attribute Cloudtechnology with lower costs. The
billing model is pay asper usage; the infrastructure is not purchased thuslowering
maintenance. Initial expense and recurringexpenses are much lower than traditional
computing.

2. Increased Storage
With the massive Infrastructure that is offered by Cloudproviders today, storage &
maintenance of large volumesof data is a reality. Sudden workload spikes are also managed
effectively & efficiently, since the cloud canscale dynamically.
3. Flexibility
This is an extremely important characteristic. Withenterprises having to adapt, even
more rapidly, tochanging business conditions, speed to deliver iscritical. Cloud
computing stresses on getting applicationsto market very quickly, by using the most
appropriate building blocks necessary for deployment.
Cloud Computing Challenges
Despite its growing influence, concerns regarding cloudcomputing still remain. In our
opinion, the benefits outweighthe drawbacks and the model is worth exploring. Some
common challenges are:
1. Data Protection
Data Security is a crucial element that warrants scrutiny.Enterprises are reluctant to
buy an assurance ofbusiness data security from vendors. They fear losingdata to
competition and the data confidentiality ofconsumers. In many instances, the actual
storagelocation is not disclosed, adding onto the securityconcerns of enterprises. In

Page | 25
BIG DATA LECTURE NOTES

the existing models, firewallsacross data centers (owned by enterprises) protect this
sensitive information. In the cloud model, Serviceproviders are responsible for
maintaining data security and enterprises would have to rely on them.
2. Data Recovery and Availability
All business applications have Service level agreementsthat are stringently followed. Operation
al teams play akey role in management of service level agreements and runtime governance
of applications. In production environments, operational teams support Appropriate clusterin
g and Fail over Data Replication System monitoring (Transactions monitoring,logs monitoring
and others) Maintenance(Runtime Governance) Disaster recovery Capacity
and performance management.
If, any of the above mentioned services is under-servedby a cloud provider, the
damage & impact could be severe.
3. Management Capabilities
Despite there being multiple cloud providers, themanagement of platform and infrastruc
ture is still in itsinfancy. Features like „Auto-scaling‟ for example, are a
crucial requirement for many enterprises. There is hugepotential to improve on the
scalability and load balancing features provided today.
4. Regulatory and Compliance Restrictions
In some of the European countries, Governmentregulations do not allow customer's
personal informationand other sensitive information to be physically locatedoutside
the state or country. In order to meet suchrequirements, cloud providers need to
setup a datacenter or a storage site exclusively within the country tocomply with
regulations. Having such an infrastructuremay not always be feasible and is a big
challenge for cloud providers.

Page | 26
BIGDATA LECTURE NOTES

UNIT-II
DISTRIBUTED FILE SYSTEMS LEADING TO HADOOP
FILE SYSTEM
Big Data :

'Big Data' is also a data but with a huge size.

'Big Data' is a term used to describe collection of data that is huge in size and yet growing
exponentially with time.

Data which are very large in size is called Big Data.

Normally we work on data of size MB(Word Doc, Excel) or maximum GB(Movies, Codes) but
data in Peta bytes i.e. 10^15 byte size is called Big Data.

It is stated that almost 90% of today's data has been generated in the past 3 years.

In short, such a data is so large and complex that none of the traditional data management
tools are able to store it or process it efficiently.

Examples Of 'Big Data'

Following are some the examples of 'Big Data'-

The New York Stock Exchange generates about one terabyte of new trade data per day.

Social Media Impact

Statistic shows that 500+terabytes of new data gets ingested into the databases of social
media site Facebook,Google, LinkedIn, every day. This data is mainly generated in terms of
photo and video uploads, message exchanges, putting comments etc.

Single Jet engine can generate 10+terabytes of data in 30 minutes of a flight time. With

Page | 27
BIGDATA LECTURE NOTES

many thousand flights per day, generation of data reaches up to many Petabytes.

E-commerce site: Sites like Amazon, Flipkart, Alibaba generates huge amount of logs from
which users buying trends can be traced.

Weather Station: All the weather station and satellite gives very huge data which are stored
and manipulated to forecast weather.

Telecom company: Telecom giants like Airtel, Vodafone study the user trends and
accordingly publish their plans and for this they store the data of its million users.

Categories Of 'Big Data'

Page | 28
BIGDATA LECTURE NOTES

Big data' could be found in three forms:

1. Structured
2. Unstructured
3. Semi-structured

Structured

Any data that can be stored, accessed and processed in the form of fixed format is termed
as a 'structured' data.

Do you know? 1021 bytes equals to 1 zettabyte or one billion terabytes forms a zettabyte.

Data stored in a relational database management system is one example of


a 'structured' data.

Examples Of Structured Data

An 'Employee' table in a database is an example of Structured Data

Employee_ID Employee_Name Gender Department Salary_In_lacs

2365 Rajesh Kulkarni Male Finance 650000

3398 Pratibha Joshi Female Admin 650000

7465 Shushil Roy Male Admin 500000

7500 Shubhojit Das Male Finance 500000

7699 Priya Sane Female Finance 550000

Unstructured

Any data with unknown form or the structure is classified as unstructured data.

Page | 29
BIGDATA LECTURE NOTES

Typical example of unstructured data is, a heterogeneous data source containing a


combination of simple text files, images, videos etc.

Examples Of Un-structured Data

Output returned by 'Google Search' .

Semi-structured

Semi-structured data can contain both the forms of data.

Example of semi-structured data is a data represented in XML file.

Examples Of Semi-structured Data

Personal data stored in a XML file-

<rec><name>Prashant Rao</name><sex>Male</sex><age>35</age></rec>
<rec><name>Seema R.</name><sex>Female</sex><age>41</age></rec>
<rec><name>Satish Mane</name><sex>Male</sex><age>29</age></rec>
<rec><name>Subrato Roy</name><sex>Male</sex><age>26</age></rec>
<rec><name>Jeremiah J.</name><sex>Male</sex><age>35</age></rec>

Page | 30
BIGDATA LECTURE NOTES

Page | 31
BIGDATA LECTURE NOTES

Characteristics of big data:


Volume
Variety
Velocity

Volume' is one characteristic which needs to be considered while dealing with 'Big Data'.

Size of data plays very crucial role in determining value out of data.

Variety

Variety refers to heterogeneous sources and the nature of data, both structured and
unstructured.

During earlier days, spreadsheets and databases were the only sources of data considered
by most of the applications.

Now days, data in the form of emails, photos, videos, monitoring devices, PDFs, audio, etc.
is also being considered in the analysis applications.

Velocity

The term 'velocity' refers to the speed of generation of data.

Big Data Velocity deals with the speed at which data flows in from sources like business
processes, application logs, networks and social media sites, sensors, Mobile devices, etc.
The flow of data is massive and continuous.

Benefits of Big Data Processing

Ability to process 'Big Data' brings in multiple benefits, such as-

• Businesses can utilize outside intelligence while taking decisions.

• Improved customer service.

Page | 32
BIGDATA LECTURE NOTES

• Early identification of risk to the product/services, if any.

• Better operational efficiency.

Issues

Huge amount of unstructured data which needs to be stored, processed and analyzed.

Solution:

Apache Hadoop is the most important framework for working with Big Data. The biggest
strength of Hadoop is scalability.

Background of Hadoop

• With an increase in the penetration of internet and the usage of the internet, the
data captured by Google increased exponentially year on year.
• Just to give you an estimate of this number, in 2007 Google collected on an average
270 PB of data every month.
• The same number increased to 20000 PB everyday in 2009.
• Obviously, Google needed a better platform to process such an enormous data.
• Google implemented a programming model called MapReduce, which could process
this 20000 PB per day. Google ran these MapReduce operations on a special file
system called Google File System (GFS). Sadly, GFS is not an open source.

Doug cutting and Yahoo! reverse engineered the model GFS and built a parallel Hadoop
Distributed File System (HDFS).

The software or framework that supports HDFS and MapReduce is known as Hadoop.

What is Hadoop

Hadoop is an open source framework from Apache software foundation.

Hadoop is written in Java and is not OLAP (online analytical processing).

Page | 33
BIGDATA LECTURE NOTES

It is used for batch/offline processing.

It is used to store process and analyze data which are very huge in volume.

It is being used by Facebook, Yahoo, Google, Twitter, LinkedIn and many more.

Why to use Hadoop

Apache Hadoop is not only a storage system but is a platform for data storage as well as
processing.

It is scalable (as we can add more nodes on the fly).

Fault tolerant (Even if nodes go down, data processed by another node).

It efficiently processes large volumes of data on a cluster of commodity hardware.

Hadoop is for processing of huge volume of data.

Commodity hardware is the low-end hardware, they are cheap devices which are
very economical. Hence, Hadoop is very economic.

The idea of Apache Hadoop was actually born out of a Google project called the
MapReduce, which is a framework for breaking down an application into smaller chunks
that can then be parsed on a much smaller and granular level. Each of the smaller blocks is
individually operated on nodes which are then connected to the main cluster.

Hadoop works in master-slave fashion. There is a master node and there are n numbers of
slave nodes where n can be 1000s.

Master manages, maintains and monitors the slaves.

slaves are the actual worker nodes. Master should deploy on good configuration hardware,
not just commodity hardware. As it is the centerpiece of Hadoop cluster.

Master stores the metadata (data about data) while slaves are the nodes which store the
data.

Page | 34
BIGDATA LECTURE NOTES

Distributedly data stores in the cluster.

The client connects with master node to perform any task.

Hadoop components

1)Hdfs -storage

HDFS :, Hadoop uses HDFS (Hadoop Distributed File System) which uses commodity
hardware to form clusters and store data in a distributed fashion.

It works on Write once, read many times principle.

2) Map_Reduce-processing.

Map Reduce paradigm is applied to data distributed over network to find the required
output.

HDFS architecture:

Hadoop comes with a distributed file system called HDFS (HADOOP Distributed File
Systems) HADOOP based applications make use of HDFS.

HDFS is designed for storing very large data files, running on clusters of commodity
hardware.

It is fault tolerant, scalable, and extremely simple to expand.

Hadoop HDFS has a Master/Slave architecture in which Master is NameNode and Slave is
DataNode.

HDFS Architecture consists of single NameNode and all the other nodes are DataNodes.

Page | 35
BIGDATA LECTURE NOTES

HDFS NameNode

It is also known as Master node.

HDFS Namenode stores meta-data i.e. number of data blocks, replicas and other details.

This meta-data is available in memory in the master for faster retrieval of data. NameNode
maintains and manages the slave nodes, and assigns tasks to them. It should deploy on
reliable hardware as it is the centerpiece of HDFS.

Task of NameNode

 Manage file system namespace.


 Regulates client’s access to files.
 It also executes file system execution such as naming, closing, opening files/directories.
 All DataNodes sends a Heartbeat and block report to the NameNode in the Hadoop
cluster.
 It ensures that the DataNodes are alive. A block report contains a list of all blocks on a
datanode.
 NameNode is also responsible for taking care of the Replication Factor of all the blocks.

Page | 36
BIGDATA LECTURE NOTES

Files present in the NameNode metadata are as follows-

FsImage –

It is an “Image file”. FsImage contains the entire filesystem namespace and stored as a file in
the namenode’s local file system.

It also contains a serialized form of all the directories and file inodes in the filesystem.

Each inode is an internal representation of file or directory’s metadata.

EditLogs –

It contains all the recent modifications made to the file system on the most recent FsImage.

Namenode receives a create/update/delete request from the client.

After that this request is first recorded to edits file.

HDFS DataNode

It is also known as Slave.

In Hadoop HDFS Architecture, DataNode stores actual data in HDFS.

It performs read and write operation as per the request of the client.

DataNodes can deploy on commodity hardware.

Task of DataNode

 Block replica creation, deletion, and replication according to the instruction of


Namenode.
 DataNode manages data storage of the system.
 DataNodes send heartbeat to the NameNode to report the health of HDFS. By default,
this frequency is set to 3 seconds.

Page | 37
BIGDATA LECTURE NOTES

Secondary Namenode: Secondary NameNode downloads the FsImage and EditLogs from
the NameNode.

And then merges EditLogs with the FsImage (FileSystem Image).

It keeps edits log size within a limit.

It stores the modified FsImage into persistent storage.

And we can use it in the case of NameNode failure.

Secondary NameNode performs a regular checkpoint in HDFS.

Checkpoint Node

The Checkpoint node is a node which periodically creates checkpoints of the namespace.

Checkpoint Node in Hadoop first downloads FsImage and edits from the Active Namenode.

Then it merges them (FsImage and edits) locally, and at last, it uploads the new image back
to the active NameNode.

Backup Node

In Hadoop, Backup node keeps an in-memory, up-to-date copy of the file system
namespace.

The Backup node checkpoint process is more efficient as it only needs to save the
namespace into the local FsImage file and reset edits.

NameNode supports one Backup node at a time.

Blocks

HDFS in Apache Hadoop split huge files into small chunks known as Blocks. These are the
smallest unit of data in a filesystem. We (client and admin) do not have any control on the
block like block location. NameNode decides all such things.

Page | 38
BIGDATA LECTURE NOTES

The default size of the HDFS block is 64 MB, which we can configure as per the need.

All blocks of the file are of the same size except the last block, which can be the same size or
smaller.

The major advantages of storing data in such block size are that it saves disk seek time.

Replication Management

Block replication provides fault tolerance. If one copy is not accessible and corrupted then
we can read data from other copy.

The number of copies or replicas of each block of a file is replication factor. The default
replication factor is 3 which are again configurable. So, each block replicates three times and
stored on different DataNodes.

If we are storing a file of 128 MB in HDFS using the default configuration, we will end up
occupying a space of 384 MB (3*128 MB).

NameNode receives block report from DataNode periodically to maintain the replication
factor.

When a block is over-replicated/under-replicated the NameNode add or delete replicas as


needed.

Rack Awareness

In a large cluster of Hadoop, in order to improve the network traffic while reading/writing
HDFS file, NameNode chooses the DataNode which is closer to the same rack or nearby rack
to Read /write request. NameNode achieves rack information by maintaining the rack ids of
each DataNode. Rack Awareness in Hadoop is the concept that chooses Datanodes based
on the rack information.

In HDFS Architecture, NameNode makes sure that all the replicas are not stored on the
same rack or single rack. It follows Rack Awareness Algorithm to reduce latency as well as
fault tolerance. We know that default replication factor is 3.

Page | 39
BIGDATA LECTURE NOTES

According to Rack Awareness Algorithm first replica of a block will store on a local rack. The
next replica will store another datanode within the same rack. The third replica will store on
different rack In Hadoop.

Rack Awareness is important to improve:

 Data high availability and reliability.


 The performance of the cluster.
 To improve network bandwidth.

Hadoop Architecture Overview

Apache Hadoop is an open-source software framework for storage and large-scale


processing of data-sets on clusters of commodity hardware. There are mainly five building
blocks inside this runtime envinroment (from bottom to top):

 The cluster is the set of host machines (nodes). Nodes may be partitioned in racks.
This is the hardware part of the infrastructure.
 YARN Infrastructure (Yet Another Resource Negotiator) is the framework
responsible for providing the computational resources (e.g., CPUs, memory, etc.)
needed for application executions.

Page | 40
BIGDATA LECTURE NOTES

Two important elements are:

 Resource Manager (one per cluster) is the master. It knows where the
slaves are located (Rack Awareness) and how many resources they
have. It runs several services, the most important is the Resource
Scheduler which decides how to assign the resources.

o Node Manager (many per cluster) is the slave of the infrastructure.

o When it starts, it announces himself to the Resource Manager. Periodically, it


sends an heartbeat to the Resource Manager.

o Each Node Manager offers some resources to the cluster.

o Its resource capacity is the amount of memory and the number of vcores.

o At run-time, the Resource Scheduler will decide how to use this capacity:
a Container is a fraction of the NM capacity and it is used by the client for
running a program.

Page | 41
BIGDATA LECTURE NOTES

 HDFS Federation is the framework responsible for providing permanent, reliable and
distributed storage. This is typically used for storing inputs and output (but not
intermediate ones).

 Other alternative storage solutions. For instance, Amazon uses the Simple Storage
Service (S3).

 The MapReduce Framework is the software layer implementing the MapReduce


paradigm.

The YARN infrastructure and the HDFS federation are completely decoupled and
independent:

First one provides resources for running an application while the second one provides
storage.

The MapReduce framework is only one of many possible framework which runs on top of
YARN (although currently is the only one implemented).

YARN: Application Startup

In YARN, there are at least three actors:

 the Job Submitter (the client)


 the Resource Manager (the master)
 the Node Manager (the slave)

Page | 42
BIGDATA LECTURE NOTES

The application startup process is the following:

1. a client submits an application to the Resource Manager


2. the Resource Manager allocates a container
3. the Resource Manager contacts the related Node Manager
4. the Node Manager launches the container
5. the Container executes the Application Master

The Application Master is responsible for the execution of a single application. It asks for
containers to the Resource Scheduler (Resource Manager) and executes specific programs
(e.g., the main of a Java class) on the obtained containers.

The Application Master knows the application logic and thus it is framework-specific.

The MapReduce framework provides its own implementation of an Application Master.

The Resource Manager is a single point of failure in YARN.

Using Application Masters, YARN is spreading over the cluster the metadata related to
running applications.

This reduces the load of the Resource Manager and makes it fast recoverable.

Read Operation In HDFS

Data read request is served by HDFS, NameNode and DataNode. Let's call reader as a
'client'. Below diagram depicts file read operation in Hadoop.

Page | 43
BIGDATA LECTURE NOTES

1. Client initiates read request by calling 'open()' method of FileSystem object; it is an


object of type DistributedFileSystem.
2. This object connects to namenode using RPC and gets metadata information such as
the locations of the blocks of the file. Please note that these addresses are of first
few block of file.
3. In response to this metadata request, addresses of the DataNodes having copy of
that block, is returned back.
4. Once addresses of DataNodes are received, an object of type FSDataInputStream is
returned to the client. FSDataInputStream contains DFSInputStream which takes
care of interactions with DataNode and NameNode. In step 4 shown in above
diagram, client invokes 'read()' method which causes DFSInputStream to establish a
connection with the first DataNode with the first block of file.
5. Data is read in the form of streams wherein client invokes 'read()' method
repeatedly. This process of read() operation continues till it reaches end of block.
6. Once end of block is reached, DFSInputStream closes the connection and moves on
to locate the next DataNode for the next block
7. Once client has done with the reading, it calls close() method.

Page | 44
BIGDATA LECTURE NOTES

Write Operation In HDFS

In this section, we will understand how data is written into HDFS through files.

1. Client initiates write operation by calling 'create()' method of DistributedFileSystem


object which creates a new file - Step no. 1 in above diagram.
2. Distributed FileSystem object connects to the NameNode using RPC call and initiates
new file creation. However, this file create operation does not associate any blocks
with the file. It is the responsibility of NameNode to verify that the file (which is
being created) does not exist already and client has correct permissions to create
new file. If file already exists or client does not have sufficient permission to create a
new file, then IOException is thrown to client. Otherwise, operation succeeds and a
new record for the file is created by the NameNode.
3. Once new record in NameNode is created, an object of type FSDataOutputStream is
returned to the client. Client uses it to write data into the HDFS. Data write method
is invoked (step 3 in diagram).

Page | 45
BIGDATA LECTURE NOTES

4. FSDataOutputStream contains DFSOutputStream object which looks after


communication with DataNodes and NameNode. While client continues writing
data, DFSOutputStream continues creating packets with this data. These packets are
en-queued into a queue which is called as DataQueue.
5. There is one more component called DataStreamer which consumes
this DataQueue. DataStreamer also asks NameNode for allocation of new blocks
thereby picking desirable DataNodes to be used for replication.
6. Now, the process of replication starts by creating a pipeline using DataNodes. In our
case, we have chosen replication level of 3 and hence there are 3 DataNodes in the
pipeline.
7. The DataStreamer pours packets into the first DataNode in the pipeline.
8. Every DataNode in a pipeline stores packet received by it and forwards the same to
the second DataNode in pipeline.
9. Another queue, 'Ack Queue' is maintained by DFSOutputStream to store packets
which are waiting for acknowledgement from DataNodes.
10. Once acknowledgement for a packet in queue is received from all DataNodes in the
pipeline, it is removed from the 'Ack Queue'. In the event of any DataNode failure,
packets from this queue are used to reinitiate the operation.
11. After client is done with the writing data, it calls close() method (Step 9 in the
diagram) Call to close(), results into flushing remaining data packets to the pipeline
followed by waiting for acknowledgement.
12. Once final acknowledgement is received, NameNode is contacted to tell it that the
file write operation is complete.

Hadoop HDFS Commands

Hadoop HDFS command are discussed below along with their usage, description, and
examples.

Hadoop file system shell commands are used to perform various Hadoop HDFS operations
and in order to manage the files present on HDFS clusters.

All the Hadoop file system shell commands are invoked by the bin/hdfs script

Page | 46
BIGDATA LECTURE NOTES

1. Create a directory in HDFS at given path(s).

Syntax: hadoop fs -mkdir<paths>


Example: hadoop fs -mkdir /user/saurzcode/dir1 /user/saurzcode/dir2

2. List the contents of a directory.


Usage :hadoop fs -ls <args>
Example: hadoop fs -ls /user/saurzcode

3. Upload and download a file in HDFS.

Upload:

hadoop fs -put:

Copy single src file, or multiple src files from local file system to the Hadoop data file system
Syntax: hadoop fs -put <localsrc> ... <HDFS_dest_Path>
Example: hadoop fs -put /home/saurzcode/Samplefile.txt /user/saurzcode/dir3/

Download:

hadoop fs -get:

Copies/Downloads files to the local file system

Usage: hadoop fs -get <hdfs_src><localdst>


Example: hadoop fs -get /user/saurzcode/dir3/Samplefile.txt /home/

4. See contents of a file

Same as unix cat command:

Usage: hadoop fs -cat <path[filename]>


Example: hadoop fs -cat /user/saurzcode/dir1/abc.txt

Page | 47
BIGDATA LECTURE NOTES

5. Copy a file from source to destination

This command allows multiple sources as well in which case the destination must be a
directory.

Usage: hadoop fs -cp<source><dest>


Example: hadoop fs -cp /user/saurzcode/dir1/abc.txt /user/saurzcode/dir2

6. Copy a file from/To Local file system to HDFS

copyFromLocal

Usage: hadoop fs -copyFromLocal<localsrc> URI


Example: hadoop fs -copyFromLocal /home/saurzcode/abc.txt /user/saurzcode/abc.txt

Similar to put command, except that the source is restricted to a local file reference.

copyToLocal

Usage: hadoop fs -copyToLocal [-ignorecrc] [-crc] URI <localdst>

Similar to get command, except that the destination is restricted to a local file reference.

7. Move file from source to destination.

Note:- Moving files across filesystem is not permitted.

Usage :hadoop fs -mv <src><dest>


Example: hadoop fs -mv /user/saurzcode/dir1/abc.txt /user/saurzcode/dir2

8. Remove a file or directory in HDFS.

Remove files specified as argument. Deletes directory only when it is empty

Usage :hadoop fs -rm<arg>


Example: hadoop fs -rm /user/saurzcode/dir1/abc.txt

Page | 48
BIGDATA LECTURE NOTES

Recursive version of delete.

Usage :hadoop fs -rmr<arg>


Example: hadoop fs -rmr /user/saurzcode/

9. Display last few lines of a file.

Similar to tail command in Unix.

Usage :hadoop fs -tail <path[filename]>


Example: hadoop fs -tail /user/saurzcode/dir1/abc.txt

10. Display the aggregate length of a file.


Usage :hadoop fs -du <path>
Example: hadoop fs -du /user/saurzcode/dir1/abc.txt

Please comment which of these commands you found most useful while dealing with
Hadoop /HDFS

Hadoop Daemons

Daemons are the processes that run in the background. There are mainly 4 daemons which
run for Hadoop.

Page | 49
BIGDATA LECTURE NOTES

 Namenode – It runs on master node for HDFS.


 Datanode – It runs on slave nodes for HDFS.
 ResourceManager – It runs on master node for Yarn.
 NodeManager – It runs on slave node for Yarn.

These 4 demons run for Hadoop to be functional. Apart from this, there can be secondary
NameNode, standby NameNode, Job HistoryServer, etc.

Hadoop Flavors

Below are the various flavors of Hadoop.

 Apache – Vanilla flavor, as the actual code is residing in Apache repositories.


 Hortonworks – Popular distribution in the industry.
 Cloudera – It is the most popular in the industry.
 MapR – It has rewritten HDFS and its HDFS is faster as compared to others.
 IBM – Proprietary distribution is known as Big Insights.

All the databases have provided native connectivity with Hadoop for fast data transfer.
Because, to transfer data from Oracle to Hadoop, you need a connector.

All flavors are almost same and if you know one, you can easily work on other flavors as
well.

Hadoop Ecosystem

In this section of Hadoop tutorial, we will cover Hadoop ecosystem components. Let us see
what all the components form the Hadoop Eco-System:

Page | 50
BIGDATA LECTURE NOTES

 Hadoop HDFS – Distributed storage layer for Hadoop.


 Yarn – Resource management layer introduced in Hadoop 2.x.
 Hadoop Map-Reduce – Parallel processing layer for Hadoop.
 HBase – It is a column-oriented database that runs on top of HDFS. It is a NoSQL
database which does not understand the structured query. For sparse data set, it suits
well.
 Hive – Apache Hive is a data warehousing infrastructure based on Hadoop and it enables
easy data summarization, using SQL queries.
 Pig – It is a top-level scripting language. As we use it with Hadoop. Pig enables writing
complex data processing without Java programming.
 Flume – It is a reliable system for efficiently collecting large amounts of log data from
many different sources in real-time.
 Sqoop – It is a tool design to transport huge volumes of data between Hadoop and
RDBMS.
 Oozie – It is a Java Web application uses to schedule Apache Hadoop jobs. It combines
multiple jobs sequentially into one logical unit of work.
 Zookeeper – A centralized service for maintaining configuration information, naming,
providing distributed synchronization, and providing group services.
 Mahout – A library of scalable machine-learning algorithms, implemented on top of
Apache Hadoop and using the MapReduce paradigm.

Page | 51
UNIT III
MAP-REDUCE PROGRAMMING

Why Map_Reduce in Hadoop?

MapReduce is a programming framework that allows us to perform distributed and


parallel processing on large data sets in a distributed environment.

Hadoop MapReduce is a software framework for easily writing applications which process

vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of


nodes) of commodity hardware in a reliable, fault-tolerant manner.

MapReduce is a framework using which we can write applications to process huge amounts
of data, in parallel, on large clusters of commodity hardware in a reliable manner.

MapReduce is a programming model suitable for processing of huge data.

Hadoop is capable of running MapReduce programs written in various languages: Java,


Ruby, Python, and C++.

MapReduce programs are parallel in nature, thus are very useful for performing large-scale
data analysis using multiple machines in the cluster.

Advantages of Map_Reduce programming

1) Scalability

2) cost-effective solution

3) Flexibility

4 ) Fast

5) Security and authentication

6) Parallel processing

7) Availability and resilient nature

8) Simple model of programming.

Page | 52
Map Reduce programs work in two phases:

1) Map phase

2) Reduce phase

First is the map job, where a block of data is read and processed to produce key-value pairs
as intermediate outputs.

The output of a Mapper or map job (key-value pairs) is input to the Reducer.

The reducer receives the key-value pair from multiple map jobs.

Then, the reducer aggregates those intermediate key-value pair into a smaller set of key-
value pairs which is the final output.

How Map Reduce works

Input Splits:

Input to a MapReduce job is divided into fixed-size pieces called input splits.

Input split is a chunk of the input that is consumed by a single map.

Mapper

This is very first phase in the execution of map-reduce program.

In this phase data in each split is passed to a mapping function to produce output values.

Input to mapper in the form of key,value.

The output from the Mapper in the form of key value pair.

The output from all the Mapper is nothing but intermediate output.

Shuffling

This phase consumes output of mapping phase.

Its task is to consolidate the relevant records from Mapping phase output.

In shuffling is used to avoid duplicate keys. Once shuffling is done then it is very easy to sort
the key.

Page | 53
Reducer

In this phase, output values from Shuffling phase are aggregated. This phase combines
values from Shuffling phase and returns a single output value.

In short, this phase summarizes the complete dataset.

How Many Maps/Mapper

The number of maps is usually driven by the total size of the inputs, that is, the total number

of blocks of the input files.

No of inputsplits =number of mapper

If the input data size is 200MB, we know that default block size of the HDFS is 64MB,then we
get 4 input splits I,e 3 blocks of 64MB each and 8Mb of one block.

Then map task is performed on each input splits.

Example of word count.

Let us understand how a MapReduce works by taking an example where I have a text
file called example.txt whose contents are as follows:

Dear, Bear, River, Car, Car, River, Deer, Car and Bear

Now, suppose, we have to perform a word count on the sample.txt using MapReduce.

So, we will be finding the unique words and the number of occurrences of those unique words.

Page | 54
 First, we divide the input in three splits as shown in the figure. This will distribute
the work among all the map nodes.

 Then, we tokenize the words in each of the mapper and give a hardcoded value (1) to
each of the words. The rationale behind giving a hardcoded value equal to 1 is that
every word, in itself, will occur once.

 Now, a list of key-value pair will be created where the key is nothing but the
individual words and value is one.

 So, for the first line (Dear Bear River) we have 3 key-value pairs – Dear, 1; Bear, 1;
River, 1. The mapping process remains the same on all the nodes.

 After mapper phase, a partition process takes place where sorting and shuffling
happens so that all the tuples with the same key are sent to the corresponding
reducer.

 So, after the sorting and shuffling phase, each reducer will have a unique key and a list
of values corresponding to that very key. For example, Bear, [1,1]; Car, [1,1,1].., etc.

 Now, each Reducer counts the values which are present in that list of values. As shown
in the figure, reducer gets a list of values which is [1,1] for the key Bear. Then, it counts
the number of ones in the very list and gives the final output as – Bear, 2.

 Finally, all the output key/value pairs are then collected and written in the output file.

Explain Conceptual understanding of Map-Reduce programming Or


How map reduce organizes work?

The complete execution process (execution of Map and Reduce tasks, both) is controlled by
two types of entities called a

1. Jobtracker : Acts like a master (responsible for complete execution of submitted job)
2. Multiple Task Trackers : Acts like slaves, each of them performing the job

For every job submitted for execution in the system, there is one Jobtracker that
resides on Namenode and there are multiple tasktrackers which reside on Datanode.

The job execution starts when the client program submit to the JobTracker a job
configuration, which specifies the map and reduce function, as well as the input and
output path of data.

Page | 55
 The JobTracker will first determine the number of splits (each split is configurable, ~16-
64MB) from the input path, and select some TaskTracker based on their network
proximity to the data sources,then the JobTracker send the task requests to those
selected TaskTrackers.

 Each TaskTracker will start the map phase processing by extracting the input data
from the splits.

 For each record parsed by the ―InputFormat‖, it invoke the user provided ―map‖
function, which emits a number of key/value pair in the memory buffer.

 When the map task completes (all splits are done), the TaskTracker will notify
the JobTracker.

 When all the TaskTrackers are done, the JobTracker will notify the
selected TaskTrackers for the reduce phase.

 Each TaskTracker will read the region files remotely. It sorts the key/value pairs and
for each key, it invoke the ―reduce‖ function, which collects the key/aggregatedValue
into the output file (one per reducer node)

Map/Reduce framework is resilient to crash of any components.

 The JobTracker keep tracks of the progress of each phases and periodically ping
the TaskTracker for their health status.

Page | 56
 When any of the map phase TaskTracker crashes, the JobTracker will reassign the
map task to a different TaskTracker node, which will rerun all the assigned splits.

 If the reduce phase TaskTracker crashes, the JobTracker will rerun the reduce at
a different TaskTracker.

 After both phase completes, the JobTracker will unblock the client program.

 A job is divided into multiple tasks which are then run onto multiple data nodes in
a cluster.
 It is the responsibility of jobtracker to coordinate the activity by scheduling tasks to
run on different data nodes.
 Execution of individual task is then look after by tasktracker, which resides on every
data node executing part of the job.
 Tasktracker's responsibility is to send the progress report to the jobtracker.
 In addition, tasktracker periodically sends 'heartbeat' signal to the Jobtracker so as
to notify him of current state of the system

Page | 57
Developing Map-Reduce programs in Java

Map function logic, and Reduce function logic. The rest of the remaining steps will
execute automatically.

Make sure that Hadoop is installed on your system with java idk

Steps
Step 1. Open Eclipse> File > New > Java Project >( Name it – MRProgramsDemo) > Finish

Step 2. Right Click > New > Package ( Name it - PackageDemo) > Finish

Step 3. Right Click on Package > New > Class (Name it - WordCount) Step

4. Add Following Reference Libraries –

Right Click on Project > Build Path> Add External Archivals

/usr/lib/hadoop-0.20/hadoop-core.jar
Usr/lib/hadoop-0.20/lib/Commons-cli-1.2.jar

Step 5. Type following Program:

Driver code:

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

Page | 58
import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import

org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount extends Configured{

public static void main(String [] args) throws Exception {

Job j=new Job( );

j.setJarByClass(WordCount.class);

FileInputFormat.SetInputpathputPath(j, new path(args[0]));

FileOutputFormat.setOutputPath(j, new path(args[1]));

j.setMapperClass(NewWordMapper.class);

j.setReducerClass(NewWordReducer.class);

j.setMapOutputKeyClass(Text.class);

j.setMapOutputValueClass(IntWritable.class);

j.setoutputkeyclass(text.class);

j.setoutputvalueclass(intwritable.class);

System.exit(j.waitForCompletion(true)?0:1); }

Mapper class

public static class NewWordMapper extends Mapper<LongWritable, Text, Text, IntWritable>{

public void map(LongWritable key, Text value, Context con) throws IOException,
InterruptedException {

Page | 59
String line = value.toString();

String[] words=line.split(",");

for(String word: words ) {

Text outputKey = new Text(word.toUpperCase().trim());

IntWritable outputValue = new IntWritable(1);

con.write(outputKey, outputValue);

}}

Reducer class

public static class NewWordReducer extends Reducer<Text, IntWritable, Text, IntWritable>{

public void reduce(Text word, Iterable<IntWritable> values, Context con) throws IOException,
InterruptedException {

int sum = 0;

for(IntWritable value : values)

{
sum += value.get();
}

con.write(word, new IntWritable(sum));

}} }

Step 6. Make Jar File

Right Click on Project> Export> Select export destination as Jar File > next> Finish.

Page | 60
Step 7: Take a text file and move it in HDFS

To Move this into Hadoop directly, open the terminal and enter the following commands:

[training@localhost ~]$ hadoop fs -put wordcountFile wordCountFile


Step 8 . Run Jar file

(hadoop jar jarfilename.jar packageName.ClassName PathToInputTextFile


PathToOutputDirectry)

[training@localhost ~]$ hadoop jar MRProgramsDemo.jar


PackageDemo.WordCount wordCountFile MRDir1
Step 9. Open Result

[training@localhost ~]$ hadoop fs -ls MRDir1


Found 3 items
-rw-r--r-- 1 training supergroup 0 2016-02-23 03:36
/user/training/MRDir1/_SUCCESS
drwxr-xr-x - training supergroup 0 2016-02-23 03:36
/user/training/MRDir1/_logs
-rw-r--r-- 1 training supergroup 20 2016-02-23 03:36
/user/training/MRDir1/part-r-00000

[training@localhost ~]$ hadoop fs -cat MRDir1/part-r-00000


BUS 7
CAR 4
TRAIN 6

Page | 61
Setting up the cluster with HDFS:

There are two ways to install Hadoop, i.e. Single node and Multi node.

Single node cluster means only one DataNode running and setting up all the
NameNode, DataNode, ResourceManager and NodeManager on a single machine.

For example, let us consider a sample data set inside a healthcare industry. So, for testing
whether the Oozie jobs have scheduled all the processes like collecting, aggregating, storing
and processing the data in a proper sequence, we use single node cluster.

It can easily and efficiently test the sequential workflow in a smaller environment as compared
to large environments which contains terabytes of data distributed across hundreds of
machines.

Requirement

VIRTUAL BOX: it is used for installing the operating system on it.

OPERATING SYSTEM: You can install Hadoop on Linux based operating systems.
Ubuntu and CentOS are very commonly used. In this tutorial, we are using CentOS.
JAVA: You need to install the Java 8 package on your system.
HADOOP: You require Hadoop 2.7.3 package.

Install Hadoop

Step 1: download the Java 8 Package. Save this file in your home directory.

Step 2: Extract the Java Tar File.

Command: tar -xvf jdk-8u101-linux-i586.tar.gz.

Step 3: Download the Hadoop 2.7.3 Package.

Command: wget https://archive.apache.org/dist/hadoop/core/hadoop-2.7.3/hadoop-


2.7.3.tar.gz

Page | 62
Step 4: Extract the Hadoop tar File.

Command: tar -xvf hadoop-2.7.3.tar.gz

Step 5: Add the Hadoop and Java paths in the bash file (.bashrc).

Open. bashrc file. Now, add Hadoop and Java Path as shown below.

Command: vi .bashrc

Fig: Hadoop Installation – Setting Environment Variable

Then, save the bash file and close it.

For applying all these changes to the current Terminal, execute the source command.

Command: source .bashrc

Page | 63
To make sure that Java and Hadoop have been properly installed on your system and can

be accessed through the Terminal, execute the java -version and hadoop version

commands.

Command: java -version

Command: hadoop version

Fig: Hadoop Installation – Checking Hadoop Version

Step 6: Edit the Hadoop Configuration files.

Command: cd hadoop-2.7.3/etc/hadoop/

Command: ls

All the Hadoop configuration files are located in hadoop-2.7.3/etc/hadoop directory as you
can see in the snapshot below:

Page | 64
Step 7: Open core-site.xml and edit the property mentioned below inside configuration tag:

core-site.xml informs Hadoop daemon where NameNode runs in the cluster. It


contains configuration settings of Hadoop core such as I/O settings that are common
to HDFS &
MapReduce.

Command: vi core-site.xml

<?xml version="1.0" encoding="UTF-8"?>


<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>

Step 8: Edit hdfs-site.xml and edit the property mentioned below inside configuration tag:

hdfs-site.xml contains configuration settings of HDFS daemons (i.e. NameNode,

DataNode, Secondary NameNode). It also includes the replication factor and block size of

HDFS.

Command: vi hdfs-site.xml

<?xml version="1.0" encoding="UTF -8"?>


<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>

Page | 65
<property>
<name>dfs.permission</name>
<value>false</value>
</property>
</configuration>

Step 9: Edit the mapred-site.xml file and edit the property mentioned below inside
configuration tag:

mapred-site.xml contains configuration settings of MapReduce application like number of


JVM that can run in parallel, the size of the mapper and the reducer process, CPU cores
available for a process, etc.

In some cases, mapred-site.xml file is not available. So, we have to create the mapred-
site.xml file using mapred-site.xml template.

Command: cp mapred-site.xml.template mapred-site.xml

Command: vi mapred-site.xml.

<?xml version="1.0" encoding="UTF-8"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>


<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration

Step 10: Edit yarn-site.xml and edit the property mentioned below inside configuration tag:
yarn-site.xml contains configuration settings of ResourceManager and NodeManager
like application memory management size, the operation needed on program &
algorithm, etc.

Page | 66
Command: vi yarn-site.xml

<?xml version="1.0">
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>

Step 11: Edit hadoop-env.sh and add the Java Path as mentioned below:

hadoop-env.sh contains the environment variables that are used in the script to run Hadoop
like Java home path, etc.

Command: vi hadoop–env.sh

Step 12: Go to Hadoop home directory and format the NameNode.

Command: cd
Command: cd hadoop-2.7.3
Command: bin/hadoop namenode -format

Fig: Hadoop Installation – Formatting NameNode

Page | 67
This formats the HDFS via NameNode. This command is only executed for the first time.
Formatting the file system means initializing the directory specified by the dfs.name.dir
variable.

Never format, up and running Hadoop filesystem. You will lose all your data stored in
the HDFS.

Step 13: Once the NameNode is formatted, go to hadoop-2.7.3/sbin directory and start all
the daemons.

Command: cd hadoop-2.7.3/sbin

Either you can start all daemons with a single command or do it individually.

Command: ./start-all.sh

The above command is a combination of start-dfs.sh, start-yarn.sh & mr-jobhistory-daemon.sh

Or you can run all the services individually as below:

Start NameNode:

The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files
stored in the HDFS and tracks all the file stored across the cluster.

Command: ./hadoop-daemon.sh start namenode

Fig: Hadoop Installation – Starting NameNode

Start DataNode:

On startup, a DataNode connects to the Namenode and it responds to the requests from
the Namenode for different operations.

Page | 68
Command: ./hadoop-daemon.sh start datanode

Fig: Hadoop Installation – Starting DataNode

Start ResourceManager:

ResourceManager is the master that arbitrates all the available cluster resources and thus
helps in managing the distributed applications running on the YARN system.

Its work is to manage each NodeManagers and the each application’s ApplicationMaster.

Command: ./yarn-daemon.sh start resourcemanager

Fig: Hadoop Installation – Starting ResourceManager

Start NodeManager:
The NodeManager in each machine framework is the agent which is responsible for
managing containers, monitoring their resource usage and reporting the same to the
ResourceManager.
Command: ./yarn-daemon.sh start nodemanager

Fig: Hadoop Installation – Starting NodeManager

Page | 69
Start JobHistoryServer:

JobHistoryServer is responsible for servicing all job history related requests from client.

Command: ./mr-jobhistory-daemon.sh start historyserver

Step 14: To check that all the Hadoop services are up and running, run the below command.

Command: jps

20

Fig: Hadoop Installation – Checking Daemons

Step 15: Now open the Mozilla browser and go to localhost:50070/dfshealth.html to check
the NameNode interface.

Page | 70
Running simple word count Map-Reduce program on the cluster,

Open the Ecclips-> Right click on the package explorer -> New java project

Projectname->s wordcountjob

Next add jar file for every project

What is the jar file we have to add

Java setting->libraries->add external jar files—>jarselection-filesysytem->usr->lib->hadoop-


0.20->hadoop-core.jar->ok finish.

Referenced libraries- hadoop-core.jar

Right click on the project name(wordcount job)->Newclass

Wordcount,wordMapper,wordReducer

Import org.apache.hadoop.conf.Configured; Import

org.apache.hadoop.Util.tool Driver Code:

Public class wordcount extends configured implements Tool {

Public int run(String[] args) throw Exception{

If(args.length()>2)

{ sop(―pls give input and output‖);

Return -1;

}
Jobconf conf=newjobconf(wordcount.class);

File inputformat.setinputpath(conf,new path(args[0]));

File outputpath.setoutputpath(conf,new pathargs[1]));

Conf.setMapperClass(WordMapper.class);

Job.setReducerClass(wordReducer.class);

Job.setMapoutputkeyclass(text.class);

Job.setmapoutputvalueclass(intwritable.class);

Jobclient.runjob(conf);

Return 0; }
Page | 71
Public static void main(Stringargs[])

{ Int exit code=ToolRunner.run(new wordcount(),srgs); System.exit(exitcode); }

WordMapper class

Public class wordmapper extends MapreduceBase


implements Mapper<Longwritable,Text,Text,Intwritable> {

Public void map(Longwritable key,Text Value,outputCollector<Text,Intwritable>


output,Reporter r) throws IOException {

String s=valiue.toString();

For(String word:s.split(― ―); {

If(word.length()>0)

{ output.collect(new Text(word),new Inwritable(1)); }}

WordReducer class:

Public class WordReducer extends MapreduceBase


inplements Reducer<Text,Intwritable,text,intwritable>{

Public void reduce(Text key,Iterator<Intwritable>


values,outputcollector<text,intwritable> output,Reporter r){

Int count=0;

While(values.hasNext())

{Intwritable i=values.next();

Count+=i.get(); }

Output.collect(key,new Intwritable(count));

Right click on the project name ->export->expand the java->choose jar say next wourdcout.jar -
> finish.

Page | 72
Location

To know where jar file is stored generally work space is there

Then Alt+enter on wordcount-resources-path is available-

/wordcountjob.

Path: /home/training/workspace/wordcountjob.

Open the linux terminal

training@localhost-]$ cat >file.txt

hai how are you

I am fine

Whart are you doing

How is your job

How is your family

Save by pressing ctrl+d

To clear screen use Clear

training@localhost-]$ Clear

Keep the file in to HDFS use the command –put command

training@localhost-]$ Hadoop fs –put file.txt ram/file.txt Hadoop fs –ls

training@localhost-]$ Hadoop fs –ls ram Cd

training@localhost-]$ workspace

training@localhost-]$ $hadoop jar wordcount.jar wordcount ram/file.txt wordcountoutput

Page | 73
Type url:

http://loclhost/50070

http://localhost/50030

Go to name node->Browse file system

Wordcountoutput

Or

training@localhost-]$ hadop fs –cat wordcountoutput/part-0000

output will be displayed

Understanding How Map- Reduce works on HDFS.

HDFS

HDFS stands for Hadoop Distributed File System, which is the storage system used by Hadoop.

Page | 74
.

The following is a high-level architecture that explains how HDFS works.

1) In the above diagram, there is one NameNode, and multiple DataNodes (servers). b1,
b2, indicates data blocks.

2) When you dump a file (or data) into the HDFS, it stores them in blocks on the various
nodes in the hadoop cluster.

3) HDFS creates several replication of the data blocks and distributes them accordingly in
the cluster in way that will be reliable and can be retrieved faster

4) Typical HDFS block size is 128MB. Each and every data block is replicated to multiple
nodes across the cluster.

Hadoop will internally make sure that any node failure will never results in a data loss.

There will be one NameNode that manages the file system metadata

There will be multiple DataNodes (These are the real cheap commodity servers) that will
store the data blocks

When you execute a query from a client, it will reach out to the NameNode to get the file
metadata information, and then it will reach out to the DataNodes to get the real data
blocks

Hadoop provides a command line interface for administrators to work on HDFS

The NameNode comes with an in-built web server from where you can browse the
HDFS filesystem and view some basic cluster statistics

Page | 75
MapReduce

MapReduce is a parallel programming model that is used to retrieve the data from the
Hadoop cluster.

This splits the tasks and executes on the various nodes parallely, thus speeding up
the computation and retriving required data from a huge dataset in a fast manner.

This provides a clear abstraction for programmers. They have to just implement (or use)
two functions: map and reduce

The data are fed into the map function as key value pairs to produce intermediate key/value
pairs

Once the mapping is done, all the intermediate results from various nodes are reduced to
create the final output

JobTracker keeps track of all the MapReduces jobs that are running on various nodes.

This schedules the jobs, keeps track of all the map and reduce jobs running across the nodes.

If any one of those jobs fails, it reallocates the job to another node, etc. In simple terms,
JobTracker is responsible for making sure that the query on a huge dataset runs successfully
and the data is returned to the client in a reliable manner.

TaskTracker performs the map and reduce tasks that are assigned by the JobTracker.

TaskTracker also constantly sends a hearbeat message to JobTracker, which helps JobTracker
to decide whether to delegate a new task to this particular node or not.

Page | 76
In this model, the library handles lot of messy details that programmers doesn’t need to
worry about. For example, the library takes care of parallelization, fault tolerance, data
distribution, load balancing, etc

Additional M_R programming Model

Problem Statement:

Find out Number of Products Sold in Each Country.

Transcation date,product,price,payement type,name ,city,state,country,account last


login,latitude,longitude.

SalesCountryDriver code:

package SalesCountry;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.*;

import org.apache.hadoop.mapred.*;

public class SalesCountryDriver {

public static void main(String[] args) {

JobClient my_client = new JobClient();

// Create a configuration object for the job

JobConf job_conf = new JobConf(SalesCountryDriver.class);

// Set a name of the Job

job_conf.setJobName("SalePerCountry");

// Specify data type of output key and

value

job_conf.setOutputKeyClass(Text.class);

Page | 77
.
job_conf.setOutputValueClass(IntWritable.class);

// Specify names of Mapper and Reducer Class

job_conf.setMapperClass(SalesCountry.SalesMapper.class);

job_conf.setReducerClass(SalesCountry.SalesCountryReducer.class);

// Specify formats of the data type of Input and

output

job_conf.setInputFormat(TextInputFormat.class);

job_conf.setOutputFormat(TextOutputFormat.class);

// Set input and output directories using command line arguments,

//arg[0] = name of input directory on HDFS, and arg[1] = name of output


directory to be created to store the output file.

FileInputFormat.setInputPaths(job_conf, new Path(args[0]));

FileOutputFormat.setOutputPath(job_conf, new Path(args[1]));

my_client.setConf(job_conf);

try {// Run the job

JobClient.runJob(job_conf);

} catch (Exception e) {

e.printStackTrace();}}}

SalesMapper code

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapred.*;

Page | 78
public class SalesMapper extends MapReduceBase implements Mapper<LongWritable,
Text, Text, IntWritable> {

private final static IntWritable one = new IntWritable(1);

public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output,


Reporter reporter) throws IOException {

String valueString = value.toString();

String[] SingleCountryData = valueString.split(",");

output.collect(new Text(SingleCountryData[7]), one);

SalesCountryReduce code

import java.io.IOException;

import java.util.*;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapred.*;

public class SalesCountryReducer extends MapReduceBase implements Reducer<Text,


IntWritable, Text, IntWritable> {

public void reduce(Text t_key, Iterator<IntWritable> values, OutputCollector<Text,IntWritable>


output, Reporter reporter) throws IOException {
Text key = t_key;
int frequencyForCountry = 0;
while (values.hasNext()) {
// replace type of value with the actual type of our
value IntWritable value = (IntWritable) values.next();
frequencyForCountry += value.get();}
output.collect(key, new IntWritable(frequencyForCountry));}

Page | 79
Explain about the Combiner in Hadoop

The Combiner class is used in between the Map class and the Reduce class to reduce the
volume of data transfer between Map and Reduce

When a MapReduce(MR) job is run on a large dataset, Map task generates huge chunks of
intermediate data, which is passed on to Reduce task.

During this phase, the output from Mapper has to travel over the network to the node
where Reducer is running.

This data movement may cause network congestion if the data is huge. To

reduce this network congestion,

MR framework provides a function called 'Combiner', which is also called as 'Mini-


Reducer'

The role of Combiner is to take the output of Mapper as it's input, process it and sends its
output to the reducer.

Combiner reads each key-value pair, combines all the values for the same key, and sends this
as input to the reducer, which reduces the data movement in the network. Combiner uses
same class as Reducer.

Page | 80
MapReduce Combiner Implementation

The following example provides a theoretical idea about combiners. Let us assume we have
the following input text file named input.txt for MapReduce.

What do you mean by Object

What do you know about Java

What is Java Virtual Machine

How Java enabled High Performance

Record Reader

This is the first phase of MapReduce where the Record Reader reads every line from the
input text file as text and yields output as key-value pairs.

Input − Line by line text from the input file.

Output − Forms the key-value pairs. The following is the set of expected key-value pairs.

<1, What do you mean by Object>

<2, What do you know about Java>

<3, What is Java Virtual Machine>

<4, How Java enabled High Performance>

Map Phase

The Map phase takes input from the Record Reader, processes it, and produces the output
as another set of key-value pairs.

Input − The following key-value pair is the input taken from the Record Reader.

<1, What do you mean by Object>

<2, What do you know about Java>

<3, What is Java Virtual Machine>

<4, How Java enabled High Performance>

Page | 81
The Map phase reads each key-value pair, divides each word from the value
using StringTokenizer, treats each word as key and the count of that word as
value.

Output − The expected output is as follows −

<What,1> <do,1> <you,1> <mean,1> <by,1> <Object,1>

<What,1> <do,1> <you,1> <know,1> <about,1> <Java,1>

<What,1> <is,1> <Java,1> <Virtual,1> <Machine,1>

<How,1> <Java,1> <enabled,1> <High,1> <Performance,1>

Combiner Phase

The Combiner phase takes each key-value pair from the Map phase, processes it, and
produces the output as key-value collection pairs.

Input − The following key-value pair is the input taken from the Map phase.

<What,1> <do,1> <you,1> <mean,1> <by,1> <Object,1>

<What,1> <do,1> <you,1> <know,1> <about,1> <Java,1>

<What,1> <is,1> <Java,1> <Virtual,1> <Machine,1>

<How,1> <Java,1> <enabled,1> <High,1> <Performance,1>

The Combiner phase reads each key-value pair,

combines the common words as key and values as collection.

Usually, the code and operation for a Combiner is similar to that of a

Reducer. Output − The expected output is as follows −

<What,1,1,1> <do,1,1> <you,1,1> <mean,1> <by,1> <Object,1>

<know,1> <about,1> <Java,1,1,1> <is,1> <Virtual,1>

<Machine,1>

<How,1> <enabled,1> <High,1> <Performance,1>

Page | 82
Reducer Phase

The Reducer phase takes each key-value collection pair from the Combiner phase, processes
it, and passes the output as key-value pairs. Note that the Combiner functionality is same as
the Reducer.

Input − The following key-value pair is the input taken from the Combiner phase.

<What,1,1,1> <do,1,1> <you,1,1> <mean,1> <by,1> <Object,1> <know,1>

<about,1> <Java,1,1,1>

<is,1> <Virtual,1> <Machine,1>

<How,1> <enabled,1> <High,1> <Performance,1>

The Reducer phase reads each key-value pair. Following is the code snippet for the
Combiner.

Output − The expected output from the Reducer phase is as follows –

<What,3> <do,2> <you,2> <mean,1> <by,1> <Object,1>

<know,1> <about,1> <Java,3> <is,1> <Virtual,1>

<Machine,1>

<How,1> <enabled,1> <High,1> <Performance,1>

Record Writer

This is the last phase of MapReduce where the Record Writer writes every key-value pair
from the Reducer phase and sends the output as text.

Input − Each key-value pair from the Reducer phase along with the Output format.

Output − It gives you the key-value pairs in text format. Following is the expected output.

What 3
do 2
you 2
mean 1
by 1
Object 1
know 1
Page | 83
about 1
Java 3
is 1
Virtual 1
Machine 1
How 1
enabled 1
High 1

Performance 1

Example Program

import java.io.IOException;

import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import

org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {

public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{

private final static IntWritable one = new IntWritable(1);

private Text word = new Text();

Page | 84
public void map(Object key, Text value, Context context) throws IOException,
InterruptedException {

StringTokenizer itr = new StringTokenizer(value.toString());

while (itr.hasMoreTokens()) {

word.set(itr.nextToken());

context.write(word, one); } } }

combiner Code

public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> {

private IntWritable result = new IntWritable();

public void reduce(Text key, Iterable<IntWritable> values, Context context) throws


IOException, InterruptedException {

int sum = 0;

for (IntWritable val : values) {

sum += val.get(); }

result.set(sum);

context.write(key, result); } }

Driver Code

Public class Wordcount extends Configured

public static void main(String[] args) throws Exception

Configuration conf = new Configuration();

Job job = Job.getInstance(conf, "word count");

job.setJarByClass(WordCount.class);

job.setMapperClass(TokenizerMapper.class);

job.setCombinerClass(IntSumReducer.class);
Page | 85
job.setReducerClass(IntSumReducer.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.class);

FileInputFormat.addInputPath(job, new Path(args[0]));

FileOutputFormat.setOutputPath(job, new Path(args[1]));

System.exit(job.waitForCompletion(true) ? 0 : 1); }}

Advantages of Combiner in MapReduce

As we have discussed what is Hadoop MapReduce Combiner in detail, now we will discuss
some advantages of Mapreduce Combiner.

Combiner reduces the time taken for data transfer between mapper and reducer.

It decreases the amount of data that needed to be processed by the reducer.


The Combiner improves the overall performance of the reducer.

Explain about Partitioner in Hadoop

It give more permformance for your application and readability.

The Partitioner in MapReduce controls the partitioning of the key of the intermediate
mapper output.

Page | 86
By hash function, key (or a subset of the key) is used to derive the partition.

Default partitioner in MapReduce Hadoop is Hash Partitioner which computes a hash value
for the key and assigns the partition based on this result.

It takes the responsibility of split Key,values pair to different reducers,depend on the hash
code of the object.
It is not aware which key,value pair is sent to specific reducer.

If you want to sent specific key,vale to specific reducer we have to design our own
partitioner code.

I have to sent specific key ,vales to specific reducer if the key length is one send to
reducer1,if key length is two send to reducer 2 and so on.

A total number of partitions depends on the number of reduce task.

Partition class determines which partition a given (key, value) pair will go.

Partition phase takes place after map phase and before reduce phase.

Partitioning of the map output take place on the basis of the key and sorted.

.The total number of Partitioners that run in Hadoop is equal to the number of reducers
i.e. Partitioner will divide the data according to the number of reducers which is set by
JobConf.setNumReduceTasks() method.

pratitioner is created only when there are multiple reducers.

Public class MyPartitioner inplements Partitioner <Text,Intwritable> {

Public void configured (Jobconf conf)

Public int getPartion (Text key,Intwritable value,int setnumreducetask) {


String s=key.toString();

Page | 87
If(s.length()==1)
{ return 0;} If
(s.length()==2) {
return 1;}
If(s.length()==3)
{ return 2;}
Else
return 3;
}}

We can work with maximum of 1 lakhs reducer,minimum we can work with one reducer.
Specify these statement in the driver code

job.setMapperClass(PartitionerMapper.class);
job.setPartitionerClass(AgePartitioner.class);
job.setReducerClass(PartitionerReducer.class);
job.setNumReduceTasks(3)

To overcome poor partitioner in Hadoop MapReduce, we can create Custom partitioner,


which allows sharing workload uniformly across different reducers.

Flow of Map_Reduce in Hadoop

InputSplit

It is the logical representation of data.

It describes a unit of work that contains a single map task in a MapReduce program.

InputSplits converts the physical representation of the block into logical for the mapper.

To read the 100MB file, two InputSlits are required.

One InputSplit is created for each block and one RecordReader and one mapper are created
for each InputSplit.

InputSpits do not always depend on the number of blocks, we can customize the number of
splits for a particular file by setting mapred.max.split.size property during job execution.

RecordReader

RecordReader communicates with the InputSplit and converts the split into records.

Records are in form of Key-value pairs that are suitable for reading by the Mapper.

RecordReader communicates with the inputsplit until it does not read the complete file

Page | 88
Syntax is (byteoffset, entire line);

By, default; it uses TextInputFormat for converting data into key-value pairs.

TextInputFormat provides 2 types of RecordReader :

LineRecordReader and SequenceFileRecordReader.

LineRecordReader- LineRecordReader in Hadoop is the default RecordReader that


TextInputFormat provides. Hence, each line of the input file is the new value and a key is
byte offset.
SequenceFileRecordReader- It reads data specified by the header of a sequence file.

Mapper task is the first phase of processing that processes each input record
(from RecordReader) and generates an intermediate key-value pair. Hadoop
Mapper store
intermediate-output on the local disk.

Shuffle and sort

The map phase guarantees that the input to the reducer will be sorted on its key. The process
by which output of the mapper is sorted and transferred across to the reducers is known as
the shuffle

Reducer

Reducer takes a set of an intermediate key-value pair produced by the mapper as the input
and runs a Reducer function on each of them.

The Reducer process the output of the mapper. After processing the data, it produces a new
set of output. At last HDFS stores this output data.

All the grouping will be done here and the value is passed as input to Reducer phase.

The reducers then finally combine each key-value pair and pass those values to HDFS via record
writer.

Record Writer:

RecordWriter is a class, whose implementation is provided by the OutputFormat, collects


the output key-value pairs from the Reducer and writes it into the output file.

The way these output key-value pairs are written in output files by RecordWriter is
determined by the OutputFormat.

Page | 89
200 MB of Data

inputsplit inputsplit inputsplit inputsplit

Record reader Record reader Record reader


Record reader

Mapper Mapper
Mapper Mapper

Intermediate key,value pair from mappers

RecordWriter

output

Page | 90
UNIT –IV
ANATOMY OF MAP-REDUCE JOBS

Anatomy of Map-Reduce Jobs: Understanding how Map- Reduce program works, tuning
Map-Reduce jobs, Understanding different logs produced by Map-Reduce jobs and
debugging the Map- Reduce jobs.

Anatomy of a MapReduce Job Run


You can run a MapReduce job with a single method call: submit() on a Job object (note that you
can also call waitForCompletion(), which will submit the job if it hasn’t been submitted already,
then wait for it to finish).

This method call conceals a great deal of processing behind the scenes.

In Hadoop, mapred.job.tracker determines the means of execution.

If this configuration property is set to local, the default, then the local job runner is used.
This runner runs the whole job in a single JVM.

It’s designed for testing and for running MapReduce programs on small datasets.

In Hadoop 0.23.0 a new MapReduce implementation was introduced.

The new implementation (called MapReduce 2) is built on a system called YARN.


It is used for execution is set by the mapreduce.framework.name property, which takes
the values local (for the local job runner), classic (for the “classic” MapReduce frame-

Classic MapReduce (MapReduce 1)

A job run in classic MapReduce is illustrated below figure.

At the highest level, there are four independent entities:

• The client, which submits the MapReduce job.


• The jobtracker, which coordinates the job run. The jobtracker is a
Java application whose main class is JobTracker.
• The tasktrackers, which run the tasks that the job has been split into. Tasktrackers
are Java applications whose main class is TaskTracker.
• The distributed filesystem (HDFS), which is used for sharing job files between the
other entities.
Job Submission

Page | 91
Job Initialization

Task Assignment

Task Execution

Streaming and Pipes.

Progress and Status Updates

Job Completion

Job submission
The submit () method on Job creates an internal JobSummitter instance and calls
sub mitJobInternal() on it (step 1) .

Having submitted the job, waitForCompletion() polls the job’s progress once a second
and reports the progress to the console if it has changed since the last report. \

When the job is complete, if it was successful, the job counters are displayed. Otherwise,
the error that caused the job to fail is logged to the console.

The job submission process implemented by JobSummitter does the following:

• Asks the jobtracker for a new job ID (by calling getNewJobId() on JobTracker) (step 2).

• Checks the output specification of the job. For example, if the output directory has not
been specified or it already exists, the job is not submitted and an error is thrown to the
MapReduce program.

• Computes the input splits for the job. If the splits cannot be computed, because the input
paths don’t exist, for example, then the job is not submitted and an error is thrown to the
MapReduce program
.
• Copies the resources needed to run the job, including the job JAR file, the configuration
file, and the computed input splits, to the jobtracker’s filesystem in a directory named after
the job ID.

The job JAR is copied with a high replication factor (controlled by the
mapred.submit.replication property, which defaults to 10) so that there are lots of copies
across the cluster for the tasktrackers to access when they run tasks for the job (step 3)

Tells the jobtracker that the job is ready for execution (by calling submitJob() on JobTracker)
(step 4).

Page | 92
Job Initialization

When the JobTracker receives a call to its submitJob() method, it puts it into an internal
queue from where the job scheduler will pick it up and initialize it.

Initialization involves creating an object to represent the job being run, which encapsulates
its tasks, and bookkeeping information to keep track of the tasks’ status and progress (step
5).

To create the list of tasks to run, the job scheduler first retrieves the input splits computed by
the client from the shared file system (step 6).

It then creates one map task for each split. The number of reduce tasks to create is determined
by the mapred.reduce.tasks property in the Job, which is set by the setNumReduceTasks()
method, and the scheduler simply creates this number of reduce tasks to be run.

Tasks are given IDs at this point.

In addition to the map and reduce tasks, two further tasks are created: a job setup task and a
job cleanup task.

These are run by task trackers and are used to run code to setup the job before any map tasks
run, and to clean up after all the reduce tasks are complete.

Page | 93
The OutputCommitter that is configured for the job determines the code to be run, and by
default this is a FileOutputCommitter.

Task Assignment

Tasktrackers run a simple loop that periodically sends heartbeat method calls to the jobtracker.

Heartbeats tell the jobtracker that a tasktracker is alive, but they also double as a channel
for messages.

As a part of the heartbeat, a tasktracker will indicate whether it is ready to run a new task,
and if it is, the jobtracker will allocate it a task, which it communicates to the tasktracker using
the heartbeat return value (step 7).

Before it can choose a task for the tasktracker, the jobtracker must choose a job to select the
task from.

There are various scheduling algorithms, but the default one simply maintains a priority list
of jobs.

Having chosen a job, the job tracker now chooses a task for the job.

Task trackers have a fixed number of slots for map tasks and for reduce tasks:

For example, a task tracker may be able to run two map tasks and two reduce
tasks simultaneously.

The default scheduler fills empty map task slots before reduce task slots, so if the task tracker
has at least one empty map task slot, the job tracker will select a map task; otherwise, it will
select a reduce task.

Task Execution:
When task tracker has been assigned a task, the next step is for it to run the task.

First, it localizes the job JAR by copying it from the shared file system to the task trackers
File system. (step 8).

Second, it creates a local working directory for the task, and un-jars the contents of the JAR
into this directory.

Page | 94
Third, it creates an instance of Task Runner to run the task. Task Runner launches a new Java
Virtual Machine (step 9) to run each task in (step 10), so that any bugs in the user-defined
map and reduce functions don’t affect the task tracker.

The child process communicates with its parent through the umbilical interface.

This way it informs the parent of the task’s progress every few seconds until the task
is Complete.

Stream and pipes

Both Streaming and Pipes run special map and reduce tasks for the purpose of launching
the user-supplied executable and communicating with it.

In the case of Streaming, the Streaming task communicates with the process using standard
input and output streams.

The Pipes task, on the other hand, listens on a socket and passes the C++ process a port
number in its environment, so that on startup, the C++ process can establish a persistent
socket connection back to the parent Java Pipes task.

In both cases, during execution of the task, the Java process passes input key-value pairs to
the external process, which runs it through the user-defined map or reduce function and
passes the output key-value pairs back to the Java process.

From the tasktracker’s point of view, it is as if the tasktracker child process ran the map
or reduce code itself.

Page | 95
Progress and Status Updates
Map Reduce jobs are long-running batch jobs, taking anything from minutes to hours to run.

Because this is a significant length of time, it’s important for the user to get feedback on how
the job is progressing.

A job and each of its tasks have a status, which includes such things as the state of the job or
task (e.g., running, successfully completed, failed), the progress of maps and reduces, the
values of the job’s counters, and a status message or description (which may be set by user
code).

These statuses change over the course of the job, so how do they get communicated back to
the client?

When a task is running, it keeps track of its progress, that is, the proportion of the
task completed.

For map tasks, this is the proportion of the input that has been processed.
For reduce tasks, it’s a little more complex, but the system can still estimate the proportion of
the reduce input processed.

It does this by dividing the total progress into three parts, corresponding to the three
phases of the shuffle (see “Shuffle and Sort).]

If a task reports progress, it sets a flag to indicate that the status change should be sent to
the tasktracker.

The flag is checked in a separate thread every three seconds, and if set it notifies the
tasktracker of the current task status.

Meanwhile, the tasktracker is sending heartbeats to the jobtracker every five seconds (this is
a minimum, as the heartbeat interval is actually dependent on the size of the cluster: for
larger clusters, the interval is longer), and the status of all the tasks being run by the
tasktracker is sent in the call.

Counters are sent less frequently than every five seconds, because they
can Be relatively high-bandwidth.

The jobtracker combines these updates to produce a global view of the status of all the jobs
being run and their constituent tasks. Finally, as mentioned earlier, the Job receives the latest
status by polling the jobtracker every second.

Clients can also use Job’s getStatus() method to obtain a JobStatus instance, which contains
all of the status information for the job.

Page | 96
Job Completion

When the job tracker receives a notification that the last task for a job is complete (this\ will
be the special job cleanup task), it changes the status for the job to “successful.”

Then, when the Job polls for status, it learns that the job has completed successfully, so it prints
a message to tell the user and then returns from the waitForCompletion() method.

The job tracker also sends an HTTP job notification if it is configured to do so.

This can be configured by clients wishing to receive callbacks, via the job.end.notifica tion.url
property.

Last, the jobtracker cleans up its working state for the job and instructs tasktrackers to do
the same (so intermediate output is deleted).

YARN (Map Reduce-2)

Map Reduce on YARN involves more entities than classic MapReduce.

• The client, which submits the MapReduce job.


• The YARN resource manager, which coordinates the allocation of compute resources on
the cluster.
• The YARN node managers, which launch and monitor the compute containers on machines
in the cluster.

Page | 97
• The MapReduce application master, which coordinates the tasks running the MapReduce
job. The application master and the MapReduce tasks run in containers that are scheduled by
the resource manager, and managed by the node managers.

• The distributed filesystem (normally HDFS), which is used for sharing job files between the
other entities.

The process of running a job is shown in Figure 6-4, and described in the following sections.

Job Submission
Jobs are submitted in MapReduce 2 using the same user API as MapReduce 1
(step1). MapReduce 2 has an implementation of ClientProtocol that is activated
when mapreduce.framework.name is set to yarn.

The submission process is very similar to the classic implementation.

The new job ID is retrieved from the resource manager (rather than
the jobtracker), although in the nomenclature of YARN it is an application ID (step 2).

The job client checks the output specification of the job; computes input splits (although there
is an option to generate them on the cluster, yarn.app.mapreduce.am.compute-splits-in-
cluster, which can be beneficial for jobs with many splits); and copies job resources (including
the job JAR, configuration, and split information) to HDFS (step 3).

Finally, the job is submitted by calling submitApplication() on the resource manager (step 4).

Page | 98
Job Initialization

When the resource manager receives a call to its submitApplication(), it hands off the request
to the scheduler.

The scheduler allocates a container, and the resource manager then launches the
application master’s process there, under the node manager’s management
(steps 5a and 5b).

The application master for MapReduce jobs is a Java application whose main class
is MRAppMaster.

It initializes the job by creating a number of bookkeeping objects to keep track of the job’s
progress, as it will receive progress and completion reports from the tasks (step 6).

Next, it retrieves the input splits computed in the client from the
shared filesystem (step 7).

It then creates a map task object for each split, and a number of reduce task objects
determined by the mapreduce.job.reduces property.

The next thing the application master does is decide how to run the tasks that
make up the MapReduce job.

If the job is small, the application master may choose to run them in the same JVM as
itself, since it judges the overhead of allocating new containers
and running tasks in them as outweighing the gain to be had in running them in parallel,
compared to running them sequentially on one node. (This is different to MapReduce 1,
where small jobs are never run on a single tasktracker.) Such a job is said to be uberized, or
run as an uber task.

What qualifies as a small job? By default one that has less than 10 mappers, only one
reducer, and the input size is less than the size of one HDFS block. (These values may be
changed for a job by setting mapreduce.job.ubertask.maxmaps, mapreduce.job.uber
task.maxreduces, and mapreduce.job.ubertask.maxbytes.)

It’s also possible to disable uber tasks entirely (by setting mapreduce.job.ubertask.enable
to false).

Before any tasks can be run the job setup method is called (for the job’s OutputCommit ter),
to create the job’s output directory.

In contrast to MapReduce 1, where it is called in a special task that is run by the tasktracker,
in the YARN implementation the method is called directly by the application master.
Task Assignment
If the job does not qualify for running as an uber task, then the application master
requests containers for all the map and reduce tasks in the job from the resource
manager (step 8).

Page | 99
Each request, which are piggybacked on heartbeat calls, includes information about each map
task’s data locality, in particular the hosts and corresponding racks that the input split resides on.

The scheduler uses this information to make scheduling decisions (just like a jobtracker’s
scheduler does): it attempts to place tasks on data-local nodes in the ideal case, but if this is
not possible the scheduler prefers rack-local placement to non-local placement.

Requests also specify memory requirements for tasks. By default both map and reduce tasks
are allocated 1024 MB of memory, but this is configurable by setting
mapreduce.map.memory.mb and mapreduce.reduce.memory.mb.

The way memory is allocated is different to MapReduce 1, where tasktrackers have a


fixed number of “slots”, set at cluster configuration time, and each task runs in a single
slot.

Slots have a maximum memory allowance, which again is fixed for a cluster, and which leads
both to problems of under utilization when tasks use less memory (since other waiting tasks
are not able to take advantage of the unused memory) and problems of job failure when a
task can’t complete since it can’t get enough memory to run correctly.
In YARN, resources are more fine-grained, so both these problems can be avoided.

In particular, applications may request a memory capability that is anywhere between the
minimum allocation and a maximum allocation, and which must be a multiple of the minimum
allocation. Default memory allocations are scheduler-specific, and for thecapacity scheduler
the default minimum is 1024 MB (set by yarn.scheduler.capacity.minimum-allocation-mb),
and the default maximum is 10240 MB (set byyarn.scheduler.capacity.maximum-allocation-
mb).

Thus, tasks can request any memory allocation between 1 and 10 GB (inclusive), in multiples
of 1 GB (the scheduler will round to the nearest multiple if needed), by setting
mapreduce.map.memory.mb and map reduce.reduce.memory.mb appropriately.

Task Execution

Once a task has been assigned a container by the resource manager’s scheduler, the
application master starts the container by contacting the node manager (steps 9a and 9b).

The task is executed by a Java application whose main class is YarnChild.

Before it can run the task it localizes the resources that the task needs, including the
job configuration and JAR file, and any files from the distributed cache (step 10).

Finally, it runs the map or reduce task (step 11).

Page | 100
The YarnChild runs in a dedicated JVM, for the same reason that tasktrackers spawn new JVMs
for tasks in MapReduce 1: to isolate user code from long-running system daemons.

Unlike MapReduce 1, however, YARN does not support JVM reuse so each task runs in a new
JVM.

Streaming and Pipes programs work in the same way as MapReduce 1. The Yarn Child launches
the Streaming or Pipes process and communicates with it using standard input/output or a
socket (respectively), as shown in Figure (except the child subprocesses run on node managers,
not tasktrackers).

Streaming and Pipes programs work in the same way as MapReduce 1. The Yarn Child launches
the Streaming or Pipes process and communicates with it using standardinput/output or a
socket (respectively), as shown in Figure (except the child and subprocesses run on node
managers, not tasktrackers).

Progress and Status Updates

When running under YARN, the task reports its progress and status (including counters) back
to its application master every three seconds (over the umbilical interface), which has an
aggregate view of the job.

The process is illustrated in Figure Contrast this to MapReduce 1, where progress updates
flow from the child through the tasktracker to the jobtracker for aggregation.

The client polls the application master every second (set via mapreduce.client.pro
gressmonitor.pollinterval) to receive progress updates, which are usually displayed to the
user.

Page | 101
Job Completion

As well as polling the application master for progress, every five seconds the client
checks whether the job has completed when using the waitForCompletion() method on
Job.

The polling interval can be set via the


mapreduce.client.completion.polli nterval configuration property.

Notification of job completion via an HTTP callback is also supported like in MapReduce

1. In MapReduce 2 the application master initiates the callback.


2. On job completion the application master and the task containers clean up their
working state, and the OutputCommitter’s job cleanup method is called.
3. Job information is archived by the job history server to enable later interrogation by
users if desired.

Failures
In the real world, user code is buggy, processes crash, and machines fail. One of the major
benefits of using Hadoop is its ability to handle such failures and allow your job to
complete.

Failures in Classic MapReduce

In the MapReduce 1 runtime there are three failure modes to consider: failure of the
running task, failure of the tastracker, and failure of the jobtracker.

Task Failure

Consider first the case of the child task failing.

The most common way that this happens is when user code in the map or reduce task
throws a runtime exception. If this happens,the child JVM reports the error back to its parent
tasktracker, before it exits.

The tasktracker marks the task attempt asfailed, freeing up a slot to run another task.

Another failure mode is the sudden exit of the child JVM—perhaps there is a JVM bug that
causes the JVM to exit for a particular set of circumstances exposed by the Map- Reduce
user code.

In this case, the tasktracker notices that the process has exited and marks the attempt as failed.

Page | 102
Hanging tasks are dealt with differently.

The tasktracker notices that it hasn’t received a progress update for a while and proceeds to
mark the task as failed.

The child JVM process will be automatically killed after this period.

The timeout period after which tasks are considered failed is normally 10 minutes and can
be configured on a per-job basis (or a cluster basis) by setting the mapred.task.timeout
property to a value in milliseconds.

When the job tracker is notified of a task attempt that has failed (by the
tasktracker’sheartbeat call), it will reschedule execution of the task.

The jobtracker will try to avoid rescheduling the task on a tasktracker where it has
previously failed.

Furthermore, if a task fails four times (or more), it will not be retried further.

This value is configurable: the maximum number of attempts to run a task is controlled by
the mapred.map.max.attempts property for map tasks and mapred.reduce.max.attempts
for reduce tasks.

By default, if any task fails four times (or whatever the maximum number of attempts is
configured to), the whole job fails.

Map tasks and reduce tasks are controlled independently, using the
mapred.max.map.failures.percent and mapred.max.reduce.failures.percent properties.

A task attempt may also be killed, which is different from it failing. A task attempt may be killed
because it is a speculative duplicate (for more, see “SpeculativeExecution” ), or because the
tasktracker it was running on failed, and the jobtracker marked all the task
attempts running on it as killed.

Tasktracker Failure

Failure of a tasktracker is another failure mode.

If a tasktracker fails by crashing, or running very slowly, it will stop sending heartbeats to
the jobtracker (or send them very infrequently).

The jobtracker will notice a tasktracker that has stopped sending heartbeats
(if it hasn’t received one for 10 minutes, configured via the mapred.task
tracker.expiry.interval property, in milliseconds) and remove it from its pool
of tasktrackers to schedule tasks on.

Page | 103
The jobtracker arranges for map tasks that were run and completed successfully on that
tasktracker to be rerun if they belong to incomplete jobs, since their intermediate output
residing on the failed tasktracker’s local filesystem may not be accessible to the reduce task.

Any tasks in progress are also rescheduled.

A tasktracker can also be blacklisted by the jobtracker, even if the tasktracker has not failed.

If more than four tasks from the same job fail on a particular tasktracker (set
by (mapred.max.tracker.failures), then the jobtracker records this as a fault.

A tasktracker is blacklisted if the number of faults is over some minimum threshold (four, set
by mapred.max.tracker.blacklists) and is significantly higher than the average number of faults
for tasktrackers in the cluster cluster.

Blacklisted tasktrackers are not assigned tasks, but they continue to communicate with
the jobtracker.

Jobtracker Failure

Failure of the jobtracker is the most serious failure mode. Hadoop has no mechanism for
dealing with failure of the jobtracker—it is a single point of failure—so in this casethe job fails.

However, this failure mode has a low chance of occurring, since the chance of a
particular machine failing is low.

The good news is that the situation is improved in YARN, since one of its design goals is to
eliminate single points of failure in Map- Reduce.

After restarting a jobtracker, any jobs that were running at the time it was stopped will need to
be re-submitted.

There is a configuration option that attempts to recover any running jobs


(mapred.jobtracker.restart.recover, turned off by default), however it is known not to
work reliably, so should not be used.

Failures in YARN
For MapReduce programs running on YARN, we need to consider the failure of any of
the following entities: the task, the application master, the node manager, and the
resource manager.

Task Failure

Failure of the running task is similar to the classic case. Runtime exceptions and
sudden exits of the JVM are propagated back to the application master and the task
attempt is marked as failed.

Page | 104
The configuration properties for determining when a task is considered to be failed are
the same as the classic case: a task is marked as failed after four attempts (set by
mapreduce.map.maxattempts for map tasks and mapreduce.reduce.maxattempts for
reducer tasks).

Application Master Failure

By default, applications are marked as failed if they fail once, but this can be increased by
setting the property yarn.resourcemanager.am.max-retries.

An application master sends periodic heartbeats to the resource manager, and in the event of
application master failure, the resource manager will detect the failure and start a new
instance of the master running in a new container (managed by a node manager).

When the ApplicationMaster fails, the ResourceManager simply starts another container with
a new ApplicationMaster running in it for another application attempt.

It is the responsibility of the new ApplicationMaster to recover the state of the older
ApplicationMaster, and this is possible only when ApplicationMasters persist their states in
the external location so that it can be used for future reference.

Any ApplicationMaster can run any application from scratch instead of recovering its state
and rerunning again

Node Manager Failure

If a node manager fails, then it will stop sending heartbeats to the resource manager, and
the node manager will be removed from the resource manager’s pool of available nodes.

The property yarn.resourcemanager.nm.liveness-monitor.expiry-intervalms, which defaults


to 600000 (10 minutes), determines the minimum time the resource manager waits before
considering a node manager that has sent no heartbeat in thattime as failed.

Any task or application master running on the failed node manager will be recovered using
the mechanisms described in the previous two sections.

Node managers may be blacklisted if the number of failures for the application is high.

Blacklisting is done by the application master, and for MapReduce the application master will
try to reschedule tasks on different nodes if more than three tasks fail on a node manager.

The threshold may be set with mapreduce.job.maxtaskfailures.per.tracker.

Resource Manager Failure


Failure of the resource manager is serious, since without it neither jobs nor task containers can
be launched.

Page | 105
The resource manager was designed from the outset to be able to recover from crashes, by
using a checkpointing mechanism to save its state to persistentstorage, although at the time
of writing the latest release did not have a complete implementation.

After a crash, a new resource manager instance is brought up (by an adminstrator) and it
recovers from the saved state.

The state consists of the node managers in the system as well as the running applications.

(Note that tasks are not part of the resource manager’s state, since they are managed by
the application. Thus the amount of state to be stored is much more managable than that
of the jobtracker.)

The storage used by the reource manager is configurable via the


yarn.resourceman ager.store.class property.

The default is org.apache.hadoop.yarn.server.resource manager.recovery.MemStore-


which keeps the store in memory, and is therefore not highly-available.

However, there is a ZooKeeper-based store in the works that will support reliable recovery
from resource manager failures in the future.

Job Scheduling
Early versions of Hadoop had a very simple approach to scheduling users’ jobs: they ran in
order of submission, using a FIFO scheduler.

Typically, each job would use the whole cluster, so jobs had to wait their turn.

Although a shared cluster offers great potential for offering large resources to many users,
the problem of sharing resources fairly between users requires a better scheduler.

Production jobs need to complete in a timely manner, while allowing users who are
making smaller ad hoc queries to get results back in a reasonable time.

Later on, the ability to set a job’s priority was added, via the mapred.job.priority property or
the setJobPriority() method on JobClient (both of which take one of the values VERY_HIGH,
HIGH, NORMAL, LOW, VERY_LOW).

When the job scheduler is choosing the next job to run, it selects one with the highest
priority. However, with the FIFO scheduler, priorities do not support preemption, so a high-
priority job can still be blocked by a long-running low priority job that started before the
high-priority job was scheduled.

MapReduce in Hadoop comes with a choice of schedulers. The default in MapReduce 1 is


the original FIFO queue-based scheduler, and there are also multiuser schedulers called the
Fair Scheduler and the Capacity Scheduler.

MapReduce 2 comes with the Capacity Scheduler (the default), and the FIFO scheduler .
Page | 106
The Fair Scheduler
The Fair Scheduler aims to give every user a fair share of the cluster capacity over time. If
a single job is running, it gets the entire cluster.

As more jobs are submitted, free task slots are given to the jobs in such a way as to give
each user a fair share of the cluster.
It is also an easy way to share the cluster between multiple of user.

Jobs are placed in pools, and by default, each user gets their own pool.

A user who submits more jobs than a second user will not get any more cluster resources than
the second, on average. It is also possible to define custom pools with guaranteed minimum
capacities defined in terms of the number of map and reduce slots, and to set weightings for
each pool.

The Fair Scheduler supports preemption, so if a pool has not received its fair share for a
certain period of time, then the scheduler will kill tasks in pools running over capacity in
order to give the slots to the pool running under capacity.

The Fair Scheduler is a “contrib” module. To enable it, place its JAR file on Hadoop’s classpath,
by copying it from Hadoop’s contrib/fairscheduler directory to the lib directory.

Then set the mapred.jobtracker.taskScheduler property to:


org.apache.hadoop.mapred.FairScheduler.

It can also limit the number of concurrently running jobs per user and pool.
When job submitted beyond the limit they have to wait until some of the pool earlier job finish.
Jobs to run from each pool are chosen in order of priority and then submit time.

The Capacity Scheduler


The Capacity Scheduler takes a slightly different approach to multiuser scheduling.
A cluster is made up of a number of queues (like the Fair Scheduler’s pools), which may
be hierarchical (so a queue may be the child of another queue), and each queue has an
allocated capacity.

This is like the Fair Scheduler, except that within each queue, jobs are scheduled using FIFO
scheduling (with priorities).

In effect, the Capacity Scheduler allows users or organizations (defined using queues) to
simulate a separate MapReduce cluster with FIFO scheduling for each user or organization. The
Fair Scheduler, by contrast, (which actually also supports FIFO job scheduling within pools as an
option, making it like the Capacity Scheduler) enforces fair sharing within each pool, so running
jobs share the pool’s resources.

Page | 107
Shuffling & sorting
As we know that in broad way mapreduce follows a simple mechanism like :-

(k1,v1)—->(map)—->(k2,v2)—>(reduce)—>(k3,list<values>)

But in actual lot of done inside the two main phases know as map and reduce
( specially sorting and shuffling)

Then let’s start with the Map phase :-

Shuffle :- The process of distributing maps output to reducers as inputs is know as shuffling

1) Map Side:-

when map function starts producing output it does not write them to disk
process takes advantages of buffering writes in memory which help out for some
more efficiency
Each map task has a circular memory which is used for writing out output of each
map task . this circular memory is basically buffer , buffer size is 100 mb by default

It can be changed using property of io.sort.mb . after starting writting the map output
to circular memory / buffer then when the contents reaches a threshold size
( io.sort.spill.percent by default 80%)
A new thread will start spilling the output records to local disk . Map task output
written to buffer and spilling output to disk will takes place in parallel then afterwards.
spills are written in round robin manner to the specified directory by property
:-mapred.local.dir

Page | 108
But before written spills onto the disks , the thread first divides the spill data
into partitions corresponding to reducers that they will ultimately be sent to .
In each partition the data is arranged according to sort ( which we say and in-memory
sort key process) .
If there is combiner function then it will run on the sort data , which means it will
produce more sorted and compact map output , so less data needs to be
transferred to reduce and written on local disk
As every time the buffer reaches spill threshold new spill record is generated , hence
after map task write its last out there may be many spills , therefore before the map
task completed thread merge spill records( streams) into a single partitioned and sorted
output file.
the property io.sort.factor will tell the maximum numbers of spills record that can be
merge at once
add on :- it is good to compress the output of map task written which saves disk
space, saves the amount of data to be transferred to reducer
By default compression is not enabled , we need to enable it
by settingmapred.compress.map.output to true
Now , map output file ( u can call it partitioned file as well ) is available for reducers over
http . The maximum number of worker threads used to server the file partitions is
controlled by the tasktracker.http.thrteads property ; this setting is per tasktracker, not
per map task.

2. The Reducer Side

At this point of time we have the output of maptask , which is written into map task
local disk
Now the reduce task will start to verify which partition/output of map task is needed by
it
. Map tasks generally completed on different times , but reduce task start copying
the map partioned output as as soon as each completes this copy mechanism is
known as Copy phase
reduces using threads for copying the map output data which fetch output in
parrellel. default numbner for these threads is 5 . can be change
using mapred.reduce.parellel.copies property.

Page | 109
Map output directly copy to reduce JVM memory if there are small in numbers
otherwise
they are copied to disk .
As now the data is on disk then a background thread merges them into larger,sorted
files . This saves time merging later on .
Note :- every map output which is compressed must need to decompressed first
When all the mapout have been copied then rduce task moves into sort phase ( which
should be called merged phase as sorting was carried in map phase) , which merge the
map output managing their sort order . this is done in round robin fashion .
Merge factor play an important role in merging files . For ex :- if there are 50 outputs
and merge factor is 10 ( by default , can be change using io.sort.factor ), there would be
5 rounds . each round will merge 10 Files into one , so that at end there will be 5
intermediate files .
Rather than merging these fives files into a a single sorted files , the merge will feed
these files to reduce function hence saving a trip to disk .
Once reduce function get these files it will perform its operation and write their output
to filesystem , typically HDFS.

Note :-

How do reducers know which tasktrackers to fetch map output from? As map
tasks complete successfully, they notify their parent tasktracker of the status
update, which in turn notifies the jobtracker. These notifications are
transmitted over the heartbeat communication mechanism described
earlier. Therefore, for a given job, the jobtracker knows the mapping
between map outputs and tasktrackers. A thread in the reducer
periodically asks the jobtracker for map output locations until it has
retrieved them all.
Tasktrackers do not delete map outputs from disk as soon as the
first reducer has retrieved them, as the reducer may fail. Instead,
they wait until they are told to delete them by the jobtracker, which
is after the job has completed.
Hence this is the internal process of sorting shuffling in MapReduce mechanism.

Page | 110
In next blog we will learn about some more features of MapReduce

Hope you guys like this . Please like or comment for ypur valuable feedback

Tuning of Map Reduce job

We are now in a better position to understand how to tune the shuffle to


improve Map Reduce performance.

The general principle is to give the shuffle as much memory as possible.

However, there is a trade-off, in that you need to make sure that your map and reduce
functions get enough memory to operate.

This is why it is best to write your map and reduce functions to use as little memory as possible.

The amount of memory given to the JVMs in which the map and reduce tasks run is set by
the mapred.child.java.opts property. You should try to make this as large as possible for the
amount of memory on your task nodes.

On the map side, the best performance can be obtained by avoiding multiple spills to disk; one
is optimal.

If you can estimate the size of your map outputs, then you can set the
io.sort.* properties appropriately to minimize the number of spills.

In particular, you should increase io.sort.mb if you can.

There is a MapReduce counter) that counts the total number of records that
were spilled to disk over the course of a job, which can be useful for tuning.

Note that the counter includes both map and reduce side spills.

On the reduce side, the best performance is obtained when the intermediate data
can reside entirely in memory.

By default, this does not happen, since for the general case all the memory is
reserved for the reduce function.

But if your reduce function has light memory requirements, then setting
mapred.inmem.merge.threshold to 0 and mapred.job.reduce.input.buffer.percent to 1.0
maybring a performance boost.

Hadoop uses a buffer size of 4 KB by default, which is low, so you should increase this across
the cluster (by setting io.file.buffer.size).

Page | 111
Page | 112
The Task Execution Environment

Hadoop provides information to a map or reduce task about the environment in which it
is running.

For example, a map task can discover the name of the file it is processing
and a map or reduce task can find out the attempt number of the task. The properties in Table
can be accessed from the job’s configuration, obtained in the old MapReduce API by providing
an implementation of the configure() method for Mapper or Reducer, where the configuration
is passed in as an argument.

In the new API these properties can be accessed from the context object passed to all methods
of the Mapper or Reducer.

Page | 113
speculative execution.
Apache Hadoop does not fix or diagnose slow-running tasks. Instead, it tries to detect when a
task is running slower than expected and launches another, an equivalent task as a backup (the
backup task is called as speculative task). This process is called speculative execution in Hadoop.
A key feature of Hadoop that improves job.
What is Speculative Execution in Hadoop?
In Hadoop, MapReduce breaks jobs into tasks and these tasks run parallel rather than
sequential, thus reduces overall execution time.
This model of execution is sensitive to slow tasks (even if they are few in numbers) as they
slow down the overall execution of a job.
There may be various reasons for the slowdown of tasks, including hardware degradation or
software misconfiguration, but it may be difficult to detect causes since the tasks still
complete successfully, although more time is taken than the expected time.
Hadoop doesn’t try to diagnose and fix slow running tasks; instead, it tries to detect them and
runs backup tasks for them. This is called speculative execution in Hadoop. These backup tasks
are called Speculative tasks in Hadoop.

How Speculative Execution works in Hadoop?


Hadoop speculative execution process.
Firstly all the tasks for the job are launched in Hadoop MapReduce.

The speculative tasks are launched for those tasks that have been running for some time (at
least one minute) and have not made any much progress, on average, as compared with other
tasks from the job.
The speculative task is killed if the original task completes before the speculative task, on
the other hand, the original task is killed if the speculative task finishes before it.
How to Enable or Disable Speculative Execution?
Speculative execution is a MapReduce job optimization technique in Hadoop that is enabled
by default. You can disable speculative execution for mappers andreducers in mapred-
site.xml as shown below:
1 <property>
2 <name>mapred.map.tasks.speculative.execution</name>
3 <value>false</value>
4 </property>
5 <property>
6 <name>mapred.reduce.tasks.speculative.execution</name>
7 <value>false</value>
8 </property>

Page | 114
What is the need to turn off Speculative Execution ?
The main work of speculative execution is to reduce the job execution time; however, the
clustering efficiency is affected due to duplicate tasks. Since in speculative execution
redundant tasks are being executed, thus this can reduce overall throughput. For this reason,
some cluster administrators prefer to turn off the speculative execution in Hadoop

Task JVM reuse


Hadoop runs tasks in their own Java Virtual Machine to isolate them from other running tasks.

The overhead of starting a new JVM for each task can take around a second, which for jobs
that run for a minute or so is insignificant.

However, jobs that have alarge number of very short-lived tasks (these are usually map
tasks), or that have lengthy initialization, can see performance gains when the JVM is reused
for subsequent tasks.

Note that, with task JVM reuse enabled, tasks are not run concurrently in a single JVM;
rather, the JVM runs tasks sequentially.

Tasktrackers can, however, run more than one task at a time, but this is always done in
separate JVMs.

The properties for controlling the tasktrackers’ number of map task slots and reduce task slots .

The property for controlling task JVM reuse is

mapred.job.reuse.jvm.num.tasks:
it specifies the maximum number of tasks to run for a given job for each JVM launched;
the default is 1 .

No distinction is made between map or reduce tasks, however tasks from different jobs
are always run in separate JVMs.

Page | 115
The method set NumTasksToExecutePerJvm() on JobConf can also be used to configure this
property.

Another place where a shared JVM is useful is for sharing state between the tasks of a job.
By storing reference data in a static field, tasks get rapid access to the shared data .

Skipping Bad Records


Large datasets are messy. They often have corrupt records. They often have records that are
in a different format. They often have missing fields.

Depending on the analysis being performed, if only a small percentage of records are
affected, then skipping them may not significantly affect the result.

The best way to handle corrupt records is in your mapper or reducer code. You can detect
the bad record and ignore it, or you can abort the job by throwing an exception.

You can also count the total number of bad records in the job using counters to see
how widespread the problem is.

In rare cases, though, you can’t handle the problem because there is a bug in a third party
library that you can’t work around in your mapper or reducer.

In these cases, you can use Hadoop’s optional skipping mode for automatically skipping
bad records.

When skipping mode is enabled, tasks report the records being processed back to the
tasktracker.

When the task fails, the tasktracker retries the task, skipping the records that caused the
failure. Because of the extra network traffic and bookkeeping to maintain the failed record
ranges, skipping mode is turned on for a task only after it has failed twice.

Thus, for a task consistently failing on a bad record, the tasktracker runs the following
task attempts with these outcomes:
1. Task fails.
2. Task fails.
3. Skipping mode is enabled. Task fails, but failed record is stored by the tasktracker.
4. Skipping mode is still enabled. Task succeeds by skipping the bad record that failed in
the previous attempt.

Page | 116
Skipping mode is off by default; you enable it independently for map and reduce tasks using
the SkipBadRecords class. It’s important to note that skipping mode can detect only one bad
record per task attempt, so this mechanism is appropriate only for detecting occasional bad
records (a few per task, say). You may need to increase the maximum number of task
attempts (via mapred.map.max.attempts and mapred.reduce.max.attempts) to give skipping
mode enough attempts to detect and skip all the bad records in an input split.

Bad records that have been detected by Hadoop are saved as sequence files in the job’s
output directory under the _logs/skip subdirectory. These can be inspected for diagnostic
purposes after the job has completed (using hadoop fs -text, for example).

Understanding different logs produced by Map-Reduce jobs .

Debugging Map Reduce jobs in eclipse


New->project name->right click->bil path->configured bill path->add external jar ->user->lib-
>hadoop 0.20->choose all library files.

Again add external jars->hadoop0.20->lib->choose all jar files ->ok.

Right click on the source code of the driver code->run as->run configuration

Page | 117
Double click on the java application ->argument ->specify the in put and output path

Input path:Home/training/file.txt/

output path: home/training/desktop/wordcountoutput->apply->close

How to give break point

Mapper code,reducer code,driver code

Ctrl+shift+b

Or double click on the line and place break point.

Right click on the source code(driver code) debug as java application->yes

F8 is used to exit from one break point and move to another break point.

F6 is used for checking internal flow of the three code.

(or)
When we run Map-Reduce code in Eclipse, Hadoop runs in a special mode
called LocalJobRunner, under which all the Hadoop daemons run in a single JVM (Java Virtual
Machine) instead of several different JVMs.
The default file paths are set to local file paths and not of HDFS paths. ..
Step1: Get the code from Git repository by issuing clone command as follows:
$ git clone https://github.com/prasadkhode/wordcount-Debug.git
Step2: Import the project into your Eclipse workspace

Page | 118
Step3: Setting Breakpoints:
To set breakpoints in your source code right-click in the small left margin in your source code
editor and select Toggle Breakpoint. Alternatively you can double-click on the line of the code
to debug.

Step4: Starting the Debugger:


To debug our application, select a Java file which can be executed (WordCountDriver.java),
right-click on it and select Debug As → Debug Configuration.
Go to “Arguments” tab and pass the input arguments, input file name and output folder name
as follows:

Page | 119
Click on Apply and then Debug

Step5: Controlling the program execution:


Eclipse provides buttons in the toolbar for controlling the execution of the program that we
are debugging. It is easier to use the corresponding keys to control this execution.
We can use the F5, F6, F7 and F8 keys to go through our code. Action of each of the four keys is
as presented below:
Key Description
F5 Executes the currently selected line and goes to the next line in our program. If the selected line is a
F6 method call the debugger steps into the associated code.
F6 steps over the call, i.e. it executes a method without debugger.
F7
F7 steps out to the caller of the currently executed method. It executes of the current
F8 method and returns to the caller of this method.
F8 tells the Eclipse debugger to resume the execution of the program code until it reaches
the next break poin

Page | 120
UNIT-V
CASE STUDIES OF BIG DATA ANALYTICS USING MAP-
REDUCE PROGRAMMING

Introduction To K-Means Clustering :


K-means clustering is a type of unsupervised learning, which is used when you have
unlabeled data (i.e., data without defined categories or groups). The goal of this
algorithm is to find groups in the data, with the number of groups represented by the
variable K. The algorithm works iteratively to assign each data point to one of K groups
based on the features that are provided. Data points are clustered based on feature
similarity. The results of the K-means clustering algorithm are:
1. The centroids of the K clusters, which can be used to label new data
2. Labels for the training data (each data point is assigned to a single cluster)
Rather than defining groups before looking at the data, clustering allows you to find and
analyze the groups that have formed organically. The "Choosing K" section below
describes how the number of groups can be determined.
Each centroid of a cluster is a collection of feature values which define the resulting
groups. Examining the centroid feature weights can be used to qualitatively interpret
what kind of group each cluster represents.
This introduction to the K-means clustering algorithm covers:
 Common business cases where K-means is used
 The steps involved in running the algorithm
 A Python example using delivery fleet data
Business Uses :
The K-means clustering algorithm is used to find groups which have not been explicitly
labeled in the data. This can be used to confirm business assumptions about what types
of groups exist or to identify unknown groups in complex data sets.
Once the algorithm has been run and the groups are defined, any new data can be easily
assigned to the correct group.

Page | 121
This is a versatile algorithm that can be used for any type of grouping. Some examples of
use cases are:
 Behavioral segmentation:
 Segment by purchase history
 Segment by activities on application, website, or platform
 Define personas based on interests
 Create profiles based on activity monitoring
 Inventory categorization:
 Group inventory by sales activity
 Group inventory by manufacturing metrics
 Sorting sensor measurements:
 Detect activity types in motion sensors
 Group images
 Separate audio
 Identify groups in health monitoring
 Detecting bots or anomalies:
 Separate valid activity groups from bots
 Group valid activity to clean up outlier detection
In addition, monitoring if a tracked data point switches between groups over time can be
used to detect meaningful changes in the data.
Algorithm :
The Κ-means clustering algorithm uses iterative refinement to produce a final result. The
algorithm inputs are the number of clusters Κ and the data set. The data set is a
collection of features for each data point. The algorithms starts with i nitial estimates for
the Κ centroids, which can either be randomly generated or randomly selected from the
data set.

Page | 122
The algorithm then iterates between two steps:
1. Data assigment step:
Each centroid defines one of the clusters. In this step, each data point is assigned to its
nearest centroid, based on the squared Euclidean distance. More formally, if c i is the
collection of centroids in set C, then each data point x is assigned to a cluster based on
where dist( · ) is the standard (L 2 ) Euclidean distance. Let the set of data point
assignments for each i th cluster centroid be Si.
2. Centroid update step:
In this step, the centroids are recomputed. This is done by taking the mean of all data
points assigned to that centroid's cluster.
The algorithm iterates between steps one and two until a stopping criteria is met (i.e., no
data points change clusters, the sum of the distances is minimized, or some maximum
number of iterations is reached).
This algorithm is guaranteed to converge to a result. The result may be a local optimum
(i.e. not necessarily the best possible outcome), meaning that assessing more than one
run of the algorithm with randomized starting centroids may give a better outcome.
Choosing K
The algorithm described above finds the clusters and data set labels for a particular pre -
chosen K. To find the number of clusters in the data, the user needs to run the K-means
clustering algorithm for a range of K values and compare the results. In general, there is
no method for determining exact value of K, but an accurate estimate can be obtained
using the following techniques.
One of the metrics that is commonly used to compare results across different values
of K is the mean distance between data points and their cluster centroid. Since increasing
the number of clusters will always reduce the distance to data points,
increasing K will always decrease this metric, to the extreme of reaching zero when K is
the same as the number of data points.

Page | 123
Thus, this metric cannot be used as the sole target. Instead, mean distance to the
centroid as a function of K is plotted and the "elbow point," where the rate of decrease
sharply shifts, can be used to roughly determine K.
A number of other techniques exist for validating K, including cross-validation,
information criteria, the information theoretic jump method, the silhouette method, and
the G-means algorithm. In addition, monitoring the distribution of data points across
groups provides insight into how the algorithm is splitting the data for each K.

Example: Applying K-Means Clustering To Delivery Fleet Data


As an example, we'll show how the K-means algorithm works with a sample dataset of
delivery fleet driver data. For the sake of simplicity, we'll only be looking at two driver
features: mean distance driven per day and the mean percentage of time a driver was >5
mph over the speed limit. In general, this algorithm can be used for any number of
features, so long as the number of data samples is much greater than the number of
features.

Page | 124
Step 1: Clean and Transform Your Data
For this example, we've already cleaned and completed some simple data
transformations. A sample of the data as a pandas DataFrame is shown below.

The chart below shows the dataset for 4,000 drivers, with the distance feature on the x -
axis and speeding feature on the y-axis.

Page | 125
Step 2: Choose K and Run the Algorithm
Start by choosing K=2. For this example, use the Python packages
scikit-learn and NumPy for computations as shown below:
import numpy as np
from sklearn.cluster import KMeans
### For the purposes of this example, we store feature data from our
### dataframe `df`, in the `f1` and `f2` arrays. We combine this into
### a feature matrix `X` before entering it into the algorithm.
f1 = df['Distance_Feature'].values
f2 = df['Speeding_Feature'].values
X=np.matrix(zip(f1,f2))
kmeans = KMeans(n_clusters=2).fit(X)
The cluster labels are returned in kmeans.labels_.
Step 3: Review the Results
The chart below shows the results. Visually, you can see that the K-means algorithm
splits the two groups based on the distance feature. Each cluster centroid is marked with
a star.
 Group 1 Centroid = (50, 5.2)
 Group 2 Centroid = (180.3, 10.5)
Using domain knowledge of the dataset, we can infer that Group 1 is urban drivers and
Group 2 is rural drivers.

Page | 126
Step 4: Iterate Over Several Values of K
Test how the results look for K=4. To do this, all you need to change is the target number
of clusters in the KMeans() function.
kmeans = KMeans(n_clusters=4).fit(X)
The chart below shows the resulting clusters. We see that four distinct groups have been
identified by the algorithm; now speeding drivers have been separated from those who
follow speed limits, in addition to the rural vs. urban divide. The threshold for speeding is
lower with the urban driver group than for the rural drivers, likely due to urban drivers
spending more time in intersections and stop-and-go traffic.

Page | 127
Mahout - Introduction
We are living in a day and age where information is available in abundance. The information
overload has scaled to such heights that sometimes it becomes difficult to manage our little
mailboxes! Imagine the volume of data and records some of the popular websites (the likes of
Facebook, Twitter, and Youtube) have to collect and manage on a daily basis. It is not
uncommon even for lesser known websites to receive huge amounts of information in bulk.

Normally we fall back on data mining algorithms to analyze bulk data to identify trends and
draw conclusions. However, no data mining algorithm can be efficient enough to process very
large datasets and provide outcomes in quick time, unless the computational tasks are run on
multiple machines distributed over the cloud.

We now have new frameworks that allow us to break down a computation task into multiple
segments and run each segment on a different machine. Mahoutis such a data mining
framework that normally runs coupled with the Hadoop infrastructure at its background to
manage huge volumes of data.

What is Apache Mahout?


A mahout is one who drives an elephant as its master. The name comes from its close
association with Apache Hadoop which uses an elephant as its logo.

Hadoop is an open-source framework from Apache that allows to store and process big data in
a distributed environment across clusters of computers using simple programming models.

Apache Mahout is an open source project that is primarily used for creating scalable machine
learning algorithms. It implements popular machine learning techniques such as:

 Recommendation

 Classification

 Clustering

Apache Mahout started as a sub-project of Apache’s Lucene in 2008. In 2010, Mahout became
a top level project of Apache.

Page | 128
Features of Mahout
The primitive features of Apache Mahout are listed below.

 The algorithms of Mahout are written on top of Hadoop, so it works well in distributed
environment. Mahout uses the Apache Hadoop library to scale effectively in the cloud.

 Mahout offers the coder a ready-to-use framework for doing data mining tasks on large
volumes of data.

 Mahout lets applications to analyze large sets of data effectively and in quick time.

 Includes several MapReduce enabled clustering implementations such as k-means,


fuzzy k-means, Canopy, Dirichlet, and Mean-Shift.

 Supports Distributed Naive Bayes and Complementary Naive Bayes classification


implementations.

 Comes with distributed fitness function capabilities for evolutionary programming.

 Includes matrix and vector libraries.

Applications of Mahout
 Companies such as Adobe, Facebook, LinkedIn, Foursquare, Twitter, and Yahoo use
Mahout internally.

 Foursquare helps you in finding out places, food, and entertainment available in a
particular area. It uses the recommender engine of Mahout.

 Twitter uses Mahout for user interest modelling.

 Yahoo! uses Mahout for pattern mining.

Mahout - Environment
This chapter teaches you how to setup mahout. Java and Hadoop are the prerequisites of
mahout. Below given are the steps to download and install Java, Hadoop, and Mahout.

Pre-Installation Setup
Before installing Hadoop into Linux environment, we need to set up Linux using ssh (Secure
Shell). Follow the steps mentioned below for setting up the Linux environment.

Page | 129
Creating a User
It is recommended to create a separate user for Hadoop to isolate the Hadoop file system
from the Unix file system. Follow the steps given below to create a user:

 Open root using the command “su”.

 Create a user from the root account using the command “useradd username”.
 Now you can open an existing user account using the command “su username”.

 Open the Linux terminal and type the following commands to create a user.

$ su
password:
# useradd hadoop
# passwd hadoop
New passwd:
Retype new passwd

SSH Setup and Key Generation


SSH setup is required to perform different operations on a cluster such as starting, stopping,
and distributed daemon shell operations. To authenticate different users of Hadoop, it is
required to provide public/private key pair for a Hadoop user and share it with different users.

The following commands are used to generate a key value pair using SSH, copy the public keys
form id_rsa.pub to authorized_keys, and provide owner, read and write permissions to
authorized_keys file respectively.

$ ssh-keygen -t rsa
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
$ chmod 0600 ~/.ssh/authorized_keys

Verifying ssh
ssh localhost

Installing Java
Java is the main prerequisite for Hadoop and HBase. First of all, you should verify the existence
of Java in your system using “java -version”. The syntax of Java version command is given
below.

$ java -version

Page | 130
It should produce the following output.

java version "1.7.0_71"


Java(TM) SE Runtime Environment (build 1.7.0_71-b13)
Java HotSpot(TM) Client VM (build 25.0-b02, mixed mode)

If you don’t have Java installed in your system, then follow the steps given below for installing
Java.

Step 1

Download java (JDK <latest version> - X64.tar.gz) by visiting the following link: Oracle

Then jdk-7u71-linux-x64.tar.gz is downloaded onto your system.

Step 2

Generally, you find the downloaded Java file in the Downloads folder. Verify it and extract
the jdk-7u71-linux-x64.gz file using the following commands.

$ cd Downloads/
$ ls
jdk-7u71-linux-x64.gz
$ tar zxf jdk-7u71-linux-x64.gz
$ ls
jdk1.7.0_71 jdk-7u71-linux-x64.gz

Step 3

To make Java available to all the users, you need to move it to the location “/usr/local/”. Open
root, and type the following commands.

$ su
password:
# mv jdk1.7.0_71 /usr/local/
# exit

Step 4

For setting up PATH and JAVA_HOME variables, add the following commands to ~/.bashrc file.

export JAVA_HOME=/usr/local/jdk1.7.0_71
export PATH= $PATH:$JAVA_HOME/bin

Now, verify the java -version command from terminal as explained above.

Page | 131
Downloading Hadoop
After installing Java, you need to install Hadoop initially. Verify the existence of Hadoop using
“Hadoop version” command as shown below.

hadoop version

It should produce the following output:

Hadoop 2.6.0
Compiled by jenkins on 2014-11-13T21:10Z
Compiled with protoc 2.5.0
From source with checksum 18e43357c8f927c0695f1e9522859d6a
This command was run using
/home/hadoop/hadoop/share/hadoop/common/hadoopcommon-2.6.0.jar

If your system is unable to locate Hadoop, then download Hadoop and have it installed on
your system. Follow the commands given below to do so.

Download and extract hadoop-2.6.0 from apache software foundation using the following
commands.

$ su
password:
# cd /usr/local
# wget http://mirrors.advancedhosters.com/apache/hadoop/common/hadoop-
2.6.0/hadoop-2.6.0-src.tar.gz
# tar xzf hadoop-2.6.0-src.tar.gz
# mv hadoop-2.6.0/* hadoop/
# exit

Installing Hadoop
Install Hadoop in any of the required modes. Here, we are demonstrating HBase functionalities
in pseudo-distributed mode, therefore install Hadoop in pseudo-distributed mode.

Follow the steps given below to install Hadoop 2.4.1 on your system.

Step 1: Setting up Hadoop


You can set Hadoop environment variables by appending the following commands
to ~/.bashrc file.

export HADOOP_HOME=/usr/local/hadoop
export HADOOP_MAPRED_HOME=$HADOOP_HOME

Page | 132
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME

export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native

export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_INSTALL=$HADOOP_HOME

Now, apply all changes into the currently running system.

$ source ~/.bashrc

Step 2: Hadoop Configuration


You can find all the Hadoop configuration files at the location
“$HADOOP_HOME/etc/hadoop”. It is required to make changes in those configuration files
according to your Hadoop infrastructure.

$ cd $HADOOP_HOME/etc/hadoop

In order to develop Hadoop programs in Java, you need to reset the Java environment
variables in hadoop-env.sh file by replacing JAVA_HOME value with the location of Java in
your system.

export JAVA_HOME=/usr/local/jdk1.7.0_71

Given below are the list of files which you have to edit to configure Hadoop.

core-site.xml

The core-site.xml file contains information such as the port number used for Hadoop instance,
memory allocated for file system, memory limit for storing data, and the size of Read/Write
buffers.

Open core-site.xml and add the following property in between the <configuration>,
</configuration> tags:

<configuration>

<property>

<name>fs.default.name</name>

Page | 133
<value>hdfs://localhost:9000</value>

</property>

</configuration>

hdfs-site.xm

The hdfs-site.xml file contains information such as the value of replication data, namenode
path, and datanode paths of your local file systems. It means the place where you want to
store the Hadoop infrastructure.

Let us assume the following data:

dfs.replication (data replication value) = 1

(In the below given path /hadoop/ is the user name.


hadoopinfra/hdfs/namenode is the directory created by hdfs file system.)
namenode path = //home/hadoop/hadoopinfra/hdfs/namenode

(hadoopinfra/hdfs/datanode is the directory created by hdfs file system.)


datanode path = //home/hadoop/hadoopinfra/hdfs/datanode

Open this file and add the following properties in between the <configuration>,
</configuration> tags in this file.

<configuration>

<property>

<name>dfs.replication</name>

<value>1</value>

</property>

<property>

<name>dfs.name.dir</name>

<value>file:///home/hadoop/hadoopinfra/hdfs/namenode</value>

Page | 134
</property>

<property>

<name>dfs.data.dir</name>

<value>file:///home/hadoop/hadoopinfra/hdfs/datanode</value>

</property>

</configuration>

Note: In the above file, all the property values are user defined. You can make changes
according to your Hadoop infrastructure.

mapred-site.xml

This file is used to configure yarn into Hadoop. Open mapred-site.xml file and add the
following property in between the <configuration>, </configuration> tags in this file.

<configuration>

<property>

<name>yarn.nodemanager.aux-services</name>

<value>mapreduce_shuffle</value>

</property>

</configuration>

mapred-site.xml

This file is used to specify which MapReduce framework we are using. By default, Hadoop
contains a template of mapred-site.xml. First of all, it is required to copy the file from mapred-
site.xml.template to mapred-site.xml file using the following command.

$ cp mapred-site.xml.template mapred-site.xml

Open mapred-site.xml file and add the following properties in between the <configuration>,
</configuration> tags in this file.

Page | 135
<configuration>

<property>

<name>mapreduce.framework.name</name>

<value>yarn</value>

</property>

</configuration>

Verifying Hadoop Installation


The following steps are used to verify the Hadoop installation.

Step 1: Name Node Setup


Set up the namenode using the command “hdfs namenode -format” as follows:

$ cd ~
$ hdfs namenode -format

The expected result is as follows:

10/24/14 21:30:55 INFO namenode.NameNode: STARTUP_MSG:


/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = localhost/192.168.1.11
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 2.4.1
...
...
10/24/14 21:30:56 INFO common.Storage: Storage directory
/home/hadoop/hadoopinfra/hdfs/namenode has been successfully formatted.
10/24/14 21:30:56 INFO namenode.NNStorageRetentionManager: Going to retain
1 images with txid >= 0
10/24/14 21:30:56 INFO util.ExitUtil: Exiting with status 0
10/24/14 21:30:56 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at localhost/192.168.1.11
************************************************************/

Step 2: Verifying Hadoop dfs


The following command is used to start dfs. This command starts your Hadoop file system.

Page | 136
$ start-dfs.sh

The expected output is as follows:

10/24/14 21:37:56
Starting namenodes on [localhost]
localhost: starting namenode, logging to /home/hadoop/hadoop-
2.4.1/logs/hadoop-hadoop-namenode-localhost.out
localhost: starting datanode, logging to /home/hadoop/hadoop-
2.4.1/logs/hadoop-hadoop-datanode-localhost.out
Starting secondary namenodes [0.0.0.0]

Step 3: Verifying Yarn Script


The following command is used to start yarn script. Executing this command will start your
yarn demons.

$ start-yarn.sh

The expected output is as follows:

starting yarn daemons


starting resource manager, logging to /home/hadoop/hadoop-2.4.1/logs/yarn-
hadoop-resourcemanager-localhost.out
localhost: starting node manager, logging to /home/hadoop/hadoop-
2.4.1/logs/yarn-hadoop-nodemanager-localhost.out

Step 4: Accessing Hadoop on Browser


The default port number to access hadoop is 50070. Use the following URL to get Hadoop
services on your browser.

http://localhost:50070/

Page | 137
Step 5: Verify All Applications for Cluster
The default port number to access all application of cluster is 8088. Use the following URL to
visit this service.

http://localhost:8088/

Downloading Mahout
Mahout is available in the website Mahout. Download Mahout from the link provided in the
website. Here is the screenshot of the website.

Step 1
Download Apache mahout from the link http://mirror.nexcess.net/apache/mahout/ using the
following command.

Page | 138
[Hadoop@localhost ~]$ wget
http://mirror.nexcess.net/apache/mahout/0.9/mahout-distribution-0.9.tar.gz

Then mahout-distribution-0.9.tar.gz will be downloaded in your system.

Step2
Browse through the folder where mahout-distribution-0.9.tar.gz is stored and extract the
downloaded jar file as shown below.

[Hadoop@localhost ~]$ tar zxvf mahout-distribution-0.9.tar.gz

Maven Repository
Given below is the pom.xml to build Apache Mahout using Eclipse.

<dependency>

<groupId>org.apache.mahout</groupId>

<artifactId>mahout-core</artifactId>

<version>0.9</version>

</dependency>

<dependency>

<groupId>org.apache.mahout</groupId>

<artifactId>mahout-math</artifactId>

<version>${mahout.version}</version>

</dependency>

<dependency>

<groupId>org.apache.mahout</groupId>

<artifactId>mahout-integration</artifactId>

<version>${mahout.version}</version>

</dependency>

Page | 139

You might also like