You are on page 1of 45

NAME:

ROLLNO:

YEAR/SEM: IST YEAR /IIND SEM

BRANCH: M.TECH IN COMPUTER SCIENCE


&ENGINEERING

SUBJECT:BIG DATA ANALYTICS LAB


EXERCISE-1:-

AIM:-

ImplementthefollowingDataStructuresinJava

a)Linked Lists b)Stacks c)Queues d)Set

e)MapDESCRIPTION:

Thejava.utilpackagecontains all theclassesandinterfaces forCollectionframework.

MethodsofCollectioninterface

Therearemanymethodsdeclared intheCollectioninterface. Theyareasfollows:

No. Method Description


1 public boolean isusedtoinsert anelementinthiscollection.
add(Objectelement)

2 public boolean isusedtoinsertthespecifiedcollectionelementsinthein


addAll(Collectionc) vokingcollection.

3 publicbooleanremove(Object isusedtodeleteanelementfromthiscollection.
element)

4 public boolean isusedtodeletealltheelementsofspecifiedcollection


removeAll(Collectionc) from theinvokingcollection.

5 public boolean isusedtodeletealltheelementsofinvokingcollection


retainAll(Collectionc) except thespecifiedcollection.

6 publicint size() return total number of elements in the


the
collection.

7 publicvoidclear() removesthetotalnoofelementfrom thecollection.

8 public boolean isusedtosearchan element.


contains(Objectelement)

9 public boolean isusedtosearchthespecifiedcollectioninthiscollection.


containsAll(Collectionc)

10 publicIteratoriterator() returnsaniterator.

11 publicObject[]toArray() convertscollectionintoarray.

12 publicbooleanisEmpty() checksifcollectionisempty.
SKELTONOFJAVA.UTIL.COLLECTIONINTERFACE

publicinterfaceCollection<E>extends

Iterable<E>{int size();

booleanisEmpty();

boolean contains(Object

o);Iterator<E>

iterator();Object[]

toArray();

<T>T[]toArray(T[]a);b

ooleanadd(Ee);

booleanremove(Objecto);

boolean addAll(Collection<? extends E>

c);boolean removeAll(Collection<?>

c);booleanretainAll(Collection<?> c);

voidclear();

booleanequals(Objecto);i

nt hashCode();

}
ALGORITHMforAllCollectionDataStructures:-

StepsofCreationof Collection

1. CreateaObjectofGeneric TypeE,T,KorV

2. CreateaModel classorPlainOldJavaObject(POJO)oftype.

3. GenerateSettersandGetters

4. CreateaCollection ObjectoftypeeitherSetorList orMaporQueue

5. Add Objects to the

collectionBooleanadd(Ee)

6. AddCollectiontotheCollection.

BooleanaddAll(Collection)

7. Remove or retain data from

CollectionRemove(Collection)retailAll(

Collection)

8. IterateObjectsusingEnumerationorIteratororListIteratorIter

atorlistIterator()

9. DisplayObjects from Collection

10. END
SAMPLEINPUT:

SampleEmployeeDataSet:(

employee.txt)

e100,james,asst.prof,cse,8000,16000,4000,8.7e1

01,jack,asst.prof,cse,8350,17000,4500,9.2e102,j

ane,assoc.prof,cse,15000,30000,8000,7.8e104,j

ohn,prof,cse,30000,60000,15000,8.8e105,peter,

assoc.prof,cse,16500,33000,8600,6.9e106,david

,assoc.prof,cse,18000,36000,9500,8.3e107,danie

l,asst.prof,cse,9400,19000,5000,7.9e108,ramu,a

ssoc.prof,cse,17000,34000,9000,6.8e109,rani,as

st.prof,cse,10000,21500,4800,6.4e110,murthy,p

rof,cse,35000,71500,15000,9,3

EXPECTEDOUTPUT:-

Printstheinformationofemployeewithallitsattributes
EXERCISE-2:-

AIM:-

i) PerformsettingupandInstallingHadoopinitsthreeoperatingmodes:

 Standalone
 PseudoDistributed
 FullyDistributed

DESCRIPTION:

Hadoop is written in Java, so you will need to have Java installed on your
machine,version 6 or later. Sun's JDK is the one most widely used with Hadoop, although others
havebeenreported to work.

Hadoop runs on Unix and on Windows. Linux is the only supported production
platform,but other flavors of Unix (including Mac OS X) can be used to run Hadoop for
development.Windows is only supported as a development platform, and additionally requires
Cygwin to run.During the Cygwin installation process, you should include the openssh package
if you plan torunHadoop in pseudo-distributed mode

ALGORITHM

STEPS INVOLVEDININSTALLINGHADOOPINSTANDALONE MODE:-

1. Commandforinstallingsshis“sudoapt-getinstall ssh”.
2. Commandforkeygenerationisssh-keygen –trsa–P“”.
3. Storethekeyintorsa.pubbyusingthecommandcat$HOME/.ssh/id_rsa.pub >>
$HOME/.ssh/authorized_keys
4. Extractthejavabyusingthecommand tarxvfzjdk-8u60-linux-i586.tar.gz.
5. Extracttheeclipsebyusingthecommandtarxvfzeclipse-jee-mars-R-linux-gtk.tar.gz
6. Extractthehadoop byusingthecommandtarxvfz hadoop-2.7.1.tar.gz
7. Move the java to /usr/lib/jvm/ and eclipse to /opt/ paths. Configure the java path in
theeclipse.inifile
8. Exportjavapathandhadooppath in ./bashrc
9. Checktheinstallationsuccessfulornotbycheckingthejavaversionand hadoopversion
10. Checkthe hadoop instancein standalone mode workingcorrectlyornot
byusinganimplicit hadoop jarfile named aswordcount.
11. Ifthewordcountisdisplayed correctlyinpart-r-
00000fileitmeansthatstandalonemodeisinstalled successfully.

ALGORITHM

STEPSINVOLVEDININSTALLINGHADOOPIN PSEUDODISTRIBUTEDMODE:-

1. Inorderinstallpseudodistributedmodeweneedtoconfigurethehadoopconfig
uration files resides in the directory /home/ACEIT/hadoop-
2.7.1/etc/hadoop.
2. Firstconfigurethehadoop-env.sh filebychanging thejavapath.
3. Configurethecore-site.xmlwhichcontainsapropertytag,it
containsnameandvalue.Nameas fs.defaultFSandvalueashdfs://localhost:9000
4. Configurehdfs-site.xml.
5. Configure yarn-site.xml.
6. Configure mapred-site.xml before configure the copy mapred-site.xml.template
tomapred-site.xml.
7. Nowformat the name node byusingcommand hdfs namenode–format.
8. Type the command start-dfs.sh,start-yarn.sh means that starts the daemons
likeNameNode,DataNode,SecondaryNameNode,ResourceManager,NodeMana
ger.
9. Run JPS which views all daemons. Create a directory in the hadoop by
usingcommand hdfs dfs –mkdr /csedir and enter some data into ACEIT.txt
usingcommand nano ACEIT.txt and copyfrom local directoryto hadoop
usingcommandhdfs dfs – copyFromLocal ACEIT.txt /csedir/and run sample jar file
wordcount tocheckwhetherpseudo distributed mode is workingornot.
10. Displaythecontents offile byusingcommand hdfs dfs–cat /newdir/part-r-00000.

FULLYDISTRIBUTEDMODEINSTALLATION:

ALGORITHM

1. Stopallsinglenodeclusters

$stop-all.sh

2. DecideoneasNameNode(Master)andremainingasDataNodes(Slaves).

3. Copypublickeyto all threehosts togetapassword less SSHaccess

$ssh-copy-id–I$HOME/.ssh/id_rsa.pubACEIT@l5sys24

4. Configureall Configuration files,tonameMasterandSlaveNodes.

$cd$HADOOP_HOME/etc/hadoop

$nanocore-site.xml

$nanohdfs-site.xml

5. Addhostnames tofileslavesandsaveit.

$nanoslaves

6. Configure$nanoyarn-site.xml

7. Doin MasterNode

$hdfs namenode–format

$start-dfs.sh

$start-yarn.sh

8. FormatNameNode
9. DaemonsStartinginMasterandSlaveNodes
10. END
INPUT

ubuntu@localhost>jps

OUTPUT:

Datanode,namenodemSecondarynamenode,No

deManager,ResourceManager

II) Using Web Based Tools to Manage Hadoop Set-

upDESCRIPTION

Hadoopset up can bemanaged bydifferentweb based tools, which can


beeasyfortheuserto identifytherunningdaemons. Fewofthe toolsusedin thereal world are:

a) ApacheAmbari

b) HortonWorks

c) ApacheSpark

LISTOFCLUSTERSINHADOOP

ApacheHadoopRunning atLocal Host


AMBARIAdmin PageforManagingHadoopClusters
AMBARIAdmin PageforViewingHadoop MapReduceJobs

HortonWorksToolforManaging MapReduceJobsinApachePig
RunningMapReduceJobs inHortonWorksforPigLatin Script
EXERCISE-3:-

AIM:-

ImplementthefollowingfilemanagementtasksinHadoop:

 Addingfiles and directories


 Retrievingfiles
 DeletingFiles

DESCRIPTION:-

HDFS is a scalable distributed filesystem designed to scale to petabytes of data


whilerunning on top of the underlying filesystem of the operating system. HDFS keeps track of
wherethe data resides in a network by associating the name of its rack (or network switch) with
thedataset. This allows Hadoop to efficiently schedule tasks to those nodes that contain data,
orwhich are nearest to it, optimizing bandwidth utilization. Hadoop provides a set of command
lineutilities that work similarly to the Linux file commands, and serve as your primary interface
withHDFS. We‘re going to have a look into HDFS by interacting with it from the command line.
Wewill takealook atthemostcommonfile managementtasks in Hadoop, whichinclude:

 Addingfilesand directoriestoHDFS
 Retrievingfiles from HDFStolocalfilesystem
 DeletingfilesfromHDFS 

ALGORITHM:-

SYNTAXANDCOMMANDSTOADD,RETRIEVEANDDELETEDATAFROMHDFS

Step-1
AddingFilesandDirectoriestoHDFS

Before you can run Hadoop programs on data stored in HDFS, you‘ll need to put the data
intoHDFSfirst.Let‘screateadirectory andputafileinit.HDFShasadefaultworkingdirectoryof
/user/$USER, where $USER is your login user name. This directory isn‘t automatically
createdfor you, though, so let‘s create it withthe mkdir command. For the purpose of illustration,
weusechuck. You should substitute your usernameintheexamplecommands.

hadoopfs-mkdir/user/chuck
hadoopfs-put example.txt
hadoopfs-putexample.txt/user/chuck

Step-2

RetrievingFilesfromHDFS

TheHadoopcommandgetcopiesfilesfromHDFSbacktothelocalfilesystem.Toretrieveexample
.txt,wecan runthe followingcommand:

hadoopfs-cat example.txt

Step-3

Deleting FilesfromHDFS

hadoopfs-rmexample.txt
 Commandforcreatingadirectoryinhdfsis“hdfsdfs–mkdir /ACEITcse”.
 Addingdirectoryisdonethroughthe command “hdfsdfs –putACEIT_english/”.

Step-4
CopyingDatafromNFStoHDFS
Copyingfromdirectorycommandis“hdfsdfs–copyFromLocal
/home/ACEIT/Desktop/shakes/glossary/ACEITcse/”
 Viewthefilebyusingthecommand“hdfsdfs–cat/ACEIT_english/glossary”
 CommandforlistingofitemsinHadoopis“hdfsdfs–lshdfs://localhost:9000/”.
 CommandforDeletingfilesis “hdfsdfs –rmr/kartheek”.

SAMPLEINPUT:

Inputasanydataformatoftypestructured,UnstructuredorSemi Structured

EXPECTEDOUTPUT:
EXERCISE-

4AIM:-

RunabasicWordCountMapReduceProgramtounderstandMapReduce Paradigm

DESCRIPTION:--

MapReduce is the heart of Hadoop.It is this programming paradigm that allows


formassive scalability across hundreds or thousands of servers in a Hadoop
cluster.The MapReduce concept is fairly simple to understand for those who are familiar with
clusteredscale-out data processing solutions. The term MapReduce actually refers to two
separate anddistinct tasks that Hadoop programs perform. The first is the map job, which takes a
set of dataand converts it into another set of data, where individual elements are broken down
into tuples(key/value pairs). The reduce job takes the output from a map as input and combines
those datatuples into a smaller set of tuples. As the sequence of the name MapReduce implies,
the reducejobisalwaysperformedafterthemapjob.

ALGORITHMMAPREDUC

EPROGRAM

WordCountisasimpleprogram whichcountsthenumberofoccurrencesof eachwordinagiventext


input data set. WordCount fits very well with the MapReduce programming model making ita
great example to understand the Hadoop Map/Reduce programming style. Our
implementationconsists ofthreemain parts:

1. Mapper

2. Reducer

3. Driver
Step-1.WriteaMapper

AMapperoverridesthe―map‖functionfromtheClass"org.apache.hadoop.mapreduce.Mapper"
whichprovides <key,value>pairsas the input.AMapper implementationmayoutput
<key,value>pairs usingtheprovided Context .

Input value of the WordCount Map task will be a line of text from the input data file and the
keywould be the line number <line_number, line_of_text> . Map task outputs <word, one> for
eachwordin thelineoftext.

Pseudo-code

void Map (key,


value){foreachwordxinva
lue:
output.collect(x,1);
}
Step-2.WriteaReducer

AReducer collects theintermediate<key,value>output


frommultiplemaptasksandassembleasingleresult.Here,theWordCount program will
sumuptheoccurrenceof eachwordtopairsas
<word,occurrence>.

Pseudo-code

voidReduce(keyword,<listofvalue>){for
each xin<list ofvalue>:
sum+=x;final_output.collect(keywor
d,sum);
}
Step-3.WriteDriver

TheDriverprogramconfiguresandruntheMapReducejob.Weusethemainprogram
toperformbasicconfigurations such as:

Job Name: name of thisJob


Executable(Jar)Class:
themainexecutableclass.Forhere,WordCount.Mapper Class: class which
overrides the "map" function. For here, Map.Reducer: class which
override the "reduce" function. For here , Reduce.OutputKey:
typeofoutputkey.Forhere, Text.
OutputValue:typeofoutputvalue.Forhere,IntWritable.File
InputPath
FileOutputPath
INPUT:-

SetofDataRelatedShakespeareComedies,Glossary,Poems
OUTPUT:-
EXERCISE-5:-

AIM:-

WriteaMapReduce ProgramthatminesWeatherData.

DESCRIPTION:

Climate change has been seeking a lot of attention sincelong time. The antagonisticeffect
of this climate is being felt in every part of the earth. There are many examples for these,such as
sea levels are rising, less rainfall, increase in humidity. The propose system overcomesthesome
issues that occurred by using other techniques. Inthis project we use the concept ofBig data
Hadoop. In the proposed architecture we are able to process offline data, which isstored in the
National Climatic Data Centre (NCDC). Through this we are able to find out themaximum
temperature and minimum temperature of year, and able to predict the future weatherforecast.
Finally, weplot the graph for the obtained MAX and MIN temperaturefor each mothof the
particular year to visualize the temperature. Based on the previous year data weather
dataofcomingyear is predicted.

ALGORITHM:-

MAPREDUCEPROGRAM

WordCountisasimpleprogram whichcountsthenumberofoccurrencesofeachwordinagiventext
input data set. WordCount fits very well with the MapReduce programming model making ita
great example to understand the Hadoop Map/Reduce programming style. Our
implementationconsists ofthreemain parts:

1. Mapper

2. Reducer

3. Mainprogram
Step-1.WriteaMapper

AMapperoverridesthe―map‖functionfromtheClass"org.apache.hadoop.mapreduce.Mapper"
whichprovides <key,value>pairsas the input.AMapper implementationmayoutput
<key,value>pairs usingtheprovided Context .

Input value of the WordCount Map task will be a line of text from the input data file and the
keywould be the line number <line_number, line_of_text> . Map task outputs <word, one> for
eachwordin thelineoftext.

Pseudo-code

voidMap(key,value){
for each max_temp x in
value:output.collect(x,1);
}
voidMap(key,value){
foreachmin_tempxinvalue:ou
tput.collect(x,1);
}

Step-2WriteaReducer

A Reducer collects the intermediate <key,value> output from multiple map tasks and assemble
asingleresult.Here,theWordCount programwill sumuptheoccurrenceof eachwordtopairsas
<word,occurrence>.

Pseudo-code

voidReduce(max_temp,<listofvalue>){for
each xin <listofvalue>:
sum+=x;final_output.collect(max_te
mp,sum);
}

void Reduce (min_temp, <list of


value>){foreach xin<list of value>:
sum+=x;final_output.collect(min_temp,
sum);
}

3. WriteDriver

TheDriverprogramconfiguresandruntheMapReducejob.Weusethemainprogram
toperformbasicconfigurations such as:

Job Name : name ofthis Job

Executable (Jar) Class: the main executable class. For here,

WordCount.Mapper Class: class which overrides the "map" function.

For here, Map.Reducer: class which override the "reduce" function. For

here , Reduce.OutputKey: typeofoutput key.Forhere, Text.

OutputValue:typeofoutput value.Forhere,IntWritable.File

Input Path

FileOutputPath

INPUT:-

SetofWeatherDataover the years

OUTPUT:-
EXERCISE-6:-

AIM:-

WriteaMapReduceProgramthatimplementsMatrixMultiplication.

DESCRIPTION:

We can represent a matrix as a relation (table) in RDBMSwhere each cell in the


matrixcan be representedas a record(i,j,value).Asanexample let us consider the following
matrixand its representation. It is important to understand that this relation is a very inefficient
relationif the matrix is dense.Let us say we have 5 Rows and 6 Columns , then we need to store
only 30values. But if you consider above relation we are storing30 rowid, 30 col_id and 30
values inother sense we are tripling the data. So a natural question arises why we need to store in
thisformat ? In practice most of the matrices are sparse matrices . In sparse matrices not all
cellsused to have any values , so we don‘t have to store those cells in DB. So thisturns out to be
veryefficientin storingsuchmatrices.

MapReduceLogic

Logicistosendthecalculationpartofeachoutputcelloftheresultmatrixtoareducer.
Soinmatrixmultiplicationthefirstcellofoutput(0,0)hasmultiplicationandsummationof
elementsfromrow0ofthematrixAandelementsfromcol0ofmatrixB.Todothecomputation of valuein
the output cell (0,0) of resultant matrixin a seperate reducer we need touse (0,0) as output key of
mapphase and value should have array of values from row 0 of matrixA and column 0 of matrix
B.Hopefully this picture will explain the point. So in this algorithmoutput from map phase
should be having a <key,value> , wherekey represents the output
celllocation(0,0),(0,1)etc..andvaluewillbelistofallvaluesrequiredforreducertodocomputation. Let
us take an example for calculatiing value at output cell (00). Here we need tocollect values from
row 0 of matrix A and col 0 of matrix B in the map phase and pass (0,0) askey.So
asinglereducercan do the calculation.
ALGORITHM

We assume that the input files for A and B are streams of (key,value) pairs in
sparsematrix format,whereeachkeyis apairofindices(i,j)and eachvalueis the
correspondingmatrixelementvalue. Theoutput files for matrixC=A*Barein thesameformat.

Wehavethe followinginput parameters:

The path of the input file or directory for matrix


A.Thepath oftheinput fileordirectoryformatrixB.
The path of the directory for the output files for matrix
C.strategy=1, 2, 3 or4.
R=the number ofreducers.
I=thenumberof rows inAand C.
K=thenumberofcolumnsinAandrowsinB.J=the
number ofcolumns in Band C.
IB=thenumberofrowsperAblock andCblock.
KB=thenumberofcolumns per Ablockandrows
perBblock.JB=the number ofcolumns per Bblock and C block.

Inthepseudo-
codefortheindividualstrategiesbelow,wehaveintentionallyavoidedfactoringcommon codefor
thepurposes of clarity.

Notethat inall thestrategies thememoryfootprintof boththemappersandthereducers isflat atscale.

Notethatthestrategies all workreasonablywellwith bothdenseandsparse


matrices.Forsparsematrices we do not emit zero elements. That said, the simple pseudo-code for
multiplying theindividualblocksshownhereiscertainlynotoptimal
forsparsematrices.Asalearningexercise,our focus here is on mastering the MapReduce
complexities, not on optimizing the sequentialmatrix multipliation algorithm forthe
individualblocks.
Steps

1. setup()
2. varNIB=(I-1)/IB+1
3. varNKB=(K-1)/KB+1
4. varNJB=(J-1)/JB+1
5. map(key,value)
6. iffrommatrix Awithkey=(i,k)andvalue=a(i,k)
7. for0 <=jb<NJB
8. emit(i/IB,k/KB,jb, 0),(imodIB,kmodKB,a(i,k))
9. iffrommatrixBwith key=(k,j)andvalue=b(k,j)
10. for0 <=ib <NIB
emit(ib,k/KB,j/JB,1),(kmodKB,jmodJB,b(k,j))

Intermediatekeys (ib, kb, jb, m) sort in increasingorder first byib, then bykb,
thenbyjb,thenbym. Note that m =0 forA dataand m = 1 forBdata.

Thepartitionermapsintermediatekey(ib,kb,jb,m)toareducerrasfollows:

11. r =((ib*JB+jb)*KB+kb)mod R

12. These definitions for the sorting order and partitioner guarantee that each
reducerR[ib,kb,jb]receives thedataitneedsforblocksA[ib,kb]
andB[kb,jb],withthedatafortheAblock immediatelyprecedingthe data fortheBblock.
13. varA=newmatrixofdimensionIBxKB
14. varB=new matrixofdimensionKBxJB
15. varsib =-1
16. varskb =-1
Reduce(key,valueList)

17. if keyis(ib, kb, jb, 0)


18. //Savethe Ablock.
19. sib=ib
20. skb=kb
21. ZeromatrixA
22. foreachvalue =(i,k,v)invalueList A(i,k)=v
23. if keyis(ib, kb, jb, 1)
24. ifib !=sib orkb !=skbreturn // A[ib,kb] must bezero!
25. //Build theBblock.
26. ZeromatrixB
27. foreachvalue=(k,j,v)in valueListB(k,j)=v
28. //Multiplythe blocksandemit the result.
29. ibase=ib*IB
30. jbase=jb*JB
31. for0 <=i<rowdimension of A
32. for0 <=j<column dimensionof B
33. sum = 0
34. for0 <=k<column dimensionof A=rowdimension of B
a. sum+=A(i,k)*B(k,j)
35. ifsum!=0emit(ibase+i, jbase+j),sum

INPUT:-

SetofDatasetsoverdifferentClustersaretakenasRowsandColumns
OUTPUT:-
EXERCISE-7:-

AIM:-

InstallandRunPigthen writePigLatinscriptstosort,group,join,projectandfilterthe
data.

DESCRIPTION

Apache Pig is a high-level platform for creating programs that run on Apache
Hadoop.Thelanguageforthisplatformiscalled PigLatin.
PigcanexecuteitsHadoopjobsinMapReduce,ApacheTez,or
ApacheSpark.PigLatinabstractstheprogrammingfromtheJava MapReduce idiom into a notation
which makes MapReduce programming high level,similartothatof SQL for
RDBMSs.PigLatincanbeextendedusing UserDefinedFunctions (UDFs) which the user can write
in Java, Python, JavaScript, Ruby or Groovy and thencalldirectlyfromthelanguage.

Pig Latin is procedural and fits very naturally in the pipeline paradigm while SQL
isinstead declarative. In SQL users can specify that data from two tables must be joined, but
notwhat join implementation to use (You can specify the implementation of JOIN in SQL, thus
"...for many SQL applications the query writer may not have enough knowledge of the data
orenough expertise to specify an appropriate join algorithm."). Pig Latin allows users to specify
animplementationoraspectsofanimplementationtobeusedinexecutingascriptinseveralways. In
effect, Pig Latin programming is similar to specifying a query execution plan, making
iteasierforprogrammers to explicitlycontrol theflow oftheirdata processingtask.

SQL is oriented around queries that produce a single result. SQL handles trees naturally, but
hasno built in mechanism for splitting a data processing stream and applying different operators
toeachsub-stream.PigLatinscriptdescribesadirectedacyclic graph(DAG)ratherthanapipeline.

PigLatin'sabilitytoincludeusercodeatanypointinthepipelineisusefulforpipelinedevelopment.IfSQL
isused,datamustfirstbe importedintothedatabase,andthenthecleansingand transformationprocess
can begin.
ALGORITHM

STEPSFORINSTALLINGAPACHEPIG

1) Extractthepig-0.15.0.tar.gz andmovetohomedirectory

2) SettheenvironmentofPIGinbashrcfile.

3) Pigcanrun intwomodes
LocalMode and Hadoop
ModePig–xlocal and
pig
4) GruntShell
Grunt>
5) LOADINGData into Grunt Shell
DATA=LOAD<CLASSPATH>USINGPigStorage(DELIMITER)as(ATTRIBUTE:
DataType1,ATTRIBUTE:DataType2…..)

6) DescribeData

DescribeDATA;

7) DUMPData

DumpDATA;

8) FILTERData

FDATA=FILTERDATAbyATTRIBUTE=VALUE;

9) GROUPData

GDATA=GROUP DATA byATTRIBUTE;

10) IteratingData

FOR_DATA=FOREACHDATAGENERATEGROUPASGROUP_FUN,AT
TRIBUTE=<VALUE>
11) SortingData

SORT_DATA=ORDERDATABYATTRIBUTEWITHCONDITION;

12) LIMITData

LIMIT_DATA=LIMITDATACOUNT;

13) JOINData

JOINDATA1BY(ATTRIBUTE1,ATTRIBUTE2….),DATA2BY(A
TTRIBUTE3,ATTRIBUTE….N)

INPUT:

InputasWebsiteClickCountData

OUTPUT:
EXERCISE-8:-

AIM:-

InstallandRunHivethenuseHiveto Create, alterand


dropdatabases,tables,views,functionsandIndexes.

DESCRIPTION

Hive, allows SQL developers to write Hive Query Language (HQL) statements that
aresimilar tostandard SQL statements;nowyoushouldbe aware thatHQL islimited in
thecommands it understands, but it is still pretty useful. HQL statements are broken down by
theHive service into MapReduce jobs and executed across a Hadoop cluster. Hive looks very
muchlike traditional database code with SQL access. However, because Hive is based on
Hadoop andMapReduce operations, there are several key differences. The first is that Hadoop is
intended forlong sequential scans, and because Hive is based on Hadoop, you can expect queries
to have
averyhighlatency(manyminutes).ThismeansthatHivewouldnotbeappropriateforapplications that
need very fast response times, as you would expect with a database such asDB2. Finally, Hive is
read-based and therefore not appropriate for transaction processing thattypicallyinvolves
ahighpercentageof writeoperations.

ALGORITHM:

ApacheHIVEINSTALLATIONSTEPS

1) InstallMySQL-Server
Sudoapt-getinstallmysql-server
2) ConfiguringMySQLUserNameandPassword
3) CreatingUserand
grantingallPrivilegesMysql –uroot –
proot
Createuser<USER_NAME> identified by<PASSWORD>
4) ExtractandConfigureApacheHive
tarxvfzapache-hive-1.0.1.bin.tar.gz
5) MoveApacheHivefromLocal directorytoHome directory
6) SetCLASSPATHinbashrc
Export HIVE_HOME = /home/apache-
hiveExportPATH=$PATH:$HIVE_HOME/
bin
7) Configuringhive-default.xml byaddingMySQL ServerCredentials

<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost:3306/hive?createDatabaseIfNotE
xist=true
</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hadoop</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>hadoop</value>
</property>

8) Copyingmysql-java-connector.jartohive/libdirectory.

SYNTAXforHIVEDatabaseOperationsD

ATABASE Creation
CREATEDATABASE|SCHEMA[IFNOTEXISTS]<databasename>

DropDatabaseStatement
DROPDATABASEStatementDROP (DATABASE|SCHEMA)[IFEXISTS]
database_name[RESTRICT|CASCADE];

Creatingand DroppingTableinHIVE

CREATE[TEMPORARY][EXTERNAL]TABLE[IFNOTEXISTS][db_name.]table_name
[(col_namedata_type[COMMENTcol_comment],...)]

[COMMENT table_comment][ROWFORMATrow_format][STOREDASfile_format]

Loading Data into table

log_dataSyntax:
LOADDATALOCALINPATH'<path>/u.data'OVERWRITE INTOTABLEu_data;

AlterTableinHIVE

Syntax

ALTERTABLE nameRENAME TOnew_name


ALTER TABLE name ADD COLUMNS (col_spec[, col_spec
...])ALTERTABLEnameDROP[COLUMN]column_name
ALTER TABLE name CHANGE column_name new_name
new_typeALTERTABLEnameREPLACECOLUMNS(col_spec[,col_sp
ec...])

Creatingand DroppingView
CREATE VIEW [IF NOT EXISTS] view_name [(column_name
[COMMENTcolumn_comment],...)][COMMENTtable_comment]AS SELECT ...

Dropping

ViewSyntax:

DROPVIEWview_name

Functionsin HIVE

StringFunctions:-round(),ceil(),substr(),upper(),reg_exp()etcDate

and Time Functions:- year(), month(), day(), to_date()

etcAggregateFunctions :-sum(),min(),max(), count(),avg()etc

INDEXES

CREATE INDEX index_name ON TABLE base_table_name (col_name,


...)AS'index.handler.class.name'
[WITHDEFERREDREBUILD]
[IDXPROPERTIES (property_name=property_value,
...)][INTABLE index_table_name]
[PARTITIONEDBY(col_name,...)][
[ROWFORMAT...]STOREDAS...
|STORED BY...
]
[LOCATION
hdfs_path][TBLPROPE
RTIES(...)]
CreatingIndex

CREATE INDEX index_ip ON TABLE log_data(ip_address)


AS'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'WITHDEFERRE
DREBUILD;
AlteringandInsertingIndex

ALTERINDEXindex_ip_addressONlog_dataREBUILD;
StoringIndexDatainMetastore

SET
hive.index.compact.file=/home/administrator/Desktop/big/metastore_db/tmp/index_ipaddress_res
ult;

SET
hive.input.format=org.apache.hadoop.hive.ql.index.compact.HiveCompactIndexInputFormat;

DroppingIndex

DROPINDEXINDEX_NAMEonTABLE_NAME;

INPUT

InputasWebServerLogData
OUTPUT
EXERCISE-10:-

AIM:Writeaprogram to
implementcombiningandpartitioninginhadooptoimplementacustompartitionerandCombiner

DESCRIPTION

:COMBINERS:

ManyMapReducejobsarelimitedbythebandwidthavailableonthecluster,soit
pays to minimize the data transferred between map and reduce tasks. Hadoop allows
theuserto specifyacombinerfunction to berun onthemapoutput—the combiner
function‘soutputformstheinputtothereducefunction.Sincethecombinerfunctionisanoptimiz
ation, Hadoop does not provide a guarantee of how many times it will call it for
aparticular map output record, if at all. In other words, calling the combiner function
zero,one, or many times should produce the same output from the reducer. One can think
ofCombiners asmini-reducethat take place on the output of the mappers, prior to
theshuffle and sort phase. Each combiner operates in isolation and therefore does not
haveaccess to intermediate output from other mappers. The combiner is provided keys
andvalues associated with each key (the same types as the mapper output keys and
values).Critically, one cannot assume that a combiner will have the opportunity to
process allvalues associated with the same key. The combiner can emit any number of
key-valuepairs, but the keys and values must be of the same type as the mapper output
(same as
thereducerinput).Incaseswhereanoperationisbothassociativeandcommutative(e.g.,
additionormultiplication),reducerscandirectlyserveascombiners

PARTITIONERS

Acommonmisconceptionforfirst-
timeMapReduceprogrammersistouseonlyasinglereducer.Itiseasytounderstandthatsuchaco
nstraintisanonsenseandthat
using more than one reducer is most of the time necessary, else the map/reduce concept would
notbe very useful. With multiple reducers, we need some way to determine the appropriate one
tosenda(key/value)pairoutputtedby amapper.Thedefaultbehavioristohashthekey todetermine the
reducer. The partitioning phase takes place after the map phase and before
thereducephase.Thenumberofpartitionsisequaltothenumberofreducers.Thedatagetspartitioned
across the reducers according to the partitioningfunction. This approach improves
theoverallperformanceandallowsmapperstooperate
completelyindependently.Forallofitsoutputkey/valuepairs,eachmapperdetermines
which reducer will receive them. Because all the mappers are using the same
partitioningfor any key, regardless of which mapper instance generated it, the destination
partition isthe same. Hadoop uses an interface called Partitioner to determine which
partition akey/valuepairwillgoto.Asinglepartitionreferstoallkey/valuepairsthatwillbesentto
a single reduce task. You can configure the number of reducers in a job driver bysetting a
number of reducers on the job object (job.setNumReduceTasks). Hadoop
comeswithadefaultpartitionerimplementationi.e.HashPartitioner,whichhashesarecord‘ske
y to determine which partition the record belongs in. Each partition is processed by
areduce task, so the number of partitions is equal to the number of reduce tasks for the
job,When the map function starts producing output, it is not simply written to disk. Each
maptask has a circular memory buffer that it writes the output to. When the contents of
thebuffer reach a certain threshold size, a background thread will start tospill the contents
todisk. Map outputs will continue to be written to the buffer while the spill takes place,
butif the buffer fills up during this time, the map will block until the spill is complete.
Beforeit writes to disk, the thread first divides the data into partitions corresponding to
thereducers that they will ultimately be sent to. Within each partition, the background
threadperformsanin-memorysortbykey,andifthereisacombinerfunction,itisrunonthe
output of thesort.
ALGORITHM:

COMBINING

1) Dividethedatasource(thedatafiles)intofragmentsor blocks
whicharesenttoamapper.Thesearecalledsplits.
2) Thesesplits
arefurtherdividedintorecordsandtheserecordsareprovidedoneatatimetothemapperforproc
essing. ThisisachievedthroughaclasscalledasRecord Reader.
Create a Class and extend from TextInputFormat class to
createownNLinesInputFormat.
Then create our own RecordReader class called NLinesRecordReader where we
willimplementthelogic offeeding3 lines/recordsat atime.
MakeachangeinthedriverprogramtousenewNLinesInputFormatclass.
To prove that are really getting 3 lines at a time, instead of actually counting words (which
already know now how to do ) , emit out number of lines to get in the input at atime as a
key and 1 as a value , which after going through reducer will give the frequencyofeach
uniquenumberof lines to themappers.

PARTIONING

1. First,thekeyappearingmorewill besend toonepartition

2. Second,all otherkeyswill besendtopartitions accordingtotheirhashCode().

3. Now suppose if your hashCode() method does not uniformly distribute other keys
dataover partitions range. So the data is not evenly distributed in partitions as well
asreducers.Sinceeachpartitionisequivalenttoareducer.Soheresomereducerswill
havemore data than other reducers.So other reducers will wait for one reducer(one with
userdefinedkeys)dueto thework load it share
INPUT:

DatasetsfromdifferentsourcesasInput

OUTPUT:

hduser@ACEIT-3446:/usr/local/hadoop/bin$hadoopfs-ls/partitionerOutput/
14/12/0117:50:52WARNutil.NativeCodeLoader:Unabletoloadnative-
hadooplibraryforyourplatform...usingbuiltin-javaclasses whereapplicable
Found4items
-rw-r--r-- 1hdusersupergroup 02014-12-0117:49/partitionerOutput/_SUCCESS
-rw-r--r-- 1hdusersupergroup 102014-12-0117:48/partitionerOutput/part-r-
00000
-rw-r--r-- 1hdusersupergroup 102014-12-0117:48/partitionerOutput/part-r-
00001
-rw-r--r-- 1hdusersupergroup 92014-12-0117:49/partitionerOutput/part-r-00002

You might also like