Big Data Lab Material

NAME:
ROLLNO:
YEAR/SEM: IST YEAR /IIND SEM
BRANCH: M.TECH IN COMPUTER SCIENCE

&ENGINEERING
SUBJECT:BIG DATA ANALYTICS LAB

EXERCISE-1:-
AIM:-
ImplementthefollowingDataStructuresinJava
a)Linked Lists b)Stacks c)Queues d)Set
e)MapDESCRIPTION:
Thejava.utilpackagecontains all theclassesandinterfaces forCollectionframework.
MethodsofCollectioninterface
Therearemanymethodsdeclared intheCollectioninterface. Theyareasfollows:
No. Method Description

1 public boolean isusedtoinsert anelementinthiscollection.
add(Objectelement)
2 public boolean isusedtoinsertthespecifiedcollectionelementsinthein

addAll(Collectionc) vokingcollection.
3 publicbooleanremove(Object isusedtodeleteanelementfromthiscollection.
element)
4 public boolean isusedtodeletealltheelementsofspecifiedcollection

removeAll(Collectionc) from theinvokingcollection.
5 public boolean isusedtodeletealltheelementsofinvokingcollection

retainAll(Collectionc) except thespecifiedcollection.
6 publicint size() return total number of elements in the

the
collection.
7 publicvoidclear() removesthetotalnoofelementfrom thecollection.
8 public boolean isusedtosearchan element.

contains(Objectelement)
9 public boolean isusedtosearchthespecifiedcollectioninthiscollection.

containsAll(Collectionc)
10 publicIteratoriterator() returnsaniterator.
11 publicObject[]toArray() convertscollectionintoarray.
12 publicbooleanisEmpty() checksifcollectionisempty.
SKELTONOFJAVA.UTIL.COLLECTIONINTERFACE
publicinterfaceCollection<E>extends
Iterable<E>{int size();
booleanisEmpty();
boolean contains(Object
o);Iterator<E>
iterator();Object[]
toArray();
<T>T[]toArray(T[]a);b
ooleanadd(Ee);
booleanremove(Objecto);
boolean addAll(Collection<? extends E>
c);boolean removeAll(Collection<?>
c);booleanretainAll(Collection<?> c);
voidclear();
booleanequals(Objecto);i
nt hashCode();
}
ALGORITHMforAllCollectionDataStructures:-
StepsofCreationof Collection
1. CreateaObjectofGeneric TypeE,T,KorV
2. CreateaModel classorPlainOldJavaObject(POJO)oftype.
3. GenerateSettersandGetters
4. CreateaCollection ObjectoftypeeitherSetorList orMaporQueue
5. Add Objects to the
collectionBooleanadd(Ee)
6. AddCollectiontotheCollection.
BooleanaddAll(Collection)
7. Remove or retain data from
CollectionRemove(Collection)retailAll(
Collection)
8. IterateObjectsusingEnumerationorIteratororListIteratorIter
atorlistIterator()
9. DisplayObjects from Collection
10. END
SAMPLEINPUT:
SampleEmployeeDataSet:(
employee.txt)
e100,james,asst.prof,cse,8000,16000,4000,8.7e1
01,jack,asst.prof,cse,8350,17000,4500,9.2e102,j
ane,assoc.prof,cse,15000,30000,8000,7.8e104,j
ohn,prof,cse,30000,60000,15000,8.8e105,peter,
assoc.prof,cse,16500,33000,8600,6.9e106,david
,assoc.prof,cse,18000,36000,9500,8.3e107,danie
l,asst.prof,cse,9400,19000,5000,7.9e108,ramu,a
ssoc.prof,cse,17000,34000,9000,6.8e109,rani,as
st.prof,cse,10000,21500,4800,6.4e110,murthy,p
rof,cse,35000,71500,15000,9,3
EXPECTEDOUTPUT:-
Printstheinformationofemployeewithallitsattributes
EXERCISE-2:-
AIM:-
i) PerformsettingupandInstallingHadoopinitsthreeoperatingmodes:
 Standalone
 PseudoDistributed
 FullyDistributed
DESCRIPTION:
Hadoop is written in Java, so you will need to have Java installed on your
machine,version 6 or later. Sun's JDK is the one most widely used with Hadoop, although others
havebeenreported to work.
Hadoop runs on Unix and on Windows. Linux is the only supported production
platform,but other flavors of Unix (including Mac OS X) can be used to run Hadoop for
development.Windows is only supported as a development platform, and additionally requires
Cygwin to run.During the Cygwin installation process, you should include the openssh package
if you plan torunHadoop in pseudo-distributed mode
ALGORITHM
STEPS INVOLVEDININSTALLINGHADOOPINSTANDALONE MODE:-
1. Commandforinstallingsshis“sudoapt-getinstall ssh”.
2. Commandforkeygenerationisssh-keygen –trsa–P“”.
3. Storethekeyintorsa.pubbyusingthecommandcat$HOME/.ssh/id_rsa.pub >>
$HOME/.ssh/authorized_keys
4. Extractthejavabyusingthecommand tarxvfzjdk-8u60-linux-i586.tar.gz.
5. Extracttheeclipsebyusingthecommandtarxvfzeclipse-jee-mars-R-linux-gtk.tar.gz
6. Extractthehadoop byusingthecommandtarxvfz hadoop-2.7.1.tar.gz
7. Move the java to /usr/lib/jvm/ and eclipse to /opt/ paths. Configure the java path in
theeclipse.inifile
8. Exportjavapathandhadooppath in ./bashrc
9. Checktheinstallationsuccessfulornotbycheckingthejavaversionand hadoopversion
10. Checkthe hadoop instancein standalone mode workingcorrectlyornot
byusinganimplicit hadoop jarfile named aswordcount.
11. Ifthewordcountisdisplayed correctlyinpart-r-
00000fileitmeansthatstandalonemodeisinstalled successfully.
ALGORITHM
STEPSINVOLVEDININSTALLINGHADOOPIN PSEUDODISTRIBUTEDMODE:-
1. Inorderinstallpseudodistributedmodeweneedtoconfigurethehadoopconfig
uration files resides in the directory /home/ACEIT/hadoop-
2.7.1/etc/hadoop.
2. Firstconfigurethehadoop-env.sh filebychanging thejavapath.
3. Configurethecore-site.xmlwhichcontainsapropertytag,it
containsnameandvalue.Nameas fs.defaultFSandvalueashdfs://localhost:9000
4. Configurehdfs-site.xml.
5. Configure yarn-site.xml.
6. Configure mapred-site.xml before configure the copy mapred-site.xml.template
tomapred-site.xml.
7. Nowformat the name node byusingcommand hdfs namenode–format.
8. Type the command start-dfs.sh,start-yarn.sh means that starts the daemons
likeNameNode,DataNode,SecondaryNameNode,ResourceManager,NodeMana
ger.
9. Run JPS which views all daemons. Create a directory in the hadoop by
usingcommand hdfs dfs –mkdr /csedir and enter some data into ACEIT.txt
usingcommand nano ACEIT.txt and copyfrom local directoryto hadoop
usingcommandhdfs dfs – copyFromLocal ACEIT.txt /csedir/and run sample jar file
wordcount tocheckwhetherpseudo distributed mode is workingornot.
10. Displaythecontents offile byusingcommand hdfs dfs–cat /newdir/part-r-00000.
FULLYDISTRIBUTEDMODEINSTALLATION:
ALGORITHM
1. Stopallsinglenodeclusters
$stop-all.sh
2. DecideoneasNameNode(Master)andremainingasDataNodes(Slaves).
3. Copypublickeyto all threehosts togetapassword less SSHaccess
$ssh-copy-id–I$HOME/.ssh/id_rsa.pubACEIT@l5sys24
4. Configureall Configuration files,tonameMasterandSlaveNodes.
$cd$HADOOP_HOME/etc/hadoop
$nanocore-site.xml
$nanohdfs-site.xml
5. Addhostnames tofileslavesandsaveit.
$nanoslaves
6. Configure$nanoyarn-site.xml
7. Doin MasterNode
$hdfs namenode–format
$start-dfs.sh
$start-yarn.sh
8. FormatNameNode
9. DaemonsStartinginMasterandSlaveNodes
10. END
INPUT
ubuntu@localhost>jps
OUTPUT:
Datanode,namenodemSecondarynamenode,No
deManager,ResourceManager
II) Using Web Based Tools to Manage Hadoop Set-
upDESCRIPTION
Hadoopset up can bemanaged bydifferentweb based tools, which can

beeasyfortheuserto identifytherunningdaemons. Fewofthe toolsusedin thereal world are:
a) ApacheAmbari
b) HortonWorks
c) ApacheSpark
LISTOFCLUSTERSINHADOOP
ApacheHadoopRunning atLocal Host

AMBARIAdmin PageforManagingHadoopClusters
AMBARIAdmin PageforViewingHadoop MapReduceJobs
HortonWorksToolforManaging MapReduceJobsinApachePig
RunningMapReduceJobs inHortonWorksforPigLatin Script
EXERCISE-3:-
AIM:-
ImplementthefollowingfilemanagementtasksinHadoop:
 Addingfiles and directories

 Retrievingfiles
 DeletingFiles
DESCRIPTION:-
HDFS is a scalable distributed filesystem designed to scale to petabytes of data

whilerunning on top of the underlying filesystem of the operating system. HDFS keeps track of
wherethe data resides in a network by associating the name of its rack (or network switch) with
thedataset. This allows Hadoop to efficiently schedule tasks to those nodes that contain data,
orwhich are nearest to it, optimizing bandwidth utilization. Hadoop provides a set of command
lineutilities that work similarly to the Linux file commands, and serve as your primary interface
withHDFS. We‘re going to have a look into HDFS by interacting with it from the command line.
Wewill takealook atthemostcommonfile managementtasks in Hadoop, whichinclude:
 Addingfilesand directoriestoHDFS
 Retrievingfiles from HDFStolocalfilesystem
 DeletingfilesfromHDFS 
ALGORITHM:-
SYNTAXANDCOMMANDSTOADD,RETRIEVEANDDELETEDATAFROMHDFS
Step-1
AddingFilesandDirectoriestoHDFS
Before you can run Hadoop programs on data stored in HDFS, you‘ll need to put the data
intoHDFSfirst.Let‘screateadirectory andputafileinit.HDFShasadefaultworkingdirectoryof
/user/$USER, where $USER is your login user name. This directory isn‘t automatically
createdfor you, though, so let‘s create it withthe mkdir command. For the purpose of illustration,
weusechuck. You should substitute your usernameintheexamplecommands.
hadoopfs-mkdir/user/chuck
hadoopfs-put example.txt
hadoopfs-putexample.txt/user/chuck
Step-2
RetrievingFilesfromHDFS
TheHadoopcommandgetcopiesfilesfromHDFSbacktothelocalfilesystem.Toretrieveexample
.txt,wecan runthe followingcommand:
hadoopfs-cat example.txt
Step-3
Deleting FilesfromHDFS
hadoopfs-rmexample.txt
 Commandforcreatingadirectoryinhdfsis“hdfsdfs–mkdir /ACEITcse”.
 Addingdirectoryisdonethroughthe command “hdfsdfs –putACEIT_english/”.
Step-4
CopyingDatafromNFStoHDFS
Copyingfromdirectorycommandis“hdfsdfs–copyFromLocal
/home/ACEIT/Desktop/shakes/glossary/ACEITcse/”
 Viewthefilebyusingthecommand“hdfsdfs–cat/ACEIT_english/glossary”
 CommandforlistingofitemsinHadoopis“hdfsdfs–lshdfs://localhost:9000/”.
 CommandforDeletingfilesis “hdfsdfs –rmr/kartheek”.
SAMPLEINPUT:
Inputasanydataformatoftypestructured,UnstructuredorSemi Structured
EXPECTEDOUTPUT:
EXERCISE-
4AIM:-
RunabasicWordCountMapReduceProgramtounderstandMapReduce Paradigm
DESCRIPTION:--
MapReduce is the heart of Hadoop.It is this programming paradigm that allows

formassive scalability across hundreds or thousands of servers in a Hadoop
cluster.The MapReduce concept is fairly simple to understand for those who are familiar with
clusteredscale-out data processing solutions. The term MapReduce actually refers to two
separate anddistinct tasks that Hadoop programs perform. The first is the map job, which takes a
set of dataand converts it into another set of data, where individual elements are broken down
into tuples(key/value pairs). The reduce job takes the output from a map as input and combines
those datatuples into a smaller set of tuples. As the sequence of the name MapReduce implies,
the reducejobisalwaysperformedafterthemapjob.
ALGORITHMMAPREDUC
EPROGRAM
WordCountisasimpleprogram whichcountsthenumberofoccurrencesof eachwordinagiventext

input data set. WordCount fits very well with the MapReduce programming model making ita
great example to understand the Hadoop Map/Reduce programming style. Our
implementationconsists ofthreemain parts:
1. Mapper
2. Reducer
3. Driver
Step-1.WriteaMapper
AMapperoverridesthe―map‖functionfromtheClass"org.apache.hadoop.mapreduce.Mapper"
whichprovides <key,value>pairsas the input.AMapper implementationmayoutput
<key,value>pairs usingtheprovided Context .
Input value of the WordCount Map task will be a line of text from the input data file and the
keywould be the line number <line_number, line_of_text> . Map task outputs <word, one> for
eachwordin thelineoftext.
Pseudo-code
void Map (key,

value){foreachwordxinva
lue:
output.collect(x,1);
}
Step-2.WriteaReducer
AReducer collects theintermediate<key,value>output

frommultiplemaptasksandassembleasingleresult.Here,theWordCount program will
sumuptheoccurrenceof eachwordtopairsas
<word,occurrence>.
Pseudo-code
voidReduce(keyword,<listofvalue>){for
each xin<list ofvalue>:
sum+=x;final_output.collect(keywor
d,sum);
}
Step-3.WriteDriver
TheDriverprogramconfiguresandruntheMapReducejob.Weusethemainprogram
toperformbasicconfigurations such as:
Job Name: name of thisJob

Executable(Jar)Class:
themainexecutableclass.Forhere,WordCount.Mapper Class: class which
overrides the "map" function. For here, Map.Reducer: class which
override the "reduce" function. For here , Reduce.OutputKey:
typeofoutputkey.Forhere, Text.
OutputValue:typeofoutputvalue.Forhere,IntWritable.File
InputPath
FileOutputPath
INPUT:-
SetofDataRelatedShakespeareComedies,Glossary,Poems
OUTPUT:-
EXERCISE-5:-
AIM:-
WriteaMapReduce ProgramthatminesWeatherData.
DESCRIPTION:
Climate change has been seeking a lot of attention sincelong time. The antagonisticeffect
of this climate is being felt in every part of the earth. There are many examples for these,such as
sea levels are rising, less rainfall, increase in humidity. The propose system overcomesthesome
issues that occurred by using other techniques. Inthis project we use the concept ofBig data
Hadoop. In the proposed architecture we are able to process offline data, which isstored in the
National Climatic Data Centre (NCDC). Through this we are able to find out themaximum
temperature and minimum temperature of year, and able to predict the future weatherforecast.
Finally, weplot the graph for the obtained MAX and MIN temperaturefor each mothof the
particular year to visualize the temperature. Based on the previous year data weather
dataofcomingyear is predicted.
ALGORITHM:-
MAPREDUCEPROGRAM
WordCountisasimpleprogram whichcountsthenumberofoccurrencesofeachwordinagiventext
input data set. WordCount fits very well with the MapReduce programming model making ita
great example to understand the Hadoop Map/Reduce programming style. Our
implementationconsists ofthreemain parts:
1. Mapper
2. Reducer
3. Mainprogram
Step-1.WriteaMapper
AMapperoverridesthe―map‖functionfromtheClass"org.apache.hadoop.mapreduce.Mapper"
whichprovides <key,value>pairsas the input.AMapper implementationmayoutput
<key,value>pairs usingtheprovided Context .
Input value of the WordCount Map task will be a line of text from the input data file and the
keywould be the line number <line_number, line_of_text> . Map task outputs <word, one> for
eachwordin thelineoftext.
Pseudo-code
voidMap(key,value){
for each max_temp x in
value:output.collect(x,1);
}
voidMap(key,value){
foreachmin_tempxinvalue:ou
tput.collect(x,1);
}
Step-2WriteaReducer
A Reducer collects the intermediate <key,value> output from multiple map tasks and assemble
asingleresult.Here,theWordCount programwill sumuptheoccurrenceof eachwordtopairsas
<word,occurrence>.
Pseudo-code
voidReduce(max_temp,<listofvalue>){for
each xin <listofvalue>:
sum+=x;final_output.collect(max_te
mp,sum);
}
void Reduce (min_temp, <list of

value>){foreach xin<list of value>:
sum+=x;final_output.collect(min_temp,
sum);
}
3. WriteDriver
TheDriverprogramconfiguresandruntheMapReducejob.Weusethemainprogram
toperformbasicconfigurations such as:
Job Name : name ofthis Job
Executable (Jar) Class: the main executable class. For here,
WordCount.Mapper Class: class which overrides the "map" function.
For here, Map.Reducer: class which override the "reduce" function. For
here , Reduce.OutputKey: typeofoutput key.Forhere, Text.
OutputValue:typeofoutput value.Forhere,IntWritable.File
Input Path
FileOutputPath
INPUT:-
SetofWeatherDataover the years
OUTPUT:-
EXERCISE-6:-
AIM:-
WriteaMapReduceProgramthatimplementsMatrixMultiplication.
DESCRIPTION:
We can represent a matrix as a relation (table) in RDBMSwhere each cell in the

matrixcan be representedas a record(i,j,value).Asanexample let us consider the following
matrixand its representation. It is important to understand that this relation is a very inefficient
relationif the matrix is dense.Let us say we have 5 Rows and 6 Columns , then we need to store
only 30values. But if you consider above relation we are storing30 rowid, 30 col_id and 30
values inother sense we are tripling the data. So a natural question arises why we need to store in
thisformat ? In practice most of the matrices are sparse matrices . In sparse matrices not all
cellsused to have any values , so we don‘t have to store those cells in DB. So thisturns out to be
veryefficientin storingsuchmatrices.
MapReduceLogic
Logicistosendthecalculationpartofeachoutputcelloftheresultmatrixtoareducer.
Soinmatrixmultiplicationthefirstcellofoutput(0,0)hasmultiplicationandsummationof
elementsfromrow0ofthematrixAandelementsfromcol0ofmatrixB.Todothecomputation of valuein
the output cell (0,0) of resultant matrixin a seperate reducer we need touse (0,0) as output key of
mapphase and value should have array of values from row 0 of matrixA and column 0 of matrix
B.Hopefully this picture will explain the point. So in this algorithmoutput from map phase
should be having a <key,value> , wherekey represents the output
celllocation(0,0),(0,1)etc..andvaluewillbelistofallvaluesrequiredforreducertodocomputation. Let
us take an example for calculatiing value at output cell (00). Here we need tocollect values from
row 0 of matrix A and col 0 of matrix B in the map phase and pass (0,0) askey.So
asinglereducercan do the calculation.
ALGORITHM
We assume that the input files for A and B are streams of (key,value) pairs in
sparsematrix format,whereeachkeyis apairofindices(i,j)and eachvalueis the
correspondingmatrixelementvalue. Theoutput files for matrixC=A*Barein thesameformat.
Wehavethe followinginput parameters:
The path of the input file or directory for matrix

A.Thepath oftheinput fileordirectoryformatrixB.
The path of the directory for the output files for matrix
C.strategy=1, 2, 3 or4.
R=the number ofreducers.
I=thenumberof rows inAand C.
K=thenumberofcolumnsinAandrowsinB.J=the
number ofcolumns in Band C.
IB=thenumberofrowsperAblock andCblock.
KB=thenumberofcolumns per Ablockandrows
perBblock.JB=the number ofcolumns per Bblock and C block.
Inthepseudo-
codefortheindividualstrategiesbelow,wehaveintentionallyavoidedfactoringcommon codefor
thepurposes of clarity.
Notethat inall thestrategies thememoryfootprintof boththemappersandthereducers isflat atscale.
Notethatthestrategies all workreasonablywellwith bothdenseandsparse

matrices.Forsparsematrices we do not emit zero elements. That said, the simple pseudo-code for
multiplying theindividualblocksshownhereiscertainlynotoptimal
forsparsematrices.Asalearningexercise,our focus here is on mastering the MapReduce
complexities, not on optimizing the sequentialmatrix multipliation algorithm forthe
individualblocks.
Steps
1. setup()
2. varNIB=(I-1)/IB+1
3. varNKB=(K-1)/KB+1
4. varNJB=(J-1)/JB+1
5. map(key,value)
6. iffrommatrix Awithkey=(i,k)andvalue=a(i,k)
7. for0 <=jb<NJB
8. emit(i/IB,k/KB,jb, 0),(imodIB,kmodKB,a(i,k))
9. iffrommatrixBwith key=(k,j)andvalue=b(k,j)
10. for0 <=ib <NIB
emit(ib,k/KB,j/JB,1),(kmodKB,jmodJB,b(k,j))
Intermediatekeys (ib, kb, jb, m) sort in increasingorder first byib, then bykb,
thenbyjb,thenbym. Note that m =0 forA dataand m = 1 forBdata.
Thepartitionermapsintermediatekey(ib,kb,jb,m)toareducerrasfollows:
11. r =((ib*JB+jb)*KB+kb)mod R
12. These definitions for the sorting order and partitioner guarantee that each
reducerR[ib,kb,jb]receives thedataitneedsforblocksA[ib,kb]
andB[kb,jb],withthedatafortheAblock immediatelyprecedingthe data fortheBblock.
13. varA=newmatrixofdimensionIBxKB
14. varB=new matrixofdimensionKBxJB
15. varsib =-1
16. varskb =-1
Reduce(key,valueList)
17. if keyis(ib, kb, jb, 0)

18. //Savethe Ablock.
19. sib=ib
20. skb=kb
21. ZeromatrixA
22. foreachvalue =(i,k,v)invalueList A(i,k)=v
23. if keyis(ib, kb, jb, 1)
24. ifib !=sib orkb !=skbreturn // A[ib,kb] must bezero!
25. //Build theBblock.
26. ZeromatrixB
27. foreachvalue=(k,j,v)in valueListB(k,j)=v
28. //Multiplythe blocksandemit the result.
29. ibase=ib*IB
30. jbase=jb*JB
31. for0 <=i<rowdimension of A
32. for0 <=j<column dimensionof B
33. sum = 0
34. for0 <=k<column dimensionof A=rowdimension of B
a. sum+=A(i,k)*B(k,j)
35. ifsum!=0emit(ibase+i, jbase+j),sum
INPUT:-
SetofDatasetsoverdifferentClustersaretakenasRowsandColumns
OUTPUT:-
EXERCISE-7:-
AIM:-
InstallandRunPigthen writePigLatinscriptstosort,group,join,projectandfilterthe
data.
DESCRIPTION
Apache Pig is a high-level platform for creating programs that run on Apache
Hadoop.Thelanguageforthisplatformiscalled PigLatin.
PigcanexecuteitsHadoopjobsinMapReduce,ApacheTez,or
ApacheSpark.PigLatinabstractstheprogrammingfromtheJava MapReduce idiom into a notation
which makes MapReduce programming high level,similartothatof SQL for
RDBMSs.PigLatincanbeextendedusing UserDefinedFunctions (UDFs) which the user can write
in Java, Python, JavaScript, Ruby or Groovy and thencalldirectlyfromthelanguage.
Pig Latin is procedural and fits very naturally in the pipeline paradigm while SQL
isinstead declarative. In SQL users can specify that data from two tables must be joined, but
notwhat join implementation to use (You can specify the implementation of JOIN in SQL, thus
"...for many SQL applications the query writer may not have enough knowledge of the data
orenough expertise to specify an appropriate join algorithm."). Pig Latin allows users to specify
animplementationoraspectsofanimplementationtobeusedinexecutingascriptinseveralways. In
effect, Pig Latin programming is similar to specifying a query execution plan, making
iteasierforprogrammers to explicitlycontrol theflow oftheirdata processingtask.
SQL is oriented around queries that produce a single result. SQL handles trees naturally, but
hasno built in mechanism for splitting a data processing stream and applying different operators
toeachsub-stream.PigLatinscriptdescribesadirectedacyclic graph(DAG)ratherthanapipeline.
PigLatin'sabilitytoincludeusercodeatanypointinthepipelineisusefulforpipelinedevelopment.IfSQL
isused,datamustfirstbe importedintothedatabase,andthenthecleansingand transformationprocess
can begin.
ALGORITHM
STEPSFORINSTALLINGAPACHEPIG
1) Extractthepig-0.15.0.tar.gz andmovetohomedirectory
2) SettheenvironmentofPIGinbashrcfile.
3) Pigcanrun intwomodes
LocalMode and Hadoop
ModePig–xlocal and
pig
4) GruntShell
Grunt>
5) LOADINGData into Grunt Shell
DATA=LOAD<CLASSPATH>USINGPigStorage(DELIMITER)as(ATTRIBUTE:
DataType1,ATTRIBUTE:DataType2…..)
6) DescribeData
DescribeDATA;
7) DUMPData
DumpDATA;
8) FILTERData
FDATA=FILTERDATAbyATTRIBUTE=VALUE;
9) GROUPData
GDATA=GROUP DATA byATTRIBUTE;
10) IteratingData
FOR_DATA=FOREACHDATAGENERATEGROUPASGROUP_FUN,AT
TRIBUTE=<VALUE>
11) SortingData
SORT_DATA=ORDERDATABYATTRIBUTEWITHCONDITION;
12) LIMITData
LIMIT_DATA=LIMITDATACOUNT;
13) JOINData
JOINDATA1BY(ATTRIBUTE1,ATTRIBUTE2….),DATA2BY(A
TTRIBUTE3,ATTRIBUTE….N)
INPUT:
InputasWebsiteClickCountData
OUTPUT:
EXERCISE-8:-
AIM:-
InstallandRunHivethenuseHiveto Create, alterand

dropdatabases,tables,views,functionsandIndexes.
DESCRIPTION
Hive, allows SQL developers to write Hive Query Language (HQL) statements that
aresimilar tostandard SQL statements;nowyoushouldbe aware thatHQL islimited in
thecommands it understands, but it is still pretty useful. HQL statements are broken down by
theHive service into MapReduce jobs and executed across a Hadoop cluster. Hive looks very
muchlike traditional database code with SQL access. However, because Hive is based on
Hadoop andMapReduce operations, there are several key differences. The first is that Hadoop is
intended forlong sequential scans, and because Hive is based on Hadoop, you can expect queries
to have
averyhighlatency(manyminutes).ThismeansthatHivewouldnotbeappropriateforapplications that
need very fast response times, as you would expect with a database such asDB2. Finally, Hive is
read-based and therefore not appropriate for transaction processing thattypicallyinvolves
ahighpercentageof writeoperations.
ALGORITHM:
ApacheHIVEINSTALLATIONSTEPS
1) InstallMySQL-Server
Sudoapt-getinstallmysql-server
2) ConfiguringMySQLUserNameandPassword
3) CreatingUserand
grantingallPrivilegesMysql –uroot –
proot
Createuser<USER_NAME> identified by<PASSWORD>
4) ExtractandConfigureApacheHive
tarxvfzapache-hive-1.0.1.bin.tar.gz
5) MoveApacheHivefromLocal directorytoHome directory
6) SetCLASSPATHinbashrc
Export HIVE_HOME = /home/apache-
hiveExportPATH=$PATH:$HIVE_HOME/
bin
7) Configuringhive-default.xml byaddingMySQL ServerCredentials
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost:3306/hive?createDatabaseIfNotE
xist=true
</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hadoop</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>hadoop</value>
</property>
8) Copyingmysql-java-connector.jartohive/libdirectory.
SYNTAXforHIVEDatabaseOperationsD
ATABASE Creation
CREATEDATABASE|SCHEMA[IFNOTEXISTS]<databasename>
DropDatabaseStatement
DROPDATABASEStatementDROP (DATABASE|SCHEMA)[IFEXISTS]
database_name[RESTRICT|CASCADE];
Creatingand DroppingTableinHIVE
CREATE[TEMPORARY][EXTERNAL]TABLE[IFNOTEXISTS][db_name.]table_name
[(col_namedata_type[COMMENTcol_comment],...)]
[COMMENT table_comment][ROWFORMATrow_format][STOREDASfile_format]
Loading Data into table
log_dataSyntax:
LOADDATALOCALINPATH'<path>/u.data'OVERWRITE INTOTABLEu_data;
AlterTableinHIVE
Syntax
ALTERTABLE nameRENAME TOnew_name

ALTER TABLE name ADD COLUMNS (col_spec[, col_spec
...])ALTERTABLEnameDROP[COLUMN]column_name
ALTER TABLE name CHANGE column_name new_name
new_typeALTERTABLEnameREPLACECOLUMNS(col_spec[,col_sp
ec...])
Creatingand DroppingView
CREATE VIEW [IF NOT EXISTS] view_name [(column_name
[COMMENTcolumn_comment],...)][COMMENTtable_comment]AS SELECT ...
Dropping
ViewSyntax:
DROPVIEWview_name
Functionsin HIVE
StringFunctions:-round(),ceil(),substr(),upper(),reg_exp()etcDate
and Time Functions:- year(), month(), day(), to_date()
etcAggregateFunctions :-sum(),min(),max(), count(),avg()etc
INDEXES
CREATE INDEX index_name ON TABLE base_table_name (col_name,

...)AS'index.handler.class.name'
[WITHDEFERREDREBUILD]
[IDXPROPERTIES (property_name=property_value,
...)][INTABLE index_table_name]
[PARTITIONEDBY(col_name,...)][
[ROWFORMAT...]STOREDAS...
|STORED BY...
]
[LOCATION
hdfs_path][TBLPROPE
RTIES(...)]
CreatingIndex
CREATE INDEX index_ip ON TABLE log_data(ip_address)

AS'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'WITHDEFERRE
DREBUILD;
AlteringandInsertingIndex
ALTERINDEXindex_ip_addressONlog_dataREBUILD;
StoringIndexDatainMetastore
SET
hive.index.compact.file=/home/administrator/Desktop/big/metastore_db/tmp/index_ipaddress_res
ult;
SET
hive.input.format=org.apache.hadoop.hive.ql.index.compact.HiveCompactIndexInputFormat;
DroppingIndex
DROPINDEXINDEX_NAMEonTABLE_NAME;
INPUT
InputasWebServerLogData
OUTPUT
EXERCISE-10:-
AIM:Writeaprogram to
implementcombiningandpartitioninginhadooptoimplementacustompartitionerandCombiner
DESCRIPTION
:COMBINERS:
ManyMapReducejobsarelimitedbythebandwidthavailableonthecluster,soit
pays to minimize the data transferred between map and reduce tasks. Hadoop allows
theuserto specifyacombinerfunction to berun onthemapoutput—the combiner
function‘soutputformstheinputtothereducefunction.Sincethecombinerfunctionisanoptimiz
ation, Hadoop does not provide a guarantee of how many times it will call it for
aparticular map output record, if at all. In other words, calling the combiner function
zero,one, or many times should produce the same output from the reducer. One can think
ofCombiners asmini-reducethat take place on the output of the mappers, prior to
theshuffle and sort phase. Each combiner operates in isolation and therefore does not
haveaccess to intermediate output from other mappers. The combiner is provided keys
andvalues associated with each key (the same types as the mapper output keys and
values).Critically, one cannot assume that a combiner will have the opportunity to
process allvalues associated with the same key. The combiner can emit any number of
key-valuepairs, but the keys and values must be of the same type as the mapper output
(same as
thereducerinput).Incaseswhereanoperationisbothassociativeandcommutative(e.g.,
additionormultiplication),reducerscandirectlyserveascombiners
PARTITIONERS
Acommonmisconceptionforfirst-
timeMapReduceprogrammersistouseonlyasinglereducer.Itiseasytounderstandthatsuchaco
nstraintisanonsenseandthat
using more than one reducer is most of the time necessary, else the map/reduce concept would
notbe very useful. With multiple reducers, we need some way to determine the appropriate one
tosenda(key/value)pairoutputtedby amapper.Thedefaultbehavioristohashthekey todetermine the
reducer. The partitioning phase takes place after the map phase and before
thereducephase.Thenumberofpartitionsisequaltothenumberofreducers.Thedatagetspartitioned
across the reducers according to the partitioningfunction. This approach improves
theoverallperformanceandallowsmapperstooperate
completelyindependently.Forallofitsoutputkey/valuepairs,eachmapperdetermines
which reducer will receive them. Because all the mappers are using the same
partitioningfor any key, regardless of which mapper instance generated it, the destination
partition isthe same. Hadoop uses an interface called Partitioner to determine which
partition akey/valuepairwillgoto.Asinglepartitionreferstoallkey/valuepairsthatwillbesentto
a single reduce task. You can configure the number of reducers in a job driver bysetting a
number of reducers on the job object (job.setNumReduceTasks). Hadoop
comeswithadefaultpartitionerimplementationi.e.HashPartitioner,whichhashesarecord‘ske
y to determine which partition the record belongs in. Each partition is processed by
areduce task, so the number of partitions is equal to the number of reduce tasks for the
job,When the map function starts producing output, it is not simply written to disk. Each
maptask has a circular memory buffer that it writes the output to. When the contents of
thebuffer reach a certain threshold size, a background thread will start tospill the contents
todisk. Map outputs will continue to be written to the buffer while the spill takes place,
butif the buffer fills up during this time, the map will block until the spill is complete.
Beforeit writes to disk, the thread first divides the data into partitions corresponding to
thereducers that they will ultimately be sent to. Within each partition, the background
threadperformsanin-memorysortbykey,andifthereisacombinerfunction,itisrunonthe
output of thesort.
ALGORITHM:
COMBINING
1) Dividethedatasource(thedatafiles)intofragmentsor blocks
whicharesenttoamapper.Thesearecalledsplits.
2) Thesesplits
arefurtherdividedintorecordsandtheserecordsareprovidedoneatatimetothemapperforproc
essing. ThisisachievedthroughaclasscalledasRecord Reader.
Create a Class and extend from TextInputFormat class to
createownNLinesInputFormat.
Then create our own RecordReader class called NLinesRecordReader where we
willimplementthelogic offeeding3 lines/recordsat atime.
MakeachangeinthedriverprogramtousenewNLinesInputFormatclass.
To prove that are really getting 3 lines at a time, instead of actually counting words (which
already know now how to do ) , emit out number of lines to get in the input at atime as a
key and 1 as a value , which after going through reducer will give the frequencyofeach
uniquenumberof lines to themappers.
PARTIONING
1. First,thekeyappearingmorewill besend toonepartition
2. Second,all otherkeyswill besendtopartitions accordingtotheirhashCode().
3. Now suppose if your hashCode() method does not uniformly distribute other keys
dataover partitions range. So the data is not evenly distributed in partitions as well
asreducers.Sinceeachpartitionisequivalenttoareducer.Soheresomereducerswill
havemore data than other reducers.So other reducers will wait for one reducer(one with
userdefinedkeys)dueto thework load it share
INPUT:
DatasetsfromdifferentsourcesasInput
OUTPUT:
hduser@ACEIT-3446:/usr/local/hadoop/bin$hadoopfs-ls/partitionerOutput/
14/12/0117:50:52WARNutil.NativeCodeLoader:Unabletoloadnative-
hadooplibraryforyourplatform...usingbuiltin-javaclasses whereapplicable
Found4items
-rw-r--r-- 1hdusersupergroup 02014-12-0117:49/partitionerOutput/_SUCCESS
-rw-r--r-- 1hdusersupergroup 102014-12-0117:48/partitionerOutput/part-r-
00000
-rw-r--r-- 1hdusersupergroup 102014-12-0117:48/partitionerOutput/part-r-
00001
-rw-r--r-- 1hdusersupergroup 92014-12-0117:49/partitionerOutput/part-r-00002

Big Data Lab Material

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Big Data Lab Material

Uploaded by

Copyright:

Available Formats

NAME:

YEAR/SEM: IST YEAR /IIND SEM

BRANCH: M.TECH IN COMPUTER SCIENCE

SUBJECT:BIG DATA ANALYTICS LAB

a)Linked Lists b)Stacks c)Queues d)Set

Thejava.utilpackagecontains all theclassesandinterfaces forCollectionframework.

Therearemanymethodsdeclared intheCollectioninterface. Theyareasfollows:

No. Method Description

2 public boolean isusedtoinsertthespecifiedcollectionelementsinthein

4 public boolean isusedtodeletealltheelementsofspecifiedcollection

5 public boolean isusedtodeletealltheelementsofinvokingcollection

6 publicint size() return total number of elements in the

7 publicvoidclear() removesthetotalnoofelementfrom thecollection.

8 public boolean isusedtosearchan element.

9 public boolean isusedtosearchthespecifiedcollectioninthiscollection.

boolean addAll(Collection<? extends E>

4. CreateaCollection ObjectoftypeeitherSetorList orMaporQueue

5. Add Objects to the

7. Remove or retain data from

9. DisplayObjects from Collection

STEPS INVOLVEDININSTALLINGHADOOPINSTANDALONE MODE:-

3. Copypublickeyto all threehosts togetapassword less SSHaccess

4. Configureall Configuration files,tonameMasterandSlaveNodes.

II) Using Web Based Tools to Manage Hadoop Set-

Hadoopset up can bemanaged bydifferentweb based tools, which can

ApacheHadoopRunning atLocal Host

 Addingfiles and directories

HDFS is a scalable distributed filesystem designed to scale to petabytes of data

MapReduce is the heart of Hadoop.It is this programming paradigm that allows

WordCountisasimpleprogram whichcountsthenumberofoccurrencesof eachwordinagiventext

void Map (key,

AReducer collects theintermediate<key,value>output

Job Name: name of thisJob

void Reduce (min_temp, <list of

Job Name : name ofthis Job

Executable (Jar) Class: the main executable class. For here,

WordCount.Mapper Class: class which overrides the "map" function.

here , Reduce.OutputKey: typeofoutput key.Forhere, Text.

SetofWeatherDataover the years

We can represent a matrix as a relation (table) in RDBMSwhere each cell in the

Wehavethe followinginput parameters:

The path of the input file or directory for matrix

Notethat inall thestrategies thememoryfootprintof boththemappersandthereducers isflat atscale.

Notethatthestrategies all workreasonablywellwith bothdenseandsparse

17. if keyis(ib, kb, jb, 0)

GDATA=GROUP DATA byATTRIBUTE;

InstallandRunHivethenuseHiveto Create, alterand

Loading Data into table

ALTERTABLE nameRENAME TOnew_name

and Time Functions:- year(), month(), day(), to_date()

etcAggregateFunctions :-sum(),min(),max(), count(),avg()etc

CREATE INDEX index_name ON TABLE base_table_name (col_name,

CREATE INDEX index_ip ON TABLE log_data(ip_address)

1. First,thekeyappearingmorewill besend toonepartition

2. Second,all otherkeyswill besendtopartitions accordingtotheirhashCode().

You might also like