Lab Manual Bda

Experiment 1:- INSTALL VM WARE
OBJECTIVE:
To Install VMWare.
RESOURCES:
VM Ware stack, 4 GB RAM,Web browser,Hard Disk80 GB.
PROGRAMLOGIC:
STEP1.Firstofall, entertotheofficialsiteofVMwareanddownloadVMwareWorkstation
https://www.vmware.com/tryvmware/?p=workstation-w
STEP2.After downloadingVMwareworkstation, installitonyour

STEP3.Setup willopenWelcomeScreen
Click onNextbuttonandchooseTypicaloption
STEP4.Byclicking“Next”buttons,tobegintheinstallation,click on Installbuttonattheend
STEP5.ThiswillinstallVMware WorkstationsoftwareonyourPC,Afterinstallationcomplete, click on
Finish button. Then restart your PC. Then open this software
6.Inthisstepwetryto createnew“virtualmachine”. Enterto File menu, thenNew->Virtual Machine
ClickonNextbutton,thencheckTypicaloptionas below
Thenclick Next button, andcheck your OSversion. Inthisexample, aswe‟regoingtosetup
OracleserveronCentOS,we‟llcheckLinuxoptionandfrom“version”optionwe‟llcheckRedHat
EnterpriseLinux4
ByclickingNextbutton,we‟llgiveanametoourvirtualmachine,andgivedirectoryto createthisnew virtual
machine
Thenselect
Usebridgednetworkingoptionandclick Next.
Thenyou‟vetodefinesizeofharddiskbyenteringitssize.I‟llgive15GBharddiskspaceand please check

Allocate all disk space now option
Here,youcandeleteSoundAdapter,FloppyandUSBControllerbyentering“Edit virtualmachine settings”.
If you‟re going to setupOracle Server, please make sure you‟ve increased your Memory (RAM) to 1GB.
INPUT/OUTPUT
PRELABVIVAQUESTIONS:
1. WhatisVMWarestack?
2. Listoutvariousdata formats?
3. Listoutthecharacteristicsofbigdata?
LABASSIGNMENT:
1. InstallPig?
2. InstallHive?
POSTLABVIVAQUESTIONS:
1. ListoutvariousterminologiesinBigDataenvironments?
2. Definebigdata analytics?
WEEK-2
HADOOPMODES
OBJECTIVE:
1) PerformsettingupandInstallingHadoopinitsthreeoperatingmodes.
Standalone.
Pseudodistributed
Fully distributed.
2) Usewebbasedtoolstomonitoryour Hadoopsetup.
RESOURCES:
VMWarestack,4GBRAM,HardDisk80GB.
PROGRAMLOGIC:
a) STANDALONEMODE:
 Installationofjdk7
Command:sudoapt-get installopenjdk-7-jdk
 DownloadandextractHadoop
Command:wgethttp://archive.apache.org/dist/hadoop/core/hadoop-1.2.0/hadoop-1.2.0.tar.gz
Command:tar-xvfhadoop-1.2.0.tar.gz
Command:sudomvhadoop-1.2.0 /usr/lib/hadoop
 Setthepathforjavaandhadoop
Command:sudogedit$HOME/.bashrc
exportJAVA_HOME=/usr/lib/jvm/java-7-openjdk-i386
export PATH=$PATH:$JAVA_HOME/bin
export HADOOP_COMMON_HOME=/usr/lib/hadoop
export HADOOP_MAPRED_HOME=/usr/lib/hadoop
exportPATH=$PATH:$HADOOP_COMMON_HOME/b
in
exportPATH=$PATH:$HADOOP_COMMON_HOME/Sbin
 Checkingofjavaand hadoop
Command:java-version
Command:hadoopversion
b) PSEUDOMODE:
Hadoopsingle nodeclusterrunsonsinglemachine. The namenodesanddatanodesareperforming onthe one
machine. The installation and configuration steps as given below:
 Installationofsecuredshell
Command:sudoapt-getinstallopenssh-server
 Createasshkeyforpasswordlesssshconfiguration
Command:ssh-keygen-trsa–P""
 Movingthe keytoauthorizedkey
Command:cat$HOME/.ssh/id_rsa.pub>>$HOME/.ssh/authorized_keys
/**************RESTARTTHECOMPUTER********************/
 Checkingofsecuredshelllogin
Command: sshlocalhost
 AddJAVA_HOMEdirectoryinhadoop-env.shfile
Command:sudogedit/usr/lib/hadoop/conf/hadoop-env.sh
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-i386
 Creatingnamenodeanddatanodedirectoriesforhadoop
Command:sudomkdir-p/usr/lib/hadoop/dfs/namenode
Command:sudomkdir-p/usr/lib/hadoop/dfs/datanode
 Configurecore-site.xml
Command:sudogedit/usr/lib/hadoop/conf/core-site.xml
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:8020</value>
</property>
 Configurehdfs-site.xml
Command:sudogedit/usr/lib/hadoop/conf/hdfs-site.xml
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>/usr/lib/hadoop/dfs/namenode</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/usr/lib/hadoop/dfs/datanode</value>
</property>
 Configuremapred-site.xml
Command:sudogedit/usr/lib/hadoop/conf/mapred-site.xml
<property>
<name>mapred.job.tracker</name>
<value>localhost:8021</value>
</property>
 Formatthenamenode
Command:hadoopnamenode-format
 Startthe namenode,datanode
Command:start-dfs.sh
 Startthetasktrackerandjobtracker
Command:start-mapred.sh
 TocheckifHadoopstartedcorrectly
Command: jps
namenode
secondarynamenode
datanode
jobtracker
tasktracke
r
c) FULLYDISTRIBUTEDMODE:
All the demons like namenodes and datanodes are runs on different machines. The data will replicate
according to the replication factor in client machines. The secondary namenode will store the mirror
images of namenode periodically. The namenode having the metadata where the blocks are stored and
number of replicas in the client machines. The slaves and master communicate each other periodically.
The configurations of multinode cluster are given below:
 Configurethehostsinallnodes/machines
Command:sudogedit/etc/hosts/
192.168.1.58 pcetcse1
pcetcse2
pcetcse3
pcetcse4
pcetcse5
 Passwordless
SshConfiguration
Createsshkeyonnamenode/master.
Command:ssh-keygen-trsa-p“”
Copythegeneratedpublickeyalldatanodes/slaves.
Command:ssh-copy-id-i~/.ssh/
id_rsa.pubhuser@pcetcse2 Command:ssh-copy-id-
i~/.ssh/id_rsa.pubhuser@pcetcse3 Command:ssh-copy-
id-i~/.ssh/id_rsa.pubhuser@pcetcse4 Command:ssh-
copy-id-i~/.ssh/id_rsa.pubhuser@pcetcse5
/**************RESTARTALLNODES/COMPUTERS/MACHINES************/
NOTE:Verifythepasswordless sshenvironmentfromnamenodetoalldatanodes as“huser”user.
 Logintomasternode
Command:sshpcetcse1
Command:sshpcetcse2
Command:sshpcetcse3
Command:sshpcetcse4
Command:
sshpcetcse5
 AddJAVA_HOMEdirectoryinhadoop-env.shfileinallnodes/machines
Command:sudogedit/usr/lib/hadoop/conf/hadoop-env.sh
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-i386
 Creatingnamenodedirectoryinnamenode/master
Command:sudomkdir-p/usr/lib/hadoop/dfs/namenode
 Creatingnamenodedirectoryindatanonodes/slaves
Command:sudomkdir-p/usr/lib/hadoop/dfs/datanodeCloseHTMLtag.
Usewebbased toolsto monitoryourHadoopsetup.
HDFSNamenodeonUI
http://locahost:50070/
INPUT/OUTPUT:
ubuntu@localhost>jps
Datanode,namenodem
Secondarynamenode,NodeManager,ResourceManager
HDFS Jobtracker
http://locahost:50030/
HDFS Logs
http://locahost:50070/logs/
1. Whatdoes‗jps„commanddo?
2. HowtorestartNamenode?
3. DifferentiatebetweenStructuredandUnstructureddata?
LAB ASSIGNMENT:
1 Howtoconfigurethedaemonsinthebrowser.
1. WhatarethemaincomponentsofaHadoopApplication?
2. ExplainthedifferencebetweenNameNode,BackupNodeandCheckpointNameNode.
WEEK-3
USINGLINUXOPERATINGSYSTEM
OBJECTIVE:
1.Implementingthe basiccommandsofLINUXOperatingSystem–File/Directorycreation,
deletion,update operations.
RESOURCES:
VMWarestack,4GBRAM, HardDisk80GB.
PROGRAMLOGIC:
1. cat>filename
2. Addcontent
3. Press'ctrl+d'toreturnto commandprompt. To
remove a file use syntax - rm filename
1.4 INPUT/OUTPUT:
PRE-LABVIVAQUESTIONS:
1. Whatislscommand?
2. Whatarethe attributesofls command?
LAB ASSIGNMENT:
1 Writealinuxcommands for Sedoperations?
2 Writethelinuxcommandsforrenamingafile?
POST-LABVIVA QUESTIONS:
1. Whatisthepurposeofrmcommand?
2. WhatisthedifferencebetweenLinuxandwindowscommands?
WEEK-4
FILEMANAGEMENTIN HADOOP
OBJECTIVE:
Implementthefollowing filemanagementtasksinHadoop:
i. Addingfilesanddirectories
ii. Retrievingfiles
iii. Deletingfiles
Hint:AtypicalHadoopworkflowcreatesdatafiles (suchas log files)elsewhereandcopiestheminto HDFS

using one of the above command line utilities.
RESOURCES:
VMWarestack,4GBRAM, HardDisk80GB.
PROGRAMLOGIC:
AddingFilesandDirectoriesto HDFS
Before you can run Hadoop programs on data stored in HDFS, you„ll need to put the data into
HDFS first. Let„screate a directoryand put afile in it. HDFS hasa default working directoryof
/user/$USER, where $USER is your login user name. This directory isn„t automatically created
for you, though, so let„s create it with the mkdir command. For the purpose of illustration, we
use chuck. You should substitute your user name in the example commands.
hadoopfs-mkdir/user/chuck
hadoop fs-put
hadoopfs-putexample.txt/user/chuck
RetrievingFilesfrom HDFS
TheHadoopcommandget copiesfilesfromHDFSbacktothelocalfilesystem.Toretrieveexample.txt, we can

run the following command:
hadoopfs-catexample.txt
Deleting files from HDFS
hadoop fs -rmexample.txt
 Commandforcreatingadirectoryinhdfsis“hdfsdfs–mkdir/lendicse”.
 Addingdirectoryisdonethroughthecommand“hdfsdfs–putlendi_english/”.
INPUT/OUTPUT:
1) DefineHadoop?
2) ListoutthevarioususecasesofHadoop?
LABASSIGNMENT
1)WhatiscommandusedtolistoutdirectoriesofDataNodethroughwebtool
1. DistinguishtheHadoopEcosystem?
2. Demonstratedivideandconquerphilosophyin HadoopCluster?
WEEK-5
MAPREDUCEPROGRAM1
OBJECTIVE:
RunabasicwordcountMapReduceprogramtounderstandMapReduce Paradigm.
RESOURCES:
VMWarestack,4GBRAM,Webbrowser,HardDisk80 GB.
PROGRAMLOGIC:
WordCount is a simple program which counts the number of occurrences of each word in a
given text input data set. WordCount fits very well with the MapReduce programming model
making it a great example to understand the Hadoop Map/Reduce programming style. Our
implementation consists of three main parts:
1. Mapper
2. Reducer
3. Driver
Step-1. Writea Mapper
A Mapper overrides the ―map‖ function from the Class "org.apache.hadoop.mapreduce.Mapper"

which provides <key, value> pairs as the input. A Mapper implementation may output
<key,value>pairsusingtheprovidedContext.
Input value of the WordCount Map task will be a line of text from the input data file and the key
would bethe line number<line_number, line_of_text>.Maptaskoutputs<word,one> foreach word
in the line of text.
Pseudo-code
voidMap(key,value){
foreachwordxinvalue:
output.collect(x,1);
}
Step-2. WriteaReducer
AReducercollectstheintermediate<key,value>outputfrommultiplemaptasksandassembleasingle result.
Here, the WordCount program will sum up the occurrence of each word to pairs as
<word,occurrence>.
Pseudo-code
voidReduce(keyword,<listofvalue>){for
each x in <list of value>:
sum+=x;
final_output.collect(keyword, sum);
INPUT/OUTPUT:
1. Justifyhowhadooptechnologysatisfiesthebusinessinsightsnow-a–days?
2. DefineFilesystem?
LAB ASSIGNMENT:
RunabasicwordcountMapReduceprogramtounderstandMapReduce Paradigm.
POST-LABVIVAQUESTIONS:
1. Definewhat isblock inHDFS?
2. WhyisablockinHDFSsolarge?
WEEK-6
MAPREDUCEPROGRAM2
OBJECTIVE:
Write a Map Reduce programthat mines weather data. Hint: Weather sensors collecting data
every hourat many locations across the globe gather a large volume of log data, which is a
good candidate foranalysis with Map Reduce, since it is semi structured and record- oriented.
RESOURCES:
VMWare, Webbrowser, 4GBRAM,HardDisk80GB.
PROGRAMLOGIC:
WordCount is a simple program which counts the number of occurrences of each word in a
given text input data set. WordCount fits very well with the MapReduce
programmingmodelmaking it agreat exampleto
understandtheHadoopMap/Reduceprogrammingstyle. Our implementation consists of three
main parts:
1. Mapper
2. Reducer
3. Mainprogram
Step-1.WriteaMapper
AMapperoverridesthe―mapfunctionfromtheClass"org.apache.hadoop.mapreduce.Mapper"
which provides <key, value> pairs as the input. AMapper implementation may output
<key,value>pairsusingtheprovidedContext.
Input valueofthe WordCountMaptaskwillbe a lineoftext fromthe input datafileandthekey
wouldbethe linenumber <line_number, line_of_text>. Maptaskoutputs<word,one>for each word
in the line of text.
Pseudo-code
voidMap(key,value){
foreachmax_tempxinvalue:
output.collect(x, 1);
}
voidMap(key,value){
foreachmin_tempxinvalue:
output.collect(x,1);
}
Step-2 Write aReducer
AReducercollectstheintermediate<key,value>outputfrommultiple maptasksandassemblea
singleresult.Here,the WordCountprogramwillsumuptheoccurrenceofeachwordtopairs as
<word, occurrence>.
Pseudo-code
voidReduce(max_temp,<listofvalue>){ for
sum+=x;
final_output.collect(max_temp,sum);
}
voidReduce(min_temp,<listofvalue>){ for
sum+=x;
final_output.collect(min_temp,sum);
}
3. WriteDriver
TheDriverprogramconfiguresandruntheMapReduce job.Weusethe mainprogramto perform
basic configurations such as:
JobName : name ofthisJobExecutable (Jar)
Class:themainexecutableclass.Forhere,WordCount.
MapperClass:classwhichoverridesthe"map"function.Forhere,Map.
Reducer: classwhichoverride the "reduce" function. For here ,Reduce.
Output Key: type of output key. For here, Text.Output
Value:typeofoutputvalue.Forhere,IntWritable.
File Input Path
FileOutput Path
INPUT/OUTPUT:
SetofWeatherDataovertheyears
1) ExplainthefunctionofMapReducer partitioner?
2) WhatisthedifferencebetweenanInputSplitandHDFS Block?
3) WhatisSequencefileinputformat?
LAB ASSIGNMENT:
1. UsingMapReducejobtoIdentifylanguagebymergingmultilanguagedictionaryfiles into a
single dictionary file.
2. JoinmultipledatasetsusingaMapReduceJob.
1) InHadoopwhatisInputSplit?
2) ExplainwhatisasequencefileinHadoop?
WEEK-7
MAPREDUCEPROGRAM3
OBJECTIVE:
Implement matrixmultiplicationwithHadoopMap Reduce.
RESOURCES:
VMWare,Webbrowser,4GBRAM, Hard Disk80GB.
PROGRAMLOGIC:
Weassumethattheinputfiles for AandBarestreamsof(key,value)pairsinsparse
matrix format, where each key is a pair of indices (i,j) and each value is the
correspondingmatrix element value. The output files for matrix C=A*B are in the same
format.
Wehave thefollowinginput parameters:
Thepathoftheinput fileordirectoryfor matrixA.

Thepathoftheinputfileor directoryformatrix B.
ThepathofthedirectoryfortheoutputfilesformatrixC. strategy =
1, 2, 3 or 4.
R=thenumberofreducers.
I=the numberofrowsinAandC.
K=thenumberofcolumnsin AandrowsinB. J =
the number of columns in B and C.
IB=the numberofrowsperAblockandCblock.
KB=thenumberofcolumnsperAblockandrowsperBblock. JB = the
number of columns per B block and C block.
Inthepseudo-codefortheindividualstrategiesbelow,wehaveintentionallyavoided
factoringcommoncodeforthepurposesofclarity.Notethatinallthestrategiesthememory footprint of
both the mappers and the reducers is flat at scale.
Notethatthestrategiesallworkreasonablywellwithbothdenseandsparse matrices. For sparse matrices
we do not emit zero elements. That said, the simple pseudo-code for multiplying the
individualblocksshownhere iscertainlynotoptimalforsparse matrices.Asa learningexercise, our
focus here is on mastering the MapReduce complexities, not onoptimizing the sequential matrix
multipliation algorithm for the individual blocks.
Steps
1. setup ()
2. varNIB=(I-1)/IB+1
3. varNKB=(K-1)/KB+1
4. var NJB=(J-1)/JB+1
5. map(key,value)
6. iffrommatrixAwithkey=(i,k) andvalue=a(i,k)
7. for 0<=jb<NJB
8. emit(i/IB,k/KB,jb,0),(imodIB,kmodKB,a(i,k))
9. iffrommatrixBwithkey=(k,j)andvalue=b(k,j)
10. for 0<=ib<NIB
emit(ib,k/KB,j/JB,1),(kmodKB,jmodJB,b(k,j))
Intermediatekeys(ib,kb, jb, m)sortinincreasingorderfirst byib,thenbykb,thenbyjb, then by m.
Note that m = 0 for A data and m = 1 for B data.
Thepartitionermapsintermediatekey (ib,kb,jb,m)to areducerrasfollows:
11. r=((ib*JB+jb)*KB+kb)modR
12. These definitions for the sorting order and partitioner guarantee that each reducer
R[ib,kb,jb]receivesthedatait needsforblocksA[ib,kb]andB[kb,jb],withthedatafor the A
block immediately preceding the data for the B block.
13. varA=newmatrixofdimensionIBxKB
14. var B=newmatrixofdimension KBxJB
15. varsib=-1
16. varskb=-1
Reduce(key,valueList)
17. ifkeyis(ib, kb,jb, 0)
18. //Savethe Ablock.
19. sib=ib
20. skb= kb
21. ZeromatrixA
22. foreachvalue=(i,k, v)invalueListA(i,k)=v
23. ifkeyis(ib, kb,jb, 1)
24. ifib!=sibor kb!=skbreturn//A[ib,kb]mustbe zero!
25. //BuildtheB block.
26. ZeromatrixB
27. foreachvalue=(k,j,v)invalueListB(k,j)=v
28. //Multiplytheblocksandemitthe result.
29. ibase=ib*IB
30. jbase=jb*JB
31. for0<=i< rowdimensionofA
32. for0<=j< column dimensionofB
33. sum= 0
34. for 0<=k<columndimensionofA= rowdimensionofB
a.sum+=A(i,k)*B(k,j)
35. ifsum!=0emit(ibase+i,jbase+j),sum
SetofDatasetsoverdifferentClustersaretakenasRowsandColumns
INPUT/OUTPUT:
1. Explainwhatis“map”andwhatis“reducer”inHadoop?
2. Mentionwhatdaemonsrunonamasternodeandslavenodes?
3. MentionwhatistheuseofContextObject?
LAB ASSIGNMENT:
1. ImplementmatrixadditionwithHadoopMap Reduce.
1. Whatispartitioner inHadoop?
2. ExplainofRecordReaderinHadoop?
WEEK-8
PIGLATINLANGUAGE-PIG
OBJECTIVE:
1. InstallationofPIG.
RESOURCES:
VMWare,Webbrowser,4GBRAM, HardDisk80GB.
PROGRAMLOGIC:
STEPSFORINSTALLINGAPACHE PIG
1) Extractthepig-0.15.0.tar.gzandmovetohomedirectory
2) SettheenvironmentofPIGinbashrcfile.
3) Pig can run in two modes
LocalModeandHadoopMode
Pig –x local and pig
4) GruntShell
Grunt >
5) LOADINGDataintoGruntShell
DATA=LOAD<CLASSPATH>USINGPigStorage(DELIMITER) as(ATTRIBUTE:
DataType1,ATTRIBUTE: DataType2…..)
6) DescribeData
DescribeDATA;
7) DUMPData
Dump DATA;
INPUT/OUTPUT:
InputasWebsiteClickCountData
1) WhatdoyoumeanbyabaginPig?
2) DifferentiatebetweenPigLatinandHiveQL
3) Howwillyou mergethecontentsoftwoormorerelationsanddivideasinglerelation into
two or more relations?
LAB ASSIGNMENT:
1. ProcessbaseballdatausingApachePig.
1. WhatistheusageofforeachoperationinPigscripts?
2. WhatdoesFlattendoinPig
WEEK-9
PIGCOMMANDS
OBJECTIVE:
WritePigLatinscriptssort,group,join, project,andfilteryourdata.
RESOURCES:
VMWare, Webbrowser,4GBRAM,HardDisk80GB.
PROGRAMLOGIC:
FILTER Data
FDATA= FILTERDATAby ATTRIBUTE=VALUE;
GROUPData
GDATA= GROUP DATAbyATTRIBUTE;
IteratingData
FOR_DATA=FOREACHDATAGENERATEGROUPASGROUP_FUN,
ATTRIBUTE =<VALUE>
SortingData
SORT_DATA=ORDERDATABYATTRIBUTEWITHCONDITION;
LIMITData
LIMIT_DATA=LIMIT DATACOUNT;
JOINData
JOINDATA1BY(ATTRIBUTE1,ATTRIBUTE2….),DATA2BY
(ATTRIBUTE3,ATTRIBUTE….N)
INPUT/OUTPUT:
1. Howwillyou mergethecontentsoftwoormorerelationsanddivideasinglerelation into
two or more relations?
2. WhatistheusageofforeachoperationinPigscripts?
3. WhatdoesFlattendoinPig?
LAB ASSIGNMENT:
1. UsingApachePigtodevelopUserDefinedFunctionsforstudentdata.
1. WhatdoyoumeanbyabaginPig?
2. DifferentiatebetweenPigLatinandHiveQL
WEEK-10
PIGLATINMODES,PROGRAMS
OBJECTIVE:
a. RunthePig LatinScriptsto findWord Count.
b. RunthePigLatinScriptstofind amaxtempfor eachandeveryyear.
RESOURCES:
VMWare,WebBrowser,4 GBRAM,80 GBHard Disk.
PROGRAMLOGIC:
RunthePig LatinScriptsto findWord Count.
lines=LOAD '/user/hadoop/HDFS_File.txt'AS (line:chararray);

words=FOREACHlinesGENERATEFLATTEN(TOKENIZE(line))asword; grouped
= GROUP words BY word;
wordcount=FOREACHgroupedGENERATEgroup,COUNT(words);
DUMP wordcount;
RunthePigLatinScriptstofind amaxtempfor eachandeveryyear
--max_temp.pig:Findsthemaximumtemperaturebyyear
records = LOAD 'input/ncdc/micro-tab/sample.txt'
AS(year:chararray,temperature:int, quality:int);
filtered_records=FILTERrecordsBYtemperature!=9999AND
(quality==0ORquality==1ORquality ==4ORquality==5ORquality==9); grouped_records
= GROUP filtered_records BY year;
max_temp=FOREACHgrouped_recordsGENERATEgroup,
MAX(filtered_records.temperature);
DUMPmax_temp;
INPUT/OUTPUT:
(1950,0,1)
(1950,22,1)
(1950,-11,1)
(1949,111,1)
(1949,78,1)
1. ListoutthebenefitsofPig?
2. ClassifyPigLatincommandsinPig?
LAB ASSIGNMENT:
1.AnalyzingaveragestockpricefromthestockdatausingApachePig
POST-LABVIVAQUESTIONS:
1. DiscussthemodesofPigscripts?
2. ExplainthePigLatinapplicationflow?
WEEK-11
HIVE
OBJECTIVE:
InstallationofHIVE.
RESOURCES:
VMWare, WebBrowser, 1GBRAM, HardDisk80GB.
PROGRAMLOGIC:
InstallMySQL-Server
1) Sudoapt-getinstallmysql-server
2) ConfiguringMySQLUserNameandPassword
3) CreatingUserandgrantingallPrivilege
s Mysql –uroot –proot
Createuser<USER_NAME>identifiedby<PASSWORD>
4) ExtractandConfigureApacheHiv
e tar xvfz apache-hive-
1.0.1.bin.tar.gz
5) MoveApacheHivefromLocaldirectorytoHomedirectory
6) SetCLASSPATHinbashrc
Export HIVE_HOME = /home/apache-hive
ExportPATH=$PATH:$HIVE_HOME/bin
7) Configuringhive-default.xmlbyaddingMySQLServerCredentials
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value> jdbc:mysql://localhost:3306/hive?
createDatabaseIfNotExist=true
</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hadoop</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>hadoop</value>
</property>
8) Copyingmysql-java-connector.jartohive/libdirectory.
INPUT/OUTPUT:
1. InHive,explaintheterm„aggregation‟anditsuses?
2. ListouttheDatatypesinHive?
LAB ASSIGNMENT:
1.AnalyzetwitterdatausingApacheHive.
1. ExplaintheBuilt-inFunctionsinHive?
2. DescribethevariousHiveDatatypes?
WEEK-12
HIVEOPERATIONS
OBJECTIVE:
UseHivetocreate,alter,anddropdatabases,tables,views,functions,andindexes.
RESOURCES:
VMWare,XAMPPServer, WebBrowser,1GBRAM, HardDisk80GB.
PROGRAMLOGIC:
SYNTAXforHIVEDatabaseOperations
DATABASE Creation
CREATEDATABASE|SCHEMA[IFNOTEXISTS]<databasename>
DropDatabaseStatement
DROPDATABASEStatementDROP(DATABASE|SCHEMA)[IFEXISTS]
database_name[RESTRICT|CASCADE];
CreatingandDroppingTableinHIVE
CREATE[TEMPORARY][EXTERNAL]TABLE[IFNOTEXISTS][db_name.]
table_name
[(col_namedata_type[COMMENTcol_comment],...)]
[COMMENTtable_comment][ROWFORMATrow_format][STOREDAS
file_format]
LoadingDataintotablelog_data
Syntax:
LOADDATALOCALINPATH'<path>/u.data'OVERWRITEINTO TABLE
u_data;
AlterTable inHIVE
Syntax
ALTERTABLEnameRENAMETOnew_name
ALTERTABLEnameADDCOLUMNS(col_spec[,col_spec...]) ALTER
TABLE name DROP [COLUMN] column_name
ALTER TABLE name CHANGE column_name new_name new_type
ALTERTABLEnameREPLACECOLUMNS(col_spec[,col_spec...])
Creating and Dropping View
CREATEVIEW[IFNOTEXISTS]view_name[(column_name[COMMENT
column_comment], ...) ] [COMMENT table_comment] AS SELECT ...
DroppingView
Syntax:
DROP VIEWview_name
FunctionsinHIVE
StringFunctions:-round(),ceil(),substr(),upper(),reg_exp()etc Date
and Time Functions:- year(), month(), day(), to_date() etc Aggregate
Functions :- sum(), min(), max(), count(), avg() etc
INDEXES
CREATEINDEXindex_nameONTABLEbase_table_name(col_name,...) AS
'index.handler.class.name'
[WITHDEFERREDREBUILD]
[IDXPROPERTIES(property_name=property_value,...)]
[IN TABLE index_table_name]
[PARTITIONEDBY(col_name,...)] [
[ROWFORMAT...] STOREDAS...
| STOREDBY...
]
[LOCATIONhdfs_path]
[TBLPROPERTIES(...)]
CreatingIndex
CREATE INDEX index_ip ON TABLE log_data(ip_address) AS
'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'WITHDEFERRED
REBUILD;
AlteringandInsertingIndex
ALTERINDEXindex_ip_addressONlog_data REBUILD;
StoringIndexDatainMetastore
SET
hive.index.compact.file=/home/administrator/Desktop/big/metastore_db/tmp/index_ipadd
ress_result;
SET
hive.input.format=org.apache.hadoop.hive.ql.index.compact.HiveCompactIndexInputFor
mat;
DroppingIndex
DROP INDEXINDEX_NAMEonTABLE_NAME;
INPUT/OUTPUT:
1. Howmanytypes ofjoins are there inPigLatinwithanexamples?
2. WritetheHivecommandto createatablewithfourcolumns:First name,
last name, age, and income?
LAB ASSIGNMENT:
1. Analyzestockdatausing Apache Hive.
1. WriteashellcommandinHivetolistallthefilesinthecurrentdirectory?
2. Listthecollection typesprovidedby Hiveforthepurposeastart-up
company want to use Hive for storing its data.

Lab Manual Bda

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lab Manual Bda

Uploaded by

Copyright:

Available Formats

Experiment 1:- INSTALL VM WARE

VM Ware stack, 4 GB RAM,Web browser,Hard Disk80 GB.

STEP2.After downloadingVMwareworkstation, installitonyour

ByclickingNextbutton,we‟llgiveanametoourvirtualmachine,andgivedirectoryto createthisnew virtual

Thenyou‟vetodefinesizeofharddiskbyenteringitssize.I‟llgive15GBharddiskspaceand please check

Usewebbased toolsto monitoryourHadoopsetup.

Hint:AtypicalHadoopworkflowcreatesdatafiles (suchas log files)elsewhereandcopiestheminto HDFS

TheHadoopcommandget copiesfilesfromHDFSbacktothelocalfilesystem.Toretrieveexample.txt, we can

Deleting files from HDFS

Step-1. Writea Mapper

A Mapper overrides the ―map‖ function from the Class "org.apache.hadoop.mapreduce.Mapper"

each x in <list of value>:

VMWare, Webbrowser, 4GBRAM,HardDisk80GB.

Implement matrixmultiplicationwithHadoopMap Reduce.

Wehave thefollowinginput parameters:

Thepathoftheinput fileordirectoryfor matrixA.

lines=LOAD '/user/hadoop/HDFS_File.txt'AS (line:chararray);

RunthePigLatinScriptstofind amaxtempfor eachandeveryyear

You might also like