You are on page 1of 57

Experiment 1:- INSTALL VM WARE

OBJECTIVE:

To Install VMWare.
RESOURCES:

VM Ware stack, 4 GB RAM,Web browser,Hard Disk80 GB.

PROGRAMLOGIC:

STEP1.Firstofall, entertotheofficialsiteofVMwareanddownloadVMwareWorkstation
https://www.vmware.com/tryvmware/?p=workstation-w

STEP2.After downloadingVMwareworkstation, installitonyour


STEP3.Setup willopenWelcomeScreen
Click onNextbuttonandchooseTypicaloption

STEP4.Byclicking“Next”buttons,tobegintheinstallation,click on Installbuttonattheend
STEP5.ThiswillinstallVMware WorkstationsoftwareonyourPC,Afterinstallationcomplete, click on
Finish button. Then restart your PC. Then open this software
6.Inthisstepwetryto createnew“virtualmachine”. Enterto File menu, thenNew->Virtual Machine

ClickonNextbutton,thencheckTypicaloptionas below
Thenclick Next button, andcheck your OSversion. Inthisexample, aswe‟regoingtosetup
OracleserveronCentOS,we‟llcheckLinuxoptionandfrom“version”optionwe‟llcheckRedHat
EnterpriseLinux4

ByclickingNextbutton,we‟llgiveanametoourvirtualmachine,andgivedirectoryto createthisnew virtual

machine
Thenselect

Usebridgednetworkingoptionandclick Next.

Thenyou‟vetodefinesizeofharddiskbyenteringitssize.I‟llgive15GBharddiskspaceand please check


Allocate all disk space now option
Here,youcandeleteSoundAdapter,FloppyandUSBControllerbyentering“Edit virtualmachine settings”.
If you‟re going to setupOracle Server, please make sure you‟ve increased your Memory (RAM) to 1GB.
INPUT/OUTPUT

PRELABVIVAQUESTIONS:

1. WhatisVMWarestack?
2. Listoutvariousdata formats?
3. Listoutthecharacteristicsofbigdata?

LABASSIGNMENT:
1. InstallPig?
2. InstallHive?

POSTLABVIVAQUESTIONS:
1. ListoutvariousterminologiesinBigDataenvironments?
2. Definebigdata analytics?
WEEK-2

HADOOPMODES
OBJECTIVE:

1) PerformsettingupandInstallingHadoopinitsthreeoperatingmodes.
Standalone.
Pseudodistributed
Fully distributed.

2) Usewebbasedtoolstomonitoryour Hadoopsetup.

RESOURCES:

VMWarestack,4GBRAM,HardDisk80GB.

PROGRAMLOGIC:

a) STANDALONEMODE:
 Installationofjdk7

Command:sudoapt-get installopenjdk-7-jdk
 DownloadandextractHadoop

Command:wgethttp://archive.apache.org/dist/hadoop/core/hadoop-1.2.0/hadoop-1.2.0.tar.gz
Command:tar-xvfhadoop-1.2.0.tar.gz
Command:sudomvhadoop-1.2.0 /usr/lib/hadoop
 Setthepathforjavaandhadoop

Command:sudogedit$HOME/.bashrc
exportJAVA_HOME=/usr/lib/jvm/java-7-openjdk-i386
export PATH=$PATH:$JAVA_HOME/bin
export HADOOP_COMMON_HOME=/usr/lib/hadoop
export HADOOP_MAPRED_HOME=/usr/lib/hadoop
exportPATH=$PATH:$HADOOP_COMMON_HOME/b
in
exportPATH=$PATH:$HADOOP_COMMON_HOME/Sbin
 Checkingofjavaand hadoop

Command:java-version
Command:hadoopversion
b) PSEUDOMODE:
Hadoopsingle nodeclusterrunsonsinglemachine. The namenodesanddatanodesareperforming onthe one
machine. The installation and configuration steps as given below:
 Installationofsecuredshell

Command:sudoapt-getinstallopenssh-server
 Createasshkeyforpasswordlesssshconfiguration

Command:ssh-keygen-trsa–P""
 Movingthe keytoauthorizedkey

Command:cat$HOME/.ssh/id_rsa.pub>>$HOME/.ssh/authorized_keys
/**************RESTARTTHECOMPUTER********************/
 Checkingofsecuredshelllogin

Command: sshlocalhost
 AddJAVA_HOMEdirectoryinhadoop-env.shfile

Command:sudogedit/usr/lib/hadoop/conf/hadoop-env.sh
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-i386
 Creatingnamenodeanddatanodedirectoriesforhadoop

Command:sudomkdir-p/usr/lib/hadoop/dfs/namenode
Command:sudomkdir-p/usr/lib/hadoop/dfs/datanode
 Configurecore-site.xml

Command:sudogedit/usr/lib/hadoop/conf/core-site.xml
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:8020</value>
</property>
 Configurehdfs-site.xml

Command:sudogedit/usr/lib/hadoop/conf/hdfs-site.xml
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>/usr/lib/hadoop/dfs/namenode</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/usr/lib/hadoop/dfs/datanode</value>
</property>
 Configuremapred-site.xml
Command:sudogedit/usr/lib/hadoop/conf/mapred-site.xml
<property>
<name>mapred.job.tracker</name>
<value>localhost:8021</value>
</property>
 Formatthenamenode

Command:hadoopnamenode-format
 Startthe namenode,datanode

Command:start-dfs.sh
 Startthetasktrackerandjobtracker

Command:start-mapred.sh
 TocheckifHadoopstartedcorrectly

Command: jps
namenode
secondarynamenode
datanode
jobtracker
tasktracke
r

c) FULLYDISTRIBUTEDMODE:

All the demons like namenodes and datanodes are runs on different machines. The data will replicate
according to the replication factor in client machines. The secondary namenode will store the mirror
images of namenode periodically. The namenode having the metadata where the blocks are stored and
number of replicas in the client machines. The slaves and master communicate each other periodically.
The configurations of multinode cluster are given below:

 Configurethehostsinallnodes/machines

Command:sudogedit/etc/hosts/
192.168.1.58 pcetcse1
pcetcse2
pcetcse3
pcetcse4
pcetcse5

 Passwordless

SshConfiguration

Createsshkeyonnamenode/master.
Command:ssh-keygen-trsa-p“”
Copythegeneratedpublickeyalldatanodes/slaves.

Command:ssh-copy-id-i~/.ssh/
id_rsa.pubhuser@pcetcse2 Command:ssh-copy-id-
i~/.ssh/id_rsa.pubhuser@pcetcse3 Command:ssh-copy-
id-i~/.ssh/id_rsa.pubhuser@pcetcse4 Command:ssh-
copy-id-i~/.ssh/id_rsa.pubhuser@pcetcse5
/**************RESTARTALLNODES/COMPUTERS/MACHINES************/
NOTE:Verifythepasswordless sshenvironmentfromnamenodetoalldatanodes as“huser”user.
 Logintomasternode

Command:sshpcetcse1
Command:sshpcetcse2
Command:sshpcetcse3
Command:sshpcetcse4
Command:
sshpcetcse5

 AddJAVA_HOMEdirectoryinhadoop-env.shfileinallnodes/machines

Command:sudogedit/usr/lib/hadoop/conf/hadoop-env.sh
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-i386

 Creatingnamenodedirectoryinnamenode/master

Command:sudomkdir-p/usr/lib/hadoop/dfs/namenode
 Creatingnamenodedirectoryindatanonodes/slaves

Command:sudomkdir-p/usr/lib/hadoop/dfs/datanodeCloseHTMLtag.

Usewebbased toolsto monitoryourHadoopsetup.

HDFSNamenodeonUI
http://locahost:50070/

INPUT/OUTPUT:
ubuntu@localhost>jps

Datanode,namenodem
Secondarynamenode,NodeManager,ResourceManager
HDFS Jobtracker
http://locahost:50030/

HDFS Logs
http://locahost:50070/logs/
PRELABVIVAQUESTIONS:

1. Whatdoes‗jps„commanddo?
2. HowtorestartNamenode?
3. DifferentiatebetweenStructuredandUnstructureddata?

LAB ASSIGNMENT:

1 Howtoconfigurethedaemonsinthebrowser.

POSTLABVIVAQUESTIONS:

1. WhatarethemaincomponentsofaHadoopApplication?
2. ExplainthedifferencebetweenNameNode,BackupNodeandCheckpointNameNode.
WEEK-3
USINGLINUXOPERATINGSYSTEM

OBJECTIVE:
1.Implementingthe basiccommandsofLINUXOperatingSystem–File/Directorycreation,
deletion,update operations.
RESOURCES:
VMWarestack,4GBRAM, HardDisk80GB.
PROGRAMLOGIC:

1. cat>filename
2. Addcontent
3. Press'ctrl+d'toreturnto commandprompt. To
remove a file use syntax - rm filename

1.4 INPUT/OUTPUT:

PRE-LABVIVAQUESTIONS:
1. Whatislscommand?
2. Whatarethe attributesofls command?

LAB ASSIGNMENT:
1 Writealinuxcommands for Sedoperations?
2 Writethelinuxcommandsforrenamingafile?
POST-LABVIVA QUESTIONS:
1. Whatisthepurposeofrmcommand?
2. WhatisthedifferencebetweenLinuxandwindowscommands?
WEEK-4
FILEMANAGEMENTIN HADOOP

OBJECTIVE:
Implementthefollowing filemanagementtasksinHadoop:
i. Addingfilesanddirectories
ii. Retrievingfiles
iii. Deletingfiles

Hint:AtypicalHadoopworkflowcreatesdatafiles (suchas log files)elsewhereandcopiestheminto HDFS


using one of the above command line utilities.

RESOURCES:
VMWarestack,4GBRAM, HardDisk80GB.

PROGRAMLOGIC:
AddingFilesandDirectoriesto HDFS
Before you can run Hadoop programs on data stored in HDFS, you„ll need to put the data into
HDFS first. Let„screate a directoryand put afile in it. HDFS hasa default working directoryof
/user/$USER, where $USER is your login user name. This directory isn„t automatically created
for you, though, so let„s create it with the mkdir command. For the purpose of illustration, we
use chuck. You should substitute your user name in the example commands.

hadoopfs-mkdir/user/chuck
hadoop fs-put

hadoopfs-putexample.txt/user/chuck

RetrievingFilesfrom HDFS

TheHadoopcommandget copiesfilesfromHDFSbacktothelocalfilesystem.Toretrieveexample.txt, we can


run the following command:
hadoopfs-catexample.txt

Deleting files from HDFS

hadoop fs -rmexample.txt
 Commandforcreatingadirectoryinhdfsis“hdfsdfs–mkdir/lendicse”.
 Addingdirectoryisdonethroughthecommand“hdfsdfs–putlendi_english/”.
INPUT/OUTPUT:

PRELABVIVAQUESTIONS:
1) DefineHadoop?
2) ListoutthevarioususecasesofHadoop?

LABASSIGNMENT
1)WhatiscommandusedtolistoutdirectoriesofDataNodethroughwebtool

POSTLABVIVAQUESTIONS:
1. DistinguishtheHadoopEcosystem?
2. Demonstratedivideandconquerphilosophyin HadoopCluster?
WEEK-5
MAPREDUCEPROGRAM1
OBJECTIVE:

RunabasicwordcountMapReduceprogramtounderstandMapReduce Paradigm.

RESOURCES:

VMWarestack,4GBRAM,Webbrowser,HardDisk80 GB.

PROGRAMLOGIC:

WordCount is a simple program which counts the number of occurrences of each word in a
given text input data set. WordCount fits very well with the MapReduce programming model
making it a great example to understand the Hadoop Map/Reduce programming style. Our
implementation consists of three main parts:

1. Mapper
2. Reducer
3. Driver

Step-1. Writea Mapper

A Mapper overrides the ―map‖ function from the Class "org.apache.hadoop.mapreduce.Mapper"


which provides <key, value> pairs as the input. A Mapper implementation may output
<key,value>pairsusingtheprovidedContext.

Input value of the WordCount Map task will be a line of text from the input data file and the key
would bethe line number<line_number, line_of_text>.Maptaskoutputs<word,one> foreach word
in the line of text.

Pseudo-code

voidMap(key,value){
foreachwordxinvalue:

output.collect(x,1);

}
Step-2. WriteaReducer

AReducercollectstheintermediate<key,value>outputfrommultiplemaptasksandassembleasingle result.
Here, the WordCount program will sum up the occurrence of each word to pairs as
<word,occurrence>.

Pseudo-code

voidReduce(keyword,<listofvalue>){for

each x in <list of value>:

sum+=x;

final_output.collect(keyword, sum);

INPUT/OUTPUT:

PRE-LABVIVAQUESTIONS:
1. Justifyhowhadooptechnologysatisfiesthebusinessinsightsnow-a–days?
2. DefineFilesystem?
LAB ASSIGNMENT:
RunabasicwordcountMapReduceprogramtounderstandMapReduce Paradigm.

POST-LABVIVAQUESTIONS:
1. Definewhat isblock inHDFS?
2. WhyisablockinHDFSsolarge?
WEEK-6
MAPREDUCEPROGRAM2
OBJECTIVE:

Write a Map Reduce programthat mines weather data. Hint: Weather sensors collecting data
every hourat many locations across the globe gather a large volume of log data, which is a
good candidate foranalysis with Map Reduce, since it is semi structured and record- oriented.

RESOURCES:

VMWare, Webbrowser, 4GBRAM,HardDisk80GB.

PROGRAMLOGIC:

WordCount is a simple program which counts the number of occurrences of each word in a
given text input data set. WordCount fits very well with the MapReduce
programmingmodelmaking it agreat exampleto
understandtheHadoopMap/Reduceprogrammingstyle. Our implementation consists of three
main parts:

1. Mapper
2. Reducer
3. Mainprogram

Step-1.WriteaMapper

AMapperoverridesthe―mapfunctionfromtheClass"org.apache.hadoop.mapreduce.Mapper"
which provides <key, value> pairs as the input. AMapper implementation may output
<key,value>pairsusingtheprovidedContext.
Input valueofthe WordCountMaptaskwillbe a lineoftext fromthe input datafileandthekey
wouldbethe linenumber <line_number, line_of_text>. Maptaskoutputs<word,one>for each word
in the line of text.

Pseudo-code
voidMap(key,value){
foreachmax_tempxinvalue:
output.collect(x, 1);
}
voidMap(key,value){
foreachmin_tempxinvalue:

output.collect(x,1);
}
Step-2 Write aReducer
AReducercollectstheintermediate<key,value>outputfrommultiple maptasksandassemblea
singleresult.Here,the WordCountprogramwillsumuptheoccurrenceofeachwordtopairs as
<word, occurrence>.
Pseudo-code
voidReduce(max_temp,<listofvalue>){ for
each x in <list of value>:
sum+=x;
final_output.collect(max_temp,sum);
}
voidReduce(min_temp,<listofvalue>){ for
each x in <list of value>:
sum+=x;
final_output.collect(min_temp,sum);
}

3. WriteDriver
TheDriverprogramconfiguresandruntheMapReduce job.Weusethe mainprogramto perform
basic configurations such as:
JobName : name ofthisJobExecutable (Jar)
Class:themainexecutableclass.Forhere,WordCount.
MapperClass:classwhichoverridesthe"map"function.Forhere,Map.
Reducer: classwhichoverride the "reduce" function. For here ,Reduce.
Output Key: type of output key. For here, Text.Output
Value:typeofoutputvalue.Forhere,IntWritable.
File Input Path
FileOutput Path
INPUT/OUTPUT:
SetofWeatherDataovertheyears

PRE-LABVIVAQUESTIONS:

1) ExplainthefunctionofMapReducer partitioner?
2) WhatisthedifferencebetweenanInputSplitandHDFS Block?
3) WhatisSequencefileinputformat?

LAB ASSIGNMENT:
1. UsingMapReducejobtoIdentifylanguagebymergingmultilanguagedictionaryfiles into a
single dictionary file.
2. JoinmultipledatasetsusingaMapReduceJob.

POST-LABVIVA QUESTIONS:
1) InHadoopwhatisInputSplit?
2) ExplainwhatisasequencefileinHadoop?
WEEK-7
MAPREDUCEPROGRAM3

OBJECTIVE:

Implement matrixmultiplicationwithHadoopMap Reduce.

RESOURCES:
VMWare,Webbrowser,4GBRAM, Hard Disk80GB.

PROGRAMLOGIC:
Weassumethattheinputfiles for AandBarestreamsof(key,value)pairsinsparse
matrix format, where each key is a pair of indices (i,j) and each value is the
correspondingmatrix element value. The output files for matrix C=A*B are in the same
format.

Wehave thefollowinginput parameters:

Thepathoftheinput fileordirectoryfor matrixA.


Thepathoftheinputfileor directoryformatrix B.

ThepathofthedirectoryfortheoutputfilesformatrixC. strategy =
1, 2, 3 or 4.
R=thenumberofreducers.
I=the numberofrowsinAandC.
K=thenumberofcolumnsin AandrowsinB. J =
the number of columns in B and C.
IB=the numberofrowsperAblockandCblock.
KB=thenumberofcolumnsperAblockandrowsperBblock. JB = the
number of columns per B block and C block.

Inthepseudo-codefortheindividualstrategiesbelow,wehaveintentionallyavoided
factoringcommoncodeforthepurposesofclarity.Notethatinallthestrategiesthememory footprint of
both the mappers and the reducers is flat at scale.
Notethatthestrategiesallworkreasonablywellwithbothdenseandsparse matrices. For sparse matrices
we do not emit zero elements. That said, the simple pseudo-code for multiplying the
individualblocksshownhere iscertainlynotoptimalforsparse matrices.Asa learningexercise, our
focus here is on mastering the MapReduce complexities, not onoptimizing the sequential matrix
multipliation algorithm for the individual blocks.

Steps
1. setup ()
2. varNIB=(I-1)/IB+1
3. varNKB=(K-1)/KB+1
4. var NJB=(J-1)/JB+1
5. map(key,value)
6. iffrommatrixAwithkey=(i,k) andvalue=a(i,k)
7. for 0<=jb<NJB
8. emit(i/IB,k/KB,jb,0),(imodIB,kmodKB,a(i,k))
9. iffrommatrixBwithkey=(k,j)andvalue=b(k,j)
10. for 0<=ib<NIB
emit(ib,k/KB,j/JB,1),(kmodKB,jmodJB,b(k,j))
Intermediatekeys(ib,kb, jb, m)sortinincreasingorderfirst byib,thenbykb,thenbyjb, then by m.
Note that m = 0 for A data and m = 1 for B data.
Thepartitionermapsintermediatekey (ib,kb,jb,m)to areducerrasfollows:
11. r=((ib*JB+jb)*KB+kb)modR
12. These definitions for the sorting order and partitioner guarantee that each reducer
R[ib,kb,jb]receivesthedatait needsforblocksA[ib,kb]andB[kb,jb],withthedatafor the A
block immediately preceding the data for the B block.
13. varA=newmatrixofdimensionIBxKB
14. var B=newmatrixofdimension KBxJB
15. varsib=-1
16. varskb=-1
Reduce(key,valueList)
17. ifkeyis(ib, kb,jb, 0)
18. //Savethe Ablock.
19. sib=ib
20. skb= kb
21. ZeromatrixA
22. foreachvalue=(i,k, v)invalueListA(i,k)=v
23. ifkeyis(ib, kb,jb, 1)
24. ifib!=sibor kb!=skbreturn//A[ib,kb]mustbe zero!
25. //BuildtheB block.
26. ZeromatrixB
27. foreachvalue=(k,j,v)invalueListB(k,j)=v
28. //Multiplytheblocksandemitthe result.
29. ibase=ib*IB
30. jbase=jb*JB
31. for0<=i< rowdimensionofA
32. for0<=j< column dimensionofB
33. sum= 0
34. for 0<=k<columndimensionofA= rowdimensionofB
a.sum+=A(i,k)*B(k,j)
35. ifsum!=0emit(ibase+i,jbase+j),sum
SetofDatasetsoverdifferentClustersaretakenasRowsandColumns
INPUT/OUTPUT:

PRE-LABVIVAQUESTIONS:
1. Explainwhatis“map”andwhatis“reducer”inHadoop?
2. Mentionwhatdaemonsrunonamasternodeandslavenodes?
3. MentionwhatistheuseofContextObject?

LAB ASSIGNMENT:
1. ImplementmatrixadditionwithHadoopMap Reduce.

POST-LABVIVA QUESTIONS:
1. Whatispartitioner inHadoop?
2. ExplainofRecordReaderinHadoop?
WEEK-8
PIGLATINLANGUAGE-PIG

OBJECTIVE:
1. InstallationofPIG.

RESOURCES:
VMWare,Webbrowser,4GBRAM, HardDisk80GB.

PROGRAMLOGIC:
STEPSFORINSTALLINGAPACHE PIG
1) Extractthepig-0.15.0.tar.gzandmovetohomedirectory
2) SettheenvironmentofPIGinbashrcfile.
3) Pig can run in two modes
LocalModeandHadoopMode
Pig –x local and pig
4) GruntShell
Grunt >
5) LOADINGDataintoGruntShell
DATA=LOAD<CLASSPATH>USINGPigStorage(DELIMITER) as(ATTRIBUTE:
DataType1,ATTRIBUTE: DataType2…..)
6) DescribeData
DescribeDATA;
7) DUMPData
Dump DATA;

INPUT/OUTPUT:
InputasWebsiteClickCountData
PRE-LABVIVAQUESTIONS:
1) WhatdoyoumeanbyabaginPig?
2) DifferentiatebetweenPigLatinandHiveQL
3) Howwillyou mergethecontentsoftwoormorerelationsanddivideasinglerelation into
two or more relations?

LAB ASSIGNMENT:
1. ProcessbaseballdatausingApachePig.

POST-LABVIVA QUESTIONS:
1. WhatistheusageofforeachoperationinPigscripts?
2. WhatdoesFlattendoinPig
WEEK-9
PIGCOMMANDS

OBJECTIVE:
WritePigLatinscriptssort,group,join, project,andfilteryourdata.

RESOURCES:
VMWare, Webbrowser,4GBRAM,HardDisk80GB.

PROGRAMLOGIC:
FILTER Data
FDATA= FILTERDATAby ATTRIBUTE=VALUE;
GROUPData
GDATA= GROUP DATAbyATTRIBUTE;
IteratingData
FOR_DATA=FOREACHDATAGENERATEGROUPASGROUP_FUN,
ATTRIBUTE =<VALUE>
SortingData
SORT_DATA=ORDERDATABYATTRIBUTEWITHCONDITION;
LIMITData
LIMIT_DATA=LIMIT DATACOUNT;
JOINData
JOINDATA1BY(ATTRIBUTE1,ATTRIBUTE2….),DATA2BY
(ATTRIBUTE3,ATTRIBUTE….N)

INPUT/OUTPUT:
PRE-LABVIVAQUESTIONS:
1. Howwillyou mergethecontentsoftwoormorerelationsanddivideasinglerelation into
two or more relations?
2. WhatistheusageofforeachoperationinPigscripts?
3. WhatdoesFlattendoinPig?

LAB ASSIGNMENT:
1. UsingApachePigtodevelopUserDefinedFunctionsforstudentdata.

PRE-LABVIVAQUESTIONS:
1. WhatdoyoumeanbyabaginPig?
2. DifferentiatebetweenPigLatinandHiveQL
WEEK-10
PIGLATINMODES,PROGRAMS
OBJECTIVE:
a. RunthePig LatinScriptsto findWord Count.
b. RunthePigLatinScriptstofind amaxtempfor eachandeveryyear.

RESOURCES:
VMWare,WebBrowser,4 GBRAM,80 GBHard Disk.

PROGRAMLOGIC:
RunthePig LatinScriptsto findWord Count.

lines=LOAD '/user/hadoop/HDFS_File.txt'AS (line:chararray);


words=FOREACHlinesGENERATEFLATTEN(TOKENIZE(line))asword; grouped
= GROUP words BY word;
wordcount=FOREACHgroupedGENERATEgroup,COUNT(words);
DUMP wordcount;

RunthePigLatinScriptstofind amaxtempfor eachandeveryyear

--max_temp.pig:Findsthemaximumtemperaturebyyear
records = LOAD 'input/ncdc/micro-tab/sample.txt'
AS(year:chararray,temperature:int, quality:int);
filtered_records=FILTERrecordsBYtemperature!=9999AND
(quality==0ORquality==1ORquality ==4ORquality==5ORquality==9); grouped_records
= GROUP filtered_records BY year;
max_temp=FOREACHgrouped_recordsGENERATEgroup,
MAX(filtered_records.temperature);
DUMPmax_temp;

INPUT/OUTPUT:
(1950,0,1)
(1950,22,1)
(1950,-11,1)
(1949,111,1)
(1949,78,1)

PRE-LABVIVAQUESTIONS:
1. ListoutthebenefitsofPig?
2. ClassifyPigLatincommandsinPig?
LAB ASSIGNMENT:
1.AnalyzingaveragestockpricefromthestockdatausingApachePig

POST-LABVIVAQUESTIONS:
1. DiscussthemodesofPigscripts?
2. ExplainthePigLatinapplicationflow?
WEEK-11
HIVE

OBJECTIVE:
InstallationofHIVE.

RESOURCES:
VMWare, WebBrowser, 1GBRAM, HardDisk80GB.

PROGRAMLOGIC:
InstallMySQL-Server

1) Sudoapt-getinstallmysql-server
2) ConfiguringMySQLUserNameandPassword
3) CreatingUserandgrantingallPrivilege
s Mysql –uroot –proot
Createuser<USER_NAME>identifiedby<PASSWORD>
4) ExtractandConfigureApacheHiv
e tar xvfz apache-hive-
1.0.1.bin.tar.gz
5) MoveApacheHivefromLocaldirectorytoHomedirectory
6) SetCLASSPATHinbashrc
Export HIVE_HOME = /home/apache-hive
ExportPATH=$PATH:$HIVE_HOME/bin
7) Configuringhive-default.xmlbyaddingMySQLServerCredentials
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value> jdbc:mysql://localhost:3306/hive?
createDatabaseIfNotExist=true
</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hadoop</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>hadoop</value>
</property>
8) Copyingmysql-java-connector.jartohive/libdirectory.
INPUT/OUTPUT:

PRE-LABVIVAQUESTIONS:
1. InHive,explaintheterm„aggregation‟anditsuses?
2. ListouttheDatatypesinHive?

LAB ASSIGNMENT:
1.AnalyzetwitterdatausingApacheHive.

POST-LABVIVA QUESTIONS:
1. ExplaintheBuilt-inFunctionsinHive?
2. DescribethevariousHiveDatatypes?
WEEK-12
HIVEOPERATIONS

OBJECTIVE:
UseHivetocreate,alter,anddropdatabases,tables,views,functions,andindexes.

RESOURCES:
VMWare,XAMPPServer, WebBrowser,1GBRAM, HardDisk80GB.

PROGRAMLOGIC:
SYNTAXforHIVEDatabaseOperations
DATABASE Creation
CREATEDATABASE|SCHEMA[IFNOTEXISTS]<databasename>
DropDatabaseStatement
DROPDATABASEStatementDROP(DATABASE|SCHEMA)[IFEXISTS]
database_name[RESTRICT|CASCADE];
CreatingandDroppingTableinHIVE
CREATE[TEMPORARY][EXTERNAL]TABLE[IFNOTEXISTS][db_name.]
table_name
[(col_namedata_type[COMMENTcol_comment],...)]
[COMMENTtable_comment][ROWFORMATrow_format][STOREDAS
file_format]
LoadingDataintotablelog_data
Syntax:
LOADDATALOCALINPATH'<path>/u.data'OVERWRITEINTO TABLE
u_data;
AlterTable inHIVE
Syntax

ALTERTABLEnameRENAMETOnew_name
ALTERTABLEnameADDCOLUMNS(col_spec[,col_spec...]) ALTER
TABLE name DROP [COLUMN] column_name
ALTER TABLE name CHANGE column_name new_name new_type
ALTERTABLEnameREPLACECOLUMNS(col_spec[,col_spec...])
Creating and Dropping View
CREATEVIEW[IFNOTEXISTS]view_name[(column_name[COMMENT
column_comment], ...) ] [COMMENT table_comment] AS SELECT ...
DroppingView
Syntax:
DROP VIEWview_name
FunctionsinHIVE
StringFunctions:-round(),ceil(),substr(),upper(),reg_exp()etc Date
and Time Functions:- year(), month(), day(), to_date() etc Aggregate
Functions :- sum(), min(), max(), count(), avg() etc
INDEXES
CREATEINDEXindex_nameONTABLEbase_table_name(col_name,...) AS
'index.handler.class.name'
[WITHDEFERREDREBUILD]
[IDXPROPERTIES(property_name=property_value,...)]
[IN TABLE index_table_name]
[PARTITIONEDBY(col_name,...)] [
[ROWFORMAT...] STOREDAS...
| STOREDBY...
]
[LOCATIONhdfs_path]
[TBLPROPERTIES(...)]
CreatingIndex
CREATE INDEX index_ip ON TABLE log_data(ip_address) AS
'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'WITHDEFERRED
REBUILD;
AlteringandInsertingIndex
ALTERINDEXindex_ip_addressONlog_data REBUILD;
StoringIndexDatainMetastore
SET
hive.index.compact.file=/home/administrator/Desktop/big/metastore_db/tmp/index_ipadd
ress_result;
SET
hive.input.format=org.apache.hadoop.hive.ql.index.compact.HiveCompactIndexInputFor
mat;
DroppingIndex
DROP INDEXINDEX_NAMEonTABLE_NAME;
INPUT/OUTPUT:
PRE-LABVIVAQUESTIONS:
1. Howmanytypes ofjoins are there inPigLatinwithanexamples?
2. WritetheHivecommandto createatablewithfourcolumns:First name,
last name, age, and income?

LAB ASSIGNMENT:
1. Analyzestockdatausing Apache Hive.

POST-LABVIVA QUESTIONS:
1. WriteashellcommandinHivetolistallthefilesinthecurrentdirectory?
2. Listthecollection typesprovidedby Hiveforthepurposeastart-up
company want to use Hive for storing its data.

You might also like