Professional Documents
Culture Documents
ROLLNO:
AIM:-
ImplementthefollowingDataStructuresinJava
e)MapDESCRIPTION:
MethodsofCollectioninterface
3 publicbooleanremove(Object isusedtodeleteanelementfromthiscollection.
element)
10 publicIteratoriterator() returnsaniterator.
11 publicObject[]toArray() convertscollectionintoarray.
12 publicbooleanisEmpty() checksifcollectionisempty.
SKELTONOFJAVA.UTIL.COLLECTIONINTERFACE
publicinterfaceCollection<E>extends
Iterable<E>{int size();
booleanisEmpty();
boolean contains(Object
o);Iterator<E>
iterator();Object[]
toArray();
<T>T[]toArray(T[]a);b
ooleanadd(Ee);
booleanremove(Objecto);
c);boolean removeAll(Collection<?>
c);booleanretainAll(Collection<?> c);
voidclear();
booleanequals(Objecto);i
nt hashCode();
}
ALGORITHMforAllCollectionDataStructures:-
StepsofCreationof Collection
1. CreateaObjectofGeneric TypeE,T,KorV
2. CreateaModel classorPlainOldJavaObject(POJO)oftype.
3. GenerateSettersandGetters
collectionBooleanadd(Ee)
6. AddCollectiontotheCollection.
BooleanaddAll(Collection)
CollectionRemove(Collection)retailAll(
Collection)
8. IterateObjectsusingEnumerationorIteratororListIteratorIter
atorlistIterator()
10. END
SAMPLEINPUT:
SampleEmployeeDataSet:(
employee.txt)
e100,james,asst.prof,cse,8000,16000,4000,8.7e1
01,jack,asst.prof,cse,8350,17000,4500,9.2e102,j
ane,assoc.prof,cse,15000,30000,8000,7.8e104,j
ohn,prof,cse,30000,60000,15000,8.8e105,peter,
assoc.prof,cse,16500,33000,8600,6.9e106,david
,assoc.prof,cse,18000,36000,9500,8.3e107,danie
l,asst.prof,cse,9400,19000,5000,7.9e108,ramu,a
ssoc.prof,cse,17000,34000,9000,6.8e109,rani,as
st.prof,cse,10000,21500,4800,6.4e110,murthy,p
rof,cse,35000,71500,15000,9,3
EXPECTEDOUTPUT:-
Printstheinformationofemployeewithallitsattributes
EXERCISE-2:-
AIM:-
i) PerformsettingupandInstallingHadoopinitsthreeoperatingmodes:
Standalone
PseudoDistributed
FullyDistributed
DESCRIPTION:
Hadoop is written in Java, so you will need to have Java installed on your
machine,version 6 or later. Sun's JDK is the one most widely used with Hadoop, although others
havebeenreported to work.
Hadoop runs on Unix and on Windows. Linux is the only supported production
platform,but other flavors of Unix (including Mac OS X) can be used to run Hadoop for
development.Windows is only supported as a development platform, and additionally requires
Cygwin to run.During the Cygwin installation process, you should include the openssh package
if you plan torunHadoop in pseudo-distributed mode
ALGORITHM
1. Commandforinstallingsshis“sudoapt-getinstall ssh”.
2. Commandforkeygenerationisssh-keygen –trsa–P“”.
3. Storethekeyintorsa.pubbyusingthecommandcat$HOME/.ssh/id_rsa.pub >>
$HOME/.ssh/authorized_keys
4. Extractthejavabyusingthecommand tarxvfzjdk-8u60-linux-i586.tar.gz.
5. Extracttheeclipsebyusingthecommandtarxvfzeclipse-jee-mars-R-linux-gtk.tar.gz
6. Extractthehadoop byusingthecommandtarxvfz hadoop-2.7.1.tar.gz
7. Move the java to /usr/lib/jvm/ and eclipse to /opt/ paths. Configure the java path in
theeclipse.inifile
8. Exportjavapathandhadooppath in ./bashrc
9. Checktheinstallationsuccessfulornotbycheckingthejavaversionand hadoopversion
10. Checkthe hadoop instancein standalone mode workingcorrectlyornot
byusinganimplicit hadoop jarfile named aswordcount.
11. Ifthewordcountisdisplayed correctlyinpart-r-
00000fileitmeansthatstandalonemodeisinstalled successfully.
ALGORITHM
STEPSINVOLVEDININSTALLINGHADOOPIN PSEUDODISTRIBUTEDMODE:-
1. Inorderinstallpseudodistributedmodeweneedtoconfigurethehadoopconfig
uration files resides in the directory /home/ACEIT/hadoop-
2.7.1/etc/hadoop.
2. Firstconfigurethehadoop-env.sh filebychanging thejavapath.
3. Configurethecore-site.xmlwhichcontainsapropertytag,it
containsnameandvalue.Nameas fs.defaultFSandvalueashdfs://localhost:9000
4. Configurehdfs-site.xml.
5. Configure yarn-site.xml.
6. Configure mapred-site.xml before configure the copy mapred-site.xml.template
tomapred-site.xml.
7. Nowformat the name node byusingcommand hdfs namenode–format.
8. Type the command start-dfs.sh,start-yarn.sh means that starts the daemons
likeNameNode,DataNode,SecondaryNameNode,ResourceManager,NodeMana
ger.
9. Run JPS which views all daemons. Create a directory in the hadoop by
usingcommand hdfs dfs –mkdr /csedir and enter some data into ACEIT.txt
usingcommand nano ACEIT.txt and copyfrom local directoryto hadoop
usingcommandhdfs dfs – copyFromLocal ACEIT.txt /csedir/and run sample jar file
wordcount tocheckwhetherpseudo distributed mode is workingornot.
10. Displaythecontents offile byusingcommand hdfs dfs–cat /newdir/part-r-00000.
FULLYDISTRIBUTEDMODEINSTALLATION:
ALGORITHM
1. Stopallsinglenodeclusters
$stop-all.sh
2. DecideoneasNameNode(Master)andremainingasDataNodes(Slaves).
$ssh-copy-id–I$HOME/.ssh/id_rsa.pubACEIT@l5sys24
$cd$HADOOP_HOME/etc/hadoop
$nanocore-site.xml
$nanohdfs-site.xml
5. Addhostnames tofileslavesandsaveit.
$nanoslaves
6. Configure$nanoyarn-site.xml
7. Doin MasterNode
$hdfs namenode–format
$start-dfs.sh
$start-yarn.sh
8. FormatNameNode
9. DaemonsStartinginMasterandSlaveNodes
10. END
INPUT
ubuntu@localhost>jps
OUTPUT:
Datanode,namenodemSecondarynamenode,No
deManager,ResourceManager
upDESCRIPTION
a) ApacheAmbari
b) HortonWorks
c) ApacheSpark
LISTOFCLUSTERSINHADOOP
HortonWorksToolforManaging MapReduceJobsinApachePig
RunningMapReduceJobs inHortonWorksforPigLatin Script
EXERCISE-3:-
AIM:-
ImplementthefollowingfilemanagementtasksinHadoop:
DESCRIPTION:-
Addingfilesand directoriestoHDFS
Retrievingfiles from HDFStolocalfilesystem
DeletingfilesfromHDFS
ALGORITHM:-
SYNTAXANDCOMMANDSTOADD,RETRIEVEANDDELETEDATAFROMHDFS
Step-1
AddingFilesandDirectoriestoHDFS
Before you can run Hadoop programs on data stored in HDFS, you‘ll need to put the data
intoHDFSfirst.Let‘screateadirectory andputafileinit.HDFShasadefaultworkingdirectoryof
/user/$USER, where $USER is your login user name. This directory isn‘t automatically
createdfor you, though, so let‘s create it withthe mkdir command. For the purpose of illustration,
weusechuck. You should substitute your usernameintheexamplecommands.
hadoopfs-mkdir/user/chuck
hadoopfs-put example.txt
hadoopfs-putexample.txt/user/chuck
Step-2
RetrievingFilesfromHDFS
TheHadoopcommandgetcopiesfilesfromHDFSbacktothelocalfilesystem.Toretrieveexample
.txt,wecan runthe followingcommand:
hadoopfs-cat example.txt
Step-3
Deleting FilesfromHDFS
hadoopfs-rmexample.txt
Commandforcreatingadirectoryinhdfsis“hdfsdfs–mkdir /ACEITcse”.
Addingdirectoryisdonethroughthe command “hdfsdfs –putACEIT_english/”.
Step-4
CopyingDatafromNFStoHDFS
Copyingfromdirectorycommandis“hdfsdfs–copyFromLocal
/home/ACEIT/Desktop/shakes/glossary/ACEITcse/”
Viewthefilebyusingthecommand“hdfsdfs–cat/ACEIT_english/glossary”
CommandforlistingofitemsinHadoopis“hdfsdfs–lshdfs://localhost:9000/”.
CommandforDeletingfilesis “hdfsdfs –rmr/kartheek”.
SAMPLEINPUT:
Inputasanydataformatoftypestructured,UnstructuredorSemi Structured
EXPECTEDOUTPUT:
EXERCISE-
4AIM:-
RunabasicWordCountMapReduceProgramtounderstandMapReduce Paradigm
DESCRIPTION:--
ALGORITHMMAPREDUC
EPROGRAM
1. Mapper
2. Reducer
3. Driver
Step-1.WriteaMapper
AMapperoverridesthe―map‖functionfromtheClass"org.apache.hadoop.mapreduce.Mapper"
whichprovides <key,value>pairsas the input.AMapper implementationmayoutput
<key,value>pairs usingtheprovided Context .
Input value of the WordCount Map task will be a line of text from the input data file and the
keywould be the line number <line_number, line_of_text> . Map task outputs <word, one> for
eachwordin thelineoftext.
Pseudo-code
Pseudo-code
voidReduce(keyword,<listofvalue>){for
each xin<list ofvalue>:
sum+=x;final_output.collect(keywor
d,sum);
}
Step-3.WriteDriver
TheDriverprogramconfiguresandruntheMapReducejob.Weusethemainprogram
toperformbasicconfigurations such as:
SetofDataRelatedShakespeareComedies,Glossary,Poems
OUTPUT:-
EXERCISE-5:-
AIM:-
WriteaMapReduce ProgramthatminesWeatherData.
DESCRIPTION:
Climate change has been seeking a lot of attention sincelong time. The antagonisticeffect
of this climate is being felt in every part of the earth. There are many examples for these,such as
sea levels are rising, less rainfall, increase in humidity. The propose system overcomesthesome
issues that occurred by using other techniques. Inthis project we use the concept ofBig data
Hadoop. In the proposed architecture we are able to process offline data, which isstored in the
National Climatic Data Centre (NCDC). Through this we are able to find out themaximum
temperature and minimum temperature of year, and able to predict the future weatherforecast.
Finally, weplot the graph for the obtained MAX and MIN temperaturefor each mothof the
particular year to visualize the temperature. Based on the previous year data weather
dataofcomingyear is predicted.
ALGORITHM:-
MAPREDUCEPROGRAM
WordCountisasimpleprogram whichcountsthenumberofoccurrencesofeachwordinagiventext
input data set. WordCount fits very well with the MapReduce programming model making ita
great example to understand the Hadoop Map/Reduce programming style. Our
implementationconsists ofthreemain parts:
1. Mapper
2. Reducer
3. Mainprogram
Step-1.WriteaMapper
AMapperoverridesthe―map‖functionfromtheClass"org.apache.hadoop.mapreduce.Mapper"
whichprovides <key,value>pairsas the input.AMapper implementationmayoutput
<key,value>pairs usingtheprovided Context .
Input value of the WordCount Map task will be a line of text from the input data file and the
keywould be the line number <line_number, line_of_text> . Map task outputs <word, one> for
eachwordin thelineoftext.
Pseudo-code
voidMap(key,value){
for each max_temp x in
value:output.collect(x,1);
}
voidMap(key,value){
foreachmin_tempxinvalue:ou
tput.collect(x,1);
}
Step-2WriteaReducer
A Reducer collects the intermediate <key,value> output from multiple map tasks and assemble
asingleresult.Here,theWordCount programwill sumuptheoccurrenceof eachwordtopairsas
<word,occurrence>.
Pseudo-code
voidReduce(max_temp,<listofvalue>){for
each xin <listofvalue>:
sum+=x;final_output.collect(max_te
mp,sum);
}
3. WriteDriver
TheDriverprogramconfiguresandruntheMapReducejob.Weusethemainprogram
toperformbasicconfigurations such as:
For here, Map.Reducer: class which override the "reduce" function. For
OutputValue:typeofoutput value.Forhere,IntWritable.File
Input Path
FileOutputPath
INPUT:-
OUTPUT:-
EXERCISE-6:-
AIM:-
WriteaMapReduceProgramthatimplementsMatrixMultiplication.
DESCRIPTION:
MapReduceLogic
Logicistosendthecalculationpartofeachoutputcelloftheresultmatrixtoareducer.
Soinmatrixmultiplicationthefirstcellofoutput(0,0)hasmultiplicationandsummationof
elementsfromrow0ofthematrixAandelementsfromcol0ofmatrixB.Todothecomputation of valuein
the output cell (0,0) of resultant matrixin a seperate reducer we need touse (0,0) as output key of
mapphase and value should have array of values from row 0 of matrixA and column 0 of matrix
B.Hopefully this picture will explain the point. So in this algorithmoutput from map phase
should be having a <key,value> , wherekey represents the output
celllocation(0,0),(0,1)etc..andvaluewillbelistofallvaluesrequiredforreducertodocomputation. Let
us take an example for calculatiing value at output cell (00). Here we need tocollect values from
row 0 of matrix A and col 0 of matrix B in the map phase and pass (0,0) askey.So
asinglereducercan do the calculation.
ALGORITHM
We assume that the input files for A and B are streams of (key,value) pairs in
sparsematrix format,whereeachkeyis apairofindices(i,j)and eachvalueis the
correspondingmatrixelementvalue. Theoutput files for matrixC=A*Barein thesameformat.
Inthepseudo-
codefortheindividualstrategiesbelow,wehaveintentionallyavoidedfactoringcommon codefor
thepurposes of clarity.
1. setup()
2. varNIB=(I-1)/IB+1
3. varNKB=(K-1)/KB+1
4. varNJB=(J-1)/JB+1
5. map(key,value)
6. iffrommatrix Awithkey=(i,k)andvalue=a(i,k)
7. for0 <=jb<NJB
8. emit(i/IB,k/KB,jb, 0),(imodIB,kmodKB,a(i,k))
9. iffrommatrixBwith key=(k,j)andvalue=b(k,j)
10. for0 <=ib <NIB
emit(ib,k/KB,j/JB,1),(kmodKB,jmodJB,b(k,j))
Intermediatekeys (ib, kb, jb, m) sort in increasingorder first byib, then bykb,
thenbyjb,thenbym. Note that m =0 forA dataand m = 1 forBdata.
Thepartitionermapsintermediatekey(ib,kb,jb,m)toareducerrasfollows:
11. r =((ib*JB+jb)*KB+kb)mod R
12. These definitions for the sorting order and partitioner guarantee that each
reducerR[ib,kb,jb]receives thedataitneedsforblocksA[ib,kb]
andB[kb,jb],withthedatafortheAblock immediatelyprecedingthe data fortheBblock.
13. varA=newmatrixofdimensionIBxKB
14. varB=new matrixofdimensionKBxJB
15. varsib =-1
16. varskb =-1
Reduce(key,valueList)
INPUT:-
SetofDatasetsoverdifferentClustersaretakenasRowsandColumns
OUTPUT:-
EXERCISE-7:-
AIM:-
InstallandRunPigthen writePigLatinscriptstosort,group,join,projectandfilterthe
data.
DESCRIPTION
Apache Pig is a high-level platform for creating programs that run on Apache
Hadoop.Thelanguageforthisplatformiscalled PigLatin.
PigcanexecuteitsHadoopjobsinMapReduce,ApacheTez,or
ApacheSpark.PigLatinabstractstheprogrammingfromtheJava MapReduce idiom into a notation
which makes MapReduce programming high level,similartothatof SQL for
RDBMSs.PigLatincanbeextendedusing UserDefinedFunctions (UDFs) which the user can write
in Java, Python, JavaScript, Ruby or Groovy and thencalldirectlyfromthelanguage.
Pig Latin is procedural and fits very naturally in the pipeline paradigm while SQL
isinstead declarative. In SQL users can specify that data from two tables must be joined, but
notwhat join implementation to use (You can specify the implementation of JOIN in SQL, thus
"...for many SQL applications the query writer may not have enough knowledge of the data
orenough expertise to specify an appropriate join algorithm."). Pig Latin allows users to specify
animplementationoraspectsofanimplementationtobeusedinexecutingascriptinseveralways. In
effect, Pig Latin programming is similar to specifying a query execution plan, making
iteasierforprogrammers to explicitlycontrol theflow oftheirdata processingtask.
SQL is oriented around queries that produce a single result. SQL handles trees naturally, but
hasno built in mechanism for splitting a data processing stream and applying different operators
toeachsub-stream.PigLatinscriptdescribesadirectedacyclic graph(DAG)ratherthanapipeline.
PigLatin'sabilitytoincludeusercodeatanypointinthepipelineisusefulforpipelinedevelopment.IfSQL
isused,datamustfirstbe importedintothedatabase,andthenthecleansingand transformationprocess
can begin.
ALGORITHM
STEPSFORINSTALLINGAPACHEPIG
1) Extractthepig-0.15.0.tar.gz andmovetohomedirectory
2) SettheenvironmentofPIGinbashrcfile.
3) Pigcanrun intwomodes
LocalMode and Hadoop
ModePig–xlocal and
pig
4) GruntShell
Grunt>
5) LOADINGData into Grunt Shell
DATA=LOAD<CLASSPATH>USINGPigStorage(DELIMITER)as(ATTRIBUTE:
DataType1,ATTRIBUTE:DataType2…..)
6) DescribeData
DescribeDATA;
7) DUMPData
DumpDATA;
8) FILTERData
FDATA=FILTERDATAbyATTRIBUTE=VALUE;
9) GROUPData
10) IteratingData
FOR_DATA=FOREACHDATAGENERATEGROUPASGROUP_FUN,AT
TRIBUTE=<VALUE>
11) SortingData
SORT_DATA=ORDERDATABYATTRIBUTEWITHCONDITION;
12) LIMITData
LIMIT_DATA=LIMITDATACOUNT;
13) JOINData
JOINDATA1BY(ATTRIBUTE1,ATTRIBUTE2….),DATA2BY(A
TTRIBUTE3,ATTRIBUTE….N)
INPUT:
InputasWebsiteClickCountData
OUTPUT:
EXERCISE-8:-
AIM:-
DESCRIPTION
Hive, allows SQL developers to write Hive Query Language (HQL) statements that
aresimilar tostandard SQL statements;nowyoushouldbe aware thatHQL islimited in
thecommands it understands, but it is still pretty useful. HQL statements are broken down by
theHive service into MapReduce jobs and executed across a Hadoop cluster. Hive looks very
muchlike traditional database code with SQL access. However, because Hive is based on
Hadoop andMapReduce operations, there are several key differences. The first is that Hadoop is
intended forlong sequential scans, and because Hive is based on Hadoop, you can expect queries
to have
averyhighlatency(manyminutes).ThismeansthatHivewouldnotbeappropriateforapplications that
need very fast response times, as you would expect with a database such asDB2. Finally, Hive is
read-based and therefore not appropriate for transaction processing thattypicallyinvolves
ahighpercentageof writeoperations.
ALGORITHM:
ApacheHIVEINSTALLATIONSTEPS
1) InstallMySQL-Server
Sudoapt-getinstallmysql-server
2) ConfiguringMySQLUserNameandPassword
3) CreatingUserand
grantingallPrivilegesMysql –uroot –
proot
Createuser<USER_NAME> identified by<PASSWORD>
4) ExtractandConfigureApacheHive
tarxvfzapache-hive-1.0.1.bin.tar.gz
5) MoveApacheHivefromLocal directorytoHome directory
6) SetCLASSPATHinbashrc
Export HIVE_HOME = /home/apache-
hiveExportPATH=$PATH:$HIVE_HOME/
bin
7) Configuringhive-default.xml byaddingMySQL ServerCredentials
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost:3306/hive?createDatabaseIfNotE
xist=true
</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hadoop</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>hadoop</value>
</property>
8) Copyingmysql-java-connector.jartohive/libdirectory.
SYNTAXforHIVEDatabaseOperationsD
ATABASE Creation
CREATEDATABASE|SCHEMA[IFNOTEXISTS]<databasename>
DropDatabaseStatement
DROPDATABASEStatementDROP (DATABASE|SCHEMA)[IFEXISTS]
database_name[RESTRICT|CASCADE];
Creatingand DroppingTableinHIVE
CREATE[TEMPORARY][EXTERNAL]TABLE[IFNOTEXISTS][db_name.]table_name
[(col_namedata_type[COMMENTcol_comment],...)]
[COMMENT table_comment][ROWFORMATrow_format][STOREDASfile_format]
log_dataSyntax:
LOADDATALOCALINPATH'<path>/u.data'OVERWRITE INTOTABLEu_data;
AlterTableinHIVE
Syntax
Creatingand DroppingView
CREATE VIEW [IF NOT EXISTS] view_name [(column_name
[COMMENTcolumn_comment],...)][COMMENTtable_comment]AS SELECT ...
Dropping
ViewSyntax:
DROPVIEWview_name
Functionsin HIVE
StringFunctions:-round(),ceil(),substr(),upper(),reg_exp()etcDate
INDEXES
ALTERINDEXindex_ip_addressONlog_dataREBUILD;
StoringIndexDatainMetastore
SET
hive.index.compact.file=/home/administrator/Desktop/big/metastore_db/tmp/index_ipaddress_res
ult;
SET
hive.input.format=org.apache.hadoop.hive.ql.index.compact.HiveCompactIndexInputFormat;
DroppingIndex
DROPINDEXINDEX_NAMEonTABLE_NAME;
INPUT
InputasWebServerLogData
OUTPUT
EXERCISE-10:-
AIM:Writeaprogram to
implementcombiningandpartitioninginhadooptoimplementacustompartitionerandCombiner
DESCRIPTION
:COMBINERS:
ManyMapReducejobsarelimitedbythebandwidthavailableonthecluster,soit
pays to minimize the data transferred between map and reduce tasks. Hadoop allows
theuserto specifyacombinerfunction to berun onthemapoutput—the combiner
function‘soutputformstheinputtothereducefunction.Sincethecombinerfunctionisanoptimiz
ation, Hadoop does not provide a guarantee of how many times it will call it for
aparticular map output record, if at all. In other words, calling the combiner function
zero,one, or many times should produce the same output from the reducer. One can think
ofCombiners asmini-reducethat take place on the output of the mappers, prior to
theshuffle and sort phase. Each combiner operates in isolation and therefore does not
haveaccess to intermediate output from other mappers. The combiner is provided keys
andvalues associated with each key (the same types as the mapper output keys and
values).Critically, one cannot assume that a combiner will have the opportunity to
process allvalues associated with the same key. The combiner can emit any number of
key-valuepairs, but the keys and values must be of the same type as the mapper output
(same as
thereducerinput).Incaseswhereanoperationisbothassociativeandcommutative(e.g.,
additionormultiplication),reducerscandirectlyserveascombiners
PARTITIONERS
Acommonmisconceptionforfirst-
timeMapReduceprogrammersistouseonlyasinglereducer.Itiseasytounderstandthatsuchaco
nstraintisanonsenseandthat
using more than one reducer is most of the time necessary, else the map/reduce concept would
notbe very useful. With multiple reducers, we need some way to determine the appropriate one
tosenda(key/value)pairoutputtedby amapper.Thedefaultbehavioristohashthekey todetermine the
reducer. The partitioning phase takes place after the map phase and before
thereducephase.Thenumberofpartitionsisequaltothenumberofreducers.Thedatagetspartitioned
across the reducers according to the partitioningfunction. This approach improves
theoverallperformanceandallowsmapperstooperate
completelyindependently.Forallofitsoutputkey/valuepairs,eachmapperdetermines
which reducer will receive them. Because all the mappers are using the same
partitioningfor any key, regardless of which mapper instance generated it, the destination
partition isthe same. Hadoop uses an interface called Partitioner to determine which
partition akey/valuepairwillgoto.Asinglepartitionreferstoallkey/valuepairsthatwillbesentto
a single reduce task. You can configure the number of reducers in a job driver bysetting a
number of reducers on the job object (job.setNumReduceTasks). Hadoop
comeswithadefaultpartitionerimplementationi.e.HashPartitioner,whichhashesarecord‘ske
y to determine which partition the record belongs in. Each partition is processed by
areduce task, so the number of partitions is equal to the number of reduce tasks for the
job,When the map function starts producing output, it is not simply written to disk. Each
maptask has a circular memory buffer that it writes the output to. When the contents of
thebuffer reach a certain threshold size, a background thread will start tospill the contents
todisk. Map outputs will continue to be written to the buffer while the spill takes place,
butif the buffer fills up during this time, the map will block until the spill is complete.
Beforeit writes to disk, the thread first divides the data into partitions corresponding to
thereducers that they will ultimately be sent to. Within each partition, the background
threadperformsanin-memorysortbykey,andifthereisacombinerfunction,itisrunonthe
output of thesort.
ALGORITHM:
COMBINING
1) Dividethedatasource(thedatafiles)intofragmentsor blocks
whicharesenttoamapper.Thesearecalledsplits.
2) Thesesplits
arefurtherdividedintorecordsandtheserecordsareprovidedoneatatimetothemapperforproc
essing. ThisisachievedthroughaclasscalledasRecord Reader.
Create a Class and extend from TextInputFormat class to
createownNLinesInputFormat.
Then create our own RecordReader class called NLinesRecordReader where we
willimplementthelogic offeeding3 lines/recordsat atime.
MakeachangeinthedriverprogramtousenewNLinesInputFormatclass.
To prove that are really getting 3 lines at a time, instead of actually counting words (which
already know now how to do ) , emit out number of lines to get in the input at atime as a
key and 1 as a value , which after going through reducer will give the frequencyofeach
uniquenumberof lines to themappers.
PARTIONING
3. Now suppose if your hashCode() method does not uniformly distribute other keys
dataover partitions range. So the data is not evenly distributed in partitions as well
asreducers.Sinceeachpartitionisequivalenttoareducer.Soheresomereducerswill
havemore data than other reducers.So other reducers will wait for one reducer(one with
userdefinedkeys)dueto thework load it share
INPUT:
DatasetsfromdifferentsourcesasInput
OUTPUT:
hduser@ACEIT-3446:/usr/local/hadoop/bin$hadoopfs-ls/partitionerOutput/
14/12/0117:50:52WARNutil.NativeCodeLoader:Unabletoloadnative-
hadooplibraryforyourplatform...usingbuiltin-javaclasses whereapplicable
Found4items
-rw-r--r-- 1hdusersupergroup 02014-12-0117:49/partitionerOutput/_SUCCESS
-rw-r--r-- 1hdusersupergroup 102014-12-0117:48/partitionerOutput/part-r-
00000
-rw-r--r-- 1hdusersupergroup 102014-12-0117:48/partitionerOutput/part-r-
00001
-rw-r--r-- 1hdusersupergroup 92014-12-0117:49/partitionerOutput/part-r-00002