Hadoop Notes

HadoopI/O
WhyNotUseJavaObjectSerialization?p.101 SerializationwithThrift:
ThereislimitedsupportfortheseasMapReduce formats(p.102):
http://wiki.apache.org/hadoop/Hbase/ThriftApi
SequenceFilecanuse'any'serializationframework Incontrast,MapFilecanonlyuseWritables
IheardHbaseusedtouseMapFile.Doesitmeanwe can'tuseitoutsideJava?
NO!ThriftorStargateRESTConnector(ALPHA)
MapFileisasanindexedandsortedSequenceFile.
Map&Reduce:otherlanguages
UnixstandardstreamsastheinterfacebetweenHadoopandyourprogram (p.32) TheJavaAPIisgearedtowardprocessingyourmapfunctiononerecordat atime
Recordsarepushed butitsstillpossibletoconsidermultiplelinesatatimeby accumulatingpreviouslinesinaninstancevariableinthe Mapper Orusethenewpullstyle(p.25)
WhereaswithStreamingthemapprogramcandecidehowtoprocessthe input What'sthepenaltyforthat?DataiscopiedoverfromtheJavaproccess spacetootherproccessspace[1].IsRemoteDebugging(p.144) possible?Don'tthinkso.

[1]:http://www.cloudera.com/hadooptrainingprogrammingwithhadoop49'50
Map&Reduce:C++
HadoopPipes Usessocketsasthechanneloverwhichthetasktrackercommunicateswiththe processrunningtheC++maporreducefunction Implement
HadoopPipes::Mapper HadoopPipes::Reducer Main: intmain(intargc,char*argv[]){ return HadoopPipes::runTask(HadoopPipes::TemplateFactory<MaxTemperat ureMapper,MapTemperatureReducer>()); }
%hadooppipesDhadoop.pipes.java.recordreader=trueD hadoop.pipes.java.recordwriter=trueinputsample.txtoutputoutput programbin/max_temperature

HDFSconcepts
AfileinHDFSthatissmallerthanasingleblockdoesnotoccupyafullblocks worthofunderlyingstorage Afilecanbelargerthananysinglediskinthenetwork.Thefilewillbespreadto morethan1node Ablockistypicallyreplicatedto3otherphysicalmachines
Someapplicationsmaychoosetosetahighreplicationfactorforthe blocksinapopularfiletospreadthereadloadonthecluster.
FilepermissionsinHDFS(p.47) Interfaces:
FUSE,Thrift,C TheFUSEinterfaceallowsanyHDFStobemountedinthestandardFS. ItmakesitpossibletouseanyUnixutilitylikels,cat... Howeveritdoesn'tmeanyoushoulduseitasageneralpurposeFS.

HDFSConcepts
HDFSstoressmallfilesinefficiently[1]
EatsupalotofNamenode'smemory Howeveritwon'ttakeupanymorespacethanis requiredtostoretherawcontentsofafile
HDFSallowsonlysequentialwritestoanopenfile, orappendstoanalreadywrittenfile.
[1]: http://www.cloudera.com/blog/2009/02/02/the-small-files-problem/

HDFS:clientread
The Namenode gives back the closest (p. 64) nodes with these block location.
HDFScoherencymodel
Path p = new Path("p"); FSDataOutputStream out = fs.create(p); out.write("content".getBytes("UTF-8")); out.flush(); out.sync(); assertThat(fs.getFileStatus(p).getLen(), is(((long) "content".length())));
Whenyousync,youareguaranteedtoseethe changestotheFS Withnocallstosync(),youshouldbepreparedto loseuptoablockofdataintheeventofclientor systemfailure Tradeoffbetweendatarobustnessandthroughput

Questions
WhathappensiftheNamenodedies?[1] WhatifaDatanodefailsduringawrite?(p.67) InaJobif1taskkeepsfailingHadoopwillgiveup andbydefaultsaythattheJobfailed. HoweveryoucanspecifyaQualityFactorand specifythatifonly99%ofmyinputismapped it'sgoodenoughforme.[3] SinceeverythingisstoredasString,howmuch spaceareweloosingwhenstoringbinarydataas base64?

config
<property> <name>sizeweight</name> <value>${size},${weight}</value> <description>Sizeandweight</description> </property> Toolinterface MiniDFSClusterandMiniMRCluster:aprogrammaticwayof creatinginprocessclusters.Unlikethelocaljobrunner,these allowtestingagainstthefullHDFSandMapReducemachinery. Bearinmindtoothattasktrackersinaminiclusterlaunch separateJVMstoruntasksin,whichcanmakedebuggingmore difficult.

debugging
System.err.println("Temperature over 100 degrees for input: " + value); reporter.setStatus("Detected possibly corrupt record: see logs."); reporter.incrCounter(Temperature.OVER_100, 1);
debugging
Logs:
Systemdaemons(p.256) HDFSaudit(p.280) MapReducejobhistory(p.135) MapReducetask(p.143)
Adminstration
distcp Balancer Metrics:jmx/ganglia
FaultTolerance:whenaTaskfails
When the jobtracker is notified of a task attempt that has failed (by the tasktrackers heartbeat call) it will reschedule execution of the task. The jobtracker will try to avoid rescheduling the task on a tasktracker where it has previously failed. Furthermore, if a task fails more than four times, it will not be retried further. This value is configurable: the maximum number of attempts to run a task is controlled by the mapred.map.max.attempts property for map tasks, and mapred.reduce.max.attempts for reduce tasks. By default, if any task fails more than four times (or whatever the maximum number of attempts is configured to), the whole job fails. (p. 160) If a Streaming process hangs, the tasktracker does not try to kill it (although the JVM that launched it will be killed), so you should take precautions to monitor for this scenario, and kill orphaned processes by some other means Currently, Hadoop has no mechanism for dealing with failure of the jobtrackerit is a single point of failure (p. 161)
Faulttolerance:baddata
Whenyouhaveabadrecordthesearetheoptions:
Youcandetectthebadrecordandignoreit.Additionallyyoucan useacustomcounter. Youcanabortthejobbythrowinganexception Automaticmechanismforskipingbadrecords(youcanthandlethe problembecausethereisabugina3rdpartylibrarythatyou cantworkaroundinyourmapperorreducer): 1.Taskfails. 2.Taskfails. 3.Skippingmodeisenabled.Taskfailsbutfailedrecordis storedbythetasktracker. 4.Skippingmodeisstillenabled.Tasksucceedsby skippingthebadrecordthatfailecintheprevious attempt.

Itisoftenagoodideatocompressthemapoutput asitiswrittentodisk,sincedoingsomakesit fastertowritetodisk,savesdiskspace,and reducestheamountofdatatotransfertothe reducer TheamountofmemorygiventotheJVMsinwhich themapandreducetasksrunissetbythe mapred.child.java.optsproperty.Youshouldtry tomakethisaslargeaspossiblefortheamountof memoryonyoutasknodes;thediscussionin Memoryonpage254goesthroughthe constraintstoconsider.

Tunning
Incompletepresentationd:~(
Sorry,thiswasapresentationthatIwasmakingbut didnothavethetimetofinish.Nevertheless,Ifelt likesharingit... BasedontheOReillyHadoopTheDefinitive Guide(062009)book.

Hadoop Notes

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Hadoop Notes

Uploaded by

Copyright:

Available Formats

HadoopI/O

UnixstandardstreamsastheinterfacebetweenHadoopandyourprogram (p.32) TheJavaAPIisgearedtowardprocessingyourmapfunctiononerecordat atime

Recordsarepushed butitsstillpossibletoconsidermultiplelinesatatimeby accumulatingpreviouslinesinaninstancevariableinthe Mapper Orusethenewpullstyle(p.25)

WhereaswithStreamingthemapprogramcandecidehowtoprocessthe input What'sthepenaltyforthat?DataiscopiedoverfromtheJavaproccess spacetootherproccessspace[1].IsRemoteDebugging(p.144) possible?Don'tthinkso.

HadoopPipes Usessocketsasthechanneloverwhichthetasktrackercommunicateswiththe processrunningtheC++maporreducefunction Implement

HadoopPipes::Mapper HadoopPipes::Reducer Main: intmain(intargc,char*argv[]){ return HadoopPipes::runTask(HadoopPipes::TemplateFactory<MaxTemperat ureMapper,MapTemperatureReducer>()); }

%hadooppipesDhadoop.pipes.java.recordreader=trueD hadoop.pipes.java.recordwriter=trueinputsample.txtoutputoutput programbin/max_temperature

AfileinHDFSthatissmallerthanasingleblockdoesnotoccupyafullblocks worthofunderlyingstorage Afilecanbelargerthananysinglediskinthenetwork.Thefilewillbespreadto morethan1node Ablockistypicallyreplicatedto3otherphysicalmachines

FUSE,Thrift,C TheFUSEinterfaceallowsanyHDFStobemountedinthestandardFS. ItmakesitpossibletouseanyUnixutilitylikels,cat... Howeveritdoesn'tmeanyoushoulduseitasageneralpurposeFS.

EatsupalotofNamenode'smemory Howeveritwon'ttakeupanymorespacethanis requiredtostoretherawcontentsofafile

Whenyousync,youareguaranteedtoseethe changestotheFS Withnocallstosync(),youshouldbepreparedto loseuptoablockofdataintheeventofclientor systemfailure Tradeoffbetweendatarobustnessandthroughput

Systemdaemons(p.256) HDFSaudit(p.280) MapReducejobhistory(p.135) MapReducetask(p.143)

distcp Balancer Metrics:jmx/ganglia

Sorry,thiswasapresentationthatIwasmakingbut didnothavethetimetofinish.Nevertheless,Ifelt likesharingit... BasedontheOReillyHadoopTheDefinitive Guide(062009)book.

You might also like