You are on page 1of 49

CCD 410 Cloudera Certified Developer for Apache Hadoop (CCD H)

Ver 2.20 Q&A 60

QUESTION NO: 1 When is the earliest point at which the reduce method of a given Reducer can be called? A. As soon as at least one mapper has finished processing its input split. B. As soon as a mapper has emitted at least one record. C. Not until all mappers have finished processing all records. D. It depends on the InputFormat used for the job. Answer: C Explanat !n: In a MapReduce job reducers do not start e ecuting the reduce method until the all Map jobs have completed. Reducers start cop!ing intermediate "e!#value pairs from the mappers as soon as the! are available. $he programmer defined reduce method is called onl! after all the mappers have finished. Note%$he reduce phase has & steps% shuffle' sort' reduce. (huffle is where the data is collected b! the reducer from each mapper. $his can happen while mappers are generating data since it is onl! a data transfer. )n the other hand' sort and reduce can onl! start once all the mappers are done. Wh! is starting the reducers earl! a good thing? *ecause it spreads out the data transfer from the mappers to the reducers over time' which is a good thing if !our networ" is the bottlenec". Wh! is starting the reducers earl! a bad thing? *ecause the! +hog up+ reduce slots while onl! cop!ing data. Another job that starts later that will actuall! use the reduce slots now can,t use them. -ou can customi.e when the reducers startup b! changing the default value of mapred.reduce.slowstart.completed.maps in mapred#site. ml. A value of /.00 will wait for all the mappers to finish before starting the reducers. A value of 0.0 will start the reducers right awa!. A value of 0.1 will start the reducers when half of the mappers are complete. -ou can also change mapred.reduce.slowstart.completed.maps on a job#b!#job basis. $!picall!' "eep mapred.reduce.slowstart.completed.maps above 0.2 if the s!stem ever has multiple jobs running at once. $his wa! the job doesn,t hog up reducers when the! aren,t doing an!thing but cop!ing data. If !ou onl! ever have one job running at a time' doing 0./ would probabl! be appropriate. Reference%34 Interview 5uestions 6 Answers for 7adoop MapReduce developers'When is the reducers are started in a MapReduce job?

QUESTION NO: 2 Which describes how a client reads a file from 78F(? A. $he client 9ueries the NameNode for the bloc" location:s;. $he NameNode returns the bloc" location:s; to the client. $he client reads the data director! off the 8ataNode:s;. B. $he client 9ueries all 8ataNodes in parallel. $he 8ataNode that contains the re9uested data responds directl! to the client. $he client reads the data directl! off the 8ataNode. C. $he client contacts the NameNode for the bloc" location:s;. $he NameNode then 9ueries the 8ataNodes for bloc" locations. $he 8ataNodes respond to the NameNode' and the NameNode redirects the client to the 8ataNode that holds the re9uested data bloc":s;. $he client then reads the data directl! off the 8ataNode. D. $he client contacts the NameNode for the bloc" location:s;. $he NameNode contacts the 8ataNode that holds the re9uested data bloc". 8ata is transferred from the 8ataNode to the NameNode' and then from the NameNode to the client. Answer: C Explanat !n: $he <lient communication to 78F( happens using 7adoop 78F( A=I. <lient applications tal" to the NameNode whenever the! wish to locate a file' or when the! want to add>cop!>move>delete a file on 78F(. $he NameNode responds the successful re9uests b! returning a list of relevant 8ataNode servers where the data lives. <lient applications can tal" directl! to a 8ataNode' once the NameNode has provided the location of the data. Reference% 34 Interview 5uestions 6 Answers for 7adoop MapReduce developers'7ow the <lient communicates with 78F(?

QUESTION NO: " -ou are developing a combiner that ta"es as input $e t "e!s' IntWritable values' and emits $e t "e!s' IntWritable values. Which interface should !our class implement? A. <ombiner ?$e t' IntWritable' $e t' IntWritable@ B. Mapper ?$e t' IntWritable' $e t' IntWritable@ C. Reducer ?$e t' $e t' IntWritable' IntWritable@ D. Reducer ?$e t' IntWritable' $e t' IntWritable@ E. <ombiner ?$e t' $e t' IntWritable' IntWritable@ Answer: D Explanat !n:

QUESTION NO: # Indentif! the utilit! that allows !ou to create and run MapReduce jobs with an! e ecutable or script as the mapper and>or the reducer? A. )o.ie B. (9oop C. Flume D. 7adoop (treaming E. mapred Answer: D Explanat !n: 7adoop streaming is a utilit! that comes with the 7adoop distribution. $he utilit! allows !ou to create and run Map>Reduce jobs with an! e ecutable or script as the mapper and>or the reducer. Reference%http%>>hadoop.apache.org>common>docs>r0.30./>streaming.html:7adoop (treaming' second sentence;

QUESTION NO: $ 7ow are "e!s and values presented and passed to the reducers during a standard sort and shuffle phase of MapReduce? A. Ae!s are presented to reducer in sorted orderB values for a given "e! are not sorted. B. Ae!s are presented to reducer in sorted orderB values for a given "e! are sorted in ascending order. C. Ae!s are presented to a reducer in random orderB values for a given "e! are not sorted. D. Ae!s are presented to a reducer in random orderB values for a given "e! are sorted in ascending order. Answer: A Explanat !n: Reducer has & primar! phases% /.(huffle $he Reducer copies the sorted output from each Mapper using 7$$= across the networ". 3.(ort $he framewor" merge sorts Reducer inputs b! "e!s :since different Mappers ma! have output the

same "e!;. $he shuffle and sort phases occur simultaneousl! i.e. while outputs are being fetched the! are merged. (econdar!(ort $o achieve a secondar! sort on the values returned b! the value iterator' the application should e tend the "e! with the secondar! "e! and define a grouping comparator. $he "e!s will be sorted using the entire "e!' but will be grouped using the grouping comparator to decide which "e!s and values are sent in the same call to reduce. &.Reduce In this phase the reduce:)bject' Iterable' <onte t; method is called for each ?"e!' :collection of values;@ in the sorted inputs. $he output of the reduce tas" is t!picall! written to a RecordWriter via $as"Input)utput<onte t.write:)bject' )bject;. $he output of the Reducer is not re#sorted. Reference% org.apache.hadoop.mapreduce'<lass Reducer?AC-IN'DAEFCIN'AC-)F$'DAEFC)F$@

QUESTION NO: 6 Assuming default settings' which best describes the order of data provided to a reducerGs reduce method% A. $he "e!s given to a reducer arenGt in a predictable order' but the values associated with those "e!s alwa!s are. B. *oth the "e!s and values passed to a reducer alwa!s appear in sorted order. C. Neither "e!s nor values are in an! predictable order. D. $he "e!s given to a reducer are in sorted order but the values associated with each "e! are in no predictable order Answer: D Explanat !n: Reducer has & primar! phases%

/.(huffle $he Reducer copies the sorted output from each Mapper using 7$$= across the networ". 3.(ort $he framewor" merge sorts Reducer inputs b! "e!s :since different Mappers ma! have output the same "e!;. $he shuffle and sort phases occur simultaneousl! i.e. while outputs are being fetched the! are merged. (econdar!(ort $o achieve a secondar! sort on the values returned b! the value iterator' the application should e tend the "e! with the secondar! "e! and define a grouping comparator. $he "e!s will be sorted using the entire "e!' but will be grouped using the grouping comparator to decide which "e!s and values are sent in the same call to reduce. &.Reduce In this phase the reduce:)bject' Iterable' <onte t; method is called for each ?"e!' :collection of values;@ in the sorted inputs. $he output of the reduce tas" is t!picall! written to a RecordWriter via $as"Input)utput<onte t.write:)bject' )bject;. $he output of the Reducer is not re#sorted. Reference% org.apache.hadoop.mapreduce'<lass Reducer?AC-IN'DAEFCIN'AC-)F$'DAEFC)F$@

QUESTION NO: % -ou wrote a map function that throws a runtime e ception when it encounters a control character in input data. $he input supplied to !our mapper contains twelve such characters totals' spread across five file splits. $he first four file splits each have two control characters and the last split has four control characters.

Indentif! the number of failed tas" attempts !ou can e pect when !ou run the job with mapred.ma .map.attempts set to 4%

A. -ou will have fort!#eight failed tas" attempts B. -ou will have seventeen failed tas" attempts C. -ou will have five failed tas" attempts D. -ou will have twelve failed tas" attempts E. -ou will have twent! failed tas" attempts Answer: E Explanat !n: $here will be four failed tas" attempts for each of the five file splits. Note%

QUESTION NO: & -ou want to populate an associative arra! in order to perform a map#side join. -ouGve decided to put this information in a te t file' place that file into the 8istributed<ache and read it in !our Mapper before an! records are processed. Indentif! which method in the Mapper !ou should use to implement code for reading the file and populating the associative arra!? A. combine B. map C. init D. configure Answer: D Explanat !n: (ee &; below. 7ere is an illustrative e ample on how to use the 8istributed<ache%

>> (etting up the cache for the application /. <op! the re9uisite files to theFile(!stem% H bin>hadoop fs #cop!FromEocal loo"up.dat >m!app>loo"up.dat H bin>hadoop fs #cop!FromEocal map..ip >m!app>map..ip H bin>hadoop fs #cop!FromEocal m!lib.jar >m!app>m!lib.jar H bin>hadoop fs #cop!FromEocal m!tar.tar >m!app>m!tar.tar H bin>hadoop fs #cop!FromEocal m!tg..tg. >m!app>m!tg..tg. H bin>hadoop fs #cop!FromEocal m!targ..tar.g. >m!app>m!targ..tar.g. 3. (etup the application,sIob<onf% Iob<onf job J new Iob<onf:;B 8istributed<ache.add<acheFile:new FRI:+>m!app>loo"up.datKloo"up.dat+;' job;B 8istributed<ache.add<acheArchive:new FRI:+>m!app>map..ip+' job;B 8istributed<ache.addFile$o<lass=ath:new =ath:+>m!app>m!lib.jar+;' job;B 8istributed<ache.add<acheArchive:new FRI:+>m!app>m!tar.tar+' job;B 8istributed<ache.add<acheArchive:new FRI:+>m!app>m!tg..tg.+' job;B 8istributed<ache.add<acheArchive:new FRI:+>m!app>m!targ..tar.g.+' job;B &. Fse the cached files in theMapper orReducer% public static class Map<lass e tends MapReduce*ase implements Mapper?A' D' A' D@ L private =athMN localArchivesB private =athMN localFilesB public void configure:Iob<onf job; L >> Oet the cached archives>files localArchives J 8istributed<ache.getEocal<acheArchives:job;B localFiles J 8istributed<ache.getEocal<acheFiles:job;B P public void map:A "e!' D value' )utput<ollector?A' D@ output' Reporter reporter; throws I)C ception L >> Fse data from the cached archives>files here

>> ... >> ... output.collect:"' v;B P P

Reference%org.apache.hadoop.filecache'<lass 8istributed<ache

QUESTION NO: ' -ouGve written a MapReduce job that will process 100 million input records and generated 100 million "e!#value pairs. $he data is not uniforml! distributed. -our MapReduce job will create a significant amount of intermediate data that it needs to transfer between mappers and reduces which is a potential bottlenec". A custom implementation of which interface is most li"el! to reduce the amount of intermediate data transferred across the networ"? A. =artitioner B. )utputFormat C. Writable<omparable D. Writable E. InputFormat (. <ombiner Answer: ( Explanat !n: <ombiners are used to increase the efficienc! of a MapReduce program. $he! are used to aggregate intermediate map output locall! on individual mapper outputs. <ombiners can help !ou reduce the amount of data that needs to be transferred across to the reducers. -ou can use !our reducer code as a combiner if the operation performed is commutative and associative. Reference%34 Interview 5uestions 6 Answers for 7adoop MapReduce developers'What are combiners? When should I use a combiner in m! MapReduce Iob?

QUESTION NO: 10 <an !ou use MapReduce to perform a relational join on two large tables sharing a "e!? Assume that the two tables are formatted as comma#separated files in 78F(.

A. -es. B. -es' but onl! if one of the tables fits into memor! C. -es' so long as both tables fit into memor!. D. No' MapReduce cannot perform relational operations. E. No' but it can be done with either =ig or 7ive. Answer: A Explanat !n: Note% Q Ioin Algorithms in MapReduce A; Reduce#side join *; Map#side join <; In#memor! join > (triped (triped variant variant > Memcached variant Q Which join to use? > In#memor! join @ map#side join @ reduce#side join > Eimitations of each? In#memor! join% memor! Map#side join% sort order and partitioning Reduce#side join% general purpose

QUESTION NO: 11 -ou have just e ecuted a MapReduce job. Where is intermediate data written to after being emitted from the MapperGs map method? A. Intermediate data in streamed across the networ" from Mapper to the Reduce and is never written to dis". B. Into in#memor! buffers on the $as"$rac"er node running the Mapper that spill over and are written into 78F(. C. Into in#memor! buffers that spill over to the local file s!stem of the $as"$rac"er node running the Mapper. D. Into in#memor! buffers that spill over to the local file s!stem :outside 78F(; of the $as"$rac"er node running the )e*+,er E. Into in#memor! buffers on the $as"$rac"er node running the Reducer that spill over and are written into 78F(. Answer: C Explanat !n: $he mapper output :intermediate data; is stored on the Eocal file s!stem :N)$ 78F(; of each individual mapper nodes. $his is t!picall! a temporar! director! location which can

be setup in config b! the hadoop administrator. $he intermediate data is cleaned up after the 7adoop Iob completes. Reference%34 Interview 5uestions 6 Answers for 7adoop MapReduce developers'Where is the Mapper )utput :intermediate "a!#value data; stored ?

QUESTION NO: 12 -ou want to understand more about how users browse !our public website' such as which pages the! visit prior to placing an order. -ou have a farm of 300 web servers hosting !our website. 7ow will !ou gather this data for !our anal!sis? A. Ingest the server web logs into 78F( using Flume. B. Write a MapReduce job' with the web servers for mappers' and the 7adoop cluster nodes for reduces. C. Import all usersG clic"s from !our )E$= databases into 7adoop' using (9oop. D. <hannel these clic"streams inot 7adoop using 7adoop (treaming. E. (ample the weblogs from the web servers' cop!ing them into 7adoop using curl. Answer: B Explanat !n: 7adoop MapReduce for =arsing Weblogs 7ere are the steps for parsing a log file using 7adoop MapReduce% Eoad log files into the 78F( location using this 7adoop command% hadoop fs #put ?local file path of weblogs@ ?hadoop 78F( location@ $he )pencsv3.&.jar framewor" is used for parsing log records.

*elow is the Mapper program for parsing the log file from the 78F( location. public static class =arseMapper e tends Mapper?)bject' $e t' NullWritable'$e t @L private $e t word J new $e t:;B public void map:)bject "e!' $e t value' <onte t conte t ; throws I)C ception' InterruptedC ception L

<(D=arser parse J new <(D=arser:, ,',R+,;B (tring spMNJparse.parseEine:value.to(tring:;;B int sp(i.eJsp.lengthB (tring*uffer recJ new (tring*uffer:;B for:int iJ0Bi?sp(i.eBiSS; L rec.append:spMiN;B if:iTJ:sp(i.e#/;; rec.append:+'+;B P word.set:rec.to(tring:;;B conte t.write:NullWritable.get:;' word;B P P $he command below is the 7adoop#based log parse e ecution. $heMapReduce program is attached in this article. -ou can add e tra parsing methods in the class. *e sure to create a new IAR with an! change and move it to the 7adoop distributed job trac"er s!stem. hadoop jar ?path of logparse jar@ ?hadoop 78F( logfile path@ ?output path of parsed log file@ $he output file is stored in the 78F( location' and the output file name starts with +part#+.

QUESTION NO: 1" MapReduce v3 :MRv3>-ARN; is designed to address which two issues? A. (ingle point of failure in the NameNode. B. Resource pressure on the Iob$rac"er. C. 78F( latenc!. D. Abilit! to run framewor"s other than MapReduce' such as M=I. E. Reduce comple it! of the MapReduce A=Is. (. (tandardi.e on a single MapReduce A=I. Answer: B-D Explanat !n: -ARN:-et Another Resource Negotiator;' as an aspect of 7adoop' has two major "inds of benefits% Q :8;$he abilit! to use programming framewor"s other than MapReduce. >M=I :Message =assing Interface; was mentioned as a paradigmatic e ample of a MapReduce alternative Q(calabilit!' no matter what programming framewor" !ou use.

Note% Q$he fundamental idea of MRv3 is to split up the two major functionalities of the Iob$rac"er' resource management and job scheduling>monitoring' into separate daemons. $he idea is to have a global ResourceManager :RM; and per#application ApplicationMaster :AM;. An application is either a single job in the classical sense of Map#Reduce jobs or a 8AO of jobs. Q :*;$he central goal of -ARN is to clearl! separate two things that are unfortunatel! smushed together in current 7adoop' specificall! in :mainl!; Iob$rac"er% >Monitoring the status of the cluster with respect to which nodes have which resources available. Fnder -ARN' this will be global. >Managing the paralleli.ation e ecution of an! specific job. Fnder -ARN' this will be done separatel! for each job. $he current 7adoop MapReduce s!stem is fairl! scalable U -ahoo runs 1000 7adoop jobs' trul! concurrentl!' on a single cluster' for a total /.1 V 3 millions jobs>cluster>month. (till' -ARN will remove scalabilit! bottlenec"s Reference%Apache 7adoop -ARN V <oncepts 6 Applications

QUESTION NO: 1# -ou need to run the same job man! times with minor variations. Rather than hardcoding all job configuration options in !our drive code' !ouGve decided to have !our 8river subclass org.apache.hadoop.conf.<onfigured and implement the org.apache.hadoop.util.$ool interface. Indentif! which invocation correctl! passes.mapred.job.name with a value of C ample to 7adoop? A. hadoop Wmapred.job.nameJC ampleX M!8river input output B. hadoop M!8river mapred.job.nameJC ample input output C. hadoop M!8rive V8 mapred.job.nameJC ample input output D. hadoop setpropert! mapred.job.nameJC ample M!8river input output E. hadoop setpropert! :Wmapred.job.nameJC ampleX; M!8river input output Answer: C Explanat !n: <onfigure the propert! using the #8 "e!Jvalue notation% #8 mapred.job.nameJ,M! Iob, -ou can list a whole bunch of options b! calling the streaming jar with just the #info argument Reference%=!thon hadoop streaming % (etting a job name

QUESTION NO: 1$ -ou are developing a MapReduce job for sales reporting. $he mapper will process input "e!s representing the !ear :IntWritable; and input values representing product indentifies :$e t;. Indentif! what determines the data t!pes used b! the Mapper for a given job. A. $he "e! and value t!pes specified in the Iob<onf.setMapInputAe!<lass and Iob<onf.setMapInputDalues<lass methods B. $he data t!pes specified in 7A8))=YMA=Y8A$A$-=C( environment variable C. $he mapper#specification. ml file submitted with the job determine the mapperGs input "e! and value t!pes. D. $he InputFormat used b! the job determines the mapperGs input "e! and value t!pes. Answer: D Explanat !n: $he input t!pes fed to the mapper are controlled b! the InputFormat used. $he default input format' +$e tInputFormat'+ will load data in as :EongWritable' $e t; pairs. $he long value is the b!te offset of the line in the file. $he $e t object holds the string contents of the line of the file. Note%$he data t!pes emitted b! the reducer are identified b! set)utputAe!<lass:; andset)utputDalue<lass:;. $he data t!pes emitted b! the reducer are identified b! set)utputAe!<lass:; and set)utputDalue<lass:;. *! default' it is assumed that these are the output t!pes of the mapper as well. If this is not the case' the methods setMap)utputAe!<lass:; and setMap)utputDalue<lass:; methods of the Iob<onf class will override these. Reference%-ahooT 7adoop $utorial'$7C 8RIDCR MC$7)8

QUESTION NO: 16 Identif! the MapReduce v3 :MRv3 > -ARN; daemon responsible for launching application containers and monitoring application resource usage? A. ResourceManager B. NodeManager C. ApplicationMaster D. ApplicationMaster(ervice E. $as"$rac"er (. Iob$rac"er

Answer: C Explanat !n: $he fundamental idea of MRv3:-ARN;is to split up the two major functionalities of the Iob$rac"er' resource management and job scheduling>monitoring' into separate daemons. $he idea is to have a global ResourceManager :RM; and per#application ApplicationMaster :AM;. An application is either a single job in the classical sense of Map#Reduce jobs or a 8AO of jobs. Note%EetGs wal" through an application e ecution se9uence% Reference%Apache 7adoop -ARN V <oncepts 6 Applications

QUESTION NO: 1% Which best describes how $e tInputFormat processes input files and line brea"s? A. Input file splits ma! cross line brea"s. A line that crosses file splits is read b! the RecordReader of the split that contains the beginning of the bro"en line. B. Input file splits ma! cross line brea"s. A line that crosses file splits is read b! the RecordReaders of both splits containing the bro"en line. C. $he input file is split e actl! at the line brea"s' so each RecordReader will read a series of complete lines. D. Input file splits ma! cross line brea"s. A line that crosses file splits is ignored. E. Input file splits ma! cross line brea"s. A line that crosses file splits is read b! the RecordReader of the split that contains the end of the bro"en line. Answer: E Explanat !n: As the Map operation is paralleli.ed the input file set is first split to several pieces called File(plits. If an individual file is so large that it will affect see" time it will be split to several (plits. $he splitting does not"now an!thing about the input file,s internal logical structure' for e ample line#oriented te t files are split on arbitrar! b!te boundaries. $hen a new map tas" is created per File(plit. When an individual map tas" starts it will open a new output writer per configured reduce tas". It will then proceed to read its File(plit using the RecordReader it gets from the specified InputFormat. InputFormat parses the input and generates "e!#value pairs. InputFormat must also handle records that ma! be split on the File(plit boundar!. For e ample $e tInputFormat will read the last line of the File(plit past the split boundar! and' when reading other than the first File(plit' $e tInputFormat ignores the content up to the first newline. Reference%7ow Map and Reduce operations are actuall! carried out

QUESTION NO: 1& For each input "e!#value pair' mappers can emit% A. As man! intermediate "e!#value pairs as designed. $here are no restrictions on the t!pes of those "e!#value pairs :i.e.' the! can be heterogeneous;. B. As man! intermediate "e!#value pairs as designed' but the! cannot be of the same t!pe as the input "e!#value pair. C. )ne intermediate "e!#value pair' of a different t!pe. D. )ne intermediate "e!#value pair' but of the same t!pe. E. As man! intermediate "e!#value pairs as designed' as long as all the "e!s have the same t!pes and all the values have the same t!pe. Answer: E Explanat !n: Mapper maps input "e!>value pairs to a set of intermediate "e!>value pairs. Maps are the individual tas"s that transform input records into intermediate records. $he transformed intermediate records do not need to be of the same t!pe as the input records. A given input pair ma! map to .ero or man! output pairs. Reference%7adoop Map#Reduce $utorial

QUESTION NO: 1' -ou have the following "e!#value pairs as output from !our Map tas"% :the' /; :fo ' /; :faster' /; :than' /; :the' /; :dog' /; 7ow man! "e!s will be passed to the ReducerGs reduce method?

A. (i B. Five C. Four D. $wo E. )ne (. $hree Answer: A Explanat !n: )nl! one "e! value pair will be passed from the two :$he' /; "e! value pairs.

QUESTION NO: 20 -ou have user profile records in !our )E=$ database' that !ou want to join with web logs !ou have alread! ingested into the 7adoop file s!stem. 7ow will !ou obtain these user records? A. 78F( command B. =ig E)A8 command C. (9oop import D. 7ive E)A8 8A$A command E. Ingest with Flume agents (. Ingest with 7adoop (treaming Answer: B Explanat !n: Apache 7adoop and =ig provide e cellent tools for e tracting and anal!.ing data from ver! large Web logs. We use =ig scripts for sifting through the data and to e tract useful information from the Web logs. We load the log file into =ig using the E)A8 command. rawYlogs J E)A8 ,apacheEog.log, F(INO $e tEoader A( :line%chararra!;B

Note /% Data (l!w an* C!.p!nents Q<ontent will be created b! multiple Web servers and logged in local hard discs. $his content will then be pushed to 78F( using FEFMC framewor". FEFMC has agents running on Web serversB these are machines that collect data intermediatel! using collectors and finall! push that data to 78F(. Q=ig (cripts are scheduled to run using a job scheduler :could be cron or an! sophisticated batch job solution;. $hese scripts actuall! anal!.e the logs on various dimensions and e tract the results. Results from =ig are b! default inserted into 78F(' but we can use storage

implementation for other repositories also such as 7*ase' Mongo8*' etc. We have also tried the solution with 7*ase :please see the implementation section;. =ig (cripts can either push this data to 78F( and then MR jobs will be re9uired to read and push this data into 7*ase' or =ig scripts can push this data into 7*ase directl!. In this article' we use scripts to push data onto 78F(' as we are showcasing the =ig framewor" applicabilit! for log anal!sis at large scale. Q$he database 7*ase will have the data processed b! =ig scripts read! for reporting and further slicing and dicing. Q$he data#access Web service is a RC($#based service that eases the access and integrations with data clients. $he client can be in an! language to access RC($#based A=I. $hese clients could be *I# or FI#based clients.

Note 3% $he Eog Anal!sis (oftware (tac" Q7adoop is an open source framewor" that allows users to process ver! large data in parallel. It,s based on the framewor" that supports Ooogle search engine. $he 7adoop core is mainl! divided into two modules% /.78F( is the 7adoop 8istributed File (!stem. It allows !ou to store large amounts of data using multiple commodit! servers connected in a cluster. 3.Map#Reduce :MR; is a framewor" for parallel processing of large data sets. $he default implementation is bonded with 78F(. Q$he database can be a No(5E database such as 7*ase. $he advantage of a No(5E database is that it provides scalabilit! for the reporting module as well' as we can "eep historical processed data for reporting purposes. 7*ase is an open source columnar 8* or No(5E 8*' which uses 78F(. It can also use MR jobs to process data. It gives real#time' random read>write access to ver! large data sets ## 7*ase can save ver! large tables having million of rows. It,s a distributed database and can also "eep multiple versions of a single row. Q$he =ig framewor" is an open source platform for anal!.ing large data sets and is implemented as a la!ered language over the 7adoop Map#Reduce framewor". It is built to ease the wor" of developers who write code in the Map#Reduce format' since code in Map#Reduce format needs to be written in Iava. In contrast' =ig enables users to write code in a scripting language. QFlume is a distributed' reliable and available service for collecting' aggregating and moving a large amount of log data :src flume#wi"i;. It was built to push large logs into 7adoop#78F( for further processing. It,s a data flow solution' where there is an originator and destination for each node and is divided into Agent and <ollector tiers for collecting logs and pushing them to destination storage.

Reference%7adoop and =ig for Earge#(cale Web Eog Anal!sis

QUESTION NO: 21 What is the disadvantage of using multiple reducers with the default 7ash=artitioner and distributing !our wor"load across !ou cluster? A. -ou will not be able to compress the intermediate data. B. -ou will longer be able to ta"e advantage of a <ombiner. C. *! using multiple reducers with the default 7ash=artitioner' output files ma! not be in globall! sorted order. D. $here are no concerns with this approach. It is alwa!s advisable to use multiple reduces. Answer: C Explanat !n: Multiple reducers and total ordering If !our sort job runs with multiple reducers :either because mapreduce.job.reduces in mapred# site. ml has been set to a number larger than /' or because !ouGve used the #r option to specif! the number of reducers on the command#line;' then b! default 7adoop will use the 7ash=artitioner to distribute records across the reducers. Fse of the 7ash=artitioner means that !ou canGt concatenate !our output files to create a single sorted output file. $o do this !ouGll need total ordering'

Reference%(orting te t files with MapReduce

QUESTION NO: 22 Oiven a director! of files with the following structure% line number' tab character' string% C ample% /abial"jfj"aoasdfj"sdl"jh9weroij 3"adfjhuw9ounahagtnbvaswslmnbfg! &"jfteiomndsc e9al".htoped"fsi"j -ou want to send each line as one record to !our Mapper. Which InputFormat should !ou use to complete the line% conf.setInputFormat : .class; B ?

A. (e9uenceFileAs$e tInputFormat B. (e9uenceFileInputFormat C. Ae!DalueFileInputFormat D. *8*InputFormat Answer: B Explanat !n: Note% $he output format for !our first MR job should be (e9uenceFile)utputFormat # this will store the Ae!>Dalues output from the reducer in a binar! format' that can then be read bac" in' in !our second MR job using (e9uenceFileInputFormat. Reference%7ow to parse <ustomWritable from te t in 7adoop http%>>stac"overflow.com>9uestions>2Z3/Z14>how#to#parse#customwritable#from#te t#in#hadoop:see answer / and then see the comment K/ for it;

QUESTION NO: 2" -ou need to perform statistical anal!sis in !our MapReduce job and would li"e to call methods in the Apache <ommons Math librar!' which is distributed as a /.& megab!te Iava archive :IAR; file. Which is the best wa! to ma"e this librar! available to !our MapReducer job at runtime? A. 7ave !our s!stem administrator cop! the IAR to all nodes in the cluster and set its location in the 7A8))=Y<EA((=A$7 environment variable before !ou submit !our job. B. 7ave !our s!stem administrator place the IAR file on a Web server accessible to all cluster nodes and then set the 7$$=YIARYFRE environment variable to its location. C. When submitting the job on the command line' specif! the Vlibjars option followed b! the IAR file path. D. =ac"age !our code and the Apache <ommands Math librar! into a .ip file named IobIar..ip Answer: C Explanat !n: $he usage of the jar command is li"e this' Fsage% hadoop jar ?jar@ Mmain<lassN args... If !ou want the commons#math&.jar to be available for all the tas"s !ou can do an! one of these /. <op! the jar file in H7A8))=Y7)MC>lib dir or 3. Fse the generic option #libjars.

QUESTION NO: 2# $he 7adoop framewor" provides a mechanism for coping with machine issues such as fault! configuration or impending hardware failure. MapReduce detects that one or a number of machines are performing poorl! and starts more copies of a map or reduce tas". All the tas"s run simultaneousl! and the tas" finish first are used. $his is called% A. <ombine B. Identit!Mapper C. Identit!Reducer D. 8efault =artitioner E. (peculative C ecution Answer: E Explanat !n: (peculative e ecution% )ne problem with the 7adoop s!stem is that b! dividing the tas"s across man! nodes' it is possible for a few slow nodes to rate#limit the rest of the program. For e ample if one node has a slow dis" controller' then it ma! be reading its input at onl! /0[ the speed of all the other nodes. (o when 22 map tas"s are alread! complete' the s!stem is still waiting for the final map tas" to chec" in' which ta"es much longer than all the other nodes. *! forcing tas"s to run in isolation from one another' individual tas"s do not "now where their inputs come from. $as"s trust the 7adoop platform to just deliver the appropriate input. $herefore' the same input can be processed multiple times in parallel' to e ploit differences in machine capabilities. As most of the tas"s in a job are coming to a close' the 7adoop platform will schedule redundant copies of the remaining tas"s across several nodes which do not have other wor" to perform. $his process is "nown as speculative e ecution. When tas"s complete' the! announce this fact to the Iob$rac"er. Whichever cop! of a tas" finishes first becomes the definitive cop!. If other copies were e ecuting speculativel!' 7adoop tells the $as"$rac"ers to abandon the tas"s and discard their outputs. $he Reducers then receive their inputs from whichever Mapper completed successfull!' first. Reference%Apache 7adoop'Module 4% MapReduce Note% Q7adoop uses +speculative e ecution.+ $he same tas" ma! be started on multiple bo es. $he first one to finish wins' and the other copies are "illed. Failed tas"s are tas"s that error out. Q$here are a few reasons 7adoop can "ill tas"s b! his own decisions%

a; $as" does not report progress during timeout :default is /0 minutes; b; Fair(cheduler or <apacit!(cheduler needs the slot for some other pool :Fair(cheduler; or 9ueue :<apacit!(cheduler;. c; (peculative e ecution causes results of tas" not to be needed since it has completed on other place. Reference%8ifference failed tas"s vs "illed tas"s

QUESTION NO: 2$ For each intermediate "e!' each reducer tas" can emit% A. As man! final "e!#value pairs as desired. $here are no restrictions on the t!pes of those "e!# value pairs :i.e.' the! can be heterogeneous;. B. As man! final "e!#value pairs as desired' but the! must have the same t!pe as the intermediate "e!#value pairs. C. As man! final "e!#value pairs as desired' as long as all the "e!s have the same t!pe and all the values have the same t!pe. D. )ne final "e!#value pair per value associated with the "e!B no restrictions on the t!pe. E. )ne final "e!#value pair per "e!B no restrictions on the t!pe. Answer: E Explanat !n: Reducer reduces a set of intermediate values which share a "e! to a smaller set of values. Reducing lets !ou aggregate values together. A reducer function receives an iterator of input values from an input list. It then combines these values together' returning a single output value. Reference%7adoop Map#Reduce $utorialB-ahooT 7adoop $utorial' Module 4% MapReduce

QUESTION NO: 26 What data does a Reducer reduce method process? A. All the data in a single input file. B. All data produced b! a single mapper.

C. All data for a given "e!' regardless of which mapper:s; produced it. D. All data for a given value' regardless of which mapper:s; produced it. Answer: C Explanat !n: Reducing lets !ou aggregate values together. A reducer function receives an iterator of input values from an input list. It then combines these values together' returning a single output value. All values with the same "e! are presented to a single reduce tas". Reference%-ahooT 7adoop $utorial'Module 4% MapReduce

QUESTION NO: 2% All "e!s used for intermediate output from mappers must% A. Implement a splittable compression algorithm. B. *e a subclass of FileInputFormat. C. Implement Writable<omparable. D. )verride is(plitable. E. Implement a comparator for speed! sorting. Answer: C Explanat !n: $he MapReduce framewor" operates e clusivel! on ?"e!' value@ pairs' that is' the framewor" views the input to the job as a set of ?"e!' value@ pairs and produces a set of ?"e!' value@ pairs as the output of the job' conceivabl! of different t!pes. $he "e! and value classes have to be seriali.able b! the framewor" and hence need to implement the Writable interface. Additionall!' the "e! classes have to implement the Writable<omparable interface to facilitate sorting b! the framewor". Reference%MapReduce $utorial

QUESTION NO: 2& )n a cluster running MapReduce v/ :MRv/;' a $as"$rac"er heartbeats into the Iob$rac"er on !our cluster' and alerts the Iob$rac"er it has an open map tas" slot. What determines how the Iob$rac"er assigns each map tas" to a $as"$rac"er?

A. $he amount of RAM installed on the $as"$rac"er node. B. $he amount of free dis" space on the $as"$rac"er node. C. $he number and speed of <=F cores on the $as"$rac"er node. D. $he average s!stem load on the $as"$rac"er node over the past fifteen :/1; minutes. E. $he location of the Insput(plit to be processed in relation to the location of the node. Answer: E Explanat !n: $he $as"$rac"ers send out heartbeat messages to the Iob$rac"er' usuall! ever! few minutes' to reassure the Iob$rac"er that it is still alive. $hese message also inform the Iob$rac"er of the number of available slots' so the Iob$rac"er can sta! up to date with where in the cluster wor" can be delegated. When the Iob$rac"er tries to find somewhere to schedule a tas" within the MapReduce operations' it first loo"s for an empt! slot on the same server that hosts the 8ataNode containing the data' and if not' it loo"s for an empt! slot on a machine in the same rac". Reference%34 Interview 5uestions 6 Answers for 7adoop MapReduce developers'7ow Iob$rac"er schedules a tas"?

QUESTION NO: 2' Indentif! which best defines a (e9uenceFile? A. A (e9uenceFile contains a binar! encoding of an arbitrar! number of homogeneous Writable objects B. A (e9uenceFile contains a binar! encoding of an arbitrar! number of heterogeneous Writable objects C. A (e9uenceFile contains a binar! encoding of an arbitrar! number of Writable<omparable objects' in sorted order. D. A (e9uenceFile contains a binar! encoding of an arbitrar! number "e!#value pairs. Cach "e! must be the same t!pe. Cach value must be the same t!pe. Answer: D Explanat !n: (e9uenceFile is a flat file consisting of binar! "e!>value pairs. $here are & different (e9uenceFile formats% Fncompressed "e!>value records. Record compressed "e!>value records # onl! ,values, are compressed here. *loc" compressed "e!>value records # both "e!s and values are collected in ,bloc"s, separatel! and compressed. $he si.e of the ,bloc", is configurable. Reference%http%>>wi"i.apache.org>hadoop>(e9uenceFile

QUESTION NO: "0 A client application creates an 78F( file named foo.t t with a replication factor of &. Identif! which best describes the file access rules in 78F( if the file has a single bloc" that is stored on data nodes A' * and <? A. $he file will be mar"ed as corrupted if data node * fails during the creation of the file. B. Cach data node loc"s the local file to prohibit concurrent readers and writers of the file. C. Cach data node stores a cop! of the file in the local file s!stem with the same name as the 78F( file. D. $he file can be accessed if at least one of the data nodes storing the file is available. Answer: D Explanat !n: 78F( "eeps three copies of a bloc" on three different datanodes to protect against true data corruption. 78F( also tries to distribute these three replicas on more than one rac" to protect against data availabilit! issues. $he fact that 78F( activel! monitors an! failed datanode:s; and upon failure detection immediatel! schedules re#replication of bloc"s :if needed; implies that three copies of data on three different nodes is sufficient to avoid corrupted files. Note% 78F( is designed to reliabl! store ver! large files across machines in a large cluster. It stores each file as a se9uence of bloc"sB all bloc"s in a file e cept the last bloc" are the same si.e. $he bloc"s of a file are replicated for fault tolerance. $he bloc" si.e and replication factor are configurable per file. An application can specif! the number of replicas of a file. $he replication factor can be specified at file creation time and can be changed later. Files in 78F( are write#once and have strictl! one writer at an! time. $he NameNode ma"es all decisions regarding replication of bloc"s. 78F( uses rac"#aware replica placement polic!. In default configuration there are total & copies of a databloc" on 78F(' 3 copies are stored on datanodes on same rac" and &rd cop! on a different rac". Reference%34 Interview 5uestions 6 Answers for 7adoop MapReduce developers'7ow the 78F( *loc"s are replicated?

QUESTION NO: "1 In a MapReduce job' !ou want each of !our input files processed b! a single map tas". 7ow do !ou configure a MapReduce job so that a single map tas" processes each input file regardless of how man! bloc"s the input file occupies?

A. Increase the parameter that controls minimum split si.e in the job configuration. B. Write a custom MapRunner that iterates over all "e!#value pairs in the entire file. C. (et the number of mappers e9ual to the number of input files !ou want to process. D. Write a custom FileInputFormat and override the method is(plitable to alwa!s return false. Answer: D Explanat !n: FileInputFormat is the base class for all file#based InputFormats. $his provides a generic implementation of get(plits:Iob<onte t;. (ubclasses of FileInputFormat can also override the is(plitable:Iob<onte t' =ath; method to ensure input#files are not split#up and are processed as a whole b! Mappers. Reference%org.apache.hadoop.mapreduce.lib.input'<lass FileInputFormat?A'D@

QUESTION NO: "2 Which process describes the lifec!cle of a Mapper? A. $he Iob$rac"er calls the $as"$rac"erGs configure :; method' then its map :; method and finall! its close :; method. B. $he $as"$rac"er spawns a new Mapper to process all records in a single input split. C. $he $as"$rac"er spawns a new Mapper to process each "e!#value pair. D. $he Iob$rac"er spawns a new Mapper to process all records in a single file. Answer: C Explanat !n: For each map instance that runs' the $as"$rac"er creates a new instance of !our mapper. Note% Q$he Mapper is responsible for processing Ae!>Dalue pairs obtained from the InputFormat. $he mapper ma! perform a number of C traction and $ransformation functions on the Ae!>Dalue pair before ultimatel! outputting none' one or man! Ae!>Dalue pairs of the same' or different Ae!>Dalue t!pe. QWith the new 7adoop A=I' mappers e tend the org.apache.hadoop.mapreduce.Mapper class. $his class defines an ,Identit!, map function b! default # ever! input Ae!>Dalue pair obtained from the InputFormat is written out. C amining the run:; method' we can see the lifec!cle of the mapper% >QQ Q C pert users can override this method for more complete control over the Q e ecution of the Mapper.

Q \param conte t Q \throws I)C ception Q> public void run:<onte t conte t; throws I)C ception' InterruptedC ception L setup:conte t;B while :conte t.ne tAe!Dalue:;; L map:conte t.get<urrentAe!:;' conte t.get<urrentDalue:;' conte t;B P cleanup:conte t;B P setup:<onte t; # =erform an! setup for the mapper. $he default implementation is a no#op method. map:Ae!' Dalue' <onte t; # =erform a map operation in the given Ae! > Dalue pair. $he default implementation calls <onte t.write:Ae!' Dalue; cleanup:<onte t; # =erform an! cleanup for the mapper. $he default implementation is a no#op method. Reference%7adoop>MapReduce>Mapper

QUESTION NO: "" 8etermine which best describes when the reduce method is first called in a MapReduce job? A. Reducers start cop!ing intermediate "e!#value pairs from each Mapper as soon as it has completed. $he programmer can configure in the job what percentage of the intermediate data should arrive before the reduce method begins. B. Reducers start cop!ing intermediate "e!#value pairs from each Mapper as soon as it has completed. $he reduce method is called onl! after all intermediate data has been copied and sorted. C. Reduce methods and map methods all start at the beginning of a job' in order to provide optimal performance for map#onl! or reduce#onl! jobs. D. Reducers start cop!ing intermediate "e!#value pairs from each Mapper as soon as it has completed. $he reduce method is called as soon as the intermediate "e!#value pairs start to arrive. Answer: B Explanat !n: QIn a MapReduce job reducers do not start e ecuting the reduce method until the all Map jobs have completed. Reducers start cop!ing intermediate "e!#value pairs from the mappers as soon as the! are available. $he programmer defined reduce method is called onl! after all the mappers have finished.

QReducers start cop!ing intermediate "e!#value pairs from the mappers as soon as the! are available. $he progress calculation also ta"es in account the processing of data transfer which is done b! reduce process' therefore the reduce progress starts showing up as soon as an! intermediate "e!#value pair for a mapper is available to be transferred to reducer. $hough the reducer progress is updated still the programmer defined reduce method is called onl! after all the mappers have finished. Reference%34 Interview 5uestions 6 Answers for 7adoop MapReduce developers'When is the reducers are started in a MapReduce job?

QUESTION NO: "# -ou have written a Mapper which invo"es the following five calls to the )utput<olletor.collect method% output.collect :new $e t :WAppleX;' new $e t :WRedX; ; B output.collect :new $e t :W*ananaX;' new $e t :W-ellowX; ; B output.collect :new $e t :WAppleX;' new $e t :W-ellowX; ; B output.collect :new $e t :W<herr!X;' new $e t :WRedX; ; B output.collect :new $e t :WAppleX;' new $e t :WOreenX; ; B 7ow man! times will the ReducerGs reduce method be invo"ed? A. ] B. & C. / D. 0 E. 1 Answer: B Explanat !n: reduce:; gets called once for each M"e!' :list of values;N pair. $o e plain' let,s sa! !ou called% out.collect:new $e t:+<ar+;'new $e t:+(ubaru+;B out.collect:new $e t:+<ar+;'new $e t:+7onda+;B out.collect:new $e t:+<ar+;'new $e t:+Ford+;B out.collect:new $e t:+$ruc"+;'new $e t:+8odge+;B out.collect:new $e t:+$ruc"+;'new $e t:+<hev!+;B $hen reduce:; would be called twice with the pairs

reduce:<ar' ?(ubaru' 7onda' Ford@; reduce:$ruc"' ?8odge' <hev!@; Reference%Mapper output.collect:;?

QUESTION NO: "$ $o process input "e!#value pairs' !our mapper needs to lead a 1/3 M* data file in memor!. What is the best wa! to accomplish this? A. (eriali.e the data file' insert in it the Iob<onf object' and read the data into memor! in the configure method of the mapper. B. =lace the data file in the 8istributed<ache and read the data into memor! in the map method of the mapper. C. =lace the data file in the 8ata<ache and read the data into memor! in the configure method of the mapper. D. =lace the data file in the 8istributed<ache and read the data into memor! in the configure method of the mapper. Answer: D Explanat !n: 7adoop has a distributed cache mechanism to ma"e available file locall! that ma! be needed b! Map>Reduce jobs Fse <ase Eets understand our Fse <ase a bit more in details so that we can follow#up the code snippets. We have a Ae!#Dalue file that we need to use in our Map jobs. For simplicit!' lets sa! we need to replace all "e!words that we encounter during parsing' with some other value. (o what we need is A "e!#values files :Eets use a =roperties files; $he Mapper code that uses the code Write the Mapper code that uses it view sourceprint? 0/. public class 8istributed<acheMapper e tends Mapper?EongWritable' $e t' $e t' $e t@ L 03.

0&. =roperties cacheB 04. 01. \)verride 0]. protected void setup:<onte t conte t; throws I)C ception' InterruptedC ception L 0Z. super.setup:conte t;B 0^. =athMN local<acheFiles J 8istributed<ache.getEocal<acheFiles:conte t.get<onfiguration:;;B 02. /0. if:local<acheFiles TJ null; L //. >> e pecting onl! single file here /3. for :int i J 0B i ? local<acheFiles.lengthB iSS; L /&. =ath local<acheFile J local<acheFilesMiNB /4. cache J new =roperties:;B /1. cache.load:new FileReader:local<acheFile.to(tring:;;;B /]. P /Z. P else L /^. >> do !our error handling here /2. P 30. 3/. P 33.

3&. \)verride 34. public void map:EongWritable "e!' $e t value' <onte t conte t; throws I)C ception' InterruptedC ception L 31. >> use the cache here 3]. >> if value contains some attribute' cache.get:?value@; 3Z. >> do some action or replace with something else 3^. P 32. &0. P

Note% Q 8istribute application#specific large' read#onl! files efficientl!. 8istributed<ache is a facilit! provided b! the Map#Reduce framewor" to cache files :te t' archives' jars etc.; needed b! applications. Applications specif! the files' via urls :hdfs%>> or http%>>; to be cached via the Iob<onf. $he 8istributed<ache assumes that the files specified via hdfs%>> urls are alread! present on the File(!stem at the path specified b! the url. Reference%Fsing 7adoop 8istributed <ache

QUESTION NO: "6 In a MapReduce job' the reducer receives all values associated with same "e!. Which statement best describes the ordering of these values? A. $he values are in sorted order. B. $he values are arbitraril! ordered' and the ordering ma! var! from run to run of the same MapReduce job.

C. $he values are arbitrar! ordered' but multiple runs of the same MapReduce job will alwa!s have the same ordering. D. (ince the values come from mapper outputs' the reducers will receive contiguous sections of sorted values. Answer: B Explanat !n: Note% QInput to the Reducer is the sorted output of the mappers. Q$he framewor" calls the application,s Reduce function once for each uni9ue "e! in the sorted order. QC ample% For the given sample input the first map emits% ? 7ello' /@ ? World' /@ ? *!e' /@ ? World' /@ $he second map emits% ? 7ello' /@ ? 7adoop' /@ ? Ooodb!e' /@ ? 7adoop' /@

QUESTION NO: "% -ou need to create a job that does fre9uenc! anal!sis on input data. -ou will do this b! writing a Mapper that uses $e tInputFormat and splits each value :a line of te t from an input file; into individual characters. For each one of these characters' !ou will emit the character as a "e! and an InputWritable as the value. As this will produce proportionall! more intermediate data than input data' which two resources should !ou e pect to be bottlenec"s? A. =rocessor and networ" I>) B. 8is" I>) and networ" I>) C. =rocessor and RAM D. =rocessor and dis" I>) Answer: B Explanat !n:

QUESTION NO: "& -ou want to count the number of occurrences for each uni9ue word in the supplied input data. -ouGve decided to implement this b! having !our mapper to"eni.e each word and emit a literal value /' and then have !our reducer increment a counter for each literal / it receives. After successful implementing this' it occurs to !ou that !ou could optimi.e this b! specif!ing a combiner. Will !ou be able to reuse !our e isting Reduces as !our combiner in this case and wh! or wh! not? A. -es' because the sum operation is both associative and commutative and the input and output t!pes to the reduce method match. B. No' because the sum operation in the reducer is incompatible with the operation of a <ombiner. C. No' because the Reducer and <ombiner are separate interfaces. D. No' because the <ombiner is incompatible with a mapper which doesnGt use the same data t!pe for both the "e! and value. E. -es' because Iava is a pol!morphic object#oriented language and thus reducer code can be reused as a combiner. Answer: A Explanat !n: <ombiners are used to increase the efficienc! of a MapReduce program. $he! are used to aggregate intermediate map output locall! on individual mapper outputs. <ombiners can help !ou reduce the amount of data that needs to be transferred across to the reducers. -ou can use!our reducer code as a combiner if the operation performed is commutative and associative. $he e ecution of combiner is not guaranteed' 7adoop ma! or ma! not e ecute a combiner. Also' if re9uired it ma! e ecute it more then / times. $herefore !our MapReduce jobs should not depend on the combiners e ecution.

Reference%34 Interview 5uestions 6 Answers for 7adoop MapReduce developers'What are combiners? When should I use a combiner in m! MapReduce Iob?

QUESTION NO: "' -our client application submits a MapReduce job to !our 7adoop cluster. Identif! the 7adoop daemon on which the 7adoop framewor" will loo" for an available slot schedule a MapReduce operation. A. $as"$rac"er B. NameNode

C. 8ataNode D. Iob$rac"er E. (econdar! NameNode Answer: D Explanat !n: Iob$rac"er is the daemon service for submitting and trac"ing MapReduce jobs in 7adoop. $here is onl! )ne Iob $rac"er process run on an! hadoop cluster. Iob $rac"er runs on its own IDM process. In a t!pical production cluster its run on a separate machine. Cach slave node is configured with job trac"er node location. $he Iob$rac"er is single point of failure for the 7adoop MapReduce service. If it goes down' all running jobs are halted. Iob$rac"er in 7adoop performs following actions:from 7adoop Wi"i%; <lient applications submit jobs to the Iob trac"er. $he Iob$rac"er tal"s to the NameNode to determine the location of the data $he Iob$rac"er locates $as"$rac"er nodes with available slots at or near the data $he Iob$rac"er submits the wor" to the chosen $as"$rac"er nodes. $he $as"$rac"er nodes are monitored. If the! do not submit heartbeat signals often enough' the! are deemed to have failed and the wor" is scheduled on a different $as"$rac"er. A $as"$rac"er will notif! the Iob$rac"er when a tas" fails. $he Iob$rac"er decides what to do then% it ma! resubmit the job elsewhere' it ma! mar" that specific record as something to avoid' and it ma! ma! even blac"list the $as"$rac"er as unreliable. When the wor" is completed' the Iob$rac"er updates its status. <lient applications can poll the Iob$rac"er for information. Reference%34 Interview 5uestions 6 Answers for 7adoop MapReduce developers'What is a Iob$rac"er in 7adoop? 7ow man! instances of Iob$rac"er run on a 7adoop <luster?

QUESTION NO: #0 Which project gives !ou a distributed' (calable' data store that allows !ou random' realtime read>write access to hundreds of terab!tes of data? A. 7*ase B. 7ue C. =ig D. 7ive E. )o.ie (. Flume /. (9oop

Answer: A Explanat !n: Fse Apache 7*ase when !ou need random' realtime read>write access to !our *ig 8ata. Note%$his project,s goal is the hosting of ver! large tables ## billions of rows _ millions of columns # # atop clusters of commodit! hardware. Apache 7*ase is an open#source' distributed' versioned' column#oriented store modeled after Ooogle,s *igtable% A 8istributed (torage (!stem for (tructured 8ata b! <hang et al. Iust as *igtable leverages the distributed data storage provided b! the Ooogle File (!stem' Apache 7*ase provides *igtable#li"e capabilities on top of 7adoop and 78F(. Features Einear and modular scalabilit!. (trictl! consistent reads and writes. Automatic and configurable sharding of tables Automatic failover support between Region(ervers. <onvenient base classes for bac"ing 7adoop MapReduce jobs with Apache 7*ase tables. Cas! to use Iava A=I for client access. *loc" cache and *loom Filters for real#time 9ueries. 5uer! predicate push down via server side Filters $hrift gatewa! and a RC($#ful Web service that supports _ME' =rotobuf' and binar! data encoding options C tensible jrub!#based :IIR*; shell (upport for e porting metrics via the 7adoop metrics subs!stem to files or OangliaB or via IM_ Reference%http%>>hbase.apache.org>:when would I use 7*ase? First sentence;

QUESTION NO: #1 -ou use the hadoop fs Vput command to write a &00 M* file using and 78F( bloc" si.e of ]4 M*. Iust after this command has finished writing 300 M* of this file' what would another user see when tr!ing to access this life? A. $he! would see 7adoop throw an <oncurrentFileAccessC ception when the! tr! to access this file. B. $he! would see the current state of the file' up to the last bit written b! the command. C. $he! would see the current of the file through the last completed bloc". D. $he! would see no content until the whole file written and closed.

Answer: D Explanat !n: Note% Qput Fsage% hadoop fs #put ?localsrc@ ... ?dst@ <op! single src' or multiple srcs from local file s!stem to the destination files!stem. Also reads input from stdin and writes to destination files!stem.

QUESTION NO: #2 Identif! the tool best suited to import a portion of a relational database ever! da! as files into 78F(' and generate Iava classes to interact with that imported data? A. )o.ie B. Flume C. =ig D. 7ue E. 7ive (. (9oop /. fuse#dfs Answer: ( Explanat !n: Answer% <' C (9oop :W(5E#to#7adoopX; is a straightforward command#line tool with the following capabilities% Imports individual tables or entire databases to files in 78F( Oenerates Iava classes to allow !ou to interact with !our imported data =rovides the abilit! to import from (5E databases straight into !our 7ive data warehouse

Note% 8ata Movement *etween 7adoop and Relational 8atabases 8ata can be moved between 7adoop and a relational database as a bul" data transfer' or relational tables can be accessed from within a MapReduce map function. Note% Q <loudera,s 8istribution for 7adoop provides a bul" data transfer tool :i.e.' (9oop; that imports individual tables or entire databases into 78F( files. $he tool also generates Iava classes that support interaction with the imported data. (9oop supports all relational databases over I8*<' and 5uest (oftware provides a connector :i.e.' )ra)op; that has been optimi.ed for access to

data residing in )racle databases. Reference%http%>>log.medcl.net>item>30//>0^>hadoop#and#mapreduce#big#data#anal!tics# gartner>:8ata Movement between hadoop and relational databases' second paragraph;

QUESTION NO: #" -ou have a director! named jobdata in 78F( that contains four files% Yfirst.t t' second.t t' .third.t t and Kdata.t t. 7ow man! files will be processed b! the FileInputFormat.setInput=aths :; command when it,s given a path object representing this director!? A. Four' all files will be processed B. $hree' the pound sign is an invalid character for 78F( file names C. $wo' file names with a leading period or underscore are ignored D. None' the director! cannot be named jobdata E. )ne' no special characters can prefi the name of an input file Answer: C Explanat !n: Files starting with ,Y, are considered ,hidden, li"e uni files startingwith ,.,. K characters are allowed in 78F( file names.

QUESTION NO: ## -ou write MapReduce job to process /00 files in 78F(. -our MapReduce algorithm uses $e tInputFormat% the mapper applies a regular e pression over input values and emits "e!#values pairs with the "e! consisting of the matching te t' and the value containing the filename and b!te offset. 8etermine the difference between setting the number of reduces to one and settings the number of reducers to .ero. A. $here is no difference in output between the two settings. B. With .ero reducers' no reducer runs and the job throws an e ception. With one reducer' instances of matching patterns are stored in a single file on 78F(. C. With .ero reducers' all instances of matching patterns are gathered together in one file on 78F(. With one reducer' instances of matching patterns are stored in multiple files on 78F(. D. With .ero reducers' instances of matching patterns are stored in multiple files on 78F(. With one reducer' all instances of matching patterns are gathered together in one file on 78F(.

Answer: D Explanat !n: QIt is legal to set the number of reduce#tas"s to .ero if no reduction is desired. In this case the outputs of the map#tas"s go directl! to the File(!stem' into the output path set b! set)utput=ath:=ath;. $he framewor" does not sort the map#outputs before writing them out to the File(!stem. Q)ften' !ou ma! want to process input data using a map function onl!. $o do this' simpl! set mapreduce.job.reduces to .ero. $he MapReduce framewor" will not create an! reducer tas"s. Rather' the outputs of the mapper tas"s will be the final output of the job. Note% Reduce In this phase the reduce:Writable<omparable' Iterator' )utput<ollector' Reporter; method is called for each ?"e!' :list of values;@ pair in the grouped inputs. $he output of the reduce tas" is t!picall! written to the File(!stem via )utput<ollector.collect:Writable<omparable' Writable;. Applications can use the Reporter to report progress' set application#level status messages and update <ounters' or just indicate that the! are alive. $he output of the Reducer is not sorted.

QUESTION NO: #$ A combiner reduces% A. $he number of values across different "e!s in the iterator supplied to a single reduce method call. B. $he amount of intermediate data that must be transferred between the mapper and reducer. C. $he number of input files a mapper must process. D. $he number of output files a reducer must produce. Answer: B Explanat !n: <ombiners are used to increase the efficienc! of a MapReduce program. $he! are used to aggregate intermediate map output locall! on individual mapper outputs. <ombiners can help!ou reduce the amount of data that needs to be transferred across to the reducers. -ou can use !our reducer code as a combiner if the operation performed is commutative and associative. $he e ecution of combiner is not guaranteed' 7adoop ma! or ma! not e ecute a combiner. Also' if

re9uired it ma! e ecute it more then / times. $herefore !our MapReduce jobs should not depend on the combiners e ecution. Reference%34 Interview 5uestions 6 Answers for 7adoop MapReduce developers'What are combiners? When should I use a combiner in m! MapReduce Iob?

QUESTION NO: #6 In a MapReduce job with 100 map tas"s' how man! map tas" attempts will there be? A. It depends on the number of reduces in the job. B. *etween 100 and /000. C. At most 100. D. At least 100. E. C actl! 100. Answer: D Explanat !n: From <loudera $raining <ourse% $as" attempt is a particular instance of an attempt to e ecute a tas" V $here will be at least as man! tas" attempts as there are tas"s V If a tas" attempt fails' another will be started b! the Iob$rac"er V (peculative e ecution can also result in more tas" attempts than completed tas"s

QUESTION NO: #% MapReduce v3 :MRv3>-ARN; splits which major functions of the Iob$rac"er into separate daemons? (elect two. A. 7eath states chec"s :heartbeats; B. Resource management C. Iob scheduling>monitoring D. Iob coordination between the ResourceManager and NodeManager E. Eaunching tas"s (. Managing file s!stem metadata /. MapReduce metric reporting 0. Managing tas"s Answer: B-C1D

Explanat !n: $he fundamental idea of MRv3 is to split up the two major functionalities of the Iob$rac"er' resource management and job scheduling>monitoring' into separate daemons. $he idea is to have a global ResourceManager :RM; and per#application ApplicationMaster :AM;. An application is either a single job in the classical sense of Map#Reduce jobs or a 8AO of jobs. Note% $he central goal of -ARN is to clearl! separate two things that are unfortunatel! smushed together in current 7adoop' specificall! in :mainl!; Iob$rac"er% >Monitoring the status of the cluster with respect to which nodes have which resources available. Fnder -ARN' this will be global. >Managing the paralleli.ation e ecution of an! specific job. Fnder -ARN' this will be done separatel! for each job. Reference%Apache 7adoop -ARN V <oncepts 6 Applications

QUESTION NO: #& What t!pes of algorithms are difficult to e press in MapReduce v/ :MRv/;? A. Algorithms that re9uire appl!ing the same mathematical function to large numbers of individual binar! records. B. Relational operations on large amounts of structured and semi#structured data. C. Algorithms that re9uire global' sharing states. D. Earge#scale graph algorithms that re9uire one#step lin" traversal. E. $e t anal!sis algorithms on large collections of unstructured te t :e.g' Web crawls;. Answer: C Explanat !n: (ee &; below. Eimitations of MapreduceVwhere not to use Mapreduce While ver! powerful and applicable to a wide variet! of problems' MapReduce is not the answer to ever! problem. 7ere are some problems I found where MapReudce is not suited and some papers that address the limitations of MapReuce. /. <omputation depends on previousl! computed values If the computation of a value depends on previousl! computed values' then MapReduce cannot be used. )ne good e ample is the Fibonacci series where each value is summation of the previous two values. i.e.' f:"S3; J f:"S/; S f:";. Also' if the data set is small enough to be computed on a

single machine' then it is better to do it as a single reduce:map:data;; operation rather than going through the entire map reduce process. 3. Full#te t inde ing or ad hoc searching $he inde generated in the Map step is one dimensional' and the Reduce step must not generate a large amount of data or there will be a serious performance degradation. For e ample' <ouch8*Gs MapReduce ma! not be a good fit for full#te t inde ing or ad hoc searching. $his is a problem better suited for a tool such as Eucene. &. Algorithms depend on shared global state (olutions to man! interesting problems in te t processing do not re9uire global s!nchroni.ation. As a result' the! can be e pressed naturall! in MapReduce' since map and reduce tas"s run independentl! and in isolation. 7owever' there are man! e amples of algorithms that depend cruciall! on the e istence of shared global state during processing' ma"ing them difficult to implement in MapReduce :since the single opportunit! for global s!nchroni.ation in MapReduce is the barrier between the map and reduce phases of processing; Reference% Eimitations of MapreduceVwhere not to use Mapreduce

QUESTION NO: #' In the reducer' the MapReduce A=I provides !ou with an iterator over Writable values. What does calling the ne t :; method return? A. It returns a reference to a different Writable object time. B. It returns a reference to a Writable object from an object pool. C. It returns a reference to the same Writable object each time' but populated with different data. D. It returns a reference to a Writable object. $he A=I leaves unspecified whether this is a reused object or a new object. E. It returns a reference to the same Writable object if the ne t value is the same as the previous value' or a new Writable object otherwise. Answer: C Explanat !n: <alling Iterator.ne t:; will alwa!s return the (AMC C_A<$ instance of IntWritable' with the contents of that instance replaced with the ne t value. Reference%manupulating iterator in mapreduce

QUESTION NO: $0 $able metadata in 7ive is% A. (tored as metadata on the NameNode. B. (tored along with the data in 78F(. C. (tored in the Metastore. D. (tored in `ooAeeper. Answer: C Explanat !n: *! default' hive use an embedded 8erb! database to store metadata information. $he metastore is the +glue+ between 7ive and 78F(. It tells 7ive where !our data files live in 78F(' what t!pe of data the! contain' what tables the! belong to' etc. $he Metastore is an application that runs on an R8*M( and uses an open source )RM la!er called 8ataNucleus' to convert object representations into a relational schema and vice versa. $he! chose this approach as opposed to storing this information in hdfs as the! need the Metastore to be ver! low latenc!. $he 8ataNucleus la!er allows them to plugin man! different R8*M( technologies. Note% Q*! default' 7ive stores metadata in an embedded Apache 8erb! database' and other client>server databases li"e M!(5E can optionall! be used. Qfeatures of 7iveinclude% Metadata storage in an R8*M(' significantl! reducing the time to perform semantic chec"s during 9uer! e ecution. Reference%(tore 7ive Metadata into R8*M(

QUESTION NO: $1 Anal!.e each scenario below and indentif! which best describes the behavior of the default partitioner? A. $he default partitioner assigns "e!#values pairs to reduces based on an internal random number generator. B. $he default partitioner implements a round#robin strateg!' shuffling the "e!#value pairs to each reducer in turn. $his ensures an event partition of the "e! space. C. $he default partitioner computers the hash of the "e!. 7ash values between specific ranges are

associated with different buc"ets' and each buc"et is assigned to a specific reducer. D. $he default partitioner computers the hash of the "e! and divides that valule modulo the number of reducers. $he result determines the reducer assigned to process the "e!#value pair. E. $he default partitioner computers the hash of the value and ta"es the mod of that value with the number of reducers. $he result determines the reducer assigned to process the "e!#value pair. Answer: D Explanat !n: $he default partitioner computes a hash value for the "e! and assigns the partition based on this result. $he default =artitioner implementation is called 7ash=artitioner. It uses the hash<ode:; method of the "e! objects modulo the number of partitions total to determine which partition to send a given :"e!' value; pair to. In 7adoop' the default partitioner is 7ash=artitioner' which hashes a recordGs "e! to determine which partition :and thus which reducer; the record belongs in.$he number of partition is then e9ual to the number of reduce tas"s for the job. Reference%Oetting (tarted With :<ustomi.ed; =artitioning

QUESTION NO: $2 -ou need to move a file titled WweblogsX into 78F(. When !ou tr! to cop! the file' !ou canGt. -ou "now !ou have ample space on !our 8ataNodes. Which action should !ou ta"e to relieve this situation and store more files in 78F(? A. Increase the bloc" si.e on all current files in 78F(. B. Increase the bloc" si.e on !our remaining files. C. 8ecrease the bloc" si.e on !our remaining files. D. Increase the amount of memor! for the NameNode. E. Increase the number of dis"s :or si.e; for the NameNode. (. 8ecrease the bloc" si.e on all current files in 78F(. Answer: C Explanat !n: Note% Q#put local(rc dest<opies the file or director! from the local file s!stem identified b! local(rc to dest within the 8F(. QWhat is 78F( *loc" si.e? 7ow is it different from traditional file s!stem bloc" si.e?

In 78F( data is split into bloc"s and distributed across multiple nodes in the cluster. Cach bloc" is t!picall! ]4Mb or /3^Mb in si.e. Cach bloc" is replicated multiple times. 8efault is to replicate each bloc" three times. Replicas are stored on different nodes. 78F( utili.es the local file s!stem to store each 78F( bloc" as a separate file. 78F( *loc" si.e can not be compared with the traditional file s!stem bloc" si.e.

QUESTION NO: $" In a large MapReduce job with m mappers and n reducers' how man! distinct cop! operations will there be in the sort>shuffle phase? A. m_n :i.e.' m multiplied b! n; B. n C. m D. mSn :i.e.' m plus n; E. C.mn:i.e.' m to the power of n; Answer: A Explanat !n: A MapReduce job withm mappers and r reducers involves up to mQr distinct cop! operations' since eachmapper ma! have intermediate output going to ever! reducer.

QUESTION NO: $# Wor"flows e pressed in )o.ie can contain% A. (e9uences of MapReduce and =ig. $hese se9uences can be combined with other actions including for"s' decision points' and path joins. B. (e9uences of MapReduce job onl!B on =ig on 7ive tas"s or jobs. $hese MapReduce se9uences can be combined with for"s and path joins. C. (e9uences of MapReduce and =ig jobs. $hese are limited to linear se9uences of actions with e ception handlers but no for"s. D. Iterntive repetition of MapReduce jobs until a desired answer or state is reached. Answer: A Explanat !n: )o.ie wor"flow is a collection of actions :i.e. 7adoop Map>Reduce jobs' =ig jobs; arranged in a control dependenc! 8AO :8irect Ac!clic Oraph;' specif!ing a se9uence of actions e ecution. $his graph is specified in h=8E :a _ME =rocess 8efinition Eanguage;.

h=8E is a fairl! compact language' using a limited amount of flow control and action nodes. <ontrol nodes define the flow of e ecution and include beginning and end of a wor"flow :start' end and fail nodes; and mechanisms to control the wor"flow e ecution path : decision' for" and join nodes;. Wor"flow definitions <urrentl! running wor"flow instances' including instance states and variables

Reference%Introduction to )o.ie Note%)o.ie is a Iava Web#Application that runs in a Iava servlet#container # $omcat and uses a database to store%

QUESTION NO: $$ Which best describes what the map method accepts and emits? A. It accepts a single "e!#value pair as input and emits a single "e! and list of corresponding values as output. B. It accepts a single "e!#value pairs as input and can emit onl! one "e!#value pair as output. C. It accepts a list "e!#value pairs as input and can emit onl! one "e!#value pair as output. D. It accepts a single "e!#value pairs as input and can emit an! number of "e!#value pair as output' including .ero. Answer: D Explanat !n: public class Mapper?AC-IN'DAEFCIN'AC-)F$'DAEFC)F$@ e tends )bject Maps input "e!>value pairs to a set of intermediate "e!>value pairs. Maps are the individual tas"s which transform input records into a intermediate records. $he transformed intermediate records need not be of the same t!pe as the input records. A given input pair ma! map to .ero or man! output pairs. Reference%org.apache.hadoop.mapreduce <lass Mapper?AC-IN'DAEFCIN'AC-)F$'DAEFC)F$@

QUESTION NO: $6 When can a reduce class also serve as a combiner without affecting the output of a MapReduce program? A. When the t!pes of the reduce operationGs input "e! and input value match the t!pes of the reducerGs output "e! and output value and when the reduce operation is both communicative and associative. B. When the signature of the reduce method matches the signature of the combine method. C. Alwa!s. <ode can be reused in Iava since it is a pol!morphic object#oriented programming language. D. Alwa!s. $he point of a combiner is to serve as a mini#reducer directl! after the map phase to increase performance. E. Never. <ombiners and reducers must be implemented separatel! because the! serve different purposes. Answer: A Explanat !n: -ou can use !our reducer code as a combiner if the operation performed is commutative and associative. Reference%34 Interview 5uestions 6 Answers for 7adoop MapReduce developers'What are combiners? When should I use a combiner in m! MapReduce Iob?

QUESTION NO: $% -ou want to perform anal!sis on a large collection of images. -ou want to store this data in 78F( and process it with MapReduce but !ou also want to give !our data anal!sts and data scientists the abilit! to process the data directl! from 78F( with an interpreted high#level programming language li"e =!thon. Which format should !ou use to store this data in 78F(? A. (e9uenceFiles B. Avro C. I()N D. 7$ME E. _ME (. <(D Answer: A Explanat !n: Fsing 7adoop (e9uence Files (o what should we do in order to deal with huge amount of images? Fse hadoop se9uence filesT $hose are map files that inherentl! can be read b! map reduce applications V there is an input

format especiall! for se9uence files V and are splitable b! map reduce' so we can have one huge file that will be the input of man! map tas"s. *! using those se9uence files we are letting hadoop use its advantages. It can split the wor" into chun"s so the processing is parallel' but the chun"s are big enough that the process sta!s efficient. (ince the se9uence file are map file the desired format will be that the "e! will be te t and hold the 78F( filename and the value will be *!tesWritable and will contain the image content of the file. Reference%7adoop binar! files processing introduced b! image duplicates finder

QUESTION NO: $& -ou want to run 7adoop jobs on !our development wor"station for testing before !ou submit them to !our production cluster. Which mode of operation in 7adoop allows !ou to most closel! simulate a production cluster while using a single machine? A. Run all the nodes in !our production cluster as virtual machines on !our development wor"station. B. Run the hadoop command with the Vjt local and the Vfs file%>>>options. C. Run the 8ataNode' $as"$rac"er' NameNode and Iob$rac"er daemons on a single machine. D. Run simldooop' the Apache open#source software for simulating 7adoop clusters. Answer: A Explanat !n: 7osting on local DMs As well as large#scale cloud infrastructures' there is another deplo!ment pattern% local DMs on des"top s!stems or other development machines. $his is a good tactic if !our ph!sical machines run windows and !ou need to bring up a Einu s!stem running 7adoop' and>or !ou want to simulate the comple it! of a small 7adoop cluster. 7ave enough RAM for the DM to not swap. 8on,t tr! and run more than one DM per ph!sical host' it will onl! ma"e things slower. use file% FREs to access persistent input and output data. consider ma"ing the default files!stem a file% FRE so that all storage is reall! on the ph!sical host. It,s often faster and preserves data better.

QUESTION NO: $' -our clusterGs 78F( bloc" si.e in ]4M*. -ou have director! containing /00 plain te t files' each of

which is /00M* in si.e. $he InputFormat for !our job is $e tInputFormat. 8etermine how man! Mappers will run? A. ]4 B. /00 C. 300 D. ]40 Answer: C Explanat !n: Cach file would be split into two as the bloc" si.e :]4 M*; is less than the file si.e :/00 M*;' so 300 mappers would be running. Note% If !ou,re not compressing the files then hadoop will process !our large files :sa! /0O;' with a number of mappers related to the bloc" si.e of the file. (a! !our bloc" si.e is ]4M' then !ou will have a/]0 mappers processing this /0O file :/]0Q]4 aJ /0O;. 8epending on how <=F intensive !our mapper logic is' this might be an acceptable bloc"s si.e' but if !ou find that !our mappers are e ecuting in sub minute times' then !ou might want to increase the wor" done b! each mapper :b! increasing the bloc" si.e to /3^' 31]' 1/3m # the actual si.e depends on how !ou intend to process the data;. Reference%http%>>stac"overflow.com>9uestions>//0/442&>hadoop#mapreduce#appropriate#input# files#si.e:first answer' second paragraph;

QUESTION NO: 60 What is a (e9uenceFile? A. A (e9uenceFile contains a binar! encoding of an arbitrar! number of homogeneous writable objects. B. A (e9uenceFile contains a binar! encoding of an arbitrar! number of heterogeneous writable objects. C. A (e9uenceFile contains a binar! encoding of an arbitrar! number of Writable<omparable objects' in sorted order. D. A (e9uenceFile contains a binar! encoding of an arbitrar! number "e!#value pairs. Cach "e! must be the same t!pe. Cach value must be same t!pe. Answer: D Explanat !n: (e9uenceFile is a flat file consisting of binar! "e!>value pairs.

$here are & different (e9uenceFile formats% Fncompressed "e!>value records. Record compressed "e!>value records # onl! ,values, are compressed here. *loc" compressed "e!>value records # both "e!s and values are collected in ,bloc"s, separatel! and compressed. $he si.e of the ,bloc", is configurable. Reference%http%>>wi"i.apache.org>hadoop>(e9uenceFile