You are on page 1of 4

Apache HTTP Server Log Analysis

Business Analytics Using Hadoop - Project

Objec t ives
T his pro gr am e nable s the par ticipan ts to imple me nt the le ar nings o f
the B ig Data T echno lo gy - Hado o p co ur se .
T he pr imar y o bje ctive o f the pr o je ct is to e nhance the par ticipan t’ s
skills in HDFS, Pig, Hive , Sqo o p.

Pr o c edur e
 Vie w Apache Sample Lo g
R e fe r apache _ sample .pdf
 Unde r stand Apache L o gs
R e fe r apache _ de sc.pdf
So ur ce : http :// ht tpd .apache .o r g /do cs/2 .2 / lo gs.h tml
 Use Datase t as give n
R e fe r se ctio n Apache Data Se ts
 Par se & Analyz e
R e fe r se ctio n Pro ce dur e
 Analyt ics Re quir e me nt
R e fe r se ctio n Analytics R equir e me nt
 G e ne r ate Pr o je ct R e por t
R e fe r se ctio n Pro je ct Re po r t

A pa c he Da t a Set s
 apache _ w or kse t.lo g - small apache lo g to cre ate yo ur pro to type
 usask_ acce ss_ lo g.gz - co mpr e sse d file co ntainin g
" Uo fS_ acce ss_ lo g" ; an apache lo g o f appr o x 23 3 MB
N o te :
 " Uo fS_ acce ss_ lo g" to be re name d as " apache _ datase t.lo g"
 Site : We b lo gs fr o m N ASA W e b Site
 So ur ce : http :// ita .e e .lb l.go v /htm l/co n tr ib /N ASA - HT T P.html

Pr o c edur e
 C o py lo g file to L inux & the n to an hdfs fo lde r o f yo ur cho ice
 Par se lo g file using Pig. T he fo r mat o f the data- fr ame sho uld match
w ith the “ apache _ http.x lsx” .
 Sto r e data the par se d data in the hdfs fo lde r “ apache _ http” .
 T he csv file sho uld have fo llo w ing fie lds
Ho st- IP, Date (dd- mmm- yyyy for mat), T ime , (hh:mm:ss fo r mat),
Pr o to co l, UR L , HttpVe r s, Status , B yte s. T ime Zo ne to be igno r e d.
 Impo r t into hive by cre ating hive - w ar e ho use data
 Pr o vide analysis r e sults as pe r se ctio n Analytic s R e quir e me nt be lo w
 C o py r e sults to lo cal file syste m or lo cal MySQ L as per r e quir e me nt.

500188032.docx Page: 1/4


Apache HTTP Server Log Analysis
Business Analytics Using Hadoop - Project

J ar File
 Ple ase use " piggybank - apache .jar " pr o vide d in “ apache fo lde r ” fo r
par sing the apache lo g in pig.

A na lyt ic s Requir em ent


R e quir e d As A Hive T able
 Ho w many time s e ach indiv idua l ho st has co nne cte d to o ur ser ve r?
Sto r e data so r te d by highe st co unt fir st.
 Ho w many time s e ach indiv idua l page has be e n r e que ste d fr o m o ur
se r ve r ? Sto re data so r te d by highe st co unt fir st.
 Ho w much data has bee n do w nlo ade d by each indiv idua l ho st that
has co nne cte d to o ur se r ve r ? Sto r e data so r te d by highe st co unt
fir st.
 Ho w much data w as se nt o ut as e ach indivi dual page w as
do w nlo ade d fro m o ur se r ver ? Sto r e data sor te d by highe st co unt
fir st.
N o te - ge t all the abo ve Hive T able base d info r matio n to MySQ L table ;
Answ e r s R e quir e d
Using the abo ve r esults and also car r ying o ut any o the r analysis as
may be r e quir e d, pr o vide answ er s to the fo llo w ing que stio ns.
 W hich ho st has co nne cte d the maximu m numbe r o f time s to o ur
se r ve r ? G ive the ho st name & co unt of co nne ctio ns fr o m that ho st.
 W hich page that has bee n re que ste d the maximum numbe r o f time s
fr o m o ur ser ve r? G ive the page name & co unt o f the time s the page
w as r e que ste d.
 Ho w many unique ho sts have co nne cte d to o ur se r ver ? G ive co unts.
 Ho w many unique page s have be e n r e que ste d fr o m o ur se r ver ? G ive
co unts.
 W hich ho st has cause d maximum data tr ansfe r fr o m o ur se r ve r?
G ive ho st name & the data tr ansfe r fo r the ho st.
 W hich page has cause d maximum data tr ansfe r fr o m o ur se r ve r?
G ive page name & the data tr ansfe r fo r the page .
 W hich page has maximu m do w nlo ad siz e fro m o ur se r ver ? G ive page
name & the size fo r the page .
 W hat is the do w nlo ad co unt o f the page that has maximum do w nlo ad
siz e fr o m o ur se r ver ? G ive page name & dow nlo ad co unt
 W hich page has mini mum do w nlo ad siz e fr o m o ur ser ve r? G ive page
name & the size fo r the page .
 W hat is the do w nlo ad co unt o f the page that minimu m do w nlo ad siz e
fr o m o ur ser ve r? G ive page name & the siz e fo r the page .
Addi tio nal Analyti cs

500188032.docx Page: 2/4


Apache HTTP Server Log Analysis
Business Analytics Using Hadoop - Project

Ple ase pro vide additio na l analysi s if yo u de e m it impo r tant . Also


pr o vide re aso n w hy the same is co nside re d signif icant .
Pr o jec t Repo r t
 Pr o je ct O ve r vie w
 C o mmands / Co de Se ctio n
 R esults Se ctio n
 Summar y - Ho w yo u use d Hadoo p for Data Analy tics
Pr o jec t Ov er v iew
 B r ie f O ve r vie w O f T he Pr o je ct
 L e ar ning O bje ctive
Co m m a nds / Co de Sec t io n
 Sho uld co ntain all HDFS co mmands use d to tr ansfe r data to hdfs
 Sho uld co ntain all PIG co mmands use d
 Sho uld co ntain all HIVE co mmands use d
 Sho uld co ntain all SQO O P co mmands use d to tr ansfer data to linux
file syste m
Result Sec t io n (A s H I VE TA BL E)
 Fo r e ach table give MySQ L co mmand
" se le ct * fr o m < table - name > limi t 10 " .
 C ut paste the r esult in the abo ve re po r t.
Result Sec t io n (A s H DFS FIL E)
 G e t all the abo ve HDFS base d info r matio n to L inux File Syste m
 Pr e par e re po r t answ e r ing all give n que stio ns abo ve
 Also give yo ur unde r standin g (tr anslate statist ics to analytics ) o f
the same
Sum m a r y
 De scr ibe yo ur e xpe r ie nce of using Hado o p fo r Data Analyt ics.
Eva lua t io n M et ho do lo gy
T he stude nts w ill be e valuate d as pe r the fo llo w ing:
Sa m e Fo r A ll Gr o up M em ber s
Data Par sing (PIG ) 15
R e quir e d As Data Fr ame 20
Answ e r s R e quir e d 10
O utput To MySQ L (Sqo o p) 15
I ndiv idua l Quer ies
Q ue r ie s O n Pr o je ct 20 Explai n so me co mmands fr o m co de
Q ue r ie s O n Hado o p 20 Any que stio n r e late d to Hado o p.
Refer enc es - Ho w To Pa r se The A pa c he L og
http://blog.cloudera.com/blog/2009/06/analyzing-apache-logs-with-pig/

500188032.docx Page: 3/4


Apache HTTP Server Log Analysis
Business Analytics Using Hadoop - Project

http://ossec-docs.readthedocs.org/en/latest/log_samples/apache/apache.html
http://hadooptutorial.info/processing-logs-in-pig/#Example_Use_case_of_CommonLogLoader
http://kickstarthadoop.blogspot.in/2011/06/analyzing-apache-logs-with-pig.html

500188032.docx Page: 4/4

You might also like