P. 1
getPDF

getPDF

|Views: 56|Likes:
Published by jain

More info:

Published by: jain on Feb 22, 2011
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less

02/22/2011

pdf

text

original

The 9th International Conference on Computer Supported Cooperative Work in Design Proceedings

Web Robot Detection Techniques Based on Statistics of Their Requested URL Resources
1

Weigang Guo’,~, Shiguang Ju2 ,Yi Gu2 Information Center, Foshan Universi@, Foshan, Guungdong, 528000, P.R. China wgguo@fosu. edu.cn SchooE o Computer Science, Jiangsu Universiv, Jiangsu, 212013, P.R. China f jushig@ujs.edu. cn
Abstract
proper measures to redirect the Web robots or stop responding to HTTP requests coming from the unauthorized robots. The commonly used Web detection method is to set up a database of known robots[Z], and compare the IP address and User Agent fields of the HTTP request message against the known robots. But this method can detect the well-known robots only. There are three simple techniques to detect the unknown robots from Web logs: (1) According to the SRE (Standard for Robot Exclusion) [] whenever a robot visits a Web site, it 3, should first request a file called robotstxt. So, by examining the user sessions generated from Web logs, new robots that follow the S E can be found. However, the standard is voluntary, and many robots do not obey it. (2) Most robots do not assign any value to the “referrer” field in their HTTP request messages. And, in the Web log, the “referrer” field is empty (=”-“). So, if the user sessions have large number of requests with empty referrer fields, the visitor is a “suspicious” robot. But, as Web browsers can sometimes generate HTTP messages with empty referrer values, this method is also not reliable. (3) When checking the validity of the hyperlink structure, most robot use HEAD request method to reduce the burden on Web servers. Therefore, one can examine user sessions with lots of HEAD requests to discover potential robots. Similarly, as Web browsers can sometimes generate HEAD request type, this method is also not reliable. To solve the problem, Pang-Ning Tan and Vipin Kumar [11 adopted C4.5 decision tree algorithm to classify the robots visit and human visit .based on the characteristics of the robots’ access pattern. Their method can effectively detect the unknown robots. But it is a bit complicated. In this paper, after analyzing the access patterns of Web robots, we propose two new and simple algorithms to detect Web robots based on the statistics of their requested URL resources.

Following the widely use o search engines, the f impact Web robots have on the Web sites should not be ignored. Afer analyzing the navigational patterns of Web robotsfrom Web logs, two new algorithms are proposed One is based on classijkation and statistics of requested URL resources, which classifies the URL resources into eight types and counts the number of session of the clients and number of visiting records with same type. And another is based on Web page member list, which constructs one member listf o r every Web page and one show table for every visitor. The experiment shows that the two new algorithms can detect the unknown robots and unfriendly robots who do not obey the Stundardfor Robot Exclusion.

Keywords: Search Engine, Web Robot Detection, Content Classification, Webpage Member List, Web Log

1. Introduction
A Web robot is ;t program that automatically traverses the Web’s hypertext structure by retrieving a document, and recursively retrieving a11 documents that are referenced. Web robots are often used as resource discovery and retrieval tools for Web search engines such as Googie, Lycos, etc. But the robot’s automatic visits toward the Web sites also cause many problems. First, considering the business secret, many E-commerce Web sites do not hope the unauthorized robots retrieve inforination fiom their Web site, Second, many E-commerce Web sites need to analyze the visitors’ browsing behavior, but such analysis can be severely distorted by the presence of Web robots [ 11. Third, many government Web sites also do not hope their information collected and indexed by the robots for some reason. Fourth, poorlydesigned robots often consume lots of network and server resources, affecting the visit of normal customers. So, it is necessary for the Web site managers to detect Web robots from all the visitors, and take

302

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR. Downloaded on March 09,2010 at 00:48:51 EST from IEEE Xplore. Restrictions apply.

bmp) Music(mid. and all the embedded objects how'^ in the log. m. Usually. So. as it has requested and analyzed the webpage in the previous sessions. t2. png (the image search engine robot) or mp3. resources The contents of Web sites can be classified to eight major types (See Table 1).htm. the file type of the requested document is usually the Web page type (. Restrictions apply. the method of treating embedded objects (such as image files. Each record of the log files can be processed as: R=QP. it gives out an error message.. and t is the type of the requested content. I f the purpose of one robot is to coIlect images files or music files.rm(the music search engine robot). the server will check whether it has the document specified by the URL. the browser will send out the HTTP request to the target server. animation files. it sends out that document.The 9th International Conference on Computer Supported Cooperative Work in Design Proceedings 2 The differences between robots visit and . Therefore one request of robot visit only leaves one piece of record in the server logs.png. avi. htm4 asp.1 The classification of URI. The Classification of the content orweb site Type Name DescriDtion { file tvDes ) webpage document script image music video download others Web pages( htm. And..) may be different respectively.2 The detection algorithm Stepl: data preprocessing. But one thing is same that the robots don’t send the requests for the embedded objects to the server at once. Microsoft Internet Explorer). such as a picture etc. PDF file) we can regard “document” and ‘kebpage” as the same type.mpeg) Compressed files(zip. secund and third keywords. And.. All the records form the user visiting record set. It. human visit There are great differences between robots visit and human visit.asp. It is valuable to point out that as many robots retrieve non-text documents (e. the sewer sends out all the requested documents in order after receiving the client’s requests. Otherwise. Here. url is the requested URL resources. each record of the Web logs can be assigned a value of certain type. the two visits c m be regarded as two different sessions. If it is a single document. Scripts files(js. agent is the user agent field of HTTP message. all the file types of requested URLs of robot visit are the webpage type. .php. P. . PI. agent field and time field as the first. 3. If it does have. and then continuously and automatically send out the HTTP requests to the server until all the embedded objects have been requested. If the time interval between two visits is more than a fixed time length T. According to the HTTP protocol [4]. First. which exactly represents the request of the robot. T is between 15-30 minutes. The algorithm based on classification and statistics of requested URL resources The algorithm classifies the URL resources into eight types and counts the number of session of the clients and number of visiting records with same type. rar. If it is an HTML document. the robot sends HTTP request to the target server. exe) anv other file tvues 3. etc. And there is no obvious characteristic of the fiIe types of URL fields of these records because the embedded objects of the web page are random. agent. php. after getting a URL (assuming it is an html document) from the URL list which is waiting for being visited. wma. mp3. and then adds the embedded hyperlinks to the waiting list according to its visiting rules.g. Table I. which is computed from its url field according to the tablel. time url. vbs) w rmagesCjpg. When the browser has received a11 the embedded objects.gif.). rm) Video and animation files(swf. Some search engine robots add the URLs of these objects to the waiting list as well. . (1) When a person inputs a URL address in the browser (eg.2010 at 00:48:51 EST from IEEE Xplore. Usually. The robot also analyzes the embedded objects and hyperlinks within the received HTML document after receiving the servefs response.&t 303 Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR. time is the request time. in a session. scripts files. The user session set S is represented as: S=<IP. the browser will parse the document after receiving the document sent by the server. PS. it “assembles” them and then generates a complete Web page from the human point of view. after it has received the request. gif. According to the classification. ppt. 3. . agent.. ( 2 ) The robot is different. the file types of requested URLs of the current session are all jpg. etc). one request of a person may generate several records in Web server logs. On the server side. sort the Web logs by IP field. and then treat the records with same IP and agent field as one visitor’s visiting records. css. 4 Non-text documents(doc. Where IP is the address of the client. cascading style sheet files and frames. So. StepZ: generating user session set S. then the browser will analyze the embedded and linked objects in this document (such as image files. scripts files. cascading style sheet files and frames. etc. tgz. Downloaded on March 09. the browser shows it directly. animation files. while some give them up directly or modify the links of the objects in the HTML document rather than requested them straightly. pdf.

.jpg respectively. b. style-css}. n is the number of the members.i.jpg. 4. For example.jpg. More specifically.1. 4. agent. n is the number of the type t. The construction of Web page member Iist Definition 1: A Web page is a collection of information. n> where. OBJECT tags of HTML. We developed an HTML analyzer to analysis the HTML tags and its attributes and generate the Web page member list for every file in the Web site.htm.(ul~ti~e~)}> Where. Downloaded on March 09. tZ. sounds. then. the items in set M which Snumber and Rnumber all equal 1 will be deleted. it can be regarded as a robot. the visitors corning from same Type-C I P address and having same user agent fields can be regarded as one visitor.[5] 304 Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR. and forming the robot candidate set C .jpg. Where m is the total amount of records. . supposing all the files are in the root directory of the Web site.htm.The ! % Intemational Conference on Computer Supported Cooperative Work in Design Proceedings A .htm is: <index. 4. consisting of one or more Web resources. . NumberOfMember is the number of member of the Web page and is obtained from the Web site’s Web page member Iist. if they exceed a value of a certain threshold. EMBED.mz. m.n is the URL of the embedded objects. and referred to by the URL of the one Web resource which is not embedded.2010 at 00:48:51 EST from IEEE Xplore. The embedded objects m include: 1) Multimedia i files (images. So.htm is consisted of three frames: a. . Now the user session set is represented as: S=<IP. and there is a cascading style sheet file style. t. Rnumber. a Web page consists of a Web resource with zero.htm and c. mi ( i=l.. & are the types of requested contents of each record in the session.jpg. webpage is the URL of the Web page. and every frame has one picture named tl. or more embedded Web resources intended to be rendered as a single unit.htm. b. one. agent. t2. The threshold vafue can be different from Web site to Web site. and then treat the records with same IP and agent field as one visitor’s visiting records and assign a label uid. 4) Scripts files linked by the SRC attribute of SCRIPT tag of HTML. Set M is represented as: .which is represented as: C=<IP. Definition 2: The member list of a Web page is defined by a three-tuple: t=<webpage. All the member lists consists the member list set of the Web site. Sort the Web logs by IP field. the total amount of sessions and the visiting records of every visitor with same type are calculated.htm. if one Web page index. the threshold value of Snumber can be set to 2 while Rnumber can be set to 5. 7> The number of multimedia files’ member is 0 . For example. Each record of the log files can be processed as: R=<uid. Restrictions apply.htm. second and third keywords.htm.jpg. memeberset.. c. Step3: Computing the number of each type of the requested contents in every session. M=<IP. ShowNumber is the total number of members who appear in his visiting records set. Snumber. t3.time. 2) Frames defined by the SRC attribute of FRAME $D IFRAME tags of HTML.htm. The algorithm based on Web page member list The aIgorithm first constructs a member list for every Web page. t. and finding out the content type which has the largest number.. 5 ) Java applet class files linked by the CODE attribute of APPLET tag of HTML. m. agent field and time field as the first. and then generates a ShowTable for every visitor’s requested URLs. BGSOUND. D Step5: Merging the sessions and forming the merging-session set M. In the ShowTable. url is the requested URL resources. agent.. tl. =i d ~(urll. animations. 3) Cascading style sheet files linked by the HREF attribute of LINK tag of HTML. n> Where t i s one of the eight types defined in Table 1. etc) defined by the SRC attribute of IMG. t> Where Snumber represents the total amount of sessions which is of type t and generated by the visitors corning from same Type-C IP address and having same user agent fields. Thinking that there are lots of occasional visitors. Step4: Sifting all the sessions with m= n. StepC: Checking Snumber and humbers. url. {a.). t i m e Where uid i s the label of every different visitor. the member list of index. and identified by a single URL. and Rnumber represents the total amount of visiting records correspondingly. Because search engine robots usually use several hosts with one same Type-C IP address to retrieve information.* .. W. In set C.*. and time is the request time. k is the total number of visiting records..2 The detection algorithm Stepl: data preprocessing. Robot can be checked from the ShowTable. memberset is the aggregate of all the URLs of the embedded objects and is represented as {ml. t2. url is URL of the Web page.. intended to be rendered simultaneously. m n ) . Step2: Constructing a ShowTable(see Table 2) which records actual attendance of the members o f the visited Web pages for every visitor.css in a. the visiting records set of the user is represented as: S = -a.

If all the corresponding ShowNumber fields of its visited UlUs equal 0. and the maximal session number is 96 while the minimal number is 1. sounds) then ShowNumber := 0 . the browser usually requests the embedded objects in 0-5 seconds.. The average session number of 253 clients is 2. there are only a little people visit the university Web site during this period). As they are filtered in Step 5. and the maximal is 249.245) collected from January 21st to February 7th. the time interval is usually large than 30 seconds. So the wholeness of detection of goodbehavior robot (who requests robots. of which there are 24 records coming from the different IP addresses and agents that requests robots. 6165 “music” type sessions and 7 “animation” type sessions. By checking the original log files. and then adopt the algorithm described in section 4 to verify.192. The reasons is that. 305 Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR. Experiments Our Experiments were performed on Foshan University server logs (http:/1202. the simple method is to check the ShowNumber field of its ShowTable. We got 28 clients filtered from the merging-session set M by setting Snumber>=2 or Rnumber>=5 (the average number of set M).htm Camera..htm Computer. The algorithm of computing ShowNumber is as follows: for each r E S do if the URL type of r is multimedia files(images. 5. ”animation” type sessions in the robot candidate set C are used for further processing. The final result is that there are 253 cIients (their IP address and agent are different) in the merging-session set M. 2) Usually it requests a bit more contents t h a n a human user. 0 .1.2010 at 00:48:51 EST from IEEE Xplore. lots o f records (sometimes exceeding 30) may be produced in the server log and the agent fields may be different. we adopt the algorithm described in Section 3 to detect the robot. Step 3: To judge whether one uid is a robot. the close succeeding sequence of r means all the visiting records behind r in a certain time interval in the visiting records set S.txt.txt. 128 “image” type sessions. there are 24 different clients (their IP address and agent are different) who have requested robots. it usually divides its retrieving task into several sub-tasks and may produces a bit more sessions than a human user. it i s discovered that other 4 who don’t appear in the merging-session set M only request nothing but robot. end for next r.txt file. The average visiting records (requested resources) number is 5.2.txt. 7432 sessions are created while the time interval T=20 minutes.txt) is 100%.jpg 7 0 3 0 0 0 0 ... Of the 28 clients. 5. The eight clients are shown in Table 3. The Web log of January 21st. 2004 (these days are Winter holidays and Chinese Spring Festival.htm Apple. First.the minimal is 1. 5. otherwise the visitor will be impatient and give up or exit. Downloaded on March 09. when Microsoft Windows Media Player requests a MP3 file. Restrictions apply.The 9th International Conference on Computer Supported Cooperative Work in Design Proceedings Table 2. So. And there are 6742 session in the robot candidate set C. end for Where. only a total of 559 “webpage”. ShowTnble: records actual attendance of the members of the visited Web pages for every visitor. delete this member’s record) next member. of which there are 424 “webpage” type sessions. Using the algorithm proposed in section 3. while other 8 didn’t. For example.1 The wholeness of detection In original logs. we could think the uid is a robot. if it does appear then { ShowNumber := ShowNumber + 1. ”image”.2 The accurateness of detection Which are the real Robots of the 253 clients in the merging-session set M? The criteria are to check their Snumber and Rnumber. url NumberOfMember S howNumbe r 1ndex. For robot visits. it is reasonable that they wouldn’t appear in the mergingsession set M..168. 2004 contains a total of 50740 records. there are 20 clients requested robots. else for each member of r do ( judge whether this member appear in its close succeeding sequence. if the visitor is a person. We did not ‘process the “music” sessions because the requests of different clients generate complete different server log records. The interval can be from 0 to 30 seconds.. according to its retrieving strategy. . if the robot wants to request the embedded objects. among which there are 20 appear in the merging-session set M. We can choose appropriate Snumber and Rnumber to determine whether the client is a robot based on the following assumption: 1) When a robot visit a Web site. . and the request intervals will not be over 30 seconds.

O+compatible+ZyBorg/l .2010 at 00:48:51 EST from IEEE Xplore.roborcatorg/wc/ exclusion.+http://www. 6. When adopting the algorithm described in section 4.158.209.html) response image 3 96 249 10 126 Whether the remaining clients of the 253 in the merging-session set M can be regarded as robots will be detected accord to their later visits. his browser not displaying images and not playing sounds.+slurp@inktm" . we would like to take into account the hyperlinks of the Web pages to detect the search engine robots more effectively.The 9th International Conference on Computer Supported Cooperative Work in Design Proceedings Table 3.openfind.html) MoziIldS. For future work.O. http://www.hfml. +MSIE+5 .tw.0) L+9.1.O+(Slurp/cat.t e r m s / O l / 306 Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR.88. the algorithms maybe regard a human visitor as a robot.+Openbot/3 .+MSIE+6. hilp://www. Datu Mining undKnowZedge Discovev.21.tw/ robat.com) Mozilld5. htfp://www.125 webpage webpage webpage image 46 50 3 18 15 6 707 202.html.199 21 6.html) User-Agent:+Mozilla/4.+http://~.3 219.196.6(1): 9-35. Vipin K m a r .com. but they are effective and have a high accurateness.txt IF address 210.wlo r ~ l 9 9 9 / 0 5 ~ C ~ .188.+AO image Openfind+data+gatherer.103 66. Downloaded on March 09. [4] Transfer Protocol- IITpIl. And if a person uses very simple browser or set Hypertext http://www.+Windows+NT+5.O+(compatible.com.net. This is proved by our experiments using the following days' server logs.91 m.robolstrf. [2] The Web Robots Database.39.+http:/lwww. [5] Web Characterization Termhalogy & Definitions Sheet.+Windows+NT+S. then they can be detected.142 Agent HTML-GET-APP Tvoe webpage webpage Snumber 20 Mozilla/4.S. The result robots of the experiment who don't request robots.15 205. [3] Robots Exclusion.96. The weakness of the two algorithms is that if the Web pages of the Web site only contain plain text and no images.72. 2002.196.37 66. we found that all the Shownumber in the ShowTable of these robots' requested U l U s are 0. "Discovery of Web Robot Sessions based on their Navigational Patterns".0. Conclusions Our detection algorithms are simple. WISE nutbot.O+(ro bot openfind.inktomi.O+(wn. References [I] Pang-Ning Tan.60.O+(compatible. sounds.inktomi. the algorithms may mistake a human visit for a robot visit.63.co m.CO 11 Rnumber 20 26 66.72. + http:I/www.com/slurp. and need only a few records to detect whether the visitor is a robot. zyborg@looksmart. Restrictions apply.orgfwd active. .133.90.com/shrp.+slurp@inktomi.org.w3.237.O+(Slurp/cat. 1) Mozilla/4. If the number of their visit sessions and visit records exceed one certain value.

You're Reading a Free Preview

Download
scribd
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->