You are on page 1of 10

Design and Implementation of an Effective Web Server Log Preprocessing System

Saritha Vemulapalli1, Dr. Shashi M2
1 Associate Professor, Department of Information Technology, VNR Vignana jyothi Inst. of Engg. & Tech, Hyderabad, A.P, India. saritha_vemulapalli@yahoo.com 2 Professor, Department of CS & SE, Andhra University College of Engg (A), Visakhapatnam, A.P, India. smogalla2000@yahoo.com

Abstract. WWW constitutes huge repository, distributed and dynamically growing hyper medium, supporting access to information and services. As more organizations rely on WWW to conduct business, user behavior analysis becoming difficult in web-based applications. Information about user’s interactions with website is stored in server logs and serves as huge electronic survey of website. Web usage mining deals with discovering usage patterns from server logs in order to understand and better serve the needs of web users. The raw information contained in log file represents noisy data. Preprocessing includes cleaning, user identification, sessionization, path completion & structurization and is a prerequisite for improving accuracy and efficiency of the subsequent mining process. This paper emphasizes on an effective web log preprocessing system. Experimental results proved that the proposed system reduces the size of log file down to 12% and improves the performance of preprocessing in identifying users, sessions, path completion and structurization.

Keywords: Data Mining, Web Log Mining, Web Usage Mining, Preprocessing, Cleaning, User Identification, Sessionization, Path Completion.

1 Introduction
Since 1991, WWW became so popular & has a rapid development. Now it has formed a great distributed information source including 8.75 millions websites, 2.5 billions web pages and great many users [1]. The WWW constitutes a huge repository, widely distributed and dynamically growing hyper medium, supporting access to information and services. With the explosive growth of information sources available on the WWW, providing web users with more exactly needed information is becoming a

user’s history) to improve web services and performance.g. pattern discovery & analysis and visualization [7]. session identification. It has become necessary for users to utilize automated tools in order to find. promotional campaigns and user behavior analysis. consists of data collection. Web mining can be divided into web content mining. such as to identify the typical behavior of the users [5]. 2 Web Usage Mining Process Web usage mining is the discovery of user access patterns from server logs. Webbased applications generate and collect large volumes of data in their day-to-day activities. Section 2 describes overview of web usage mining. audio or video data) available in web pages. represented in standard formats (e. Typical applications are website design & management. linkage or usage information [3]. graphs & reports. client side.critical issue in web-based applications. Web mining is the application of data mining which deals with the extraction of interesting knowledge from the WWW documents and services which are expressed in the forms of textual. filter. Server logs are the primary source of data for web usage mining that are collected as a result of users interactions with website. Common Log Format [8] and Extended Common Log Format [9]). in order to discover statistics & user access patterns and are represented using visualization techniques such as charts. user identification. [4] introduced the term web usage mining in 1997 and is defined as process of extracting useful information from server logs (i. making clusters of users with similar access patterns and by adding navigational links [6]. reliable & consistent data. extract. proves the effectiveness & efficiency of our algorithms. Obtained user access patterns can be used in variety of applications. The paper is organized as follows. web personalization. Preprocessing techniques can improve the quality of the data involves cleaning. The raw information in a web server log file doesn’t represent a structured. website topology. Statistical & data mining techniques can be applied to the preprocessed web log data. Web content mining is the process of discovering useful knowledge from the raw data (text. adaptive websites. preprocessing. Majority of this data is generated automatically by web servers and collected in server logs in an unstructured format. . Section 4 covers experimental results. Web structure mining is the process of analyzing the link between pages of a web site using web topology. image. Design & implementation of proposed preprocessing system and also related algorithms are presented in section 3. Conclusions are in section 5. web structure mining and web usage mining. The data which is used for mining process can be collected from server side. proxy server. web usage mining has attracted lot of attention in recent time [2]. and evaluate the desired information. complete. Cooley et al. As a result. path completion and data structurization [10]. web page contents & user profile information.e. cross marketing strategies. recommendation systems.

Our proposed system addresses all the above issues. Low-quality data will lead to low-quality mining results.2. Hence. Data preprocessing techniques can improve the quality of the data. • Web cache keeps track of pages that are requested and saves a copy of these pages for a certain period. 3. Nearly 80% of mining efforts are required to improve the quality of data [12]. 2. is an implicitly generated data as a result of user interactions with a website are represented in Extended common log format (ECLF).1 Data preprocessing As the web server logs are not designed for data mining. these requests are not recorded in log files. thereby helping to improve the accuracy and efficiency of the subsequent mining process. • IP address misinterpretation due to shared computers. it accounts for 41% of all user interaction requests for web documents [11]. Each line in a log file represented in the extended common log format has the following syntax. web server logs do not identify sessions or users. having some additional information like user_agent. [Host/IP Rfcname Userid [DD/MMM/YYYY: HH:MM:SS -0000] "Method /Path HTTP/1.x" Code Bytes] A "-" in a field indicates missing data. • Since HTTP is stateless.ac. Referrer defines the URL from where the visitor came from. preprocessing must be carried out in order to obtain reliable and accurate data.in. The following are some of the drawbacks of using server logs. The proposed preprocessing system consists of components such as .2 Extended Common Log Format It’s an extension to common log format. User_agent is the visitor’s browser version & O. Most of the researchers considered web server log file as most reliable and accurate for WUM process.S. cookie and referrer.vnrvjiet.1 Common Log Format Each line in a log file represented in the common log format has the following syntax. [s-computername s-ip s-port c-ip rfcname cs-userid date time csmethod cs-uri-stem cs-uri-query cs-version sc-status time-taken sc-bytes cs(user-agent) cs(cookie) cs(referrer)] 3 Design and Implementation of Proposed Preprocessing System The proposed preprocessing system uses server logs of www. • The browser’s back button is “the second most used feature” on the web.

Removing such entries decreases the memory usage and improves the performance. HTTP is a stateless & connectionless protocol which requires separate connections for every file requested from the web server. 1. irrelevant files are found up to a ratio of 10:1. Ri.txt.” b) Many crawlers voluntarily declare themselves in user agent field of log. …. etc. The following is an algorithm for data cleaning: Input: Records of server log file. F2. path completion and data structurization is shown in Fig. c-ip. which are automatically downloaded due to the embedded HTML tags. Where Ri=<F1. cs-uri-stem. depending on how many graphics and other files the web pages contain [10]. script’s and style sheet files. cs-uriquery. sc-bytes. Spider requests can be identified by looking a) All hosts that have requested the page “robots. user identification. The implementation issues are explained below. s-port. Rn}. Proposed Preprocessing System Data Cleaning: The process of removing entries which are irrelevant and redundant in pattern discovery. iii) Removing access records generated by automatic search engine agents such as crawler.cleaning. time-taken. which is represented as log_file R= {R1. cs-method. . 1. The following rules are used for data cleaning in our proposed system: i) Removing all the attributes which contain no data at all and are not essential for the analysis. sc-status. Fn> is a record in log_file and is defined as <s-computername. In the real world data. ii) Removing log entries covering image. iv) Removing log entries that have status of “error” or “failure”. s-ip. …. by referring user agent field weather it contains either a URL or an email address. popup pages. cs(referrer)>. frames. All the entries with a status code other than the 200 range are removed. cs(user-agent). robot. time. other than file requests that the user did not explicitly request. rfcname. date. In general a user does not explicitly request all of the graphics on a web page. …. sound. …. csuserid. video. session identification. Fig. spider. flash animations. v) Removing log entries that have http request method other than Get or Post. R2. cs(cookie). Fj. The main intent of web usage mining is to get a picture of the user's behavior. cs-version. Spiders are widely used in web search engine tools to update their search indexes [13].

frame. …. 2. rfcname. Output: Set of records “U” = {U1. iii) If the IP address. operating system and browser software are same. but with different operating system or browser software is assumed as new user. However it is necessary to distinguish among different users. FOR each Record Ri in log_information DO IF(Ri doesn’t represent image. Uj. R2. time-taken. The following is an algorithm for user identification: Input: Set of records “R” = {R1. The following rules are used for user identification in our proposed system: i) If the IP address is different is assumed as new user. style sheet file. crawler request. s-port. Remove the attributes which doesn’t contain data in all records of log_file. ii) If the IP address is same. are database object’s. Rn} of data_cleaning.Output: log_information & data_cleaning. but with different http version is assumed as new user. …. The analysis of web usage doesn’t require knowledge about a user’s identity. Select IP.video. Algorithm: Begin 1. F2. Remove non essential attributes for the analysis such as <s-computername. Un} of users_info.Ri. Where Uj = <F1. cs-uriquery. 3. …. is a database object. U2 . s-ip. …. error request and other than get or post request) Then Insert Ri into data_cleaning END IF END FOR END User Identification: The process of identifying the unique users. script. sc-bytes> from log_file.flash animation. …. version fields of records of data_cleaning. user_agent. FOR each Record Ri in log_file // 1<=i<=n DO Insert Ri into log_information END FOR 4. Insert R1 into users_info // R1 is a first record in R . ….sound. pop-up page. Algorithm: Begin 1. Fn> is a record in users_info. Fk. 2. who is interacting with a website using the web browser. // indicates missing values.

time and Records “U” = {U1. if the URL in the referrer field of the page request made is not equivalent to URL of last page user has requested & if the URL in the referrer field is in the user’s history. iii) A new session is identified if the URL in the referrer field has never been accessed before in a current session. END IF END FOR END FOR END User’s Session identification: Web log span long periods of time. but with different http version)) Then Insert Ri into users_info .…. …. The following rules are used for session identification in our proposed system: i) A new session begins each time when there is a new user. a record in users_info & …urlin //1<=i<=n And .R2. ii) A new session begins each time when the time gap between consecutive requests made by the same user exceeds threshold ∆ t=10 minutes when the referrer is “-“. Uj is pathi is defined as urli1urli2 S2. Un} of users_info. it is assumed that user uses “back” button. …. …. FOR each Record Ri in data_cleaning // 1<= i<=n DO FOR each Record Uj in users_info IF((IP is different ) OR (IP is same. Where Sk = {Uj. thus causing the problem of incomplete path. The following rules are used for path completion in our proposed system: i) With in the identified user’s sessions. is a database pathi} is a session in S. Output: Records “S” = {S1.Ri. users_sessions. It is the process of reconstructing the user’s navigation path. Path Completion: Some important page requests are not recorded in server log due to the cache. Uj. The following is an algorithm for sessionization & path completion: Input: Records “R” = {R1. but with Different operating system or browser software) OR (IP. operating system and browser software are same. Sn} of object. Sk. Since HTTP protocol is stateless and connectionless discovering the user sessions from server log is a complex task. The process of identifying sequence of activities of a single user during a single visit at a defined duration [14]. …. Missing page references that are inferred through this rule are added to the user’s session file. by appending missed page requests (page requests that are not recorded in server log) within the identified sessions. it is very likely that users will visit the web site more than once.Rn} of data_cleaning sorted in assending order of date.….3. U2 .

Else Add uri-stem field to pathi of the current Reconstructed Session RSk . Insert RSk into users_sessions_path.…. IF(URL in Referrer field is not equivalent to URL of last page user has requested) Then Add missing page references to pathi of the current Reconstructed Session RSk . END FOR END FOR END Data Structurization: The process of transforming and storing the data into suitable form for input to the pattern discovery. Uj is a record in users_info & pathi is defined as urli1urli2 …urlin //1<=i<=n Algorithm: Begin Set S={ }.RS2. Where RSk = {Uj. FOR each Record Uj in users_info Create a new Session Sk & Reconstructed Session RSk. identified in various stages of preprocessing. RSn} of users_sessions_path.RSk. . END IF END FOR FOR each Session in Sk & RSk DO Insert Sk into users_sessions.….Records “RS”={RS1. Add uri-stem field to pathi of the current Session Sk & pathi of the current Reconstructed Session RSk.user_agent&version are same) Then DO IF((Referrer is ‘-‘ & Time gap between consecutive requests by the same user>10min) OR (URL in Referrer field has never been accessed before in current session)) Then Create new Session Sk &Reconstructed Session RSk. is a database object.RS={ }. Different tables are designed in the relational database for each object. Else Add uri-stem field to pathi of the current Session Sk . DO FOR each Record Ri in data_cleaning DO IF(Values of IP. pathi} is a reconstructed session in RS.

flash animations. when navigating the web site. Table 3 shows the results of user identification.375 records is selected for analysis. of sessions 269 253 818 589 .4 Experimental Results The proposed system was developed based on IIS web server log represented in ECLF.375 1581 1230 1220 User identificatio n process 1 2 3 No. of records reduces down to 1. Results of Session Identification Process Cleaning Process 1 2 3 4 No. Table 1.vnrvjiet. The Results of Data Preprocessing Records in logfile 10. 2 represents records after removing image. Our experimental results proved that unique users can be identified more effectively using user_agent field along with IP address. Experimental analysis is carried out to validate the effectiveness and efficiency of the proposed preprocessing system. The rationale behind this rule is that a user. 1 represents sessions identified based on time gap between two consecutive page requests exceeds Session Identificati on Process 1 2 3 4 No. 4 represents records after further removing error requests. much more than one OS. frames. 3 represents records after further removing crawler requests. sound.ac. of Records 10. This can result in several users being erroneously grouped together as one user. 3 represents unique users identified using IP address. video. having 10. 1 represents unique users identified using IP address. internet cafe or one user uses multiple computers). user_agent & version. 235 unique users & 589 user’s sessions are identified.in of 15th Nov 2010. After cleaning the No.375 Table 2. Results of Data Cleaning Process Records after cleaning 1. Table 4 shows the results of user’s session’s identification process.220 (12% of original records). User Identification using the IP address alone is not sufficient and reliable. rarely employs more than one browser. 2 represents unique users identified using IP address & user_agent.220 NO of unique Users 235 Sessions 589 Table 3. The server log file of www. 1 represents records in raw server log file. Results of User Identification Process Table 4. different users sharing the same host can not be distinguished. Because although an IP address may represent one person only. Of users 215 234 235 Table 2 shows the results of data cleaning process. an IP address is in most cases shared by more than one person (at a library. using java programming language. The results of preprocessing are shown in Table1. script’s and style sheet files. So. pop-up pages.

ac. preprocessing must be carried out to improve the accuracy and efficiency of the subsequent mining process. content size of web pages are not considered.: Grouping Web page references into transactions for mining World Wide Web browsing patterns. IEEE. 4 represents identified sessions based on proposed session identification rules.10min’s.in are analyzed using the proposed preprocessing system in order to identify unique users. l(5) (1997) 58-68. Knowledge and Data Engineering Workshop. loading time of components in web page. 3 represents identified sessions based on if the URL in the referrer field has accessed before in a current session. (1997) 2-9. International Conference on Tools with Artificial Intelligence.vnrvjiet. reliable & consistent data.: Data preparation for mining World Wide Web browsing patterns.. 2 represents identified sessions based on session duration as 30min’s. path completion and structurization. (1997) 558-567. Mobasher B. The proposed system can be enhanced in future for more accurate session identification & path completion. Experimental results proved that the proposed system reduces the size of log file down to 12% and improves the performance of preprocessing in identifying users. Robert Cooley. IEEE Internet Computing. and Jaideep Srivastava. Our experimental results proved that session can be identified more effectively using our proposed session identification rules. Low-quality data will lead to low-quality mining results. 2. complete. Bamshad Mobasher. Server logs of www. Gudivada V N: Information retrieval on the World Wide Web. Journal of Knowledge and Information System..IEEE. 4. References 1. CA. sessions. Cooley R. . Robert Cooley. which play a major role in web usage mining process in order to discover useful hidden patterns reflecting the typical behavior of users. Time based methods are not reliable because users may involve in some other activities after opening the web page and factors such as busy communication line. Referrer based method introduces the confusion when user types URL directly or uses bookmark to reach pages not connected via links and identified sessions may contains more than one visit by the same user at different time. 5 Conclusion & Future Enhancements The raw information contained in a web server log file as a result of user’s interactions with a website doesn’t represent a structured. As the web server logs are not designed for data mining. Bam shad Mobasher. user sessions & path completion and data structurization. 3. (1999) 1-27. and Srivastava J: Web mining: information and pattern discovery on the World Wide Web. Newport Beach. and Jaideep Srivastava. New port Beach.

12. B. 10. Robert Cooley.: Characterizing browsing strategies in the world-wide web. L. vol. Ophir Frieder. .: Using data mining techniques on Web access logs to dynamically improve Hypertext structure. (2007) 79-104. Grossman: Information Retrieval: Algorithms and Heuristics.org/TR/WD-logfile. Cooley. and Pitkow.1 (2000) 12-23. http://www. Tanasa D.Catledge. 1. 13. Journal of Intelligent Informatin Systems. Vol. Computer Networks and ISDN Systems. Vol.. Srivastava.w3.. Masseglia F. Jaideep Srivastava. Poncelet P. R. 3rd European Conference on Principles and Practice of Knowledge Discovery in Databases PKDD’99. Mukund Deshpande. vol.w3. 2nd Edition. Prague. E. (1999) 13-19. In ACM SigWeb Letters.org/Daemon/User/Config/ (1995). Intelligent Systems. (1999). 28. and Pang-Ning Tan: Web usage mining: Discovery and applications of usage patterns from Web data.html (1996). http://www. Z. 9. D.. 6. 1. W3C Extended Log File Format. IEEE. Pabarskaite and A. Trousse B. Raudys: A process of knowledge discovery from web log data: Systematization and critical review. 8.: Data Preparation for Mining World Wide Web Browsing Patterns.. Knowledge and Information Systems. Mobasher. No. and David A. Spiliopoulou.5. Vol 19 (2004) 59 – 65. 27 (1995) 1065-1073. no. 3. J. : Advanced data preprocessing for intersites Web usage mining. and Teisseire M. J. N. 14. The Information Retrieval Series. Czech Republic: Springer-Verlag. 8. Configuration file of W3C httpd. Vol.: Managing interesting rules in sequence mining. 11. (2004). M.. 7. and J. 1 (1999) 5–32. SIGKDD Explorations.