You are on page 1of 8

International Journal of Computer Science Engineering and Information Technology Research (IJCSEITR) ISSN 2249-6831 Vol.

3, Issue 4, Oct 2013, 1-8 TJPRC Pvt. Ltd.

A COMPLETE PREPROCESSING METHODOLOGY FOR WUM


CHANDRASHEKAR HC1, NEELAMBIKE S2 & VEERAGANGADHARA SWAMY TM 3 & B. S. THEERTHARAJU
1,2 3 4

Research Scholar, Department of CSE, VTU, Karnataka, India

Assistant Professor, Department of ISE, VTU, Karnataka, India

Associate Professor, Department of ISE, VTU, Karnataka, India

ABSTRACT
The popularity of Web resulted in heavy traffic in the Internet. This intense increase in internet traffic has caused significant increase in the user perceived latency. In order to reduce this, WUM is found to be potentially useful. The exponential growth of the Web in terms of Web sites and their users has generated huge amount of data related to the users interactions with the Web sites. This data is recorded in the Web access log files of Web servers and usually referred as Web Usage Data (WUD). In this paper, a comprehensive preprocessing methodology is proposed as a prerequisite and first stage for Web mining applications that has four steps: Data Collection, Data Cleaning, Identification of Users & Sessions, and finally the Data Aggregation. Here an attempt is made to reduce the quantity of the WUD and thereby improve the quality of WUD for effective mining of usage patterns. Several heuristics have been proposed for cleaning the WUD which is then aggregated and recorded in the relational database. Experimental results show that the proposed methodologies reduce the volume of the WUD effectively.

KEYWORDS: Web Usage Mining, Data Preprocessing, Server Logs, Users and User Sessions INTRODUCTION
Data preprocessing has a fundamental role in Web Usage Mining (WUM) applications. A significant problem with most of the pattern discovery methods is that, their difficulty in handling very large scales of WUD. Despite the fact that, most of the WUM processes done off-line, the size of WUD is in the orders of magnitude larger than those met in common applications of machine learning. Rushing to analyze usage data without a proper preprocessing method will lead to poor results or even to failure. Preprocessing methodology has not received enough analysis efforts. Hence managing the quantity of data that is continuously increasing, and the great diversity of pages on Web site has become critical for WUM applications.In this paper, we propose a comprehensive preprocessing methodology that allows the analyst to transform any collection of web server log files into structured collection of tables in relational database for further use in WUM process. The log file is cleaned by removing all unnecessary requests, such as implicit requests for the objects embedded in the Web pages and the requests generated by non-human clients of the Web site (i.e. Web robots)[4],[5],[16]. Then, the remaining requests are grouped by user, user sessions, page views, and visits.Finally, the cleaned and transformed collections of requests are saved onto a relational database model. We have provided filters to filter the unwanted, irrelevant, and unused data. Analyst can select the log file of a web server and decide what entries he/she is interested (HTML, PDF, and TXT). The objective of this paper is to considerably reduce the large quantity of Web usage data available and, at the same time, to increase its quality by structuring it and providing additional aggregated variables for the data mining analysis.

Chandrashekar HC, Neelambike S & Veeragangadhara Swamy TM & B. S. Theertharaju

RELATED WORK
In the recent years, we have seen much research on Web usage mining [13,14,15,16,17,18,19,20,21,22,23]. However, as described below, data preprocessing in WUM has received far less attention than it deserves. Methods for user identification, sessionizing, page view identification, path completion, and episode identification are presented in [3]. However, some of the heuristics proposed are not appropriate for larger and more complex Web sites. For example, they propose to use the site topology in conjunction with the ECLF [2] file for what they call the user identification. The proposed heuristic aims to distinguish between users with the same IP address, OS and Browser by checking every page requested in a chronological order. If a page requested is not referred by any previous page requested, then it belongs to a new user session. The drawback of this approach is that it considers only one way of navigating in a Web site, by following links. However, in order to change the current page, the users can, for instance, type the new URL in the address bar (most browsers have an auto completion feature that facilitates this function) or they can select it from their bookmarks. In another work [15], the authors compared time-based and referrer-based heuristics for visit reconstruction. They found out that a heuristic's appropriateness depends on the design of the Web site (i.e. whether the site is frame-based or frame-free) and on the length of the visits (the referrer-based heuristic performs better for shorter visits). In [16], Marquardt et al. addressed the application of WUM in the e-learning area with a focus on the preprocessing phase. In this context, they redefined the notion of visit from the e-learning point of view. Web log data preprocessing algorithm based on collaborative filtering has been discussed in [17]. Preprocessing of the web log file methods that can be used for the task of session identification have been discussed in [18].

DATA PREPROCESSING
Data preprocessing of web log is usually complex and time demanding. It comprises of following steps: data collection, data cleaning, Data Structuration and data aggregation the following figure depicts.

Figure 1: Phases of Data Preprocessing in WUM

Data Collection
At the beginning of the data preprocessing, we have the Log containing the Web server log files collected from a Web server. First, we anonymize the log file for privacy reasons. Here we need to remove the host names or the IP addresses by replacing the original host name with an identifier that keeps the information about the domain extension (i.e. the country code or organization type, such as .com, .edu, and .org). We can also use these parameters later in the analysis.

A Complete Preprocessing Methodology for WUM

Data Cleaning
The second step of data preprocessing consists of removing useless requests from the log files. Since all the log entries are not valid, we need to eliminate the irrelevant entries. Usually, this process removes requests concerning nonanalyzed resources such as images, multimedia files, and page style files. For example, requests for graphical page content (*.jpg & *.gif images) and requests for any other file which might be included into a web page or even navigation sessions performed by robots and web spiders. By filtering out useless data, we can reduce the log file size to use less storage space and to facilitate upcoming tasks. For example, by filtering out image requests, the size of Web server log files reduced to less than 50% of their original size. Thus, data cleaning includes the elimination of irrelevant entries like: Requests executed by automated programs, such as web robots, spiders and crawlers; these programs generate the traffic to web sites, can dramatically bias the site statistics, and are also not the desired category which WUM investigates. Requests for image files associated with requests for particular pages; an users request to view a particular page often results in several log entries because that page includes other graphics, while we are only interested in what the users explicitly request, which are usually text files. Entries with unsuccessful HTTP status codes; HTTP status codes are used to indicate the success or failure of a requested event, and we only consider successful entries with codes between 200 and 299. Entries with request methods except GET and POST.

Removing Requests for Non-Analyzed Resources Nowadays, most Web pages contain images. If these images or design purposes (such as lines and colored buttons) or to serve the information (such as graphics and maps) was present. Or to keep these images for Web use log files to remove the decision depends more on the purpose of mining. For the purpose of web caching or prefetching support, log analyst, referring to images and multimedia files should not remove the entries. A Web cache application to make it more than other (text) files a request for predicting such requests for files, usually because the images are larger than HTML documents is important. Conversely, for analysts to find flaws in the structure of a web site or your visitors want to provide personal dynamic link, they requested to keep clear why these requests should represent users' actual work. In addition to image files, other file types we like the web page style files can be included in the pages contained requests can cause, the script (JavaScript) files, applets (Java object code) files and so on. . Except for the resources needing explicit requests (like some applet files), the requests for these files need to be removed as they will not bring any new knowledge in the pattern discovery phase. Removing Web Robots' Requests A Web robot (WR) (also called a spider or bot) is a software tool that scans the website regularly to take its contents. WRS automatically follow all links on a website. Search engines like Google, WRS regularly used to collect all the pages from the website to update their search indexes. Number of queries from one WR may be equal to the number of URI's website. If the Web site does not attract many visitors, the number of inquiries come from all WRS who have visited the site could be more than human-generated requests. Removing WR-generated log entries not only simplifies the mining task that will follow, but it also removes uninteresting sessions from the log file. Usually, a WR has a breadth (or depth) first search strategy and follows all the

Chandrashekar HC, Neelambike S & Veeragangadhara Swamy TM & B. S. Theertharaju

links from a Web page. Therefore, a WR will generate a huge number of requests on a Web site. Moreover, the requests of a WR are out of the analysis scope, as the analyst is interested in discovering knowledge about users' behaviour. Most of the web bots identify themselves with the user agent field of the log file. Several reference databases known robot is maintained. However, these databases are not exhaustive and every day new WRS show or a new name, making the WR identify task more difficult. Robots Cleaning Web Robot (WR) (also called spider or bot) is a software tool that periodically scans a website to extract its content. Web robots automatically follow all the hyperlinks from a web page. Search engines (Yamin and Ramayah, 2011), such as Google, periodically use WRs to gather all the pages from a web site in order to update their search indexes. The number of requests from one WR may be equal to the number of the web site's URIs. If the web site does not attract many visitors, the number of requests coming from all the WRs that have visited the site might exceed that of humangenerated requests. Eliminating WR-generated log entries not only simplifies the mining task that will follow, but it also removes uninteresting sessions from the log file. Usually, a WR has a breadth (or depth) first search strategy and follows all the links from a web page. Therefore, a WR will generate a huge number of requests on a web site. Moreover, the requests of a WRare out of the analysis scope, as the analyst is interested in discovering knowledge about users' behavior. Most of the Web robots identify themselves by using the user agent field from the log file. Several databases referencing the known robots are maintained [Kos, ABC]. However, these databases are not exhaustive as each day new WRs appear or are being renamed, making the WR identification task more difficult. To identify web robots requests, the data cleaning module implements two different techniques. In the first technique, all records containing the name robots.txt in the requested resource name (URL) are identified and straightly removed. The next technique is based on the fact that the crawlers retrieve pages in an automatic and exhaustive manner, so they are distinguished by a very high browsing speed. Therefore, for each different IP address, the browsing speed is calculated and all requests with this value more than a threshold are regarded as made by robots and are consequently removed. The value of the threshold is set up by analyzing the browser behavior arising from the considered log files. This helps in accurate detection of user interested patterns by providing only the relevant web logs. Only the patterns that are much interested by the user will be resulted in the final phase of identification if this Cleaning process is performed before start identifying the user interested patterns.

Data Structuration
This step groups the unstructured requests of a log file by user, user session, path completion. At the end of this step, the log file will be a set of transactions, where by transaction we refer to a user session, path completion. User Identification In most cases, the log file provides only the computer address (name or IP) and the user agent (for the ECLF [2] log files). For Web sites requiring user registration, the log file also contains the user login (as the third record in a log entry). In this case, we use this information for the user identification. When the user login is not available, we consider (if

A Complete Preprocessing Methodology for WUM

necessary) each IP as a user, although we know that an IP address can be used by several users. In this paper, we approximate users in terms of IP address, type of OS and User agent. User Session Identification Identifying the user sessions from the log file is not a simple task due to proxy servers, dynamic addresses, and cases where multiple users access the same computer (at a library, Internet cafe, etc.) or one user uses multiple browsers or computers. A user session is defined as a sequence of requests made by a single user over a certain navigation period and a user may have a single (or multiple) session(s) during a period of time. Session identification is the process of segmenting the access log of each user into individual access sessions [5]. Two time-oriented heuristic methods: session-duration based method and page-stay-time based method have been specifically proposed by [6, 7, and 8] for session identification. In this paper, we use the timeout threshold in order to define the users sessions. The following is the rules we use to identify user session in our experiment: If there is a new user, there is a new session; If the time between page requests exceeds a certain limit (30 or 25.5mintes), it is assumed that the user is starting a new session. Path Completion Client- or proxy-side caching can often result in missing access references to those pages or objects that have been cached. For instance, if a user goes back to a page A during the same session, the second access to A will likely result in viewing the previously downloaded version of A that was cached on the client-side, and therefore, no request is made to the server. This results in the second reference to a not being recorded on the server logs. Path Completion: How to infer missing user references due to caching. Effective path completion requires extensive knowledge of the link structure within the site. Referrer information in server logs can also be used in disambiguating the inferred paths. The referrer plays an important role in determining the path for particular request.

Figure 2: Path Navigation

Data Aggregation
This is the last step of data preprocessing. In this step, first, we transfer the structured file containing sessions and visits to a relational database. Afterwards, we apply the data generalization at the request level (for URLs) and the aggregated data computation for visits and user sessions to completely fill in the database.

Chandrashekar HC, Neelambike S & Veeragangadhara Swamy TM & B. S. Theertharaju

Table 1: Table Structures for Log Data and Sessions Data Hostip Accessdate url Bytes Statuscode Referrer Useragent Log Data char datetime char int int char char Not Null Not Null Not Null Not Null Not Null Null Null

Session Data SessionId int Hostip char Accessdate datetime url char Referrer char Useragent char

Not Null Not Null Not Null Not Null Null Null

EXPERIMENT RESULTS
Table 2: Results of Data Preprocessing Total Entries in web log Entries after data cleaning Images And unsuccessful status code robots.txt Robots User Agent 92168 26584 60605 2151 5294

After data cleaning, the number of requests declined from 92168 to 26584. Figure 3 shows the detail changes in data cleaning.

Figure 3: Preprocessing Result Bar Chart In gernal Preprocessing can take up to 60-80% of the times spend analyzing the data. Incomplete preprocessing task can easily result invalid pattern and wrong conclusions. Size of original log File before apply preprocessing is 37765942 Byte (36.01 MB) and After apply the preprocessing is 4251968 Byte (4.06 MB) so the Reduction in log File is 88.741263226 %. Finally, we have identified 3546 unique user. on the basis of user identifications results, we have identified 4319 sessions by a threshold of 25.5 minutes and path completion.

A Complete Preprocessing Methodology for WUM

CONCLUSIONS
In this paper, we have presented preprocessing methodology as a requisite for WUM. The experimental results presented in section 3, illustrates the importance of the data preprocessing step and the effectiveness of our methodology, by reducing not only the size of the log file but also by increasing the quality of the data available through the new data structures that we obtained. Although the preprocessing methodology presented allows us to reassemble most of the initial visits, the process itself does not fully guarantee that we identify correctly all the transactions (i.e. user sessions & visits). This can be due to the poor quality of the initial log file as well as to other factors involved in the log collection process (e.g. different browsers, Web servers, cache servers, proxies, etc.). These misidentification errors will affect the data mining, resulting in erroneous Web access patterns. Therefore, we need a solid procedure that guarantees the quality and the accuracy of the data obtained at the end of data preprocessing. In conclusion, our methodology is more complete because: It offers the possibility of analyzing jointly multiple Web server logs; It employs effective heuristics for detecting and eliminating Web robot requests; It proposes a complete relational database model for storing the structured information about the Web site, its usage and its users;

REFERENCES
1. Bamshad Mobasher, A Web Usage Mining,http://maya.cs.depaul.edu/~mobasher/webminer/survey/node6.html. 1997. (URL link *include year) 2. 3. 4. 5. 6. 7. 8. 9. Li Chaofeng , Research and Development of Data Preprocessing in Web Usage Mining ," Rajni Pamnani, Pramila Chawan , " Web Usage Mining: A Research Area in Web Mining " Andrew Shen , Http User Agent List, http://www.httpuseragent.org/list/ Andreas Staeding , User-Agents (Spiders, Robots, Crawler, Browser), http://www.user-agents.org/ Robots Ip Address, http://chceme.info/ips/ Volatile Graphix, Inc.,http://www.iplists.com/nw/ Configuration files of W3C httpd, http://www.w3.org/Daemon/User/Config/ (1995). W3C Extended Log File Format, http://www.w3.org/TR/WD-logfile.html (1996).

10. J. Srivastava, R. Cooley, M. Deshpande, P.-N. Tan, Web usage mining: discovery and applications of usage patterns from web data, SIGKDD Explorations, 1(2), 2000, 12 23 11. R. Kosala, H. Blockeel, Web mining research: a survey, SIGKDD: SIGKDD explorations: newsletter of the special interest group (SIG) on knowledge discovery & data mining, ACM 2 (1), 2000, 115 12. R. Kohavi, R. Parekh, Ten supplementary analyses to improve e-commerce web sites, in: Proceedings of the Fifth WEBKDD workshop, 2003. 13. B. Mobasher R. Cooley, and J. Srivastava, Creating Adaptive Web Sites through usage based clustering of URLs, in IEEE knowledge & Data Engg work shop (KDEX99), 1999

Chandrashekar HC, Neelambike S & Veeragangadhara Swamy TM & B. S. Theertharaju

14. Bettina Berendt, Web usage mining, site semantics, and the support of navigation, in Proceedings of the Workshop WEBKDD2000 - Web Mining for E-Commerce - Challenges and Opportunities, 6th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, 2000, Boston, MA 15. B. Berendt and M. Spiliopoulou. Analysis of Navigation Behaviour in Web Sites Integrating Multiple Information Systems. VLDB, 9(1), 2000, 56-75 16. A Novel Technique for Web Log mining with Better Data Cleaning and Transaction Identification J. Vellingiri 2011 Science Publications ISSN 1549-3636 17. Jiang Chang-bin; Chen Li, "Web Log Data Preprocessing Based on Collaborative Filtering," Education Technology and Computer Science (ETCS), 2010 Second International Workshop on , vol.2, no., pp.118,121, 6-7 March 2010 18. Thanakorn Pamutha, Siriporn Chimphlee, Chom Kimpan et al., Data Preprocessing on Web Server Log Files for Mining Users Access Patterns, International Journal of Research and Reviews in Wireless Communications (IJRRWC) Vol. 2, No. 2, June 2012, ISSN: 2046-6447 Science Academy Publisher, United Kingdom 19. J. Srivastava, R. Cooley, M. Deshpande, P.-N. Tan, Web usage mining: discovery and applications of usage patterns from web data, SIGKDD Explorations, 1(2), 2000, 12 23 20. R. Kosala, H. Blockeel, Web mining research: a survey, SIGKDD: SIGKDD explorations: newsletter of the special interest group (SIG) on knowledge discovery & data mining, ACM 2 (1), 2000, 115 21. R. Kohavi, R. Parekh, Ten supplementary analyses to improve e-commerce web sites, in: Proceedings of the Fifth WEBKDD workshop, 2003. 22. B. Mobasher R. Cooley, and J. Srivastava, Creating Adaptive Web Sites through usage based clustering of URLs, in IEEE knowledge & Data Engg work shop (KDEX99), 1999 23. Bettina Berendt, Web usage mining, site semantics, and the support of navigation, in Proceedings of the Workshop WEBKDD2000-Web Mining for E-Commerce-Challenges and Opportunities, 6th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, 2000, Boston, MA

You might also like