You are on page 1of 8

A COMPLETE PREPROCESSING METHODOLOGY FOR WUM

CHANDRASHEKAR HC
1
, NEELAMBIKE S
2
& VEERAGANGADHARA SWAMY TM
3
& B. S. THEERTHARAJU

1,2
Research Scholar, Department of CSE, VTU, Karnataka, India
3
Assistant Professor, Department of ISE, VTU, Karnataka, India
4
Associate Professor, Department of ISE, VTU, Karnataka, India

ABSTRACT
The popularity of Web resulted in heavy traffic in the Internet. This intense increase in internet traffic has caused
significant increase in the user perceived latency. In order to reduce this, WUM is found to be potentially useful. The
exponential growth of the Web in terms of Web sites and their users has generated huge amount of data related to the
users interactions with the Web sites. This data is recorded in the Web access log files of Web servers and usually referred
as Web Usage Data (WUD). In this paper, a comprehensive preprocessing methodology is proposed as a prerequisite and
first stage for Web mining applications that has four steps:
Data Collection, Data Cleaning, Identification of Users & Sessions, and finally the Data Aggregation. Here an
attempt is made to reduce the quantity of the WUD and thereby improve the quality of WUD for effective mining of usage
patterns. Several heuristics have been proposed for cleaning the WUD which is then aggregated and recorded in the
relational database. Experimental results show that the proposed methodologies reduce the volume of the WUD
effectively.
KEYWORDS: Web Usage Mining, Data Preprocessing, Server Logs, Users and User Sessions
INTRODUCTION
Data preprocessing has a fundamental role in Web Usage Mining (WUM) applications. A significant problem with
most of the pattern discovery methods is that, their difficulty in handling very large scales of WUD. Despite the fact that,
most of the WUM processes done off-line, the size of WUD is in the orders of magnitude larger than those met in common
applications of machine learning. Rushing to analyze usage data without a proper preprocessing method will lead to poor
results or even to failure. Preprocessing methodology has not received enough analysis efforts. Hence managing the
quantity of data that is continuously increasing, and the great diversity of pages on Web site has become critical for WUM
applications.In this paper, we propose a comprehensive preprocessing methodology that allows the analyst to transform
any collection of web server log files into structured collection of tables in relational database for further use in WUM
process. The log file is cleaned by removing all unnecessary requests, such as implicit requests for the objects embedded in
the Web pages and the requests generated by non-human clients of the Web site (i.e. Web robots)[4],[5],[16]. Then, the
remaining requests are grouped by user, user sessions, page views, and visits.Finally, the cleaned and transformed
collections of requests are saved onto a relational database model. We have provided filters to filter the unwanted,
irrelevant, and unused data. Analyst can select the log file of a web server and decide what entries he/she is interested
(HTML, PDF, and TXT). The objective of this paper is to considerably reduce the large quantity of Web usage data
available and, at the same time, to increase its quality by structuring it and providing additional aggregated variables for
the data mining analysis.
International Journal of Computer Science Engineering
and Information Technology Research (IJCSEITR)
ISSN 2249-6831
Vol. 3, Issue 4, Oct 2013, 1-8
TJPRC Pvt. Ltd.
2 Chandrashekar HC, Neelambike S

& Veeragangadhara Swamy TM & B. S. Theertharaju

RELATED WORK
In the recent years, we have seen much research on Web usage mining [13,14,15,16,17,18,19,20,21,22,23].
However, as described below, data preprocessing in WUM has received far less attention than it deserves. Methods for
user identification, sessionizing, page view identification, path completion, and episode identification are presented in [3].
However, some of the heuristics proposed are not appropriate for larger and more complex Web sites. For example, they
propose to use the site topology in conjunction with the ECLF [2] file for what they call the user identification. The
proposed heuristic aims to distinguish between users with the same IP address, OS and Browser by checking every page
requested in a chronological order. If a page requested is not referred by any previous page requested, then it belongs to a
new user session. The drawback of this approach is that it considers only one way of navigating in a Web site, by following
links. However, in order to change the current page, the users can, for instance, type the new URL in the address bar (most
browsers have an auto completion feature that facilitates this function) or they can select it from their bookmarks. In
another work [15], the authors compared time-based and referrer-based heuristics for visit reconstruction. They found out
that a heuristic's appropriateness depends on the design of the Web site (i.e. whether the site is frame-based or frame-free)
and on the length of the visits (the referrer-based heuristic performs better for shorter visits). In [16], Marquardt et al.
addressed the application of WUM in the e-learning area with a focus on the preprocessing phase. In this context, they
redefined the notion of visit from the e-learning point of view. Web log data preprocessing algorithm based on
collaborative filtering has been discussed in [17]. Preprocessing of the web log file methods that can be used for the task of
session identification have been discussed in [18].
DATA PREPROCESSING
Data preprocessing of web log is usually complex and time demanding. It comprises of following steps: data
collection, data cleaning, Data Structuration and data aggregation the following figure depicts.

Figure 1: Phases of Data Preprocessing in WUM
Data Collection
At the beginning of the data preprocessing, we have the Log containing the Web server log files collected from a
Web server. First, we anonymize the log file for privacy reasons. Here we need to remove the host names or the IP
addresses by replacing the original host name with an identifier that keeps the information about the domain extension (i.e.
the country code or organization type, such as .com, .edu, and .org). We can also use these parameters later in the analysis.
A Complete Preprocessing Methodology for WUM 3

Data Cleaning
The second step of data preprocessing consists of removing useless requests from the log files. Since all the log
entries are not valid, we need to eliminate the irrelevant entries. Usually, this process removes requests concerning non-
analyzed resources such as images, multimedia files, and page style files. For example, requests for graphical page content
(*.jpg & *.gif images) and requests for any other file which might be included into a web page or even navigation sessions
performed by robots and web spiders. By filtering out useless data, we can reduce the log file size to use less storage space
and to facilitate upcoming tasks. For example, by filtering out image requests, the size of Web server log files reduced to
less than 50% of their original size. Thus, data cleaning includes the elimination of irrelevant entries like:
Requests executed by automated programs, such as web robots, spiders and crawlers; these programs generate the
traffic to web sites, can dramatically bias the site statistics, and are also not the desired category which WUM
investigates.
Requests for image files associated with requests for particular pages; an users request to view a particular page
often results in several log entries because that page includes other graphics, while we are only interested in what
the users explicitly request, which are usually text files.
Entries with unsuccessful HTTP status codes; HTTP status codes are used to indicate the success or failure of a
requested event, and we only consider successful entries with codes between 200 and 299.
Entries with request methods except GET and POST.
Removing Requests for Non-Analyzed Resources
Nowadays, most Web pages contain images. If these images or design purposes (such as lines and colored
buttons) or to serve the information (such as graphics and maps) was present. Or to keep these images for Web use log files
to remove the decision depends more on the purpose of mining. For the purpose of web caching or prefetching support, log
analyst, referring to images and multimedia files should not remove the entries. A Web cache application to make it more
than other (text) files a request for predicting such requests for files, usually because the images are larger than HTML
documents is important. Conversely, for analysts to find flaws in the structure of a web site or your visitors want to provide
personal dynamic link, they requested to keep clear why these requests should represent users' actual work.
In addition to image files, other file types we like the web page style files can be included in the pages contained
requests can cause, the script (JavaScript) files, applets (Java object code) files and so on. . Except for the resources
needing explicit requests (like some applet files), the requests for these files need to be removed as they will not bring any
new knowledge in the pattern discovery phase.
Removing Web Robots' Requests
A Web robot (WR) (also called a spider or bot) is a software tool that scans the website regularly to take its
contents. WRS automatically follow all links on a website. Search engines like Google, WRS regularly used to collect all
the pages from the website to update their search indexes. Number of queries from one WR may be equal to the number of
URI's website. If the Web site does not attract many visitors, the number of inquiries come from all WRS who have visited
the site could be more than human-generated requests.
Removing WR-generated log entries not only simplifies the mining task that will follow, but it also removes
uninteresting sessions from the log file. Usually, a WR has a breadth (or depth) first search strategy and follows all the
4 Chandrashekar HC, Neelambike S

& Veeragangadhara Swamy TM & B. S. Theertharaju

links from a Web page. Therefore, a WR will generate a huge number of requests on a Web site. Moreover, the requests of
a WR are out of the analysis scope, as the analyst is interested in discovering knowledge about users' behaviour.
Most of the web bots identify themselves with the user agent field of the log file. Several reference databases
known robot is maintained. However, these databases are not exhaustive and every day new WRS show or a new name,
making the WR identify task more difficult.
Robots Cleaning
Web Robot (WR) (also called spider or bot) is a software tool that periodically scans a website to extract its
content. Web robots automatically follow all the hyperlinks from a web page. Search engines (Yamin and Ramayah, 2011),
such as Google, periodically use WRs to gather all the pages from a web site in order to update their search indexes. The
number of requests from one WR may be equal to the number of the web site's URIs. If the web site does not attract many
visitors, the number of requests coming from all the WRs that have visited the site might exceed that of humangenerated
requests.
Eliminating WR-generated log entries not only simplifies the mining task that will follow, but it also removes
uninteresting sessions from the log file.
Usually, a WR has a breadth (or depth) first search strategy and follows all the links from a web page. Therefore,
a WR will generate a huge number of requests on a web site. Moreover, the requests of a WRare out of the analysis scope,
as the analyst is interested in discovering knowledge about users' behavior.
Most of the Web robots identify themselves by using the user agent field from the log file. Several databases
referencing the known robots are maintained [Kos, ABC]. However, these databases are not exhaustive as each day new
WRs appear or are being renamed, making the WR identification task more difficult. To identify web robots requests, the
data cleaning module implements two different techniques.
In the first technique, all records containing the name robots.txt in the requested resource name (URL) are
identified and straightly removed.
The next technique is based on the fact that the crawlers retrieve pages in an automatic and exhaustive manner, so
they are distinguished by a very high browsing speed. Therefore, for each different IP address, the browsing speed
is calculated and all requests with this value more than a threshold are regarded as made by robots and are
consequently removed. The value of the threshold is set up by analyzing the browser behavior arising from the
considered log files. This helps in accurate detection of user interested patterns by providing only the relevant web
logs. Only the patterns that are much interested by the user will be resulted in the final phase of identification if
this Cleaning process is performed before start identifying the user interested patterns.
Data Structuration
This step groups the unstructured requests of a log file by user, user session, path completion. At the end of this
step, the log file will be a set of transactions, where by transaction we refer to a user session, path completion.
User Identification
In most cases, the log file provides only the computer address (name or IP) and the user agent (for the ECLF [2]
log files). For Web sites requiring user registration, the log file also contains the user login (as the third record in a log
entry). In this case, we use this information for the user identification. When the user login is not available, we consider (if
A Complete Preprocessing Methodology for WUM 5

necessary) each IP as a user, although we know that an IP address can be used by several users. In this paper, we
approximate users in terms of IP address, type of OS and User agent.
User Session Identification
Identifying the user sessions from the log file is not a simple task due to proxy servers, dynamic addresses, and
cases where multiple users access the same computer (at a library, Internet cafe, etc.) or one user uses multiple browsers or
computers. A user session is defined as a sequence of requests made by a single user over a certain navigation period and a
user may have a single (or multiple) session(s) during a period of time. Session identification is the process of segmenting
the access log of each user into individual access sessions [5]. Two time-oriented heuristic methods: session-duration
based method and page-stay-time based method have been specifically proposed by [6, 7, and 8] for session identification.
In this paper, we use the timeout threshold in order to define the users sessions.
The following is the rules we use to identify user session in our experiment:
If there is a new user, there is a new session;
If the time between page requests exceeds a certain limit (30 or 25.5mintes), it is assumed that the user is starting
a new session.
Path Completion
Client- or proxy-side caching can often result in missing access references to those pages or objects that have been
cached. For instance, if a user goes back to a page A during the same session, the second access to A will likely result in
viewing the previously downloaded version of A that was cached on the client-side, and therefore, no request is made to
the server. This results in the second reference to a not being recorded on the server logs.
Path Completion: How to infer missing user references due to caching.
Effective path completion requires extensive knowledge of the link structure within the site.
Referrer information in server logs can also be used in disambiguating the inferred paths.
The referrer plays an important role in determining the path for particular request.

Figure 2: Path Navigation
Data Aggregation
This is the last step of data preprocessing. In this step, first, we transfer the structured file containing sessions and
visits to a relational database. Afterwards, we apply the data generalization at the request level (for URLs) and the
aggregated data computation for visits and user sessions to completely fill in the database.
6 Chandrashekar HC, Neelambike S

& Veeragangadhara Swamy TM & B. S. Theertharaju

Table 1: Table Structures for Log Data and Sessions Data
Log Data
Hostip char Not Null
Accessdate datetime Not Null
url char Not Null
Bytes int Not Null
Statuscode int Not Null
Referrer char Null
Useragent char Null

Session Data
SessionId int Not Null
Hostip char Not Null
Accessdate datetime Not Null
url char Not Null
Referrer char Null
Useragent char Null

EXPERIMENT RESULTS
Table 2: Results of Data Preprocessing
Total Entries in web log 92168
Entries after data cleaning 26584
Images And unsuccessful status code 60605
robots.txt 2151
Robots User Agent 5294

After data cleaning, the number of requests declined from 92168 to 26584. Figure 3 shows the detail changes in
data cleaning.

Figure 3: Preprocessing Result Bar Chart
In gernal Preprocessing can take up to 60-80% of the times spend analyzing the data. Incomplete preprocessing
task can easily result invalid pattern and wrong conclusions. Size of original log File before apply preprocessing is
37765942 Byte (36.01 MB) and After apply the preprocessing is 4251968 Byte (4.06 MB) so the Reduction in log File is
88.741263226 %. Finally, we have identified 3546 unique user. on the basis of user identifications results, we have
identified 4319 sessions by a threshold of 25.5 minutes and path completion.
A Complete Preprocessing Methodology for WUM 7

CONCLUSIONS
In this paper, we have presented preprocessing methodology as a requisite for WUM. The experimental results
presented in section 3, illustrates the importance of the data preprocessing step and the effectiveness of our methodology,
by reducing not only the size of the log file but also by increasing the quality of the data available through the new data
structures that we obtained. Although the preprocessing methodology presented allows us to reassemble most of the initial
visits, the process itself does not fully guarantee that we identify correctly all the transactions (i.e. user sessions & visits).
This can be due to the poor quality of the initial log file as well as to other factors involved in the log collection process
(e.g. different browsers, Web servers, cache servers, proxies, etc.). These misidentification errors will affect the data
mining, resulting in erroneous Web access patterns. Therefore, we need a solid procedure that guarantees the quality and
the accuracy of the data obtained at the end of data preprocessing. In conclusion, our methodology is more complete
because:
It offers the possibility of analyzing jointly multiple Web server logs;
It employs effective heuristics for detecting and eliminating Web robot requests;
It proposes a complete relational database model for storing the structured information about the Web site, its
usage and its users;
REFERENCES
1. Bamshad Mobasher, A Web Usage Mining,http://maya.cs.depaul.edu/~mobasher/webminer/survey/node6.html.
1997. (URL link *include year)
2. Li Chaofeng , Research and Development of Data Preprocessing in Web Usage Mining ,"
3. Rajni Pamnani, Pramila Chawan , " Web Usage Mining: A Research Area in Web Mining "
4. Andrew Shen , Http User Agent List, http://www.httpuseragent.org/list/
5. Andreas Staeding , User-Agents (Spiders, Robots, Crawler, Browser), http://www.user-agents.org/
6. Robots Ip Address, http://chceme.info/ips/
7. Volatile Graphix, Inc.,http://www.iplists.com/nw/
8. Configuration files of W3C httpd, http://www.w3.org/Daemon/User/Config/ (1995).
9. W3C Extended Log File Format, http://www.w3.org/TR/WD-logfile.html (1996).
10. J. Srivastava, R. Cooley, M. Deshpande, P.-N. Tan, Web usage mining: discovery and applications of usage
patterns from web data, SIGKDD Explorations, 1(2), 2000, 1223
11. R. Kosala, H. Blockeel, Web mining research: a survey, SIGKDD: SIGKDD explorations: newsletter of the
special interest group (SIG) on knowledge discovery & data mining, ACM 2 (1), 2000, 115
12. R. Kohavi, R. Parekh, Ten supplementary analyses to improve e-commerce web sites, in: Proceedings of the Fifth
WEBKDD workshop, 2003.
13. B. Mobasher R. Cooley, and J. Srivastava, Creating Adaptive Web Sites through usage based clustering of URLs,
in IEEE knowledge & Data Engg work shop (KDEX99), 1999
8 Chandrashekar HC, Neelambike S

& Veeragangadhara Swamy TM & B. S. Theertharaju

14. Bettina Berendt, Web usage mining, site semantics, and the support of navigation, in Proceedings of the
Workshop WEBKDD2000 - Web Mining for E-Commerce - Challenges and Opportunities, 6th ACM
SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, 2000, Boston, MA
15. B. Berendt and M. Spiliopoulou. Analysis of Navigation Behaviour in Web Sites Integrating Multiple Information
Systems. VLDB, 9(1), 2000, 56-75
16. A Novel Technique for Web Log mining with Better Data Cleaning and Transaction Identification J. Vellingiri
2011 Science Publications ISSN 1549-3636
17. Jiang Chang-bin; Chen Li, "Web Log Data Preprocessing Based on Collaborative Filtering," Education
Technology and Computer Science (ETCS), 2010 Second International Workshop on , vol.2, no., pp.118,121, 6-7
March 2010
18. Thanakorn Pamutha, Siriporn Chimphlee, Chom Kimpan et al., Data Preprocessing on Web Server Log Files for
Mining Users Access Patterns, International Journal of Research and Reviews in Wireless Communications
(IJRRWC) Vol. 2, No. 2, June 2012, ISSN: 2046-6447 Science Academy Publisher, United Kingdom
19. J. Srivastava, R. Cooley, M. Deshpande, P.-N. Tan, Web usage mining: discovery and applications of usage
patterns from web data, SIGKDD Explorations, 1(2), 2000, 1223
20. R. Kosala, H. Blockeel, Web mining research: a survey, SIGKDD: SIGKDD explorations: newsletter of the
special interest group (SIG) on knowledge discovery & data mining, ACM 2 (1), 2000, 115
21. R. Kohavi, R. Parekh, Ten supplementary analyses to improve e-commerce web sites, in: Proceedings of the Fifth
WEBKDD workshop, 2003.
22. B. Mobasher R. Cooley, and J. Srivastava, Creating Adaptive Web Sites through usage based clustering of URLs,
in IEEE knowledge & Data Engg work shop (KDEX99), 1999
23. Bettina Berendt, Web usage mining, site semantics, and the support of navigation, in Proceedings of the
Workshop WEBKDD2000-Web Mining for E-Commerce-Challenges and Opportunities, 6th ACM SIGKDD
Int. Conf. on Knowledge Discovery and Data Mining, 2000, Boston, MA

You might also like