You are on page 1of 13

International Journal of Computational Intelligence and Information Security, November 2011 Vol. 2, No.

11

Web Usage Pattern Discovery Using Various Contextual Factors


Ms. Ravita Mishra Mr. Jayant Gadge

Lecturer, VJTI, Computer Technology Department Matunga (E), Mumbai, India Head of Comuter Engg. Dept, Thadomal Shahani Engg. College Bandra, Mumbai, India E-mail: m.ravita@gmail.com jayantrg@hotmail.com

Abstract
This paper present a study based on web usage mining area and a new technique of pattern discovery technique based on analysis of contextual factors. Web usage mining is a main research area in web mining focused on learning about web users and their interactions with web sites. The motive of mining is to find users access models automatically and quickly from the vast web log data, such as frequent access paths, frequent access page groups and user clustering. Through web usage mining, the server log, registration information and other relative information left by user access can be mined with the user access mode which will provide foundation for decision making of organizations. It is important to analyze users web usage logs for developing personalized web services. However, there are several inherent difficulties in analyzing usage logs because the kinds of available logs are very limited and the logs show uncertain patterns due to the influences of various contextual factors. Therefore, speculating that it is necessary to find what contextual factors exert influences on the usage logs prior to designing personalized services, several experiments in-series not only in situations of performing designed tasks during short time periods but also in users natural web environments during a period of several days. The results of experiments are different contextual factors interest levels, credibility levels, page types, task types, and languages are influential contextual factors in a natural web environment. Moreover, some historical and experiential patterns that could not be observed in short time analysis were discovered in the results of long time experiments. The proposed the techniques are used to find the credibility of web pages. These findings will be useful for other researchers, practitioners, and especially for developers of adaptive personalization services and made up of pages dynamically generated. Using this approach use can find other contextual factors easily and improve the performance of system. The result of finding is useful for adaptive services as well as site modification.

Keywords: Web usage mining, Web page classification, Task classification, Contextual factors, Credibility.

1. Introduction
Web Mining is the extraction of interesting and potentially useful patterns and implicit information from artifacts or activity related to the World Wide Web. Web usage mining provides the support for the web site design, providing personalization server and other business making decision, etc. It is important to analyze the users web usage log for developing any personalized web services. The available web usage log has several 41

International Journal of Computational Intelligence and Information Security, November 2011 Vol. 2, No. 11 difficulties and usage log show uncertain pattern. Pattern discovery is challenging task in web usage mining but using contextual parameter analysis the pattern discovery will be simple and used for online help, suggestion given to new site developer, site improvement as well as modifying the structure of current web site. With the help of long term experiments user can easily find which contextual factor affect the web usage log and why usage log show uncertain pattern. The proposed approach of pattern discovery technique uses the web page classification algorithm can classify the users current task and time spent on the web page and clustering algorithms are used to cluster the users IP address and identify the user type. The result of experiment is different contextual factor via User interest, Complexity, Difficulty, Task type etc [1]. Web usage mining provides the support for the web site design, providing personalization server and other business making decision [2]. In addition pattern discovery is also capable to distinguish which contextual factors are useful for further analysis or not and also calculate the credibility of web page it also find the other interesting factors e.g. Written language, complexity and difficulty. In this paper presenting an algorithm for pre processing of web log file and a new approach for clustering of user session and user identification based on classification technique of IP address. For the clustering of user session creates a unique identification number for each user. This number is used for further access time calculation and classification of users page as well as task; it is based on connectivity between referrer and URI pages and it calculating the access time spent on web page and classification of web pages. The results represent that this approach can improve the quality of clustering for user session creation and contextual factors calculation in web usage mining systems [4]. These results can be use for predicting users next request in the huge web sites, and site improvements, modification of existing sites and guidelines for building a new site. The rest of this paper is organized as follows: In section 2, review some researches that advance in pattern discovery techniques. Section 3 describes the design consideration and implementation for the pattern discovery technique. Results are shown in section 4. Finally, section 5 summarizes the paper and introduces future work.

2. Related work
Recently, several web usage mining algorithms have been proposed to discover the users navigational pattern. In the following review some of the most significant pattern discovery techniques and algorithm in web usage mining that can be compared with proposed system. Pattern discovery using classification and clustering is one of the earliest methods used in web usage mining. Pattern discovery using SOM was one of the earliest clustering methods to be used in web usage mining by kobra, Amin[7].Kohenan self organizing map (SOM) is clustering technique to discover the users navigational pattern SOM also represents clustering concept by grouping similar data together. It reduces the data dimension and displays similarities among the data once the data have been entered into the system, the network of artificial neurons is trained by providing information about input. The weight vector of the unit is closest to the current object becomes the winning or active unit. SOM does not require target vector and classify the training data without any sequential pattern [7]. SOM provides the data visualization techniques which help to understand high dimensional data by reducing the dimension of data to a map. It is simple do not need to involve in complex mathematical relation. It does not 42

International Journal of Computational Intelligence and Information Security, November 2011 Vol. 2, No. 11 depend on pattern length and it is able to extract pattern of any length. Different patterns can be extracted depending on the occurrence of that page group in a cluster. It does not need to know the number of clusters before running the clustering phase. The result could help web site owners to attract new customers, retain current customer, improve cross marketing /sales, effectiveness of promotional campaigns, tracking leaving customers, etc. Pattern discovery presented in paper is based on contextual factors analysis. The core element of this system is web page classification and task classification algorithm and contextual parameter evaluation [1]. Web page classification algorithm take users session and access time as input and classify the web pages into two categories content page and index page. Task classification algorithm is another important technique applied in pattern discovery. Users web task is classified into two main groups, casual user and careful user. In casual searching the user wants to find the precise and credible information. In careful searching the credibility and accuracy of the search results are not important. Task classification algorithm classifies the users current task into two categories. The frequently visited URLs will be considered as an indicator URL [2]. If URL contain more credible link and paid user then the assigned task is careful task. If URL contains no credible link then the assigned task is casual user.

3. Design Consideration and Implementation

Data PreLog file processing

Page Classification

Task Classification

Contextual parameter analysis

Knowledge base Figure 1: Steps for Pattern Discovery Technique

Step 1: Data sets consisting of web log records for 5446 users are collected from De Paul University website. Web log is unprocessed text file which is recorded from the IIS Web Server. Web log consist of 17 attributes with the data values in the form of records. Fields: date/ time /c-ip /cs-username /s-sitename /s-computername /s-ip /s-port /cs-method /cs-uri-stem /cs-uriquery /sc-status /time-taken /cs-version/ cs-host /cs (User-Agent) / cs (Referer).

Step 2: Generally, several preprocessing tasks need to be done before performing web mining algorithms on the Web server logs. Data preprocessing a web usage mining model (Web-Log preprocessing) aims to reformat the original web logs to identify users access sessions. The Web server usually registers all users access activities of the website as Web server logs. Due to different server setting parameters, there are many types of web logs, 43

International Journal of Computational Intelligence and Information Security, November 2011 Vol. 2, No. 11 but typically the log files share the same basic information, such as: client IP address, request time, requested URL, HTTP status code, referrer, etc [8]. Data pre-processing is done using following steps.

2.1. Data Cleansing: Irrelevant records are eliminated during data cleansing. Since target of web usage mining is to get traversal pattern, following two kinds of records are unnecessary and should be removed. The records of graphics, video and format information. The records with failed HTTP status codes.

2.2. User and Session Identification: The task of user and session identification is to find out the different user sessions from the original web access log. A referrer-based method is used for identifying sessions. The different IP addresses distinguish different users. If the IP addresses are same If all of the IP address, browsers and operating systems are same

2.3. Content Retrieval: Content Retrieval retrieves content from users query request i.e. cs-uri-query. Eg: Query: http/1www.cs.depaul.edu /courses/syllabus.asp? Course = 323-21-603&q=3&y=2002&id=671. Retrieve the content like /courses/syllabus.asp which helps in fast searching of unvisited pages by other users which are similar to users interest.

2.4.Path Completion: Path Completion should be used acquiring the complete user access path. The incomplete access path of every user session is recognized based on user session identification. If in a start of user session, Referrer as well URI has data value, delete value of Referrer by adding -.Web log preprocessing [8] helps in removal of unwanted click-streams from the log file and also reduces the size of original file by 40-50%.

Step 3: Access time calculation: In this step, the access statistics for the pages are collected from the sessions. The sessions obtained in server log pre processing are scanned and the access statistics are computed. The following statistics are computed for each page. Number of session in which the page was accessed. Total time spent on the page. Number of times the page is the last requested page of a session. Algorithm Steps Input: Session zed log file Output: Time spent on each page Session created from the above step is used to calculate the Access time. Single session is considered for calculating the time. First find the Time spent of the last page of log file. Here ni is the last visited page of the log file. Find the time spent of the first page of log file Subtracting the last page from the first page. 44

International Journal of Computational Intelligence and Information Security, November 2011 Vol. 2, No. 11

Step 4: Web page classification Algorithm: Web Page Classification algorithm, classifies the pages into two categories, index pages and content pages [3]. Algorithm Steps Input: 1.Cleaned, filtered and session zed log file. 2. Count, threshold Output: Classified pages (Index page or content page) Two thresholds set. Count threshold and link threshold Set =1/(mean reference length of all pages) t= -ln(1-)/ for each page p on the web site If Ps file type is not HTML or Ps end of session count > count _threshold Mark P as a content page else Ps number of links > link _threshold Mark p as an index page else If Ps reference length < t Mark P as an index page else Mark P as a content page

Step 5: Task Classification Algorithm: A user main task is classified into two main groups. First task is careful searching in which user wants to find the precise and credible information, Second task is casual searching in which the credibility and accuracy of result are not important [2]. Algorithm steps Input: 1.Cleaned, filtered and session zed log file. 2. Frequently visited URL used. Output: Task type (careful or casual user) Users task should be identified by the top level URL. Frequently visited URLs as indicators for the task type classification (cs-uri-stem) field. Web task is supposed to be kept some period of time (5 min). Sort all the element of frequently visited URLs. Checking how many times the Frequently Visited URLs visits. If frequently visited URLs are more than or equals to 5 then setting the user task is careful user otherwise the user task is casual user. If frequently visited URL have query (cs-uri-query) and that query will be same then setting the user task is casual otherwise the user task is careful user. URL in casual searching was higher than the careful searching. 45

International Journal of Computational Intelligence and Information Security, November 2011 Vol. 2, No. 11

Step 6: Discovery of contextual factors: In the pattern discovery stage applying various contextual factors and then analyzing the extracted pattern. In this paper five contextual factors are considered for implementation purpose the results of implementation is given below.

6.1 Difficulty: The difficulty contextual factors are calculated by observing the s-port field of log file. S-port field will indicate where the page is executed or not. If there is some problem then it will display the 302 code. Which indicate the difficulty factors? If the page is opened proper way without phasing any problem then server code 200 will be displayed. The table shows different error code and their description.

6.2 Search interest: The search interest contextual parameter are calculated by observin the field CS-method , S-port , time-taken,SC-status. Sc- status field give the status of page whether the given page is loaded properly or not.CS-method give the get or put method , get method means query and result is submitted . In th given log file maxumum method is get method. The time-taken field is dispaly the how much time user taken to upload the current page. Based on the above four field obsevation contxtual factor search interest is calculated.from the used log file search intrest precentage is 14 %.

6.3 Credibility: The credibility is one of the most important factor . credibility or believeability give the whether the access page is crediblr or not . or site is credible or not. Credibility is calculated by observing the Cs-method , S-port and SC-Status field of log file . In the used log file all these fields are true and it will give perfect result. Based on this analysis credibility prcentagee is calculated.The credibility factor is mainly applicable in designing new web sites and sometimes in the E-business to implement the business value. The total cerdibility count is 3783 and percentage is 74 %. The difficulty of assessing Web sites credibility manifests itself in several problematic phenomena [5]. Credibility can be categorized into four types. Presumed credibility is based on general assumptions in the users mind (e.g., the trustworthiness of domain identifiers like .gov). Surface credibility is derived from inspection of a site, .is often based on a first impression that a user has of a site, and is often influenced by how professional the sites design appears. Earned credibility refers to trust established over time, and is often influenced by a sites ease of use and its ability to consistently provide trustworthy information. Reputed credibility refers to third party opinions of the site, such as any certificates or awards the site has won.

Page Selection Hand labelling Web pages for credibility is a time-consuming process. A credible webpage is one whose information one can accept as the truth without needing to look elsewhere [5]. If one can accept information on a 46

International Journal of Computational Intelligence and Information Security, November 2011 Vol. 2, No. 11 page as true at face value, then the page is credible. If one need to go elsewhere to check the validity of the information on the page, then it is less credible. Page selection will be done two ways on page feature and off page feature [5]. On-Page Features Selection On-Page features are present on a page but are difficult or time-consuming for a person to quantify or attend to. Checking spelling, advertising, domain type is applied on page feature selection. Off-Page Features Selection Off-page features require the user to leave the target page and look elsewhere for supplementary data. Awards, PageRank and sharing techniques are applied on off page feature selection.

6.4 Browser dependence: Browser dependence is another most useful contextual factor. This factor is obtained by observing the CS-referrer and User-agent field of log file. Here we are observing how many users used the Mozilla browser and other is Lynx browser. In the given log file maximum user used the Mozilla browser and least user used the other browser. Therefore Mozilla browser is user friendly and faster browser it will perform the action fastest way as compare to other Lynx browser. The total count of Mozilla browser is 4342 and Lynx browser is 6 and window based browser is 546. 6.5 Page frequency: Page frequency is another contextual parameter it is similar to document frequency. In the given log file observed field is cs-uri-stem. In this field we have observed the pages that contain any link or not. If link contain then the page is executed n times and it will also find the how many time that pages is executed. Here Total page count is 152 and this link is executed n times. The sample result of page frequency is given below. 124 /programs/2002/gradse2002.asp visted 7times.

4. Results and Discussion


Step wise results are shown below for 5048 records. Step 1: Collection of web logs which are in raw or unprocessed form.17 attributes are shown below: #Fields: date time c-ip cs-username s-sitename s-computername s-ip s-port cs method cs-uri-stem cs-uri-query sc-status time-taken cs-version cs-host cs(User-Agent) cs(Referer)

2002-04-01 00:00:10 1cust62.tnt40.chi5.da.uu.net - w3svc3 bach bach.cs.depaul.edu 80 get /courses/syllabus.asp course=323-21-603&q=3&y=2002&id=671 200 156 http/1.1 www.cs.depaul.edu

mozilla/4.0+(compatible;+msie+5.5;+windows+98;+win+9x+4.90;+msn+6.1;+msnbmsft;+msnmenus;+msnc21) http://www.cs.depaul.edu/courses/syllabilist.asp 2002-04-01 00:00:26 ac9781e5.ipt.aol.com - w3svc3 bach bach.cs.depaul.edu 80 get /advising/default.asp20016http/1.1www.cs.depaul.edu mozilla/4.0+(compatible;+msie+5.0;+msnia;+windows+98;+digext)

http://www.cs.depaul.edu/news/news.asp?theid=573 2002-04-01 00:00:29 alpha1.csd.uwm.edu - w3svc3 bach bach.cs.depaul.edu 80 get /default.asp - 302 0http/1.1www.cs.depaul.edumozilla/4.0+(compatible;+msie+6.0;+msn+2.5;+windows+98;+luc+user) 47

International Journal of Computational Intelligence and Information Security, November 2011 Vol. 2, No. 11

Step 2: Unprocessed log file is converted into oracle /MS access database. The snapshot of converted database is shown in figure 2 as:

Figure 2: Text file converted into database

Step3, 4: Preprocessing is done of 5048 web log records from CTI dataset. 34 records are removed from dataset and 270th session is created from 5048 dataset. Access time calculated of each records as well whole database. The total time spent on web page is 4 hour 3 minute and 3 seconds.

Step 5: Page classification technique applies index and content page is classified. 44 index page and 4988 content page form. A task classification technique classifies the task in Casual and careful users and this result are used for contextual factors analysis. Total 2193 is careful user and 2839 is casual user.

Step 6: Experimental results of different contextual factor and their graphical results are shown below.

Figure 3: Result of Different Contextual Factor and Viewing Time

X-axis: Mean value of viewing time (total time 6 hour) 48

International Journal of Computational Intelligence and Information Security, November 2011 Vol. 2, No. 11 Y-axis: Contextual factors value in percentage (%) Here WBP and LBP stand for WBP: Window based browser preference LBP: Linux based browser preference Table1: Contextual Factor and their Percentage Contextual Factor Name Difficulty Search Interest Credibility Window based browser preference Linux Based Browser preference Page Frequency Graphical results of Contextual factor and their conclusion Difficulty Percentage (%) 20 % 14 % 76 % 99.89% 0.011 % 36%

Figure 4: Graphical result of contextual factor difficulty

The graphical result of contextual factor difficulty slightly increases in first hour after 65 minutes factor increases very rapidly and it will reach the 35 %.In the next hour graph decreases rapidly and it will reach the difficulty 24 %. Thus it concludes the difficulty contextual factor increases first four hours and decreases next two hour. Thus the total difficulty obtained accessing in log file analysis approximately 20 %.

49

International Journal of Computational Intelligence and Information Security, November 2011 Vol. 2, No. 11 Search Interest

Figure 5: Graphical result of contextual factor search Interest In the search interest percentage started at 35 % and in next hour it will decreases up to 20 %. Next hour again percentage increases and last two hour it will increases the 45 %. Thus it concludes that the searching percentage is increases and maximum search is successful. Thus the total search interest of accessing log file is 35 %.

Credibility

Figure 6: Graphical result of contextual factor credibility

In credibility percentage started 65 % and next hour it will decreases but it will increases in third hour and continue up to six hour. Thus it concludes the maximum pages are credible pages and credibility percentage is also more than 75%.

50

International Journal of Computational Intelligence and Information Security, November 2011 Vol. 2, No. 11 Browser Preference

Figure 7: Graphical result of Browser Preferences Contextual Factor

In Browser preference maximum usage percentage of Mozilla browser and it started at 90 % and continue increases up to six hour and percentage is more than 94%. Other browser Linux browser started at 10 % and continuous increases but last two hour it will decreases and reach 3%.

Page Frequency

Figure 8: Graphical result of Page Frequency Contextual Factor

In page frequency percentage started at 25 and it continuously increases up to last six hour. In page frequency percentage of execution of pages is rapidly and it more than 35 %. The users page request will be executed rapidly without dropping the percentage value.

Application of Discovered patterns: According to the application of web usage mining is classified into two main categories. Those dedicated to the discovery of web usage pattern (behavior) and hence customizing the web sites. Other dedicated to the architectural topology (related to the site topology) and improvement web sites 51

International Journal of Computational Intelligence and Information Security, November 2011 Vol. 2, No. 11 effectively. These two applications are discussed below.

Web site improvement: The logs are used to infer user interaction and hence provide implicit user rating .other browser provide sophisticated indicator such as bookmark , print , number of visits, save and more effective time calculation . Moreover the analysis of web size usage may give important hints for site redesign so as enhance the accessibility and effectiveness web usage mining.

Performance improvement: Another important application of web usage mining is to improve the web performance .here too we need intelligent system technique to identify, characterize, and understand user behavior. In fact user navigation activity determines the network traffic and as a consequence, influences web server performance. A workload characterization could be also used to create synthetic workload, in order to benchmark site, a monitoring agent checks the resource usage metrics during to test the search for a system and network bottleneck.

5. Conclusion and Future Research


Pattern Discovery system using various contextual factors helps webmasters of the website to pre process the web access log also finds the users problem which is caused by searching and accessing the site. System also captures the different types of pages which will helpful finding the credibility or trustworthiness finding. The proposed system helps in reducing the searching time of pages by the user on the web site. Thus, increase the website usability and provide better services to webmaster and users. In this paper a little attempt is made to provide an up-to-date survey of the rapidly growing area of web usage mining and how the various pattern discovery techniques help in developing business plans especially in the area of e-business. This article has aimed at describing different pattern discovery techniques and addressing the different problem phase user while accessing the web page. The WWW will keep growing, even in a somewhat different form than how we know it today. Therefore the need for discovering new methods and techniques to handle the amounts of data existing in this universal framework will always exist which help in maintaining the trust between customers and traders.

6. References
[1] J. Choi, J. Seo, G. Lee, Analysis of Web Usage Patterns in Consideration of Various Contextual Factors, Proceeding of 7th workshop on intelligent techniques for web personalization & recommendation system at IJCAI09, July, pp. 1-12, 2009. [2] Jinhyuk Choi, Geehyuk Lee. New Techniques for Data Pre processing Based on Usage Logs for Efficient Web User Profiling at Client Side, Proceeding of International Joint Conferences on Web Intelligence and Intelligent Agent Technologies pp. 54-57, 2009 [3] Yongjian Fu, Ming Yi shih, Mario Creado, Chunhua Ju, Reorganizing Web sites based on user Access Patterns In Proceeding of 10th international conference on Information and knowledge management Atlanta, Georgia USAACM 2001 pp. 1-15, 2001 52

International Journal of Computational Intelligence and Information Security, November 2011 Vol. 2, No. 11 [4] Peter I. Hofegang, De boelelaan, Methodology for preprocessing and evaluating the time spent on Web pages, In proceeding of the IEEE / ACM International Conferences on web Intelligence pp. 1-8, 2006. [5] Julia Schwarz, Meredith Ringel Morris, Augmenting Web Pages and search Results to Support Credibility Assessment, In Proceeding of ACM International Conferences, pp .1-10, 2011 [6] Choi J,: Lee G,: and Um Y . 2007 Analysis of Internet users Interests based on Windows GUI Messages In Proceeding of the 12th International conferences on Human-computer Interaction, Lecture notes in computer science, 4553: 881-888: Springer Berlin/Heidelberg [7] Kobra Etminani, Amin Rezaeian Delui, Noorali Raeeji Yaneshari, Modjtabani Dept of computer engg;Web usage mining: Discovery of the users navigational Pattern using SOM pp. 244-248, 2009 [8] Rajni pamnani, Pramila Chavan. 2010 Web usage mining: A research area in web mining VJTI, University pp.1-5. Mumbai

53

You might also like