Welcome to Scribd. Sign in or start your free trial to enjoy unlimited e-books, audiobooks & documents.Find out more
Standard view
Full view
of .
Look up keyword
Like this
0 of .
Results for:
No results containing your search query
P. 1
Categories Of Unstructured Data Processing And Their Enhancement

Categories Of Unstructured Data Processing And Their Enhancement

|Views: 168|Likes:
Published by ijcsis
Web Mining is an area of Data Mining which deals with the extraction of interesting knowledge from the World Wide Web. The central goal of the paper is to provide past, current evaluation and update in each of the three different types of web mining i.e. web content mining, web structure mining and web usages mining and also outlines key future research directions.
Web Mining is an area of Data Mining which deals with the extraction of interesting knowledge from the World Wide Web. The central goal of the paper is to provide past, current evaluation and update in each of the three different types of web mining i.e. web content mining, web structure mining and web usages mining and also outlines key future research directions.

More info:

Published by: ijcsis on Nov 02, 2010
Copyright:Attribution Non-commercial


Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less





Prof.(Dr). Vinodani Katiyar 
Sagar Institute of Technology and Management Barabanki U.P. (INDIA)(drvinodini@gmail.com) 
 Hemant Kumar Singh
 Azad Institute of Engineering & Technology Lucknow, U.P. INDIA.(hemantbib@gmail.com)
Web Mining is an area of Data Mining which deals with theextraction of interesting knowledge from the World Wide Web.The central goal of the paper is to provide past, currentevaluation and update in each of the three different types of webmining i.e. web content mining, web structure mining and webusages mining and also outlines key future research directions.Keywords
Web mining; web content mining; web usage mining;web structure mining
The amount of data kept in computer files and data bases isgrowing at a phenomenal rate. At the same time users of thesedata are expecting more sophisticated information from them.A marketing manager is no longer satisfied with the simplelisting of marketing contacts
but wants detailed information
about customers’ past purchases as well as prediction of future
purchases. Simple structured / query language queries are notadequate to support increased demands for information. Datamining steps is to solve these needs. Data mining is defined asfinding hidden information in a database alternatively it hasbeen called exploratory data analysis, data driven discovery,and deductive learning [7]. In the data mining communities,there are three types of mining: data mining, web mining, andtext mining. There are many challenging problems [1] indata/web/text mining research. Data mining mainly deals withstructured data organized in a database while text miningmainly handles unstructured data/text. Web mining lies inbetween and copes with semi-structured data and/orunstructured data. Web mining calls for creative use of datamining and/or text mining techniques and its distinctiveapproaches. Mining the web data is one of the mostchallenging tasks for the data mining and data managementscholars because there are huge heterogeneous, less structureddata available on the web and we can easily get overwhelmedwith data [2].According to Oren Etzioni[6] Web mining is the use of datamining techniques to automatically discover and extractinformation from World Wide Web documents and service.Web mining research can be classified in to three categories:Web content mining (WCM), Web structure mining (WSM),and Web usage mining (WUM) [3]. Web content miningrefers to the discovery of useful information from webcontents, including text, image, audio, video, etc.Webstructure mining tries to discover the model underlying thelink structures of the web. Model is based on the topology of hyperlinks with or without description of links. This modelcan be used to categorize web pages and is useful to generateinformation such as similarity and relationship betweendifferent websites.
Web usage mining refers discovery of useraccess patterns from Web servers. Web usages data includedata from web server access logs, proxy server logs, browserlogs, user profiles, registration data, user session ortransactions, cookies, user queries, bookmark data, mouseclicks and scrolls or any other data as result of interaction.Minos N. Garofalakis, Rajeev Rastogi, et al[4]
presents asurvey of web mining research [1999] and analyses Today'ssearch tools are plagued by the following four problems:(1) The abundance problem, that is, the phenomenon of hundreds of irrelevant documents being returned in responseto a search query, (2) limited coverage of the Web (3) alimited query interface that is based on syntactic keyword-oriented search (4) limited customization to individual users
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 8, No. 7, October 2010144http://sites.google.com/site/ijcsis/ISSN 1947-5500
and listed research issues that still remain to be addressed inthe area of Web Mining .Bin Wang, Zhijing Liu[5]
presents a survey [2003] of webmining research With the explosive growth of informationsources available on the World Wide Web, it has becomemore and more necessary for users to utilize automated toolsin order to find, extract, filter, and evaluate the desiredinformation and resources. In addition, with thetransformation of the web into the primary tool for electroniccommerce, it is essential for organizations and companies,who have invested millions in Internet and Intranettechnologies, to track and analyze user access patterns. Thesefactors give rise to the necessity of creating server-side andclient-side intelligent systems that caneffectively mine for knowledge both across the Internet and inparticular web localities. The purpose of the paper is toprovide past, current evaluation and update in each of thethree different types of web mining i.e. web content mining,web structure mining and web usages mining and alsooutlines key future research directions.
Both Etzioni[6] and Kosala and Blockeel[3] decompose webmining into four subtasks that respectively, are (a) resourcefinding; (b) information selection and preprocessing;(c)generalization; and (d) analysis. Qingyu Zhang and Richard s.Segall[2]
devided the web mining process into the followingfive subtasks:(1) Resource finding and retrieving;(2) Information selection and preprocessing;(3) Patterns analysis and recognition;(4) Validation and interpretation;(5) VisualizationThe literature in this paper is classified into the three types of web mining: web content mining, web usage mining, and webstructure mining. We put the literature into five sections: (2.1)Literature review for web content mining; (2.2) Literaturereview for web usage mining; (2.3) Literature review for webstructure mining; (2.4) Literature review for web miningsurvey; and (2.5) Literature review for semantic web.
2.1 Web Content Mining-
Margaret H. Dunham[7] statedWeb Content Mining can be thought of the extending the work performed by basic search engines. Web content mininganalyzes the content of Web resources. Recent advances inmultimedia data mining promise to widen access also toimage, sound, video, etc. content of Web resources. Theprimary Web resources that are mined in Web content miningare individual pages. Information Retrieval is one of theresearch areas that provides a range of popular and effective,mostly statistical methods for Web content mining. They canbe used to group, categorize, analyze, and retrieve documents.content mining methods which will be used for Ontologylearning, mapping and merging ontologies, and instancelearning [8].
To reduce the gap between low-level image features used toindex images and high-level semantic contents of images incontent-based image retrieval (CBIR) systems or searchengines, Zhang et al.[9] suggest applying relevance feedback to refine the query or similarity measures in image searchprocess. They present a framework of relevance feedback andsemantic learning where low-level features and keywordexplanation are integrated in image retrieval and in feedback processes to improve the retrieval performance. Theydeveloped a prototype system performing better thantraditional approaches.The dynamic nature and size of the Internet can result indifficulty finding relevant information. Most users typicallyexpress their information need via short queries to searchengines and they often have to physically sift through thesearch results based on relevance ranking set by the searchengines, making the process of relevance judgement time-consuming. Chen et al[10] describe a novel representationtechnique which makes use of the Web structure together withsummarization techniques to better represent knowledge inactual Web Documents. They named the proposed techniqueas Semantic Virtual Document (SVD). The proposed SVD canbe used together with a suitable clustering algorithm toachieve an automatic content-based categorization of similarWeb Documents. This technique allows an automatic content-based categorization of web documents as well as a tree-like
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 8, No. 7, October 2010145http://sites.google.com/site/ijcsis/ISSN 1947-5500
graphical user interface for browsing post retrieval documentbrowsing enhances the relevance judgment process forInternet users. They also introduce cluster-biased automaticquery expansion technique to interpret short queriesaccurately. They present a prototype of Intelligent Search andReview of Cluster Hierarchy (iSEARCH) for web contentmining.Typically, search engines are low precision in response to aquery, retrieving lots of useless web pages, and missing someother important ones. Ricardo Campos et al[11]
study theproblem of the hierarchical clustering of web and proposed anarchitecture of a meta-search engine called WISE thatautomatically builds clusters of related web pages embodyingone meaning of the query. These clusters are thenhierarchically organized and labeled with a phraserepresenting the key concept of the cluster and thecorresponding web documents.Mining search engine query log is a new method forevaluating web site link structure and information architecture.Mehdi Hosseini , Hassan Abol hassani [12]
propose a newquery-URL co-clustering for a web site useful to evaluateinformation architecture and link structure. Firstly, all queriesand clicked URLs corresponding to particular web site arecollected from a query log as bipartite graph, one side forqueries and the other side for URLs. Then a new content freeclustering is applied to cluster queries and URLs concurrently.Afterwards, based on information entropy, clusters of URLsand queries will be used for evaluating link structure andinformation architecture respectively.Data available on web is classified as structured data, semistructured data and Unstructured data. Kshitija Pol, Nita Patilet al[13] presented a survey on web content mining describedvarious problems of web content mining and techniques tomine the Web pages including structured and semi structureddata.
2.2 Web Structure Mining-
Web information retrieval toolsmake use of only the text on pages, ignoring valuableinformation contained in links. Web structure mining aims togenerate structural summary about web sites and web pages.The focus of structure mining is on link information[14].Through an original algorithm for hyperlink analysiscalled HITS (Hypertext Induced Topic Search), Kleinberg[15]introduced the concepts of hubs (pages that refer to manypages) and authorities (pages that are referred by manypages)[16].
Apart from search ranking, hyperlinks are alsouseful for finding Web communities. A web community is acollection of web pages that are focused on a particular topicor theme. Most community mining approaches are based onthe assumption that each member of a community has morehyperlinks within than outside its community. In this context,many graph clustering algorithms may be used for mining thecommunity structure of a graph as they adopt the sameassumption, i.e. they assume that a cluster is a vertex subsetsuch that for all of its vertices, the number of links connectinga vertex to its cluster is higher than the number of linksconnecting the vertex outside its cluster[17].Furnkranz[18] described the Web may be viewed as a(directed) graph whose nodes are the documents and the edgesare the hyperlinks between them and exploited the graphstructure of the World Wide Web for improved retrievalperformance and classification accuracy. Many search enginesuse graph properties in ranking their query results.The continuous growth in the size and use of the Internet iscreating difficulties in the search for information. To helpusers search for information and organize information layout,Smith and Ng[19] suggest using a SOM to mine web data andprovide a visual tool to assist user navigation. Based on the
users’ navigation behavior, they develop LOGSOM,
a systemthat utilizes SOM to organize web pages into a two-dimensional map. The map provides a meaningful navigationtool and serves as a visual tool to better understand thestructure of the web site and navigation behaviors of webusers.As the size and complexity of websites expands dramatically,it has become increasingly challenging to design websites onwhich web surfers can easily find the information they seek.Fang and Sheng[20] address the design of the portal page of aweb site. They try to maximize the efficiency, effectiveness,
and usage of a web site’s portal page
by selecting a limitednumber of hyperlinks from a large set for the inclusion in a
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 8, No. 7, October 2010146http://sites.google.com/site/ijcsis/ISSN 1947-5500

You're Reading a Free Preview

/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->