Professional Documents
Culture Documents
Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com Volume 2, Issue 3, May June 2013 ISSN 2278-6856
Asst. Professor, Computer Engineering, MCOER, Nasik, University of Pune (MS), India
clusters, decision trees, decision rules, etc. Among them, association rules have been more effective in discovering interesting relations in a large amount of data. The idea of mining association rules to provide summarized representations of XML documents has been focused in many proposals either by using languages (e.g., XQuery ) and techniques developed in the XML context, or by implementing graph- or tree-based algorithms. The technique of mining and storing Tree-Based Association Rules (TARs) as a means to represent intentional knowledge in original XML is used here. TARs are extracted for two main purposes to get a little idea of both the structure and content of an XML document and to use them for intentional query-answering, which allows the user to query the extracted TARs rather than the original document. The paper is organized as follows. Section 2 overviews the related work carried out regarding mining XML documents and different algorithms to generate tree based association rules, while sections followed by this represent proposed system mine XML documents and give intentional answer to query. Then it has discussion about the experimental results and draws the conclusions.
Key words: Extensible markup Language (XML), approximate query answering, data mining, intentional information, Tree-Based Association Rules.
1. INTRODUCTION
Currently, XML is penetrating virtually all areas of internet application programming. With the continuous growth in XML data sources, the ability to extract knowledge from them for decision support becomes increasingly important and desirable. As compared to the successful performances in mining well-structured data such as relational databases and object-oriented databases, mining the documents in XML is still at basic level and it come up with the more challenges due to the undefined characteristics of XML in both structure and semantics. XML data have a more complex hierarchical structure than a database record. Different from database record, elements in XML data have fixed position regarding its context, which thus XML data seems to be much bigger than traditional data. Querying such documents is quite are not able to specify about a probable structure in the query conditions and they are very often confused by the large amount of information available. According to these needs, the traditional data mining technology have to be regenerated and reformed for extracting knowledge from XML structure. The aim of XML mining is to integrate the emerging XML technology into data mining technology. This knowledge can be represented in many different ways such as Volume 2, Issue 3 May June 2013
2. RELATED WORK
The idea of mining association rules used first in Fast Algorithms for Mining Association Rules in Large Databases, by R. Agrawal and R. Srikant, in 1994 [1] to provide summarized representations of XML documents. They propose a set of functions, written in XQuery, which implement the Apriori algorithm [2] where prior knowledge is required about document. In the year 2003 [3], Wan and Dobbie presented the Apriori algorithm to generate the large item set document, and then select all association rules from the large item set where support is greater than or equal to the minimum support. This approach performs well on simple XML documents but not on complex XML documents with an irregular structure. This limitation is overcome in Discovering Interesting Information in XML Data with Association Rules, in year 2003 [4], where Braga et al. introduce a proposal which uses XMINE RULE, an operator for mining association rules for native XML documents.
Page 324
4. EXTRACTING TARS
Tree Based Association Rules are obtained by selecting an item having its support and confidence value above user defined support and confidence. For this frequent subtrees are generated by using extension of CMTTreeminer algorithm. These trees are the subtrees having support above user defined support. Depending on the number of subtrees obtained, tree based association rules (TAR's) are generated using function given below. TAR's are of two types one is content based i.e called instance TAR (iTAR) which indicates values or text in the XML documents. Another type is structured TAR's (sTAR) which indicates structure of mined knowledge from XML documents. Each rule is saved inside a <rule> element which contains three attributes for the ID, support, and confidence of the rule. The rules extracted above are stored in a file called rule file, which can further used as another source to get idea about original XML document. Answer to the given query is found from this rule file by matching the condition in query.
6. QUERY ANSWERING
Once rule files and index files are saved as an dataset to be queried, user enter the query which is on original document. This query is transformed into intentional query and then fired on extracted datasets. Due to this user get intentional answer to the query. This answer is precise and gives property which is frequently satisfied. Condition in the query is matched with the nodes in index file and references of rules are obtained. Rule IDs return from index file are accepted and only those rules are searched for answer. Thus answer is return from mined knowledge not from original document. It is also useful in absence of original document. Queries are of different types such as and , count queries, Top k queries which are frequently asked.
7. RULES UPDATING
XML documents on web go on changing. In case of documents which are frequently undergoes changes instead of creating new rules previously obtained rule and index files are updated. This is done by creating dummy node while generating frequent subtree. Dummy nodes are the nodes whose confidence is less than minconf but it is near to that. When document changes and confidence value becomes greater than or equal to minconf only that respective dummy node gets activated and rule as well as indexes are updated
Page 325
8. EXPERIMENTAL RESULTS
Four types of experiments are performed using the method. 8.1 Extraction Time Time required for the extraction of the intentional knowledge from an XML database. As no of nodes increases extraction time increase initially, it remains stable for some time and as no of nodes becomes too high again it increases very fast.
Figure 4. Extraction time with constant Support =0.02 8.4 Accuracy Accuracy of intentional answer is measured in terms of precision and recall. Query answering depends on support threshold. When support is high then chance of correct answering is high as less no of rules are to be access.
9. CONCLUSION
A method here is for deriving intentional knowledge from XML documents in the form of TARs, and then storing these TARs as an alternative, synthetic data set to be queried for providing quick and summarized answers. This procedure has characterized by some key aspects that, it works directly on the XML documents, without transforming the data into any intermediate format. It looks for general association rules, without the need to impose what should be contained in the antecedent and consequent of the rule. It stores association rules in XML format and it translates the queries on the original data set into queries on the TARs set. The aim of this project is to provide a way to use intentional knowledge as a substitute of the original document during querying and not to improve the execution time of the queries over the original XML data set. The method used in this project can be further developing to optimize mining algorithms.
No. of nodes Figure 2. Extraction time w.r.t no. of nodes in XML doc. 8.2 Answering Time Answer Time of getting intentional answer is comparatively less than that of extensional answer, as instead of accessing original document mined rule file is used to answer the query. 8.3 Comparison with Support and Confidence Extraction time of generating rules from XML documents changes according to support and confidence. This can be shown in graph by keeping first confidence constant and vary support and then keeping support constant. It is seen in the figure 3 that more the support means frequent data items are less hence extraction time is less.
References
[1] G. Marchionini, Exploratory Search: From Finding to Understanding, Comm. ACM, vol. 49, no. 4, pp. 41-46, 2006. [2] R. Agrawal and R. Srikant, Answering XML Queries by Means of Data Summaries, Proc. 20th international Conf. Very Large Data Bases ,pp. 478499, 1994. [3] J.W.W. Wan and G. Dobbie, Extracting Association Rules from XML Documents Using XQuery Proc. Fifth ACM IntAZl Workshop Web Information and Data Management, pp. 94-97, 2003. [4] D. Braga, A. Campi, S. Ceri, M. Klemettinen, and P. Lanzi, Discovering Interesting Information in XML
Figure 3.Extraction time with constant confidence = 0.95 Volume 2, Issue 3 May June 2013
Page 326
Page 327