You are on page 1of 4

International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)

Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com Volume 2, Issue 3, May June 2013 ISSN 2278-6856

Mining Tree-Based Association Rules for XML Query Answering


Arundhati Birari1, Prof. Ranjit Gawande2
1

ME II, Computer Engineering, MCOER, Nasik, University of Pune (MS), India

Asst. Professor, Computer Engineering, MCOER, Nasik, University of Pune (MS), India

Abstract: The database research field has concentrated on


the Extensible Markup Language (XML) due to its flexible hierarchical nature which can use to represent huge amounts of data, also it doesnt have absolute and fixed schema, but having possibly irregular and incomplete structure. It is a very hard task to extract information from semi structured documents and is going to become more and more difficult as the amount of digital information available on the Internet grows. Actually, the data set returned as answer to a query may be too big to convey interpretable knowledge, as documents are often so large. An approach based on TreeBased Association Rules (TARs), which provide approximate, intentional information about the structure and the contents of XML documents both, as well as it can be stored in XML format. This mined knowledge is used to provide, a concise idea of both the structure and the content of the XML document and quick, approximate answers to queries whenever required.

clusters, decision trees, decision rules, etc. Among them, association rules have been more effective in discovering interesting relations in a large amount of data. The idea of mining association rules to provide summarized representations of XML documents has been focused in many proposals either by using languages (e.g., XQuery ) and techniques developed in the XML context, or by implementing graph- or tree-based algorithms. The technique of mining and storing Tree-Based Association Rules (TARs) as a means to represent intentional knowledge in original XML is used here. TARs are extracted for two main purposes to get a little idea of both the structure and content of an XML document and to use them for intentional query-answering, which allows the user to query the extracted TARs rather than the original document. The paper is organized as follows. Section 2 overviews the related work carried out regarding mining XML documents and different algorithms to generate tree based association rules, while sections followed by this represent proposed system mine XML documents and give intentional answer to query. Then it has discussion about the experimental results and draws the conclusions.

Key words: Extensible markup Language (XML), approximate query answering, data mining, intentional information, Tree-Based Association Rules.

1. INTRODUCTION
Currently, XML is penetrating virtually all areas of internet application programming. With the continuous growth in XML data sources, the ability to extract knowledge from them for decision support becomes increasingly important and desirable. As compared to the successful performances in mining well-structured data such as relational databases and object-oriented databases, mining the documents in XML is still at basic level and it come up with the more challenges due to the undefined characteristics of XML in both structure and semantics. XML data have a more complex hierarchical structure than a database record. Different from database record, elements in XML data have fixed position regarding its context, which thus XML data seems to be much bigger than traditional data. Querying such documents is quite are not able to specify about a probable structure in the query conditions and they are very often confused by the large amount of information available. According to these needs, the traditional data mining technology have to be regenerated and reformed for extracting knowledge from XML structure. The aim of XML mining is to integrate the emerging XML technology into data mining technology. This knowledge can be represented in many different ways such as Volume 2, Issue 3 May June 2013

2. RELATED WORK
The idea of mining association rules used first in Fast Algorithms for Mining Association Rules in Large Databases, by R. Agrawal and R. Srikant, in 1994 [1] to provide summarized representations of XML documents. They propose a set of functions, written in XQuery, which implement the Apriori algorithm [2] where prior knowledge is required about document. In the year 2003 [3], Wan and Dobbie presented the Apriori algorithm to generate the large item set document, and then select all association rules from the large item set where support is greater than or equal to the minimum support. This approach performs well on simple XML documents but not on complex XML documents with an irregular structure. This limitation is overcome in Discovering Interesting Information in XML Data with Association Rules, in year 2003 [4], where Braga et al. introduce a proposal which uses XMINE RULE, an operator for mining association rules for native XML documents.

Page 324

International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)


Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com Volume 2, Issue 3, May June 2013 ISSN 2278-6856
Paik et. al [5] in 2005 to introduced HoPS, an algorithm for extracting association rules from a set of XML documents. Such rules are called XML association rules. The idea of using association rules as summarized representations of XML documents was also introduced in 2007 [7] by E. Baralis, P. Garza, E. Quintarelli, and L. Tanca, where the XML summary is based on the extraction of rules both on the structure and on content of XML data sets. The limitations of this approach are the root of the rule is established a-priori and the patterns are less accurate and reliable. Various algorithms available for generating frequent tree based association rules are proposed. In Efficient Substructure discovery from Large Semi-structured data Tatsuya Asai et.al [8] presented an efficient algorithm called FREQT for finding all frequent ordered tree patterns from a collection of semi structured data. The drawback of this algorithm is it does not deal with more complex components such as attributes and texts and it only worked on ordered and induced trees. An algorithm PATHJOIN is proposed by Y. Xiao et.al [9] to discover all maximal frequent sub trees given some minimum support threshold. Mohammed J. Zaki, Member IEEE [10], has introduced the mining of embedded sub trees in a (forest) database of trees and introduced a novel algorithm, TREEMINER, for tree mining. TREEMINER uses depth-first search; it also uses the scope-list vertical representation of trees. The limitation of this algorithm is that it cannot find structure of XML document. Another algorithm is presented by Yun Chi, Yirong Yang, Yi Xia, and Richard R. Muntz [11], called CMTreeMiner, a computationally efficient algorithm that discovers all closed and maximal frequent sub trees in a database of rooted unordered trees. In [12], Termier et al. show that Dryadeparent is currently the fastest tree mining algorithm. However, Dryadeparent extracts embedded sub trees which are trees that maintain the ancestor relationship between nodes but do not distinguish, among the ancestor-descendant pairs and the parent- child ones.

4. EXTRACTING TARS
Tree Based Association Rules are obtained by selecting an item having its support and confidence value above user defined support and confidence. For this frequent subtrees are generated by using extension of CMTTreeminer algorithm. These trees are the subtrees having support above user defined support. Depending on the number of subtrees obtained, tree based association rules (TAR's) are generated using function given below. TAR's are of two types one is content based i.e called instance TAR (iTAR) which indicates values or text in the XML documents. Another type is structured TAR's (sTAR) which indicates structure of mined knowledge from XML documents. Each rule is saved inside a <rule> element which contains three attributes for the ID, support, and confidence of the rule. The rules extracted above are stored in a file called rule file, which can further used as another source to get idea about original XML document. Answer to the given query is found from this rule file by matching the condition in query.

5. ASSIGNING INDEX TO TAR'S


TARs provides intentional answer to the query which is more concise. Instead of describing data in terms of properties it gives the properties which data frequently satisfies. Index is assign to each path present in at least one rule. Index file is an XML document containing set of references to the each node in the rules.

6. QUERY ANSWERING
Once rule files and index files are saved as an dataset to be queried, user enter the query which is on original document. This query is transformed into intentional query and then fired on extracted datasets. Due to this user get intentional answer to the query. This answer is precise and gives property which is frequently satisfied. Condition in the query is matched with the nodes in index file and references of rules are obtained. Rule IDs return from index file are accepted and only those rules are searched for answer. Thus answer is return from mined knowledge not from original document. It is also useful in absence of original document. Queries are of different types such as and , count queries, Top k queries which are frequently asked.

3. ASSOCIATION RULE MINING


Association rules describe the co-occurrence of data items in a large amount of collected data and are represented as implications of the form X => Y , where X and Y are two arbitrary sets of data items, such that X Y = . The quality of an association rule is measured by means of support and confidence. Support corresponds to the frequency of the set X Y in the data set, while confidence corresponds to the conditional probability of finding Y, having found X and is given by supp(X Y)/supp(X). Association rule mining, one of the most important and well researched techniques of data mining. Association rule mining is to find out association rules that satisfy the predefined minimum support and confidence from a given database. Volume 2, Issue 3 May June 2013

7. RULES UPDATING
XML documents on web go on changing. In case of documents which are frequently undergoes changes instead of creating new rules previously obtained rule and index files are updated. This is done by creating dummy node while generating frequent subtree. Dummy nodes are the nodes whose confidence is less than minconf but it is near to that. When document changes and confidence value becomes greater than or equal to minconf only that respective dummy node gets activated and rule as well as indexes are updated

Page 325

International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)


Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com Volume 2, Issue 3, May June 2013 ISSN 2278-6856
Similarly, confidence is more so less data items are frequently present so no. of rules extracted are less hence less time is required.

Figure 1. System Architecture

8. EXPERIMENTAL RESULTS
Four types of experiments are performed using the method. 8.1 Extraction Time Time required for the extraction of the intentional knowledge from an XML database. As no of nodes increases extraction time increase initially, it remains stable for some time and as no of nodes becomes too high again it increases very fast.

Figure 4. Extraction time with constant Support =0.02 8.4 Accuracy Accuracy of intentional answer is measured in terms of precision and recall. Query answering depends on support threshold. When support is high then chance of correct answering is high as less no of rules are to be access.

9. CONCLUSION
A method here is for deriving intentional knowledge from XML documents in the form of TARs, and then storing these TARs as an alternative, synthetic data set to be queried for providing quick and summarized answers. This procedure has characterized by some key aspects that, it works directly on the XML documents, without transforming the data into any intermediate format. It looks for general association rules, without the need to impose what should be contained in the antecedent and consequent of the rule. It stores association rules in XML format and it translates the queries on the original data set into queries on the TARs set. The aim of this project is to provide a way to use intentional knowledge as a substitute of the original document during querying and not to improve the execution time of the queries over the original XML data set. The method used in this project can be further developing to optimize mining algorithms.

No. of nodes Figure 2. Extraction time w.r.t no. of nodes in XML doc. 8.2 Answering Time Answer Time of getting intentional answer is comparatively less than that of extensional answer, as instead of accessing original document mined rule file is used to answer the query. 8.3 Comparison with Support and Confidence Extraction time of generating rules from XML documents changes according to support and confidence. This can be shown in graph by keeping first confidence constant and vary support and then keeping support constant. It is seen in the figure 3 that more the support means frequent data items are less hence extraction time is less.

References
[1] G. Marchionini, Exploratory Search: From Finding to Understanding, Comm. ACM, vol. 49, no. 4, pp. 41-46, 2006. [2] R. Agrawal and R. Srikant, Answering XML Queries by Means of Data Summaries, Proc. 20th international Conf. Very Large Data Bases ,pp. 478499, 1994. [3] J.W.W. Wan and G. Dobbie, Extracting Association Rules from XML Documents Using XQuery Proc. Fifth ACM IntAZl Workshop Web Information and Data Management, pp. 94-97, 2003. [4] D. Braga, A. Campi, S. Ceri, M. Klemettinen, and P. Lanzi, Discovering Interesting Information in XML

Figure 3.Extraction time with constant confidence = 0.95 Volume 2, Issue 3 May June 2013

Page 326

International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)


Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com Volume 2, Issue 3, May June 2013 ISSN 2278-6856
Data with Association Rules, Proc. ACM Symp. Applied Computing, pp. 450-454, 2003. [5] J. Paik, H.Y. Youn, and U.M. Kim, New Method for Mining Association Rules from a Collection of XML Documents, Proc. Int Conf. Computational Science and Its Applications, pp. 936-945, 2005. [6] K. Wong, J.X. Yu, and N. Tang, Answering XML Queries Using Path-Based Indexes: A Survey, World Wide Web, vol. 9, no. 3, pp. 277-299, 2006. [7] E. Baralis, P. Garza, E. Quintarelli, and L Tanca, Answering XML Queries by Means of Data Summaries, ACM Trans. Information Systems, vol. 25, no. 3, p. 10, 2007. [8] T. Asai, K. Abe, S. Kawasoe, H. Arimura,H. Sakamoto, and S. Arikawa,Efficient Substructure Discovery from Large Semi-Structured Data,Proc. SIAM Int Conf. Data Mining, 2002. [9] Y. Xiao,J.F. Yao,Z. Li, and M.H. Dunham,Efficient Data Mining for Maximal Frequent Subtrees,Proc. IEEE Third Int. Conf. Data Mining, pp. 379-386, 2003. [10] M.J. Zaki, Efficiently Mining Frequent Trees in a Forest: Algorithms and Applications, IEEE Trans. Knowledge and Data Eng., vol. 17, no. 8, pp. 10211035, Aug. 2005. [11] Y. Chi, Y. Yang, Y. Xia, and R.R. Muntz, CMTreeMiner: Mining both Closed and Maximal Frequent Subtrees, Proc. Eighth Pacific- Asia Conf. Knowledge Discovery and Data Mining, pp. 63-73, 2004. [12] A.Termier, M.Rousset, M.Sebag, K.Ohara, T.Washio, and H.Motoda, DryadeParent, an Efficient and Robust Closed Attribute Tree Mining Algorithm ,IEEE Trans. Knowledge and Data Eng,vol. 20, no. 3, pp. 300-320, Mar. 2008. [13] Mirjana Mazuran, Elisa Quintarelli, and Letizia tanca Data Mining for XML query-answering support, IEEE Transactions on Knowledge and Data Engineering, Volume:24 NO. 8, August 2012.

Volume 2, Issue 3 May June 2013

Page 327

You might also like