One of the most advantages of the Semantic Web is to give the data a well-defined meaning and linking between data by using RDF ontology language. Today most of data are stored in relational database. In order to reuse and infer this data on the Semantic Web, there is a need for converting relational database to the form of RDF. Some approaches have been proposed, however, most of them transform a single table into RDF triple. This paper presents RDB2RDF, a complete method to transform all table
Original Title
RDB2RDF: Completed Transformation from Relational Database into RDF Ontology
One of the most advantages of the Semantic Web is to give the data a well-defined meaning and linking between data by using RDF ontology language. Today most of data are stored in relational database. In order to reuse and infer this data on the Semantic Web, there is a need for converting relational database to the form of RDF. Some approaches have been proposed, however, most of them transform a single table into RDF triple. This paper presents RDB2RDF, a complete method to transform all table
One of the most advantages of the Semantic Web is to give the data a well-defined meaning and linking between data by using RDF ontology language. Today most of data are stored in relational database. In order to reuse and infer this data on the Semantic Web, there is a need for converting relational database to the form of RDF. Some approaches have been proposed, however, most of them transform a single table into RDF triple. This paper presents RDB2RDF, a complete method to transform all table
Pham Thi Thu Thuy Nguyen Duc Thuan Yongkoo Han Kisung Park Young-Koo Lee Faculty of Information Faculty of Information Kyung Hee Kyung Hee Kyung Hee Technology Technology University University University Nha Trang University Nha Trang University Vietnam Vietnam South of Korea South of Korea South of Korea thuthuypht@gmail.com ngducthuan@gmail.com ykhan@khu.ac.kr kspark@khu.ac.kr yklee@khu.ac.kr
ABSTRACT integrated into a data repository enabling applications to
One of the most advantages of the Semantic Web is to augment use the data in different contexts [7]. the data with a well-defined meaning and linking between data by RDF data can be presented in the form of triple (Subject - using the RDF ontology language. Today most of data are stored Predicate – Object) or RDF/XML which stores RDF format in relational databases. In order to reuse and infer this data on the Semantic Web, there is a need for converting the data stored in in the form of XML file [5]. The most advantage of relational databases to the form of RDF. Some approaches have RDF/XML is that it can reuse the existing XML tools. been proposed, however, most of them transform a single table Moreover, each RDF format has an internet content type into RDF triples. This paper presents RDB2RDF, a complete [5], passed by the server, so the client knows how to parse method to transform all tables in the relational database into RDF the data. Therefore, in this paper we use RDF/XML format ontology. The transformation makes it possible to reverse RDF to store the results. ontology to relational tables. Most of all, all the steps in RDB2RDF are done automatically without any user intervention. Moreover, most of formatted data today is stored in relational databases which are excellent tools for storing and querying data, but lack the ability to describe the Categories and Subject Descriptors semantics of data. In order to utilize the relational data in a I.2.4 [Knowledge Representation Formalisms and Method]: Representation languages. I.5.3. [Clustering] Similarity semantic context, we should transform those data into RDF, measures. the data format of the Semantic Web. There are some proposals that move relational data to the RDF dataset. The typical approaches are proposed by Edgard Marx et al. [4], General Terms Standardization, Languages. Huajun Chen et al. [6], and Kate Byrne [9]. However, most of proposed approaches are simple and equivalent matching. They map some tuples of relational data to some Keywords triples of RDF dataset without considering the RDFS Semantic Web, relational databases, RDF, transformation. semantic constraints and the complex query to extract important information. 1. INTRODUCTION The Semantic Web is an extension of the current Web, in The main goal of RDB2RDF is to allow flexible mappings which data are augmented with a well-defined meaning and of complex relational structures into RDF ontology without relationship between data by using RDF (Resource changing the existing database. The flexibility is achieved Description Language) with the vocabulary supported by by employing SQL statements directly in the RDF Schema (RDFS). Those RDF data can be understood transformation steps. The resulting record sets are grouped by the computer and then can be shared, exchanged or afterwards and the data is mapped to the RDF triples. Our contributions are as three folds: x RDB2RDF can transform all tables in the Permission to make digital or hard copies of all or part of this work for relational database into RDF ontology. personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that x The transformation keeps the tracks of attribute copies bear this notice and the full citation on the first page. To copy keys in the tables so that the algorithm can be otherwise, or republish, to post on servers or to redistribute to lists, extended to reverse RDF ontology to relational requires prior specific permission and/or a fee. tables. IMCOM (ICUIMC)’14, January 9–11, 2014, Siem Reap, Cambodia. x All the steps in RDB2RDF are done automatically Copyright 2014 ACM 978-1-4503-2644-5 …$15.00. without any user intervention. The rest of the paper is organized as follows. Section 2 an entity in the conceptual data model. We can check this presents some specific methods of the related work. In the requirement by looking in the table. Are there attributes Section 3, we describe the RDB2RDF architecture and which are primary keys and are not foreign keys? If yes, details of each steps. Section 4 presents the illustrating this table is generated from an entity. Otherwise, it is from a example for the RDB2RDF. The evaluation is presented in relationship. The target of the transformation is a RDF file Section 5. Finally, Section 6 summarizes the paper and which has the name of the source table (with different mentions to the future research. extension). The extracted table must have at least one attribute which is 2. RELATED WORK a primary key. This primary key is used to create URI for the resource. Each extracted attribute has domain and range There are many approaches investigating the as describing in Table 1. transformation of the relational database into a RDF dataset. The most similar approach to our approach is the Table 1. Domain and range for each extracted attribute D2R [3]. However, this approach only extracts interest Attribute type Domain Range Description attributes in tables and then transform them into RDF triples. Our proposed method transforms all attributes in a Primary key in Name of Name of primary primaryKey primary table primary table table database. Foreign key in Name of Name of foreignKey More work has been addressed on the issue of explicitly primary table primary table referenced table defining semantics in database schemas [2], [13], extracting Name of table Primary key in Name of semantics out of database schema and transforming a containing this primaryKey other table primary table relational model into an object-oriented model [1], which is key close to an ontological theory. Foreign key in Name of Name of foreignKey other table primary table referenced table Other well-known approaches are RDB2RDF [4], Juan Nomal Name of Name of Sequeda [8], Huajun Chen et al. [6], and Kate Byrne [9]. attribute (not datatype (in attribute primary table RDB2RDF [4] method uses the mapping language R2RML key) XML Schema) [15] to convert tuples of the relational data to RDF triples. However, this is the direct mapping which does not 3.2 RDB2RDF architecture consider the RDFS semantic constraints such as rdfs:subClassof, rdfs:subPropertyof, rdfs:domain, and The details of our approach is presented in Figure 1. rdfs:range. Juan Sequeda [8] and Huajun Chen et al. [6] are also direct mappings by using RDF query and SPARQL query, respectively, to extract some specific information Description RDFS of attributes RDF from the relational data. Kate Byrne [9] defines new RDF 1 in DB 3.2 5.2 relationships and maps cultural heritage data to those relationships. In the broader sense, our approach could be treated as the Database 3.1 5.1 reverse method: RDF storage in relational database. Agrawal et al. [12] use only one “universal” table in the 2 database. Every individual (instance) falls into one record 4 XML Select query Attributes in the table. While the data model is simple, this approach Execute Select has some drawbacks such as large number of columns and query limits on property values. The “Generic Representation” [14] has a single table where each record corresponds to a RDF triple. However, this design means that any query has Figure 1. RDB2RDF architecture to search the whole database and queries that involve joins. As shown in Figure 1, our approach has five small steps as The cost will be especially expensive. following: x Step 1: Describe all attributes in a database. The 3. THE RDB2RDF DESCRIPTION AND description result is a text file stored in secondary PROCESS memory. x Step 2: Use SELECT command to extract the data 3.1 RDB2RDF description describing the resource. Each resource should . belong to a primary table. The attributes of the The source of the transformation should be a primary table primary table must be extracted first. in a relational database. This table has to be generated from x Step 3: Generate RDF Schema (RDFS) file based SELECT property1, property2..... on the description file (Step 1) and the attributes FROM table1, table2, .... extracted from Step 2. WHERE [Where conditions] x Step 4: Execute the SELECT query to extract ORDER BY property1 instances in the relational database. The results are FOR XML AUTO, ELEMENTS stored in XML format. where property1 is the attribute key in the primary table; x Step 5: Generate RDF dataset from RDFS and table1 is the name of primary table. XML files. The algorithm to generate RDF/XML file can be described Details of each step are presented in next sections. as following pseudo codes:
3.3 SELECT syntax for extracting data
Algorithm 1. GenerateRDF ༦Input: a XML file Fxml, a primary table Tp, a RDFS file The SELECT command must contain the primary key of Fs the primary table. This primary key is considered as URI ༦Output: a RDF file F for instances in the resource. The SELECT syntax is as following: 1: Collect all the children elements of the root element in the XML file Fxml. SELECT tableName.ID, attibute1 As AliasName1, attibute2 2: FOR each child element ec in Fxml. As AliasName2.... 3: read the value ID of the attribute key in the Tp FROM tableName , table2Name .... 4: create a resource having WHERE ..... URI=baseURI+ResourceName+#+ID. FOR XML AUTO, ELEMENTS ODER BY tableNameID 5: FOR each property p in Fs 6: take a list of elements (listEle) in an instance. We note that the attribute key in the primary table must be 7: IF n(|listEle|) > 0 THEN extracted in the SELECT command. 8: FOR each child element ec in listEle 9: IF is the attribute key THEN 3.4 Generating RDFS 10: Create a corresponding property 11: Generate a property having ec’s value. We assume that the extraction of data in the relational 12: Append predicate to the resource. database is not redundant. It means that if the foreign key is 10. RETURN F extracted, the primary key which is referenced by the foreign key is not extracted and vice versa. The RDFS file We describe an algorithm for generating a RDF file from contains the classes and properties which are described as the primary table, the RDFS file and the XML file following: generated by Algorithm 1. First, we collect all the children elements of the root element in XML file (line 1). Second, x Description of classes: Each table in a database is we read the value ID of the attribute key in the primary transformed into a class. The description of a class table, and then we iteratively create a resource having an is based on the key attribute (primary key or URI that includes base URI and Resource Name and ID for foreign key). The class name is a value in the each child element in the XML file (line 2-4). Third, we range column. If the parent class contains values, take a list of elements (listEle) in an instance for each the class in the range column is a child class of a property (line 6). If the number of elements in listEle is parent class. greater than 0 and the property is the attribute value, we x Description of properties: The domain of all the create a corresponding property which referenced to the attributes is the name of primary table. The range resource having URI for each property (line 7-10). We of attributes is the values from the range column. generate property which value is the value of element child in listEle, and then we put predicate between containers 3.5 Generating RDF (line 11). Finally, we append predicate to the resource. If all the properties in RDFS are not traveled, we return to the This step produces a RDF file from the XML and RDFS attribute key checking step (line 12). Through these all step, files generated in Section 3.3 and 3.4. The SELECT we can get the results RDF file. command to generate RDF format in the form of XML file is as following: 4. ILLUSTRATING EXAMPLE The following example illustrates the use of RDB2RDF to Document.DocumentID #Document #Document primaryKey transform data about authors and their documents from a For easy understanding, we can replace the attribute database into RDF. Because authors usually have more than Doc_Author.Author by the alias name “Created_By”. one document and documents can be created by multiple authors, the information can be stored in three database The RDFS file that stores the required information is tables: one for the authors, one for their documents, and the generated as following: third one for the n:m relationship between authors and <?xml version="1.0" ?> documents: <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf- Author (AuthorID, AuthorName, AuthorEmail, syntax-ns#" xmlns:rdfs="http://www.w3.org/2000/01/rdf- AuthorORG) schema#" xmlns:xsd="http://www.w3.org/2001/XMLSchema#"> Document (DocumentID, DocName, DocFormat, DocLink) <rdfs:Class rdf:ID="Document" /> Doc_Author (AuthorID, DocumentID). <rdfs:Class rdf:ID="Author" /> The instances of three above tables are as following: <rdf:Property rdf:ID="DocName"> Author: <rdfs:domain rdf:resource="#Document" /> AuthorID AuthorName AuthorEmail AuthorORG <rdfs:range rdf:resource="&rdf;Literal" /> </rdf:Property> Author01 Anderson anv@yahoo.com MinhKhaiPub <rdf:Property rdf:ID="DocFormat"> Author02 Thomas btv@yahoo.com MinhKhaiPub <rdfs:domain rdf:resource="#Document" /> <rdfs:range rdf:resource="&rdf;Literal" /> Document: </rdf:Property> <rdf:Property rdf:ID="DocLink"> DocumentID DocName DocFormat DocLink <rdfs:domain rdf:resource="#Document" /> Doc01 C++ pdf http://www.somewh programming ere/Doc <rdfs:range rdf:resource="&rdf;Literal" /> Doc02 Semantic Web chm http://www.somewh </rdf:Property> ere/Doc <rdf:Property rdf:ID="Created_By"> Doc03 MSSQL 2000 pdf http://www.somewh <rdfs:domain rdf:resource="#Document" /> ere/Doc <rdfs:range rdf:resource="#Author" /> Doc04 ASP & pdf http://www.somewh </rdf:Property> ASP.NET ere/Doc </rdf:RDF> Doc_Author: To create the RDF file, we use the following SELECT AuthorID DocumentID command. Author01 Doc01 SELECT DocID AS [Document_RESOURCEURI_], DocName, DocFormat, DocLink, AuthorID AS [Created_By] Author02 Doc01 FROM Document, Doc_Author Author01 Doc02 WHERE Document.DocID=Doc_Author.DocID Author01 Doc03 ORDER BY DocID Author02 Doc04 FOR XML AUTO, ELEMENTS The result RDF/XML file is below: For example, we would like to know the detail information <?xml version="1.0"?> about the author. The extracted information is as following: <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" Attribute xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" Attribute name domain range type xmlns:res="http://www.oracle.com/technology/pub/articles/res#" Author.AuthorID #Author #Author primaryKey xmlns:db="http://http://www.w3.org/2001/XMLSchema- Author.AuthorName #Author Literal Attribute instance/mine#"> Author.AuthorEmail #Author Literal Attribute <db:result> <db:Document> Author.AuthorORG #Author Literal Attribute <db:Document_RESOURCEURI_> Doc01</db:Document_RESOURCEURI_> </rdf:RDF> <db:DocName> C++ Programming</db:DocName> <db:DocFormat>pdf</db:DocFormat> The RDB2RDF allows to transform an arbitrary relational <db:DocLink> database into RDF formats which includes RDF Schema http://www.somewhere/Doc </db:DocLink> file (*.rdfs) and RDF file (*.rdf). The RDF Schema file describes classes and the relationship between properties <db:Doc_Author> and classes. The RDF file stores all instances of relational <db:Created_By> Author01</db:Created_By> database. The transformation does not depend on the </db:Doc_Author> database and does not make any change on the database. <db:Doc_Author> RDB2RDF is implemented by using the C# language with <db:Created_By> Author02</db:Created_By> the support of the library .Net 2.0. The relational databases </db:Doc_Author> are stored in SQL server 2012. Therefore, before executing the transformation, we must connect to the SQL database. </db:Document> Our program supports two authentication (windows and <db:Document> server) for connecting to the database. <db:Document_RESOURCEURI_> Our RDB2RDF program provides two kinds of Doc02</db:Document_RESOURCEURI_> transformations. The first one allows users to use the button <db:DocName>Semantic Web</db:DocName> to convert all relational tables into RDF ontology. The <db:DocFormat>chm</db:DocFormat> second one let users enter the SQL commands to specify the <db:DocLink>http://www.somewhere/Doc needed information from some tables. All steps in RDB2RDF program are automated and thus inexpensive </db:DocLink> and fast. <db:Doc_Author> <db:Created_By> Author01</db:Created_By> 5. EVALUATION </db:Doc_Author> </db:Document> We evaluate the proposed transforming strategies by <db:Document> matching a relational database with a RDF file to determine <db:Document_RESOURCEURI_> the true matches, and compare our results with related Doc03</db:Document_RESOURCEURI_> methods. To assess the quality of the matching system, we use precision and recall [16]. Given the set of expected <db:DocName>MSSQL 2000</db:DocName> matching pairs, R, (produced by a human), the set of <db:DocFormat>pdf</db:DocFormat> alignment pairs, T, (produced by the matching system for <db:DocLink> http://www.somewhere/Doc the proposed methods), the Precision is computed as the </db:DocLink> following equation: R T (1) <db:Doc_Author> precision(R,T) T <db:Created_By> Author01</db:Created_By> Recall specifies the share of real correspondences: </db:Doc_Author> R T (2) </db:Document> recall(R,T) R <db:Document> Although precision and recall are the most widely used <db:Document_RESOURCEURI_> measures, when comparing matching systems, one may Doc04</db:Document_RESOURCEURI_> prefer to have only a single measure. For this reason, F- <db:DocName> ASP and ASP.NET</db:DocName> measure [16], is introduced to aggregate the precision and recall. <db:DocFormat>chm</db:DocFormat> (3) precision* recall <db:DocLink> http://www.somewhere/Doc F measure 2* precision+ recall </db:DocLink> <db:Doc_Author> To obtain practical evidence, we applied our transformation <db:Created_By> Author02</db:Created_By> to two sample databases produced by Microsoft, </db:Doc_Author> particularly, Northwind [10], and Pubs [11]. </db:Document> We compare the precision, recall, and F-measure values </db:result> between our proposed method and the most related work, such as D2R [3], RDB2RDF [4], Juan Sequeda [8], and their matching results in the Northwind database are lower Huajun Chen et al. [6]. The matching system is also than those in the Pubs database. For instance, the D2R’s F- implemented by using Visual C#. The comparing results are measure score in the Figure 2 is only 58% compared with shown in the following figures. 66% in the Figure 3.
6. CONLUSIONS
Transformation from relational database into RDF ontology
plays a critical role in realizing the Semantic Web as well as in many data sharing problems. There are many approaches mentioning this transformation. Moreover, most of those approaches directly transform relational tuples into RDF triples without keeping the foreign key and primary key relationships. Other methods transform some relational tuples into the RDF triples and do not consider the RDFS Figure 2. Matching comparison between our method and semantic constraints and relational data’s structure. Our related work on Northwind database proposed RDB2RDF method can transform all data from the relations or can extract any required information while keeping the relationship between primary keys and foreign keys and improve the relational data semantics by using RDFS vocabularies. The experimental results show that our proposed method outperforms other related work due to these reasons. Moreover, all the steps in our proposed method can be executed automatically without any human intervention. This algorithm can be also implemented as an intermediate module between any relational database and Semantic Web Figure 3. Matching comparison between our method and page. The extracted information can be selected by the related work on Pub sample database users. Our future direction is to transform relational databases into OWL ontology which supports more Figure 2 and Figure 3 show that our matching quality is semantics for the data than RDF. highest in comparing to those of the related work. RDB2RDF [4] is ranked second, then J. Sequeda [8], H. 7. ACKNOWLEDGMENTS Chen et al. [6], and D2R [3]. The main reason is that our method and RDB2RDF [4] transform all relational data into This research was supported by the MSIP (Ministry of Science, RDF whereas other three methods extract some relational ICT & Future Planning), Korea, under the ITRC (Information tuples. Moreover, our method maintains the relationships Technology Research Center) support program supervised by the between foreign key and primary key among relations NIPA (National IT Industry Promotion Agency) (NIPA-2013- whereas RDB2RDF [4] does not. Among D2R [3], J. (H0301-13-2001)) Sequeda [8], and H. Chen et al. [6] methods, J. Sequeda [8] gives the highest matching values since this method retains 8. REFERENCES the connections between foreign keys and primary keys. Moreover, when extracting some portions of the relational [1] Behm A., Geppert A., Dittrich, K. 1997. On the Migration of data, those three methods change some of the data structure Relational Schemas and Data to Object-Oriented Database so that their matching scores are not good. Systems. In Proceeding of the 5th Int. Conference on Re- Technologies for Information Systems (Klagenfurt, There are some small differences between Figure 2 and December 1997), pp. 13-33. Figure 3, since the differences of Northwind and Pubs [2] Chiang R., Barron T., Storey V. 1994. Reverse engineering databases. Northwind database has 13 relations in of relational databases: Extraction of an EER model from a comparing to 11 relations in Pubs database. Among those relational database. Journal. of Data and Knowledge relations, there are relationships between foreign keys and Engineering, Vol. 12, No. 2, pp. 107–142. primary keys. In this experiment, the total number of the [3] Christian Bizer. 2003. D2R Map - A Database to RDF relationships in the Northwind database is higher than that Mapping Language. WWW 2003, Hungary. of Pubs database. Therefore, for those methods which do [4] Edgard Marx, Percy Salas, Karin Breitman, José Viterbo, not maintain the foreign key and primary key relationship, Marco A. Casanova. 2013. RDB2RDF: A relational to RDF plug-in for Eclipse. Software. Practice Expert, Vol. 43, No. [10] Microsoft. 2011. Northwind database. 4, pp. 435-447, doi:10.1002/spe.2145 http://northwinddatabase.codeplex.com/ [5] Graham Klyne, Jeremy Carroll. 2002. Resource Description [11] Microsoft. 2013. Pubs sample database. Framework (RDF): Concepts and Abstract Syntax. W3C http://technet.microsoft.com/en- Working Draft (work in progress). us/library/aa238305%28v=sql.80%29.aspx. http://www.w3.org/TR/2002/WD-rdf-concepts-20021108/. [12] R. Agrawal, A. Somani, and Y. Xu. 2001. Storage and [6] Huajun Chen, Zhaohui Wu, Heng Wang and Yuxin Mao. Querying of E-Commerce Data. In Proceedings of VLDB. 2006. RDF/RDFS-based relational database integration. In [13] Rishe N. 1992. Database Design: The Semantic Modeling Proceedings of the 22nd International Conference on Data Approach. McGraw-Hill. Engineering. pp. 94-104. [14] S. Alexaki, V. Christophides, G. Karvounarakis, D. [7] James Hendler, Tim Berners-Lee, Eric Miller. 2002. Plexousakis & K.Tolle, 2001. On Storing Voluminous RDF Integrating Applications on the Semantic Web. Journal of Description: The case of Web Portal Catalogs, In Proc. of the Institute of Electrical Engineers of Japan, Vol 122(10), WebDB2001 in conjunction with ACM SIGMOD'01 p.676-680. Conference. [8] Juan Sequeda, Marcelo Arenas, Daniel P. Miranker. 2012. [15] W3C. 2012. R2RML: RDB to RDF mapping language. On directly mapping relational databases to RDF and OWL. http://www.w3.org/TR/r2rml/ WWW 2012, 649-658 [16] Wikipedia, “Precision and recall”, [9] Kate Byrne. 2006. Tethering cultural data with RDF. In http://en.wikipedia.org/wiki/Precision_and_recall Proceedings of the Jena user conference 2006 (JUC2006), UK.