BUILDING ONTOLOGIES OVER RELATIONAL DATABASES

Relational Databases are typically created to fulfil the information requirements of a community of users generally belongs to a single organization. Data stored in these databases were typically accessed by using Structured Query Languages or through customized interfaces. With the popularity of the World Wide Web and the availability of large number of Relational Databases for public access there is a need for users to retrieve data from these databases by using a text-based queries, possibly by using the terms that they are familiar with. However, the inherent limitations of Structured Query Languages used to create and access data in relational Data Bases does not allow uses to access data by using text-based queries. Also, the terms used in queries should be limited to those used during the construction of the databases. This paper proposes an architecture to generated ontologies over relation databases and show how they could be enhanced semantically by using available domain-specific or top-level ontologies so that the data managed by the DBs can be accessed by using text-based queries. The feasibility of the proposed architecture was demonstrated by building a prototype system over a sample MySQL database


Introduction
Today large volumes of data are managed in relations databases (RDBs). These RDBs are created by different organizations for different purposes and accessed through the Structured Query Language (SQL) provided by the database management systems (DBMSs) or through customized interfaces build over the databases (DBs). With the popularity of the World Wide Web (WWW), there is an increasing trend to provide Hyper Text Markup Language (HTML) form-based interfaces over DBs to provide data through web pages. When using SQL to access data in a DBs users should have a fair knowledge on the syntax and semantics of the DB content. However, complete knowledge of RDBs is usually unavailable with the DBs (Astrova, 2005), thus cannot be extracted from the DBs. When using web-based interface users are limited to the data provided by the interfaces and they cannot extract and combine data as they required from the available DBs. Thus, different approaches are proposed to link DBs with different types of ontologies to enrich the semantics so that the DB contents can be understood easily (Astrova, 2005 Alonso-Calvo, et al., 2007) and can be queried by using languages other than SQL. In this context, ontologies are considered as semantically rich knowledge structures that have the potential to enrich the semantics of data and meta-data in data repositories.
In this paper, we present a mechanism to construct ontologies over RDBs in third normal form. The proposed mechanism is based on a set of rules. Also, the mechanism make use of ontologies to enrich the constructed ontologies so that the users can query the content of RDBs by using simple text-based queries. The feasibility of the proposed mechanism for building ontologies over the existing RDBs is then demonstrated by developing a prototype system. The rest of the paper is organized as follows. Section 2 describes briefly, the major differences and limitation of databases and ontologies and then presents a selected collection of approaches reported to link databases with ontologies. In section 3 the overall architecture of the proposed system and the rules proposed to construct ontologies over RDBS were presented. Section 4 elaborates on how the porotype system was constructed to demonstrates the feasibility of the proposed architecture. Finally, section 5 presents the conclusions and future enhancements.

Differences Between Databases and Ontologies
Database and Ontologies are constructed by using different approaches and are used by different community of users for different objectives (Martinez-Cruz, Blanco, & Vila, 2012; Sir, Bradac, & Petr, 2015;Studer, Benjamins, & Fensel, 1998). The main intension of DBs to store and manage large volumes of data efficiently while ontologies are targeted at sharing knowledge of selected domains in a formal and structural manner among large community of users (Gruber , 1993;Guarino, 1995). Also, different languages and methods were developed to construct DBs and ontologies. For example, DBs are constructed and queried by using SQL whereas ontologies are constructed and queries by using a logic based language such as description logic.
It is important to understand the evolution of databases in order to understand the limitations of relational database technology. The Database Management Systems (DBMSs) available in late sixties and early sentries did not provide a sharp distinction between the logical view of the data and the physical representation of data in storage. As a result, application programmers were burdened with knowing irrelevant details on how data are organized and stored to develop applications over the DBs. This requirement sharply reduces the productivity of application programmers. The Relational Database Management Systems (RDBMS) was proposed by E. F Code (Codd, 1982) as a solution to this problem. Thus, the main objective of the RDBMS is to increase the productivity of application programmers within an organization who uses a RDB to develop applications. Consequently, much emphasis was not placed on the semantic expressiveness of data elements managed in data dictionaries of RDBMSs since one of the main premises made during the DB construction is that its usage is limited to a user community within a single organization. During the DB design stage conceptual schemas such as ER diagrams or UML are typically constructed to model the real world. These models are then used to create DB schemas by using SQL provided by the RDBMSs. The models like ER models are considered as intensional models (Guarino, 1998) at the conceptual level whereas DBs are considered as extensional models at the implementation level. In the process of converting a model at the intensional level to a model at the extensional level some explicit semantics associated with conceptual models would become implicit are some are lost. For example, the cardinality between entities in a ER diagram become implicit and name of relationships between entities are lost. Thus, the extensional models and the corresponding intensional models do not encode the same semantics. Also, there is no standard in assigning names for DB components. As a result, DB creators were free to assign their own invented names for tables and attributes in the DB schemas, which may not reflect the actual real-world entities and their properties they represent. In addition, new tables may be constructed to represent many-to-many relationships among entities in the ER models and such tables do not correspond to any real-world entities. These problems would cause serious limitations on usability, with respect to the semantics of DB elements, when the DBs are exposed to a larger community of users.
The study of ontologies initially started as a discipline in philosophy and integrated gradually into information technology during the last few decades (Sir, Bradac, & Petr, 2015). The word ontology has two different meanings based on the context on which it is used (Gómez-Pérez, Fernández-López, & Corcho, 2004; Guarino, 1995). In the first context, the term ontology refers to the study of things and their attributes and how the attributes belong to the things because of their very nature (Corazzon, 2017;Guizzardi, 2005). In the second context it is used to model the structure of a system formally. The system could be any area of interest that needs to be represented in a structural manner. In Computer Science the concept ontology is used in this latter context.
The main components of an ontology are concepts. A concept may be assigned with one or more names and may have linked with each other through relationships. The relationships give a structure for an ontology. Also, axioms may be assigned with both concepts and relationships. The names, relationships and axioms assigned for concepts together define the intended meaning of concepts in an ontology. Ability to represent data in languages such as OWL and RDF opens up the possibilities and reduces the cost and complexity associated with building ontology from scratch (Yu, 2007).
Since a DB schema is an explicit specification of a conceptualization it also can be considered as an ontology according to the definition given by Gruber (Gruber , 1993). However, a schema defines mostly the structure of a DB but not the semantics of the DB content formally as required by a typical ontology. Thus, a DB schema by itself cannot be used as an ontology to share knowledge of the DB content in a formal and structural manner among large community of users. This means that a DB schema is a primitive ontology with capability to define only some of the semantics of the DB content. Thus, if a RDB has to be shared among a large community of users with the intension of allowing them to query data by using their own queries two components must be provided. A knowledge layer such as ontologies on top of the BDs to expose the semantics of DB content to the intended user community and a user query to SQL query engine to convert the user queries to native SQL queries understood by the RDBMSs.

Linking Databases with Ontologies
The approaches reported in the literature aiming at linking databases with ontologies generally take two different approaches. In the first approach databases are mapped with existing ontologies whereas in the second approach a new skeletal ontology is built for the DB by constructing a data model for the DB through reverse engineering techniques and then enrich the data model into a new ontology. Some research adopted this approach were based only on the schema information while others have used both schema information as well as properties of data for this process. The former approach has several drawbacks. One of them is the semantic mismatch between available ontologies and DBs. Since different ontologies and DBs may have constructed by different communities for different purposes with different assumption linking them together would result in unacceptable semantic inconsistencies. The latter approach provides many benefits. Firstly, it allows meta-data extracted from the DBs to be semantically enriched prior to the construction of the ontology. Secondly, the ontology constructed contains only those concepts required to describe the semantics of the associated DB. Thus, many research work on linking DBs to ontologies had taken the second approach.

Similar Work
OntoFusion (Alonso-Calvo, et al., 2007), is a system based on the second approach. It was designed to provide unified access to multiple, heterogeneous biological and medical data sources that are publicly available over Internet by constructing ontologies over data sources by using terms from already available domain ontologies, and then unifying similar ontologies. Concepts in the domain ontologies are mapped with the data items in the data source so that data sources can be queried by using the concepts defined in the domain ontologies.
Datagenie (Zhou, Ling, Han, & Zhang, 2010) , is one of the earliest attempts reported to integrate a database with an Ontology. The ontologies of Datagenie are constructed by using the ontology construction toolkit Protégé (PROTÉGÉ, 2018). Consequently, terms used to describe the system are tightly coupled with the Protégé vocabulary. Datagenie was built as a plug-in and has the following simple rules to link DB elements with the components in an ontology. The Datagenie is too simplistic since the generated ontology is very primitive with respect to its semantics and the names in ontology may not make sense to a larger community as the names of ontology items are taken directly from the names of schema items. Also, it is not realistic to map tables representing many-to-many relations to classes as representing real-world concepts. The project qualegDB (Astrova, Korda, & Kalja, 2007) reported a similar approach but it has additional rules to extract inheritance relationships between tables in relational databases. Case 2: focuses on tables which represent specialization. These tables are considered as subclasses in the Ontology of the tables from which they derive.
Case 3: is the default case, where the tables which do not meet the above two are converted into Ontology classes unconditionally.
The process of conversion is first done with tables belong to case 3 followed by case 2, and then in case 1. Keys (except keys involved in case 2) are converted to object type properties in the Ontology and the other columns to data type properties. This approach also assume that the vocabulary used by the DB constructors are sufficient enough to expose the semantics of related items. Also, many of the research reported ignore the fact that primary keys and foreign keys defined in a RDB may comprises of multiple attributes.
Most of the approaches reported in the literature do not specify the rules formally, that they have used to construct ontologies and the structure and the composition of the ontologies built. Also, no clear explanation is reported on how the mapping between the components of the ontologies and components of DBs are recorded so that they can be used in subsequent SQL generation to access data in the DBs. Thus, it is not easy to identify their strengths and weaknesses easily.

Methodology
The approach we propose to build an Ontology comprises of two phases namely Meta-data Extraction phase and Ontology Generation Phase as depicted in figure 4.1. The extraction process is responsible for extracting the metadata from the relational database schema and to build a database model to represent the data in the DB. This process can be viewed as an attempt to build an intensional model, an intermediate ontology, from the extensional model described by the data in the DB. This process uses a set of rules to build an intensional model for a DB. In the next step the intermediate ontology created is enhanced with semantics by using an existing domain or a top-level ontology. The purpose of this stage is to ensure that the final ontology constructed is semantically rich enough to cater for a wide range of search words (to be entered by the user) than the meta-data items found in the database schema.

Construction of the Intentional Model
In our research an ontology O, is defined as below.
O = {C1,C2,…,Cm} , where C1,C2,…,Cm are the concepts in the ontology. Each concept Ci is assigned with one or more names and may have zero or more attributes. Each attribute of a concept is assigned with one or more names and has a domain and a range. Concepts may be related with each other by using sub_class/super_class or named relationships. An ontology can be viewed as Figure 1.

Implementation and Evaluation
In our prototype system Java programming language and SQL are used to implement the extraction process. The generated ontology is in OWL and the consistency of the generated ontology is checked by using the tool PROTÉGÉ (PROTÉGÉ, 2018). The general purpose top-level ontology WordNet (Fellbaum, 1998;MIller, Beckwith, Fellbaum, GRoss, & Miller, 1990) is used for the ontology enhancement process. The following figure 4.1 shows the meta-model of our ontology. The types of relationships defined between concepts are is_a, attribute and user named relationships. In our ontology attribute of tables are also represented as concepts. The attribute relationship is used to link attributes with their data types and is_a relationship represents subclass/super-class relationship. Each concept and attribute are assigned with an optional id to point to the relevant synsets in the WordNet database. Currently identifying the relevant concepts from WordNet and assigning their ids with the corresponding concepts in the ontology is done manually.   ); Figure 4.3: SQL commands used to create the database in a MYSQL server The concept hierarchy of the OWL ontology generated by our prototype system for the database is given in Figure 4

Discussion and Future Work
The proposed solution was evaluated by building a prototype and applying it on a set of sample database schemas. The results of the evaluation show that the proposed system is technically feasible and has the potential to build semantically rich ontology on top of any relational DB. The ontologies built by the proposed approach can subsequently be used to support text-based queries. One of the main limitations of the proposed approach is the limitations in the enhancement process. Currently, only the synonyms of the WordNet are used to enhance the ontologies. We are exploring how hypernyms and hyponyms defined in WordNet could be integrated into the ontologies to further enhance the semantics of the generated ontologies. Also, during our research we found that the terms to represent the intended meanings of some the items (table names/attribute names) in the database could not be found in the WordNet.