Implementing an Improved Mediator Wrapper Paradigm for Heterogeneous Database Integration

This study developed and implemented an improved mediator wrapper approach to addressing the challenges of integration of semantic heterogeneous databases. It employed Local as View (LaV) paradigm of database integration so as to reduce its cost as well as offer the local sources a degree of independence. The developed model was implemented as web service using JEE and some software development tools with a number of heterogeneous databases as a case study.


Introduction
*The rapid growth of database technologies and networking have had a major impact on information processing requirement and methods in organizations. Information systems for large organizations today are mostly implemented on a distributed architecture, using a number of different computers interconnected through the Internet. Different data are usually stored in various databases based on different departments in the organization. For these reasons, organizations may have incompatible databases and their associated database management systems. A good example is the group of national databases in Nigeria. Each of these databases have unique functions and are usually created and managed internally by one organization established by the Federal Government. For example, the national identity card database which is used to give each registered citizen a national identity number is managed by the National Identity Management Commission (NIMC), SIM registration databases is managed by Nigerian Communication Commission (NCC), and BVN (Biometric Verification Number) database managed by Central Bank of Nigeria (CBN), just to mention a few. Integrating these databases will offer many benefits to the nation like access to accurate and complete information about citizens, aid in criminal investigation as well as aid planning for national growth.
Database integration is the process of taking databases (data sources) and making the data in them work together as if they were a single database. The aim of this is to achieve easy access to information from different databases so as to avoid duplication of data acquisition and processing efforts, and enable efficient use of resources. Integration models for heterogeneous databases are challenged with how autonomous the source databases are, resolving data semantics from various databases and the cost of implementing and managing an integrated system (Shi, 2002); hence this study. The study sought to tackle these problems by using the mediatorwrapper approach to harness data from different sources and platforms for the purpose of management.

Review of Related Literature
Most database systems form foundations were designed without integration in mind. Whenever there is desire for integration, the sources and their data (that do not fit together) need to be combined using additional adaptation and reconciliation functionality; hence the development of research in database integration.
According to Almanhy and Salim (2013) database integration is a process involving multiple steps in finding similar objects in two or more databases to create a non-redundant, unified view of the databases. The databases are often unrelated and this make the task of integration difficult. Dissimilar sources means that the databases may differ in some respect such as the Database Management application each database is using (Ali, 2009). Heterogeneities between data sources (databases) are broadly classified into syntactic, platform, schematic and semantic heterogeneities (Xiao and Cruz, 2004). Syntactic heterogeneity is caused by the use of different models or languages. In today's computing there are large amounts of unstructured data in form of XM and pdf in different formats, thus different type of data models are used to store these data. Some may be document oriented, graph based arts, relational data models, or object oriented, every data model having its own query syntax, query language and no knowledge about other databases (Garg and Kaur, 2015). Therefore harnessing of data in this is a big problem. Schematic heterogeneity results from the different data organizations or conceptual design. This implies different data values can be used for the same information and kept at multiple sites but of different values. Sometimes data is repeated on multiple sites but due to synchronization problems same copy has different values. Semantic heterogeneity is caused by different meanings or interpretations of data, for example in representing gender as male or female, one database design may use m or f and another may use 0 or 1. Sometimes a database design can use Naira as currency while the other use Dollar, Euro and so on. Furthermore Ceruti, Kemal and Thuraisingham (1986) pointed out that platform heterogeneity, constraint heterogeneity and transaction processing heterogeneity were issues that have to be resolved to achieve the goal of integration.
Database integration can be accomplished either by physical or virtual approach. In the physical approach, copies of the original heterogeneous databases are combined to form a single database called the integrated database or data warehouse, while the virtual approach deploys a multi-database or mediation system to support queries on an integrated view constructed upon the local (original) databases (Mohd, Yazid and Wan, 2013). This approach solely depends on local databases thus providing a closely coupled paradigm and consequently making the approach very slow. Though on this warehouse, data can be extracted, transferred and loaded as in data mining, it has a big setback in that the data will not always be up to date. The other approaches are Federated and mediator-wrapper system which are virtual forms of integration (Ullman, Molina andWidom, 2002, Almahy andSalim, 2013).
A federated database achieve uniform access solution by logically integrating data from underlying local DBMS. The component databases are integrated by each having its own database management system (DBMS) and a management system that controls the coordination of the databases, called a federated database management system. The databases themselves and the associated DBMS's play a pivotal role in deciding the overall structure of the federated system, since these can vary greatly. Federated database systems are fully-fledged DBMS in that they implement their own data model, support global queries, global transactions, and global access control. However queries may be costly and more difficult to optimize (Ullman, Molina and Widom, 2002).
In the mediator wrapper approach, queries can be performed directly on original databases. Here a query is posted to global schema called a mediator which reformulates it according to formats of local systems with the assistance of wrappers and then carry out a direct query on databases. This is a loosely coupled architecture and faster approach which overcome the up to date data challenge of data warehousing. The mediator exploit encoded knowledge about certain set of data to create information for higher layer of applications (Herschel, 2014) and processes queries, merge results and keeps track of available data sources. It consists of a query processor for constructing query trees and merging data. It is used to determine relationships between different objects which may be located in different data sources. Here each object must also contain information about the data source it belongs to. A wrapper is a software component that establish communication and the data flow between data sources and mediators. It has a homogeneous interface for working with heterogeneous data sources and provides frameworks to extract the schema from a data source in a format that is usable by the repository (Garg and Kaur, 2015). It also uses a parser to execute queries in its data source, using one wrapper for each data source and integrating its schema with the repository. A mediator may either be designed using object oriented language or XML schema and XQuery (Ulman, Molina and Widom, 2002). Either way, a database integration system I based on mediatorwrapper approach can be defined as a tripple (G,S,M) (Lerzerini, 2002), where: i.
G is the global schema ii. S is the source schema(s) iii. M is the mapping between G and S.
The global schema is usually developed using a process called schema matching and the process of mapping is referred to as schema mapping. Schema matching is the process of generating correspondence between elements of two schema while schema mapping determines how data from the local or different sources can be mapped to the global schema while keeping data consistency (Almahy and Salim, 2013).
An integration system based on mediation technique is also defined as a system characterized by an architecture based on a global schema that provide a reconciled and integrated view of the underlying sources Xiao and Cruz (2004). This means the integrated system is concerned with two key issues: system modeling and query processing.
System modeling deals with modeling the relation between the sources and the global schema. Two basic approaches have been proposed for system modeling as listed below: i.
Global as View (GAV): This approach expresses the global schema in terms of the local sources. ii. Local as View (LAV): This approach, requires the global schema to be specified independently from the sources, and the relationships between the global schema and the sources are established by defining every source as a view over the global schema Q u e r y p r o c e s s i n g r e q u i r e a q u e r y reformulation step: query over the global schema has to be reformulated in terms of a set of queries over the sources. In the GaV approach, every entity in the global schema is associated with a view over the source local schema, therefore query processing in this case uses a simple "unfolding" strategy. In contrast, query processing in LaV can be complex, since the local sources may contain incomplete information. Query processing in LaV is called view-based query processing. Common mediator wrapper systems include TSIMMIS (The Stanford-IBM Manager of Multiple Information Sources) and GARLIC System. (Shi, 2002).

Introduction to National Database Integration Project
In Nigeria, there are a number of databases in use. These include: National Identity database that is managed by the National Identity Management Commission (NIMC), SIM registration database an initiative of Nigerian Communication Commission where data is collected by various telecommunication companies (MTN, Airtel, GlobaCom, 9mobile) in the country for submission to the SIM registration database, BVN (Bank Verification Number) database meant to unify bank accounts belonging to an individual, Tax identification database managed by Federal Inland Revenue service just to mention a few. These databases resulted from polices by the government to gather information about citizens so as to enable smooth functioning of the activities of these agencies and commission. Since these databases are managed by different organizations, there is need for them to be autonomous so that the various applications that use them can still function properly. All the information gathered about citizens are from strategic areas and as such one outcome is glaring: the need for access in order to have detailed complete and accurate data about a person in such a way that information gathering will not be disrupted. A commission, the National Identity Management Commission (NIMC) was given the mandate to carry out integration of national databases by the government. An attempt is made in this study to adapt the mediator wrapper model of database integration to solve this problem. The results will be relevant to developers of integrated database systems.

Methodology
This section discussed the methodology adopted in undertaking this study. It presented detailed analysis of the proposed systems and the model which was later implemented using Java Programming language and some database managers.

Analysis of the integration system
An integration system should be able to combine data from different sources to form a single record such that it appears to the user as if the data is from a single source. In the case of this system, the sources were databases. The integration system behaved like a database management system in the sense that it receives query from applications requesting for data. This implies that the integration system will be a server program. The system was designed to enable the s o u r c e s t o b e a u t o n o m o u s a n d r e s o l v e discrepancies in data about individual record from the various sources. To be able to perform this function, the system was developed with the following: I. a virtual database also called global schema that can be queried, ii. a language for encoding ontologies, iii. a protocol for query and a means of processing queries, iv. a way to acquire data from different data sources, v.
a technique of harnessing data from each database, and vi. a means of sending back report in a comprehensible form to the requesting application.

Description of the Proposed System
This study adapted the Mediator Wrapper paradigm (Seksun, Djamal, Champin and Thairin, 2002) of data integration and uses Local as View approach of schema mapping. The mediator provides a virtual database in the form of classes to applications and the wrappers. The applications send query asking for objects of a class at a particular time. The mediator does not need to understand the underlying structure of each database. All it does is to know which local table(s) or entity from a database form its class. This information is encoded in a global ontology. The global ontology then enable the mediator to know which wrapper to call in the event of a query. When a mediator gets a query, it simply decodes it to form a query message and send it to the wrappers. The wrappers interpret the mediator schema and query message from the mediator via the use of a local ontology. When a wrapper gets a query, it converts it into a local query that its source database can understand. Afterwards, it connects to the database, execute the query and get the result. The result it receives is converted to a format the mediator can understand. To do this, the wrapper uses the local ontology to convert the data into objects of the mediator class. As such the result from the wrapper is an array of object of a particular mediator class. The mediator receives the various results from the wrappers, use the integration system to combine them and send the information to the application. To be able to communicate effectively as proposed by the model, the integration system will be a web service. The proposed integration model is illustrated figure1.
The architecture was implemented using JEE platform for the reasons that: i.
It is a single standard that sits on top of a wide range of existing enterprise systems, and database management systems. In addition, JEE specification ensures Web services interoperability, ii. it can be deployed on most web application servers, many operating systems and can also be used with many Relational Database Management System (DBMS), and iii. its implementations on multiple hardware and software platforms have been successful in high performance and throughput system space.
The integration system was developed as a web service; a collection of open protocols and standards used for exchanging data between applications or systems which will be available over the Internet or private networks using standardized XML messaging for communication. Web services allow various applications to talk to each other and share data and services among themselves, share documents such as XML making it most suitable choice for implementing the proposed model. It also use standardized protocols like SOAP over HTTP protocol. This is low cost because existing low-cost internet services is used for implementing web services.

Data Design
The Database Management Systems (DBMS) used were MYSQL, ORACLE and POSTGREL. These database management systems are robust and reliable with widespread usage. Although all of them make use of Structured Query Language (SQL), there were some variations in the syntax for writing SQL in each of the DBMS. Also the underlying data structure as well as data types used in each of them are different. Hence making the system heterogeneous. The data in the five databases were simulated in this project. To do this, data collection forms used by the various agencies were gathered and studied. The tool used for the designs is MYSQL workbench as such the data types are of MYSQL. Each database was implemented randomly in one of the three database managers mentioned earlier, for example National ID management database implemented in Oracle, BVN database implemented in MYQSL, SIM registration database implemented in POSTGREL, National Health Insurance database (NHIS) implemented in MYSQL and TAX database implemented in POSTGREL.

Global Virtual Database
A structure-based schema matching was used to design the virtual database. The process involved determining the relationship between entities within a database or between databases: equivalence, synonymy, containment, homonymy (Ali, 2009) and antonym. Customer and Customer from BVN and SIM registration databases respectively were equivalent. Person, Taxpayer, Personal Data from NIN, Tax and NHIS databases respectively were synonymous. Taxpayer contains workdetails and Taxdue contained tax type from TAX database. Secondly a schema matching was performed using operators. The operators applied here were union, generalization, specialization and import (Ali, 2009). The rules used were that: i.
All equivalence and synonymy can use union. ii. Generalization can be used for equivalence, synonymy or containment where possible. iii. Specialization can be used for equivalence, synonymy or containment where possible. iv. Use import for homonymy and antonymy (Ali, 2009).
The algorithm for the global database is as Repeat (iv) until i> n iv.
Identify the relationship between the entities in each database. The relationships could be binary or n-ary in degree. v. let i=1 vi.
Repeat ( repeat (xvi to xvii) until 1>NewEntities xvi. Compare and identify the relationship between the new entities created. xvii. perform (x to xii) xviii. Repeat (xiii-xviii) until the integrated schema cannot be merged any further The global ontology was specified using xml document. It was used by the mediator to know which wrapper a query message is to be sent. The global class name is written in the tag <classname> </classname>. The <database> </database> tag contains data required by the mediator to be able to connect to a database wrapper. Each wrapper has its own local ontology. This is a well formed XML document and is meant to enable the wrapper convert a query message from the mediator to local query. The local ontology is also used by its wrapper to convert the result from its database to the required mediator class. The local ontology help solve semantic problems. Figure 2 shows the data flow diagram of the model.
The class diagram also has a class called query message which defines how query to be sent to the Wrappers will be. An ID was given to the query message. Parameters such as command, classname, fieldname, operator and value were the arguments list respectively. This enabled the wrapper to easily parse and convert a message from the mediator. The query processor validates a query and provides the list of wrappers to which message will be sent. The message is sent to the wrappers. On receipt of the result, the mediator perform integration using methods from the integration system. Upon receipt of records from each wrapper, the mediator calls the third component which is the integration system. This component is responsible for combining the The query received from the application is first decoded by the mediator to get the name of class requested. The selector is a simple "where" clause records from each wrapper to form a single result set. The wrapper receives query messages from the mediator and executes them. It does so by converting the query message to its local query with the help of the local ontology, connect to its local database and execute the local query, get the result and convert it to the mediator class specified in the query message by using the local ontology then send records to mediator.

Results and Discussion
The integrated system used the concept of distributed processing to perform its task. In this regard, it is assumed that the source databases are always running on servers at different locations and internet connection is available at the that specifies only one field value as search criteria. The syntax of the query from an application is shown in figure 4: integration system site at all times. The design was fully implemented as a web service in JAVA and tested using the proposed client application and test samples from different databases on different platforms. The client application is assumed to be a search application. At the client application home page, an authorized personnel was presented with a search box requesting for the name of the entity to search for. The user inputs the required data and on click the search button, the client application connects to the integration system requesting for information about the specified individual using the proposed name.

Records in the Test Table of the Individual Databases
The records in each table from each of database are shown in Figure 5. This class relates to Person, Customer, Customer, Taxpayer and Personal Data tables of NIN, BVN, SIM, Tax and NHIS databases respectively.    Looking at the records and the fields of the table from each database, they exhibit different levels of heterogeneity. The field names are different and the number of fields in the table of each database differ depending on the kind of information required about a person by the designers of the database. For example: sname, surname, lname, lastname refer to a person's surname. Also gender, sex and dob, dateofbirth. Figure 10 showed the result of searching for a person with surname "Anyebe" initiated by the client application. After comparing with the records in the various databases, only the records of individuals with surname "Anyebe" were displayed. Also, the system was able to merge the record of "Anyebe Evelyn Onyi" found in all the databases and "Anyebe Emani Mary" in all the databases except in tax database and as such there was no TIN number. Also "Anyebe joy's" record which was found in only NHIS databases was displayed. The next search is done using national identity number as shown in figure 11. Only one user exist in all the data sources with national id " 746909AKTE ". As such the application correctly retrieved its data. The national id field has the property of uniqueness thereby making search more efficient.
The application correctly assessed the record for individual "AGADA ZAINAB" (figure 12) from the databases using her national id and was able to highlight a disparity in her title. With the use of the national id, unique records were retrieved and combined efficiently because the attribute has the property of uniqueness.
The integration system was able to combine data from different sources to form a single record such that it appeared to the user as if the data was from a single source. The system facilitated easy access to information from different databases of diverse platforms.

Conclusion and Recommendation
The study attempted to address the challenges of autonomy in database integration using the different databases as case studies. A modified mediator wrapper using Local as View method of integration was adopted. The paradigm when implemented was able to combine data from different databases, offering the local sources a degree of independence and some level of cost effectiveness for the reason that that it requires little technical knowledge to be maintained and due to the absence of a central database, there is reduced cost of data center setup and management. The system can be easily developed by programmers and deployed for usage. It can be made efficient by enforcing a unique attribute in all the local sources to be integrated. It also required very little storage capacity due to the absence of a central database. During the implementation of this work, it was observed that achieving an efficient integration system require the use of a field with the property of uniqueness for search. Therefore the use of unique attributes about an individual to search is recommended. Also network instability might make the developed system have too many down time. To minimize this, the use of a central database to carry out periodical update via the integration system is recommended.
Adding the use of a central database and agents to the integration system, using unique attributes like biometric information and images of people for search parameters and using pattern matching and natural languages for resolving discrepancies in the records acquired during merging are suggestions for future studies in this work.