Here the Researcher has described the approach navigation, data warehousing approach and focus Mediator. This document provides further details on Biological Data sources and integration approaches now widely used by the database administrator and consultant.
BIOLOGICAL DATA SOURCES
Table 1: Heterogeneity of biological data sources.
This Table 1 shows the heterogeneity of data sources available to biologists for their research. The indifference among the sources of data is shown in the format of data and user interfaces. These data sources store information about the nucleotide sequences, protein sequences, 3D structures of macromolecules and protein families. This information usually can be retrieved via web interface, ftp and email. The data model underlying these data sources are flat-file model, relational model and object-relational model (P. Lambrix and V. Jakoniene, 2000).
According Lambrix and Jakoniene (2000), most banks allow to query data based on the appearance of text within a data item (full text search) and full support of databases based on consultations The occurrences of a string of text within certain predefined fields. User of these databases is often guided by the retrieval interface system also supports command line language queries using the referral systems. Most of the system allows the use of Boolean queries (use and, or, not).
There are some problems in the use of these different data sources. To allow users to query a data source they need to have some knowledge of data source you want to query. Users also need to know the query language and system user interfaces. User might need to consult more than one source of data, and learn from each data source you want to view could be a tedious job. This could also take a long time for users to gain some knowledge from the data source to use.
BIOLOGICAL DATABASE INTEGRATION APPROACHES
Researchers have come up with some approaches that integrate the various sources of biological data. There are three integration approaches used to address the issue of interoperability among biological databases: the integration of link-based navigation or, mediatorbased (T. Hernandez and S. Kambhampati, 2004), and data storage (ZB Miled, Li N., GL Kellett, B. & O. Sipesand Bukhres, 2002).
INTEGRATION APPROACHES FROM UNDERSTANDING USER REQUEST
Figure 1: General Integration approaches on different Architecture Level from User Request
The above Figure 1 describe most common approaches in Biological Database integration, but here the Researcher using three approaches, as follows
NAVIGATIONAL APPROACH
Figure 2: Navigational Approach Example
Navigations approach is an approach that the system will provide static links between data or records in different data sources. SRS (L. Wong, 2002 and ZB Miled, N. Li, GL Kellett, B. and Bukhres Sipesand O., 2002) provides some functionality to search public at home and database in the license. SRS looks basically flat files or databases that contain structured text with field names. It then creates and stores an index for each field and uses these local indexes at query time to retrieve relevant entries. SRS parses the file or databases and capture all the information using its own analysis component known as Icarus (a special built-in programming language wrapper (L. Wong, 2002)).
The results of this approach would be the simple aggregation of records that correspond to the search constraint. These records may contain links that the user can continue to get more information about the results.
DATA WAREHOUSE APPROACH
Figure 3: Traditional Data warehousing process Architecture
Data storage approach is taken by the system integration as GUS (SB Davidson, J. Crabtree, B. Brunk, J. Schug, V. Tannen, C. Overton and C. Toeckert, 2001) and DiscoveryLink (L. Wong, 2002). This approach uses data-storage tank that provides a single point of access to a collection of data, obtained from a set of distribution, heterogeneous sources. The data in heterogeneous database remotely copied to a local server and the user will use a single interface in the system to allow multi-database queries to be released this single.
This definition of the data warehouse focuses on data storage. However, the means to retrieve and analyze data, to extract, transform and load data and manage the data dictionary are also considered essential components of a data storage system. Many references to the data storage are using this broader context. Therefore, an expanded definition includes data warehousing business intelligence tools, tools to extract, transform and load data into the repository, and tools to manage and retrieve metadata.
Datawarehousing arises the need of reliable organization, consolidated information, unique and integrated analysis of data at different levels of aggregation.
MEDIATOR APPROACH
Figure 4: Architecture for Biological Database Integration (Bio Meta Search Engine)
Another approach to the integration of biological databases is the mediator approach. System as DiscoveryLink (L. Wong, 2002 and C. Goble, 2000), K2/Kleisli (SB Davidson, J. Crabtree, Brunk B., J. Schug, V. Tannen, C. Overton and C. Toeckert, 2001) , TAMBIS (Paton NW, R. Stevens, P. Baker, CA Goble, S. Bechhofer, and A. Brass, 1999), and BACIIS (ZB Miled, N. Li, GL Kellett, B., and Bukhres Sipesand O., 2002 ) uses this approach. Mediator based approach does not store any data, but provides a virtual view of the integrated sources (M. Kazemi, B. Moshiri, H. and C. Nikhbakh Lucas, 2005). Mediator approach basically translate user's query, in consultation is meant by sources in the system. This study describes the relationship between descriptions of the source and the mediator and therefore allows the mediator queries to be translated to questions about the data source.
DISCUSSIONS
Navigation approach is used in systems such as SRS and Entrez. It is popular for its ease of use the feature and only involves the tasks of pointing and clicking. This approach also allows the representation of cases in which the page containing the desired information is accessible only through particular shipping routes through other pages.
One of the weaknesses of the linkage approach is the information user must specify which data sources should be used to answer a particular query. Another weakness of this approach is when a user is interested in a connection between two sources of data into the system, the user must manually perform the join by clicking each entry in the first data source, following all the power connections Second data. This approach does not really integrate sources of data, but only offers users a means to retrieve information.
Data warehouse approach is somewhat different from the other two approaches. These large storage to copy data from the data source involved in the system and uses high-level language such as SQL query to query the system.
The main advantage of the data warehouse approach is, the system performance tends to be much better. It is because query optimization can be performed locally, and the latency between data communication source is removed. This approach is also reliable because there is less dependency on network connectivity. The most important advantage of using data storage approach, since the underlying data sources may contain errors, maintains a separate copy of data called clean copy.
Data warehouse approach has its advantages and disadvantages. Results of reliability and general maintenance of the system are questionable, as there are chances of results become redundant. The changes in sources of data means that data in the warehouse will also be changed, so there is a need to detect changes in data sources and automating the updating of the data warehouse.
Mediator approach is based on the translation of user queries which means the data sources. This approach fits the description of heterogeneous integration despite only a matter of perspective view database.
Mediator approach has two approaches, World-as-view (GAV) and local-as-view (LAV). In the GAV approach, the mediator is based on the relation of origin scheme, which facilitates the consultation of a reformulation while the LAV approach, the relation of origin is based on the schema of the mediator and the relationship. GAV approach has its drawback in adding or eliminating the sources, as will the mediator schema modification. Unlike GAV, adding or removing sources is much simpler. However LAV query reformulation makes complicated.
Apart from the inconvenience, the mediator-based approach has more strengths than the other two approaches. Compared to navigation approach, based mediator is much more advanced because it is regaining consciousness in the data source rather than provide static links. data store have their approach without involvement of the local network and server-based, and this gives an advantage in terms of performance, without congestion or unavailability of services. However, updating data source in data storage that took so long is a weakness without compromise. Mediator base would not have the problem of updating as the query directly goes to the original source. Mediator base can be seen as a more economical and efficient as it is schema or integration of views, instead of having storage for storing the copied data from all sources of data involved.
In general terms, the mediator-based approach can overcome the heterogeneity of data source problem through the use of metadata as a form of vocabulary or ontology to represent domain knowledge explicitly. Ontology fact has already been used on systems based on a transparent mediator called for access to multiple sources of biological information or TAMBIS.
TAMBIS system uses ontology as it addresses the semantic aspect of heterogeneous data sources. TAMBIS has its own ontology Tao and contains about 2000 concepts that describe both molecular biology and bioinformatics tasks. TAMBIS interface helps users navigate through the ontology for query building. According to Wong, see TAMBIS is made from a concept, then the navigation of related concepts and bioinformatics, which applies in the ontology.
Ontology (R. Stevens, CA Goble and S. Bechofer, 2000) is system integration much easier to be consulted. It is easier in a sense to make a logical and sensible query into the domain of the biological data. Ontology Mediator approach makes far more useful than the integration of other approaches on all ad hoc consultation.
However, despite the strength of the ontology, there are also things that have to be examined in its application based mediator. Ontology is very subjective and is only one way to represent a domain using ontology. Therefore, there is a need for more ontology, so that users can have options of the domain ontology that the user needs to help them check the system.
Ontology that is already stored in one system could not describe it in one specific thing. Alternative ontology is necessary to help the user to give their point of view from different perspectives. A feature that allows users to add new ontology must fit into the system based mediator. Through this approach, new and better ontology could enter the system and therefore give more support for the user to query the system.
Additional or alternative ontology is also for the exploratory analysis of additional data. The user must also be able to choose the ontology that wish to query data sources.
SUGGESTIONS FOR FURTHER WORK
The incorporation of user-defined biological mediator ontology schema based on the proposal, which help users to query different data sources at the abstract level and different contexts. Users can also adapt existing ontology that is already in the system, or the use of existing external ontologies as Gene Ontology, Ontology RiboWeb, TAMBIS Ontology, Ontology and Ontology EcoCyc Schulze-Kremer.
Therefore, there is a need to combine ontologies in a database integration approaches. Further work will focus on the merging of different ontologies to help users to make queries. Steps to be taken in the merging of ontologies are:
1) The ontology of prior treatment
2) Selection of concept
3) The similarity calculation
4) Reconstruction of the hierarchy
The anterior approach is performed in the use of ontology based on WordNet. However, in our situation, biological ontologies will be used. Before this ontology, user-defined tools will be constructed using an ontology construction.
Ontologies are stored in rebuilt and used a mediator to help the user build the query. The query is translated by the mediator and sent to the data sources incorporated. In this research, data sources will be incorporated into data sources such as SWISS-PROT and Genbank. The result of the merger with the ontology is compared with the result of the use of ontologies individually. The level of abstraction is focus on the level of biological taxonomy for the starters.
CONCLUSION
The report begins with a description of the problem of heterogeneous biological data sources and following sections describes the approaches in integration of data sources. Three approaches have presented, from those approaches mediator approach has the advantage over other approaches in integrating biological data sources. Mediator approach is one of the ontology-based user assistance by the system integration.
Specifically biological ontology or ontology could help users in the construction of meaningful consultation on the system. Alternative ontology should also be included, which means the user can add to the system of Ontology. The Combination of biological ontology in a mediator approach requires further investigation. This makes the formulation of queries much more useful on heterogeneous data sources using ontology.
From this research, the researcher conclude that many work has been performed using Ontology. Even there are few tools exist to implement for specific data sources, but there is no fixed global tool for various data source. The New trends in ontology are XML (XML based integration) and space technology. However, there are still few investigations are on Hybrid ontology approaches. So finally the Researcher concludes that ontology integration is possible using the methods described in this document. Even though the research is seems to be ad-hoc.
The Researcher learned the concept of the ontology and the integration using three databases approach. Also the Researcher has given references for the studies and ongoing research on the same approach.
The Researcher wishes to thank the University to make the report of the investigation and the opportunity to learn this work, also thankful to Mr. Cyril Anthony, for motivating this work and contributing helpful comments.
REFERENCES
A.-P. Tsou, Y.-M. Sun, Chia-Lin, Liu, H.-D. Huang, J.-T. Horng and M.-F. Tsai, (2005), A Biological Data Warehousing System for Identifying Transcriptional Regulatory Sites from Gene Expressions of Microarray Data. Retrieved on May 12, 2010 from http://www.ncbi.nlm.nih.gov/pubmed/16871724
C. Goble, (2000), Supporting Web based Biology with Ontologies, IEEE. Retrieved on May 8, 2010 from http://ieeexplore.ieee.org/iel5/7174/19297/00892423.pdf
H. Kong, M. Hwang and P. Kim, (2005), A New Methodology for Merging the Heterogeneous Domain Ontologies based on the WordNet. Retrieved on May 9, 2010 from http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.112.9332&rep=rep1&type=pdf
K. U. Sattler, I. Geist and E. Schallehn, (2004), Concept-based querying in mediator systems, VLDB Journal. Retrieved on May 12, 2010 from http://portal.acm.org/citation.cfm?id=1053474.1053480
L. M. Haas, P. M. Schwarz, P. Kodali, E. Kotlar, J. E. Rice and W. C. Swope, (2001), DiscoveryLink: A system for integrated access to life science data sources, IBM System Journal. Retrieved on May 8, 2010 from http://domino.watson.ibm.com/tchjr/journalindex.nsf/2733206779564b3d85256bd500483abf/db0fd95986b44dce85256bfa00685da4!OpenDocument
L. Wong, (2002), Technologies for Integrating Biological, Briefing in Bioinformatics. Retrieved on May 14, 2010 from http://bib.oxfordjournals.org/cgi/reprint/3/4/389.pdf
M. Kazemian, B. Moshiri, H. Nikhbakh and C. Lucas, (2005), Architecture for Biological Database Integration, Artificial Intelligence and Machine Learning 2005 Conference. Retrieved on May 9, 2010 from http://www.icgst.com/AIML05/conference/participants.html
N. W. Paton, R. Stevens, P. Baker, C. A. Goble, S. Bechhofer and A. Brass, (1999), Query Processing in the TAMBIS Bioinformatics Source Integration System, Scientific and Statistical Database Management. Retrieved on May 10, 2010 from http://portal.acm.org/citation.cfm?id=831544
P. Lambrix and V. Jakoniene, (2000), Towards transparent access to multiple biological databanks, Proceedings of the 1st Asia-Pacific Bioinformatics Conference. Retrieved on May 8, 2010 from http://portal.acm.org/citation.cfm?id=820196
R. Stevens, C. A. Goble and S. Bechofer, (2000), Ontology-based Knowledge Representation for Bioinformatics. Retrieved on May 8, 2010 from http://www.ncbi.nlm.nih.gov/pubmed/11465057
S. B. Davidson, J. Crabtree, B. Brunk, J. Schug, V. Tannen, C. Overton and C. Toeckert, (2000), K2/Kleisli and GUS: Experiments in Integrated Access to Genomic Data Sources, IBM System Journal. Deep Computing for the Life Sciences. Retrieved on May 8, 2010 from http://www.cis.upenn.edu/~db/sue/IBMSystemsJournal.pdf
T. Hernandez and S. Kambhampati, (2004), Integration of Biological Sources: Current Systems and Challenges Ahead. Retrieved on May 8, 2010 from http://rakaposhi.eas.asu.edu/biosurvey03.pdf
V. Honavar, C. Andorf, D. Caragea, A. Silvescu, J. Reinoso-Castillo and D. Dobbs, (2010), Ontology-Driven Information Extraction and Knowledge Acquisition from Heterogenous, Distributed, Autonomous Biological Data Sources. Retrieved on May 8, 2010 from http://www.cs.iastate.edu/~honavar/Papers/ijcaiworkshoppaper.pdf
Z. B. Miled, N. Li, G. L. Kellet, B. Sipesand and O. Bukhres, (2002), Complex Life Science Multidatabase Queries. Retrieved on May 12, 2010 from http://ieeexplore.ieee.org/iel5/5/22437/01046954.pdf?arnumber=1046954