In this paper, we propose to investigate suitable methodologies and tools in the area of data management and analysis to develop a well-defined storage for DNA microarray data, that will allow sharing microarray experimental results between laboratories by linking related data from different public microarray sources, integrating them to provide a consistent view of data to the users, and resolve the important issues in microarray data integration such as:
- A lack of a common shared microarray-ontology, which leads to naming conflicts when different names are employed to represent the same information.
- Since microarray technology is in its infancy, the microarray data sources tend to have dynamic data representations. The related components, which reflect this modification, need to be updated. This currently requires a lot of time and effort by database administrators.
The paper begins by describing current approaches to database interoperation in bioinformatics community, in Section 2. Section 3 states our system architecture and its components. Section 4 discusses schema matching technique that applied to our architecture to construct a global schema. Finally, in Section 5 we give our concluding remarks.
- RELATED WORK
The past work on database interoperation in the bioinformatics community has taken one of four technical approaches: 1) Hypertext navigation, 2) Data warehousing, 3) Unmediated multidatabase queries, and 4) Federated databases.
The first method, Hypertext navigation, also called the indexed data sources approach, allows the users to interactively navigate from a result of one query in one member database to a result in another database, by using indexes and links between those two databases. This approach achieves a basic level of integration with minimal effort; however, it neither provides a mechanism to directly integrate data from relational databases nor to perform data cleansing and transformation for complex data mining. The representative systems of this method include Entrez by NCBI (Benson, et al., 1994), which is a search and retrieval system that integrates information from databases at NCBI (National Center for Biotechnology Information) through PubMed and SRS (Etzold and Argos, 1993), which provide an integrated browser interface and a basic query language for a range of important information sources.
Secondly is the Data warehousing approach, in which the data from a set of heterogeneous databases are exported into a single database, called the data warehouse. Translators are needed to transform this exported data, which in different in format and conceptualisation, into the format and conceptualisation of the warehouse database. This method simplifies the access to and query of data, allows for automated data mining, and quickly extends as new data sources are added, without effecting the original data source applications. On the other hand, the extraction, cleaning, transformation, and loading process can take considerable time and effort, which is a major drawback of Data warehousing. TSIMMIS (Garcia-Molina, et al., 1995) and DataFoundry (Critchlow, et al., 2000) are examples of data warehousing system.
Next approach is Unmediated multidatabase queries. In this, the users are allowed to construct complex queries that are evaluated against multiple heterogeneous databases. Generally, a query is comprised of both a set of databases that it applies to and the tables as well as attributes (or classes and entities) that are to be queried within each database. This approach provides the format and access transparency, while it lacks the schema transparency and reconciliation. CPL/Kleisli (Overton, et al., 1997) project is a representative system, which provides integrated access to multiple data sources, but does not present an integrated schema across the source databases. Hence, the users are required to directly specify the rules and constraints involved in queries of integrated access to the databases. That might imply that only users who are familiar with the details of the individual data sources can fully utilize the resource.
The last approach, Federated databases, resembles the two previous mentioned approaches. It is similar to data warehousing method in that it requires mapping between a single federated schema and the schemas of the member databases. Like the Unmediated multidatabase Queries approach, a set of member databases are not physically integrated within one single database management system. The federated approach is the traditional approach within the computer-science community owing to many reasons, such as an easily understandable architecture, and a basic level of integration with minimal effort. However, surprisingly, it has received little attention in the bioinformatics community. One explanation is that the federated approach is too complex to implement. The EasyQuery program (Shin, et al., 1997) from CyberConnect, and P/FDM (Kemp and Gray, 1996) are representative systems that have applied the federated database method
There are many other data integration systems that have applied one of the four technical approaches to interacting with scientific data, for example, the Object Protocol Model (OPM) (Chen and Markowitz, 1995), and TAMBIS (Paton, et al., 1999). However, none of these systems have integrated information with a strong consideration of taking into account dynamic data representation and semantic conflicts, caused by the lack of a common shared ontology.
3. MICROARRAY INTEGRATED INFORMATION MANAGEMENT ARCHITECTURE
- System Architecture Design
In our database design, we desire a system that will meet the following requirements. Firstly, the system should provide a consistent view of microarray data to the user. It will allow a user to pose a single query, and to receive a single unified answer. Secondly, a new microarray data source should be wrapped and plugged in as it comes into existence. Thirdly, the system should serve a query optimization across multiple systems. Fourthly, the system can deal with dynamic data representations owing to the infancy of microarray technology. Lastly, the system should resolve the semantic conflicts and contradictions caused due to the lack of commonly shared ontologies in the microarray community.
To serve these requirements, the system architecture will be based on a federated database approach. The global microarray schema is obtained to represent a virtual microarray database, combining microarray data from each participating microarray source to form a single, consistent representation. Because of the lack of a shared common ontology in the microarray community, an intelligent approach to information integration (Bergamaschi, et al., 1998) will be applied to create a microarray global schema. This method takes into account the resolutions to semantic conflicts and contradictions, caused by the lack of a shared common ontology. Queries posed against the microarray global schema will be translated into individual queries against the microarray source databases, and their results combined before being returned to the user.
The capability of handling dynamic representations can be done by modifying the wrapper to read the new format it has minor source schema changes, or creating a new mediator interface when significant changes occur. Adding the new microarray data source can be performed via two main steps: 1) mapping new microarray data source attributes to the microarray global schema attributes by using the mapping rules, transformation, and database descriptions, 2) creating the mediator interface to a new microarray data source.
- System components
The components of our database architecture can be seen in Fig 1. It consists of the following parts.
Wrappers. They are placed on top of the information sources and are responsible for translating the schema of the source into the Microarray global object model language. This language is written in the form of the ontology representation language. The wrapper also performs the translation of a query expressed in the global object model language into a local request executable by the query processor of the corresponding source.
Mediator. It is composed of three modules: the Microarray global schema builder, the Query manager, and the Mediator class.
The Microarray global schema builder processes and integrates Microarray global object model language descriptions received from wrappers to derive the integrated representation of the information sources. This module will involve the schema matching approach as explained below.
The Query manager performs query processing and optimization. In particular, it generates a Microarray global object model language query formulated by the user on the global schema, and translates it into different sub-queries, one for each involved local source.
Fig.1. Architecture of the Information Management for Microarray Experimental
The Mediator class is a particular module for resolving problems that arise due to adding a new microarray data source to a system. It consists of two parts, the Mediator interface and the Transformation call.
The Transformation call is an important part in incorporating a new data source. The DBA is required to describe the data source, to map source attributes to corresponding global schema attributes, and to convert between different representations of the same characteristic. Once the data transformation has been performed, the Mediator Interface will be created into a new microarray data source.
- Example of schema integration
Consider the representations shown in Fig 2(a) and (b). They both include Sample, Experimental sample, Treatment, and Researcher; although occasionally called by different names. The first one also contains Strains, while the second includes Labels, Hybridization condition, Control Gene, and Experimental Control Gene. If these concepts are overlaid, the resulting composite representation is shown in Fig 2(c). While this is a reasonable representation of the concept, problems may arise in practice due to the implicit relationship between the attributes from different data sources. This type of issue is common in both business and scientific domains. The important distinction is that, while in business there is a single correct value, this is not always the case in scientific domains. Here, we will use intelligent techniques for the extraction and integration of heterogeneous information to resolve those problems. The details of those techniques are explained below.
4 SCHEMA MATCHING APPROACH
The matching approach to our schema integration system can be described in the following phases.
Evaluation of schema class affinity. This step is to evaluate the level of affinity between schema classes for subsequent integration. In conventional databases, two classes are identical if all of their attributes are identical. Therefore, we measure the level of coefficients affinity of schema classes based on their attributes by defining the proper coefficients in the range 0 to 1. Here, we evaluate the similarity in the level of schema, not instance data. The available schema information includes the usual properties of schema elements, such as name, description, and constraint. We compute a best match between the attribute of two schemas in a greedy manner (Parberry, 1995). Our algorithm (see Table 1 below) firstly finds the pair of corresponding attributes between two schema classes that are most similar. Then those attributes will be removed. These steps will iteratively be processed until eventually there is only a pair of corresponding attributes.
The degree of similarity between two schema classes C1 and C2 denoted Sim (C1, C2), is the similarity coefficient of the last pair of corresponding attributes.
Here, we consider that two attributes from different data sources are similar only if their similarity coefficient is more than a threshold (θ = 0.25).
Table 1 The pseudo code of our algorithm
Consider the similarity matrix shown in Table 2. By following the step in Table 1, we can calculate similarity measure between Samples and Sample classes from different data sources. Table 3 and 4 show the results in each step of our algorithm. From Table 4, we can obtain 0.8 as a similarity measure between Sample and Samples classes. The affinity coefficients for all possible pairs of classes will be analyzed by the algorithm described above and kept in a matrix M.
Cluster generation. In this phase, we use the technique of (Bergamaschi, et al. 1998), based on
hierarchical clustering techniques to group classes with affinity together.
Table 2 Similarity Matrix of the partial attributes of 2 schema classes (Samples and Sample classes from ChipDB and RAD public microarray database, respectively)
Table 3 The result of an implementation in first phase
Table 4 The result of an implementation in second, third, forth phase
It starts by placing each class in a cluster by itself, and then iteratively the two clusters having the greatest affinity coefficient in M are merged. Each merging operation, newly defined cluster will be obtained and therefore M will be updated. The affinity values between the newly defined cluster and each remaining cluster are also computed. As the result of clustering an affinity tree will be obtained. Mediator schema generation. Unification of affinity clusters leads to the construction of the global schema of the mediator. An integrated class is defined for each cluster, which is representative of all cluster’s class and is characterized by the union of their attributes. The global schema for the analysed sources is composed of all integrated classes derived from clusters, and is the basis for posing queries against the sources. For example, the attribute unification process for cluster of ‘Sample and Samples’ produces the following set of global ‘Sample’ attributes:
Global Sample = (Sample_ID, Grow_condtion,
Strain, Researcher, Organism, Treatment, Dev_stage, Sex, Modification_date)
The next step is to build a mapping table which relates the attributes of the global class to attributes of the classes in the associated cluster. A mapping table between ‘global and ‘Sample and Samples’ can be seen in Table 2.
5. CONCLUSION
Microarray technology is a high throughput method for obtaining gene expression data from thousands of genes simultaneously. It is helpful for researchers in many fields such as cancer research, and toxicology. With the huge amount of data generated by different microarray experiments, effective data storage is desperately needed. The current efforts in designing data storage are focused on a more traditional approach; the stored information should be able to be shared with other laboratories and combined with other experimental results. To achieve this, the three main tasks will be implemented including the development of a commonly shared microarray ontology, the implementation of schema matching to map elements between different data sources, and the construction of an integrated microarray database.
This research aims to apply methodologies and techniques such as ontology, schema matching, greedy algorithm and database integration, to develop a standardized and interoperable microarray database, by collecting and representing data from several microarray databases in a unified format. It allows users to browse, seek, and perform complex queries microarray data. This will facilitate sharing information between different laboratories and combining data with other experimental results.
Fig 2: Simple schema integration
REFERENCES
Etzold, T. and P. Argos (1993). SRS – an indexing and retrieval tool for flat file data libraries. Computer Applications in the Biosciences, 9(1): 49-57.
Benson, D. A., M. Boguski, D. J. Lipman, and J. Ostell (1994). Genbank. Nucleic Acids Research, 22:3441-3444.
Chen, A., and V. Markowitz (1995). An overview of the object protocol model (OPM) and the OPM data management tools. Inform. Syst., 20(5).
Garcia-Molina, H., J. Hammer, K. Ireland, Y. Papakonstantinou, J. Ullman, and J. Widom (1995). Integrating accessing heterogeneous information sources in TSIMMIS. Proc. AAAI Symp. Information Gathering, Stanford, CA, pp. 61-64.
Kemp, G., and P. Gray (1996). Using the Functional Data Model to Integrate Distributed Biological Data Sources. In P. Svensson and J. French, editors, Proc. SSDBM: 176-185. IEEE Press.
Overton, G. C., S. B. Davidson, and P. Buneman (1997). Database transformations for biological applications. In DOE HGP Contractor-Grantee Workshop VI Santa Fe, NM.
Shin, D. G., et al., (1997). Graphical ad hoc query interfaces for Federated Genome database, Computer Sc. & Eng. U of Connecticut. In Storrs CT DOE HGP Contractor-Grantee Workshop VI, Santa Fe, NM.
Bergamaschi, S., S. Castano, S. De Capitani di Vimercati, S. Montanari, M. Vincini (1998). An Intelligent Approach to Information Integration. In International Conference on Formal Ontology in Information Systems (FOIS'98), Trento, Italy.
Paton, N.W., R. Stevens, P. Baker, C. A. Goble, S. Bechhofer, and A. Brass (1999). Query Processing in the TAMBIS Bioinformatics Source Integration System. In Proc. SSDBM: 138-147. IEEE Press.
Brazma, A., A. Robinson, G. Cameron, and M. Ashburner (2000). One-stop shop for microarray data. Nature, 403: 699-700.
Critchlow, T., K. Fidelis, M. Ganesh, R. Musick, and T. Slezak (2000). IEEE Transactions on Information Technology in Biomedicine, 4(1): 52-57.
Paul Muhlrad (2001). DNA microarry technology to identify genes controlling spermatogenesis, Available from http://www.mcb.arizona.edu/
wardlab/microarray.html, accessed on 27-August-2002.
Altruis Biomedical Network (2002). The Web's Premier Site For DNA Arrays, Available from http://www.dna-arrays.com, accessed on 27-August-2002.
David Murphy (2002). Gene Expression studies using microarrays: principles, problems, and prospects. Advances in physiology education: 26(4).