Information Management for microarray experimental data

University Degree Mathematical and Computer Sciences

INFORMATION MANAGEMENT FOR MICROARRAY EXPERIMENTAL DATA

Supawan Prompramote

Yi-Ping Phoebe Chen

Frederic Maire

Centre for Information Technology Innovation

Faculty of Information Technology

Queensland University of Technology

Abstract: Today, proficiency in generating microarray data is fast overcoming the capacity for storing and analysing this data. Although there are some existing microarray databases, they have their own storage structure and implementation. In addition, those proposed databases might use the different terminologies to describe the same domain or concepts. As a result, these could lead to a limitation in the sharing of data with other laboratories and the combination of other experimental results. Here, we proposed the integrated information management architecture for microarray experimental data. Unlike the past work on database interoperation in the bioinformatics community, this database design will take into account the important issues in microarray data integration including a lack of a common shared microarray-ontology and having dynamic data representation of the microarray data sources. Copyright © 2003 IFAC.

Keywords: Database interoperation, Schema matching, Greedy algorithm,

Microarray.

1 INTRODUCTION

A living organism function is associated with thousands of genes and their products (RNA and proteins) to create the mystery of life. Even though most cells in a human body contain the same genes, not all of these genes are used in each cell. Some genes are expressed when they are needed. Many genes are used to specify features unique to each type of cell; for example, liver cells express genes for enzymes that detoxify poisons. To find how each cell achieves such uniqueness, scientists need to discover a way to identify which genes each type of cell possesses (MuhIrad, 2001).

Traditionally, one molecular biology experiment is based on one gene at a time; this is a limitation of obtaining the whole picture of gene function. The advent of a DNA array technology during the last few years allows researchers to gain a greater picture of the interactions among thousands of genes simultaneously. It also allows the researchers to look at many genes at once and determine which genes are expressed in a particular cell type. There are two major application forms for microarray technology: identification of sequence (gene/gene mutation) and determination of expression level (abundance) of genes. These forms will lead to new insight into fundamental biological problems such as gene discovery, gene regulation, disease diagnosis, as well as drug discovery and toxicology (Altruis, 2002; Muhlrad, 2001).

An experiment, typically, requires tens or hundreds of microarrays, where a single microarray will generate between 100,000 and a million pieces of data (Murphy, 2002). The organization of this huge-volume of data produced by microarray techniques is one of the biggest challenges that scientists and bioinformatics have yet faced. To design a microarray database, the large amount of data is not the only major difficult, with other unique characteristics such as complexity, dynamic data representation, and lack of standard nomenclatures, each causing additional problems.

There are a limited number of efficient, publicly available tools for storing microarray data. Existing relevant public DNA microarray databases each have their own storage structure and implementation, with differences

in hardware platforms, DBMS, data models and data languages. In addition, these databases are created by different developers, and unavoidably might use the different definitions and terms to describe the same domain or concept (because of the lack of a common shared microarray-ontology). In contrast, those developers might use a definition or term to have a different meaning. As a result, this could lead to a limitation in the sharing of data with other laboratories and in the combining with other experimental results (Bergamaschi, et al., 1998).

Fortunately, many of these types of issues have been already addressed research in fields outside the life sciences, particularly in the realm of commercial business. One successful strategy that has been applied to elucidate these issues is database integration. In this, we are taking advantage of the efficient and powerful database interoperation approaches that have been developed over the past decade for business applications, and we tailor them to the needs of microarray research. We believe that by looking to sources outside of the biological sciences, and taking advantage of existing methods and resources, that a microarray data management system that allows users to interact with a set of heterogeneous databases as seamlessly as they interact with each individual database, can be established. The word “interact” in this paper denotes general browsing, seeking of information about particular objects, and performing complex queries. This will not be easy, rather more difficult for microarray databases than for business sources, due to the unique characteristics of microarray experimental data.

In this paper, we propose to investigate suitable methodologies and tools in the area of data management and analysis to develop a well-defined storage for DNA microarray data, that will allow sharing microarray experimental results between laboratories by linking related data from different public microarray sources, integrating them to provide a consistent view of data to the users, and resolve the important issues in microarray data integration such as:

A lack of a common shared microarray-ontology, which leads to naming conflicts when different names are employed to represent the same information.
Since microarray technology is in ...

This is a preview of the whole essay

A lack of a common shared microarray-ontology, which leads to naming conflicts when different names are employed to represent the same information.
Since microarray technology is in its infancy, the microarray data sources tend to have dynamic data representations. The related components, which reflect this modification, need to be updated. This currently requires a lot of time and effort by database administrators.

The paper begins by describing current approaches to database interoperation in bioinformatics community, in Section 2. Section 3 states our system architecture and its components. Section 4 discusses schema matching technique that applied to our architecture to construct a global schema. Finally, in Section 5 we give our concluding remarks.

RELATED WORK

The past work on database interoperation in the bioinformatics community has taken one of four technical approaches: 1) Hypertext navigation, 2) Data warehousing, 3) Unmediated multidatabase queries, and 4) Federated databases.

The first method, Hypertext navigation, also called the indexed data sources approach, allows the users to interactively navigate from a result of one query in one member database to a result in another database, by using indexes and links between those two databases. This approach achieves a basic level of integration with minimal effort; however, it neither provides a mechanism to directly integrate data from relational databases nor to perform data cleansing and transformation for complex data mining. The representative systems of this method include Entrez by NCBI (Benson, et al., 1994), which is a search and retrieval system that integrates information from databases at NCBI (National Center for Biotechnology Information) through PubMed and SRS (Etzold and Argos, 1993), which provide an integrated browser interface and a basic query language for a range of important information sources.

Secondly is the Data warehousing approach, in which the data from a set of heterogeneous databases are exported into a single database, called the data warehouse. Translators are needed to transform this exported data, which in different in format and conceptualisation, into the format and conceptualisation of the warehouse database. This method simplifies the access to and query of data, allows for automated data mining, and quickly extends as new data sources are added, without effecting the original data source applications. On the other hand, the extraction, cleaning, transformation, and loading process can take considerable time and effort, which is a major drawback of Data warehousing. TSIMMIS (Garcia-Molina, et al., 1995) and DataFoundry (Critchlow, et al., 2000) are examples of data warehousing system.

Next approach is Unmediated multidatabase queries. In this, the users are allowed to construct complex queries that are evaluated against multiple heterogeneous databases. Generally, a query is comprised of both a set of databases that it applies to and the tables as well as attributes (or classes and entities) that are to be queried within each database. This approach provides the format and access transparency, while it lacks the schema transparency and reconciliation. CPL/Kleisli (Overton, et al., 1997) project is a representative system, which provides integrated access to multiple data sources, but does not present an integrated schema across the source databases. Hence, the users are required to directly specify the rules and constraints involved in queries of integrated access to the databases. That might imply that only users who are familiar with the details of the individual data sources can fully utilize the resource.

The last approach, Federated databases, resembles the two previous mentioned approaches. It is similar to data warehousing method in that it requires mapping between a single federated schema and the schemas of the member databases. Like the Unmediated multidatabase Queries approach, a set of member databases are not physically integrated within one single database management system. The federated approach is the traditional approach within the computer-science community owing to many reasons, such as an easily understandable architecture, and a basic level of integration with minimal effort. However, surprisingly, it has received little attention in the bioinformatics community. One explanation is that the federated approach is too complex to implement. The EasyQuery program (Shin, et al., 1997) from CyberConnect, and P/FDM (Kemp and Gray, 1996) are representative systems that have applied the federated database method

There are many other data integration systems that have applied one of the four technical approaches to interacting with scientific data, for example, the Object Protocol Model (OPM) (Chen and Markowitz, 1995), and TAMBIS (Paton, et al., 1999). However, none of these systems have integrated information with a strong consideration of taking into account dynamic data representation and semantic conflicts, caused by the lack of a common shared ontology.

3. MICROARRAY INTEGRATED INFORMATION MANAGEMENT ARCHITECTURE

System Architecture Design

In our database design, we desire a system that will meet the following requirements. Firstly, the system should provide a consistent view of microarray data to the user. It will allow a user to pose a single query, and to receive a single unified answer. Secondly, a new microarray data source should be wrapped and plugged in as it comes into existence. Thirdly, the system should serve a query optimization across multiple systems. Fourthly, the system can deal with dynamic data representations owing to the infancy of microarray technology. Lastly, the system should resolve the semantic conflicts and contradictions caused due to the lack of commonly shared ontologies in the microarray community.

To serve these requirements, the system architecture will be based on a federated database approach. The global microarray schema is obtained to represent a virtual microarray database, combining microarray data from each participating microarray source to form a single, consistent representation. Because of the lack of a shared common ontology in the microarray community, an intelligent approach to information integration (Bergamaschi, et al., 1998) will be applied to create a microarray global schema. This method takes into account the resolutions to semantic conflicts and contradictions, caused by the lack of a shared common ontology. Queries posed against the microarray global schema will be translated into individual queries against the microarray source databases, and their results combined before being returned to the user.

The capability of handling dynamic representations can be done by modifying the wrapper to read the new format it has minor source schema changes, or creating a new mediator interface when significant changes occur. Adding the new microarray data source can be performed via two main steps: 1) mapping new microarray data source attributes to the microarray global schema attributes by using the mapping rules, transformation, and database descriptions, 2) creating the mediator interface to a new microarray data source.

System components

The components of our database architecture can be seen in Fig 1. It consists of the following parts.

Wrappers. They are placed on top of the information sources and are responsible for translating the schema of the source into the Microarray global object model language. This language is written in the form of the ontology representation language. The wrapper also performs the translation of a query expressed in the global object model language into a local request executable by the query processor of the corresponding source.

Mediator. It is composed of three modules: the Microarray global schema builder, the Query manager, and the Mediator class.

The Microarray global schema builder processes and integrates Microarray global object model language descriptions received from wrappers to derive the integrated representation of the information sources. This module will involve the schema matching approach as explained below.

The Query manager performs query processing and optimization. In particular, it generates a Microarray global object model language query formulated by the user on the global schema, and translates it into different sub-queries, one for each involved local source.

Fig.1. Architecture of the Information Management for Microarray Experimental

The Mediator class is a particular module for resolving problems that arise due to adding a new microarray data source to a system. It consists of two parts, the Mediator interface and the Transformation call.

The Transformation call is an important part in incorporating a new data source. The DBA is required to describe the data source, to map source attributes to corresponding global schema attributes, and to convert between different representations of the same characteristic. Once the data transformation has been performed, the Mediator Interface will be created into a new microarray data source.

Example of schema integration

Consider the representations shown in Fig 2(a) and (b). They both include Sample, Experimental sample, Treatment, and Researcher; although occasionally called by different names. The first one also contains Strains, while the second includes Labels, Hybridization condition, Control Gene, and Experimental Control Gene. If these concepts are overlaid, the resulting composite representation is shown in Fig 2(c). While this is a reasonable representation of the concept, problems may arise in practice due to the implicit relationship between the attributes from different data sources. This type of issue is common in both business and scientific domains. The important distinction is that, while in business there is a single correct value, this is not always the case in scientific domains. Here, we will use intelligent techniques for the extraction and integration of heterogeneous information to resolve those problems. The details of those techniques are explained below.

4 SCHEMA MATCHING APPROACH

The matching approach to our schema integration system can be described in the following phases.

Evaluation of schema class affinity. This step is to evaluate the level of affinity between schema classes for subsequent integration. In conventional databases, two classes are identical if all of their attributes are identical. Therefore, we measure the level of coefficients affinity of schema classes based on their attributes by defining the proper coefficients in the range 0 to 1. Here, we evaluate the similarity in the level of schema, not instance data. The available schema information includes the usual properties of schema elements, such as name, description, and constraint. We compute a best match between the attribute of two schemas in a greedy manner (Parberry, 1995). Our algorithm (see Table 1 below) firstly finds the pair of corresponding attributes between two schema classes that are most similar. Then those attributes will be removed. These steps will iteratively be processed until eventually there is only a pair of corresponding attributes.

The degree of similarity between two schema classes C1 and C2 denoted Sim (C1, C2), is the similarity coefficient of the last pair of corresponding attributes.

Here, we consider that two attributes from different data sources are similar only if their similarity coefficient is more than a threshold (θ = 0.25).

Table 1 The pseudo code of our algorithm

Consider the similarity matrix shown in Table 2. By following the step in Table 1, we can calculate similarity measure between Samples and Sample classes from different data sources. Table 3 and 4 show the results in each step of our algorithm. From Table 4, we can obtain 0.8 as a similarity measure between Sample and Samples classes. The affinity coefficients for all possible pairs of classes will be analyzed by the algorithm described above and kept in a matrix M.

Cluster generation. In this phase, we use the technique of (Bergamaschi, et al. 1998), based on

hierarchical clustering techniques to group classes with affinity together.

Table 2 Similarity Matrix of the partial attributes of 2 schema classes (Samples and Sample classes from ChipDB and RAD public microarray database, respectively)

Table 3 The result of an implementation in first phase

Table 4 The result of an implementation in second, third, forth phase

It starts by placing each class in a cluster by itself, and then iteratively the two clusters having the greatest affinity coefficient in M are merged. Each merging operation, newly defined cluster will be obtained and therefore M will be updated. The affinity values between the newly defined cluster and each remaining cluster are also computed. As the result of clustering an affinity tree will be obtained. Mediator schema generation. Unification of affinity clusters leads to the construction of the global schema of the mediator. An integrated class is defined for each cluster, which is representative of all cluster’s class and is characterized by the union of their attributes. The global schema for the analysed sources is composed of all integrated classes derived from clusters, and is the basis for posing queries against the sources. For example, the attribute unification process for cluster of ‘Sample and Samples’ produces the following set of global ‘Sample’ attributes:

Global Sample = (Sample_ID, Grow_condtion,

Strain, Researcher, Organism, Treatment, Dev_stage, Sex, Modification_date)

The next step is to build a mapping table which relates the attributes of the global class to attributes of the classes in the associated cluster. A mapping table between ‘global and ‘Sample and Samples’ can be seen in Table 2.

5. CONCLUSION

Microarray technology is a high throughput method for obtaining gene expression data from thousands of genes simultaneously. It is helpful for researchers in many fields such as cancer research, and toxicology. With the huge amount of data generated by different microarray experiments, effective data storage is desperately needed. The current efforts in designing data storage are focused on a more traditional approach; the stored information should be able to be shared with other laboratories and combined with other experimental results. To achieve this, the three main tasks will be implemented including the development of a commonly shared microarray ontology, the implementation of schema matching to map elements between different data sources, and the construction of an integrated microarray database.

This research aims to apply methodologies and techniques such as ontology, schema matching, greedy algorithm and database integration, to develop a standardized and interoperable microarray database, by collecting and representing data from several microarray databases in a unified format. It allows users to browse, seek, and perform complex queries microarray data. This will facilitate sharing information between different laboratories and combining data with other experimental results.

Fig 2: Simple schema integration

REFERENCES

Etzold, T. and P. Argos (1993). SRS – an indexing and retrieval tool for flat file data libraries. Computer Applications in the Biosciences, 9(1): 49-57.

Benson, D. A., M. Boguski, D. J. Lipman, and J. Ostell (1994). Genbank. Nucleic Acids Research, 22:3441-3444.

Chen, A., and V. Markowitz (1995). An overview of the object protocol model (OPM) and the OPM data management tools. Inform. Syst., 20(5).

Garcia-Molina, H., J. Hammer, K. Ireland, Y. Papakonstantinou, J. Ullman, and J. Widom (1995). Integrating accessing heterogeneous information sources in TSIMMIS. Proc. AAAI Symp. Information Gathering, Stanford, CA, pp. 61-64.

Kemp, G., and P. Gray (1996). Using the Functional Data Model to Integrate Distributed Biological Data Sources. In P. Svensson and J. French, editors, Proc. SSDBM: 176-185. IEEE Press.

Overton, G. C., S. B. Davidson, and P. Buneman (1997). Database transformations for biological applications. In DOE HGP Contractor-Grantee Workshop VI Santa Fe, NM.

Shin, D. G., et al., (1997). Graphical ad hoc query interfaces for Federated Genome database, Computer Sc. & Eng. U of Connecticut. In Storrs CT DOE HGP Contractor-Grantee Workshop VI, Santa Fe, NM.

Bergamaschi, S., S. Castano, S. De Capitani di Vimercati, S. Montanari, M. Vincini (1998). An Intelligent Approach to Information Integration. In International Conference on Formal Ontology in Information Systems (FOIS'98), Trento, Italy.

Paton, N.W., R. Stevens, P. Baker, C. A. Goble, S. Bechhofer, and A. Brass (1999). Query Processing in the TAMBIS Bioinformatics Source Integration System. In Proc. SSDBM: 138-147. IEEE Press.

Brazma, A., A. Robinson, G. Cameron, and M. Ashburner (2000). One-stop shop for microarray data. Nature, 403: 699-700.

Critchlow, T., K. Fidelis, M. Ganesh, R. Musick, and T. Slezak (2000). IEEE Transactions on Information Technology in Biomedicine, 4(1): 52-57.

Paul Muhlrad (2001). DNA microarry technology to identify genes controlling spermatogenesis, Available from http://www.mcb.arizona.edu/

wardlab/microarray.html, accessed on 27-August-2002.

Altruis Biomedical Network (2002). The Web's Premier Site For DNA Arrays, Available from http://www.dna-arrays.com, accessed on 27-August-2002.

David Murphy (2002). Gene Expression studies using microarrays: principles, problems, and prospects. Advances in physiology education: 26(4).

Information Management for microarray experimental data

1 INTRODUCTION

This is a preview of the whole essay

Document Details

Related Essays

Industrial Data Management Technologies

Developing a Workflow Application: Clinical Data Management

Lifecycle Management Of Information Technology Project In Construction

Management Information Systems plays a very important role in business. Th...