- Join over 1.2 million students every month
- Accelerate your learning by 29%
- Unlimited access from just £6.99 per month
- 1 1
- 2 2
- 3 3
- 4 4
- 5 5
- 6 6
- 7 7
- 8 8
- 9 9
- 10 10
- 11 11
- 12 12
- 13 13
- 14 14
- 15 15
- 16 16
- 17 17
- 18 18
- 19 19
- 20 20
- 21 21
- 22 22
- 23 23
- 24 24
- 25 25
- 26 26
- 27 27
- 28 28
- 29 29
- 30 30
- 31 31
- 32 32
- 33 33
- 34 34
- 35 35
- 36 36
- 37 37
- 38 38
- 39 39
- 40 40
- 41 41
- 42 42
- 43 43
- 44 44
- 45 45
- 46 46
- 47 47
- 48 48
- 49 49
- 50 50
- 51 51
- 52 52
- 53 53
- 54 54
- 55 55
- 56 56
- 57 57
- 58 58
- 59 59
- 60 60
- 61 61
- 62 62
- 63 63
- 64 64
- 65 65
- 66 66
- 67 67
- 68 68
- 69 69
- 70 70
- 71 71
- 72 72
- 73 73
- 74 74
- 75 75
- 76 76
- 77 77
- 78 78
- 79 79
- 80 80
- 81 81
- 82 82
- 83 83
- 84 84
- 85 85
- 86 86
- 87 87
- 88 88
- 89 89
- 90 90
- 91 91
- 92 92
- 93 93
- 94 94
- 95 95
- 96 96
- 97 97
- 98 98
- 99 99
- 100 100
The intention of this study is to analyze and explore the emerging field of grid technology. It delves into how the grid is being used to enhance the capabilities of existing distributed systems and data resources.
Extracts from this document...
Issues and Applications of Grid Computing
A thesis submitted in partial fulfillment of the requirements for the degree of
Bachelor of Science (Computer Science)
The intention of this study is to analyze and explore the emerging field of grid technology. It delves into how the grid is being used to enhance the capabilities of existing distributed systems and data resources. The characteristics of virtual organizations and their participation in implementing a grid structure are observed. The issues surfacing in grid implementation and their possible solutions are discussed. Enhancements and modifications are proposed for existing frameworks for database integration with the grid. A basic grid structure for the Department of Computer Science, University of Karachi has been planned out. The Globus Toolkit, used in grid middleware is tested and run on available resources.
TABLE OF CONTENTS
TABLE OF CONTENTS vii
LIST OF FIGURES viii
LIST OF TABLES viii
FUNDAMENTALS OF GRID COMPUTING 1
GRID APPLICATIONS 9
THE GRID ARCHITECTURE 18
ISSUES IN GRID COMPUTING 42
DATABASES AND THE GRID 50
PROPOSED GRID DEVELOPMENT FOR THE UNIVERSITY OF KARACHI 66
OVERVIEW OF GRAM 72
OVERVIEW OF MDS 73
OVERVIEW OF GRIDFTP 74
STARTING GRAM 89
LIST OF FIGURES
Figure 1.1: Virtual Organizations 6
Figure 2.2: The Compact Muon Solenoid Experiment 13
Figure 2.3: An I-WAY Point of Presence (I-POP) Machine 16
Figure 3.4: The Layered Grid Architecture 19
Figure 3.5: The Layered Grid Architecture with respect to Services and APIs 24
Figure 3.6: The Layered Grid Architecture and its Relationship to the Internet Protocol Architecture 26
Figure 3.7: The Core Elements of the Open Grid Services Architecture (shaded) 30
Figure 3.8: Services Involved in the Example 31
Figure 3.9: The Three Layered Semantic Grid Architecture 37
Figure 3.10: Comparison of Peer-to-Peer and Grid Computing Styles 40
Figure 3.11: Middleware Peer (MP) Groups of Services at the edge of the Grid 41
Figure 4.12: Authentication, Authorization through Proxy 45
Figure 5.13: A Virtual Database System on the Grid 59
Figure 5.14: Separate Interaction with Databases on the Grid 61
Commercial applications are also gradually beginning to realize the importance of grids. Some engineering design and pharmaceutical research and development problems are similar to the above mentioned scientific applications such that they also involve huge amounts of data and require more computing power.
Data mining is also being used in some business applications as this field evolves. In the financial sector it is starting to play a significant role in fraud detection, purchasing behaviors etc.
Another important class of commercial applications being explored focuses on integrated data access both within and among enterprises. This can be viewed as a form of distributed databases [PDD]. Structured data is available at various nodes and there is a need to govern the access, maintenance and security of not only the distributed data but also transactions carried out on it.
18.104.22.168 The Data
In most of the existing architectures the data management services are restricted to the handling of files. However, in principle Data Grids should be able to handle data elements from single bits to complex collections of data and even virtual data, which must be generated upon request. All kinds of data need to be identifiable through some mechanism that is able to uniquely identify and locate the data. In OGSA terminology this identifier is called Grid Data Handle (GDH).
The following are some types of data defined in [GCM2003] that Data Grids deal with:
For many Virtual Organizations the data granularity level is a data file. Access of files is simpler and well understood, hence data management is easier. Security can be implemented through many well-known security mechanisms (Unix permissions, Access Control Lists etc).
There may be a requirement of assigning a single name to multiple files (or a collection of files) so that they can be treated as one in grid operations. Semantically there are two different types of file collections: Confined Collections in which all files which make up the collection are always kept on the same resource and are treated as one, like in a zip or a tar file. Free Collections that are composed of files and other collections, which may not be on the same resource. Hence free collections provide greater flexibility in the movement of files across resources but the availability of any particular file in that collection is not guaranteed at any given time.
The data defined by a GDH may correspond to the data in a Relational Database or to any table, row, view or other granularity level [DSP1990].
XML Database and Semi-structured Data
Data with loosely defined or irregular structure can be defined using the semi structured data model, which is represented using XML [BXM2000]. The GDH may correspond to any XML data object in an XML database.
This is the most generic form of a single data instance. The structure of an object is completely arbitrary and thus the grid services should be able to handle objects in which the type varies from object to object.
Virtual data denotes secondary data or data that is generated at run time from primary data stored at a location. Additional services are required to handle such data.
Data sets differ from Free Collections only in that they can contain any kind of the above mentioned data types in addition to files. Such data is useful for archiving, logging and debugging purposes.
3.7.2 SEMANTIC GRIDS
A Semantic Grid is characterized as an open system in which users, software components and computational resources come and go on a continual basis. There should be a high degree of automation and flexible communication and collaboration between the resources, which are all owned by different stakeholders.
The computing infrastructure is characterized in [GCM2003] as consisting of three layers:
This layer deals with the allocation, scheduling of computing resources and the transfer of data between resources in order to carry out a processing task. It deals with large volumes of data corresponding to heavy computation. This layer is built upon the Fabric layer of the Grid Layered Model, which may interconnect scientific equipment.
This layer deals with the way information is stored, retrieved, shared and maintained. It includes the semantics and meanings of data units. For example, that an integer denotes the current pressure in a cylinder.
Knowledge services use the information provided to solve scientific problems or to make a decision. It deals with how knowledge is acquired, used, retrieved, published and maintained.
Figure 3.9: The Three Layered Semantic Grid Architecture
This is just a conceptual model and direct implementation is not feasible. However all grids have some element of these three layers in them. The service-oriented view is applicable at all three layers i.e. there are services, producers and consumers that use these services at all the three layers.
The key components of a service-oriented architecture are as follows:
- Service Owners
- Service Consumers
- Contracts (between Service Producers and Consumers)
- Market Owner
All services have an owner (or a set of owners). The owner is the body that is responsible for offering the service for use by others. The owner sets the terms and conditions under which the service can be accessed and used. Thus, the owner may decide to make the service free for all and universally available or the owner may decide to put some limitations on the service access which may be for only a specific class of users, may be priority based or have a price on its usage.
The relationship between the service owner and consumer is defined by a contract, which specifies the terms, and conditions under which the owner agrees to provide the service to the consumer. This contract may define the price of the service, the expected output, the expected time taken and the penalties if failure by the service provider to do so.
The service owners and service consumers interact with one another in a specific environmental context. This may be open to all services, i.e. all services may interact in a common environment. There may be cases in which the environment is closed, i.e. membership may be limited according to some attributes. A particular environment is called a ‘Marketplace’ and the entity that runs and establishes the Marketplace is called the ‘Market Owner’. The market owner may be entities in the marketplace i.e. a producer or a consumer or it may be a neutral third party.
A Service Life Cycle as defined in [GCM2003] for e-science applications consists of the following steps:
The service needs to be defined using an appropriate Service Description Language. Service creation should be seen as a continuous activity. New services may come into the environment and existing ones may be removed at any time. Hence, no steady state is ever reached. A number of services may also be combined to form a new service.
Meta information needs to be specified which is associated with the service such as who can access this service and other contract options related to it. The service also needs to be advertised and maybe registered so that it is available in a marketplace.
This phase occurs in a particular marketplace and involves a service owner and service consumer establishing a connection on the basis of a contract for the enabling of a particular service. This may fail if the two parties do not agree on a mutually acceptable agreement. This negotiation may be carried out offline by the respective service owner and consumer or it may be carried out dynamically at run time.
After establishing a connection and agreeing on a contract the service owner has to undertake the necessary actions in order to fulfill the obligation as specified in the contract.
3.7.3 PEER-TO-PEER GRIDS
Peer-to-peer systems are Internet applications that harness resources of a large number of autonomous participants.
P2P and grid computing are both concerned with resource sharing within distributed environments. However, differences exist between the two. P2P technologies focus on resource sharing in environments characterized by millions of users with mutual distrust, most with homogenous desktop systems and low bandwidth connections to the Internet. The emphasis is on massive scalability and fault tolerance. Grid systems generally connect smaller groups of users, which are better connected and have a more diverse range of resources to share.
Figure 3.10: Comparison of Peer-to-Peer and Grid Computing Styles
However, the long-term objectives of the P2P and the grid seem to converge as both take on a broader view of scale and resource sharing. The relationship between the two has been presented in [DTC2003].
A P2P grid computer could combine the varied resources, services and power of grid computing with the global-scale, resilient and self-organizing properties of P2P systems. A P2P system provides lower-level services on top of which grid services infrastructure can be built to enable global distribution and resource sharing.
Figure 3.11: Middleware Peer (MP) Groups of Services at the edge of the Grid
Example architecture of a peer-to-peer grid may be in which peer groups are managed locally. They are then arranged in a global system supported by core servers. The grid controls central services whereas services at the edge are grouped into less organized middleware peer groups.
ISSUES IN GRID COMPUTING
This chapter discusses the issues and problems pertaining to grid computing and grid-enabled applications. As this technology matures solutions to these matters are emerging, but more enhancement issues are also surfacing which would have to be dealt with in the future.
A Grid environment should be flexible, robust, coordinated and measurable while the resources themselves should be interoperable, manageable, available and extensible. In a Data Grid the data should be accessible from anywhere at any time, however the user is not necessarily interested in the exact location of the data. Also the users should not be concerned with issues and problems related to data conversion.
With respect to P2P systems work is required in two areas in order to broaden the range of computational tasks that can be treated with a massive distributed P2P system. First, the infrastructure needs to handle tightly coupled distributed computation better. There is a need for exploiting self-organizing properties, better timing and using innovative data transfer schemes to minimize communication overhead. Second, algorithms should be designed, specifically to exploit P2P properties.
Some of the issues arising in grid computing are:
4.1 AVAILABILITY AND FAULT TOLERANCE
The grid should be able to handle failures and unpredictable behavior of nodes. There should be a graceful fault tolerance mechanism so that the reliability of the whole system is not compromised. Policies need to be created that solve issues such as what would happen if a service is unavailable for a particular time, how service overload is dealt with, what happens if the Registry (in which all services are registered) becomes unavailable, how to deal with network partitioning and other network problems.
One of the main factors of the increasing popularity of the grid is its easy scalability. This is mentioned in the Requirements for OGSA [GCM2003, GSS2002, TPG]. It should be possible to add new services on the run and the grid environment should be able to handle the increased load while still remaining flexible. As such, there should be no limit to scalability.
4.3 AUTHENTICATION AND AUTHORIZATION
Authentication, authorization and policy are one of the most challenging issues related to grid computing. There is a difference between traditional authorization tactics deployed currently and the requirements of the grid.
In Client Server architecture, the client is the one requesting for a service from the server. The server machine determines whether the client is genuine and if authorization can be granted. In a grid environment the distinction between client and server tends to disappear. If machine A requests computing power from machine B, machine A is the client and machine B the server. However, at any other time machine B may become the client requesting, for example, storage space on machine A. Hence, authorization mechanisms are essential on both sides and request processing occurs only when both machines have agreed upon some contract parameters.
Authentication in a grid environment can be called Two-way authentication. The resource providers need some sort of assurance that they can enforce local policies and are able to block malicious users from attempting any harmful activity on their system. This should be possible locally and without the need to invoke some remote service. On the other side, users connecting to a resource need to be assured that their data cannot be compromised by local site administrators.
Grid security issues differ from the current security practices because the following features must be provided in a grid [TGB2004, GCM2003]:
Users must be able to log on or authenticate to the grid just once and then have access to multiple resources. Requiring a user to reauthenticate on each occasion is impractical. Authentication may require typing in a password.
A job entered by a user may need to initialize sub programs, which would need to access resources itself. Hence there should exist the ability to delegate rights to programs. This can be done through the creation of a proxy credential.
Integration with Local Security Solutions
Each site of a grid may have its own security solutions in place. One site may be using Kerberos while the other may have employed Unix solutions. Grid security and authorization policies must be able to interoperate with these existing solutions.
Figure 4.12: Authentication, Authorization through Proxy
User-based Trust Relationships
A user may require access and usage of resources from more than one resource providers. In such cases it must not be required that the security system of the respective resource providers interact with each other.
There is a need to check how existing web service security mechanisms might interoperate with grid security infrastructures.
Questions that arise about grid security include how VO-wide security policies are to be applied, how are local security policies enforced and what relation exists between the global grid security mechanisms and the ones at the local site, is it possible for a user to belong to different VOs and use both resources even if the security mechanisms differ, should there exist a way for one VO to authenticate another VO and if so, how should it be implemented across heterogeneous platforms taking into account different security mechanisms.
4.4 INTEROPERABILITY AND COMPATIBILITY
Interoperability is an explicit requirement of the grid and is one of the driving concepts behind OGSA [GSD2002]. Web services, as well as grid services are designed such that the modules are highly interoperable. There is no uniform protocol required that each service has to speak. WSDL [WSD2001] descriptions are there to ensure interoperability.
Interoperability is very closely related to discovery because services that need to interoperate have to discover common protocols that they can use and agree on other parameters so that compatibility is ensured.
Service owners and consumers can be conceptualized as autonomous agents. Characterization of agents has been researched in [ABS1997]. Then the interaction between such agents means they should be able to interoperate in a meaningful way. Such interoperation is difficult to obtain in grids because the different agents will typically have their own individual information models.
It should be possible for any agent to establish a marketplace (a particular resource sharing environment). In order to create a marketplace the owner needs a representation scheme for describing the various entities that are allowed to participate in the market place, a means of describing how the various entities are allowed to interact with one another and what monitoring mechanisms are to be put in place, if any are needed.
4.5 RESOURCE MANAGEMENT AND SCHEDULING
The fundamental ability of the grid is to discover, allocate and negotiate the use of network-accessible capabilities. Resource management in traditional computing systems is a well-defined problem. Resource managers such as batch schedulers and operating systems exist that are local to a system and have complete control of a resource.
In a grid environment resource management is different and comparatively difficult because of many reasons [TGB2004]. First is the fact that the managed resources span multiple administrative domains. Heterogeneity also presents problems and there is a need for standard resource management protocols and standard mechanisms for expressing resource and task requirements. Different organizations operate their resources under different policies. Utilizing a resource means following the local policy in place. A task may require the use of multiple resources simultaneously, which may belong to different virtual organizations and so a mutual agreement will need to be established. A resource may also be shared among different virtual organizations. There is also the scenario of on-demand access, in which resource capability is made available at a specified point in time and for a specified duration. This is especially important if one wishes to coordinate the use of two or more resources. Co-scheduling for the grid involves scheduling multiple individual and heterogeneous resources so that multiple processes can be executed at the same time such that they may communicate and coordinate with each other [GCM2003].
There is a need to integrate services not just within but also across Virtual Organizations. Standards need to be defined so that services are integrated across VOs. For Data Grids there is also the issue of data integration. Different Virtual Organizations should be able to have secure and reliable data access.
4.7 ACCOUNTING AND PAYMENT
P2P systems such as SETI@home [SHA2002] rely on users simply volunteering their CPU resources. Introducing an economic model whereby resources are rented out adds a new complication of accounting for their use.
There is a difference between file and CPU resource sharing. File sharers have some degree of separation that allows them to upload and download files independently. Thus cooperation is of no disadvantage to them. However, interactive computing is of a bursty nature and so it may affect any local job being carried out. This is the reason accounting mechanisms of resource sharing are required to limit, monitor and prioritize sharing.
There should be a reliable and accurate method for accounting the usage of resources by a client and then the calculation of payment for it. Statistics need to be kept, published and monitored. Payment may be in the form of money paid to the service providers, or there may be provision for a client to share an equal amount of its own resources in return.
The payment methods and models will vary in the academic and business domains. If the market economy model is applied, in which every peer is free to set its own prices for resources and a stable global equilibrium is reached based on supply and demand; will global optimization be achieved with respect to resource supply and utilization? Methods to ensure fairness are to be determined. Lessons from economics and distributed algorithmic mechanism design will play an increasingly large part in the design of such systems.
4.8 MONITORABILITY (QoS METRICS)
Each Virtual Organization may implement different levels of QoS. However, Virtual Organizations need to interact and interoperate hence VOs should be able to fulfill many different QoS requirements. There may be many different parameters of Quality of Service.
There needs to be not only agreed metrics of QoS but also definitions from each service on how it will enhance or decrease certain QoS metrics. Another important property of a grid is ‘Measurability’. It is essential to have QoS metrics by which the Virtual Organization can measure itself and by which it can be measured by others. This plays an important role when it comes to billing and payment. However OGSA does not elaborate on QoS metrics. It is not mentioned in the Requirements for OGSA [GCM2003, GSS2002, TPG].
The grid should be transparent to the users. Virtual Organizations [TAG2001] may be dynamic, i.e. they may change over time in their members and also their capabilities. A grid should be able to adjust to these changes such that they are transparent to the users.
4.10 USER CONNECTIVITY
When considering the larger view of the grid it is essential to consider the network connections of the different types of nodes connected. Not only can the connection quality be low in some cases but also there are differences between dial-up, broadband and connections from academic or corporate networks. Hence, applications must consider the heterogeneity of their peer’s connections.
There may also be nodes that cannot accept incoming connections, maybe because they do not have any externally recognized IP address or because they are behind a separately administered firewall. These factors and others contribute towards complicating routing behaviors in real deployments.
Existing grid environments tend to comprise participants, which are connected by well-administered and reliable academic networks. However as more diverse nodes are connected, these issues may become more important. Work is still needed to ensure congestion free networks and guaranteed performance in large-scale distributed systems. Solutions may be obtained through localized traffic engineering and scheduling algorithms.
4.11 LOAD BALANCING
In an environment of heterogeneous resources and competing job requirements, load balancing is difficult. It involves a trade-off between the best allocation option of a job to a resource and the rate at which job and resource properties are distributed.
Some P2P systems offer privacy by masking user identities. Some go further and also mask content so that peers exchanging data do not know who delivered or stored which data. Research is needed to ensure if this would compromise grid security and whether this can be implemented on top of applications or middleware.
4.13 INDUSTRY SUPPORT
As yet the grid is still seen more as a scientific and academic technology rather than from its commercial perspective. Broad industry support is required in order to fully capitalize on the grid’s potential.
DATABASES AND THE GRID
The previous chapters provided an overview of grid computing and the issues related to it. With respect to those factors, a framework for database integration with the grid has been proposed in this chapter. It discusses enhancements that can be made to existing database integration propositions regarding major database concepts such as Metadata (section 5.3.1), Query (section 5.3.2), Transaction (section 5.3.3) and grid concepts such as Scheduling (section 5.3.6) and Accounting (section 5.3.7).
5.1 BACKGROUND: DATABASES AND THE GRID
A database is a single large repository of data [DSP1990]. Distributed database is defined as [PDD]:
‘A collection of multiple, logically interrelated databases distributed over a computer network.’
A Distributed Database Management System (DDBMS) is then defined as [PDD]:
‘The software system that permits the management of the Distributed Database System and makes the distribution transparent to the users.’
Support for databases in a grid environment is becoming essential because of the gains that can be achieved by combining data from various sources. Users can search for data relevant to specific projects or subjects and be returned data sources from all over the world. Imagine scientific data at geographically dispersed sites A and B. Both are incomplete by themselves, and do not lead to some new theory. However, it is possible that by combining data from the two the results would be of more scientific value. An example of this is the data in astronomical laboratories all around the world. The amount of data is huge but it corresponds to a fairly uniform set of metrics, units and other vocabulary [ADA1998]. Integrating this data can lead to exciting new discoveries. However, this data is stored in databases, which not only correspond to different database models, but the DBMSs also differ.
Currently almost all grid applications are file-based, so very little has been done to integrate databases with the grid [DTG]. Complete standards have not been defined by the OGSA for database integration. Oracle 10g [ORA] claims support for the grid: ‘Oracle Database 10g is the first relational database designed for Enterprise Grid Computing’ [ODG2003].
The Globus Toolkit [TGP] is used for building computational grids. With reference to databases the documentation for version 2.4 states that:
‘The Grid Resource Information Service (GRIS) provides a uniform means of querying resources on a computational grid for their current configuration, capabilities and status. Such resources include but are not limited to:
- Computational nodes
- Data storage systems
- Scientific instruments
- Network links
The Grid Index Information Service (GIIS) provides a mans for identifying interesting resources where ‘interesting’ can be defined arbitrarily.’
Spitfire [PST2002], a European Data Grid project has developed an infrastructure that allows a client to query a relational database over GSI-enabled HTTP. The Open Grid Services Architecture-Data Access and Integration (OGSA-DAI) [ODA] project is both a framework and a tool to grid-enable existing structured data resources and provide a uniform interface to access distributed and heterogeneous data sources. It is a reference implementation of the GGF DAIS specifications [TGG].
5.2 ISSUES IN THE ACCESS AND INTEGRATION OF DATABASES INTO THE GRID
There are two main dimensions of complexity to the problem of integrating databases into the grid [DTG]: implementation differences between server products within a database paradigm and the variety of database paradigms.
Existing DBMSs do not provide Grid functionality, except for Oracle 10g whose grid implementation is controversial. [ROW] claims that the grid element of Oracle 10g is mainly an enhancement of the existing cluster features of Oracle 9i. In the article [INE], competitor IBM questions the grid implementation by Oracle, which is markedly different from that of IBM. The IBM approach [IAG] provides a virtual view of information, whereas in the Oracle version the servers are controlled by Oracle.
The current DBMSs [DSP1990] have been constructed after years of research and simply discarding them and creating new ones for grid-enabled databases is not feasible. Rather some changes should be made so that they are able to manage databases on the grid. As the grid becomes commercial, database vendors would themselves wish to provide grid support according to the emerging grid standards.
The integration of databases with the grid can be done by two different ways. Either by separately providing support for every type of database that exists or by providing a middleware with common support for all types of databases. The latter method is more favorable and is being researched so that existing databases need not be changed; rather a wrapper is implemented on top of them to provide additional grid functionality. If separate integration of every type of database to the grid is carried out, all the effort put in database research till now will be in vain because new models would have to be made from scratch. Also, it would take up too much time whereas the need for database access and integration with the grid is emerging now.
5.2.1 DATABASE REQUIREMENTS OF GRID APPLICATIONS
There are two sets of requirements that must be met [DTG]: firstly those that are generic across all components of grid applications and allow databases to be used within applications, and secondly, those that are specific to databases and allow database functionality to be exploited by grid applications.
A set of standards needs to be defined which would be implemented by all grid components so that there is uniform access to databases. Work being done by the Global Grid Forum [TGG] suggests that security, accounting, performance monitoring and scheduling will be important. It must be possible to specify all combinations of access restrictions (read, write, insert, delete, etc) and to have fine-grained control over the granularity of the data (table, row, column, etc) through grid applications. Role-based access also needs to be provided in which access permissions are based on the role that the user adopts.
There can also be many ways to execute a query. Results can be returned one by one in the form of a stream, or they can be returned in the form of a block. This would depend on the further analysis that needs to be done on the resultant data set.
Internally DBMSs make decisions on how to best execute a query through the use of cost models that are based on estimates of the costs of operations used within queries, data sizes and access costs. In a grid environment the DBMS needs to be provided with cost information related to resources as well so that it can decide not only which resource to run the query on, but also what mode of communication will be best in the transfer of data.
Grid applications will not only use the functionality provided by current databases but there are some requirements that are added when databases are integrated into the grid. These include scalability, unpredictable usage, need for metadata-driven access and the heterogeneity of databases. Grid applications can have extremely demanding performance and capacity requirements. Low response times and high access throughput is desired as there will be a large number of clients. Current databases have standard ways of access by the user. In a grid environment there will be open, ad-hoc access to databases. There will be a need to manage load as well as to prevent accidental or intentional damage. Current DBMSs provide little support for the control of related resources such as CPU, disk I/O, cache storage, etc. Monitoring and accounting services for these resources will also need to be defined.
5.3 THE PROPOSED FRAMEWORK
The proposed framework in [DTG] is service-based. A service-based distributed query processor for the grid has been described in [ODS]. The objective here is to describe and enhance these infrastructures with specific regard to distributed cost processing in a grid environment.
Following are the services described in [DTG] along with the proposed changes:
This service provides access to technical metadata about the database and the set of services that it offers for Grid applications.
Metadata is data about data. It adds context to the data, aiding in its location, identification and interpretation. Key metadata includes the name and location of the data sources, the structure of the data held within it, data item names and description [DTG]. Metadata is very valuable in exploiting the full potential of grid-enable databases. When databases are published on the grid, their metadata is installed in a catalogue. Users search these catalogues for the relevant data they require. Metadata provides them with the location of the database storing the required data. The need for standards for metadata thus becomes important. All users accessing the data need to do so in a uniform manner and the results also need to be returned in a similar manner.
The information provided by metadata should include the following:
Physical and logical name of the database, ownership and version numbers.
A description of the contents of the database. There should be a standard string defined along with other detailed description so that querying for relevant data becomes easier. It can follow a pattern such as AREA_FIELD_TYPE_DETAILS. For example, for astronomical data the contents of a specific database holding information about the Gamma Rays with respect to the planet Mars can be described by SCIENCE_ASTRONOMY_NUMERIC_MARS-GAMMA-RAYS.
[DTG] defines Provenance as a type of metadata that provides information on the history of data. It includes the data’s creation, source, owner, what processing has taken place, the software versions, what analyses has been carried out, what results have been produced and the level of accuracy of the information. This can also be used by grid applications in narrowing down the search for required data. Provenance data should be stored in a separate structure such as:
Date of creation,
Source of the data,
Owner of the data,
What processing has taken place (with software versions),
What analyses the data has been used in,
What results have been produced (with links to databases where the results can be viewed),
Level of accuracy.
Referring again to the example of astronomy, a user may search for data about Mars using the following pattern:
Where ? denotes a wild card.
The GIIS using the grid catalogue should return a list of databases that satisfy this content query. The user can narrow down the search results by specifying the following provenance structure:
>2002, ?, ?, Normalization_ABCSoftware, ?
This string means that the user requires data that should have been created after 2002 and on which normalization has been carried out using ABCSoftware.
The database schema is also defined as metadata. There are two ways to implement this: the schema can be defined using the local database model or one specific standard can be defined and all schemas should follow that model. [DTG] notes that the inclusion of a service federation middleware for heterogeneous databases seems to be a better option. So, the schema definition in metadata should be defined using the current model and the middleware should have support for all models so that it can translate between them.
Metadata should also include what functionality is offered by the database. Some databases may only provide read access while others may provide read and write access. Permissions may depend on the role of the user. Payment details can also be related to the roles. For example if three roles are provided by a specific database: Guest, Admin and Analyzer, and there are four kinds of permissions: Read, Write, Execute and Delete, there may exist rules that state:
Guest: read access
Admin: read, write, delete, execute access.
Analyzer: read, execute access.
Where execute means that the user can execute some operation on the data using some software or analysis capability provided by the resource. For obvious reasons, the payment for Analyzer would be greater than that for a Guest.
Metadata should also specify what type of query language the database supports. The middleware services should provide support for all the major query languages. This can be achieved considering that the base for all languages is SQL.
The query service also needs to support query evaluation with respect to communication and scheduling. The service-based distributed query processor (OGSA-DPQ) [ODS] supports the evaluation of queries expressed in a declarative language over Grid Data Services [ODA]. OGSA-DPQ provides two services to fulfill its functions: The Grid Distributed Query Service (GDQS) and the Grid Query Evaluation Service (GQES). The GDQS provides the primary interaction interfaces for the user and acts as a coordinator between the underlying compiler/optimizer engine and the GQES instances. The GQES is used to execute a query sub-plan assigned to it by the GDQS. The Query Optimizer makes the decision on where to create the GQES instances. GQES instances are created and scheduled dynamically and their interaction is coordinated by the GDQS.
All databases that satisfy a user’s requirements can be used to create a Virtual Database System (DBS) [DTG]. This would present to the user a single integrated schema for the virtual DBS and queries will be accepted against it.
There are two steps in evaluating an input query. First the resource on which the query is to be run has to be decided. The communication and execution costs are to be determined. Secondly, the best possible way to execute the query on a resource has to be deduced. Current DBMSs support the internal cost evaluation of a query. So, support for only the first stage needs to be defined. With respect to communication costs, the factors that play a part are:
- The size of the input data.
- Node to node travel cost of the input data.
- The estimate of the size of the resultant data set.
- Node to node travel cost of the resultant data set.
- Reliability of the network link.
- Payment cost of the network link.
[DIN1999] Nakada, Sato and Sekiguchi. Design and Implementations of Ninf: Towards a Global Computing Infrastructure, Future Generation Computing Systems.
[DTC2003] Foster and Iamnitchi. On Death, Taxes, and the Convergence of Peer-to-Peer and Grid Computing. 2nd International Workshop on Peer-to-Peer Systems, Berkeley, CA. LNCS. 2003.
[DTG] Paul Watson. Databases and the Grid. Version 3.1.
[GEM1998] Foster and Karonis. A Grid-Enabled MPI: Message Passing in Heterogeneous Computing Environment. Proceedings of the SC’98, 1998.
[GIS2001] Czajkowski, Fitzgerald, Foster and Kesselman. Grid Information Services for Distributed Resource Sharing. 2001.
[GSD2002] Foster, Kesselman, Nick and Tuecke. Grid Services for Distributed System Integration. IEEE Computer. 2002.
[GSS2002] Tuecke, Czajkowski, Foster, Frey, Graham and Kesselman. Grid Service Specifications. 2002. http://www.globus.org/ogsa.
[NSN1997] Casanova and Dongarra. NetSolve: A Network Server for Solving Computational Science Problems. International Journal of Supercomputer Applications and High Performance Computing. 1997.
[ODG2003] Penny Avril. Oracle Database 10g: A Revolution in Database Technology. An Oracle White Paper. December 2003.
[ODS] Alpdemir, Mukherjee, Gounaris, Paton, Watson, Fernandes and Smith. OGSA-DQP: A Service-Based Distributed Query Processor for the Grid.
[OGS2003] Tuecke, Czajkowski, Foster, Frey, Graham, Kesselman, Maquire, Sandholm, Snelling and Vanderbilt. Open Grid Services Infrastructure (OGSI), Version 1.0. Technical Report, Open Grid Services Infrastructure WG, Global Grid Forum. 2003.
[OIW] DeFanti, Foster, Papka, Stevens and Kuhfuss. Overview of the I-WAY: Wide Area Visual Supercomputing.
[SED2001] Allcock, et al. Secure, Efficient Data Transport and Replica Management for High Performance Data-Intensive Computing. Mass Storage Conference. 2001.
[SHA2002] Anderson, Cobb, Korpella, Lebofsky, Werthimer. SETI@home: An Experiment in Public-Resource Computing. Communications of the ACM. 2002.
[TAG2001] Foster, Kesselman and Tuecke. The Anatomy of the Grid – Enabling Scalable Virtual Organizations. International Journal of High Performance Computing Applications, Sage Publications Inc. 2001.
[TPG] Foster, Kesselman, Nick and Tuecke. The Physiology of the Grid, An Open Grid Services Architecture for Distributed Systems Integration. http://www.globus.org/ogsa/.
[TSP2002] Laszewski, Su, Foster and Kesselman. The Sourcebook of Parallel Computing. Morgan Kaufmann Publishers. 2002.
[WSD2001] Christensen, Curbera, Meredith and Weerawarana. Web Services Description Language (WSDL) 1.1. 2001. www.w3.org/TR/wsdl.
This student written piece of work is one of many that can be found in our University Degree Information Systems section.
Found what you're looking for?
- Start learning 29% faster today
- 150,000+ documents available
- Just £6.99 a month
- Join over 1.2 million students every month
- Accelerate your learning by 29%
- Unlimited access from just £6.99 per month