Distributed Computing


The Parallel Applications for Distributed Systems

Introduction

Distributed systems and computational Grids (Foster and Kesselman) involve large system dynamics that it is highly desirable to reconfigure executing applications in response to the change in environments. Since parallel applications execute on large number of shared systems, the performance of the applications will be degraded if there is increase in external load on the resources caused by other applications. Also, it is difficult for users of parallel applications to determine the amount of parallelism for their applications and hence may want to determine the amount of parallelism by means of trial-and-error experiments. Due to the large number of machines involved in the distributed computing systems, the mean single processor failure rate and hence the failure rate of the set of machines where parallel applications are executing are fairly high (Beguelin et al.) Hence, for long running applications involving large number of machines, the probability of successful completions of the applications is low. Also, machines may be removed from executing environment for maintenance.

In the above situations, it will be helpful for the users or the scheduling system to stop the executing parallel application and continue it possibly with a new configuration in terms of the number of processors used for the execution. In cases of the failure of the application due to non-deterministic events, restarting the application on a possibly new configuration also provides a way of fault tolerance. Paper defines the following terms that are commonly used in the literature to describe parallel applications with different capabilities.

1.  Moldable applications - Parallel applications that can be stopped at any point of execution but can be restarted only on the same number of processors.

2.  Malleable applications - Parallel applications that can be stopped at any point of execution and can be restarted on a different number of processors. These applications are also called reconfigurable applications.

3.  Migratable applications - Parallel applications that can be stopped at any point of execution and can be restarted on processors in a different site, cluster or domain.

Reconfigurable or malleable and migratable applications provide added functionality and flexibility to the scheduling and resource management systems for distributed computing.

In order to achieve starting and stopping of the parallel applications, the state of the applications have to be checkpointed. Some scholars (Elonazhy; Plank) have surveyed several checkpointing strategies for sequential and parallel applications. Checkpointing systems for sequential (Tannenbaum and Litzkow). and parallel applications (Dikken et al.) have been built. Checkpointing systems are of different types depending on the transparency to the user and the portability of the checkpoints. Transparent and semi-transparent checkpointing systems (Tannenbaum and Litzkow) hide the details of checkpointing and restoration of saved states from the users, but are not portable. Non-transparent checkpointing systems (Geist et al.) involves the users to make some modifications to their programs but are highly portable across systems. Checkpointing can also be implemented at the kernel level or user-level.

This paper describes a checkpointing infrastructure that helps in the development and execution of malleable and migratable parallel applications for distributed systems. The infrastructure consists of a user-level semi-transparent checkpointing library called SRS (Stop Restart Software) and a Runtime Support System (RSS). The SRS library is semi-transparent because the user of the parallel applications has to insert calls in his program to specify the data for checkpointing and to restore the application state in the event of a restart. But the actual storing of checkpoints and the redistribution of data in the event of a reconfiguration are handled internally by the library. Though there are few checkpointing systems that allow changing the parallelism of the parallel applications (Geist et al.), the system is unique in that it allows for the applications to be migrated to distributed locations with different file systems without requiring the users to manually migrate the checkpoint data to distributed locations. This is achieved by the use of a distributed storage infrastructure called IBP (Plank et al.) that allows the applications to remotely access checkpoint data.

SRS Checkpointing Library

SRS (Stop Restart Software) is a user-level checkpointing library that helps to make iterative parallel MPI message passing applications reconfigurable. Iterative parallel applications cover a broad range of important applications including linear solvers, heat-wave equation solvers, partial differential equation (PDE) applications etc. The SRS library has been implemented in both С and Fortran and hence SRS functions can be called from both С and Fortran MPI programs. The SRS library consists of 6 main functions:

Join now!

1.  SRSJnit,

2.  SRSJtestart-Value,

3.  SRS-Read,

4.  SRS-Register,

5.  SRS-CheckJ3top and

6.  SRS-Finish.

The user calls SRS-Init after calling MPIJnit. SRS-Init is a collective operation and initializes the various data structures used internally by the library. SRSJnit also reads various parameters from a user-supplied configuration file. These parameters include the location of the Runtime Support System (RSS) and a flag indicating if the application needs periodic checkpointing. SRSJnit, after reading these parameters, contacts the RSS and sends the current number of processes that the application is using. It also receives the previous configuration of the application from the RSS if ...

This is a preview of the whole essay