By offering a shared address space across a number of processors connected by a local area network, the distributed shared memory model offers an attractive way of programming parallel-distributed applications. Such p...
详细信息
The implementation of the pending event set (PES) is crucial to the execution speed of discrete event simulation programs. This paper studies the implementation of the PES in the context of simulations executing on pa...
详细信息
ISBN:
(纸本)1565550552
The implementation of the pending event set (PES) is crucial to the execution speed of discrete event simulation programs. This paper studies the implementation of the PES in the context of simulations executing on parallel computers using the Time Warp mechanism. We present a scheme for implementing Time Warp's PES based on well-known data structures for priority queues. This scheme supports efficient management of future and past events, especially for rollback and fossil collection operations. A comparative study of several queue implementations is presented. Experiments with a Time Warp system executing on a Kendall Square Research multiprocessor (KSR1) demonstrate that the implementation of the input queue can have a dramatic impact on performance, as large as an order of magnitude, that is much greater than what can be accounted for by simply the reduced execution time to access the data structure. In particular, it is demonstrated that an efficient input queue implementation can also significantly reduce the number of rollbacks, and the efficiency of memory management policies such as Jefferson's cancelback protocol. In the context of this work we also present an improved version of the skew heap that allows dequeueing of arbitrary elements at low cost. In particular, the possibility of dequeueing arbitrary elements will improve memory utilization. This ability is also important in applications where frequent rescheduling may occur, as in ready queues used to select the next logical process to execute.
This work presents results from an experimental evaluation of the space-time tradeoffs in Time Warp augmented with the cancelback protocol for memory management. An implementation of the cancelback protocol on Time Wa...
详细信息
ISBN:
(纸本)1565550552
This work presents results from an experimental evaluation of the space-time tradeoffs in Time Warp augmented with the cancelback protocol for memory management. An implementation of the cancelback protocol on Time Warp is described that executes on a shared memory multiprocessor, a 32 processor Kendall Square Research Machine (KSR1). The implementation supports canceling back more than one object when memory has been exhausted. The limited memory performance of the system is evaluated for three different workloads with varying degrees of symmetry. These workloads provide interesting stress cases for evaluating limited memory behavior. We, however, make certain simplifying assumptions (e.g, uniform memory requirement by all the events in the system) to keep the experiments tractable. The experiments are extensively monitored to determine the extent to which various overheads affect performance. It is observed that (i) depending on the available memory and asymmetry in the workload, canceling back several (called the salvage parameter) events at one time may improve performance significantly, by reducing certain overheads, (ii) a performance nearly equivalent to that with unlimited memory can be achieved with only a modest amount of memory depending on the degree of asymmetry in the workload.
Synchronization is a significant cost in many parallel programs, and can be a major bottleneck if it is handled in a centralized fashion using traditional shared-memory constructs such as barriers. In a parallel time-...
ISBN:
(纸本)9781565550551
Synchronization is a significant cost in many parallel programs, and can be a major bottleneck if it is handled in a centralized fashion using traditional shared-memory constructs such as barriers. In a parallel time-stepped simulation, the use of global synchronization primitives limits scalability, increases the sensitivity to load imbalance, and reduces the potential for exploiting locality to improve cache *** paper presents the results of an initial one-application study quantifying the costs and performance benefits of distributed, nearest neighbors synchronization. The application studied, MP3D, is a particle-based wind tunnel simulation. Our results for this one application on current shared-memory multiprocessors show a significant decrease in synchronization time using these techniques. We prototyped an application-independent library that implements distributed synchronization. The library allows a variety of parallelsimulations to exploit these techniques without increasing the application programming beyond that of conventional approaches.
By offering a shared address space across a number of processors connected by a local area network, the distributed shared memory model offers an attractive way of programming parallel-distributed applications. Such p...
详细信息
By offering a shared address space across a number of processors connected by a local area network, the distributed shared memory model offers an attractive way of programming parallel-distributed applications. Such programming can be done either using a memory model based on objects or linear memory. Very few performance studies have been made of such systems. The author describes the motivation and the methodology for a project which compares performance of the object model to the linear memory model. Execution-driven simulation is used to analyze the performance and scalability of the systems for appearing fast processors and new highspeed networks.< >
The major goal of this work has been to develop an implementation of a parallel partitioning algorithm which is suitable for use in a conservatively synchronized parallel Discrete Event simulation (PDES) environment. ...
ISBN:
(纸本)9781565550551
The major goal of this work has been to develop an implementation of a parallel partitioning algorithm which is suitable for use in a conservatively synchronized parallel Discrete Event simulation (PDES) environment. Effective partitioning is essential for performance and capacity consideration, for any PDES problem. The performance of the partitioning algorithm is very important, to the overall simulation performance. There are two possible approaches to improve performance for the partitioning step: algorithm modifications; and parallelize the partitioning algorithm (Fiduccia and Mattheyses, 1982) is developed. The basic algorithm has been modified, first for parallel execution with a similar quality of final partition; and then further modified to increase the parallelism of the algorithm, at the expense of partition quality.
Recent experiments have shown that conservative methods can achieve good performance by exploiting the characteristics of the system being simulated. In this paper we focus on the interrelationship between run time an...
ISBN:
(纸本)9781565550551
Recent experiments have shown that conservative methods can achieve good performance by exploiting the characteristics of the system being simulated. In this paper we focus on the interrelationship between run time and synchronization requirements of a distributedsimulation. A metric that considers the effect of lookahead and the physical rate of transmission of messages, and an arrival approximation that models the effect of synchronization requirements on the run time are developed. It is shown that even when good lookahead is exploited in the system, poor run-time performance is achieved if an inefficient mapping of LPs to processors is used.
An approach for high performance parallel logic simulation on a local area network of workstation computers is discussed in this paper. The single, shared transmission medium often found in such networks places limita...
ISBN:
(纸本)9781565550551
An approach for high performance parallel logic simulation on a local area network of workstation computers is discussed in this paper. The single, shared transmission medium often found in such networks places limitations on parallel execution, hence a reduction in the frequency of synchronization is pursued by combining a circuit partitioning methodology with a specific synchronization constraint. A consequence of the partitioning methodology is replication of objects between blocks of a partition. A partitioning procedure based on iterative improvement is described for reducing replication while preserving load balance. Two interprocessor synchronization techniques for parallelsimulation are studied: conservative and optimistic synchronization. Experiments conducted on three large sequential circuits indicate that reasonable speedup is achievable for well-balanced partitions, and that optimistic synchronization provides a modest improvement in performance over conservative synchronization.
暂无评论