This work presents results from an experimental evaluation of the space-time tradeoffs in Time Warp augmented with the cancelback protocol for memory management. An implementation of the cancelback protocol on Time Wa...
详细信息
ISBN:
(纸本)1565550552
This work presents results from an experimental evaluation of the space-time tradeoffs in Time Warp augmented with the cancelback protocol for memory management. An implementation of the cancelback protocol on Time Warp is described that executes on a shared memory multiprocessor, a 32 processor Kendall Square Research Machine (KSR1). The implementation supports canceling back more than one object when memory has been exhausted. The limited memory performance of the system is evaluated for three different workloads with varying degrees of symmetry. These workloads provide interesting stress cases for evaluating limited memory behavior. We, however, make certain simplifying assumptions (e.g, uniform memory requirement by all the events in the system) to keep the experiments tractable. The experiments are extensively monitored to determine the extent to which various overheads affect performance. It is observed that (i) depending on the available memory and asymmetry in the workload, canceling back several (called the salvage parameter) events at one time may improve performance significantly, by reducing certain overheads, (ii) a performance nearly equivalent to that with unlimited memory can be achieved with only a modest amount of memory depending on the degree of asymmetry in the workload.
Synchronization is a significant cost in many parallel programs, and can be a major bottleneck if it is handled in a centralized fashion using traditional shared-memory constructs such as barriers. In a parallel time-...
ISBN:
(纸本)9781565550551
Synchronization is a significant cost in many parallel programs, and can be a major bottleneck if it is handled in a centralized fashion using traditional shared-memory constructs such as barriers. In a parallel time-stepped simulation, the use of global synchronization primitives limits scalability, increases the sensitivity to load imbalance, and reduces the potential for exploiting locality to improve cache *** paper presents the results of an initial one-application study quantifying the costs and performance benefits of distributed, nearest neighbors synchronization. The application studied, MP3D, is a particle-based wind tunnel simulation. Our results for this one application on current shared-memory multiprocessors show a significant decrease in synchronization time using these techniques. We prototyped an application-independent library that implements distributed synchronization. The library allows a variety of parallelsimulations to exploit these techniques without increasing the application programming beyond that of conventional approaches.
By offering a shared address space across a number of processors connected by a local area network, the distributed shared memory model offers an attractive way of programming parallel-distributed applications. Such p...
详细信息
By offering a shared address space across a number of processors connected by a local area network, the distributed shared memory model offers an attractive way of programming parallel-distributed applications. Such programming can be done either using a memory model based on objects or linear memory. Very few performance studies have been made of such systems. The author describes the motivation and the methodology for a project which compares performance of the object model to the linear memory model. Execution-driven simulation is used to analyze the performance and scalability of the systems for appearing fast processors and new highspeed networks.< >
The major goal of this work has been to develop an implementation of a parallel partitioning algorithm which is suitable for use in a conservatively synchronized parallel Discrete Event simulation (PDES) environment. ...
ISBN:
(纸本)9781565550551
The major goal of this work has been to develop an implementation of a parallel partitioning algorithm which is suitable for use in a conservatively synchronized parallel Discrete Event simulation (PDES) environment. Effective partitioning is essential for performance and capacity consideration, for any PDES problem. The performance of the partitioning algorithm is very important, to the overall simulation performance. There are two possible approaches to improve performance for the partitioning step: algorithm modifications; and parallelize the partitioning algorithm (Fiduccia and Mattheyses, 1982) is developed. The basic algorithm has been modified, first for parallel execution with a similar quality of final partition; and then further modified to increase the parallelism of the algorithm, at the expense of partition quality.
Recent experiments have shown that conservative methods can achieve good performance by exploiting the characteristics of the system being simulated. In this paper we focus on the interrelationship between run time an...
ISBN:
(纸本)9781565550551
Recent experiments have shown that conservative methods can achieve good performance by exploiting the characteristics of the system being simulated. In this paper we focus on the interrelationship between run time and synchronization requirements of a distributedsimulation. A metric that considers the effect of lookahead and the physical rate of transmission of messages, and an arrival approximation that models the effect of synchronization requirements on the run time are developed. It is shown that even when good lookahead is exploited in the system, poor run-time performance is achieved if an inefficient mapping of LPs to processors is used.
An approach for high performance parallel logic simulation on a local area network of workstation computers is discussed in this paper. The single, shared transmission medium often found in such networks places limita...
ISBN:
(纸本)9781565550551
An approach for high performance parallel logic simulation on a local area network of workstation computers is discussed in this paper. The single, shared transmission medium often found in such networks places limitations on parallel execution, hence a reduction in the frequency of synchronization is pursued by combining a circuit partitioning methodology with a specific synchronization constraint. A consequence of the partitioning methodology is replication of objects between blocks of a partition. A partitioning procedure based on iterative improvement is described for reducing replication while preserving load balance. Two interprocessor synchronization techniques for parallelsimulation are studied: conservative and optimistic synchronization. Experiments conducted on three large sequential circuits indicate that reasonable speedup is achievable for well-balanced partitions, and that optimistic synchronization provides a modest improvement in performance over conservative synchronization.
Time Warp has evolved to a common technique for distributedsimulation. Speedup in Time Warp simulation systems mainly depends on two overhead factors: first, the load on the simulators has to be well balanced and sec...
ISBN:
(纸本)9781565550551
Time Warp has evolved to a common technique for distributedsimulation. Speedup in Time Warp simulation systems mainly depends on two overhead factors: first, the load on the simulators has to be well balanced and second, communication and rollbacks have to be kept to a minimum. Both of these factors are influenced by the partitioning of the simulated system. In this paper, we focus on various static partitioning schemes used to partition digital circuits for distributedsimulation.A new hierarchical partitioning approach is presented, compared and rated with other partitioning schemes by evaluating benchmark circuits. Partitioning is done in two steps: a fine grained clustering step based on corollas and a coarse grained step forming partitions using the connectivity matrix. The corolla approach yields very good partitioning results even for a large number of partitions. The achieved speedups are almost linear (up to 12 partitions for larger circuits), as long as the partition sizes are large enough so that communication between the simulators is not a bottleneck. The results reveal the great impact of partitioning on the acceleration of distributed logic simulation and show the effectiveness of the presented corolla partitioning scheme.
The authors describe a new parallel image understanding machine RTA/1 design based on the recursive Torus architecture, and proposed a data level parallel processing scheme using parallel data structures. Various type...
详细信息
暂无评论