Presented is a dynamic load balancing algorithm developed for Clustered Time Warp, a hybrid approach which makes use of Time Warp between clusters of LPs and a sequential mechanism within the clusters. The load balanc...
详细信息
Presented is a dynamic load balancing algorithm developed for Clustered Time Warp, a hybrid approach which makes use of Time Warp between clusters of LPs and a sequential mechanism within the clusters. The load balancing algorithm focuses on distributing the load of the simulation evenly among the processors and then tries to reduce interprocessor communications. A triggering technique is used that is based on the throughput of the simulation system. The algorithm was implemented and its performance was measured using two of the largest benchmark digital circuits of the ISCAS '89 series. Results show that by dynamically balancing the load, the throughput was improved by 40-100% when compared to Time Warp.
We present an execution model for parallelsimulation of a distributed shared memory architecture. The model captures the processor-memory interaction and abstracts the memory subsystem. Using this model we show how p...
详细信息
We present an execution model for parallelsimulation of a distributed shared memory architecture. The model captures the processor-memory interaction and abstracts the memory subsystem. Using this model we show how parallel, on-line, partially-ordered memory traces can be correctly predicted without interacting with the memory subsystem. We also outline a parallel optimistic memory simulator that uses these traces, finds a global order among all events, and returns correct data and timing to each processor. A first evaluation of the amount of concurrency that our model can extract for an ideal multiprocessor shows that processors may execute relatively long instruction sequences without violating the causality constraints. However, parallelsimulation efficiency is highly dependent on the memory consistency model and the application characteristics.
Discrete-event simulation is an important tool used for the performance evaluation of parallel systems. The space of tradeoffs is large however, when attempting to balance model fidelity and simulation execution time....
详细信息
Discrete-event simulation is an important tool used for the performance evaluation of parallel systems. The space of tradeoffs is large however, when attempting to balance model fidelity and simulation execution time. This paper describes a simulator - TAPS (Threaded Application parallel System Simulator) - that, in the context of threaded parallel computations, provides a spectrum of possibilities in this tradeoff space. TAPS is specifically designed to be parallelized;we discuss some crucial considerations regarding its parallelization.
The use of optimistic algorithms to perform parallelsimulations of parallel machines is examined. It is first shown that one can make optimistic algorithms work correctly even with Wisconsin Wind Tunnel's direct ...
详细信息
The use of optimistic algorithms to perform parallelsimulations of parallel machines is examined. It is first shown that one can make optimistic algorithms work correctly even with Wisconsin Wind Tunnel's direct execution of program executables. Process registers (integer, floating-point, and condition codes) are checkpointed and executable editing are used to log the value of memory words just before they are overwritten by stores. Second the performance of two optimistic algorithms is considered. The first executes programs optimistically, but performs protocol events conservatively. The second executes everything optimistically and is similar to Time Warp with lazy message cancellation. Unfortunately, both approaches make parallelsimulation performance worse for the default WWT assumptions. This paper concludes by speculating on the performance of optimistic simulation when simulating (1) target network of details, and (2) on hosts with high message latencies and no synchronization hardware.
A flexible simulator has been developed to simulate a two-level metropolitan area network which uses wormhole routing. To accurately model the nature of wormhole routing, the simulator performs discrete-byte rather th...
详细信息
A flexible simulator has been developed to simulate a two-level metropolitan area network which uses wormhole routing. To accurately model the nature of wormhole routing, the simulator performs discrete-byte rather than discrete-packet simulation. Despite the increased computational workload that this implies, it has been possible to create a simulator with acceptable performance by writing it in Maisie, a parallel discrete-event simulation language. The simulator provides an accurate model of an actual high-speed, source-routing, wormhole network (the Murinet) and is the first such simulator. The paper describes the simulator and reports on the performance of parallel implementations of the simulator on a 24-node IBM SP 2 multicomputer. The parallel implementations yielded reasonable speedups. For instance, on 12 nodes, the conservative algorithm yielded a speed-up of about 6 whereas an optimistic algorithm yielded a speed-up of about 4.
Presents a new approach to perform distributed event driven simulation that we have named the 'deblocking event algorithm'. This algorithm adopts the conservative paradigm, but takes into account the structura...
详细信息
Synchronization is often the dominant cost in conservative parallelsimulation, particularly in simulations of parallel computers, in which low-latency simulated communication requires frequent synchronization. We pre...
详细信息
Synchronization is often the dominant cost in conservative parallelsimulation, particularly in simulations of parallel computers, in which low-latency simulated communication requires frequent synchronization. We present and evaluate local barriers and predictive barrier scheduling, two techniques for reducing synchronization overhead in the simulation of message-passing multicomputers. Local barriers use nearest-neighbor synchronization to reduce waiting time at synchronization points. Predictive barrier scheduling, a novel technique that schedules synchronizations using both compile-time and runtime analysis, reduces the frequency of synchronization operations. In contrast to other work in this area, both techniques reduce synchronization overhead without decreasing the accuracy of network simulation. These techniques were evaluated by comparing their performance to that of periodic global synchronization. Experiments show that local barriers improve performance by up to 24% for communication-bound applications, while predictive barrier scheduling improves performance by up to 65% for applications with long local computation phases. Because the two techniques are complementary, we advocate a combined approach. This work was done in the context of parallel Proteus, a new parallel simulator of message-passing multicomputers.
Over-optimistic execution has long been identified as a major performance bottleneck in Time Warp based parallelsimulation systems. An appropriate throttle or control of optimism can improve performance by reducing t...
详细信息
Over-optimistic execution has long been identified as a major performance bottleneck in Time Warp based parallelsimulation systems. An appropriate throttle or control of optimism can improve performance by reducing the number of rollbacks. However, the design of an appropriate throttle is a difficult task, as correct computations on the critical path may be blocked, thus increasing the overall execution time. In this paper we build a cost model for throttled execution that involves both rollback probability and probability for an event computation being on the critical path. The model can estimate an appropriate size of time window for a throttled execution using statistics collected from the purely optimistic execution. The model is validated by an experimental study with a set of synthetic workloads.
The Utilitarian parallel Simulator (U.P.S.) extends parallelism to the CSIM sequential simulation tool by providing several new modeling constructs. Using conservative synchronization techniques, these constructs auto...
详细信息
The Utilitarian parallel Simulator (U.P.S.) extends parallelism to the CSIM sequential simulation tool by providing several new modeling constructs. Using conservative synchronization techniques, these constructs automatically support time-synchronized communications between CSIM submodels running on different processors. This paper describes extensions to U.P.S. that allow the user to assist U.P.S. by providing additional 'process lookahead,' thereby reducing the frequency of synchronizations. The use and effect on performance of process lookahead is described for several models. In a mobile cellular communications model, the use of process lookahead results in up to a 60% improvement in speedup on 32 nodes of the IBM SP2. A factor of 3 improvement is obtained on a closed queueing network simulation running on 32 nodes of the Intel Paragon.
This paper describes two forms of feedback in the simulation runtime of VHDL circuits that greatly influences performance. While circuit feedback and strongly connected components have been observed and documented as ...
详细信息
This paper describes two forms of feedback in the simulation runtime of VHDL circuits that greatly influences performance. While circuit feedback and strongly connected components have been observed and documented as detrimental influences to conservative parallel discrete event simulation (PDES) efficiency, that influence has never been quantified. Moreover, in this study, the phenomenon of induced feedback [1] was observed to diminish speedup to the same degree as explicit feedback. In this paper the influence of feedback on simulation runtime is analyzed and an O(n) algorithm for its elimination is presented. In addition, a metric for the quantification of feedback is introduced. By measuring feedback, it is possible to balance its influence on simulation runtime with that of other factors (e.g. load balance, number of processors, machine granularity, etc.) through the use of a cost-based partitioning approach. This paper reports significant improvements in runtime for three circuits due to the prevention of feedback using the partitioning algorithm presented. In addition, strong correlation between the feedback metric and conservative parallelsimulation overhead is demonstrated.
暂无评论