Based on a linear ordering of vertices in a directed graph, a linear-time partitioning algorithm for parallel logic simulation is presented. Unlike most other partitioning algorithms, the proposed algorithm preserves ...
详细信息
Based on a linear ordering of vertices in a directed graph, a linear-time partitioning algorithm for parallel logic simulation is presented. Unlike most other partitioning algorithms, the proposed algorithm preserves circuit concurrency by assigning to processors circuit gates that can be evaluated at about the same time. As a result, the concurrency preserving partitioning (CPP) algorithm can provide better load balancing throughout the period of a parallelsimulation. This is especially important when the algorithm is used together with a Time Warp simulation where a high degree of concurrency can lead to fewer rollbacks and better performance. The algorithm consists of three phases, and three conflicting goals can be separately considered in each phase so to reduce computational complexity. A parallel gate-level circuit simulator is implemented on an Intel Paragon machine to evaluate the performance of the CPP algorithm. The results are compared with two other partitioning algorithms to show that reasonable speedup may be achieved with the algorithm.
Presented is a dynamic load balancing algorithm developed for Clustered Time Warp, a hybrid approach which makes use of Time Warp between clusters of LPs and a sequential mechanism within the clusters. The load balanc...
详细信息
Presented is a dynamic load balancing algorithm developed for Clustered Time Warp, a hybrid approach which makes use of Time Warp between clusters of LPs and a sequential mechanism within the clusters. The load balancing algorithm focuses on distributing the load of the simulation evenly among the processors and then tries to reduce interprocessor communications. A triggering technique is used that is based on the throughput of the simulation system. The algorithm was implemented and its performance was measured using two of the largest benchmark digital circuits of the ISCAS '89 series. Results show that by dynamically balancing the load, the throughput was improved by 40-100% when compared to Time Warp.
We present an execution model for parallelsimulation of a distributed shared memory architecture. The model captures the processor-memory interaction and abstracts the memory subsystem. Using this model we show how p...
详细信息
We present an execution model for parallelsimulation of a distributed shared memory architecture. The model captures the processor-memory interaction and abstracts the memory subsystem. Using this model we show how parallel, on-line, partially-ordered memory traces can be correctly predicted without interacting with the memory subsystem. We also outline a parallel optimistic memory simulator that uses these traces, finds a global order among all events, and returns correct data and timing to each processor. A first evaluation of the amount of concurrency that our model can extract for an ideal multiprocessor shows that processors may execute relatively long instruction sequences without violating the causality constraints. However, parallelsimulation efficiency is highly dependent on the memory consistency model and the application characteristics.
Discrete-event simulation is an important tool used for the performance evaluation of parallel systems. The space of tradeoffs is large however, when attempting to balance model fidelity and simulation execution time....
详细信息
Discrete-event simulation is an important tool used for the performance evaluation of parallel systems. The space of tradeoffs is large however, when attempting to balance model fidelity and simulation execution time. This paper describes a simulator - TAPS (Threaded Application parallel System Simulator) - that, in the context of threaded parallel computations, provides a spectrum of possibilities in this tradeoff space. TAPS is specifically designed to be parallelized;we discuss some crucial considerations regarding its parallelization.
The use of optimistic algorithms to perform parallelsimulations of parallel machines is examined. It is first shown that one can make optimistic algorithms work correctly even with Wisconsin Wind Tunnel's direct ...
详细信息
The use of optimistic algorithms to perform parallelsimulations of parallel machines is examined. It is first shown that one can make optimistic algorithms work correctly even with Wisconsin Wind Tunnel's direct execution of program executables. Process registers (integer, floating-point, and condition codes) are checkpointed and executable editing are used to log the value of memory words just before they are overwritten by stores. Second the performance of two optimistic algorithms is considered. The first executes programs optimistically, but performs protocol events conservatively. The second executes everything optimistically and is similar to Time Warp with lazy message cancellation. Unfortunately, both approaches make parallelsimulation performance worse for the default WWT assumptions. This paper concludes by speculating on the performance of optimistic simulation when simulating (1) target network of details, and (2) on hosts with high message latencies and no synchronization hardware.
A flexible simulator has been developed to simulate a two-level metropolitan area network which uses wormhole routing. To accurately model the nature of wormhole routing, the simulator performs discrete-byte rather th...
详细信息
A flexible simulator has been developed to simulate a two-level metropolitan area network which uses wormhole routing. To accurately model the nature of wormhole routing, the simulator performs discrete-byte rather than discrete-packet simulation. Despite the increased computational workload that this implies, it has been possible to create a simulator with acceptable performance by writing it in Maisie, a parallel discrete-event simulation language. The simulator provides an accurate model of an actual high-speed, source-routing, wormhole network (the Murinet) and is the first such simulator. The paper describes the simulator and reports on the performance of parallel implementations of the simulator on a 24-node IBM SP 2 multicomputer. The parallel implementations yielded reasonable speedups. For instance, on 12 nodes, the conservative algorithm yielded a speed-up of about 6 whereas an optimistic algorithm yielded a speed-up of about 4.
Presents a new approach to perform distributed event driven simulation that we have named the 'deblocking event algorithm'. This algorithm adopts the conservative paradigm, but takes into account the structura...
详细信息
Synchronization is often the dominant cost in conservative parallelsimulation, particularly in simulations of parallel computers, in which low-latency simulated communication requires frequent synchronization. We pre...
详细信息
Synchronization is often the dominant cost in conservative parallelsimulation, particularly in simulations of parallel computers, in which low-latency simulated communication requires frequent synchronization. We present and evaluate local barriers and predictive barrier scheduling, two techniques for reducing synchronization overhead in the simulation of message-passing multicomputers. Local barriers use nearest-neighbor synchronization to reduce waiting time at synchronization points. Predictive barrier scheduling, a novel technique that schedules synchronizations using both compile-time and runtime analysis, reduces the frequency of synchronization operations. In contrast to other work in this area, both techniques reduce synchronization overhead without decreasing the accuracy of network simulation. These techniques were evaluated by comparing their performance to that of periodic global synchronization. Experiments show that local barriers improve performance by up to 24% for communication-bound applications, while predictive barrier scheduling improves performance by up to 65% for applications with long local computation phases. Because the two techniques are complementary, we advocate a combined approach. This work was done in the context of parallel Proteus, a new parallel simulator of message-passing multicomputers.
Over-optimistic execution has long been identified as a major performance bottleneck in Time Warp based parallelsimulation systems. An appropriate throttle or control of optimism can improve performance by reducing t...
详细信息
Over-optimistic execution has long been identified as a major performance bottleneck in Time Warp based parallelsimulation systems. An appropriate throttle or control of optimism can improve performance by reducing the number of rollbacks. However, the design of an appropriate throttle is a difficult task, as correct computations on the critical path may be blocked, thus increasing the overall execution time. In this paper we build a cost model for throttled execution that involves both rollback probability and probability for an event computation being on the critical path. The model can estimate an appropriate size of time window for a throttled execution using statistics collected from the purely optimistic execution. The model is validated by an experimental study with a set of synthetic workloads.
The Utilitarian parallel Simulator (U.P.S.) extends parallelism to the CSIM sequential simulation tool by providing several new modeling constructs. Using conservative synchronization techniques, these constructs auto...
详细信息
The Utilitarian parallel Simulator (U.P.S.) extends parallelism to the CSIM sequential simulation tool by providing several new modeling constructs. Using conservative synchronization techniques, these constructs automatically support time-synchronized communications between CSIM submodels running on different processors. This paper describes extensions to U.P.S. that allow the user to assist U.P.S. by providing additional 'process lookahead,' thereby reducing the frequency of synchronizations. The use and effect on performance of process lookahead is described for several models. In a mobile cellular communications model, the use of process lookahead results in up to a 60% improvement in speedup on 32 nodes of the IBM SP2. A factor of 3 improvement is obtained on a closed queueing network simulation running on 32 nodes of the Intel Paragon.
暂无评论