A flexible simulator has been developed to simulate a two-level metropolitan area network which uses wormhole routing. To accurately model the nature of wormhole routing, the simulator performs discrete-byte rather th...
详细信息
A flexible simulator has been developed to simulate a two-level metropolitan area network which uses wormhole routing. To accurately model the nature of wormhole routing, the simulator performs discrete-byte rather than discrete-packet simulation. Despite the increased computational workload that this implies, it has been possible to create a simulator with acceptable performance by writing it in Maisie, a parallel discrete-event simulation language. The simulator provides an accurate model of an actual high-speed, source-routing, wormhole network (the Murinet) and is the first such simulator. The paper describes the simulator and reports on the performance of parallel implementations of the simulator on a 24-node IBM SP 2 multicomputer. The parallel implementations yielded reasonable speedups. For instance, on 12 nodes, the conservative algorithm yielded a speed-up of about 6 whereas an optimistic algorithm yielded a speed-up of about 4.
Presents a new approach to perform distributed event driven simulation that we have named the 'deblocking event algorithm'. This algorithm adopts the conservative paradigm, but takes into account the structura...
详细信息
Synchronization is often the dominant cost in conservative parallelsimulation, particularly in simulations of parallel computers, in which low-latency simulated communication requires frequent synchronization. We pre...
详细信息
Synchronization is often the dominant cost in conservative parallelsimulation, particularly in simulations of parallel computers, in which low-latency simulated communication requires frequent synchronization. We present and evaluate local barriers and predictive barrier scheduling, two techniques for reducing synchronization overhead in the simulation of message-passing multicomputers. Local barriers use nearest-neighbor synchronization to reduce waiting time at synchronization points. Predictive barrier scheduling, a novel technique that schedules synchronizations using both compile-time and runtime analysis, reduces the frequency of synchronization operations. In contrast to other work in this area, both techniques reduce synchronization overhead without decreasing the accuracy of network simulation. These techniques were evaluated by comparing their performance to that of periodic global synchronization. Experiments show that local barriers improve performance by up to 24% for communication-bound applications, while predictive barrier scheduling improves performance by up to 65% for applications with long local computation phases. Because the two techniques are complementary, we advocate a combined approach. This work was done in the context of parallel Proteus, a new parallel simulator of message-passing multicomputers.
Over-optimistic execution has long been identified as a major performance bottleneck in Time Warp based parallelsimulation systems. An appropriate throttle or control of optimism can improve performance by reducing t...
详细信息
Over-optimistic execution has long been identified as a major performance bottleneck in Time Warp based parallelsimulation systems. An appropriate throttle or control of optimism can improve performance by reducing the number of rollbacks. However, the design of an appropriate throttle is a difficult task, as correct computations on the critical path may be blocked, thus increasing the overall execution time. In this paper we build a cost model for throttled execution that involves both rollback probability and probability for an event computation being on the critical path. The model can estimate an appropriate size of time window for a throttled execution using statistics collected from the purely optimistic execution. The model is validated by an experimental study with a set of synthetic workloads.
The Utilitarian parallel Simulator (U.P.S.) extends parallelism to the CSIM sequential simulation tool by providing several new modeling constructs. Using conservative synchronization techniques, these constructs auto...
详细信息
The Utilitarian parallel Simulator (U.P.S.) extends parallelism to the CSIM sequential simulation tool by providing several new modeling constructs. Using conservative synchronization techniques, these constructs automatically support time-synchronized communications between CSIM submodels running on different processors. This paper describes extensions to U.P.S. that allow the user to assist U.P.S. by providing additional 'process lookahead,' thereby reducing the frequency of synchronizations. The use and effect on performance of process lookahead is described for several models. In a mobile cellular communications model, the use of process lookahead results in up to a 60% improvement in speedup on 32 nodes of the IBM SP2. A factor of 3 improvement is obtained on a closed queueing network simulation running on 32 nodes of the Intel Paragon.
This paper describes two forms of feedback in the simulation runtime of VHDL circuits that greatly influences performance. While circuit feedback and strongly connected components have been observed and documented as ...
详细信息
This paper describes two forms of feedback in the simulation runtime of VHDL circuits that greatly influences performance. While circuit feedback and strongly connected components have been observed and documented as detrimental influences to conservative parallel discrete event simulation (PDES) efficiency, that influence has never been quantified. Moreover, in this study, the phenomenon of induced feedback [1] was observed to diminish speedup to the same degree as explicit feedback. In this paper the influence of feedback on simulation runtime is analyzed and an O(n) algorithm for its elimination is presented. In addition, a metric for the quantification of feedback is introduced. By measuring feedback, it is possible to balance its influence on simulation runtime with that of other factors (e.g. load balance, number of processors, machine granularity, etc.) through the use of a cost-based partitioning approach. This paper reports significant improvements in runtime for three circuits due to the prevention of feedback using the partitioning algorithm presented. In addition, strong correlation between the feedback metric and conservative parallelsimulation overhead is demonstrated.
One of the promises of parallelized discrete-event simulation is that it might provide significant speedups over sequential simulation. In reality, high performance cannot be achieved unless the system is fine-tuned t...
详细信息
One of the promises of parallelized discrete-event simulation is that it might provide significant speedups over sequential simulation. In reality, high performance cannot be achieved unless the system is fine-tuned to balance computation, communication, and synchronization requirements. In this paper, we discuss our experiments in automated load balancing using the SPEEDES simulation framework. Specifically, we examine three mapping algorithms that use run-time measurements. Using simulation models of queuing networks and the National Airspace System, we investigate (i) the use of run-time data to guide mapping, (ii) the utility of considering communication costs in a mapping algorithm, (iii) the degree to which computational 'hot-spots' ought to be broken up in the linearization, and (iv) the relative execution costs of the different algorithms. We compare the performance of the three algorithms using results from the Intel Paragon.
The simulation of computational fluid dynamics problems in two and more dimensions involves computations of multiple degrees of freedom, such as the components of velocity, which are an obvious source of parallelism. ...
详细信息
A load distribution system is proposed to enable a single Time Warp program to execute in background, spreading over a collection of possibly heterogeneous workstations (including multiprocessor hosts), utilizing what...
详细信息
A load distribution system is proposed to enable a single Time Warp program to execute in background, spreading over a collection of possibly heterogeneous workstations (including multiprocessor hosts), utilizing whatever otherwise unused CPU cycles are available. The system uses a simple processor allocation policy to dynamically add or delete hosts from the set of processors utilized by the Time Warp program during its execution. A load balancing algorithm is used that allocates logical processes (LPs) to processors, taking into account other computations executing on the host from the system or other user applications. A clustering mechanism is used to group collections of logical processes together, reducing process migration overheads and helping to retain locality of communication for simulations containing large number of LPs. An initial, prototype implementation of the load distribution system is described that executes on a homogeneous network of Silicon Graphics workstations. Initial experiments indicate this approach shows promise in enabling efficient execution of Time Warp programs 'in background' on distributed computing platforms.
暂无评论