In this paper we present the application of an approach for the performance prediction of message passing programs, to a PVM code implementing an iterative solver based on the Successive OverRelaxation method. The app...
详细信息
The paper examines issues that recur in consideration of simulation time-stamps, in the context of building very large simulation models from components developed by different groups at different times. A key problem ...
详细信息
The paper examines issues that recur in consideration of simulation time-stamps, in the context of building very large simulation models from components developed by different groups at different times. A key problem here is "safety", loosely defined to mean that unintended model behavior does not occur due to unpredictable behavior of timestamp generation and comparisons. We revisit the problems of timestamp format and simultaneity, and then turn to the new problem of timestamp interoperability. We describe how a C++ simulation kernel can support the concurrent evaluation of submodels that internally use heterogeneous timestamps, and evaluate the execution time costs of doing so. We find that use of a safe timestamp format that explicitly allows different time scales costs less than 10% over a stock 64-bit integer format, whereas support for completely heterogeneous timestamps can cost as much as 50% in execution speed.
We have parallelized the Iowa Logic Simulator a gate-level fine-grained discrete-event simulator by employing an optimistic algorithm framework based on a global event queue implemented as a parallel heap. The origina...
详细信息
We have parallelized the Iowa Logic Simulator a gate-level fine-grained discrete-event simulator by employing an optimistic algorithm framework based on a global event queue implemented as a parallel heap. The original code and the basic data structures of the serial simulator remained unchanged. Wrapper data structures for the logical processes (gates) and the events are created to allow roll-backs, all the earliest events at each logical processes are stored into the parallel heap, and multiple earliest events are simulated repeatedly by invoking the simulate function of the serial simulator. The parallel heap allowed extraction of hundreds to thousands of earliest events in each queue access. On a bus-based shared-memory multiprocessor simulation of synthetic circuits with 250,000 gates yielded speedups of 3.3 employing five processors compared to the serial execution time of the Iowa Logic Simulator and limited the number of roll-backs to within 2,000. The basic steps of parallelization are well-defined and general enough to be employable on other discrete-event simulators.
We discuss new synchronization algorithms for parallel and distributed discrete event simulations (PDES) which exploit the capabilities and behavior of the underlying communications network. Previous work in this area...
详细信息
We discuss new synchronization algorithms for parallel and distributed discrete event simulations (PDES) which exploit the capabilities and behavior of the underlying communications network. Previous work in this area has assumed the network to be a black box which provides a one-to-one, reliable and in-order message passing paradigm. In our work, we utilize the broadcast capability of the ubiquitous Ethernet for synchronization computations, and both unreliable and reliable protocols for message passing, to achieve more efficient communications between the participating systems. We describe two new algorithms for computation of a distributed snapshot of global reduction operations on monotonically increasing values. The algorithms require O(N) messages (where N is the number of systems participating in the snapshot) in the normal case. We specifically target the use of this algorithm for distributed discrete event simulations to determine a global lower bound on time-stamp (LETS), but expect the algorithm has applicability outside the simulation community.
Real-time distributedsimulations, such as on-line gaming or military training simulations are normally considered to be non-deterministic. Analysis of these simulations is therefore difficult depending solely on logg...
详细信息
Real-time distributedsimulations, such as on-line gaming or military training simulations are normally considered to be non-deterministic. Analysis of these simulations is therefore difficult depending solely on logging and runtime observations. This paper explores an approach for removing one major source of non-determinism in these simulations, thereby allowing repeatable executions. Specifically, we use a synchronization protocol to ensure repeatable delivery of messages. Through limited instrumentation of the simulation code, we maintain a virtual time clock, by which message delivery is governed. The additional overhead imposed by the scheme is shown to be reasonable, although additional reductions in this overhead are anticipated. The results are demonstrated in the context of a simple combat model, whose only source of non-determinism is communications latency. The simulation is shown to be made repeatable, and the perturbation on the execution compared to the non-repeatable execution small. The paper is one step in bridging the gap between the traditional PDES perspective and real-time simulation world.
In a large scale distributedsimulation with thousands of dynamic objects, efficient communication of data among these objects is an important issue. The broadcasting mechanism specified by the distributed Interactive...
详细信息
In a large scale distributedsimulation with thousands of dynamic objects, efficient communication of data among these objects is an important issue. The broadcasting mechanism specified by the distributed Interactive simulation (DIS) standards is not suitable for large scale distributedsimulations. In the high level architecture (HLA) paradigm, the Runtime Infrastructure (RTI) provides a set of services, such as data distribution management (DDM) among federates. The goal of the DDM module in RTI is to make the data communication more efficient by sending the data only to those federates that need the data, as opposed to the broadcasting mechanism employed by DIS. Several DDM schemes have appeared in the literature. We discuss grid based DDM and develop a DDM model that uses grids for matching the publishing/subscription regions, and for data filtering. We show that appropriate choice of the grid-cell size is crucial in obtaining good performance. We develop an analytical model and derive a formula for identifying the optimal cell size in grid-based DDM.
parallel cluster computing projects use a large number of commodity PCs to provide cost-effective computational power to run parallel applications. Because properly load-balanced distributedparallel applications tend...
详细信息
ISBN:
(纸本)3540678794
parallel cluster computing projects use a large number of commodity PCs to provide cost-effective computational power to run parallel applications. Because properly load-balanced distributedparallel applications tend to send messages synchronously, minimizing blocking is as crucial a requirement for the network fabric as are those of high bandwidth and low latency. We consider the selection of an optimal, commodity-based, interconnect network technology and topology to provide high bandwidth, low latency, and reliable delivery. Since our network design goal is to facilitate the performance of real applications, we evaluated the performance of myrinet and gigabit ethernet technologies in the context of working algorithms using modeling and simulation tools developed for this work.
In this paper, we introduce a new Time Warp system called ROSS: Rensselaer's Optimistic simulation System. ROSS is an extremely modular kernel that is capable of achieving event rates as high as 1,250,000 events p...
详细信息
In this paper, we introduce a new Time Warp system called ROSS: Rensselaer's Optimistic simulation System. ROSS is an extremely modular kernel that is capable of achieving event rates as high as 1,250,000 events per second when simulating a wireless telephone network model (PCS) on a quad processor PC server. In a head-to-head comparison, we observe that ROSS out performs the Georgia Tech Time Warp (GTW) system on the same computing platform by up to 180%. ROSS only requires a small constant amount of memory buffers greater than the amount needed by the sequential simulation for a constant number of processors. The driving force behind these high-performance and low memory utilization results is the coupling of an efficient pointer-based implementation framework, Fujimoto's fast GVT algorithm for shared memory multiprocessors, reverse computation and the introduction of Kernel Processes (KPs). KPs lower fossil collection overheads by aggregating processed event lists. This aspect allows fossil collection to be done with greater frequency, thus lowering the overall memory necessary to sustain stable, efficient parallel execution.
The evaluation of network performance under real application loads is carried out by detailed time-intensive and resource-intensive simulations. Moreover, the use of ILP (instruction-level parallel) processors in cc-N...
详细信息
暂无评论