In this paper we discuss new synchronization algorithms for parallel and distributed Discrete Event simulations (PDES) which exploit the capabilities and behavior of the underlying communications network. Previous wor...
详细信息
In this paper we discuss new synchronization algorithms for parallel and distributed Discrete Event simulations (PDES) which exploit the capabilities and behavior of the underlying communications network. Previous work in this area has assumed the network to be a Black Box which provides a one-to-one, reliable and in-order message passing paradigm. In our work, we utilize the Broadcast capability of the ubiquitous Ethernet for synchronization computations, and both unreliable and reliable protocols for message passing, to achieve more efficient communications between the participating systems. We describe two new algorithms for computation of a distributed snapshot of global reduction operations on monotonically increasing values. The algorithms require O(N) messages (where N is the number of systems participating in the snapshot) in the normal case. We specifically target the use of this algorithm for distributed discrete event simulations to determine a global lower bound on time-stamp (LBTS), but expect the algorithm has applicability outside the simulation community.
This paper investigates issues concerning federations of sequential and/or parallel simulators. An approach is proposed for creating federated simulations by defining a global conceptual model of the entire simulation...
详细信息
This paper investigates issues concerning federations of sequential and/or parallel simulators. An approach is proposed for creating federated simulations by defining a global conceptual model of the entire simulation, and then mapping individual entities of the conceptual model to implementations within individual federates. Proxy entities are defined as a means for linking entities that are mapped to different federates. Using this approach, an implementation of a federation of optimistic simulators is examined. Issues concerning the adaptation of optimistic simulators to a federated system are discussed. The performance of the federated system utilizing runtime infrastructure (RTI) software executing on a shared memory multiprocessor (SMP) is compared with a native (non-federated) SMP-based optimistic parallel simulator. It is demonstrated that a well designed federated simulation system can yield performance comparable to a native, parallelsimulation engine, but important implementation issues must be properly addressed.
Rapid progress in the design of fast CPU chips has outstripped progress in memory and cache performance. Optimistic algorithms would seem to be more vulnerable to poor memory performance because they require extra mem...
详细信息
Rapid progress in the design of fast CPU chips has outstripped progress in memory and cache performance. Optimistic algorithms would seem to be more vulnerable to poor memory performance because they require extra memory for state saving and anti-messages. We examine the performance of both optimistic and conservative protocols in controlled experiments to evaluate the effects of memory speed and cache size, using a variety of applications.
Many synchronous algorithms have been proposed for parallel and discrete simulations. However, the actual performance of these algorithms have been far from ideal, especially when event granularity is small. Barring t...
详细信息
Many synchronous algorithms have been proposed for parallel and discrete simulations. However, the actual performance of these algorithms have been far from ideal, especially when event granularity is small. Barring the case of low parallelism in the given simulation models, one of the main reasons of low speedups is in the uneven load distribution among processors. We present several new locality-preserving load balancing mechanisms for synchronous simulations on shared-memory multiprocessors. We show both theoretically and empirically that some of these mechanisms incur very low overhead. The results confirm that one of the new mechanisms is indeed more efficient and scalable than common existing approaches.
Circuit simulation has proven to be one of the most important computer aided design (CAD) methods for the analysis and validation of integrated circuit designs. A popular approach to describing circuits for simulation...
详细信息
Circuit simulation has proven to be one of the most important computer aided design (CAD) methods for the analysis and validation of integrated circuit designs. A popular approach to describing circuits for simulation purposes is to use a hardware description language such as VHDL. Similar efforts have also been carried out in the analog domain that has led to tools such as SPICE. However, with the growing trend of hardware designs that contain both analog and digital components, design environments that seamlessly integrate analog and digital circuitry are needed. simulation of such circuits is however, exacerbated by the higher resource (CPU and memory) demands that arise when analog and digital models are integrated in a mixed-mode (analog and digital) simulation. One solution to this problem is to use PDES algorithms on a distributed platform. However, a synchronization interface between the analog and digital simulation environment is required to achieve integrated mixed-mode simulation. In this paper, we present the issues involved in the construction of synchronization protocols which support mixed-mode simulation in a distributedsimulation environment. The proposed synchronization protocols provide an interface between an optimistic (Time Warp based) discrete-event simulation kernel and any continuous time simulation kernel. Empirical and formal analyses were conducted to ensure correctness and completeness of the protocols and the results of these analyses are also presented.
We have parallelized the Iowa Logic Simulator, a gate-level fine-grained discrete-event simulator, by employing an optimistic algorithm framework based on a global event queue implemented as a parallel heap. The origi...
详细信息
We have parallelized the Iowa Logic Simulator, a gate-level fine-grained discrete-event simulator, by employing an optimistic algorithm framework based on a global event queue implemented as a parallel heap. The original code and the basic data structures of the serial simulator remained unchanged. Wrapper data structures for the logical processes (gates) and the events are created to allow rollbacks, all the earliest events at each logical processes are stored into the parallel heap, and multiple earliest events are simulated repeatedly by invoking the simulate function of the serial simulator. The parallel heap allowed extraction of hundreds to thousands of earliest events in each queue access. On a bus-based shared-memory multiprocessor, simulation of synthetic circuits with 250,000 gates yielded speedups of 3.3 employing five processors compared to the serial execution time of the Iowa Logic Simulator, and limited the number of rollbacks to within 2,000. The basic steps of parallelization are well-defined and general enough to be employable on other discrete-event simulators.
Load balancing is a crucial factor in achieving good performance for parallel discrete event simulations. In this paper, we present a load balancing scheme that combines both static partitioning and dynamic load balan...
详细信息
Load balancing is a crucial factor in achieving good performance for parallel discrete event simulations. In this paper, we present a load balancing scheme that combines both static partitioning and dynamic load balancing. The static partitioning scheme maps simulation objects to logical processes before simulation starts while the dynamic load balancing scheme attempts to balance the load during runtime. The static scheme involves two steps. First, the simulation objects that contribute to small lookahead are merged together by using a merging algorithm. Then a partitioning algorithm is applied. The merging is needed to ensure a consistent performance for our dynamic scheme. Our dynamic scheme is tailor-made for an asynchronous simulation protocol that does not rely on null messages. The performance study on a supply-chain simulation shows that the partitioning algorithm and dynamic load balancing are important in achieving good performance.
In this paper we present the application of an approach for the performance prediction of message passing programs, to a PVM code implementing an iterative solver based on the Successive OverRelaxation method. The app...
详细信息
parallel cluster computing projects use a large number of commodity PCs to provide cost-effective computational power to run parallel applications. Because properly load-balanced distributedparallel applications tend...
详细信息
ISBN:
(纸本)3540678794
parallel cluster computing projects use a large number of commodity PCs to provide cost-effective computational power to run parallel applications. Because properly load-balanced distributedparallel applications tend to send messages synchronously, minimizing blocking is as crucial a requirement for the network fabric as are those of high bandwidth and low latency. We consider the selection of an optimal, commodity-based, interconnect network technology and topology to provide high bandwidth, low latency, and reliable delivery. Since our network design goal is to facilitate the performance of real applications, we evaluated the performance of myrinet and gigabit ethernet technologies in the context of working algorithms using modeling and simulation tools developed for this work.
In this paper, we introduce a new Time Warp system called ROSS: Rensselaer's Optimistic simulation System. ROSS is an extremely modular kernel that is capable of achieving event rates as high as 1,250,000 events p...
详细信息
In this paper, we introduce a new Time Warp system called ROSS: Rensselaer's Optimistic simulation System. ROSS is an extremely modular kernel that is capable of achieving event rates as high as 1,250,000 events per second when simulating a wireless telephone network model (PCS) on a quad processor PC server. In a head-to-head comparison, we observe that ROSS out performs the Georgia Tech Time Warp (GTW) system on the same computing platform by up to 180%. ROSS only requires a small constant amount of memory buffers greater than the amount needed by the sequential simulation for a constant number of processors. The driving force behind these high-performance and low memory utilization results is the coupling of an efficient pointer-based implementation framework, Fujimoto's fast GVT algorithm for shared memory multiprocessors, reverse computation and the introduction of Kernel Processes (KPs). KPs lower fossil collection overheads by aggregating processed event lists. This aspect allows fossil collection to be done with greater frequency, thus lowering the overall memory necessary to sustain stable, efficient parallel execution.
暂无评论