In High Performance Fortran (HPF), array redistribution can be described explicitly using directives (REDISTRIBUTE or REALIGN) which specify where new distributions become active or implicitly by calling functions whi...
详细信息
Several techniques currently exist for estimating the power dissipation of combinational and sequential circuits using exhaustive simulation, Monte Carlo sampling, and probabilistic estimation. Exhaustive simulation a...
详细信息
Several techniques currently exist for estimating the power dissipation of combinational and sequential circuits using exhaustive simulation, Monte Carlo sampling, and probabilistic estimation. Exhaustive simulation and Monte Carlo sampling techniques can be highly reliable but often require long runtimes. This paper presents a comprehensive study of pattern-partitioning and circuit-partitioning parallelization schemes for those two methodologies in the context of distributed-memory multiprocessing systems. Issues in pipelined event-driven simulation and dynamic load balancing are addressed. Experimental results are presented for an IBM SP-2 system and a network of HP-9000 workstations. For instance, runtimes have been reduced from over 3 hours to under 20 minutes in one case.
This paper studies the problem of load balancing for conservative parallel simulations for execution on a multicomputer. The synchronization protocol makes use of Chandy-Misra null-messages. We propose a dynamic load ...
ISBN:
(纸本)9780818679650
This paper studies the problem of load balancing for conservative parallel simulations for execution on a multicomputer. The synchronization protocol makes use of Chandy-Misra null-messages. We propose a dynamic load balancing algorithm which assumes no compile time knowledge about the workload parameters. It is based upon a process migration mechanism, and the notion of CPU-queue length, which indicates the workload at each *** examine two variations for the algorithm which we refer to as centralized and multi-level hierarchical methods, in the context of queueing network simulation of a torus. The torus was chosen because it of its many cycles aid in the formation of deadlock making it a stress test for any conservative synchronization protocols. Our experiments indicate that our dynamic load balancing schemes significantly reduce the run time of an optimized version of Chandy-Misra null message approach, and decreases by 30-40\% the synchronization overhead when compared to the use of a static partitioning algorithm. Significantly, the results obtained also indicate that the multi-level scheme always outperforms both the centralized load balancing approach and the static partitioning algorithm.
More and more parallel applications are running in a distributed environment to take advantage of easily available and inexpensive commodity resources. For data intensive applications, employing multiple distributed s...
详细信息
ISBN:
(纸本)9780769515823
More and more parallel applications are running in a distributed environment to take advantage of easily available and inexpensive commodity resources. For data intensive applications, employing multiple distributed storage resources has many advantages. In this paper, we present a Multi-Storage I/O System (MS-I/O) that can not only effectively manage various distributed storage resources in the system, but also provide novel high performance storage access schemes. MS-I/O employs many state-of-the-art I/O optimizations such as collective I/O, asynchronous I/O etc. and a number of new techniques such as data location, data replication, subfile, superfile and data access history. In addition, many MS-I/O optimization schemes can work simultaneously within a single data access session, greatly improving the performance. Although I/O optimization techniques can help improve performance, it also complicates I/O system. In addition, most optimization techniques have their limitations. Therefore, selecting accurate optimization policies requires ex-pert knowledge which is not suitable for end users who may have little knowledge of I/O techniques. So the task of I/O optimization decision should be left to the I/O system itself, that is, automatic from user's point of view. We present a User Access Pattern data structure which is associated With each dataset that can help MS-I/O easily make accurate I/O optimization decisions.
In this paper, we present our Grid-based decision tree architecture, with the intention of applying it to both parallel and sequential algorithms. Also, we show that, based on the scope and model of data mining applie...
详细信息
We propose an efficient dynamic load balancing scheme in cellular networks for managing a teletraffic hot spot in which channel demand exceeds a certain threshold. A hot spot, depicted as a stack of hexagonal 'rin...
详细信息
We propose an efficient dynamic load balancing scheme in cellular networks for managing a teletraffic hot spot in which channel demand exceeds a certain threshold. A hot spot, depicted as a stack of hexagonal 'ring' of cells, is classified as complete if all cells within it are hot. The rings containing only cold cells outside the hot spot are called 'peripheral rings'. Our load balancing scheme migrates channels through a structured borrowing mechanism from the cold cells within the 'rings' or 'peripheral rings' to the hot cells in the hot spot. For the more general case of an incomplete hot spot, a cold cell is further classified as cold safe, cold semi-safe or cold unsafe, and a demand graph is constructed from the channel demand of each hot cell from its adjacent cells in the next outer ring. The channel borrowing algorithm works on the demand graph in a bottom up fashion, satisfying the demands of the cells in each subsequent inner ring. Markov chain models are developed for a hot cell and detailed simulation experiments are conducted to evaluate the performance of our load balancing scheme. Comparison with an existing load balancing strategy under moderate and heavy teletraffic conditions, shows a performance improvement of 12% in terms of call blockade by our load balancing scheme.
In this paper, a parallel loop self-scheduling scheme for heterogeneous PC cluster systems is proposed. Though the proposed scheme does allow users to choose parameters before the execution initialization phase, there...
详细信息
Internet computing and grid technologies promise to change the way we tackle complex problems. They will enable large-scale aggregation and sharing of computational, data and other resources across institutional bound...
详细信息
We propose two new asynchronous parallel algorithms for test set partitioned fault simulation. The algorithms are based on a new two-stage approach to parallelizing fault simulation for sequential VLSI circuits in whi...
详细信息
We propose two new asynchronous parallel algorithms for test set partitioned fault simulation. The algorithms are based on a new two-stage approach to parallelizing fault simulation for sequential VLSI circuits in which the test set is partitioned among the available processors. These algorithms provide the same result as the previous synchronous two stage approach. However, due to the dynamic characteristics of these algorithms and due to the fact that there is very minimal redundant work, they run faster than the previous synchronous approach. A theoretical analysis comparing the various algorithms is also given to provide an insight into these algorithms. The implementations were done in MPI and are therefore portable to many parallel platforms. Results are shown for a shared memory multiprocessor.
Important applications including those in computational chemistry, computational fluid dynamics, structural analysis and sparse matrix applications usually consist of a mixture of regular and irregular accesses. While...
详细信息
Important applications including those in computational chemistry, computational fluid dynamics, structural analysis and sparse matrix applications usually consist of a mixture of regular and irregular accesses. While current state-of-the-art run-time library support for such applications handles the irregular accesses reasonably well, the efficacy of the optimizations at run-time for the regular accesses is yet to be proven. This paper aims to find a better approach to handle the above applications in a unified compiler and run-time framework. Specifically, this paper considers only regular applications and evaluates the performance of two approaches, a run-rime approach using PILAR and a compile-time approach using a commercial HPF compiler. This study shows that using a particular representation of regular accesses, the performance of regular code using run-time libraries can come close to the performance of code generated by a compiler. It also determines the operations that usually contribute largely to the run-time overhead in case of regular accesses. Experimental results are reported for three regular applications on a 16-processor IBM SP-2.
暂无评论