PM-PVM is a portable implementation of PVM designed to work on SMP architectures supporting multithreading. PM-PVM portability is achieved through the implementation of the PVM functionality on top of a reduced set of...
详细信息
PM-PVM is a portable implementation of PVM designed to work on SMP architectures supporting multithreading. PM-PVM portability is achieved through the implementation of the PVM functionality on top of a reduced set of parallel programming primitives. Within PM-PVM, PVM tasks are mapped onto threads and the message passing functions are implemented using shared memory. three implementation approaches of the PVM message passing functions have been adopted. In the first one, a single message copy in memory is shared by all destination tasks. the second one replicates the message for every destination task but requires less synchronization. Finally, the third approach uses a combination of features from the two previous ones. Experimental results comparing the performance of PM-PVM and PVM applications running on a 4-processor Sparcstation 20 under Solaris 2.5 show that PM-PVM can produce execution times up to 54% smaller than PVM.
PM-PVM is a portable implementation of PVM designed to work on SMP architectures supporting multithreading. PM-PVM portability is achieved through the implementation of the PVM functionality on top of a reduced set of...
详细信息
PM-PVM is a portable implementation of PVM designed to work on SMP architectures supporting multithreading. PM-PVM portability is achieved through the implementation of the PVM functionality on top of a reduced set of parallel programming primitives. Within PM-PVM, PVM tasks are mapped onto threads and the message passing functions are implemented using shared memory. three implementation approaches of the PVM message passing functions have been adopted. In the first one, a single message copy in memory is shared by all destination tasks. the second one replicates the message for every destination task but requires less synchronization. Finally, the third approach uses a combination of features from the two previous ones. Experimental results comparing the performance of PM-PVM and PVM applications running on a 4-processor Sparcstation 20 under Solaris 2.5 show that PM-PVM can produce execution times up to 54% smaller than PVM.
Most commercial routers designed for networks of workstations (NOWs) implement wormhole switching. However wormhole switching is not well suited for NOWs. the long wires required in this environment lead to large buff...
详细信息
Most commercial routers designed for networks of workstations (NOWs) implement wormhole switching. However wormhole switching is not well suited for NOWs. the long wires required in this environment lead to large buffers to prevent buffer overflow during flow control signaling. Moreover, wire length is limited by buffer size. Virtual cut-through (VCT) achieves a higher throughput than wormhole switching. Moreover, the traditional disadvantages of VCT switching, as buffer requirements and packetizing overhead, disappear in NOWs. In this paper, we show that VCT routers can be simpler than wormhole ones, while still achieving the advantages of using virtual channels and adaptive routing. We also propose a fully adaptive routing algorithm for VCT switching in NOWs. Moreover, we show that VCT routers outperform wormhole routers in a NOW environment at a lower cost.
Most commercial routers designed for networks of workstations (NOWs) implement wormhole switching. However, wormhole switching is not well suited for NOWs. the long wires required in this environment lead to large buf...
详细信息
Most commercial routers designed for networks of workstations (NOWs) implement wormhole switching. However, wormhole switching is not well suited for NOWs. the long wires required in this environment lead to large buffers to prevent buffer overflow during flow control signaling. Moreover, wire length is limited by buffer size. Virtual cut-through (VCT) achieves a higher throughput than wormhole switching. Moreover, the traditional disadvantages of VCT switching, as buffer requirements and packetizing overhead, disappear in NOWs. In this paper, we show that VCT routers can be simpler than wormhole ones, while still achieving the advantages of using virtual channels and adaptive routing. We also propose a fully adaptive routing algorithm for VCT switching in NOWs. Moreover, we show that VCT routers outperform wormhole routers in a NOW environment at a lower cost.
In the solution of large-scale numerical problems, parallel computing is becoming simultaneously more important and more difficult. the complex organization of today's multi-processors with several memory hierarch...
详细信息
In the solution of large-scale numerical problems, parallel computing is becoming simultaneously more important and more difficult. the complex organization of today's multi-processors with several memory hierarchies has forced the scientific programmer to make a choice between simple but unscalable code and scalable but extremely complex code that does not port to other architectures. this paper describes how the SMARTS runtime system and the POOMA C++ class library for high-performance scientific computing work together to exploit data parallelism in scientific applications while hiding the details of managing parallelism and data locality from the user. We present innovative algorithms, based on the macro-dataflow model, for detecting data parallelism and efficiently executing data-parallel statements on shared-memory multiprocessors. We also describe how these algorithms can be implemented on clusters of SMPs.
Discretization of image restoration problems often leads to a discrete inverse ill-posed problem: the discretized operator is so badly conditioned that it can be actually considered as undetermined. In this case one s...
详细信息
Discretization of image restoration problems often leads to a discrete inverse ill-posed problem: the discretized operator is so badly conditioned that it can be actually considered as undetermined. In this case one should single out the solution which is the nearest to the desired solution. the usual way to do it is to regularize the problem. In this paper we focus on the computational aspects of the Wiener filter within the framework of the regularization methods. the emphasis is on its reliability and its efficiency, both of which become more and more important as the size and the complexity of the real problem grow and the demand for advanced real-time processing increases.
In this paper, we propose a Vertically-partitioned parallel Signature File (VPSF) method which can partition a signature file vertically. Our VPSF method uses an extendable hashing technique for dynamic environment an...
详细信息
In order to extract high levels of performance from modern parallelarchitectures, the effective management of deep memory hierarchies is very important. While architectural advances in caches help in better utilizati...
详细信息
In order to extract high levels of performance from modern parallelarchitectures, the effective management of deep memory hierarchies is very important. While architectural advances in caches help in better utilization of the memory hierarchy, compiler-directed locality enhancement techniques are also important. In this paper we propose a locality improvement technique that uses data space (array layout) transformations in contrast to most of the previous work based on iteration space (loop) transformations. In other words, rather than changing the order of loop iterations, our technique modifies the memory layouts of multi-dimensional arrays. In comparison with previous work on data transformations it brings two novelties. First, we formulate the problem on a special graph structure called the layout graph (LG) and use integer linear programming (ILP) methods to determine optimal layouts. Second, in addition to static layout detection, our approach also enables the compiler to determine optimal dynamic layouts;that is, the layouts that can be changed across loop nest boundaries. We believe that this is the first attempt to determine optimal dynamic memory layouts. We also present preliminary experimental results on the SGI Origin 2000 distributed shared memory multiprocessor. Our results so far are encouraging and indicate that the additional compilation time taken by the solver is tolerable.
In order to extract high levels of performance from modern parallelarchitectures, the effective management of deep memory hierarchies is very important. While architectural advances in caches help in better utilizati...
详细信息
In order to extract high levels of performance from modern parallelarchitectures, the effective management of deep memory hierarchies is very important. While architectural advances in caches help in better utilization of the memory hierarchy, compiler-directed locality enhancement techniques are also important. In this paper we propose a locality improvement technique that uses data space (array layout) transformations in contrast to most of the previous work based on iteration space (loop) transformations. In other words, rather than changing the order of loop iterations, our technique modifies the memory layouts of multi-dimensional arrays. In comparison with previous work on data transformations it brings two novelties. First, we formulate the problem on a special graph structure called the layout graph (LG) and use integer linear programming (ILP) methods to determine optimal layouts. Second, in addition to static layout detection, our approach also enables the compiler to determine optimal dynamic layouts; that is, the layouts that can be changed across loop nest boundaries. We believe that this is the first attempt to determine optimal dynamic memory layouts. We also present preliminary experimental results on the SGI Origin 2000 distributed shared memory multiprocessor. Our results so far are encouraging and indicate that the additional compilation time taken by the solver is tolerable.
In the framework of object-based video coding morphological filtering has became an important technique for image simplification. Morphological filters are used in a pre-processing step to improve the chances of obtai...
详细信息
In the framework of object-based video coding morphological filtering has became an important technique for image simplification. Morphological filters are used in a pre-processing step to improve the chances of obtaining meaningful segmentation results. However the computation cost of these filters limits their practical use. An improvement of the processing time is then required for their use in real-time applications. In this paper the implementation issue is addressed. Improvement of the processing time is achieved by exploiting the local parallelism of these filters and by reducing the number of clock cycles used to access a specific data structure called the hierarchical queue. As various morphological filters can be used, an architecture based on hardware software partitioning is proposed. Performance results showing the benefits of the proposed architecture are provided.
暂无评论