We study scalable parallel computational geometry algorithms for the coarse grained multicomputer model: p processors solving a problem on n data items, were each processor has O(n/p) much greater than O(1) local memo...
详细信息
We study scalable parallel computational geometry algorithms for the coarse grained multicomputer model: p processors solving a problem on n data items, were each processor has O(n/p) much greater than O(1) local memory and all processors are connected via some arbitrary interconnection network (e.g. mesh, hypercube, fat tree). We present O(T-sequential/p + T-s(n,p)) time scalable parallelalgorithms for several computational geometry problems. T-s(n,p) refers to the time of a global sort operation. Our results are independent of the multicomputer's interconnection network. their time complexities become optimal when T-sequential/p dominates T-s(n,p) or when T-s(n,p) is optimal. this is the case for several standard architectures, including meshes and hypercubes, and a wide range of ratios n/p that include many of the currently available machine configurations. Our methods also have some important practical advantages: For interprocessor communication, they use only a small fixed number of one global routing operation, global sort, and all other programming is in the sequential domain. Furthermore, our algorithms use only a small number of very large messages, which greatly reduces the overhead for the communication protocol between processors. (Note however, that our time complexities account for the lengths of messages.) Experiments show that our methods are easy to implement and give good timing results.
this paper presents the design of a dedicated parallel architecture for connected component analysis. Categorized in one-dimensional array processors, for an image of n/spl times/n pixels, the proposed architecture ha...
详细信息
this paper presents the design of a dedicated parallel architecture for connected component analysis. Categorized in one-dimensional array processors, for an image of n/spl times/n pixels, the proposed architecture has n-1 linear processing elements (PEs), n/sup 2/ CAM memory modules, and a tree structure of (n/2)-1 switches allowing communication through the global bus in O(log n) unit of propagation time. Well suited for low and intermediate-level vision, this architecture allows sequential processingthrough its line structure which is perfectly adapted to real time image analysis from any interlaced-mode video signal. this paper presents the algorithms for connected component labeling, area and perimeter determination, all of which are in O(n log n). the performance of the proposed architecture is compared with another architecture types. the simulation results, the possibility of implementation, and future work are discussed.
parallel simulation has been an active research area for more than a decade, and several parallel simulation algorithms have been proposed. To evaluate parallel simulation environments, there is a need for a common be...
详细信息
parallel simulation has been an active research area for more than a decade, and several parallel simulation algorithms have been proposed. To evaluate parallel simulation environments, there is a need for a common benchmark suite. Such benchmarks should allow the designer of simulation kernels to: (i) evaluate how efficiently the simulation kernel runs on specific architectures; and (ii) evaluate how simulation problems scale on the kernel. A vast majority of benchmarks suggested in the literature focus on the latter problem. the first requirement is however equally important, as such benchmarks are needed to implement efficient simulation systems. In this report, we advocate an incremental benchmark methodology which is primarily intended for performance tuning. the benchmark suite is based on a small set of ping models with which it is possible to effectively isolate and estimate various overheads, contention and latencies encountered in simulation kernels. the use of the benchmark suite is illustrated by its application to performance tuning and evaluation of a time warp kernel.
Combinational logic synthesis is a very important but computationally expensive phase of VLSI system design. parallelprocessing offers an attractive solution to reduce this design cycle time. In this paper, we descri...
详细信息
Combinational logic synthesis is a very important but computationally expensive phase of VLSI system design. parallelprocessing offers an attractive solution to reduce this design cycle time. In this paper, we describe ProperMIS, a portable parallel algorithm for logic synthesis based on the MIS multi-level logic synthesis system. As part of this work, we have developed novel parallelalgorithms for the different logic transformations of the MIS system. Our algorithm uses an asynchronous message-driven computing model with no synchronizing barriers separating phases of parallel computation. the algorithm is portable across a wide variety of parallelarchitectures, and is built around a well-defined sequential algorithm interface, so that we can benefit from future expansion of the sequential algorithm. We present results on several MCNC and ISCAS benchmark circuits for a variety of shared memory and distributed processingarchitectures. Our implementation produces speedups of an average of 4 on 8 processors.
With nearest neighbor load balancing algorithms, a processor makes balancing decisions based on its local information and manages workload migrations within its neighborhood. this paper compares a couple of fairly wel...
详细信息
With nearest neighbor load balancing algorithms, a processor makes balancing decisions based on its local information and manages workload migrations within its neighborhood. this paper compares a couple of fairly well-known nearest neighbor algorithms, the dimension exchange and the diffusion methods and their variants in terms of their performances in both one-port and all-port communication architectures. It turns out that the dimension exchange method outperforms the diffusion method in the one-port communication model, and that the strength of the diffusion method is in asynchronous implementations in the all-port communication model. the underlying communication networks considered assume the most popular topologies, the mesh and the torus and their special cases: the hypercube and the k-ary n-cube.
the cost of interprocessor communication has a substantial impact on execution time when implementating parallelalgorithms on physical parallel computers. We study these implementation costs, examining the number of ...
详细信息
the cost of interprocessor communication has a substantial impact on execution time when implementating parallelalgorithms on physical parallel computers. We study these implementation costs, examining the number of inter-processor messages, the cost of routing these messages on various architectures, and the number of communication phases. We provide an improved direct routing algorithm for realizing h-relations on crossbar networks. We also introduce a round-robin message-delivery algorithm which reduces the number of times a communication link is established between a pair of processors (by delivering all messages of that phase for the pair in order without interruption.) We summarize criteria sufficient for a parallel algorithm to be implemented optimally on several common networks. We also describe a log n-phase optimal parallel list-ranking algorithm.
We develop and experiment with a new parallel algorithm to approximate the maximum weight cut in a weighted undirected graph. Our implementation is based on the recent new algorithm of Goemans and Williamson for this ...
详细信息
We develop and experiment with a new parallel algorithm to approximate the maximum weight cut in a weighted undirected graph. Our implementation is based on the recent new algorithm of Goemans and Williamson for this problem. However, our work aims for an efficient, practical formulation of the algorithm with close to optimal parallelization. Our theoretical analysis and an implementation on the Connection Machine CM5 show that our parallelization achieves linear speedup. We test our implementation on several large graphs (more than 9000 vertices), particularly on large instances of the Ising model.
this paper explores the fundamental impacts of pipeline technology on massively parallel SIMD architectures. the potential for performance improvement in the instruction delivery is explored, and stall penalties assoc...
详细信息
this paper explores the fundamental impacts of pipeline technology on massively parallel SIMD architectures. the potential for performance improvement in the instruction delivery is explored, and stall penalties associated with reduction operations are derived. Scheduling mechanisms to mitigate stall cycles are also presented. In addition, the design of pipelined processing elements is considered, and formula for stall penalties, and area costs are constructed. these results indicate a 5-10 fold improvement in SIMD performance is well within technological limits.
the prototype system OOXSDAR VISIS was implemented in VisualWorks/Smalltalk and Distributed Smalltalk respectively. To achieve distribution in a heterogeneous network a common object request broker architecture (CORBA...
详细信息
the prototype system OOXSDAR VISIS was implemented in VisualWorks/Smalltalk and Distributed Smalltalk respectively. To achieve distribution in a heterogeneous network a common object request broker architecture (CORBA)-based architecture was chosen. the architecture consists of three layers: knowledge client level, knowledge domain agent server level, and persistent knowledge storage level. the architecture is based on the semantic/presentation split of logical knowledge objects. this architecture combines the advantages of standardized communication protocols such as CORBA withthe power and expressivity of OODBMS. First studies withthe prototype system OOXSDAR VISIS were carried out for performance analysis. the results allowed significant improvement of the distributed inference process. Well known principles such as increasing intra-modul cohesion and minimizing inter-modul dependencies were applied for restructuring the distributed knowledge bases.
Presented are the results of a study conducted to evaluate the performance of parallel I/O on a massively parallel processor (MPP). the network traversal and total processing times are calculated for I/O reads and wri...
详细信息
暂无评论