the objective of this research is to propose a low-complexity static scheduling and allocation algorithm for message-passing architectures by considering factors such as communication delays, link contention, message ...
详细信息
the objective of this research is to propose a low-complexity static scheduling and allocation algorithm for message-passing architectures by considering factors such as communication delays, link contention, message routing and network topology. As opposed to the conventional list-scheduling approach, our technique works by first serializing the task graph and 'injecting' all the tasks to one processor. the parallel tasks are then 'bubbled up' to other processors and are inserted at appropriate time slots. the edges among the tasks are also scheduled by treating communication links between the processors as resources. the proposed approach takes into account the link contention and underlying communication routing strategy, and can self-adjust on regular as well as arbitrary network topologies. To reduce the complexity, our scheduling algorithm is itself parallelized. To our knowledge, this is the first attempt in designing a parallel algorithm for scheduling. the proposed approach implemented on an iPSC/860 hypercube, while yielding a high speedup in its execution, performs considerably better under a wide range of parameters including the task graph size, communication-to-computation ratio, and the target system topology. Comparisons are made with two other approaches.
this paper describes a fault tolerant model for a functional language parallel machine. the model is transparent to the user and ensures successful execution of programs in the presence of hardware failure. the model ...
详细信息
this paper describes a fault tolerant model for a functional language parallel machine. the model is transparent to the user and ensures successful execution of programs in the presence of hardware failure. the model is based on data replication. It takes advantage of the properties of the functional languages. the recovery scheme can be carried out simultaneously on all processors, and occurs while 'normal' program execution is in progress. thus normal execution suffers less performance degradation than with other approaches.
As computers become less expensive, interest in using CPUs to build novel parallel systems has increased. the goal is to achieve high parallelism at low cost. In this article, we first examine the reasons why we need ...
详细信息
As computers become less expensive, interest in using CPUs to build novel parallel systems has increased. the goal is to achieve high parallelism at low cost. In this article, we first examine the reasons why we need graph reduction machine for the support of declarative languages to achieve high parallelism. the next is to survey recent development in graph reduction machines, to point out their strengths and weaknesses. then we introduce a dual unit processing node architecture for the graph reduction machine that carefully separates the work load, as such one unit is responsible for network message handling and the other for supercombinator execution. Finally, we demonstrate that a highly parallel novel computer architecture is feasible at low cost.
In this paper we present the bottlenecks involved in the real-time visualization of volumetric data using volume rendering technique. Unlike the current trend of best-fitting the algorithm to a fixed architecture, we ...
详细信息
In this paper we present the bottlenecks involved in the real-time visualization of volumetric data using volume rendering technique. Unlike the current trend of best-fitting the algorithm to a fixed architecture, we analyzed the performance of our architecture independent algorithm over a wide range of data-parallel architecture using dpsim - a data-parallel simulator. the result led to a best-fit architecture that meets the 'near' real-time requirements. We chose the data parallel paradigm because of the overall simplicity it offers., and because is expected to be a component in the next generation of high end workstations.
Speed is a crucial factor for a practical computer system. With an existing system, parallelism of the processing is the way to achieve the necessary speedup withthe current computing resource. the Method to perform ...
详细信息
Speed is a crucial factor for a practical computer system. With an existing system, parallelism of the processing is the way to achieve the necessary speedup withthe current computing resource. the Method to perform an effective model-based object identification system in a transputer network is described in this paper. the speed achieved is close to that required in industrial applications.
Combinational logic synthesis is a very important but computationally expensive phase of VLSI system design. parallelprocessing offers an attractive solution to reduce this design cycle rime. In this paper we describ...
详细信息
Combinational logic synthesis is a very important but computationally expensive phase of VLSI system design. parallelprocessing offers an attractive solution to reduce this design cycle rime. In this paper we describe ProperMIS, a portable parallel algorithm for logic synthesis based on the MIS multi-level logic synthesis system. As part of this work, we have developed novel parallelalgorithms for the different logic transformations of the MIS system. Our algorithm uses art asynchronous message-driven computing model with no synchronizing barriers separating phases of parallel computation. the algorithm is portable across a wide variety of parallelarchitectures, and is built around a well-defined sequential algorithm interface, so that we can benefit from future expansion of the sequential algorithm. We present results on several MCNC and ISCAS benchmark circuits for a variety of shared memory and distributed processingarchitectures. Our implementation produces speedups of an average of 4 on 8 processors.< >
We develop and experiment with a new parallel algorithm to approximate the maximum weight cut in a weighted undirected graph. Our implementation is based on the recent new algorithm of Goemans and Williamson for this ...
详细信息
We develop and experiment with a new parallel algorithm to approximate the maximum weight cut in a weighted undirected graph. Our implementation is based on the recent new algorithm of Goemans and Williamson for this problem. However, our work aims for an efficient, practical formulation of the algorithm with close to optimal parallelization. Our theoretical analysis and an implementation on the Connection Machine CM5 show that our parallelization achieves linear speedup. We test our implementation on several large graphs (more than 9000 vertices), particularly on large instances of the Ising model.< >
the proceedings contain 49 papers. the topics discussed include: efficient resolution of sparse indirection in data-parallel compilers;extending high performance Fortran for the support of unstructured computations;a ...
ISBN:
(纸本)0897917286
the proceedings contain 49 papers. the topics discussed include: efficient resolution of sparse indirection in data-parallel compilers;extending high performance Fortran for the support of unstructured computations;a compiler-directed distributed shared memory system;the DMBC: architecture and fundamental operations;optimum modulo schedules for minimum register requirements;multiprocessor scalability predictions through detailed program execution analysis;upper time bounds for executing pram-programs on the LogP-machine;on the utility of threads for data parallel programming;reducing communication by honoring multiple alignments;compiler cache optimization for banded matrix problems;Petri net modeling of intemonnection networks for massively parallelarchitectures;decoupling synchronization and data transfer in message passing systems of parallel computers;techniques for reducing overheads of shared-memory multiprocessing;unified compilation techniques for shared and distributed address space machines;run-time methods for parallelizing partially parallel loops;a near-optimal broadcasting algorithm in all-port wormhole-routed hypercubes;and a comparative study of single hop WDM interconnections for multiprocessors.
A new systolic array without matrix transportation hardware is proposed to compute the two-dimensional discrete cosine transform (2-D DCT) based on the row-column decomposition. this architecture uses N2 multipliers t...
详细信息
A new systolic array without matrix transportation hardware is proposed to compute the two-dimensional discrete cosine transform (2-D DCT) based on the row-column decomposition. this architecture uses N2 multipliers to evaluate N×N-point DCT's at a rate of one complete transform per N clock cycles, where N is even. It possesses the features of regularity and modularity, and is thus well suited to VLSI implementation. As compared to existing pipelined regular architectures for the 2-D DCT, the proposed one has better throughput performance, smaller area-time complexity, and lower communication complexity. the new idea for the 2-D DCT is also extended to derive a similar systolic array for the 2-D inverse discrete cosine transform (IDCT). Simulation results demonstrate that the proposed 2-D DCT and IDCT architectures have good fixed-point error performance for both real image and random data. As a consequence, they are useful for applications where very high throughput rates are required.
the cost of interprocessor communication has a substantial impact on execution time when implementing parallelalgorithms on physical parallel computers. We study these implementation costs, examining the number of in...
详细信息
the cost of interprocessor communication has a substantial impact on execution time when implementing parallelalgorithms on physical parallel computers. We study these implementation costs, examining the number of inter-processor messages, the cost of routing these messages on various architectures, and the number of communication phases. We provide an improved direct routing algorithm for realizing h-relations on crossbar networks. We also introduce a round-robin message-delivery algorithm which reduces the number of times a communication lint is established between a pair of processors (by delivering all messages of that phase for the pair in order without interruption) We summarize criteria sufficient for a parallel algorithm to be implemented optimally on several common networks. We also describe a log n-phase optimal parallel list-ranking algorithm.< >
暂无评论