In this paper we present fine-grained multithreaded algorithms and implementations for the Fast Fourier Transform (FFT) problem. the FFT problem has been formulated using two distinct approaches based on the dataflow ...
详细信息
In this paper we present fine-grained multithreaded algorithms and implementations for the Fast Fourier Transform (FFT) problem. the FFT problem has been formulated using two distinct approaches based on the dataflow concepts. the first approach, referred to as the receiver-initiated algorithm, realizes the FFT iterations as a parent-child relationship while fully exploiting the underlying parallelism. the second approach, referred to as the sender-initiated algorithm, follows a data-flow model based on the producer-consumer style of programming and can be adopted to different architectural parameters for achieving high performance. the implementations of the proposed algorithms have been carried out on the EARth (Efficient Architecture for Running threads) platform. For boththe algorithms, we analyze the ratio of remote vs local threads and study its impact on the experimental results. Our implementation results show that for certain block sizes on fixed problem size and machine size, the receiver-initiated approach performs better than the sender-initiated approach. For large number of processors, boththe algorithms perform well, yielding execution times of only 10 msec for an input of 16 K data points on a 64 processor machine, assuming each processor running at 140 MHz clock speed.
Network of workstation (NOW) is a cost-effective alternative to massively parallel supercomputers. As commercially available off-the-shelf processors become cheaper and faster, it is now possible to build a PC or work...
详细信息
Network of workstation (NOW) is a cost-effective alternative to massively parallel supercomputers. As commercially available off-the-shelf processors become cheaper and faster, it is now possible to build a PC or workstation cluster that provides high computing power within a limited budget. However, a cluster may consist of different types of processors and this heterogeneity within a cluster complicates the design of efficient collective communication protocols. this paper shows that a simple heuristic called fastest-node-first (FNF) is very effective in reducing broadcast time for heterogeneous cluster systems. Despite the fact that FNF heuristic fails to give the optimal broadcast time for a general heterogeneous network of workstation, we prove that FNF always gives the optimal broadcast time in several special cases of clusters. Based on these special case results, we show that FNF is an approximation algorithm that guarantees a competitive ratio of 2. From these theoretical results we also derive techniques to speed up the branch-and-bound search for the optimal broadcast schedule in HNOW.
A hierarchical bus network T = (V, E) uses hierarchically, tree-like connected buses as a communication network. New communication technologies like SCI (Scalable Coherent Interface) make such networks very attractive...
详细信息
A hierarchical bus network T = (V, E) uses hierarchically, tree-like connected buses as a communication network. New communication technologies like SCI (Scalable Coherent Interface) make such networks very attractive, because they allow their easy construction and guarantee reasonable communication performance. Such networks can be modeled as tree networks: leaves correspond to processors, inner nodes to buses, edges to switches, and bandwidths of inner nodes and edges are related to bandwidths of buses and switches, respectively. In this paper we address the problem of static data management. Given a set of shared data objects X and the read and write frequencies from the processors to the shared data objects, the goal is to compute a (maybe redundant) placement of the shared data objects to the processors, such that the congestion (the maximum over the load of all edges and inner nodes, induced by the read and write frequencies, divided by the bandwidth of the edge or inner node, respectively) is minimized. It is known that this problem can be solved optimally in linear time, if inner nodes are allowed to hold copies of shared data objects. In our model, inner nodes correspond to buses and therefore cannot store copies of shared data objects. We show that this restriction increases the complexity of the placement problem drastically: It becomes NP-hard. On the other hand, the main contribution of our paper is an approximation algorithm with runtime O(|X|·|V|·height(T)·log(degree(T))) that increases the congestion by a factor of at most 7.
In recent years, the task of allocating jobs to servers has been studied withthe `balls and bins' abstraction. Results in this area exploit the large decrease in maximum load that can be achieved by allowing each...
详细信息
In recent years, the task of allocating jobs to servers has been studied withthe `balls and bins' abstraction. Results in this area exploit the large decrease in maximum load that can be achieved by allowing each job (ball) a little freedom in choosing its destination server (bin). In this paper we examine an infinite and parallel allocation process (see [ABS98]) which is related to the `balls and bins' abstraction. the simple process can be used to model many problems arising in applications like load balancing, data accesses for parallel data servers, hashing, and PRAM simulations. Unfortunately, the parallel allocation process behaves in a highly non-uniform manner which makes its analysis challenging. Even the typically simple question of for which arrival rates the process is stable, is highly non-trivial. In order to cope withthis non-uniform behavior we introduce a new sequential process and show (via simulations) that the sequential process models the behavior of the parallel one very accurately. We develop a system of ordinary differential equations in order to describe the behavior of our sequential process and present a thorough analysis of the performance this process. For example, we show that the queue length distribution decreases double-exponentially. Finally, we present simulation results indicating that the solutions to the differential equations very well predict the queue length distribution of our sequential process and the largest injection rate for which it is stable. Summarizing, we can conclude that in all the performance characteristics we have measured experimentally, the parallel and the sequential process are closely related. this indicates that the obtained solution of the differential equations and the results presented above are applicable to the parallel process, too.
We describe a family of reconfigurable parallelarchitectures for logic emulation. they are supposed to be applicable like conventional FPGAs, while covering a larger range of circuit sizes and clock frequencies. In o...
详细信息
ISBN:
(纸本)3540643591
We describe a family of reconfigurable parallelarchitectures for logic emulation. they are supposed to be applicable like conventional FPGAs, while covering a larger range of circuit sizes and clock frequencies. In order to evaluate the performance of such programmable designs, we also need software methods for code generation from circuit descriptions. We propose a combination of scheduling and routing algorithms for embedding calculations into the target architecture.
Mining association rules from large databases is an important problem in data mining. there is a need to develop parallel algorithm for this problem because it is a very costly computation process. However, all propos...
详细信息
ISBN:
(纸本)9780897919890
Mining association rules from large databases is an important problem in data mining. there is a need to develop parallel algorithm for this problem because it is a very costly computation process. However, all proposed parallelalgorithms for mining association rules follow the conventional level-wise approach. On a shared-memory multi-processors, they will impose a synchronization in every iteration which degrades greatly their performance. the deficiency comes from the contention on the shared I/O channel when all processors are accessing their database partitions in the shared storage synchronously. An asynchronous algorithm APM has been proposed for mining association rules on shared-memory multiprocessors. All participating processors in APM generate candidates and count their supports independently without synchronization. Furthermore, it can finish the computation with less I/O than required in the level-wise approach. the algorithm has been implemented on a Sun Enterprise 4000 multi-processors with12 nodes. the experiments show that APM has super performance than other proposed synchronous algorithms.
this work deals withthe scheduling problem of a directed acyclic graph with interprocessor communication delays. the objective is to minimize the makespan, taking into account the contention in the network induced by...
详细信息
ISBN:
(纸本)3540643591
this work deals withthe scheduling problem of a directed acyclic graph with interprocessor communication delays. the objective is to minimize the makespan, taking into account the contention in the network induced by the message routing. We propose two heuristics for solving the scheduling and routing problems onto arbitrary networks, taking into consideration the access conflicts to links during the task scheduling. Both heuristics significantly improve the performance of the algorithms which do not consider the contention in the network. the comparison of these heuristics is done on problems with different granularity levels in regard to execution times and number of needed processors.
the proceedings contain 118 papers. the special focus in this conference is on parallel and Distributed Processing. the topics include: Dynamic reconfiguration of a PMMLA for high-throughput applications;a parallel al...
ISBN:
(纸本)3540643591
the proceedings contain 118 papers. the special focus in this conference is on parallel and Distributed Processing. the topics include: Dynamic reconfiguration of a PMMLA for high-throughput applications;a parallel algorithm for minimum cost path computation on polymorphic processor array;a performance modeling and analysis environment for reconfigurable computers;an integrated partitioning and synthesis system for dynamically reconfigurabte multi-FPGA architectures;temporal partioning for partially-reconfigurable-field-programmable gate;a java development and runtime environment for reconfigurable computing;synthesizing reconfigurable sequential machines using tabular models;evaluation of a low-power reconfigurable DSP architecture;a reconfigurable hardware-monitor for communication analysis in distributed real-time systems;a mathematical benefit analysis of context switching reconfigurable computing;a configurable computing approach towards real-time target tracking;hardware reconfigurable neural networks;a simulator for the reconfigurable mesh architecture;processor architectures for circuit emulation;an empirical comparison of runtime systems for conservative parallel simulation;synchronizing operations on multiple objects;migration and rollback transparency for arbitrary distributed applications in workstation clusters;a topology based approach to coordinated multicast operations;a parallel evolutionary algorithm for the vehicle routing problem with heterogeneous fleet;artificial neural networks on reconfigurable meshes;a molecular quasi-random model of computations applied to evaluate collective intelligence;replicated shared object model for edge detection with spiral architecture and scheduling tasks of a parallel program in two-processor systems with use of cellular automata.
this paper shows the power of randomization in designing efficient parallelalgorithms for the problems of routing and PRAM emulation. We show that with randomization techniques optimal routing can be obtained for a l...
详细信息
We show that deterministic algorithms using bounded lookahead cannot fully exploit the potential of a parallel I/O system. Randomization can be used to significantly improve the performance of parallel prefetching and...
详细信息
暂无评论