This paper addresses optimal mapping of parallel programs composed of a chain of data parallel tasks onto the processors of a parallel system. The input to this class of programs is a stream of data sets, each of whic...
详细信息
ISBN:
(纸本)9780897918091
This paper addresses optimal mapping of parallel programs composed of a chain of data parallel tasks onto the processors of a parallel system. The input to this class of programs is a stream of data sets, each of which is processed in order by the chain of tasks. This computation structure, also referred to as a data parallel pipeline, is common in several application domains including digital signal processing, image processing, and computer vision. The parameters of the performance of stream processing are latency (the time to process an individual data set) and throughput (the aggregate rate at which the data sets are processed). These two criterion are distinct since multiple data sets can be pipelined or processed in parallel. We present a new algorithm to determine a processor mapping of a chain of tasks that optimizes the latency in the presence of throughput constraints, and discuss optimization of the throughput with latency constraints. The problem formulation uses a general and realistic model of inter-task communication, and addresses the entire problem of mapping, which includes clustering tasks into modules, assignment of processors to modules, and possible replication of modules. The main algorithms are based on dynamic programming and their execution time complexity is polynomial in the number of processors and tasks. The entire framework is implemented as an automatic mapping tool in the Fx parallelizing compiler for a dialect of High Performance Fortran.
The authors address the apparently difficult problem of doing parallel transitive closure when the (directed) graph is sparse and/or, only single-source information is desired. O(e) work is their target for the single...
详细信息
ISBN:
(纸本)0897913701
The authors address the apparently difficult problem of doing parallel transitive closure when the (directed) graph is sparse and/or, only single-source information is desired. O(e) work is their target for the single-source problem. When the graph is sparse, then the all-pairs transitive closure problem can be solved by performing a depth-first search from each node, taking O(ne) time;that is their target for the all-pairs problem. The authors do not reach either target, except for the all-pairs case when e is fairly large. However, they make significant progress, in the sense that they have the first algorithms that simultaneously use time much less than linear and use work that is less than M(n).
A perfect hash function for a (multi)set X of n integers is an infective function h : X → {1,., s}, where s = O(n), that can be stored in O(n) space and evaluated in constant time by a single processor. We show that ...
详细信息
The proceedings contain 54 papers. The topics discussed include: expediting hazard pointers with bounded RCU critical sections;Alock: asymmetric lock primitive for RDMA systems;when is parallelism fearless and zero-co...
ISBN:
(纸本)9798400704161
The proceedings contain 54 papers. The topics discussed include: expediting hazard pointers with bounded RCU critical sections;Alock: asymmetric lock primitive for RDMA systems;when is parallelism fearless and zero-cost with rust?;efficient parallel reinforcement learning framework using the reactor model;parallel best arm identification in heterogeneous environments;brief announcement: lock-free learned search data structure;brief announcement: LIT: lookup interlocked table for range queries;brief announcement: a fast scalable detectable unrolled lock-based linked list;scheduling out-trees online to optimize maximum flow;optimizing dynamic data center provisioning through speed scaling: a primal-dual perspective;scheduling jobs with work-inefficient parallel solutions;and multi bucket queues: efficient concurrent priority scheduling.
We consider the question of whether a shared-memory model can serve as an effective bridging model for parallel computation along the lines of a distributed-memory model such as the BSP. As a candidate for a shared-me...
详细信息
We consider the question of whether a shared-memory model can serve as an effective bridging model for parallel computation along the lines of a distributed-memory model such as the BSP. As a candidate for a shared-memory bridging model, we introduce the Queuing Shared Memory (QSM) model, which accounts for limited communication bandwidth while still providing a simple shared-memory abstraction. We substantiate the ability of the QSM to serve as a bridging model by providing a simple work-preserving emulation of the QSM on both the BSP, and on a related model, the (d,x)-BSP. We present evidence that the features of the QSM are essential to its effectiveness as a bridging model. In addition, we describe scenarios in which the high-level QSM is more suited for analyzing algorithms on certain machines than the more detailed BSP and LogP models. Finally, we present algorithmic results for the QSM, as well as general strategies for mapping algorithms designed for the BSP or PRAM models onto the QSM model. Our main conclusion is that shared-memory models can potentially serve as viable alternatives to existing message-passing, distributed-memory bridging models.
Some parallelalgorithms have the property that, as they are allowed to take more time, the total work that they do is reduced. This paper describes three such algorithms that find the strongly connected components of...
详细信息
A parallel variant of breadth-first search for distributed computing is presented. The variant allows exhaustive enumeration of elements of a search space (implicitly defined graph) in which the representation of all ...
详细信息
ISBN:
(纸本)9780897918909
A parallel variant of breadth-first search for distributed computing is presented. The variant allows exhaustive enumeration of elements of a search space (implicitly defined graph) in which the representation of all graph nodes would otherwise require more than the total available memory. This algorithm requires the use of a tadpole data structure to partition the search space into connected subgraphs, with each subgraph stored within the memory of a single processor. Thus, the graph of the nodes are stored in a way that adds greater spatial locality, thereby reducing communication among processors. The algorithm enumerates the tadpoles in a breadth-first manner while executing depth-first search within each tadpole. The search within each tadpole is reduced to tree search. A parameter is defined that allows a linear tradeoff in which memory and communication are each reduced linearly as CPU time grows. The result appears to fill a gap in the literature, which has concentrated on parallel depth-first search, but has been relatively sparse in the case of parallel breadth-first search.
The design and use of scalable parallel computers is discussed. Much of the knowledge gained in the study of distributd systems is relevant, although it needs to be used in new manners. The author argues that the chal...
详细信息
ISBN:
(纸本)0897916131
The design and use of scalable parallel computers is discussed. Much of the knowledge gained in the study of distributd systems is relevant, although it needs to be used in new manners. The author argues that the challenge is to achieve maximum reuse of existing technologies, microprocessor architectures, I/O systems, programming languages, operating systems, application codes, without compromising too much parallel performance and scalability. Such skillful compromises are essential to the success of the parallel processing industry, and pose even more intellectual challenges than the 'start from scratch' approach.
In this paper, we analyze the performance of parallel multithreaded algorithms that use dag-consistent distributed shared memory. Specifically, we analyze execution time, page faults, and space requirements for multit...
详细信息
In this paper, we analyze the performance of parallel multithreaded algorithms that use dag-consistent distributed shared memory. Specifically, we analyze execution time, page faults, and space requirements for multithreaded algorithms executed by a work-stealing thread scheduler and the BACKER coherence algorithm for maintaining dag consistency. We prove that if the accesses to the backing store are random and independent (the BACKER algorithm actually uses hashing), then the expected execution time of a `fully strict' multithreaded computation on P processors, each with an LRU cache of C pages, is O(T1(C)/P+mCT∞), where T1(C) is the total work of the computation including page faults, T∞ is its critical-path length excluding page faults, and m is the minimum page transfer time. As a corollary to this theorem, we show that the expected number of page faults incurred by a computation executed on P processors, each with an LRU cache of C pages, is F1(C)+O(CPT∞), where F1(C) is the number of serial page faults. Finally, we give simple bounds on the number of page faults and the space requirements for `regular' divide-and-conquer algorithms. We use these bounds to analyze parallel multithreaded algorithms for matrix multiplication and LU-decomposition.
Mixed task and data parallelism exists naturally in many applications, but utilizing it may require sophisticated scheduling algorithms and software support. Recently, significant research effort has been applied to e...
详细信息
ISBN:
(纸本)9780897917179
Mixed task and data parallelism exists naturally in many applications, but utilizing it may require sophisticated scheduling algorithms and software support. Recently, significant research effort has been applied to exploiting mixed parallelism in both theory and systems communities. In this paper, we ask how much mixed parallelism will improve performance in practice, and how architectural evolution impacts these estimates. First, we build and validate a performance model for a class of mixed task and data parallel problems based on machine and problem parameters. Second, we use this model to estimate the gains from mixed parallelism for some scientific applications on current machines. This quantifies our intuition that mixed parallelism is best when either communication is slow or the number of processors is large. Third, we show that, for balanced divide and conquer trees, a simple one-time switch between data and task parallelism gets most of the benefit of general mixed parallelism. Fourth, we establish upper bounds to the benefits of mixed parallelism for irregular task graphs. Apart from these detailed analyses, we provide a framework in which other applications and machines can be evaluated.
暂无评论