This paper is a contribution to the already existing series of work on the algorithmic principles of interprocedural analysis. We consider the generalization to the case of parallel programs. We give algorithms that c...
详细信息
ISBN:
(纸本)9781581131253
This paper is a contribution to the already existing series of work on the algorithmic principles of interprocedural analysis. We consider the generalization to the case of parallel programs. We give algorithms that compute the sets of backward resp. forward reachable configurations for parallel flow graph systems in linear time in the size of the graph viz. the program. These operations are important in dataflow analysis and in model checking. In our method, we first model configurations as terms (viz. trees) in the process algebra PA that can express call stack operations and parallelism. We then give a `declarative' Horn-clause specification of the sets of predecessors resp. successors. The `operational' computation of these sets is carried out using the Dowling-Gallier procedure for HornSat.
Many parallel applications require periodic redistribution of workloads and associated data. In a distributed memory computer, this redistribution can be difficult if limited memory is available for receiving messages...
详细信息
Many parallel applications require periodic redistribution of workloads and associated data. In a distributed memory computer, this redistribution can be difficult if limited memory is available for receiving messages. We propose a model for optimizing the exchange of messages under such circumstances which we call the minimum phase remapping problem. We first show that the problem is NP-Complete, and then analyze several methodologies for addressing it. First, we show how the problem can be phrased as an instance of multi-commodity flow. Next, we study a continuous approximation to the problem. We show that this continuous approximation has a solution which requires at most two more phases than the optimal discrete solution, but the question of how to consistently obtain a good discrete solution from the continuous problem remains open. Finally, we devise a simple and practical approximation algorithm for the problem with a bound of 1.5 times the optimal number of phases.
This paper presents mathematical foundations for the design of a memory controller subcomponent that helps to bridge the processor/memory performance gap for applications with strided access patterns. The parallel Vec...
详细信息
This paper presents mathematical foundations for the design of a memory controller subcomponent that helps to bridge the processor/memory performance gap for applications with strided access patterns. The parallel Vector Access (PVA) unit exploits the regularity of vectors or streams to access them efficiently in parallel on a multi-bank SDRAM memory system. The PVA unit performs scatter/gather operations so that only the elements accessed by the application are transmitted across the system bus. Vector operations are broadcast in parallel to all memory banks, each of which implements an efficient algorithm to determine which vector elements it holds. Earlier performance evaluations have demonstrated that our PVA implementation loads elements up to 32.8 times faster than a conventional memory system and 3.3 times faster than a pipelined vector unit, without hurting the performance of normal cache-line fills. Here we present the underlying PVA algorithms for both word interleaved and cache-line inter-leaved memory systems.
We study the problem of executing parallel programs, in particular Cilk programs, on a collection of processors of different speeds. We consider a model in which each processor maintains an estimate of its own speed, ...
详细信息
We study the problem of executing parallel programs, in particular Cilk programs, on a collection of processors of different speeds. We consider a model in which each processor maintains an estimate of its own speed, where communication between processors has a cost, and where all scheduling must be online. This problem has been considered previously in the fields of asynchronous parallel computing and scheduling theory. Our model is a bridge between the assumptions in these fields. We provide a new more accurate analysis of an old scheduling algorithm called the maximum utilization scheduler. Based on this analysis, we generalize this scheduling policy and define the high utilization scheduler. We next focus on the Cilk platform and introduce a new algorithm for scheduling Cilk multithreaded parallel programs on heterogeneous processors. This scheduler is inspired by the high utilization scheduler and is modified to fit in a Cilk context.' A crucial aspect of our algorithm is that it keeps the original spirit of the Cilk scheduler. In fact, when our new algorithm runs on homogeneous processors, it exactly mimics the dynamics of the original Cilk scheduler.
We study the problem of scheduling parallel machines online, allowing preemptions while disallowing migration of jobs that have been scheduled on one machine to another. For a given job, we measure the quality of serv...
详细信息
ISBN:
(纸本)0898714532
We study the problem of scheduling parallel machines online, allowing preemptions while disallowing migration of jobs that have been scheduled on one machine to another. For a given job, we measure the quality of service provided by an algorithm by the stretch of the job, defined as the ratio of the amount of time the job spends in the system (also known as the response time) to its processing time. For a sequence of jobs, we measure the performance of an algorithm by the average stretch achieved over all the jobs. The scheduling goal is to minimize the average stretch. This problem is of relevance in many applications, e.g., wireless data servers and distributed server systems In wired networks. We prove an O(1) competitive ratio for this problem. The algorithm for which we prove this result is the one proposed in [2] that has (tight) logarithmic competitive ratio for minimizing the average response time. Thus the algorithm in [2] simultaneously performs well for average response time as well as average stretch. We prove the O(1) competitive ratio against an adversary who not only knows the entire input ahead of time, but is also allowed to migrate jobs. Thus our result shows that migration is not necessary to be competitive for minimizing average stretch;in contrast, we prove that preemption is essential, even if randomization is allowed. We also establish a constant-factor lower bound on the competitive ratio of any online algorithm that minimizes average stretch without migration.
We refine the model underlying our prior work on scheduling cycle-stealing opportunities in NOWs, obtaining a model wherein the scheduling guidelines of [16] produce optimal schedules for every such opportunity. Altho...
详细信息
We refine the model underlying our prior work on scheduling cycle-stealing opportunities in NOWs, obtaining a model wherein the scheduling guidelines of [16] produce optimal schedules for every such opportunity. Although computing optimal schedules usually requires the use of general (often inefficient) function-optimizing methods, we show how to compute optimal schedules efficiently for the broad class of opportunities whose durations come from a concave probability distribution. Even when no such efficient computation of an optimal schedule is available, our refined model always suggests a natural notion of approximately optimal schedule, which may be efficiently computable. We illustrate such efficient approximability via the important class of cycle-stealing opportunities whose durations come from a heavy-tailed distribution. Such opportunities do not admit any optimal schedule - nor even a natural notion of approximately optimal schedule - within the model of [5, 16]. Within our refined model, though, we derive computationally simple schedules for heavy-tailed opportunities, which can be `tuned' to have expected work-output that is arbitrarily close to optimal.
In this paper we present fine-grained multithreaded algorithms and implementations for the Fast Fourier Transform (FFT) problem. The FFT problem has been formulated using two distinct approaches based on the dataflow ...
详细信息
In this paper we present fine-grained multithreaded algorithms and implementations for the Fast Fourier Transform (FFT) problem. The FFT problem has been formulated using two distinct approaches based on the dataflow concepts. The first approach, referred to as the receiver-initiated algorithm, realizes the FFT iterations as a parent-child relationship while fully exploiting the underlying parallelism. The second approach, referred to as the sender-initiated algorithm, follows a data-flow model based on the producer-consumer style of programming and can be adopted to different architectural parameters for achieving high performance. The implementations of the proposed algorithms have been carried out on the EARTH (Efficient Architecture for Running THreads) platform. For both the algorithms, we analyze the ratio of remote vs local threads and study its impact on the experimental results. Our implementation results show that for certain block sizes on fixed problem size and machine size, the receiver-initiated approach performs better than the sender-initiated approach. For large number of processors, both the algorithms perform well, yielding execution times of only 10 msec for an input of 16 K data points on a 64 processor machine, assuming each processor running at 140 MHz clock speed.
Network of workstation (NOW) is a cost-effective alternative to massively parallel supercomputers. As commercially available off-the-shelf processors become cheaper and faster, it is now possible to build a PC or work...
详细信息
Network of workstation (NOW) is a cost-effective alternative to massively parallel supercomputers. As commercially available off-the-shelf processors become cheaper and faster, it is now possible to build a PC or workstation cluster that provides high computing power within a limited budget. However, a cluster may consist of different types of processors and this heterogeneity within a cluster complicates the design of efficient collective communication protocols. This paper shows that a simple heuristic called fastest-node-first (FNF) is very effective in reducing broadcast time for heterogeneous cluster systems. Despite the fact that FNF heuristic fails to give the optimal broadcast time for a general heterogeneous network of workstation, we prove that FNF always gives the optimal broadcast time in several special cases of clusters. Based on these special case results, we show that FNF is an approximation algorithm that guarantees a competitive ratio of 2. From these theoretical results we also derive techniques to speed up the branch-and-bound search for the optimal broadcast schedule in HNOW.
In some scenarios involving on-line transaction processing within a distributed database, it is desirable to synchronize transactions in a manner that guarantees conflict equivalence with a serial schedule ordered by ...
详细信息
ISBN:
(纸本)076950728X
In some scenarios involving on-line transaction processing within a distributed database, it is desirable to synchronize transactions in a manner that guarantees conflict equivalence with a serial schedule ordered by original transaction start times while providing each transaction with an anomaly serializable isolation. Few theoretical concurrency control algorithms guarantee such a conflict equivalence, and we are unaware of any protocol that accomplishes this while supporting real-world issues such as out-of-order transaction messages, out-of-order operation executions, and out-of-order transaction committals without the burden of explicit readset and writeset declarations We describe an algorithm that provides this guarantee mid supports these issues while requiring only table-level writeset declarations.
We propose an optoelectronic parallel-matching architecture (PMA) that provides powerful processing capability for distributed algorithms comparing with traditional parallel computing architectures. The PMA is compose...
详细信息
ISBN:
(纸本)354067442X
We propose an optoelectronic parallel-matching architecture (PMA) that provides powerful processing capability for distributed algorithms comparing with traditional parallel computing architectures. The PMA is composed of a parallel-matching (PM) module and multiple processing elements (PE's). The PM module is implemented by a large-fan-out free-space optical interconnection and parallel-matching smart-pixel array (PM-SPA). In the proposed architecture, each PE can monitor the other PE's by utilizing several kinds of global processing by the PM module. The PE's can execute concurrent data matching among the others as well as inter-processor communication, Based on the state-of-the-art optoelectronic devices and a diffractive optical element, a prototype of the PM module is constructed. The prototype is assumed to be used in a multiple processor system composed of 4 x 4 processing elements, which are completely connected via 1-bit optical communication channels. On the prototype demonstrator, the fundamental operations of the PM module such as parallel-matching operations and inter-processor communication were virified at 15MHz.
暂无评论