Von zur Gathen proposed an efficient parallel exponentiation algorithm in finite fields using normal basis representations. In this paper we present a processor-efficient parallel exponentiation algorithm in GF(2n) wh...
详细信息
ISBN:
(纸本)9781581134094
Von zur Gathen proposed an efficient parallel exponentiation algorithm in finite fields using normal basis representations. In this paper we present a processor-efficient parallel exponentiation algorithm in GF(2n) which improves upon von zur Gathen's algorithm. We also show that exponentiation in GF(2n) can be done in O(log n) time using n/(log n)2 processors. Hence we get processor × time bound O(n/log n), which is optimal. Finally, we present an on-line processor assignment scheme which was missing in von zur Gathen's algorithm, and show that its time complexity is negligible.
We present the design and implementation of a parallel out-of-core sorting algorithm, which is based on Leighton's colunmsort algorithm. We show how to relax some of the steps of the original columnsort algorithm ...
详细信息
ISBN:
(纸本)9781581134094
We present the design and implementation of a parallel out-of-core sorting algorithm, which is based on Leighton's colunmsort algorithm. We show how to relax some of the steps of the original columnsort algorithm to permit a faster out-of-core implementation. Our algorithm requires only 4 passes over the data, and a 3-pass implementation is possible. Although there is a limit on the number of records that can be sorted-as a function of the memory used per processor-this upper limit need not be a severe restriction, and it increases superlinearly with the per-processor memory. To the best of our knowledge, our implementation is the first out-of-core multiprocessor sorting algorithm whose output is in the order assumed by the parallel Disk Model. We define several measures of sorting efficiency and demonstrate that our implementation's sorting efficiency is competitive with that of NOW-Sort, a sorting algorithm developed to sort large amounts of data quickly on a cluster of workstations.
Massive data sets often arise as physically distributed, parallel data streams. We present algorithms for estimating simple functions on the union of such data streams, while using only logarithmic space per stream. E...
详细信息
ISBN:
(纸本)9781581134094
Massive data sets often arise as physically distributed, parallel data streams. We present algorithms for estimating simple functions on the union of such data streams, while using only logarithmic space per stream. Each processor observes only its own stream, and communicates with the other processors only after observing its entire stream. This models the set-up in current network monitoring products. Our algorithms employ a novel coordinated sampling technique to extract a sample of the union;this sample can be used to estimate aggregate functions on the union. The technique can also be used to estimate aggregate functions over the distinct "labels" in one or more data streams, e.g., to determine the zeroth frequency moment (i.e., the number of distinct labels) in one or more data streams. Our space and time bounds are the best known for these problems, and our logarithmic space bounds for coordinated sampling contrast with polynomial lower bounds for independent sampling. We relate our dis tributed streams model to previously studied non-distributed (i.e., merged) streams models, presenting tight bounds on the gap between the distributed and merged models for deterministic algorithms.
When executing processes on parallel computer systems they encounter as a major bottleneck inter-processor communication. One way to address this problem is to minimize the communication between processes that are map...
详细信息
When executing processes on parallel computer systems they encounter as a major bottleneck inter-processor communication. One way to address this problem is to minimize the communication between processes that are mapped to different processors. This translates to the k-partitioning problem of the corresponding process graph, where k is the number of processors. The classical spectral lower bound of |v|2k 4∑ ik=1 λ for the k-section width of a graph is well-known. We show new relations between the structure and the eigenvalues of a graph and present a new method to get tighter lower bounds on the k-section width. This method makes use of the level structure defined by the k-section. We define some global expansion property and prove that for graphs with the same k-section width the spectral lower bound increases with this global expansion. We also present examples of graphs for which our new bounds are tight up to a constant factor.
We present an efficient, randomized, online, scheduling algorithm for a large class of programs with write-once synchronization variables. The algorithm combines the work-stealing paradigm with the depth-first schedul...
详细信息
ISBN:
(纸本)9781581134094
We present an efficient, randomized, online, scheduling algorithm for a large class of programs with write-once synchronization variables. The algorithm combines the work-stealing paradigm with the depth-first scheduling technique, resulting in high space efficiency and good time complexity. By automatically increasing the granularity of the work scheduled on each processor, our algorithm achieves good locality, low contention and low scheduling overhead, improving upon a previous depth-first scheduling algorithm [6] published in SPAA'97. Moreover, it is provably efficient for the general class of multithreaded computations with write-once synchronization variables (as studied in [6]), improving upon algorithm DFDeques (published in SPAA'99 [24]), which is only for the more restricted class of nested parallel computations. More specifically, consider such a computation with work T1, depth T∞ and σ synchronization, and suppose that space S1 suffices to execute the computation on a single-processor computer. Then, on a P-processor shared-memory parallel machine, the expected space complexity of our algorithm is at most S1 + O(PT∞ log(PT∞)), and its expected time complexity is O(T1/P+σlog(PT∞)/P+T∞ log(PT∞)). Moreover, for any Ε > 0, the space complexity of our algorithm is S1 + O(P(T∞ + ln(1/Ε))log(P(T∞ + ln(P(T∞ + ln(1/Ε))/Ε)))) with probability at least 1 - Ε. Thus, even for values of Ε as small as e -T∞, the space complexity of our algorithm is at most S1 + O(PT∞ log(PT∞)) with probability at least 1 - e-T∞. These bounds include all time and space costs for both the computation and the scheduler.
Two important emerging trends are influencing the design, implementation and deployment of high performance parallel systems. The first is on the architectural end, where both economic and technological factors are co...
详细信息
ISBN:
(纸本)9781581134094
Two important emerging trends are influencing the design, implementation and deployment of high performance parallel systems. The first is on the architectural end, where both economic and technological factors are compelling the use of off-the-shelf computing elements (workstations/PCs and networks) to put together high performance systems called clusters. The second is from the user community that is finding an increasing number of applications to benefit from such high performance systems. Apart from the scientific applications that have traditionally needed supercomputing power, a large number of graphics, visualization, database, web service and e-commerce applications have started using clusters because of their high processing and storage requirements. These applications have diverse characteristics and can place different Quality-of-Service (QoS) requirements on the underlying system (low response time, high throughput, high I/O demands, guaranteed response/throughput etc.). Further, clusters run ning such applications need to cater to potentially a large number of users (or other applications) in a time-shared manner. The underlying system needs to accommodate the requirements of each application, while ensuring that they do not interfere with each other. This paper focuses on the CPU resources of a cluster and investigates scheduling mechanisms to meet the responsiveness, throughput and guaranteed service requirements of different applications. Specifically, we propose and evaluate three different scheduling mechanisms. These mechanisms have been drawn from traditional solutions on parallel systems (gang scheduling and dynamic coscheduling), and have been extended to accommodate the new criteria under consideration. These mechanisms have been investigated using detailed simulation and workload models to show their pros and cons for different performance metrics.
We address the problem of prefetching and caching in a parallel I/O system and present a new algorithm for optimal parallel-disk scheduling. Traditional buffer management algorithms that minimize the number of I/O dis...
ISBN:
(纸本)9781581134094
We address the problem of prefetching and caching in a parallel I/O system and present a new algorithm for optimal parallel-disk scheduling. Traditional buffer management algorithms that minimize the number of I/O disk accesses, are substantially suboptimal in a parallel I/O system where multiple I/Os can proceed *** present a new algorithm SUPERVISOR for parallel-disk I/O scheduling. We show that in the off-line case, where apriori knowledge of all the requests is available, SUPERVISOR performs the minimum number of I/Os to service the given I/O requests. This is the first parallel I/O scheduling algorithm that is provably offline optimal. In the on-line case, we study SUPERVISOR in the context of global L-block lookahead, which gives the buffer management algorithm a lookahead consisting of L distinct requests. We show that the competitive ratio of SUPERVISOR, with global L-block lookahead, is Θ(M - L + D), when L ≤ M, and Θ(MD/L), when L > M, where the number of disks is D and buffer size is M.
A deterministic, oblivious, permutation routing algorithm on the 2D meshes of constant queue size is presented. An optimal, linear-time bound for oblivious permutation routing could be obtained for these meshes. A new...
详细信息
A deterministic, oblivious, permutation routing algorithm on the 2D meshes of constant queue size is presented. An optimal, linear-time bound for oblivious permutation routing could be obtained for these meshes. A new algorithm based on bit-reversal permutation for oblivious routing is proposed. A (2.954+Ε)n oblivious routing is proposed where Ε is any positive constant and the queue-size is a function of Ε. The running time of the algorithm is less than 1.5 times as large as the network diameter of the mesh. The n×n mesh is considered and model on these meshes are presented. The problems in controlling the pipeline of packet flows is discussed.
Suppose that a parallel algorithm can include any number of parallel threads. Each thread can proceed without ever having to busy wait to another thread. A thread can proceed till its termination, but no new threads c...
详细信息
Suppose that a parallel algorithm can include any number of parallel threads. Each thread can proceed without ever having to busy wait to another thread. A thread can proceed till its termination, but no new threads can be formed. What kind of problems can such restrictive algorithms solve and still be competitive in the total number of operations they perform with the fastest serial algorithm for the same problem? Intrigued by this informal question, we considered one of the most elementary parallel algorithmic paradigms, that of balanced binary trees. The main contribution of this paper is a new balanced (not necessarily binary) tree no-busy-wait paradigm for parallelalgorithms;applications of the basic paradigm to two problems are presented: building heaps, and executing parallel tree contraction (assuming a preparatory stage);the latter is known to be applicable to evaluating a family of general arithmetic expressions. For putting things in context, we also discuss our `PRAM-on-chip' vision (actually a small update to it), presented at SPAA98.
暂无评论