An acyclic set in an undirected graph G = (V, E) is a set A ⊂ V (G) such that the induced graph on A is acyclic. A maximal acyclic set (MAS) has the further property that no vertex can be added to it without creating ...
详细信息
ISBN:
(纸本)9781581138405
An acyclic set in an undirected graph G = (V, E) is a set A ⊂ V (G) such that the induced graph on A is acyclic. A maximal acyclic set (MAS) has the further property that no vertex can be added to it without creating a cycle in the induced graph. We present the first NC algorithm that finds an MAS in any graph, running in time O(log4 n) using O((m2+n)/log n) processors on a EREW PRAM, answering questions raised by Chen and He. Our NC algorithm for finding an MAS in a graph relies on a novel NC algorithm for finding a maximal forest in a hypergraph which may be of independent interest.
We describe in this paper a new method for building an efficient algorithm for scheduling jobs in a cluster. Jobs are considered as parallel tasks (PT) which can be scheduled on any number of processors. The main feat...
详细信息
ISBN:
(纸本)9781581138405
We describe in this paper a new method for building an efficient algorithm for scheduling jobs in a cluster. Jobs are considered as parallel tasks (PT) which can be scheduled on any number of processors. The main feature is to consider two criteria that are optimized together. These criteria are the makespan and the weighted minimal average completion time (minsum). They are chosen for their complementarity, to be able to represent both user-oriented objectives and system administrator objectives. We propose an algorithm based on a batch policy with increasing batch sizes, with a smart selection of jobs in each batch. This algorithm is assessed by intensive simulation results, compared to a new lower bound (obtained by a relaxation of ILP) of the optimal schedules for both criteria separately. It is currently implemented in an actual real-size cluster platform.
The parallel packet switch (PPS) is extensively used as the core of contemporary commercial switches. This paper investigates the inherent queuing delay and delay jitter introduced by the PPS's demultiplexing algo...
详细信息
ISBN:
(纸本)9781581138405
The parallel packet switch (PPS) is extensively used as the core of contemporary commercial switches. This paper investigates the inherent queuing delay and delay jitter introduced by the PPS's demultiplexing algorithm, relative to an optimal work-conserving switch. We show that the inherent queuing delay and delay jitter of a symmetric and fault-tolerant N × N PPS, where every demultiplexing algorithm dispatches cells to all the middle-stage switches is Ω(N), if there are no buffers in the PPS input-ports. If the demultiplexing algorithms dispatch cells only to part of the middle-stage switches, the queuing delay and delay jitter are Ω(N/S), where 5 is the PPS speedup. These lower bounds hold unless the demultiplexing algorithm has full and immediate knowledge of the switch status. When the PPS has buffers in its input-ports, an Ω(N/S) lower bound holds if the demultiplexing algorithm uses only local information, or the input buffers are small relative to the time an input-port needs to learn the switch global information.
A crucial problem that needs to be solved is the allocation of memory to processors in a pipeline. Ideally, the processor memories should be totally separate (i.e., one port memories) in order to minimize contention;h...
详细信息
ISBN:
(纸本)9781581138405
A crucial problem that needs to be solved is the allocation of memory to processors in a pipeline. Ideally, the processor memories should be totally separate (i.e., one port memories) in order to minimize contention;however, this minimizes memory sharing. Idealized sharing occurs by using a single shared memory for all processors but this maximizes contention. Instead, in this paper we show that perfect memory sharing of shared memory can be achieved with a collection of *two*-port memories, as long as the number of processors is less than the number of memories. We show that the problem of allocation is NP-complete in general, but has a fast approximation algorithm that comes within a factor of 3/2. The proof utilizes a new bin packing model, which is interesting in its own right. Further, for important special cases that arise in practice the approximation algorithm is indeed optimal. We also describe an incremental memory allocation algorithm that provides good memory utilization while allowing fast updates.
parallel disks provide a cost effective way of speeding up I/Os in applications that work with large amounts of data. The main challenge is to achieve as much parallelism as possible, using prefetching to avoid bottle...
详细信息
ISBN:
(纸本)9781581138405
parallel disks provide a cost effective way of speeding up I/Os in applications that work with large amounts of data. The main challenge is to achieve as much parallelism as possible, using prefetching to avoid bottlenecks in disk access. Efficient algorithms have been developed for some particular patterns of accessing the disk blocks, In this paper, we consider general request sequences. When the request sequence consists of unique block requests, the problem is called prefetching and is a well-solved problem for arbitrary request sequences. When the reference sequence can have repeated references to the same block, we need to devise an effective caching policy as well. While optimum offline algorithms have been recently designed for the problem, in the online case, no effective algorithm was previously known. Our main contribution is a deterministic online algorithm threshold-LRU which achieves O((MD/L) 2/3) competitive ratio and a randomized online algorithm threshold-MARK which achieves O(√(MD/L) log(MD/L)) competitive ratio for the caching/prefetching problem on the parallel disk model (PDM), where D is the number of disks, M is the size of fast memory buffer, and M + L is the amount of lookahead available in the request sequence. The best-known lower bound on the competitive ratio is Ω(√MD/L) for lookahead L ≥ M in both models. We also show that if the deterministic online algorithm is allowed to have twice the memory of the offline then a tight competitive ratio of Θ(√MD/L) can be achieved. This problem generalizes the well-known paging problem on a single disk to the parallel disk model.
A key capability of data-race detectors is to determine whether one thread executes logically in parallel with another or whether the threads must operate in series. This paper provides two algorithms, one serial and ...
详细信息
ISBN:
(纸本)9781581138405
A key capability of data-race detectors is to determine whether one thread executes logically in parallel with another or whether the threads must operate in series. This paper provides two algorithms, one serial and one parallel, to maintain series-parallel (SP) relationships "on the fly" for fork-join multithreaded programs. The serial SP-order algorithm runs in O(1) amortized time per operation. In contrast, the previously best algorithm requires a time per operation that is proportional to Tarjan's functional inverse of Ackermann's function. SP-order employs an order-maintenance data structure that allows us to implement a more efficient "English-Hebrew" labeling scheme than was used in earlier race detectors, which immediately yields an improved determinacy-race detector. In particular, any fork-join program running in T1 time on a single processor can be checked on the fly for determinacy races in O(T1) time. Corresponding improved bounds can also be obtained for more sophisticated data-race detectors, for example, those that use locks. By combining SP-order with Feng and Leiserson's serial SP-bags algorithm, we obtain a parallel SP-maintenance algorithm, called SP-hybrid. Suppose that a fork-join program has n threads, T1 work, and a critical-path length of T∞. When executed on P processors, we prove that SP-hybrid runs in O((T1/P + PT∞) lg n) expected time. To understand this bound, consider that the original program obtains linear speed-up over a 1-processor execution when P = O(T 1/T∞). In contrast, SP-hybrid obtains linear speed-up when P = O(√T1/T∞), but the work is increased by a factor of O (lg n).
Block (cyclic) channel coding standards for third generation cellular networks require the implementation of high-performance burst-error detection and correction algorithms. Galois field (GF) arithmetic is commonly u...
详细信息
ISBN:
(纸本)0769520979
Block (cyclic) channel coding standards for third generation cellular networks require the implementation of high-performance burst-error detection and correction algorithms. Galois field (GF) arithmetic is commonly used in these architecture for encoding and decoding error codes, however many architectures still do not support dedicated functional units. This paper presents the design of a generic parallel finite-field GF(2(m)) multiplier targeted at DSP and embedded processors. As opposed to previous research, this design has the ability to utilize different primitive polynomials as an input, thereby, being able to be reprogrammable. Moreover a design is presented that is a combined binary and finite-field GF(2(m)) multiplier Area, delay, and power dissipation results are presented from several ASIC libraries.
The proceedings contains 48 papers from Thirteen annual acm symposium on parallel algorithms and architectures. Topics discussed include: compact routing schemes;simple on-line algorithms for the maximum disjoint path...
详细信息
The proceedings contains 48 papers from Thirteen annual acm symposium on parallel algorithms and architectures. Topics discussed include: compact routing schemes;simple on-line algorithms for the maximum disjoint paths problem;competitive buffer management for shared-memory switches;attack propagation in networks;computational power of pipelined memory hierarchies;and a data tracking scheme for general networks.
The one-to-one strategy of mapping each single data item into a graphical marker adopted in many visualization techniques has limited usefulness when the number of records and/or the dimensionality of the data set are...
详细信息
ISBN:
(纸本)0780387791
The one-to-one strategy of mapping each single data item into a graphical marker adopted in many visualization techniques has limited usefulness when the number of records and/or the dimensionality of the data set are very high. In this situation, the strong overlapping of graphical markers severely hampers the user's ability to identify patterns in the data from its visual representation. We tackle this problem here with a strategy that computes frequency or density information from the data set, and uses such information in parallel Coordinates visualizations to filter out the information to be presented to the user, thus reducing visual clutter and allowing the analyst to observe relevant patterns in the data. The algorithms to construct such visualizations, and the interaction mechanisms supported, inspired by traditional image processing techniques such as grayscale manipulation and thresholding are also presented. We also illustrate how such algorithms can assist users to effectively identify clusters in very noisy large data sets.
Vector processors often use a cache to exploit temporal locality and reduce memory bandwidth demands, but then require expensive logic to track large numbers of outstanding cache misses to sustain peak bandwidth from ...
详细信息
ISBN:
(纸本)0769521266
Vector processors often use a cache to exploit temporal locality and reduce memory bandwidth demands, but then require expensive logic to track large numbers of outstanding cache misses to sustain peak bandwidth from memory. We present refill/access decoupling, which augments the vector processor with a Vector Refill Unit (VRU) to quickly pre-execute vector memory commands and issue any needed cache line refills ahead of regular execution. The VRU reduces costs by eliminating much of the outstanding miss state required in traditional vector architectures and by using the cache itself as a cost-effective prefetch buffer We also introduce vector segment accesses, a new class of vector memory instructions that efficiently encode two-dimensional access patterns. Segments reduce address bandwidth demands and enable more efficient refill/access decoupling by increasing the information contained in each vector memory command. Our results show that refill/access decoupling is able to achieve better performance with less resources than more traditional decoupling methods. Even with a small cache and memory latencies as long as 800 cycles, refill/access decoupling can sustain several kilobytes of in-flight data with minimal access management state and no need for expensive reserved element buffering.
暂无评论