Most recent research in instruction-level parallelism has focused on general-purpose applications such as the SPEC benchmarks. Many quantitative experiments have been performed over the years measuring the impact of d...
详细信息
ISBN:
(纸本)0818679778
Most recent research in instruction-level parallelism has focused on general-purpose applications such as the SPEC benchmarks. Many quantitative experiments have been performed over the years measuring the impact of different execution models and optimization techniques on these applications. Recently, however, researchers have been developing various ILP architectures far media processors in order to exploit parallelism in audio, video, and graphics applications. It has been assumed that these applications contain far more potential parallelism than general-purpose code, but there have been few attempts to quantify the available parallelism. In this paper, we present a linear complexity, global scheduling algorithm that can process very long traces up to I billion operations. Therefor-e, traces of video applications such as MPEG1, MPEG2, MPEG4 and H.263 encoders and decoders can be analyzed. Using an idealized execution model, speedups of over 1000 have been Sound in some applications. The experiment shows that eliminating currently identifiable bottlenecks can allow the exploitation of huge amounts of ILP in audio and video applications.
DataScalar architectures improve memory system performance by running computation redundantly across multiple processors, which are each tightly coupled with an associated memory. The program data set (and/or text) is...
详细信息
ISBN:
(纸本)9780897919012
DataScalar architectures improve memory system performance by running computation redundantly across multiple processors, which are each tightly coupled with an associated memory. The program data set (and/or text) is distributed across these memories. In this execution model, each processor broadcasts operands it loads from its local memory to all other units. In this paper, we describe the benefits, costs, and problems associated with the DataScalar model. We also present simulation results of one possible implementation of a DataScalar system. In our simulated implementation, six unmodified SPEC95 binaries ran from 7% slower to 50% faster on two nodes, and from 9% to 100% faster on four nodes, than on a system with a comparable, more traditional memory system. Our intuition and results show that DataScalar architectures work best with codes for which traditional parallelization techniques fail. We conclude with a discussion of how DataScalar systems may accommodate traditional parallel processing, thus improving performance over a much wider range of applications than is currently possible with either model.
作者:
Hariharan, RINDIAN INST SCI
DEPT COMP SCI & AUTOMAT BANGALORE 560012 KARNATAKA INDIA NYU
COURANT INST MATH SCI NEW YORK NY USA
An O(m)-work, O(m)-space, O(log(4) m)-time CREW-PRAM algorithm for constructing the suffix tree of a string s of length m drawn from any fixed alphabet set is obtained. This is the first known work and space optimal p...
详细信息
ISBN:
(纸本)9780897916639
An O(m)-work, O(m)-space, O(log(4) m)-time CREW-PRAM algorithm for constructing the suffix tree of a string s of length m drawn from any fixed alphabet set is obtained. This is the first known work and space optimal parallel algorithm for this problem. It can be generalized to a string s drawn from any general alphabet set to perform in O(log(4) m) time and O(m log \Sigma\) work and space, after the characters in s have been sorted alphabetically, where \Sigma\ is the number of distinct characters in s. In this case too, the algorithm is work-optimal. (C) 1997 Academic Press.
The proceedings contains 30 papers from the 24th annual International symposium on Computer Architecture. Topics discussed include: dynamic code sequence;recovery based routing algorithms;multidestination worms;switch...
详细信息
The proceedings contains 30 papers from the 24th annual International symposium on Computer Architecture. Topics discussed include: dynamic code sequence;recovery based routing algorithms;multidestination worms;switch based parallel systems;disk arrays;distributed shared memory multiprocessors;cache architecture;incremental prefetching techniques;stream buffers;sequential and release consistency;data dependence speculation;dynamic instruction reuse;microarchitecture;coherence controller;data prefetching;target cache;indirect jumps;branch prediction;conflict aliasing;DataScalar architecture;and adaptive cache hierarchy management.
This paper describes an application in high-performance signal processing using reconfigurable computing engines. The application is a 250 MHz cross-correlator for radio astronomy and was developed using the fastest a...
详细信息
ISBN:
(纸本)9780897918015
This paper describes an application in high-performance signal processing using reconfigurable computing engines. The application is a 250 MHz cross-correlator for radio astronomy and was developed using the fastest available Xilinx FPGA's. We will report experimental results on the operation of CMOS FPGA's at 250 MHz, and describe the architectural innovations required to build a 250 MHz reconfigurable signal processor. Extensions of the technique to a variety of high-performance real-time signal processing algorithms are discussed. The results of this design work provide important clues as to how to improve FPGA architectures to better support real-time signal processing at hundreds of MHz. In particular, direct routing resources between logic elements are critical to preserving high performance. These routing resources need to be symmetric in order to allow for two-way communications between logic elements. Four-way symmetry and regularity would allow for orthogonal transformations of processing elements in a hierarchical fashion. Finally, experimental results indicate that clock buffering is frequently the cause of ultimate failure in speed and performance tests. Wave pipelining techniques may be suitable in clock distribution to improve performance to match that of other elements in the system.
We consider the problem of optimizing the total flow time a stream of jobs that are released over time in a multi-processor setting. This problem is NP-hard even when we allow preemption, and have only two machines. A...
详细信息
ISBN:
(纸本)9780897918886
We consider the problem of optimizing the total flow time a stream of jobs that are released over time in a multi-processor setting. This problem is NP-hard even when we allow preemption, and have only two machines. Although the total (or average) flow time is widely accepted as a good measurement of the overall quality of service, no approximation algorithms were known for this basic scheduling problem. This paper contains two main results. We first prove that when preemption is allowed, Shortest Remaining Processing Time (SRPT) is an O(log(min{n/m, P})) approximation algorithm for the total flow time, where n is the number of jobs, m is the number of machines, and P is the ratio between the maximum and the minimum processing time of a job. We also provide an Ω(log n/m) and an Ω(log P) lower bounds on the (worst case) competitive ratio of any randomized algorithm for the on-line problem in which jobs are known at their release times. Thus, we show that up to a constant factor SRPT is an optimal on-line algorithm. Our second main result addresses the non-preemptive case. We present a general technique that allows to transform any preemptive solution into a non-preemptive solution at the expense of an O(√n/m) factor in the approximation ratio of the total flow time. Combining this technique with our previous result yields an O(√n/m log n/m) approximation algorithm for this case. We also show an Ω(n1/3-Ε) lower bound on the approximability of this problem (assuming P≠NP).
Multidestination message passing has been proposed as an attractive mechanism for efficiently implementing multicast and other collective operations on direct networks. However, applying this mechanism to switch-based...
详细信息
Multidestination message passing has been proposed as an attractive mechanism for efficiently implementing multicast and other collective operations on direct networks. However, applying this mechanism to switch-based parallel systems is non-trivial. In this paper we propose alternative switch architectures with differing buffer organizations to implement multidestination worms on switch-based parallel systems. First, we discuss issues related to such implementation (deadlock-freedom, replication mechanisms, header encoding, and routing). Next, we demonstrate how an existing central-buffer-based switch architecture supporting unicast message passing can be enhanced to accommodate multidestination message passing. Similarly, implementing multidestination worms on an input-buffer-based switch architecture is discussed. Both of these implementations are evaluated against each other as well as against a software-based scheme using the central buffer organization. Simulation experiments under a range of traffic (multiple multicast, bimodal, varying degree of multicast, and message length) and system size are used for evaluation. The study demonstrates the superiority of the central-buffer-based switch architecture. It also indicates that under bimodal traffic the central-buffer-based hardware multicast implementation affects background unicast traffic less adversely compared to a software-based multicast implementation. Thus, multidestination message passing can easily be applied to switch-based parallel systems to deliver good collective communication performance.
The proceedings contain 42 papers. The topics discussed include: managing dependencies - a key problem in fault-tolerant distributed algorithms;an approach to fault-tolerant parallel processing on intermittently idle,...
ISBN:
(纸本)0818678313
The proceedings contain 42 papers. The topics discussed include: managing dependencies - a key problem in fault-tolerant distributed algorithms;an approach to fault-tolerant parallel processing on intermittently idle, heterogeneous workstations;renegotiable quality of service - a new scheme for fault tolerance in wireless networks;evaluation of a 32-bit microprocessor with built-in concurrent error-detection;probabilistic checkpointing;portable checkpointing for heterogenous architectures;a communication-induced checkpointing protocol that ensures rollback-dependency trackability;a method to automate user interface testing using variable finite state machines;towards a statistical approach to testing object-oriented programs;and experimental evaluation of failure-detection schemes in real-time communication networks.
暂无评论