Significant advances have been made in compilation technology for capitalizing on instruction-level parallelism (ILP). The vast majority of ILP compilation research has been conducted in the context of general-purpose...
详细信息
ISBN:
(纸本)9780818679773
Significant advances have been made in compilation technology for capitalizing on instruction-level parallelism (ILP). The vast majority of ILP compilation research has been conducted in the context of general-purpose computing, and more specifically the SPEC benchmark suite. At the same time, a number of microprocessor architectures have emerged which have VLIW and SIMD structures that are well matched to the needs of the ILP compilers. Most of these processors are targeted at embedded applications such as multimedia and communications, rather than general-purpose systems. Conventional wisdom, and a history of hand optimization of inner-loops, suggests that ILP compilation techniques are well suited to these applications. Unfortunately, there currently exists a gap between the compiler community and embedded applications developers. This paper presents MediaBench, a benchmark suite that has been designed to fill this gap. This suite has been constructed through a three-step process: intuition and market driven initial selection, experimental measurement to establish uniqueness, and integration with system synthesis algorithms to establish usefulness.
Previous work (J.S. McCaskill et al., 1996; 1997) has shown the power of massively parallel configurable hardware (NGEN) in conjunction with dataflow architectures for the simulation of evolving populations. NGEN is a...
详细信息
Previous work (J.S. McCaskill et al., 1996; 1997) has shown the power of massively parallel configurable hardware (NGEN) in conjunction with dataflow architectures for the simulation of evolving populations. NGEN is a flexible computer hardware for rapid custom circuit simulation of fine grained physical processes via a massively parallel architecture, e.g. 144 hardware configurable field programmable gate arrays (FPGAs, XC4008, Xilinx). NGEN is optimized to implement dataflow architectures and systolic algorithms for large problems and is confectioned with high speed distributed SRAM, 144*8*256 kBit, 15ns access time, on the chip to chip interconnect. Microconfigurable FPGAs allow a further step to close the gap between micro electronics and biology on the information processing area. A design for a massively parallel microconfigurable computer (POLYP) is presented. It is designed to allow online evolution in hardware with significant locally controllable memory resources. It is also designed for high throughput dataflow applications with large problem size. Additionally, an evolvable interface between high rate measurement devices is provided to allow adaptive processing coupled with real time experimental environments. The computer represents the next logical step towards evolvable hardware interacting with biology beyond the massively parallel computer NGEN.
Automated target recognition is an application area that requires special-purpose hardware to achieve reasonable performance. FPGA-based platforms can provide a high level of performance for ATR systems if the impleme...
详细信息
Automated target recognition is an application area that requires special-purpose hardware to achieve reasonable performance. FPGA-based platforms can provide a high level of performance for ATR systems if the implementation can be adapted to the limited FPGA and routing resources of these architectures. The paper discusses a mapping experiment where a linear-systolic implementation of an ATR algorithm is mapped to the SPLASH 2 platform. Simple column oriented processors were used throughout the design to achieve high performance with limited nearest neighbor communication. The distributed SPLASH 2 memories are also exploited to achieve a high degree of parallelism. The resulting design is scalable and can be spread across multiple SPLASH 2 boards with a linear increase in performance.
This paper focuses on the interaction between software prefetching (both binding and nonbinding) and software pipelining for VLIW machines. First, it is shown that evaluating software pipelined schedules without consi...
详细信息
This paper focuses on the interaction between software prefetching (both binding and nonbinding) and software pipelining for VLIW machines. First, it is shown that evaluating software pipelined schedules without considering memory effects can be rather inaccurate due to stalls caused by dependences with memory instructions (even if a lockup-free cache is considered). It is also shown that the penalty of the stalls is in general higher than the effect of spill code. Second, we show that in general binding-schemes are more powerful than nonbinding ones for software pipelined schedules. Finally, the main contribution of this paper is an heuristic scheme that schedules some memory operations according to the locality estimated at compile time and other attributes of the dependence graph. The proposed scheme is shown to outperform other heuristic approaches since it achieves a better trade-off between compute and stall time than the others.
Most recent research in instruction-level parallelism has focused on general-purpose applications such as the SPEC benchmarks. Many quantitative experiments have been performed over the years measuring the impact of d...
详细信息
Most recent research in instruction-level parallelism has focused on general-purpose applications such as the SPEC benchmarks. Many quantitative experiments have been performed over the years measuring the impact of different execution models and optimization techniques on these applications. Researchers have been developing various ILP architectures for media processors in order to exploit parallelism in audio, video, and graphics applications. It has been assumed that these applications contain far more potential parallelism than general-purpose code, but there have been few attempts to quantify the available parallelism. We present a linear complexity global scheduling algorithm that can process very long traces up to 1 billion operations. Therefore, traces of video applications such as MPEG1, MPEG2, MPEG4 and H.263 encoders and decoders can be analyzed. Using an idealized execution model, speedups of over 1000 have been found in some applications. The experiment shows that eliminating currently identifiable bottlenecks can allow the exploitation of huge amounts of ILP in audio and video applications.
The proceedings contain 38 papers. The topics discussed include: efficient compilation of high-level data parallelalgorithms;improved abstract parity-declustered layouts for disk arrays;experiences with parallel n-bo...
ISBN:
(纸本)0897916719
The proceedings contain 38 papers. The topics discussed include: efficient compilation of high-level data parallelalgorithms;improved abstract parity-declustered layouts for disk arrays;experiences with parallel n-body simulation;a comparison of parallelalgorithms for connected components;studying overheads in massively parallel min/max-tree evaluation;scheduling trees using FIFO queues: a control-memory tradeoff;SIMD instruction cache;dynamic parallel tree contraction;improved bounds for routing and sorting on multi-dimensional meshes;parallel sorting by overpartitioning;an optimal randomized logarithmic time connectivity algorithm for the EREW PRAM;list ranking and list scan on the CRAY C-90;modeling communication in parallelalgorithms: a fruitful interaction between theory and systems?;on testing cache-coherent shared memories;diffracting trees;programming abstract DEC-alpha based multiprocessors the easy way;scheduling parallelizable tasks to minimize average response time;and an analysis of diffusive load-balancing.
Two novel variations on sample sort, one using only two rounds of regular all-to-all personalized communication in a scheme that yields very good load balancing with virtually no overhead and another using regular sam...
详细信息
Two novel variations on sample sort, one using only two rounds of regular all-to-all personalized communication in a scheme that yields very good load balancing with virtually no overhead and another using regular sampling for choosing splitters, were studied. The two were coded in Split-C and were run on a variety of platforms. Results were consistent with theoretical analysis and illustrated the scalability and efficiency of the algorithms.
A library, called PAD, of basic parallelalgorithms and data structures for the PRAM is currently being implemented using the PRAM programming language Fork95. Main motivations of the PAD project is to study the PRAM ...
详细信息
A library, called PAD, of basic parallelalgorithms and data structures for the PRAM is currently being implemented using the PRAM programming language Fork95. Main motivations of the PAD project is to study the PRAM as a practical programming model, and to provide an organized collection of basic PRAM algorithms for the SB-PRAM under completion at the University of Saarbruecken. We give a brief survey of Fork95, and describe the main components of PAD. Finally we report on the status of the language and library and discuss further developments.
暂无评论