作者:
Mackie, Ian
École Polytechnique Palaiseau Cedex91128 France
Interaction nets can be seen as both a programming language and an intermediate language for the implementation of other paradigms of computation. One of their principal advantages is that the reduction process is bot...
详细信息
Most recent research in instruction-level parallelism has focused on general-purpose applications such as the SPEC benchmarks. Many quantitative experiments have been performed over the years measuring the impact of d...
详细信息
ISBN:
(纸本)0818679778
Most recent research in instruction-level parallelism has focused on general-purpose applications such as the SPEC benchmarks. Many quantitative experiments have been performed over the years measuring the impact of different execution models and optimization techniques on these applications. Recently, however, researchers have been developing various ILP architectures far media processors in order to exploit parallelism in audio, video, and graphics applications. It has been assumed that these applications contain far more potential parallelism than general-purpose code, but there have been few attempts to quantify the available parallelism. In this paper, we present a linear complexity, global scheduling algorithm that can process very long traces up to I billion operations. therefor-e, traces of video applications such as MPEG1, MPEG2, MPEG4 and H.263 encoders and decoders can be analyzed. Using an idealized execution model, speedups of over 1000 have been Sound in some applications. the experiment shows that eliminating currently identifiable bottlenecks can allow the exploitation of huge amounts of ILP in audio and video applications.
Nowadays we are becoming increasingly dependent on parallel or distributed computer systems for many safety critical applications. therefore, in order to avoid software precipitated catastrophes, we must look for ways...
详细信息
In this paper, we give a necessary and sufficient condition for the existence of partially-dependent functional decomposition and develop new algorithms to compute such decompositions. We apply our method to the synth...
详细信息
ISBN:
(纸本)9780897918015
In this paper, we give a necessary and sufficient condition for the existence of partially-dependent functional decomposition and develop new algorithms to compute such decompositions. We apply our method to the synthesis and mapping for Xilinx XC4000 FPGA's which contain non-uniform sizes of LUT's in its architecture. We develop a new mapping algorithm named PDDMAP which uses CLB's to cover nodes on critical paths for depth minimization and uses LUT's to cover non-critical nodes for area minimization. On average, PDDMAP is able to reduce the depth by 13% with only 1% of increase in area comparing to the results by FlowMap followed by a CLB generation procedure match_4k. We also develop a post-mapping procedure named PDDSYN which resynthesizes mapping solutions to reduce the mapping area. On average, PDDSYN is able to improve PDDMAP mapping solutions by 5% in depth and 7% in CLB count, and achieves 8% smaller depth and 11% fewer CLB count comparing to FlowSyn followed by match_4k.
Interactive program steering is a promising technique for improving the performance of parallel and distributedapplications. Steering decisions are typically based on visual presentations of some subset of the comput...
详细信息
Interactive program steering is a promising technique for improving the performance of parallel and distributedapplications. Steering decisions are typically based on visual presentations of some subset of the computation's current state, a historical view of the computation's behavior or views of metrics based on the program's performance. As in any endeavor good decisions require accurate information. However the distributed nature of the collection process may result in distortions in the portrayal of the program's execution. these distortions stem from the merging of streams of information from distributed collection points into a single stream without enforcing the ordering relationships that held among the program components that produced the information. An ordering filter placed at the point at which the streams are merged can ensure a valid ordering, leading to more accurate visualizations and better informed steering decisions. In this paper we describe the implementation of such filters in the Falcon interactive steering toolkit, and present a methodology for their specification for automated generation.
Several methods have been proposed in the literature for the distribution of data on distributed memory machines, either oriented to dense or sparse structures. Many of the real applications, however, deal with both k...
详细信息
Several methods have been proposed in the literature for the distribution of data on distributed memory machines, either oriented to dense or sparse structures. Many of the real applications, however, deal with both kinds of data jointly. the paper presents techniques for integrating dense and sparse array accesses in a way that optimizes locality and further allows an efficient loop partitioning within a data-parallel compiler. the approach is evaluated through an experimental survey with several compilers and parallel platforms. the results prove the benefits of the BRS sparse distribution when combined with CYCLIC in mixed algorithms and the poor efficiency achieved by well-known distribution schemes when sparse elements arise in the source code.
Current parallelizing compilers for message-passing machines only support a limited class of data-parallelapplications. One method for eliminating this restriction is to combine powerful shared-memory parallelizing c...
详细信息
Current parallelizing compilers for message-passing machines only support a limited class of data-parallelapplications. One method for eliminating this restriction is to combine powerful shared-memory parallelizing compilers with software distributed shared-memory (DSM) systems. We demonstrate such a system by combining the SUIF parallelizing compiler and the CVM software DSM. Innovations of the system include compiler-directed techniques that: (1) combine synchronization and parallelism information communication on parallel task invocation, (2) employ customized routines for evaluating reduction operations, and (3) select a hybrid update protocol that pre-sends data by flushing updates at barriers. For applications with sufficient granularity of parallelism, these optimizations yield very good eight processor speedups on an IBM SP-2 and DEC Alpha cluster usually matching or exceeding the speedup of equivalent HPF and message-passing versions of each program. Flushing updates, in particular, eliminates almost all nonlocal memory misses and improves performance by 13% on average.
the Internet, best known by most users as the World-Wide-Web, continues to expand at an amazing pace. We propose a new infrastructure to harness the combined resources, such as CPU cycles or disk storage, and make the...
详细信息
the Internet, best known by most users as the World-Wide-Web, continues to expand at an amazing pace. We propose a new infrastructure to harness the combined resources, such as CPU cycles or disk storage, and make them available to everyone interested. this infrastructure has the potential for solving parallel supercomputing applications involving thousands of cooperating components. Our approach is based on recent advances in Internet connectivity and the implementation of safe distributed computing embodied in languages such as Java. We developed a prototype of a global computing infrastructure, called SuperWeb, that consists of hosts, brokers and clients. Hosts register a fraction of their computing resources (CPU time, memory, bandwidth, disk space) with resource brokers. Client computations are then mapped by the broker onto the registered resources. We examine an economic model for trading computing resources, and discuss several technical challenges associated with such a global computing environment.
this paper describes the Transmogrifier-2, a second generation multi-FPGA system. the largest version of the system will comprise 16 boards that each contain two Altera 10K50 FPGAs, four I-cube interconnect chips, and...
详细信息
this paper describes the Transmogrifier-2, a second generation multi-FPGA system. the largest version of the system will comprise 16 boards that each contain two Altera 10K50 FPGAs, four I-cube interconnect chips, and up to 8 Mbytes of memory. the inter-FPGA routing architecture of the TM-2 uses a novel interconnect structure, a non-uniform partial crossbar, that provides a constant delay between any two FPGAs in the system. the TM-2 architecture is modular and scalable, meaning that various sized systems can be constructed from the same board, while maintaining routability and the constant delay feature. Other features include a 4 system-level programmable clock that allows single-cycle access to off-chip memory, and programmable clock waveforms with resolution to 10 ns. the first Transmogrifier-2 boards have been manufactured and are functional. they have recently been used successfully in some simple graphics acceleration applications.
this paper describes a framework for providing the ability to use multiple specialized data parallel libraries and/or languages within a single application. the ability to use multiple libraries is required in many ap...
详细信息
this paper describes a framework for providing the ability to use multiple specialized data parallel libraries and/or languages within a single application. the ability to use multiple libraries is required in many application areas, such as multidisciplinary complex physical simulations and remote sensing image database applications. An application can consist of one program or multiple programs that use different libraries to parallelize operations on distributed data structures. the framework is embodied in a runtime library called Meta-Chaos that has been used to exchange data between data parallel programs written using High Performance Fortran, the Chaos and Multiblock Parti libraries developed at Maryland for handling various types of unstructured problems, and the runtime library for pC++, a data parallel version of C++ from Indiana University. Experimental results show that Meta-Chaos is able to move data between libraries efficiently and that Meta-Chaos provides effective support for complex applications.
暂无评论