Increasing demand for both greater parallelism and faster clocks dictate that future generation architectures will need to decentralize their resources and eliminate primitives that require single cycle global communi...
详细信息
Increasing demand for both greater parallelism and faster clocks dictate that future generation architectures will need to decentralize their resources and eliminate primitives that require single cycle global communication. A Raw microprocessor distributes all of its resources, including instruction streams, register files, memory ports, and ALUs, over a pipelined two-dimensional mesh interconnect, and exposes them fully to the compiler. Because communication in Raw machines is distributed, compiling for instruction-level parallelism (ILP) requires both spatial instruction partitioning as well as traditional temporal instruction scheduling. In addition, the compiler must explicitly manage all communication through the interconnect, including the global synchronization required at branch points. this paper describes RAWCC, the compiler we have developed for compiling general-purpose sequential programs to the distributed Raw architecture. We present performance results that demonstrate that although Raw machines provide no mechanisms for global communication the Raw compiler can schedule to achieve speedups that scale withthe number of available functional units.
the structural specification and modeling of time critical real-time systems has become a major area for recent research topics. this is particularly relevant for computer music when sound computation is realized invo...
详细信息
this paper describes a performance-oriented environment for the design of portable parallel software. the environment consists of a graphical design tool based on the PVM communication library for building parallel al...
ISBN:
(纸本)3540649522
this paper describes a performance-oriented environment for the design of portable parallel software. the environment consists of a graphical design tool based on the PVM communication library for building parallelalgorithms, a state-of-the-art simulation engine, a CPU characteriser and a visualisation tool for animation of program execution and visualisation of platform and network performance measures and statistics. the toolset is used to model a virtual machine composed of a cluster of workstations interconnected by a local area network. the simulation model used is modular and its components are interchangeable which allows easy re-configuration of the platform. Both communication and CPU models are validated.
this paper describes a combined approach for improving thread locality that uses the hardware performance monitors of modern processors and program-centric code annotations to guide thread scheduling on SMPs. the appr...
详细信息
this paper describes a combined approach for improving thread locality that uses the hardware performance monitors of modern processors and program-centric code annotations to guide thread scheduling on SMPs. the approach relies on a shared state cache model to compute expected thread footprints in the cache on-line. the accuracy of the model has been analyzed by simulations involving a set of parallel applications. We demonstrate how the cache model can be used to implement several practical locality-based thread scheduling policies with little overhead. Active threads, a portable, high-performance thread system, has been built and used to investigate the performance impact of locality scheduling for several applications.
the Numerical algorithms Group Ltd is currently participating in the European HPCN Fourth Framework project on parallel industrial Aum-Erical applications and Portable Libraries (PINEAPL). One of the main goals of the...
详细信息
ISBN:
(纸本)3540649522
the Numerical algorithms Group Ltd is currently participating in the European HPCN Fourth Framework project on parallel industrial Aum-Erical applications and Portable Libraries (PINEAPL). One of the main goals of the project is to increase the suitability of the existing NAG parallel Library for dealing with computationally intensive industrial applications by appropriately extending the range of library routines. Additionally, several industrial applications are being ported onto parallel computers within the PINEAPL project by replacing sequential code sections with calls to appropriate parallel library routines. A substantial part of the library material being developed is concerned withthe solution of PDE problems using parallel sparse linear algebra modules. this talk provides a number of performance results which demonstrate the efficiency and scalability of core computational routines - in particular, the iterative solver, the preconditioner and the matrix-vector multiplication routines. Most of the software described in this talk has been incorporated into the recently launched Release 1 of the PINEAPL Library.
thread-level speculation is a technique that enables parallel execution of sequential applications on a multiprocessor. this paper describes the complete implementation of the support for thread-level speculation on t...
详细信息
ISBN:
(纸本)9781581131079
thread-level speculation is a technique that enables parallel execution of sequential applications on a multiprocessor. this paper describes the complete implementation of the support for thread-level speculation on the Hydra chip multiprocessor (CMP). the support consists of a number of software speculation control handlers and modifications to the shared secondary cache memory system of the CMP. this support is evaluated using five representative integer applications. Our results show that the speculative support is only able to improve performance when there is a substantial amount of medium-grained loop-level parallelism in the application. When the granularity of parallelism is too small or there is little inherent parallelism in the application, the overhead of the software handlers overwhelms any potential performance benefits from speculative-thread parallelism. Overall, thread-level speculation still appears to be a promising approach for expanding the class of applications that can be automatically parallelized, but more hardware intensive implementations for managing speculation control are required to achieve performance improvements on a wide class of integer applications.
this paper proposes a technique that enables performing multi-cycle (multiplication, division, square-root ...) computations in a single cycle. the technique is based on the notion of memoing: saving the input and out...
详细信息
ISBN:
(纸本)9781581131079
this paper proposes a technique that enables performing multi-cycle (multiplication, division, square-root ...) computations in a single cycle. the technique is based on the notion of memoing: saving the input and output of previous calculations and using the output if the input is encountered again. this technique is especially suitable for Multi-Media (MM) processing. In MM applications the local entropy of the data tends to be low which results in repeated operations on the same datum. the inputs and outputs of assembly level operations are stored in cache-like lookup tables and accessed in parallel to the conventional computation. A successful lookup gives the result of a multi-cycle computation in a single cycle, and a failed lookup doesn't necessitate a penalty in computation time. Results of simulations have shown that on the average, for a modestly sized memo-table, about 40% of the floating point multiplications and 50% of the floating point divisions, in Multi-Media applications, can be avoided by using the values within the memo-table, leading to an average computational speedup of more than 20%.
We describe the use of a calibrated emulator to simulate a parallel computer architecture. the emulator has a virtual clock, but unlike the virtual clock of a simulator, the emulator clock is bound to a fixed fraction...
详细信息
We describe the use of a calibrated emulator to simulate a parallel computer architecture. the emulator has a virtual clock, but unlike the virtual clock of a simulator, the emulator clock is bound to a fixed fraction of real time. Individual processors time actions independently, thus without the need for a globally synchronised clock value. Each component of the emulator is calibrated (by slowing it down artificially) so that the balance of the speeds of all components reflects the balance of the system under consideration. Unlike an ordinary simulator, a calibrated emulator is inherently parallel. the technique has been applied in the form of a parallel transputer-based emulator developed to evaluate the DDM - a scalable virtual shared memory architecture. the emulator provides performance results of a hardware implementation of the DDM using a calibrated virtual clock. A large transputer platform is used to run experiments. A couple of hours are sufficient to emulate the execution of a realistic application on a large DDM.
Field Programmable Gate Array (FPGA) architectures have emerged as an alternative means of implementing complex logic circuits providing rapid manufacturing turnaround time and low prototyping costs. this paper presen...
详细信息
Field Programmable Gate Array (FPGA) architectures have emerged as an alternative means of implementing complex logic circuits providing rapid manufacturing turnaround time and low prototyping costs. this paper presents a new FPGA architecture suitable for the application specific signal processingalgorithms and Wafer-Scale integration (WSI) Technology. the architecture must be designed for versatility, flexibility, high speed, improved logic density, and defect tolerance. the proposed FPGA architecture consists of 2 dimensional array of programmable logic elements based on look-up table, interconnection resources, and input/output (I/O) blocks. the architectural style is similar to the one used in XILINX FPGA architecture. A key variation from the commonly used FPGA is the dual switching scheme employed in the proposed architecture. the design methodology, the design tools, and results obtained by using a Segmented Channel Routing algorithm to map on it a 16 bit parallel multiplier, are presented.
the DSP architecture PRISMA for object-based video signal processing is presented in this paper. Considering the specific hardware requirements of object-based algorithms a parallel architecture has been developed, wh...
详细信息
ISBN:
(纸本)0819424323
the DSP architecture PRISMA for object-based video signal processing is presented in this paper. Considering the specific hardware requirements of object-based algorithms a parallel architecture has been developed, which consists of 8 programmable data paths. To utilize the processing pourer provided by these data paths, a new controlling scheme is employed by the PRISMA processor this Dynamic Associative Controlling distributes 3 independent instruction streams to the 8 data paths and comprises the advantages of alternative controlling approaches, like SIMD and MlMD. It allows an efficient excecution of data-dependent operations as well as a flexible partitioning of the processing resources at runtime, which is advantageous for parallelprocessing of concurrent objects with different performance requirements.
暂无评论