Important applications including those in computational chemistry, computational fluid dynamics, structural analysis and sparse matrix applications usually consist of a mixture of regular and irregular accesses. While...
详细信息
ISBN:
(纸本)0818684038
Important applications including those in computational chemistry, computational fluid dynamics, structural analysis and sparse matrix applications usually consist of a mixture of regular and irregular accesses. While current state-of-the-art run-time library support for such applications handles the irregular accesses reasonably well, the efficacy of the optimizations at run-time for the regular accesses is yet to be proven. This paper aims to find out a better approach to handle the above applications in a unified compiler and run-time framework. Specifically, this paper considers only regular applications and evaluates the performance of two approaches, a run-time approach using PILAR and a compile-time approach using a commercial HPF compiler This study shows that using a particular representation of regular accesses, the performance of regular code using run-time libraries can come close To the performance of code generated by a compiler We also determine the operations that usually contribute largely to the run-time overhead in case of regular accesses. Experimental results are reported for three regular applications on a 16-processor IBM SP-2.
Mathematical analysis and empirical evaluation of the solid state equation Power(CMOS) = P = C.V-2.f.N.%N is presented in this paper which identifies a measurable metric for evaluating relative advantages of ASIC, DSP...
详细信息
ISBN:
(纸本)0818684038
Mathematical analysis and empirical evaluation of the solid state equation Power(CMOS) = P = C.V-2.f.N.%N is presented in this paper which identifies a measurable metric for evaluating relative advantages of ASIC, DSP, and RISC architectures for embedded applications. Relationships are examined which can help predict relative future architecture performance as new generations of CMOS solid state technology become available. In particular, Performance/Watt is shown to be an Architecture-Technology Metric which can be used to calibrate ASIC, DSP, & RISC performance density potential relative to a solid state technology generations, measure & evaluate architectural changes, and project a architecture performance density roadmap.
We present a C++ template run-time library, PROMOTER, and discuss run-time support for data-parallelapplications. The PROMOTER run-time library provides a uniform framework for data-parallelapplications, covering a ...
详细信息
ISBN:
(纸本)3540653872
We present a C++ template run-time library, PROMOTER, and discuss run-time support for data-parallelapplications. The PROMOTER run-time library provides a uniform framework for data-parallelapplications, covering a broad spectrum of granularity, regularity and dynamicity. It supports user-defined data structures ranging from dense to sparse arrays, regular to irregular index structures and data distributions. The object-oriented design and implementation of the PROMOTER run-time library not only provides an easy data-parallel programming environment, but also leads to an efficient implementation of data-parallelapplications through object reuse and object specialization.
In this paper;we propose several deadlock-free protocols for implementing the generalized alternative construct, where a process non-deterministically chooses between sending or receiving among various synchronous cha...
详细信息
ISBN:
(纸本)0818684038
In this paper;we propose several deadlock-free protocols for implementing the generalized alternative construct, where a process non-deterministically chooses between sending or receiving among various synchronous channels. We consider general many-to-many channels and examine in derail the special case of fan (many-to-one and one-to-many) channels, which are common and can be implemented much more efficiently. We propose a protocol that achieves an optimal number of message cycles per user-level communication, significantly improving on previous results. We propose several other "less aggressive" protocols, which may be more suitable for some applications and networks, and demonstrate how to adaptively switch between them and modify protocol parameters.
There has been relatively little analytical work on processor optimizations for multimedia applications. With the introduction of MMX by Intel, it is clear that this is an area of increasing importance. Building on pr...
详细信息
ISBN:
(纸本)0818684038
There has been relatively little analytical work on processor optimizations for multimedia applications. With the introduction of MMX by Intel, it is clear that this is an area of increasing importance. Building on previous work [4, 5, 6, 7, 13, 14], we propose optimizations for multimedia architectures that support independent parallel execution of instructions within dynamically assembled traces, resulting in dramatic performance improvements. Specifically, we propose simplified instruction scheduling and register renaming algorithms due to constraints on trace formation. In addition, we suggest specific instruction pool and trace cache parameters. We constructed a simulator in order to measure the benefits of these processor optimizations for multimedia applications. The simulated machine, which could fetch/decode 2 instructions per cycle, performed better than a superscalar machine that could fetch/decode 8 instructions per cycle. Execution rates as high as 7.3 instructions per cycle were achieved for the benchmarks simulated, assuming 16 instructions per trace.
In this paper;we present an adaptive version of our previously proposed quality equalizing (QE) load balancing strategy that attempts to maximize the performance of parallel branch-and-bound (B&B) by adapting to a...
详细信息
ISBN:
(纸本)0818684038
In this paper;we present an adaptive version of our previously proposed quality equalizing (QE) load balancing strategy that attempts to maximize the performance of parallel branch-and-bound (B&B) by adapting to application and target computing system characteristics. Adaptive QE (AQE) incorporates the following salient adaptive features: (I) Anticipatory quantitative and qualitative load balancing mechanisms. (2) Regulation of load information exchange overhead. (3) Deterministic loan balancing in extended neighborhoods instead of just immediate neighborhoods as in non-adaptive QE. (4) Randomized global load balancing to fetch work from outside the extended neighborhood. AQE yields speedup improvements of lip to 80%, and 15% on the average, compared to that provided by QE for several real-world mixed-integer programming (MIP) problems, and near-ideal speedups for two of the largest problems in the MIPLIB benchmark suite on an IBM SP2 system.
LAPI is a low-level, high-performance communication interface available on the IBM RS/6000 SP system. It provides an active-message-like interface along with remote memory copy and synchronization functionality. It is...
详细信息
ISBN:
(纸本)0818684038
LAPI is a low-level, high-performance communication interface available on the IBM RS/6000 SP system. It provides an active-message-like interface along with remote memory copy and synchronization functionality. It is designed primarily for use by experienced programmers in developing parallel subsystems, libraries and tools, brit rye also expect power programmers to use it in end-user applications. IBM developed LAPI as a part of a project with Pacific Northwest National Laboratory (PNNL) to optimize the performance of the Global Arrays (GA) toolkit and its applications on the IBM RS/6000 SP. We provide an overview of LAPI characteristics and discuss its differences from other models such as MPI-2. We present some base performance parameters of LAPI including latency and bandwidth and compare it with performance of the MPI/MPL. The Global Arrays library from PNNL was ported to LAPI to exploit the performance benefits of this new interface. Experience using LAPI to implement GA and the performance of the resulting library are presented.
The proceedings contain 18 papers. The special focus in this conference is on Network-Based parallel Computing. The topics include: The remote enqueue operation on networks of workstations;the HAL interconnect PCI car...
ISBN:
(纸本)3540641408
The proceedings contain 18 papers. The special focus in this conference is on Network-Based parallel Computing. The topics include: The remote enqueue operation on networks of workstations;the HAL interconnect PCI card;implementing protected multi-user communication for myrinet;a configurable environment for a local optical;the design of a parallel programming system for a network of workstations;remote subpaging across a fast network;improved functional imaging through network based parallelprocessing;tools for communicating complex and dynamic data-structures using MPI;analysis of a programmed backoff method for parallelprocessing on ethernets;improving dynamic token-based distributed synchronization performance via optimistic broadcasting;fast barrier synchronization on shared fast ethernet;parallel routing table computation for scalable IP routers;a tool for the analysis of reconfiguration and routing algorithms in irregular networks;real-time traffic management on optical networks using dynamic control of cycle duration and a comparative characterization of communication patterns in applications using MPI and shared memory on an IBM SP2.
Although shared memory multiprocessors are becoming increasingly popular in the commercial market place, the applications used to evaluate such systems in both academia and industry are still predominantly technical a...
详细信息
ISBN:
(纸本)0818684046
Although shared memory multiprocessors are becoming increasingly popular in the commercial market place, the applications used to evaluate such systems in both academia and industry are still predominantly technical applications such as the Stanford SPLASH2 benchmarks. The difficulty in using commercial parallel shared memory applications such as transaction processing, decision support and web server applications has been in simulating the operating systems functions that are heavily used by these applications. In this paper we describe the design of an execution driven simulation tool called COMPASS (COMmercial parallel Shared memory Simulator). We have used COMPASS at IBM to study the behavior of decision support applications and are currently studying the behavior of transaction processingapplications and web servers.
The increasing size and complexity of high-performance applications have motivated a new round of innovation related to configuration, build, and launch of applications for large computing platforms, especially hetero...
详细信息
ISBN:
(纸本)3540643591
The increasing size and complexity of high-performance applications have motivated a new round of innovation related to configuration, build, and launch of applications for large computing platforms, especially heterogeneous multicomputers. This paper describes the software technology of the Talaris(TM) Environment, created by, Mercury Computer Systems, Inc. to enable a new generation of tools that construct and initiate applications for large distributed andparallel computer systems. The Talaris Environment provides an extensible framework for cooperating tools that share application configuration information. Tools developed by Mercury for the Environment focus on high-performance embedded DSP applications that run on Mercury's RACE(R) series multicomputer systems. Additional tools under development by Mercury and other organizations support other target systems and programming interfaces that include UNIX workstation networks, the IBM SP/2, real-time DSP platforms, the Message Passing Interface(MPI), and POSIX. Development of the Talaris Environment has been funded in part by the Defense Advanced Research Projects Agency (DARPA) under the "Bridging the Gap" and "Three Steps" programs. The Talaris Environment is currently available in connection with these DARPA programs.
暂无评论