One of the reasons why parallel programming is considered to be a difficult task is that users frequently, cannot predict the performance impact of implementation? decisions prior to program execution. this results in...
详细信息
ISBN:
(纸本)0769500595
One of the reasons why parallel programming is considered to be a difficult task is that users frequently, cannot predict the performance impact of implementation? decisions prior to program execution. this results in a cycle of incremental performance improvements based on run-time performance data. While gathering and analyzing performance data is supported by a large number of tools, typically interactive, the task of performance analysis is still too complex for. users. this article illustrates this fact based on the current analysis support on GRAY T3E. As a consequence, we are convinced that automatic analysis tools are required to identify frequently occuring and well-defined performance problems automatically! this article describes the novel design of a generic automatic performance analysis environment called KOJAK. Besides its structure we also outline the first component, EARL, a new meta-tool designed and implemented as a programmable interface to calculate more abstract metrics fr om existing trace files, and to locate complex patterns describing performance problems.
the DIR net (detection-isolation-recovery net) is the main module Of a software framework for the development of embedded supercomputing applications. this framework provides a set of functional elements, collected in...
详细信息
ISBN:
(纸本)0769500595
the DIR net (detection-isolation-recovery net) is the main module Of a software framework for the development of embedded supercomputing applications. this framework provides a set of functional elements, collected in a library, to improve the dependability attributes of the applications (especially the availability). the DIR net enables these functional elements to cooperate and enhances their efficiency by controlling and co-ordinating them. As a supervisor and the main executor of the fault tolerance strategy, it is the backbone of the framework, of which the application developer is the architect. Moreover it provides an interface to which all detection and recovery tools should conform. Although the DIR net is meant to be used together within this fault tolerance framework, the adopted concepts and design decisions have a mor-e general value, and can be applied in a wide range of parallel systems.
the aim of this paper is to present ar? easy and efficient method to implement alternating-line processes on current parallel computers. First we show how darn locality has an important impact on global efficiency, wh...
详细信息
ISBN:
(纸本)0769500595
the aim of this paper is to present ar? easy and efficient method to implement alternating-line processes on current parallel computers. First we show how darn locality has an important impact on global efficiency, which leads trs to the conclusion that one-dimensional decompositions are the most convenient ones Sot 2D problems. Once this is asserted a parallel algorithm is presented for the solution of the distributed tridiagonal systems along the partitioned direction. the key idea is to pipeline the simultaneous resolution of many systems of equations, riot parallelising each resolution separately. this approach presents good numerical and architectural properties, in ter ms of memory usage and data locality: and high parallel efficiencies ave obtained. For the cast of alternating-line processes, the election of the optimal decomposition is studied. the experimental results have been obtained or? a Cray T3E.
In this work, we propose a heuristic algorithm based on Genetic Algorithm for the task-to-processor mapping problem in the context of local-memory multiprocessors with a hypercube interconnection topology. Hypercube m...
详细信息
ISBN:
(纸本)0769500595
In this work, we propose a heuristic algorithm based on Genetic Algorithm for the task-to-processor mapping problem in the context of local-memory multiprocessors with a hypercube interconnection topology. Hypercube multiprocessors have offered a cost effective and feasible approach to supercomputing through parallelism at the processor level by directly connecting a large number of low-cost processors with local memory which communicate by message passing instead of shared variables. We use concepts of the graph theory (task graph precedence to represent par allel programs, graph partitioning to solve the program decomposition problem, etc.) to model the problem. this problem is NP-complete which means heuristic approaches must be adopted. We develop a heuristic algorithm based on Genetic Algorithms to solve it.
this paper introduces Simultaneous Speculation Scheduling, a new compiler technique that enables speculative execution of alternative program paths. In our approach concurrently executed threads are generated that rep...
详细信息
ISBN:
(纸本)0769500595
this paper introduces Simultaneous Speculation Scheduling, a new compiler technique that enables speculative execution of alternative program paths. In our approach concurrently executed threads are generated that represent alternative program paths. Each thread is the result of a speculation on the outcome of one or more branches. All threads are simultaneously executed although only one of them follows the eventually correct program path. Our technique goes beyond the capabilities of usual global instruction scheduling algorithms, because we overcome most of the restrictions to speculative code motion. the architectural requirements are the ability to run two or more threads in pal allel, and an enhanced instruction set to control threads. Our technique aims at multithreaded architectures, in particular simultaneous multithreaded, nanothreaded, and microthreaded processors, but can be modified for multiscalar, datascalar, and trace processors. We evaluate our approach using program kernels from the SPECint benchmark suite.
Debuggers are critical tools for software development. the design and implementation of a source-level debugging system that enables the HPF programmer to observe the behavior of the program? at the level at which the...
详细信息
ISBN:
(纸本)0769500595
Debuggers are critical tools for software development. the design and implementation of a source-level debugging system that enables the HPF programmer to observe the behavior of the program? at the level at which the program has been developed present unique challenges. the main requirement put on an HPF debugger is to observe and control the state of many processors, to summarize and present distributed information in a concise and clear way, in terms of the source program. To be practical, the debugger has to support interactive source-level debugging of large-scale applications on large machines. In this paper Me define design goals for HPF debuggers and present an architecture of an advanced HPF debugging sq,stem DeHiFo, which addresses several challenges involved and provides significant contributions to existing debugging technology! An HPF debugger is a rather complex system. Its development requires a systematic cooperation between several partners. DeHiFo is an excellent example of cooperation and technology transfer among research teams working at different universities.
DASUD (Diffusion Algorithm Searching Unbalanced Domains) is a totally distributed load-balancing algorithm which belongs to the nearest-neighbors class. DASUD detects unbalanced domains (a processor and its immediate ...
详细信息
ISBN:
(纸本)0769500595
DASUD (Diffusion Algorithm Searching Unbalanced Domains) is a totally distributed load-balancing algorithm which belongs to the nearest-neighbors class. DASUD detects unbalanced domains (a processor and its immediate neighbors) and corrects this situation by allowing load movements between non-connected processors. DASUD has been evaluated by comparison with two well-known nearest-neighbors load balancing strategies, namely, the CDE (Generalized Dimension Exchange) and the SID (Sender Initiated Diffusion) by considering a large set of initial load distributions. these distributions were applied to ring, torus and hypercube topologies, and the number of processors ranged from 8 to 128. FI-om these experiments we have observed that DASUD outperforms the other strategies used in the comparison as it provides the best trade-off between the balance degree obtained at the final state and the number of iterations required to reach this stare.
We introduce our SimUTC toolkit, a fault-tolerant distributed systems simulation built upon the discrete event simulation package C++SIM. SimUTC has been developed in the course of our project SynUTC and targets distr...
详细信息
We introduce our SimUTC toolkit, a fault-tolerant distributed systems simulation built upon the discrete event simulation package C++SIM. SimUTC has been developed in the course of our project SynUTC and targets distributed algorithms for high-accuracy fault-tolerant clock synchronization. this application domain requires detailed simulation models for network transmission and local clock devices, fault-injection capabilities, flexible system configuration facilities, and customized data capture and analysis tools. We explain how SimUTC addresses those issues and provide a few samples of simulation results gathered from the evaluation of the well-known Fault-Tolerant Average clock synchronization algorithm.
the proceedings contain 35 papers. the topics discussed include: scalability analysis of multidimensional wavefront algorithms on large-scale SMP clusters;a system for evaluating performance and cost of SIMD array des...
ISBN:
(纸本)0769500870
the proceedings contain 35 papers. the topics discussed include: scalability analysis of multidimensional wavefront algorithms on large-scale SMP clusters;a system for evaluating performance and cost of SIMD array designs;design trade-offs of low-cost multicomputer network switches;the cactus computational collaboratory: enabling technologies for relativistic astrophysics, and a toolkit for solving PDEs by communities in science and engineering;the PETSc library for scientific software;distributed applet-based certifiable processing in client/server environments;large-scale distributed computational fluid dynamics on the information power grid using Globus;a framework for generating task parallel programs;HPF implementation of ARC3D;efficient VLSI layouts of hypercubic networks;adapting to load on workstation clusters;parallelsimulation of two-phase flow problems using the finite element method;a data-parallel algorithm for iterative tomographic image reconstruction;parallel rendering of 3D AMR data on the SGUCray T3E;a recursive PVM implementation of an image segmentation algorithm with performance results comparing the hive and the Cray T3E;Delphi: an integrated, language-directed performance prediction, measurement and analysis environment;poems-end to end performance models for dynamic parallel and distributed systems;MPI: the only programming model for managing memory;distributed control parallelism for multidisciplinary design of a high speed civil transport;implementing MM5 on NASA Goddard space flight center computing systems: a performance study;and material science electronic structure calculations on massively parallel systems: an algorithmic and computational challenge.
A massively parallel processor called JUMP-I has been developed to build an efficient cache coherent-distributed shared memory (DSM) on a large system with more than 1000 processors. Here, the dedicated processor call...
详细信息
ISBN:
(纸本)0769500870
A massively parallel processor called JUMP-I has been developed to build an efficient cache coherent-distributed shared memory (DSM) on a large system with more than 1000 processors. Here, the dedicated processor called MBP (Memory Based Processor) -light to manage the DSM of JUMP-I is introduced, and its preliminary performance with two protocol policies -update/invalidate- is evaluated. From results of its simulation, it appears that simple operations like the tag check and the collection/generation of acknowledgment packets are mostly processed by the hardware mechanisms in MBP-light without aids of the core processor with both policies. Also, the buffer-register architecture adopted by the core processor in MBP-light is exploited enough to process a protocol transaction for both policies.
暂无评论