Passing messages through shared memory plays an important role in symmetric multiprocessors and on Clumps. The management of concurrent access to message queues is an important aspect of design for shared memory messa...
详细信息
Passing messages through shared memory plays an important role in symmetric multiprocessors and on Clumps. The management of concurrent access to message queues is an important aspect of design for shared memory message passing systems. Using both microbenchmarks and applications, the paper compares the performance of concurrent access algorithms for passing active messages on a Sun Enterprise 5000 server. The paper presents a new lock free algorithm that provides many of the advantages of non blocking algorithms while avoiding the overhead of true non blocking behavior. The lock free algorithm couples synchronization tightly to the data structure and demonstrates application performance superior to all others studied. The success of this algorithm implies that other practical problems might also benefit from a reexamination of the non blocking literature.
This paper analyzes the consequences of existing network structure for the design of a protocol for a radical COMA (Cache Only Memory Architecture). parallel computing today faces two significant challenges: the diffi...
详细信息
This paper analyzes the consequences of existing network structure for the design of a protocol for a radical COMA (Cache Only Memory Architecture). parallel computing today faces two significant challenges: the difficulty of programming and the need to leverage existing "off-the-shelf" hardware. The difficulty of programmingparallel computers can be split into two problems: distributing the data, and distributing the computation. parallelizing compilers address both problems, but have limited application outside the domain of loop intensive "scientific" code. Conventional COMAs provide an adaptive, self-distributing solution to data distribution, but do not address computation distribution. Our proposal leverages parallelizing compilers, and then extends COMA to provide adaptive self-distribution of both data and computation. The radical COMA protocols can be implemented in hardware, software, or a combination of both. When, however, the implementation is constrained to operate in a cluster computing environment (that is, to use only existing, already installed hardware), the protocols have to be reengineered to accommodate the deficiencies of the hardware. This paper identifies the critical quantities of various existing network structures, and discusses their repercussions for protocol design. A new protocol is presented in detail.
Desktop computing remains indispensable in scientific exploration, largely because it provides people with devices for human interaction and environments for interactive job execution. However, with today's rapidl...
详细信息
Desktop computing remains indispensable in scientific exploration, largely because it provides people with devices for human interaction and environments for interactive job execution. However, with today's rapidly growing data volume and task complexity, it is increasingly hard for individual workstations to meet the demands of interactive scientific data processing. The increasing cost of such interactive processing is hindering the productivity of end-to-end scientific computing workflows. While existing distributed computing systems allow people to aggregate desktop workstation resources for parallel computing, the burden of explicit parallel programming and parallel job execution often prohibits scientists to take advantage of such platforms. In this paper, we discuss the need for transparent desktop parallel computing in scientific data processing. As an initial step toward this goal, we present our on-going work on the automatic parallelization of the scripting language R, a popular tool for statistical computing. Our preliminary results suggest that a reasonable speedup can be achieved on real-world sequential R programs without requiring any code modification.
High-performance embedded computing is developing rapidly since applications in most domains require a large and increasing amount of computing power. On the hardware side, this requirement is met by the introduction ...
详细信息
ISBN:
(纸本)9781728151267
High-performance embedded computing is developing rapidly since applications in most domains require a large and increasing amount of computing power. On the hardware side, this requirement is met by the introduction of heterogeneous systems, with highly parallel accelerators that are designed to take care of the computation-heavy parts of an application. There is today a plethora of accelerator architectures, including GPUs, many-cores, FPGAs, and domain-specific architectures such as AI accelerators. They all have their own programming models, which are typically complex, low-level, and involve explicit parallelism. This yields error-prone software that puts the functional safety at risk, unacceptable for safety-critical embedded applications. In this position paper we argue that high-level executable modelling languages tailored for parallel computing can help in the software design for high performance embedded applications. In particular, we consider the data-parallel model to be a suitable candidate, since it allows very abstract parallel algorithm specifications free from race conditions. Moreover, we promote the Action Language for fUML (and thereby fUML) as suitable host language.
Presents a collection of slides covering the following: NVIDIA CUDA; CUDA toolkit; CUDA libraries; closely coupled CPU-GPU; CUDA many-core and multi-core support; nvcc CUDA compiler; CUBLAS; and CUFFT.
Presents a collection of slides covering the following: NVIDIA CUDA; CUDA toolkit; CUDA libraries; closely coupled CPU-GPU; CUDA many-core and multi-core support; nvcc CUDA compiler; CUBLAS; and CUFFT.
The authors report their experiences with the Gauss elimination algorithm on several parallel machines. Several different software designs are demonstrated, ranging from a simple shared memory implementation to the us...
详细信息
The authors report their experiences with the Gauss elimination algorithm on several parallel machines. Several different software designs are demonstrated, ranging from a simple shared memory implementation to the use of a message passing programming model. It is found that the efficient use of local memory is critical to obtaining good performance on scalable machines. Machines with large coherent caches appear to require the least software effort in order to obtain effective performance.< >
We review the evolution of DSP architectures and compiler technology, and describe how compiler techniques are being used to optimize emerging DSP architectures. Such new architectures are characterized by the exploit...
详细信息
We review the evolution of DSP architectures and compiler technology, and describe how compiler techniques are being used to optimize emerging DSP architectures. Such new architectures are characterized by the exploitation of data and instruction level parallelism while being an amenable target for a compiler, thereby reducing or eliminating the need to rely on assembly language programming and/or architecture-specific compiler intrinsics to achieve highly efficient code. We also summarize our research results on an ultra low power compilable DSP architecture.
The experiments and analysis of a reconfigurable multiprocessor simulation on a cluster of workstations connected by Ethernet are presented. The system model and simulation environment is described. The monitoring/deb...
详细信息
The experiments and analysis of a reconfigurable multiprocessor simulation on a cluster of workstations connected by Ethernet are presented. The system model and simulation environment is described. The monitoring/debugging tool and the concept of SPP, a proposed parallel programming paradigm which can effectively reduce the synchronization operations, are described. The structure of the modules comprised by the system software model are also described. The sequential and parallel versions of a computationally intensive sequential program were executed on different network topologies and its speedup ratios are analyzed and discussed. The crucial issues in realizing reconfigurable multiprocessor simulation on a distributed environment are considered.< >
Describes the architecture of a development environment for computer-aided parallel software engineering. The environment comprises tools for program design, simulation, run-time support and behaviour analysis. Tools ...
详细信息
Describes the architecture of a development environment for computer-aided parallel software engineering. The environment comprises tools for program design, simulation, run-time support and behaviour analysis. Tools are invariably interactive, depending in large part on graphical and visualisation support. SEPP (Software Engineering for parallel Processing) is an EU-funded consortium of nine partners in Eastern and Western Europe, whose aim is to realise the architecture through the development of practical tools.< >
Many real-world applications feature data accesses on periodic domains. Manually implementing the synchronizations and communications associated to the data dependences on each case is cumbersome and error-prone. It i...
详细信息
Many real-world applications feature data accesses on periodic domains. Manually implementing the synchronizations and communications associated to the data dependences on each case is cumbersome and error-prone. It is increasingly interesting to support these applications in high-level parallel programming languages or parallelizing compilers. In this paper, we present a technique that, for distributed-memory systems, calculates the specific communications derived from data-parallel codes with or without periodic boundary conditions on affine access expressions. It makes transparent to the programmer the management of aggregated communications for the chosen data partition. Our technique moves to runtime part of the compile-time analysis typically used to generate the communication code for affine expressions, introducing a complete new technique that also supports the periodic boundary conditions. We present an experimental study to evaluate our proposal using several study cases. Our experimental results show that our approach can automatically obtain communication codes as efficient as those found in MPI reference codes, reducing the development effort.
暂无评论