The use of a network of workstations as a single unit for speeding up computationally intensive applications is becoming a cost-effective alternative to traditional parallel computers. We present the implementation of...
详细信息
The use of a network of workstations as a single unit for speeding up computationally intensive applications is becoming a cost-effective alternative to traditional parallel computers. We present the implementation of an application-driven parallel platform for solving partial differential equations (PDEs) on this computing environment. The platform provides a general and efficient parallel solution for time-dependent PDEs and an easy-to-use interface that allows the inclusion of a wide range of parallel programming tools. We have used two different parallelization methods in this platform. The first method is a two-phase algorithm which uses the conventional technique of alternating computation and communication phases. The second method uses a novel pre-computation technique which allows overlapping of computation and communication. Both methods yield significant speedup. However the pre-computation technique is shown to be more efficient and scalable.< >
This paper discusses the design of a hybrid artificial neural network (ANN) system and its implementation in the parallel Virtual Machine (PVM) environment. First, the PVM functions for supporting parallel application...
详细信息
ISBN:
(纸本)0780329120
This paper discusses the design of a hybrid artificial neural network (ANN) system and its implementation in the parallel Virtual Machine (PVM) environment. First, the PVM functions for supporting parallel applications and communications among multiple processes and multiple machines are investigated. Then, the design and construction of a hybrid ANN simulation software is proposed. It includes the user interface, control and SPMD computing levels. The software can be used for supporting parallel simulation of different kinds of learning algorithms and neural computing models.
We consider the generation of mixed task and data parallel programs and discuss how a clear separation into a task and data parallel level can support the development of efficient programs. The program development sta...
详细信息
We consider the generation of mixed task and data parallel programs and discuss how a clear separation into a task and data parallel level can support the development of efficient programs. The program development starts with a specification of the maximum degree of task and data parallelism and proceeds by performing several derivation steps in which the degree of parallelism is adapted to a specific parallel machine. We show how the final message-passing programs are generated and how the interaction between the task and data parallel levels can be established. We demonstrate the usefulness of the approach by examples from numerical analysis which offer the potential of a mixed task and data parallel execution but for which it is not a priori clear, how this potential should be used for an implementation on a specific parallel machine.
Future space application will require High Performance Computing (HPC) capabilities to be available on board of future spacecrafts. To cope with this requirement, multi and many-core processor technologies have to be ...
详细信息
ISBN:
(纸本)9781457705564
Future space application will require High Performance Computing (HPC) capabilities to be available on board of future spacecrafts. To cope with this requirement, multi and many-core processor technologies have to be integrated in the computing platforms of the spacecraft. One of the most important requirements, coming from the nature of space applications, is the efficiency in terms of performance per Watt. In order to improve the efficiency of such systems, algorithms and applications have to be optimized and scaled to the number of cores available in the computing platform. In this paper we describe the parallelization techniques applied to a Synthetic Aperture Radar (SAR) application based on the 2-Dimentional Fourier Matched Filtering and Interpolation (2DFMFI) Algorithm. Other than sequential optimizations, we applied parallelization techniques for shared memory, distributed shared memory and distributed memory environments, using parallel programming models like OpenMP and MPI. It turns out that parallelizing this type of algorithms is not an easy and straightforward task to do, but with a little bit of effort, one can improve performance and scalability, increasing the level of efficiency.
The IA-64 architecture provides new opportunities and challenges for implementing an improved set of transcendental functions. Using several novel polynomial-based table-driven techniques, we are able to provide new a...
详细信息
The IA-64 architecture provides new opportunities and challenges for implementing an improved set of transcendental functions. Using several novel polynomial-based table-driven techniques, we are able to provide new algorithms for the transcendental functions. Major improvements include an accuracy level of about 0.6 ulps (units in the last place) and forward trigonometric functions that have a period of 2/spl pi/. The accuracy enhancements are achieved at improved speed, yet without an increase in the table size. In this paper, we highlight the key IA-64 architectural features that influenced our designs, and explain the main ideas used in our new algorithms.
Techniques for automatic program comprehension can play a crucial role in overcoming limitations of existing tools for the automatic parallelization of programs for distributed-memory architectures. Uses of a program ...
详细信息
Techniques for automatic program comprehension can play a crucial role in overcoming limitations of existing tools for the automatic parallelization of programs for distributed-memory architectures. Uses of a program recognition-based parallelization procedure could range from the automatic selection of a data distribution, via the automatic selection of sequences of optimizing transformations of the sequential code, via the code replacement with optimized parallel libraries, up to the automatic selection of the parallel execution model that is best suited to the algorithm to be parallelized and to the target parallel architecture. This paper presents the implementation of a prototype tool for the recognition of parallelizable algorithmic patterns (PAP Recognizer), which has been integrated into the Vienna Fortran Compilation System, an interactive compilation system for scalable architectures. The distinctive features of the approach are discussed and the way the recognizer works is described with respect to a working example.
The main objective of this work is to bring supercomputing and parallel processing closer to non-specialized audiences by building a Raspberry Pi cluster, called Clupiter, which emulates the operation of a supercomput...
详细信息
ISBN:
(数字)9798350381993
ISBN:
(纸本)9798350382006
The main objective of this work is to bring supercomputing and parallel processing closer to non-specialized audiences by building a Raspberry Pi cluster, called Clupiter, which emulates the operation of a supercomputer. It consists of eight Raspberry Pi devices interconnected to each other so that they can run jobs in parallel. To make it easier to show how it works, a web application has been developed. It allows launching parallel applications and accessing a monitoring system to see the resource usage when these applications are running. The NAS parallel Benchmarks (NPB) are used as demonstration applications. From this web application a couple of educational videos can also be accessed. They deal, in a very informative way, with the concepts of supercomputing and parallel programming.
In this paper we propose a methodology to adapt Systolic Algorithms to the hardware selected for their implementation. Systolic Algorithms obtained can be efficiently implemented using Pipelined Functional Units. The ...
详细信息
In this paper we propose a methodology to adapt Systolic Algorithms to the hardware selected for their implementation. Systolic Algorithms obtained can be efficiently implemented using Pipelined Functional Units. The methodology is based on two transformation rules. These rules are applied to an initial Systolic Algorithm, possibly obtained through one of the design methodologies proposed by other autors. Parameters for these transformations are obtained from the specification of the hardware to be used. The methodology has been particularized in the case of one-dimensional Systolic Algorithms with data contraflow.
Proposes a new parallel language, ACLAN (Array C LANguage), that allows the programming of any array processor. In particular, ACLAN allows the programming of a hypercube computer as a synchronous system. This languag...
详细信息
Proposes a new parallel language, ACLAN (Array C LANguage), that allows the programming of any array processor. In particular, ACLAN allows the programming of a hypercube computer as a synchronous system. This language is specially designed for numeric applications where the processing time and/or the occupied memory are critical factors to optimize, because it allows the direct manipulation of the structural elements of the nodes of the system. Applications such as that include pattern recognition, image processing and matrix algebra. The authors describe the basic ACLAN features and present, as an example of programming of a particular hypercube computer, the implementation of ACLAN on the NCube/10 system. Finally, they include an example of an ACLAN program for this computer.< >
Grain packing is an important problem to the development of efficient parallel programs. It is desirable that the grain packing can be performed automatically, so that the programmer can write parallel programs withou...
详细信息
Grain packing is an important problem to the development of efficient parallel programs. It is desirable that the grain packing can be performed automatically, so that the programmer can write parallel programs without being troubled by the details of parallel-programming languages and parallel architectures, and the same parallel program can be executed efficiently on different machines. This paper presents a 2D Compression (2DC) grain packing method for determining optimal grain size and inherent parallelism concurrently. This ability is mainly due to 2DC's continuing efforts for achieving conflicting objectives. Experimental results demonstrate that 2DC increases the solution effectiveness, in comparison with state-of-art approaches that aim at economizing either speedup or resource utilization. Additionally, 2DC can determine inherent parallelism, which means that users will no longer be required to specify the number of processors before the compilation stage.
暂无评论