The parallel processing requirements of many computer applications, such as machine vision, radar, solar, and signal processing, are reviewed. The major hardware architectural features in optimizing parallel processin...
详细信息
The parallel processing requirements of many computer applications, such as machine vision, radar, solar, and signal processing, are reviewed. The major hardware architectural features in optimizing parallel processing performance (interconnect topology, memory locality, and synchronization facilities) are discussed. The various parallel processing models available are also discussed. These include job-level parallelism, data-level parallelism, algorithm-level parallelism, loop-level parallelism, and compute clusters.
The development of the algebra-algorithmic methodology and tools for automated design and generation of programs for graphics processing units is proposed. A particular feature of the proposed approach is the use of h...
详细信息
The development of the algebra-algorithmic methodology and tools for automated design and generation of programs for graphics processing units is proposed. A particular feature of the proposed approach is the use of high-level specifications that are close to natural-language specifications and also the application of a method that ensures the syntactical correctness of algorithms and programs being designed. The approach was implemented in a toolkit destined for interactively designing algorithm schemes and generating programs. The use of this toolkit is illustrated by the development of a parallel program in the field of meteorology.
Message-passing programs are efficient, but fall short on convenience and portability. ZPL is a high-level language that offers competitive performance and portability, as well as programming conveniences lacking in l...
详细信息
Message-passing programs are efficient, but fall short on convenience and portability. ZPL is a high-level language that offers competitive performance and portability, as well as programming conveniences lacking in low-level approaches.
Nowadays, shared-memory parallel architectures have evolved and new programming frameworks have appeared that exploit these architectures: OpenMP, TBB, Cilk Plus, ArBB and OpenCL. This article focuses on the most exte...
详细信息
Nowadays, shared-memory parallel architectures have evolved and new programming frameworks have appeared that exploit these architectures: OpenMP, TBB, Cilk Plus, ArBB and OpenCL. This article focuses on the most extended of these frameworks in commercial and scientific areas. This paper shows a comparative study of these frameworks and an evaluation. The study covers several capacities, such as task deployment, scheduling techniques, or programming language abstractions. The evaluation measures three dimensions: code development complexity, performance and efficiency, measure as speedup per watt. For this evaluation, several parallel benchmarks have been implemented with each framework. These benchmarks are created to cover certain scenarios, like regular memory access or irregular computation. The conclusions show some highlights, like the fact that some frameworks (OpenMP, Cilk Plus) are better for transforming quickly a sequential code, others (TBB) have a small footprint which is ideal for small problems, and others (OpenCL) are suited for heterogeneous architectures but they require a very complex development process. The conclusions also show that the vectorization support is more critical than multitasking to achieve efficiency for those problems where this approach fits.
Scientific computing is usually associated with compiled languages for maximum efficiency. However, in a typical application program, only a small part of the code is time-critical and requires the efficiency of a com...
详细信息
Scientific computing is usually associated with compiled languages for maximum efficiency. However, in a typical application program, only a small part of the code is time-critical and requires the efficiency of a compiled language. It is often advantageous to use interpreted high-level languages for the remaining tasks, adopting a mixed-language approach. This will be demonstrated for Python, an interpreted object-oriented high-level language that is well suited for scientific computing. Particular attention is paid to high-level parallel programming using Python and the BSP model. We explain the basics of BSP and how it differs from other parallel programming tools like MPI. Thereafter we present an application of Python and BSP for solving a partial differential equation from computational science, utilizing high-level design of libraries and mixed-language (Python-C or Python-Fortran) programming. (c) 2004 Published by Elsevier B.V.
In the past, the tenacious semiconductor problems of operating temperature and power consumption limited the performance growth for single-core microprocessors. Microprocessor vendors hence adopt the multicore chip or...
详细信息
In the past, the tenacious semiconductor problems of operating temperature and power consumption limited the performance growth for single-core microprocessors. Microprocessor vendors hence adopt the multicore chip organizations with parallel processing because the new technology promises faster and lower power needed. In a short time, this trend floods first the development of CPU, then also the other peripherals like GPU. Modern GPUs are very efficient in manipulating computer graphics, and their highly parallel structure makes them even more effective than general-purpose CPUs for a range of graphical complex algorithms. However, technology of multicore processor brought revolution and unavoidable collision to the programming personnel. Multicore processor has high performance;however, parallel processing brings not only the opportunity but also a challenge. The issue of efficiency and the way how programmer or compiler parallelizes the software explicitly are the keys that enhance the performance on multicore chip. In this paper, we propose a parallel programming approach using hybrid CUDA, OpenMP, and MPI programming. There would be two verificational experiments presented in the paper. In the first, we would verify the availability and correctness of the auto-parallel tools, and discuss the performance issues on CPU, GPU, and embedded system. In the second, we would verify how the hybrid programming could surely improve performance. Copyright (C) 2016 John Wiley & Sons, Ltd.
Horde is a general programming framework for writing parallel applications in clusters. A computing task is modeled as a graph in Horde. Each sub-task maps to one vertex and data channels map to edges in the graph. Pr...
详细信息
ISBN:
(纸本)9781424441563
Horde is a general programming framework for writing parallel applications in clusters. A computing task is modeled as a graph in Horde. Each sub-task maps to one vertex and data channels map to edges in the graph. programming with Horde is very simple by writing sequential code for vertexes and adding edges to link vertexes. Horde can tolerant transient fault and provide support to write code for toleranting permanent faults. Horde is portable and support various cluster job managers. We evaluate Horde's efficiency in communication through micro benchmarks and prove the easy-of-use of Horde by implementing a MapReuce engine. The test in a small scale cluster show that our implementation outperforms Hadoop.
Explicit multithreading (XMT) is a parallel programming approach for exploiting on-chip parallelism. XMT introduces a computational framework with (1) a simple programming style that relies on fine-grained PRAM-style ...
详细信息
Explicit multithreading (XMT) is a parallel programming approach for exploiting on-chip parallelism. XMT introduces a computational framework with (1) a simple programming style that relies on fine-grained PRAM-style algorithms;(2) hardware support for low-overhead parallel threads, scalable load balancing, and efficient synchronization. The missing link between the algorithmic-programming level and the architecture level is provided by the first prototype XMT compiler. This paper also takes this new opportunity to evaluate the overall effectiveness of the interaction between the programming model and the hardware, and enhance its performance where needed, incorporating new optimizations into the XMT compiler. We present a wide range of applications, which written in XMT obtain significant speedups relative to the best serial programs. We show that XMT is especially useful for more advanced applications with dynamic, irregular access patterns, where for regular computations we demonstrate performance gains that scale up to much higher levels than have been demonstrated before for on-chip systems.
Cenju is an experimental multiprocessor system with a distributed shared memory scheme developed mainly for circuit simulation. The system is composed of 64 PEs (Processor Elements) which are divided into eight cluste...
详细信息
Cenju is an experimental multiprocessor system with a distributed shared memory scheme developed mainly for circuit simulation. The system is composed of 64 PEs (Processor Elements) which are divided into eight clusters. In each cluster, eight PEs are connected by a cluster bus. The cluster buses are in turn connected by a multistage network to form the whole system. Each PE consists of 32-bit microprocessor MC68020 (20 MHz), 4/8 MB of RAM and a floating-point processor WTL1167 (20 MHz). The system supports parallel programming using C and FORTRAN, in which parallel primitives are provided as subroutines to be embedded by the programmer. In this system, programmers must adhere to a Producer-Consumer model in which the producer of the data always writes the data to the consumer's memory. The simulation algorithm used in circuit simulation is hierarchical modular simulation in which the circuit to be simulated is divided subcircuits connected by an interconnection network. For the 64 multiprocessor system, a speedup of 15.8 compared to the one processor case was attained for a DRAM circuit. Furthermore, by parallelizing the serial bottleneck, a speedup of 25.8 could be realized. In this article, authors briefly describe the simulation algorithm and Cenju architecture, then dwell in some detail on the parallel programming aspects of Cenju.
The paper presents the experiences of the design and development of an industrial measurement system. The architecture of the system is parallel and highly scalable. As studies show parallel systems are more error pro...
详细信息
ISBN:
(数字)9781510622043
ISBN:
(纸本)9781510622043
The paper presents the experiences of the design and development of an industrial measurement system. The architecture of the system is parallel and highly scalable. As studies show parallel systems are more error prone than sequential ones. Errors may be in synchronization or data sharing and can sometimes hinder processing within time limits acceptable for a measurement system. So, the performance problems may also be dependability ones. In this paper, the problems met during the implementation of a measurement system, as well as theirs solutions, are presented. One of them was unpredictable behavior of garbage collector which decreased system performance. Some deadlock situations have also been identified, which may occur if the measurement device (i.e. hardware) would experience a specific failure mode. It is shown, how substantially performance increase and effective and scalable code was achieved.
暂无评论