Scientific computing is usually associated with compiled languages for maximum efficiency. However, in a typical application program, only a small part of the code is time-critical and requires the efficiency of a com...
详细信息
Scientific computing is usually associated with compiled languages for maximum efficiency. However, in a typical application program, only a small part of the code is time-critical and requires the efficiency of a compiled language. It is often advantageous to use interpreted high-level languages for the remaining tasks, adopting a mixed-language approach. This will be demonstrated for Python, an interpreted object-oriented high-level language that is well suited for scientific computing. Particular attention is paid to high-level parallel programming using Python and the BSP model. We explain the basics of BSP and how it differs from other parallel programming tools like MPI. Thereafter we present an application of Python and BSP for solving a partial differential equation from computational science, utilizing high-level design of libraries and mixed-language (Python-C or Python-Fortran) programming. (c) 2004 Published by Elsevier B.V.
In the past, the tenacious semiconductor problems of operating temperature and power consumption limited the performance growth for single-core microprocessors. Microprocessor vendors hence adopt the multicore chip or...
详细信息
In the past, the tenacious semiconductor problems of operating temperature and power consumption limited the performance growth for single-core microprocessors. Microprocessor vendors hence adopt the multicore chip organizations with parallel processing because the new technology promises faster and lower power needed. In a short time, this trend floods first the development of CPU, then also the other peripherals like GPU. Modern GPUs are very efficient in manipulating computer graphics, and their highly parallel structure makes them even more effective than general-purpose CPUs for a range of graphical complex algorithms. However, technology of multicore processor brought revolution and unavoidable collision to the programming personnel. Multicore processor has high performance;however, parallel processing brings not only the opportunity but also a challenge. The issue of efficiency and the way how programmer or compiler parallelizes the software explicitly are the keys that enhance the performance on multicore chip. In this paper, we propose a parallel programming approach using hybrid CUDA, OpenMP, and MPI programming. There would be two verificational experiments presented in the paper. In the first, we would verify the availability and correctness of the auto-parallel tools, and discuss the performance issues on CPU, GPU, and embedded system. In the second, we would verify how the hybrid programming could surely improve performance. Copyright (C) 2016 John Wiley & Sons, Ltd.
Horde is a general programming framework for writing parallel applications in clusters. A computing task is modeled as a graph in Horde. Each sub-task maps to one vertex and data channels map to edges in the graph. Pr...
详细信息
ISBN:
(纸本)9781424441563
Horde is a general programming framework for writing parallel applications in clusters. A computing task is modeled as a graph in Horde. Each sub-task maps to one vertex and data channels map to edges in the graph. programming with Horde is very simple by writing sequential code for vertexes and adding edges to link vertexes. Horde can tolerant transient fault and provide support to write code for toleranting permanent faults. Horde is portable and support various cluster job managers. We evaluate Horde's efficiency in communication through micro benchmarks and prove the easy-of-use of Horde by implementing a MapReuce engine. The test in a small scale cluster show that our implementation outperforms Hadoop.
Explicit multithreading (XMT) is a parallel programming approach for exploiting on-chip parallelism. XMT introduces a computational framework with (1) a simple programming style that relies on fine-grained PRAM-style ...
详细信息
Explicit multithreading (XMT) is a parallel programming approach for exploiting on-chip parallelism. XMT introduces a computational framework with (1) a simple programming style that relies on fine-grained PRAM-style algorithms;(2) hardware support for low-overhead parallel threads, scalable load balancing, and efficient synchronization. The missing link between the algorithmic-programming level and the architecture level is provided by the first prototype XMT compiler. This paper also takes this new opportunity to evaluate the overall effectiveness of the interaction between the programming model and the hardware, and enhance its performance where needed, incorporating new optimizations into the XMT compiler. We present a wide range of applications, which written in XMT obtain significant speedups relative to the best serial programs. We show that XMT is especially useful for more advanced applications with dynamic, irregular access patterns, where for regular computations we demonstrate performance gains that scale up to much higher levels than have been demonstrated before for on-chip systems.
Cenju is an experimental multiprocessor system with a distributed shared memory scheme developed mainly for circuit simulation. The system is composed of 64 PEs (Processor Elements) which are divided into eight cluste...
详细信息
Cenju is an experimental multiprocessor system with a distributed shared memory scheme developed mainly for circuit simulation. The system is composed of 64 PEs (Processor Elements) which are divided into eight clusters. In each cluster, eight PEs are connected by a cluster bus. The cluster buses are in turn connected by a multistage network to form the whole system. Each PE consists of 32-bit microprocessor MC68020 (20 MHz), 4/8 MB of RAM and a floating-point processor WTL1167 (20 MHz). The system supports parallel programming using C and FORTRAN, in which parallel primitives are provided as subroutines to be embedded by the programmer. In this system, programmers must adhere to a Producer-Consumer model in which the producer of the data always writes the data to the consumer's memory. The simulation algorithm used in circuit simulation is hierarchical modular simulation in which the circuit to be simulated is divided subcircuits connected by an interconnection network. For the 64 multiprocessor system, a speedup of 15.8 compared to the one processor case was attained for a DRAM circuit. Furthermore, by parallelizing the serial bottleneck, a speedup of 25.8 could be realized. In this article, authors briefly describe the simulation algorithm and Cenju architecture, then dwell in some detail on the parallel programming aspects of Cenju.
Stream Processing applications are spread across different sectors of industry and people's daily lives. The increasing data we produce, such as audio, video, image, and text are demanding quickly and efficiently ...
详细信息
ISBN:
(纸本)9781665414555
Stream Processing applications are spread across different sectors of industry and people's daily lives. The increasing data we produce, such as audio, video, image, and text are demanding quickly and efficiently computation. It can be done through Stream parallelism, which is still a challenging task and most reserved for experts. We introduce a Stream Processing framework for asse s s ing parallel programming Interfaces (PPIs). Our framework targets multi-core architectures and C++ stream processing applications, providing an API that abstracts the details of the stream operators of these applications. Therefore, users can easily identify all the basic operators and implement parallelism through different PPIs. In this paper, we present the proposed framework, implement three applications using its API, and show how it works, by using it to parallelize and evaluate the applications with the PPIs Intel TBB, FastFlow, and SPar. The performance results were consistent with the literature.
The paper presents the experiences of the design and development of an industrial measurement system. The architecture of the system is parallel and highly scalable. As studies show parallel systems are more error pro...
详细信息
ISBN:
(数字)9781510622043
ISBN:
(纸本)9781510622043
The paper presents the experiences of the design and development of an industrial measurement system. The architecture of the system is parallel and highly scalable. As studies show parallel systems are more error prone than sequential ones. Errors may be in synchronization or data sharing and can sometimes hinder processing within time limits acceptable for a measurement system. So, the performance problems may also be dependability ones. In this paper, the problems met during the implementation of a measurement system, as well as theirs solutions, are presented. One of them was unpredictable behavior of garbage collector which decreased system performance. Some deadlock situations have also been identified, which may occur if the measurement device (i.e. hardware) would experience a specific failure mode. It is shown, how substantially performance increase and effective and scalable code was achieved.
This paper proposes a technology for large biomedical data analyzing based on CUDA computation. The technology was used to analyze a large set of fundus images used for diabetic retinopathy automatic diagnostics. A hi...
详细信息
ISBN:
(纸本)9781728152585
This paper proposes a technology for large biomedical data analyzing based on CUDA computation. The technology was used to analyze a large set of fundus images used for diabetic retinopathy automatic diagnostics. A high-performance algorithm has been developed to calculate effective textural characteristics for medical image analysis. During the automatic image diagnostics, the following classes were distinguished: thin vessels, thick vessels, exudates and healthy areas. The mentioned algorithm's efficiency study was conducted with 500x500-1000x1000 pixels images using a 12x12 dimension window. The relationship between the developed algorithm's acceleration and data sizes was demonstrated. The study showed that the algorithm effectiveness can be depends of certain characteristics of the image, as its clarity, the shape of exudate zone, the variability of blood vessels, and the optic disc's location.
Remote Sensing (RS) data processing is characterized by massive remote sensing images and increasing amount of algorithms of higher complexity. parallel programming for data-intensive applications like massive remote ...
详细信息
ISBN:
(纸本)9781467324229
Remote Sensing (RS) data processing is characterized by massive remote sensing images and increasing amount of algorithms of higher complexity. parallel programming for data-intensive applications like massive remote sensing image processing on parallel systems is bound to be especially trivial and challenging. We propose a C++ template mechanism enabled generic parallel programming skeleton for these remote sensing applications in high performance clusters. It provides both programming templates for distributed RS data and generic parallel skeletons for RS algorithms. Through one-side communication primitives provided by MPI, the distributed RS data template could provide a global view of the big RS data whose sliced data blocks are scattered among the distributed memory of cluster nodes. Moreover, by data serialization and RMA (Remote Memory Access), the data templates could also offer a simple and effective way to distribute and communicate massive remote sensing data with complex data structures. Furthermore, the generic parallel skeletons implement the recurring patterns of computation, performance optimization and pass the user-defined sequential functions as parameters of templates for type genericity. With the implemented skeletons, Developers without extensive parallel computing technologies can implement efficient parallel remote sensing programs without concerning for parallel computing details. Through experiments on remote sensing applications, we confirmed that our templates were productive and efficient.
Nowadays NVIDIA s CUDA is a general purpose scalable parallel programming model for writing highly parallel applications It provides several key abstractions - a hierarchy of thread blocks shared memory and barrier sy...
详细信息
Nowadays NVIDIA s CUDA is a general purpose scalable parallel programming model for writing highly parallel applications It provides several key abstractions - a hierarchy of thread blocks shared memory and barrier synchronization This model has proven quite successful at programming multithreaded many core GPUs and scales transparently to hundreds of cores scientists throughout industry and academia are already using CUDA to achieve dramatic speedups on production and research codes In this paper we propose a parallel programming approach using hybrid CUDA OpenMP and MPI programming which partition loop iterations according to the number of C1060 CPU nodes in a CPU cluster which consists of one C1060 and one S1070 Loop iterations assigned to one MPI process are processed in parallel by CUDA run by the processor cores in the same computational node (C) 2010 Elsevier B V All rights reserved
暂无评论