In this paper, we study the impact of multi processor memory systems in particular, the distributed memory (I)M) and virtual shared memory (VSM), on the implementation of parallel backpropagation neural network algori...
详细信息
ISBN:
(纸本)9781424442379
In this paper, we study the impact of multi processor memory systems in particular, the distributed memory (I)M) and virtual shared memory (VSM), on the implementation of parallel backpropagation neural network algorithms. In the first instance, neural network is partitioned into sub neural networks by applying a hybrid partitioning scheme. In the second, each partitioned network is evaluated with matrix multiplication. Three different sizes of neural networks are used and exchange rate prediction used as a reference problem. parallel implementations for each of the distributed memory and virtual shared memory scenarios is obtained. These algorithms are implemented on a high performance cluster, "Monolith" consisting of over 396 nodes. programming is realized using Message Passing Interface (MPI) library and C-Linda. The partitioned, matrix multiplication has the fastest execution time, and DM/MPI implementation is always faster than the VSM/Linda equivalent. However in VSM/Linda it is possible to allow the parallel neural network to choose the optimum number of processors dynamically.
Groundwater flow simulation has become one of the top international issues in new generation of environmental applications. When managing large-scale groundwater flow problems, the intensive computational ability and ...
详细信息
This paper introduces a CMOS based system that has been designed to allow parallel comparison of fragmented DNA sequences for on-chip assembly. The compatibility of different existing PC-based algorithms for implement...
详细信息
ISBN:
(纸本)9781467302197
This paper introduces a CMOS based system that has been designed to allow parallel comparison of fragmented DNA sequences for on-chip assembly. The compatibility of different existing PC-based algorithms for implementation in CMOS is compared and the overlaplayout-consensus approach is found to be the most suitable one. The designed system comprises a scalable processing array capable of parallel computation, which allows identification of overlaps in DNA fragments in addition to error tolerance through dynamic programming. Analysis shows that there is a "pixel area vs computation time" trade-off when implementing such a parallel architecture. Results from a hypothetical assembly confirm good overlap detection and error tolerance, with up to 94% similarity in the detected overlaps, when the error is as much as 10%.
Approximate pattern matching (APM) targets to find the occurrences of a pattern inside a subject text allowing a limited number of errors. It has been widely used in many application areas such as bioinformatics and i...
详细信息
Approximate pattern matching (APM) targets to find the occurrences of a pattern inside a subject text allowing a limited number of errors. It has been widely used in many application areas such as bioinformatics and information retrieval. Bit-parallel APM takes advantage of the intrinsic parallelism of bitwise operations inside a machine word. This approach typically encodes non-deterministic finite automaton (NFA) states or value differences between adjacent cells of a dynamic programming matrix in the form of bit arrays. Wu-Manber (WM) is a well-known bit-parallel APM algorithm, which simulates an NFA and gains parallel efficiency by performing multiple state updates within a machine word. An important parameter is the machine word size (e.g. 32 or 64 bits for CPUs). Due to increasing vector capabilities, efficient mapping of bit-parallel APM algorithms onto modern high performance computing architectures is an interesting research topic. Prominent examples are Xeon Phi coprocessors and CUDA-enabled GPUs, which provide words of size 512 bits (by means of vector registers) and 1024 bits (by means of warps), respectively. In this paper, we investigate mappings of the WM algorithm onto these two accelerator types. Both architectures are able to achieve around two orders-of-magnitude speedups compared to a single-threaded CPU implementation. Moreover, our tile-based implementation on a GeForce Titan graphics card runs up to 2.9 x faster than our implementation on an Intel Xeon Phi 5110P. Source code is available at http://***. (C) 2015 Elsevier B.V. All rights reserved.
A method for the programming and evaluation of parallel signal-processor architectures based on a data-flow representation of signal-processing algorithms is described. The constant data flow, which is a special prope...
详细信息
A method for the programming and evaluation of parallel signal-processor architectures based on a data-flow representation of signal-processing algorithms is described. The constant data flow, which is a special property of most signal-processing algorithms, allows the scheduling and resource allocation to be done at compile time, rather than at run time as in usual data-flow systems. It is therefore possible to describe arbitrary hardware configurations;a result that is closer to a realizable hardware solution is guaranteed. Therefore hardware requirements can be kept low. 7 refs.
There are many tools for OpenMP benchmarking which measure the various aspects of the performance, such as the overheads of OpenMP directives and the characteristics of the whole system. But we lack some tools to show...
详细信息
Recently, GPGPU has been adopted well in the High Performance Computing (HPC) field. The limited global memory bandwidth poses a great challenge to many GPGPU programmers trying to exploit parallelism within the CPUGP...
详细信息
Block sorting is used in connection with Optical Character Recognition (OCR). Recent work has focused on finding good strategies which work in practice. In this paper, we show that optimizing block sorting is NP-hard....
详细信息
ISBN:
(纸本)0769515797
Block sorting is used in connection with Optical Character Recognition (OCR). Recent work has focused on finding good strategies which work in practice. In this paper, we show that optimizing block sorting is NP-hard. Along with this result, we give new non-trivial lower bounds. These bound can be computed efficiently, We define the concept of "Local Property algorithms" and show that several previously published block sorting algorithms fall into this class.
In this work, two parallel techniques based on shared memory programming are presented. These models are specially suitable to be applied over evolutionary algorithms. To study their performance, the algorithm UEGO (U...
详细信息
In this work, two parallel techniques based on shared memory programming are presented. These models are specially suitable to be applied over evolutionary algorithms. To study their performance, the algorithm UEGO (Universal Evolutionary Global Optimizer) has been chosen.
Shape theory is a new approach to data types and programming based on the separation of a data type into its 'shape' and 'data' parts. Shape is common in parallel computing. This paper identifies areas...
详细信息
Shape theory is a new approach to data types and programming based on the separation of a data type into its 'shape' and 'data' parts. Shape is common in parallel computing. This paper identifies areas where the explicit use of shape reduces the burden of programming a parallel computer, using examples from an implementation of Cholesky decomposition.
暂无评论