this paper is devoted to applications of evolutionary algorithms into optimal design of nonlinear structures and identification of holes. the parallel and the distributed evolutionary algorithms are considered. the op...
详细信息
ISBN:
(纸本)3540219463
this paper is devoted to applications of evolutionary algorithms into optimal design of nonlinear structures and identification of holes. the parallel and the distributed evolutionary algorithms are considered. the optimum criterion is to minimize the plastic strain areas and stress values or an identification functional. the fitness functions are computed using the finite element method or the coupled finite and boundary element method.
the deluge of genomics data is incurring prohibitively high computational costs. As an important building block for genomic data processingalgorithms, FM-index search occupies most of execution time in sequence align...
详细信息
ISBN:
(纸本)9781450365109
the deluge of genomics data is incurring prohibitively high computational costs. As an important building block for genomic data processingalgorithms, FM-index search occupies most of execution time in sequence alignment. Due to massive random streaming memory references relative to only small amount of computations, FM-index search algorithm exhibits extremely low efficiency on conventional architectures. this paper proposes Niubility, an accelerator for FM-index search in genomic sequence alignment. Based on our algorithm-architecture co-design analysis, we found that conventional architectures exploit low memory-level parallelism so that the available memory bandwidth cannot be fully utilized. Niubility accelerator customizes bit-wise operations and exploit data-level parallelism, that produces maximal concurrent memory accesses to saturate memory bandwidth. We implement an accelerator ASIC in a ST 28nm process that achieves up to 990x speedup over the state-of-the-art software.
DPCM (Differential Pulse Code Modulation) coding is widely used in many applications including lossless JPEG compression. DPCM decoding is inherently a 1-indexed or 2-indexed recurrence relation. thus, although it is ...
详细信息
ISBN:
(纸本)9780769533025
DPCM (Differential Pulse Code Modulation) coding is widely used in many applications including lossless JPEG compression. DPCM decoding is inherently a 1-indexed or 2-indexed recurrence relation. thus, although it is hard to parallelize efficiently, some (N log N)or (log(2) N) algorithms have been studied for an N x N image with N x N or N processors. Recently commodity microprocessors are equipped with plural cores and SMP architectures are utilized in some PCs, but the number of parallelism is not so large (up to 80). thus, it is unrealistic that the image processing of an N x N image is parallelized with N x N or N processors. In this paper we implements two parallel DPCM algorithms for an N x N image on P processors (P << N): Fat-pipeline and P-scheme. Our experimental results show that both approaches provide the parallelisms of about 3.2 with 6 processing cores.
the problem of scheduling independent jobs on identical parallel machines for minimizing makespan has been intensely studied in the literature. One of the most popular constructive algorithms for this problem is the L...
详细信息
ISBN:
(纸本)9781510840232
the problem of scheduling independent jobs on identical parallel machines for minimizing makespan has been intensely studied in the literature. One of the most popular constructive algorithms for this problem is the LPT(Longest processing Time First) rule whose approximation ratio has been proved by contradiction. A direct proof of its approximation ratio is presented, which can be regarded as an acquisition of knowledge by deductive means.
the prototype system OOXSDAR VISIS was implemented in VisualWorks/Smalltalk and Distributed Smalltalk respectively. To achieve distribution in a heterogeneous network a common object request broker architecture (CORBA...
详细信息
the prototype system OOXSDAR VISIS was implemented in VisualWorks/Smalltalk and Distributed Smalltalk respectively. To achieve distribution in a heterogeneous network a common object request broker architecture (CORBA)-based architecture was chosen. the architecture consists of three layers: knowledge client level, knowledge domain agent server level, and persistent knowledge storage level. the architecture is based on the semantic/presentation split of logical knowledge objects. this architecture combines the advantages of standardized communication protocols such as CORBA withthe power and expressivity of OODBMS. First studies withthe prototype system OOXSDAR VISIS were carried out for performance analysis. the results allowed significant improvement of the distributed inference process. Well known principles such as increasing intra-modul cohesion and minimizing inter-modul dependencies were applied for restructuring the distributed knowledge bases.
作者:
Böhm, A.Brehm, J.Finnemann, H.IMMD3
University of Erlangen-Nuremberg Martensstr. 3 Germany
Hammerbacher Str. 12+14 ErlangenD-8520 Germany
In this paper we present an implementation of a parallelized sparse matrix algorithm for solving the Neutron Diffusion Equation on the SUPRENUM multiprocessor system. the solution of the steady-state and transient Neu...
详细信息
Data compression plays an important role in the era of big data;however, such compression is typically one of the bottlenecks of a massive data processing system due to intensive computing and memory access. In this p...
详细信息
An efficient parallel priority queue is at the core of the effort in parallelizing important non-numeric irregular computations such as discrete event simulation scheduling and branch-and-bound algorithms. GPGPUs can ...
详细信息
ISBN:
(纸本)9781467323703;9781467323727
An efficient parallel priority queue is at the core of the effort in parallelizing important non-numeric irregular computations such as discrete event simulation scheduling and branch-and-bound algorithms. GPGPUs can provide powerful computing platform for such non-numeric computations if an efficient parallel priority queue implementation is available. In this paper, aiming at fine-grained applications, we develop an efficient parallel heap system employing CUDA. To our knowledge, this is the first parallel priority queue implementation on many-core architectures, thus represents a breakthrough. By allowing wide heap nodes to enable thousands of simultaneous deletions of highest priority items and insertions of new items, and taking full advantage of CUDA's data parallel SIMT architecture, we demonstrate up to 30-fold absolute speedup for relatively fine-grained compute loads compared to optimized sequential priority queue implementation on fast multicores. Compared to this, our optimized multicore parallelization of parallel heap yields only 2-3 fold speedup for such fine-grained loads. this parallelization of a tree-based data structure on GPGPUs provides a roadmap for future parallelizations of other such data structures.
the efficiency of complex multiplication in signal processing is investigated in this study on three hardware platforms: CPU, GPU, and FPGA. A unified codebase was created to assess and contrast the efficiency and spe...
详细信息
GPUs are commonly used as coprocessors to accelerate a compute-intensive task, thanks to their massively parallel architecture. there is study into different abstract parallel models, which allow researchers to design...
详细信息
ISBN:
(纸本)9781538610442
GPUs are commonly used as coprocessors to accelerate a compute-intensive task, thanks to their massively parallel architecture. there is study into different abstract parallel models, which allow researchers to design and analyse parallelalgorithms. However, most work on analysing GPU algorithms has been software based tools for profiling a GPU algorithm. Recently, some abstract GPU models have been proposed, yet they do not capture all elements of a GPU. In particular, they miss the data transfer between CPU and GPU, which in practice can cause a bottleneck and reduce performance dramatically. We propose a comprehensive model called Abstract Transferring GPU which to our knowledge is the first abstract GPU model to capture data transfer between CPU and GPU. We show via experiments, that existing abstract GPU models cannot sufficiently capture all of the actual running of a GPU algorithm time in all cases, as they do not capture data transfer. We show that by capturing data transfer with our model, we are able to obtain more accurate predictions of the GPU algorithm actual running time. It is expected that our model helps improve design and analysis of heterogeneous systems consisting of CPU and GPU, and will allow researchers to make better informed implementation decisions, as they will be aware how data transfer will affect their programs.
暂无评论