the quadtree medial axis transform (QMAT) representation of a binary image is a very useful scheme for computer graphics and image processing applications. We present an efficient algorithm for QMAT on the shared memo...
详细信息
In this paper we present a novel and complete approach on how to encapsulate parallelism for relational database query execution that strives for maximum resource utilization for both CPU and disk activities. Its simp...
详细信息
ISBN:
(纸本)9783540695004
In this paper we present a novel and complete approach on how to encapsulate parallelism for relational database query execution that strives for maximum resource utilization for both CPU and disk activities. Its simple and robust design is capable of modeling intra- and inter-operator parallelism for one or more parallel queries in a most natural way. In addition, encapsulation guarantees that the bulk of relational operators can remain unmodified, as long as their implementation is thread-safe. We will show, that withthis approach, the problem of scheduling parallel tasks is generalized, so that it can be safely entrusted to the underlying operating system (OS) without suffering any performance penalties. On the contrary, relocation of all scheduling decisions from the DBMS to the OS guarantees a centralized and therefore near-optimal resource allocation (depending on the OS's abilities) for the complete system that is hosting the database server as one of its tasks. Moreover, withthis proposal, query parallelization is fully transparent on the SQL interface of the database system. Configuration of the system for effective parallel query execution can be adjusted by the DB administrator by setting two descriptive tuning parameters. A prototype implementation has been integrated into the Transbase (R) relational DBMS engine.
Execution plans constitute the traditional interface between DBMS front-ends and back-ends;similar networks of interconnected operators are found also outside database systems. Tasks like adapting execution plans for ...
详细信息
ISBN:
(纸本)9783319495835;9783319495828
Execution plans constitute the traditional interface between DBMS front-ends and back-ends;similar networks of interconnected operators are found also outside database systems. Tasks like adapting execution plans for distributed or heterogeneous runtime environments require a plan transformation mechanism which is simple enough to produce predictable results while general enough to express advanced communication schemes required for instance in skew-resistant partitioning. In this paper, we describe the BobolangNG language designed to express execution plans as well as their transformations, based on hierarchical models known from many environments but enhanced with a novel compile-time mechanism of component multiplication. Compared to approaches based on general graph rewriting, the plan transformation in BobolangNG is not iterative;therefore the consequences and limitations of the process are easier to understand and the development of distribution strategies and experimenting with distributed plans are easier and safer.
Due to its high-level nature, parallel functional languages provide some advantages for the programmer. Unfortunately, the functional programming community has not paid much attention to some important practical probl...
详细信息
ISBN:
(纸本)9783540695004
Due to its high-level nature, parallel functional languages provide some advantages for the programmer. Unfortunately, the functional programming community has not paid much attention to some important practical problems, like debugging parallel programs. In this paper we introduce the first debugger that works with any parallel extension of the functional language Haskell, the de facto standard in the (lazy evaluation) functional programming community. the debugger is implemented as an independent library. thus, it can be used with any Haskell compiler. Moreover, the debugger can be used to analyze how much speculative work has been done in any program.
In this work we investigate how Distributed Shared Memory (DSM) architectures affect performance of or-parallel logic programming systems and how this performance approaches that of conventional C systems. Our work co...
详细信息
this paper describes a data-parallel algorithm for boolean function manipulation. the algorithm adopts Binary Decision Diagrams (BDDs), which are the state-of-the-art approach for representing and handling boolean fun...
详细信息
this paper describes a data-parallel algorithm for boolean function manipulation. the algorithm adopts Binary Decision Diagrams (BDDs), which are the state-of-the-art approach for representing and handling boolean functions. the algorithm is well suited for SIMD architectures and is based on distributing BDD nodes to the available processing Elements and traversing BDDs in a breadth-first manner. An improved version of the same algorithm is also presented, which does not use virtual processors. A prototypical package has been implemented and its behavior has been studied with two different applications. In both cases the results show that the approach exploits well the parallel hardware by effectively distributing the load;thanks to the limited CPU time required and to the great amount of memory available, it can solve problems that can not be faced with by conventional architectures.
Tasking is a prominent parallel programming model. In this paper we conduct a first study into the feasibility of task-parallel execution at the CUDA grid, rather than the stream/kernel level, for regular, fixed in-ou...
详细信息
ISBN:
(纸本)9781479989379
Tasking is a prominent parallel programming model. In this paper we conduct a first study into the feasibility of task-parallel execution at the CUDA grid, rather than the stream/kernel level, for regular, fixed in-out dependency task graphs, similar to those found in wavefront computational patterns, making the findings broadly applicable. We propose and evaluate three CUDA task progression algorithms, where threadblocks cooperatively process the task graph, and argue about their performance in terms of tasking throughput, atomics and memory IO overheads. Our initial results demonstrate a throughput of 38 million tasks/second on a Kepler K20X architecture.
Graphics processing units (GPUs) originally designed for computer video cards have emerged as the most powerful chip in a high-performance workstation. Unlike multicore CPU architectures, which currently ship with two...
详细信息
ISBN:
(纸本)9781424420025
Graphics processing units (GPUs) originally designed for computer video cards have emerged as the most powerful chip in a high-performance workstation. Unlike multicore CPU architectures, which currently ship with two or four cores, GPU architectures are "manycore" with hundreds of cores capable of running thousands of threads in parallel. NVIDIA's CUDA is a co-evolved hardware-software architecture that enables high-performance computing developers to harness the tremendous computational power and memory bandwidth of the GPU in a familiar programming environment - the C programming language. We describe the CUDA programming model and motivate its use in the biomedical imaging community.
A few algorithms of distributed mutual exclusion are discussed, their unified model in terms of a finite-population queuing system is proposed, and their simulation performance study is presented withthe assumption t...
详细信息
ISBN:
(数字)9783540320715
ISBN:
(纸本)3540292357
A few algorithms of distributed mutual exclusion are discussed, their unified model in terms of a finite-population queuing system is proposed, and their simulation performance study is presented withthe assumption that they use multicast communication if possible. To formally represent the algorithms for simulation, a class of extended Petri nets is used. the simulation was done in the simulation system Winsim based on this class of Petri nets.
For high performance computation memory access is a major issue. Whether it is a supercomputer, a GPGPU device, or an Application Specific Instruction set Processor (ASIP) for Digital Signal processing (DSP) parallel ...
详细信息
ISBN:
(纸本)9781479907298
For high performance computation memory access is a major issue. Whether it is a supercomputer, a GPGPU device, or an Application Specific Instruction set Processor (ASIP) for Digital Signal processing (DSP) parallel execution is a necessity. A high rate of computation puts pressure on the memory access, and it is often non-trivial to maximize the data rate to the execution units. Many algorithmsthat from a computational point of view can be implemented efficiently on parallelarchitectures fail to achieve significant speed-ups. the reason is very often that the speed-up possible withthe available execution units are poorly utilized due to inefficient data access. this paper shows a method for improving the access time for sequences of data that are completely static at the cost of extra memory. this is done by resolving memory conflicts by using padding. the method can be automatically applied and it is shown to significantly reduce the data access time for sorting and FFTs. the execution time for the FFT is improved with up to a factor of 3.4 and for sorting by a factor of up to 8.
暂无评论