Robust and efficient parallel numerical algorithms and their implementation in easy-to-use portable software components are crucial for computational science and engineering applications. they are strongly influenced ...
详细信息
Graph-specific computing withthe support of dedicated accelerator has greatly boosted the graph processing in both efficiency and energy. Nevertheless, their data conflict management is still sequential when certain ...
详细信息
ISBN:
(纸本)9781450359863
Graph-specific computing withthe support of dedicated accelerator has greatly boosted the graph processing in both efficiency and energy. Nevertheless, their data conflict management is still sequential when certain vertex needs a large number of conflicting updates at the same time, leading to prohibitive performance degradation. this is particularly true and serious for processing natural graphs. In this paper, we have the insight that the atomic operations for the vertex updating of many graph algorithms (e.g., BFS, PageRank, and WCC) are typically incremental and simplex. this hence allows us to parallelize the conflicting vertex updates in an accumulative manner. We architect AccuGraph, a novel graph-specific accelerator that can simultaneously process atomic vertex updates for massive parallelism while ensuring the correctness. A parallel accumulator is designed to remove the serialization in atomic protections for conflicting vertex updates through merging their results in parallel. Our implementation on Xilinx FPGA with a wide variety of typical graph algorithms shows that our accelerator achieves an average throughput by 2.36 GTEPS as well as up to 3.14x performance speedup in comparison with state-of-the-art ForeGraph (with its single-chip version).
Sequence alignment is the most widely used operation in bioinformatics. Withthe exponential growth of the biological sequence databases, searching a database to find the optimal alignment for a query sequence (that c...
详细信息
ISBN:
(纸本)9781538674796
Sequence alignment is the most widely used operation in bioinformatics. Withthe exponential growth of the biological sequence databases, searching a database to find the optimal alignment for a query sequence (that can be at the order of hundreds of millions of characters long) would require excessive processing power and memory bandwidth. Sequence alignment algorithms can potentially benefit from the processing power of massive parallel processors due their simple arithmetic operations, coupled withthe inherent fine-grained and coarse-grained parallelism that they exhibit. However, the limited memory bandwidth in conventional computing systems prevents exploiting the maximum achievable speedup. In this paper, we propose a processing-in-memory architecture as a viable solution for the excessive memory bandwidth demand of bioinformatics applications. the design is composed of a set of simple and lightweight processing elements, customized to the sequence alignment algorithm, integrated at the logic layer of an emerging 3D DRAM architecture. Experimental results show that the proposed architecture results in up to 2.4x speedup and 41% reduction in power consumption, compared to a processor-side parallel implementation.
Kirchhoff pre-stack depth migration (KPSDM) algorithm, as one of the most widely used migration algorithms, plays an important part in getting the real image of the earth. However, this program takes considerable time...
详细信息
ISBN:
(数字)9783319111940
ISBN:
(纸本)9783319111940;9783319111933
Kirchhoff pre-stack depth migration (KPSDM) algorithm, as one of the most widely used migration algorithms, plays an important part in getting the real image of the earth. However, this program takes considerable time due to its high computational cost;hence the working efficiency of the oil industry is affected. the general purpose Graphic processing Unit (GPU) and the Compute Unified Device Architecture (CUDA) developed by NVIDIA have provided a new solution to this problem. In this study, we have proposed a parallel algorithm of the Kirchhoff pre-stack depth migration and an optimization strategy based on the CUDA technology. Our experiments indicate that for large data computations, the accelerated algorithm achieves a speedup of 8 similar to 15 times compared with NVIDIA GPU.
In this paper, we present a general architecture of hybrid prefix/carry-select adder. Based on this architecture, we formalize the hybrid adder's algorithm using the first-order recursive equations and develop a p...
详细信息
ISBN:
(纸本)9783642131189
In this paper, we present a general architecture of hybrid prefix/carry-select adder. Based on this architecture, we formalize the hybrid adder's algorithm using the first-order recursive equations and develop a proof framework to prove its correctness. Since several previous adders in the literature are special cases of this general architecture, our methodology can be used to prove the correctness of different hybrid prefix/carry-select adders. the formal proof for a special hybrid prefix/carry-select adder shows the effectiveness of the algebraic structures built in this paper.
this article describes an approach to scalability analysis of parallel applications, which is a major part of the algorithm description used in AlgoWiki, the Open Encyclopedia of parallel Algorithmic Features. the pro...
详细信息
ISBN:
(纸本)9783319499567;9783319499550
this article describes an approach to scalability analysis of parallel applications, which is a major part of the algorithm description used in AlgoWiki, the Open Encyclopedia of parallel Algorithmic Features. the proposed approach is based on the suggested definition of generalized scalability of a parallel application. this study uses joined and structured data on an application's execution and supercomputing co-design technologies. parallel application properties are studied by analyzing data collected from all available sources of its dynamic characteristics and information about the hardware and software platforms corresponding withthe features of an algorithm and its implementation. this allows reasonable conclusion to be drawn regarding potential reasons of changes in the execution quality for any parallel applications and to compare the scalability of various programs.
this paper considers the matrix decomposition A = LDLT, as a vehicle to explore the improvement in performance obtainable through the execution of multiple streams of control on SIMD architectures. Several methods for...
详细信息
this paper considers the matrix decomposition A = LDLT, as a vehicle to explore the improvement in performance obtainable through the execution of multiple streams of control on SIMD architectures. Several methods for partitioning the SIMD array are considered. Architectural support for and feasibility of using control parallelism in SIMD algorithms is briefly considered. Techniques for converting the extracted control parallelism into increased performance are illustrated via their application to the example algorithm. Analytical expressions for execution times are expressed in terms of execution times of the constituent operations. Experimental results for the various partitioning schemes based on execution traces are also presented. Timings based on Mas-Par MP-2 operations and extrapolated from experimental data are used to compare the various control parallel versions of the algorithm and the traditional SIMD counterpart.
More and more computers use hybrid architectures combining multi-core processors and hardware accelerators such as graphics processing units (GPUs). We present in this paper a new method for scheduling efficiently par...
详细信息
More and more computers use hybrid architectures combining multi-core processors and hardware accelerators such as graphics processing units (GPUs). We present in this paper a new method for scheduling efficiently parallel applications with m CPUs and k GPUs, where each task of the application can be processed either on a core (CPU) or on a GPU. the objective is to minimize the maximum completion time (makespan). the corresponding scheduling problem is Non-deterministic Polynomial (NP)-time hard, Copyright (c) 2014 John Wiley & Sons, Ltd.
An emotional agent software architecture for real-time mobile robotic applications has been developed. In order to allow the agent to undertake more dynamically constrained application problem solving, the processor c...
详细信息
ISBN:
(纸本)9783642246685
An emotional agent software architecture for real-time mobile robotic applications has been developed. In order to allow the agent to undertake more dynamically constrained application problem solving, the processor computation time should be reduced and the gained time is used for executing more complex processes. In this paper, the response time of the operating processes, in each attention cycle of the agent, is decreased by parallelizing the highly parallel processes of the architecture, namely, emotional contribution processes. the implementation of these processes has been evaluated in Field Programmable Gate Array (FPGA) and multicore processors.
the development of skeleton tools constitutes an alternative to cover the gap between current parallelarchitectures and sequential programmers. Its contruction involves formal models, paradigms and methologies. Based...
详细信息
ISBN:
(纸本)3540664432
the development of skeleton tools constitutes an alternative to cover the gap between current parallelarchitectures and sequential programmers. Its contruction involves formal models, paradigms and methologies. Based in the automata theory we have developed a formal model for parallel Dynamic Programming over pipeline networks. this model makes up a paradigm which is the core of skeleton tools oriented to the Dynamic Programming Technique. Following the methodology coerced by the model, we present a tool that provides the user withthe ability to obtain parallel programs adapted to the parallel architecture. the efficiency is contrasted on three current parallel platforms: Gray T3E, IBM SP2 and SG Origin 2000.
暂无评论