Multi-Swarm PSO (MPSO) is an extension of the PSO algorithm that incorporates multiple, collaborating swarms. Although embarrassingly parallel in appearance, MPSO is memory bound, introducing challenges for GPU-based ...
详细信息
Histogramming is a tool commonly used in data analysis. Although its serial version is simple to implement, providing an efficient and scalable way to parallelize it can be challenging. this especially holds in case o...
详细信息
ISBN:
(纸本)9781479901036
Histogramming is a tool commonly used in data analysis. Although its serial version is simple to implement, providing an efficient and scalable way to parallelize it can be challenging. this especially holds in case of platforms that contain one or several massively parallel devices like CUDA-capable GPUs due to issues with domain decomposition, use of global memory and similar. In this paper we compare two approaches for implementing general purpose histogramming on GPUs. the first algorithm is based on private copies of bin counters stored in shared memory for each block of threads. the second one uses the thrust library to sort the input elements and then to search for upper bounds according to bin widths. For bothalgorithms we analyze how the speedup over the sequential version depends on the size of input collection, number of bins, and the type and distribution of input elements. We also implement overlapping of data transfers between host CPU and CUDA device with kernel execution. For bothalgorithms we analyze the pros and cons in detail. For example, privatization strategy can be up to 2x faster than sort-search with realistic inputs, but can only support a limited number of bins. On the other hand, sort-search strategy has about 50% higher speedup than privatization when we use characters as input and can support unlimited number of bins. Finally, we perform an exploration to determine the optimal algorithm depending on the characteristics and values of input parameters.
A high performance VLSI architecture for integer motion estimation (IME) in High Efficiency Video Coding (HEVC) is presented in this paper. It supports coding tree block (CTB) structure withthe asymmetric motion part...
详细信息
Several highly optimized implementations of Finite Difference schemes are discussed. the combination of vectorization and an interleaved data layout, spatial and temporal loop tiling algorithms, loop unrolling, and pa...
详细信息
ISBN:
(纸本)9783642368035
Several highly optimized implementations of Finite Difference schemes are discussed. the combination of vectorization and an interleaved data layout, spatial and temporal loop tiling algorithms, loop unrolling, and parameter tuning lead to efficient computational kernels in one to three spatial dimensions, truncation errors of order two to twelve, and isotropic and compact anisotropic stencils. the kernels are implemented on and tuned for several processor architectures like recent Intel Sandy Bridge, Ivy Bridge and AMD Bulldozer CPU cores, all with AVX vector instructions as well as Nvidia Kepler and Fermi and AMD Southern and Northern Islands GPU architectures, as well as some older architectures for comparison. the kernels are either based on a cache aware spatial loop or on time-slicing to compute several time steps at once. Furthermore, vector components can either be independent, grouped in short vectors of SSE, AVX or GPU warp size or in larger virtual vectors with explicit synchronization. the optimal choice of the algorithm and its parameters depend both on the Finite Difference stencil and on the processor architecture.
the paper challenges the current state-of-the-art which is accepted by the automotive industry. Present day vehicles are unsophisticatedly over-engineered and, as a consequence, are uneconomic, hence unsustainable. Ve...
详细信息
ISBN:
(纸本)9789898565716
the paper challenges the current state-of-the-art which is accepted by the automotive industry. Present day vehicles are unsophisticatedly over-engineered and, as a consequence, are uneconomic, hence unsustainable. Vehicles currently under development, however, offer tremendous opportunities for shifting from this position to include onboard active safety systems, e.g. collision avoidance. It is argued that future vehicles should be significantly lighter and exploit the developing safety features to the full. Indeed, such a development would reduce the existing need for crashworthiness. the above arguments coupled withparallel developments in smart materials, paves the way towards a new generation of actively controlled vehicle architecture design. Whilst the move to lighter vehicles, with onboard active safety systems and actively controlled structures, may be seen as controversial, there is a convincing case for a paradigm shift towards a truly sustainable transport future.
In pursuit of bringing high end applications on radio platforms, recent and evolving wireless standards impose stringent requirements in the shape of high throughputs, error rate performance close to theoretical limit...
详细信息
ISBN:
(纸本)9781467344265;9781467344258
In pursuit of bringing high end applications on radio platforms, recent and evolving wireless standards impose stringent requirements in the shape of high throughputs, error rate performance close to theoretical limits and multi mode transmissions to efficiently use bandwidth in different channel conditions. In the presence of these requirements, the designer comes across contradicting requirements. In fact, in order to handle error rate performance the iterative (Turbo) processing (Turbo/LDPC decoding, Turbo demodulation and Turbo Equalization) is common implementation practice in baseband receivers. However, this creates bottleneck in achieving imposed throughputs. In this scenario, parallelism study and resulting throughput gains while keeping same error rate convergence, provides the designer concrete results to establish compromise among design constraints. In this paper, first of all three level of parallelism study is presented on turbo decoding, turbo demodulation and MIMO turbo equalization. To aid the designer in taking decision during the design, mathematical expressions for throughput gain in unified parallel turbo receiver are provided. throughput gain for different system scenarios are computed by using system parameters and simulation results in derived expressions.
Over the last decades, graphics processing units have developed from special-purpose graphics accelerators to general-purpose massively parallel co-processors. In recent years they gained increased traction in high pe...
详细信息
ISBN:
(纸本)9781479927012
Over the last decades, graphics processing units have developed from special-purpose graphics accelerators to general-purpose massively parallel co-processors. In recent years they gained increased traction in high performance computing, as they provide superior computational performance in terms of runtime and energy consumption for a wide range of problems. In this survey, we review their employment in distributed computing for a broad range of application scenarios. Common characteristics and a classification of the most relevant use cases are described. Furthermore, we discuss possible future developments of the use of general purpose graphics processing units in the area of service-oriented architecture. the aim of this work is to inspire future research in this field and to give guidelines on when and how to incorporate this new hardware technology.
the proceedings contain 34 papers. the topics discussed include: a virtual network embedding algorithm based on graph theory;access annotation for safe program parallelization;extracting threaded traces in simulation ...
ISBN:
(纸本)9783642408199
the proceedings contain 34 papers. the topics discussed include: a virtual network embedding algorithm based on graph theory;access annotation for safe program parallelization;extracting threaded traces in simulation environments;a network-aware virtual machine allocation in cloud datacenter;totoro: a scalable and fault-tolerant data center network by using backup port;a cloud resource allocation mechanism based on mean-variance optimization and double multi-attribution auction;a scheduling method for multiple virtual machines migration in cloud;speeding up Galois field arithmetic on Intel MIC architecture;software/hardware hybrid network-on-chip simulation on FPGA;total exchange routing on hierarchical dual-nets;efficiency of flexible rerouting scheme for maximizing logical arrays;conditional diagnosability of complete Josephus cubes;accelerating parallel frequent itemset mining on graphics processors with sorting;and asymmetry- aware scheduling in heterogeneous multi-core architectures.
Molecular dynamics simulations allow us to study the behavior of complex biomolecular systems. these simulations suffer a large computational complexity that leads to simulation times of several weeks in order to recr...
详细信息
ISBN:
(纸本)9783642400476
Molecular dynamics simulations allow us to study the behavior of complex biomolecular systems. these simulations suffer a large computational complexity that leads to simulation times of several weeks in order to recreate just a few microseconds of a molecule's motion even on high-performance computing platforms. In recent years, state-of-the-art molecular dynamics algorithms have benefited from the parallel computing capabilities of multicore systems, as well as GPUs used as co-processors. In this paper we present a parallel molecular dynamics algorithm for on-board multi-GPU architectures. We parallelize a state-of-the-art molecular dynamics algorithm at two levels. We employ a spatial partitioning approach to simulate the dynamics of one portion of a molecular system on each GPU, and we take advantage of direct communication between GPUs to transfer data among portions. We also parallelize the simulation algorithm to exploit the multi-processor computing model of GPUs. Most importantly, we present novel parallelalgorithms to update the spatial partitioning and set up transfer data packages on each GPU. We demonstrate the feasibility and scalability of our proposal through a comparative study with NAMD, a well known parallel molecular dynamics implementation.
this research investigates the problem of the optimisation of static task mapping on a heterogeneous computing system CPU/FPGA (Central processing Unit/Field-Programmable Gate Array) used to implement intimately coupl...
详细信息
ISBN:
(纸本)9781467352000;9781467351980
this research investigates the problem of the optimisation of static task mapping on a heterogeneous computing system CPU/FPGA (Central processing Unit/Field-Programmable Gate Array) used to implement intimately coupled hardware and software models. In the face of obstacles as memory-wall, power wall and real-time requirements, hardware designers are directed more and more towards reconfigurable computing. the use of heterogeneous CPU/FPGA systems is one of the most promising solutions in order to increase the performance. Indeed, in such systems, multi-core processors (CPU) provide high computation rates while the reconfigurable logic (FPGA) offers high performance and adaptability to the application real-time constraints. However, heterogeneous computing systems present new challenges, and one of the most important issues is how to map efficiently the application tasks on the available resources while considering real-time constraints. this work includes the development of two exact methods that focus on the static initial task mapping, for two different case studies. In the first case, the execution is considered preemptive and the problem of task mapping is treated in terms of workload on the heterogeneous system. While in the second case the execution is considered non preemptive and the main objective is to minimize the makespan. In both case studies we consider communication constraints, since the application tasks are linked by precedence.
暂无评论