Graph sampling and random walk operations, capturing the structural properties of graphs, are playing an important role today as we cannot directly adopt computing-intensive algorithms on large-scale graphs. Existing ...
详细信息
ISBN:
(纸本)9781665442787
Graph sampling and random walk operations, capturing the structural properties of graphs, are playing an important role today as we cannot directly adopt computing-intensive algorithms on large-scale graphs. Existing system frameworks for these tasks are not only spatially and temporally inefficient, but many also lead to biased results. this paper presents Skywalker, a high-throughput, quality-preserving random walk and sampling framework based on GPUs. Skywalker makes three key contributions: first, it takes the first step to realize efficient biased sampling withthe alias method on a GPU. Second, it introduces well-crafted load-balancing techniques to effectively utilize the massive parallelism of GPUs. third, it accelerates alias table construction and reduce the GPU memory requirement with efficient memory management scheme. We show that Skywalker greatly outperforms the state-of-the-art CPU-based and GPU-based baselines, in a wide spectrum of workload scenarios.
PAMIHR: a parallel adaptive routine for the approximate computation of a multidimensional integral over a hyperrectangular region is described. the software is designed to efficiently run on a MIMD distributed memory ...
详细信息
ISBN:
(纸本)3540664432
PAMIHR: a parallel adaptive routine for the approximate computation of a multidimensional integral over a hyperrectangular region is described. the software is designed to efficiently run on a MIMD distributed memory environment, and it's based on the widely diffused communication system BLACS. PAMIHR, further, gives special attention to the problems of scalability and of load balancing among the processes.
In this paper we provide both a qualitative and a quantitative evaluation of a decoupled multithreaded architecture that uses non-blocking threads. Our architecture is based on simple in-order pipelines and complete d...
详细信息
ISBN:
(纸本)9783540695004
In this paper we provide both a qualitative and a quantitative evaluation of a decoupled multithreaded architecture that uses non-blocking threads. Our architecture is based on simple in-order pipelines and complete decoupling of memory accesses from execution pipelines. We extend the architecture to support thread level speculation using snooping cache coherency protocols. We evaluate the performance gains from speculations by varying the number of load/store instructions compared to computational instructions, miss speculation rates and the degree of thread level speculation. Our architecture presents a viable alternative to complex superscalar and super-speculative CPUs.
In the last decade, different computing paradigms and modelling frameworks for the description and simulation of biochemical systems based on stochastic modelling have been proposed. From a computational point of view...
详细信息
ISBN:
(纸本)9781509060580
In the last decade, different computing paradigms and modelling frameworks for the description and simulation of biochemical systems based on stochastic modelling have been proposed. From a computational point of view, many simulations of the model are necessary to identify the behaviour of the system. the execution of thousands of simulations can require huge amount of time, therefore the parallelization of these algorithms is highly desirable. In particular, models that consider the size of volumes and objects involved in the reaction are very time-consuming, since many rules should be considered to take into account the position of the different molecules. In this work we present an implementation of a stochastic space-aware simulator which exploits the benefit and features of hybrid low-power computing architectures. this work shows that the simulator dynamic probabilistic approach to select possible chemical reactions can be applied and implemented in hybrid low-power low-cost architectures as well as current industry high-end servers.
Emerging trends in computer design attempt to include specific solutions for handling images also in general-purpose computers, because of the current spread of multimedia, image processing and computer graphics appli...
详细信息
ISBN:
(纸本)0818691948
Emerging trends in computer design attempt to include specific solutions for handling images also in general-purpose computers, because of the current spread of multimedia, image processing and computer graphics applications. In this context, this paper proposes hardware pre-fetching techniques specific for caching images: the main issue we state is that most algorithms working opt images exhibit a 2D spatial locality that is not taken into account in current cache organization and data access strategies. To this aim we propose an adaptive local pre-fetching for the image data type;this technique, mirroring the two-dimensional spatial locality of image processingalgorithms, results to be more efficient than other approaches, such as sequential pre-fetching and adaptive pre-fetching. Performance is evaluated on different classes of image processingalgorithms, namely raster-scan and propagative algorithms, common in computer vision and multimedia applications.
A parallel scheme for distributed memory hierarchy system is presented to solve the large-scale three-dimensional heat equation. Since managing interprocess communications and coordination is the main difficulty with ...
详细信息
ISBN:
(纸本)9783642131356
A parallel scheme for distributed memory hierarchy system is presented to solve the large-scale three-dimensional heat equation. Since managing interprocess communications and coordination is the main difficulty withthe system, the local physics/global algebraic object paradigm is introduced. Domain decomposition method is used to partition the modeling area, as well as the intensive computational effort and large memory requirement. Efficient storage and assembly of sparse matrix and parallel iterative solution of linear system are considered and developed. the efficiency and scalability of the parallel program are demonstrated by completing two experiments on Linux cluster, in which different preconditioning methods are tested and analyzed. And the results demonstrate this method could achieve desirable parallel performance.
In this paper, we investigate the performance of parallel Discrete Event Simulation ( PDES) on a cluster of many-core Intel KNL processors. Specifically, we analyze the impact of different Global Virtual Time (GVT) al...
详细信息
ISBN:
(纸本)9781450362955
In this paper, we investigate the performance of parallel Discrete Event Simulation ( PDES) on a cluster of many-core Intel KNL processors. Specifically, we analyze the impact of different Global Virtual Time (GVT) algorithms in this environment and contribute three significant results. First, we show that it is essential to isolate the thread performing MPI communications from the task of processing simulation events, otherwise the simulation is significantly imbalanced and performs poorly. this applies to both synchronous and asynchronous GVT algorithms. Second, we demonstrate that synchronous GVT algorithm based on barrier synchronization is a better choice for communication-dominated models, while asynchronous GVT based on Mattern's algorithm performs better for computation-dominated scenarios. third, we propose Controlled Asynchronous GVT (CA-GVT) algorithm that selectively adds synchronization to Mattern-style GVT based on simulation conditions. We demonstrate that CA-GVT outperforms both barrier and Mattern's GVT and achieves about 8% performance improvement on mixed computation-communication models. this is a reasonable improvement for a simple modification to a GVT algorithm.
A method of description and optimization of the continuous structure of hierarchical processing system is presented. the structure of the system is defined as a finite sequence of density functions of distributions. E...
详细信息
ISBN:
(纸本)0889865248
A method of description and optimization of the continuous structure of hierarchical processing system is presented. the structure of the system is defined as a finite sequence of density functions of distributions. Each distribution will correspond to the connections between this and previous level and shows how the size of previous level is distributed between the sizes of this level. Corresponding optimization problem is a calculus of variations problem. Some reduced variants of this problem have good mathematical properties and is solved analytically. the approach is given in terms of convex analysis, integer programming and calculus of variations.
In this paper we report on the recent progress in computing bivariate polynomial resultants on Graphics processing Units (GPU). Given two polynomials in Z[x, y], our algorithm first maps the polynomials to a prime fie...
详细信息
ISBN:
(纸本)9783642131189
In this paper we report on the recent progress in computing bivariate polynomial resultants on Graphics processing Units (GPU). Given two polynomials in Z[x, y], our algorithm first maps the polynomials to a prime field. then, each modular image is processed individually. the GPU evaluates the polynomials at a number of points and computes univariate modular resultants in parallel. the remaining "combine" stage of the algorithm is executed sequentially on the host machine. Porting this stage to the graphics hardware is an object of ongoing research. Our algorithm is based on an efficient modular arithmetic from [1]. Withthe theory of displacement structure we have been able to parallelize the resultant algorithm up to a very fine scale suitable for realization on the GPU. Our benchmarks show a substantial speed-up over a host-based resultant algorithm [2] from CGAL (***).
A novel architecture for the H.264/AVC deblocking filter is proposed. It includes three filtering units to filter in parallelthe luma and chroma components. Also, a proper two dimensional filtering order for the luma...
详细信息
ISBN:
(纸本)9781479987481
A novel architecture for the H.264/AVC deblocking filter is proposed. It includes three filtering units to filter in parallelthe luma and chroma components. Also, a proper two dimensional filtering order for the luma edges and a one-dimensional filtering order for the chroma edges are presented. the architecture achieves 750 MHz in 560 MHz in 90nm and 130 nm, respectively, and requires 76 cycles to filter each MB. Compared to existing architectures, it outperforms them in frequency and throughput, while it is the only one that achieves over 60 Fps in 8K-UHD resolution.
暂无评论