The heterogeneous accelerated processing units (APUs) integrate a multi-core CPU and a GPU within the same chip. Modern APUs implement CPU-GPU platform atomics for simple data types. However, ensuring atomicity for co...
详细信息
The heterogeneous accelerated processing units (APUs) integrate a multi-core CPU and a GPU within the same chip. Modern APUs implement CPU-GPU platform atomics for simple data types. However, ensuring atomicity for complex data types is a task delegated to programmers. Transactional memory (TM) is an optimistic approach to achieve this goal. With TM, shared data can be accessed by multiple computing threads speculatively, but changes are only visible if a transaction ends with no conflict with others in its memory accesses. In this paper we present APUTM, a software TM designed for APU processors which focuses on minimizing the access to shared metadata. The main goal of APUTM is to understand the trade-offs of implementing a software TM on such platform. In our experiments, APUTM is able to outperform sequential execution of the applications. Additionally, we compare its adaptability to execute in one of the devices or in both simultaneously.
The computation of geodesic distances is an important research topic in Geometry Processing and 3D Shape Analysis as it is a basic component of many methods used in these areas. In this work, we present a minimalistic...
详细信息
The computation of geodesic distances is an important research topic in Geometry Processing and 3D Shape Analysis as it is a basic component of many methods used in these areas. In this work, we present a minimalistic parallel algorithm based on front propagation to compute approximate geodesic distances on meshes. Our method is practical and simple to implement, and does not require any heavy preprocessing. The convergence of our algorithm depends on the number of discrete level sets around the source points from which distance information propagates. To appropriately implement our method on GPUs taking into account memory coalescence problems, we take advantage of a graph representation based on a breadth-first search traversal that works harmoniously with our parallel front propagation approach. We report experiments that show how our method scales with the size of the problem. We compare the mean error and processing time obtained by our method with such measures computed using other methods. Our method produces results in competitive times with almost the same accuracy, especially for large meshes. We also demonstrate its use for solving two classical geometry processing problems: the regular sampling problem and the Voronoi tessellation on meshes. (C) 2019 Elsevier Ltd. All rights reserved.
Particle filter techniques are common methods used to estimate the evolving state of nonlinear, non-Gaussian time-variant systems by utilizing a periodic sequence of noisy measurements. The accuracy of particle filter...
详细信息
Particle filter techniques are common methods used to estimate the evolving state of nonlinear, non-Gaussian time-variant systems by utilizing a periodic sequence of noisy measurements. The accuracy of particle filter methods has often been shown to be superior to other state estimation techniques, such as the extended Kalman filter (EKF), for many applications. Unfortunately, the high computational cost and highly nondeterministic runtime behavior of particle filters often preclude their use in hard, real-time environments, where filter response must meet the strict timing requirements of the application. Particle filter algorithms are composed of three main stages: prediction, update, and resampling. General purpose graphics processing units (GPGPUs) have been successfully employed in previous research to accelerate the computation of both the prediction and update stages by exploiting their natural fine-grain parallelism. This research focuses on accelerating the resampling stage for GPGPU execution, which has been much more difficult to parallelize due to it's apparent inherent sequentially. This paper introduces a novel GPGPU implementation of the systematic and stratified resampling algorithms that exploit the monotonically increasing nature of the prefix-sum and the evolutionary nature of the particle weighting process to allow the re-indexing portion of the algorithms to occur in a two-phase, multi-threaded manner. This resulting measured factor of performance improvement for the systematic and stratified algorithms was 15x and 32x, respectively, over the serial implementations.
It is often a challenge to keep input/output tasks/results in order for parallel computations over data streams, particularly when stateless task operators are replicated to increase parallelism when there are irregul...
详细信息
It is often a challenge to keep input/output tasks/results in order for parallel computations over data streams, particularly when stateless task operators are replicated to increase parallelism when there are irregular tasks. Maintaining input/output order requires additional coding effort and may significantly impact the application's actual throughput. Thus, we propose a new implementation technique designed to be easily integrated with any of the existing C++ parallel programming frameworks that support stream parallelism. In this paper, it is first implemented and studied using SPar, our high-level domain-specific language for stream parallelism. We discuss the results of a set of experiments with real-world applications revealing how significant performance improvements may be achieved when our proposed solution is integrated within SPar, especially for data compression applications. Also, we show the results of experiments performed after integrating our solution within FastFlow and TBB, revealing no significant overheads.
Real-time data processing is one of the central processes of particle physics experiments which require large computing resources. The LHCb (Large Hadron Collider beauty) experiment will be upgraded to cope with a par...
详细信息
Real-time data processing is one of the central processes of particle physics experiments which require large computing resources. The LHCb (Large Hadron Collider beauty) experiment will be upgraded to cope with a particle bunch collision rate of 30 million times per second, producing 10(9) particles/s. 40 Tbits/s need to be processed in real-time to make filtering decisions to store data. This poses a computing challenge that requires exploration of modern hardware and software solutions. We present Compass, a particle tracking algorithm and a parallel raw input decoding optimized for GPUs. It is designed for highly parallel architectures, data-oriented, and optimized for fast and localized data access. Our algorithm is configurable, and we explore the trade-off in computing and physics performance of various configurations. A CPU implementation that delivers the same physics performance as our GPU implementation is presented. We discuss the achieved physics performance and validate it with Monte Carlo simulated data. We show a computing performance analysis comparing consumer and server-grade GPUs, and a CPU. We show the feasibility of using a full GPU decoding and particle tracking algorithm for high-throughput particle trajectories reconstruction, where our algorithm improves the throughput up to 7.4 x compared to the LHCb baseline.
programming correct parallel software in a cost-effective way is a challenging task requiring a high degree of expertise. As an attempt to overcoming the pitfalls undermining parallel programming, this paper proposes ...
详细信息
programming correct parallel software in a cost-effective way is a challenging task requiring a high degree of expertise. As an attempt to overcoming the pitfalls undermining parallel programming, this paper proposes a pattern-based, formally grounded tool that eases writing parallel code by automatically generating platform-dependent programs from high-level, platform-independent specifications. The tool builds on three pillars: (1) a platform-agnostic parallel programming pattern, called PCR, (2) a formal translation of PCRs into a parallel execution model, namely Concurrent Collections (CnC), and (3) a program rewriting engine that generates code for a concrete runtime implementing CnC. The experimental evaluation carried out gives evidence that code produced from PCRs can deliver performance metrics which are comparable with handwritten code but with assured correctness. The technical contribution of this paper is threefold. First, it discusses a parallel programming pattern, called PCR, consisting of producers, consumers, and reducers which operate concurrently on data sets. To favor correctness, the semantics of PCRs is mathematically defined in terms of the formalism FXML. PCRs are shown to be composable and to seamlessly subsume other well-known parallel programming patterns, thus providing a framework for heterogeneous designs. Second, it formally shows how the PCR pattern can be correctly implemented in terms of a more concrete parallel execution model. Third, it proposes a platform-agnostic C++ template library to express PCRs. It presents a prototype source-to-source compilation tool, based on C++ template rewriting, which automatically generates parallel implementations relying on the Intel CnC C++ library.
As the data becomes bigger and more complex, people tend to process it in a distributed system implemented on clusters. Due to the power consumption, cost, and differentiated price-performance, the clusters are evolvi...
详细信息
As the data becomes bigger and more complex, people tend to process it in a distributed system implemented on clusters. Due to the power consumption, cost, and differentiated price-performance, the clusters are evolving into the system with heterogeneous hardware leading to the performance difference among the nodes. Even in a homogeneous cluster, the performance of the nodes is different due to the resource competition and the communication cost. Some nodes with poor performance will drag down the efficiency of the whole system. Existing parallel computing strategies such as bulk synchronous parallel strategy and stale synchronous parallel strategy are not well suited to this problem. To address it, we proposed a free stale synchronous parallel (FSSP) strategy to free the system from the negative impact of those nodes. FSSP is improved from stale synchronous parallel (SSP) strategy, which can effectively and accurately figure out the slow nodes and eliminate the negative effects of those nodes. We validated the performance of the FSSP strategy by using some classical machine learning algorithms and datasets. Our experimental results demonstrated that FSSP was 1.5-12x faster than the bulk synchronous parallel strategy and stale synchronous parallel strategy, and it used 4x fewer iterations than the asynchronous parallel strategy to converge.
This paper presents a study of the adaptation of a Non-Linear Iterative Partial Least Squares (NIPALS) algorithm applied to Hyperspectral Imaging to a Massively parallel Processor Array manycore architecture, which as...
详细信息
This paper presents a study of the adaptation of a Non-Linear Iterative Partial Least Squares (NIPALS) algorithm applied to Hyperspectral Imaging to a Massively parallel Processor Array manycore architecture, which assembles 256 cores distributed over 16 clusters. This work aims at optimizing the internal communications of the platform to achieve real-time processing of large data volumes with limited computational resources and memory bandwidth. As hyperspectral images are composed of extensive volumes of spectral information, real-time requirements, which are upper-bounded by the image capture rate of the hyperspectral sensor, are a challenging objective. To address this issue, the image size is usually reduced prior to the processing phase, which is itself a computationally intensive task. Consequently, this paper proposes an analysis of the intrinsic parallelism and the data dependency within the NIPALS algorithm and its subsequent implementation on a manycore architecture. Furthermore, this implementation has been validated against three hyperspectral images extracted from both remote sensing and medical datasets. As a result, an average speedup of 17x has been achieved when compared to the sequential version. Finally, this approach has been compared with other state-of-the-art implementations, outperforming them in terms of performance.
This paper is devoted to the performance evaluation of a hybrid computer cluster built on IBM POWER8 CPUs and NVIDIA Tesla P100 GPUs. The architecture of the computing system and software used are described. Results o...
详细信息
This paper is devoted to the performance evaluation of a hybrid computer cluster built on IBM POWER8 CPUs and NVIDIA Tesla P100 GPUs. The architecture of the computing system and software used are described. Results of experiments carried out using the STREAM, NPB, Crossroads/NERSC-9 DGEMM, and HPL packages are discussed. The efficiency of the simultaneous multithreading (SMT) technology supported by POWER8 processors, as well as the performance of some compilers, parallel programming and mathematical libraries, on this architecture is analyzed.
In this paper, an enhanced visual place recognition system is proposed aiming to improve the localization performance of a mobile platform. Our technique takes full advantage of the continuous input image stream in or...
详细信息
In this paper, an enhanced visual place recognition system is proposed aiming to improve the localization performance of a mobile platform. Our technique takes full advantage of the continuous input image stream in order to provide additional knowledge to the matching functionality. The well-established Bag-of-Visual-Words model is adapted into a hierarchical design that derives the visual information from the full entity of a natural scene into the description, while it additionally preserves the geometric structure of the explored world. Our approach is evaluated as part of a state-of-the-art Simultaneous-Localization and-Mapping algorithm, and parallelization techniques are exploited utilizing every available hardware module in a low-power device. The implemented algorithm has been tested on several publicly available datasets offering consistently accurate localization results and preventing the majority of redundant computations that the additional geometrical verifications can induce. (C) 2019 Elsevier B.V. All rights reserved.
暂无评论