检索结果-内蒙古大学图书馆

Toward a software transactional memory for heterogeneous CPU-GPU processors

JOURNAL OF SUPERCOMPUTING 2019年第8期75卷 4177-4192页

作者： Villegas, Alejandro Navarro, Angeles Asenjo, Rafael Plata, Oscar Univ Malaga Andalucia Tech Dept Comp Architecture E-29071 Malaga Spain

The heterogeneous accelerated processing units (APUs) integrate a multi-core CPU and a GPU within the same chip. Modern APUs implement CPU-GPU platform atomics for simple data types. However, ensuring atomicity for complex data types is a task delegated to programmers. Transactional memory (TM) is an optimistic approach to achieve this goal. With TM, shared data can be accessed by multiple computing threads speculatively, but changes are only visible if a transaction ends with no conflict with others in its memory accesses. In this paper we present APUTM, a software TM designed for APU processors which focuses on minimizing the access to shared metadata. The main goal of APUTM is to understand the trade-offs of implementing a software TM on such platform. In our experiments, APUTM is able to outperform sequential execution of the applications. Additionally, we compare its adaptability to execute in one of the devices or in both simultaneously.

关键词： Transactional memory APU processors parallel programming Data sharing

来源：评论

学校读者我要写书评

暂无评论

A minimalistic approach for fast computation of geodesic distances on triangular meshes

引用

COMPUTERS & GRAPHICS-UK 2019年 84卷 77-92页

作者： Romero Calla, Luciano A. Fuentes Perez, Lizeth J. Montenegro, Anselmo A. Univ Zurich Dept Informat Visualizat & MultiMedia Lab VMML Zurich Switzerland Fed Fluminense Univ Inst Comp Niteroi RJ Brazil La Salle Univ IPRODAM3D Res Grp Arequipa Peru

The computation of geodesic distances is an important research topic in Geometry Processing and 3D Shape Analysis as it is a basic component of many methods used in these areas. In this work, we present a minimalistic parallel algorithm based on front propagation to compute approximate geodesic distances on meshes. Our method is practical and simple to implement, and does not require any heavy preprocessing. The convergence of our algorithm depends on the number of discrete level sets around the source points from which distance information propagates. To appropriately implement our method on GPUs taking into account memory coalescence problems, we take advantage of a graph representation based on a breadth-first search traversal that works harmoniously with our parallel front propagation approach. We report experiments that show how our method scales with the size of the problem. We compare the mean error and processing time obtained by our method with such measures computed using other methods. Our method produces results in competitive times with almost the same accuracy, especially for large meshes. We also demonstrate its use for solving two classical geometry processing problems: the regular sampling problem and the Voronoi tessellation on meshes. (C) 2019 Elsevier Ltd. All rights reserved.

关键词： Geodesic distance Fast marching Triangular meshes parallel programming Breadth-first search

来源：评论

学校读者我要写书评

暂无评论

Improved parallel Resampling Methods for Particle Filtering

引用

IEEE ACCESS 2019年 7卷 47593-47604页

作者： Nicely, Matthew A. Wells, B. Earl Univ Alabama Dept Elect & Comp Engn Huntsville AL 35805 USA

Particle filter techniques are common methods used to estimate the evolving state of nonlinear, non-Gaussian time-variant systems by utilizing a periodic sequence of noisy measurements. The accuracy of particle filter methods has often been shown to be superior to other state estimation techniques, such as the extended Kalman filter (EKF), for many applications. Unfortunately, the high computational cost and highly nondeterministic runtime behavior of particle filters often preclude their use in hard, real-time environments, where filter response must meet the strict timing requirements of the application. Particle filter algorithms are composed of three main stages: prediction, update, and resampling. General purpose graphics processing units (GPGPUs) have been successfully employed in previous research to accelerate the computation of both the prediction and update stages by exploiting their natural fine-grain parallelism. This research focuses on accelerating the resampling stage for GPGPU execution, which has been much more difficult to parallelize due to it's apparent inherent sequentially. This paper introduces a novel GPGPU implementation of the systematic and stratified resampling algorithms that exploit the monotonically increasing nature of the prefix-sum and the evolutionary nature of the particle weighting process to allow the re-indexing portion of the algorithms to occur in a two-phase, multi-threaded manner. This resulting measured factor of performance improvement for the systematic and stratified algorithms was 15x and 32x, respectively, over the serial implementations.

关键词： Graphics processing units parallel algorithms parallel architectures parallel programming particle filters state estimation resampling

来源：评论

学校读者我要写书评

暂无评论

Stream parallelism with ordered data constraints on multi-core systems

引用

JOURNAL OF SUPERCOMPUTING 2019年第8期75卷 4042-4061页

作者： Griebler, Dalvan Hoffmann, Renato B. Danelutto, Marco Fernandes, Luiz G. Pontificia Univ Catolica Rio Grande do Sul Fac Informat Porto Alegre RS Brazil Univ Pisa Dept Comp Sci Pisa Italy

It is often a challenge to keep input/output tasks/results in order for parallel computations over data streams, particularly when stateless task operators are replicated to increase parallelism when there are irregular tasks. Maintaining input/output order requires additional coding effort and may significantly impact the application's actual throughput. Thus, we propose a new implementation technique designed to be easily integrated with any of the existing C++ parallel programming frameworks that support stream parallelism. In this paper, it is first implemented and studied using SPar, our high-level domain-specific language for stream parallelism. We discuss the results of a set of experiments with real-world applications revealing how significant performance improvements may be achieved when our proposed solution is integrated within SPar, especially for data compression applications. Also, we show the results of experiments performed after integrating our solution within FastFlow and TBB, revealing no significant overheads.

关键词： parallel programming parallel stream processing parallel data compression parallel video streaming

来源：评论

学校读者我要写书评

暂无评论

A parallel-Computing Algorithm for High-Energy Physics Particle Tracking and Decoding Using GPU Architectures

引用

IEEE ACCESS 2019年 7卷 91612-91626页

作者： Fernandez Declara, Placido Campora Perez, Daniel Hugo Garcia-Blas, Javier Vom Bruch, Dorothea Daniel Garcia, J. Neufeld, Niko CERN EP LBC CH-1211 Geneva Switzerland Univ Carlos III Madrid Dept Comp Sci & Engn Madrid 28911 Spain Univ Seville ETSI Informat E-41012 Seville Spain Sorbonne Univ Paris Diderot Sorbonne Paris Cite LPNHE CNRSIN2P3 F-75005 Paris France

Real-time data processing is one of the central processes of particle physics experiments which require large computing resources. The LHCb (Large Hadron Collider beauty) experiment will be upgraded to cope with a particle bunch collision rate of 30 million times per second, producing 10(9) particles/s. 40 Tbits/s need to be processed in real-time to make filtering decisions to store data. This poses a computing challenge that requires exploration of modern hardware and software solutions. We present Compass, a particle tracking algorithm and a parallel raw input decoding optimized for GPUs. It is designed for highly parallel architectures, data-oriented, and optimized for fast and localized data access. Our algorithm is configurable, and we explore the trade-off in computing and physics performance of various configurations. A CPU implementation that delivers the same physics performance as our GPU implementation is presented. We discuss the achieved physics performance and validate it with Monte Carlo simulated data. We show a computing performance analysis comparing consumer and server-grade GPUs, and a CPU. We show the feasibility of using a full GPU decoding and particle tracking algorithm for high-throughput particle trajectories reconstruction, where our algorithm improves the throughput up to 7.4 x compared to the LHCb baseline.

关键词： CUDA GPGPU track reconstruction particle tracking parallel programming

来源：评论

学校读者我要写书评

暂无评论

Formal specification and implementation of an automated pattern-based parallel-code generation framework

引用

INTERNATIONAL JOURNAL ON SOFTWARE TOOLS FOR TECHNOLOGY TRANSFER 2019年第2期21卷 183-202页

作者： Perez, Gervasio Yovine, Sergio Consejo Nacl Invest Cient & Tecn ICC Buenos Aires DF Argentina Univ Buenos Aires Buenos Aires DF Argentina

programming correct parallel software in a cost-effective way is a challenging task requiring a high degree of expertise. As an attempt to overcoming the pitfalls undermining parallel programming, this paper proposes a pattern-based, formally grounded tool that eases writing parallel code by automatically generating platform-dependent programs from high-level, platform-independent specifications. The tool builds on three pillars: (1) a platform-agnostic parallel programming pattern, called PCR, (2) a formal translation of PCRs into a parallel execution model, namely Concurrent Collections (CnC), and (3) a program rewriting engine that generates code for a concrete runtime implementing CnC. The experimental evaluation carried out gives evidence that code produced from PCRs can deliver performance metrics which are comparable with handwritten code but with assured correctness. The technical contribution of this paper is threefold. First, it discusses a parallel programming pattern, called PCR, consisting of producers, consumers, and reducers which operate concurrently on data sets. To favor correctness, the semantics of PCRs is mathematically defined in terms of the formalism FXML. PCRs are shown to be composable and to seamlessly subsume other well-known parallel programming patterns, thus providing a framework for heterogeneous designs. Second, it formally shows how the PCR pattern can be correctly implemented in terms of a more concrete parallel execution model. Third, it proposes a platform-agnostic C++ template library to express PCRs. It presents a prototype source-to-source compilation tool, based on C++ template rewriting, which automatically generates parallel implementations relying on the Intel CnC C++ library.

关键词： Formal methods Software design patterns parallel programming Automated code generation

来源：评论

学校读者我要写书评

暂无评论

Effective parallel Computing via a Free Stale Synchronous parallel Strategy

引用

IEEE ACCESS 2019年 7卷 118764-118775页

作者： Shi, Hang Zhao, Yue Zhang, Bofeng Yoshigoe, Kenji Chang, Furong Shanghai Univ Sch Comp Engn & Sci Shanghai 200444 Peoples R China Toyo Univ Fac Informat Networking Innovat & Design INIAD Tokyo 1128606 Japan Kashi Univ Sch Comp Sci & Technol Xinjiang 844008 Peoples R China

As the data becomes bigger and more complex, people tend to process it in a distributed system implemented on clusters. Due to the power consumption, cost, and differentiated price-performance, the clusters are evolving into the system with heterogeneous hardware leading to the performance difference among the nodes. Even in a homogeneous cluster, the performance of the nodes is different due to the resource competition and the communication cost. Some nodes with poor performance will drag down the efficiency of the whole system. Existing parallel computing strategies such as bulk synchronous parallel strategy and stale synchronous parallel strategy are not well suited to this problem. To address it, we proposed a free stale synchronous parallel (FSSP) strategy to free the system from the negative impact of those nodes. FSSP is improved from stale synchronous parallel (SSP) strategy, which can effectively and accurately figure out the slow nodes and eliminate the negative effects of those nodes. We validated the performance of the FSSP strategy by using some classical machine learning algorithms and datasets. Our experimental results demonstrated that FSSP was 1.5-12x faster than the bulk synchronous parallel strategy and stale synchronous parallel strategy, and it used 4x fewer iterations than the asynchronous parallel strategy to converge.

关键词： Straggler parallel strategy parallel programming

来源：评论

学校读者我要写书评

暂无评论

Adaptation of an Iterative PCA to a Manycore Architecture for Hyperspectral Image Processing

引用

JOURNAL OF SIGNAL PROCESSING SYSTEMS FOR SIGNAL IMAGE AND VIDEO TECHNOLOGY 2019年第7期91卷 759-771页

作者： Lazcano, R. Madronal, D. Fabelo, H. Ortega, S. Salvador, R. Callico, G. M. Juarez, E. Sanz, C. UPM Ctr Software Technol & Multimedia Syst CITSEM Madrid Spain ULPGC Res Inst Appl Microelect IUMA Las Palmas Gran Canaria Spain

This paper presents a study of the adaptation of a Non-Linear Iterative Partial Least Squares (NIPALS) algorithm applied to Hyperspectral Imaging to a Massively parallel Processor Array manycore architecture, which assembles 256 cores distributed over 16 clusters. This work aims at optimizing the internal communications of the platform to achieve real-time processing of large data volumes with limited computational resources and memory bandwidth. As hyperspectral images are composed of extensive volumes of spectral information, real-time requirements, which are upper-bounded by the image capture rate of the hyperspectral sensor, are a challenging objective. To address this issue, the image size is usually reduced prior to the processing phase, which is itself a computationally intensive task. Consequently, this paper proposes an analysis of the intrinsic parallelism and the data dependency within the NIPALS algorithm and its subsequent implementation on a manycore architecture. Furthermore, this implementation has been validated against three hyperspectral images extracted from both remote sensing and medical datasets. As a result, an average speedup of 17x has been achieved when compared to the sequential version. Finally, this approach has been compared with other state-of-the-art implementations, outperforming them in terms of performance.

关键词： NIPALS-PCA Hyperspectral imaging Massively parallel processing Real-time processing parallel programming

来源：评论

学校读者我要写书评

暂无评论

Performance Evaluation of a Hybrid Computer Cluster Built on IBM POWER8 Microprocessors

引用

programming AND COMPUTER SOFTWARE 2019年第6期45卷 324-332页

作者： Mal'kovskii, S., I Sorokin, A. A. Korolev, S. P. Zatsarinnyi, A. A. Tsoi, G., I Russian Acad Sci Comp Ctr Far Eastern Branch Ul Kim Yu Chena 65 Khabarovsk 680000 Russia Russian Acad Sci Fed Res Ctr Comp Sci & Control Ul Vavilova 44-2 Moscow 119333 Russia

This paper is devoted to the performance evaluation of a hybrid computer cluster built on IBM POWER8 CPUs and NVIDIA Tesla P100 GPUs. The architecture of the computing system and software used are described. Results of experiments carried out using the STREAM, NPB, Crossroads/NERSC-9 DGEMM, and HPL packages are discussed. The efficiency of the simultaneous multithreading (SMT) technology supported by POWER8 processors, as well as the performance of some compilers, parallel programming and mathematical libraries, on this architecture is analyzed.

关键词： parallel programming

来源：评论

学校读者我要写书评

暂无评论

Revisiting the Bag-of-Visual-Words model: A hierarchical localization architecture for mobile systems

引用

ROBOTICS AND AUTONOMOUS SYSTEMS 2019年 113卷 104-119页

作者： Bampis, Loukas Gasteratos, Antonios Democritus Univ Thrace Dept Prod & Management Engn 12 Vas Sophias GR-67132 Xanthi Greece

In this paper, an enhanced visual place recognition system is proposed aiming to improve the localization performance of a mobile platform. Our technique takes full advantage of the continuous input image stream in order to provide additional knowledge to the matching functionality. The well-established Bag-of-Visual-Words model is adapted into a hierarchical design that derives the visual information from the full entity of a natural scene into the description, while it additionally preserves the geometric structure of the explored world. Our approach is evaluated as part of a state-of-the-art Simultaneous-Localization and-Mapping algorithm, and parallelization techniques are exploited utilizing every available hardware module in a low-power device. The implemented algorithm has been tested on several publicly available datasets offering consistently accurate localization results and preventing the majority of redundant computations that the additional geometrical verifications can induce. (C) 2019 Elsevier B.V. All rights reserved.

关键词： Localization Visual place recognition Mobile systems parallel programming

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：