检索结果-内蒙古大学图书馆

Detecting semantic violations of lock-free data structures through C plus plus contracts

JOURNAL OF SUPERCOMPUTING 2020年第7期76卷 5057-5078页

作者： Lopez-Gomez, Javier del Rio Astorga, David Dolz, Manuel F. Fernandez, Javier Daniel Garcia, J. Univ Carlos III Madrid Dept Comp Sci Leganes 28911 Spain Univ Jaume I Castello Dept Engn & Comp Sci Castellon de La Plana 12071 Spain

The use of synchronization mechanisms in multithreaded applications is essential on shared-memory multi-core architectures. However, debugging parallel applications to avoid potential failures, such as data races or deadlocks, can be challenging. Race detectors are key to spot such concurrency bugs;nevertheless, if lock-free data structures are used, these may emit a significant number of false positives. In this paper, we present a framework for semantic violation detection of lock-free data structures which makes use of contracts, a novel feature of the upcoming C++20, and a customized version of the ThreadSanitizer race detector. We evaluate the detection accuracy of the framework in terms of false positives and false negatives leveraging some synthetic benchmarks which make use of the SPSC and MPMC lock-free queue structures from the Boost C++ library. Thanks to this framework, we are able to check the correct use of lock-free data structures, thus reducing the number of false positives.

关键词： parallel programming Semantic violation detection C plus plus contracts Lock-free data structures

来源：评论

学校读者我要写书评

暂无评论

A Work-Stealing Scheduler for Ada 2022, in Ada

Ada User Journal

引用

Ada User Journal 2022年第2期43卷 112-112页

作者： Tucker Taft, S. AdaCore LexingtonMA United States

Ada 2022 includes parallel programming features that use lightweight logical threads of control on top of the heavier-weight Ada tasks. This talk will report on the work in progress to implement a work-stealing schedu... 详细信息

关键词： parallel programming

来源：评论

学校读者我要写书评

暂无评论

Generation of high-performance code based on a domain-specific language for algorithmic skeletons

引用

JOURNAL OF SUPERCOMPUTING 2020年第7期76卷 5098-5116页

作者： Wrede, Fabian Rieger, Christoph Kuchen, Herbert Univ Munster European Res Ctr Informat Syst ERCIS Dept Informat Syst Leonardo Campus 3 D-48149 Munster Germany

parallel programming can be difficult and error prone, in particular if low-level optimizations are required in order to reach high performance in complex environments such as multi-core clusters using MPI and OpenMP. One approach to overcome these issues is based on algorithmic skeletons. These are predefined patterns which are implemented in parallel and can be composed by application programmers without taking care of low-level programming aspects. Support for algorithmic skeletons is typically provided as a library. However, optimizations are hard to implement in this setting and programming might still be tedious because of required boiler plate code. Thus, we propose a domain-specific language for algorithmic skeletons that performs optimizations and generates low-level C++ code. Our experimental results on four benchmarks show that the models are significantly shorter and that the execution time and speedup of the generated code often outperform equivalent library implementations using the Muenster Skeleton Library.

关键词： Algorithmic skeletons parallel programming High-performance computing Model-driven development Domain-specific language

来源：评论

学校读者我要写书评

暂无评论

A scalable multiple pairwise protein sequence alignment acceleration using hybrid CPU-GPU approach

引用

CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS 2020年第4期23卷 2677-2688页

作者： Alawneh, Luay Shehab, Mohammed A. Al-Ayyoub, Mahmoud Jararweh, Yaser Al-Sharif, Ziad A. Jordan Univ Sci & Technol Irbid Jordan Concordia Univ Montreal PQ Canada

Bioinformatics is an interdisciplinary field that applies trending techniques in information technology, mathematics, and statistics in studying large biological data. Bioinformatics involves several computational techniques such as sequence and structural alignment, data mining, macromolecular geometry, prediction of protein structure and gene finding. Protein structure and sequence analysis are vital to the understanding of cellular processes. Understanding cellular processes contributes to the development of drugs for metabolic pathways. Protein sequence alignment is concerned with identifying the similarities and the relationships among different protein structures. In this paper, we target two well-known protein sequence alignment algorithms, the Needleman-Wunsch and the Smith-Waterman algorithms. These two algorithms are computationally expensive which hinders their applicability for large data sets. Thus, we propose a hybrid parallel approach that combines the capabilities of multi-core CPUs and the power of contemporary GPUs, and significantly speeds up the execution of the target algorithms. The validity of our approach is tested on real protein sequences. Moreover, the scalability of the approach is verified on randomly generated sequences with predefined similarity levels. The results showed that the proposed hybrid approach was up to 242 times faster than the sequential approach.

关键词： Bioinformatics Needleman-Wunsch Smith-Waterman parallel programming Dynamic parallelism CUDA

来源：评论

学校读者我要写书评

暂无评论

Concurrent Irrevocability in Best-Effort Hardware Transactional Memory

引用

IEEE TRANSACTIONS ON parallel AND DISTRIBUTED SYSTEMS 2020年第6期31卷 1301-1315页

作者： Titos-Gil, Ruben Fernandez-Pascual, Ricardo Ros, Alberto Acacio, Manuel E. Univ Murcia Dept Ingn & Tecnol Comp Murcia 30100 Spain

Existing best-effort requester-wins implementations of transactional memory must resort to non-speculative execution to provide forward progress in the presence of transactions that exceed hardware capacity, experience page faults or suffer high-contention leading to livelocks. Current approaches to irrevocability employ lock-based synchronization to achieve mutual exclusion when executing a transaction non-speculatively, conservatively precluding concurrency with any other transactions in order to guarantee atomicity at the cost of degrading performance. In this article, we propose a new form of concurrent irrevocability whose goal is to minimize the loss of concurrency paid when transactions resort to irrevocability to complete. By enabling optimistic concurrency control also during non-speculative execution of a transaction, our proposal allows for higher parallelism than existing schemes. We describe the extensions to the instruction set to provide concurrent irrevocable transactions as well as the architectural extensions required to realize them on a best-effort HTM system without requiring any modification to the cache coherence protocol. Our evaluation shows that our proposal achieves an average reduction of 12.5 percent in execution time across the STAMP benchmarks, with 15.8 percent on average for highly contended workloads.

关键词： parallel programming multicore architectures transactional memory

来源：评论

学校读者我要写书评

暂无评论

MERIT: Tensor Transform for Memory-Efficient Vision Processing on parallel Architectures

引用

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 2020年第3期28卷 791-804页

作者： Lin, Yu-Sheng Chen, Wei-Chao Chien, Shao-Yi Natl Taiwan Univ Grad Inst Elect Engn Taipei 10617 Taiwan Skywatch Inc Taipei 10084 Taiwan Inventec Inc Taipei 11167 Taiwan

Computationally intensive deep neural networks (DNNs) are well- suited to run on GPUs, but newly developed algorithms usually require the heavily optimized DNN routines to work efficiently, and this problem could be even more difficult for specialized DNN architectures. In this article, we propose a mathematical formulation that can be useful for transferring the algorithm optimization knowledge across computing platforms. We discover that data movement and storage inside parallel processor architectures can be viewed as tensor transforms across memory hierarchies, making it possible to describe many memory optimization techniques mathematically. Such transform, which we call memory-efficient ranged inner-product tensor (MERIT) transform, can be applied to not only DNN tasks but also many traditional machine learning and computer vision computations. Moreover, the tensor transforms can be readily mapped to existing vector processor architectures. In this article, we demonstrate that many popular applications can be converted to a succinct MERIT notation on GPUs, speeding up GPU kernels up to 20 times while using only half as many code tokens. We also use the principle of the proposed transform to design a specialized hardware unit called MERIT-z processor. This processor can be applied to a variety of DNN tasks as well as other computer vision tasks while providing comparable area and power efficiency to dedicated DNN application-specific integrated circuits (ASICs).

关键词： Transforms Magneto electrical resistivity imaging technique Computer architecture Tensile stress Task analysis Graphics processing units Optimization Neural network hardware parallel programming vector processors

来源：评论

学校读者我要写书评

暂无评论

HDOT - An approach towards productive programming of hybrid applications

引用

JOURNAL OF parallel AND DISTRIBUTED COMPUTING 2020年第0期137卷 104-118页

作者： Ciesko, Jan Martinez-Ferrer, Pedro J. Penacoba Veigas, Raul Teruel, Xavier Beltran, Vicenc Barcelona Supercomp Ctr Barcelona Spain Barcelona Supercomp Ctr Grp Parallel Programming Models Barcelona Spain Barcelona Supercomp Ctr Parallel & Distributed Programming Models Barcelona Spain

A wealth of important scientific and engineering applications are configured for use on high performance computing architectures using functionality found in the MPI specification. This specification provides application developers with a straightforward means for implementing their ideas for execution on distributed-memory parallel processing computers. OpenMP directives provide a means for operating on shared-memory regions of those computers. With the advent of machines composed of many-core processors, the strict synchronisation required by the bulk synchronous parallel (BSP) communication model can hinder performance increases. This is due to the complexity to handle load imbalances, to reduce serialisation imposed by blocking communication patterns, to overlap communication with computation and, finally, to deal with increasing memory overheads. The MPI specification provides advanced features such as non-blocking calls or shared memory to mitigate some of these factors. However, applying these features efficiently usually requires significant changes on the application structure. Task parallel programming models are being developed as a means of mitigating the abovementioned issues but without requiring extensive changes on the application code. In this work, we present a methodology to develop hybrid applications based on tasks called hierarchical domain overdecomposition with tasking (HDOT). This methodology overcomes most of the issues found on MPI-only and traditional hybrid MPI+OpenMP applications. However, by emphasising the reuse of data partition schemes from process-level and applying them to task-level, it enables a natural coexistence between MPI and shared-memory programming models. The proposed methodology shows promising results in terms of programmability and performance measured on a set of applications. (C) 2019 Elsevier Inc. All rights reserved.

关键词： parallel programming Hybrid programming MPI OpenMP OmpSs-2

来源：评论

学校读者我要写书评

暂无评论

Distributed non-negative matrix factorization with determination of the number of latent features

引用

JOURNAL OF SUPERCOMPUTING 2020年第9期76卷 7458-7488页

作者： Chennupati, Gopinath Vangara, Raviteja Skau, Erik Djidjev, Hristo Alexandrov, Boian LANL Informat Sci CCS 3 Grp Los Alamos NM 87545 USA LANL Theoret Div T 1 Grp Los Alamos NM 87545 USA

The holistic analysis and understanding of the latent (that is, not directly observable) variables and patterns buried in large datasets is crucial for data-driven science, decision making and emergency response. Such exploratory analyses require devising unsupervised learning methods for data mining and extraction of the latent features, and non-negative matrix factorization (NMF) is one of the prominent such methods. NMF is based on compute-intense non-convex constrained minimization, which, for large datasets requires fast and distributed algorithms. However, current parallel implementations of NMF fail to estimate the number of latent features. In practice, identifying these features is both difficult and significant for pattern recognition and latent feature analysis, especially for large dense matrices. In this paper, we introduce a distributed NMF algorithm coupled with distributed custom clustering followed by a stability analysis on dense data, which we call DnMFk, to determine the number of latent variables. The results on synthetic data and the classical Swimmer data set demonstrate the accuracy of model determination while scaling nearly linearly across multiple processors for large data. Further, we employ DnMFk to determine the number of hidden features from a terabyte matrix.

关键词： NMF Latent features Distributed processing Clustering parallel programming Silhouette Big data

来源：评论

学校读者我要写书评

暂无评论

Performance Impact on Neural Network with Partitioned Convolution Implemented with GPU programming

Performance Impact on Neural Network with Partitioned Convol...

引用

作者： Lee, Bill McMaster University

学位级别：硕士

For input data of homogenous type, the standard form of convolutional neural network is normally constructed with universally applied filters to identify global patterns. However, for certain datasets, there are identifiable trends and patterns within subgroups of input data. This research proposes a convolutional neural network that deliberately partitions input data into groups to be processed with unique sets of convolutional layers, thus identifying the underlying features of individual data groups. Training and testing data are built from historical prices of stock market and preprocessed so that the generated datasets are suitable for both standard and the proposed convolutional neural network. The author of this research also developed a software framework that can construct neural networks to perform necessary testing. The calculation logic was implemented using parallel programming and executed on a Nvidia graphic processing unit, thus allowing tests to be executed without expensive hardware. Tests were executed for 134 sets of datasets to benchmark the performance between standard and the proposed convolutional neural network. Test results show that the partitioned convolution method is capable of performance that rivals its standard counterpart. Further analysis indicates that more sophisticated method of building datasets, larger sets of training data, or more training epochs can further improve the performance of the partitioned neural network. For suitable datasets, the proposed method could be a viable replacement or supplement to the standard convolutional neural network structure.

关键词： convolutional neural network parallel programming partitioned

来源：评论

学校读者我要写书评

暂无评论

GPU programming Productivity in Different Abstraction Paradigms: A Randomized Controlled Trial Comparing CUDA and Thrust

引用

ACM TRANSACTIONS ON COMPUTING EDUCATION 2020年第4期20卷 1–27页

作者： Daleiden, Patrick Stefik, Andreas Uesbeck, Philip Merlin Univ Nevada 4505 S Maryland Pkwy Las Vegas NV 89154 USA

Coprocessor architectures in High Performance Computing are prevalent in today's scientific computing clusters and require specialized knowledge for proper utilization. Various alternative paradigms for parallel and offload computation exist, but little is known about the human factors impacts of using the different paradigms. With computer science student participants from the University of Nevada, Las Vegas with no previous exposure to Graphics Processing Unit programming, our study compared NVIDIA CUDA C/C++ as a control group and the Thrust library. The designers of Thrust claim their higher level of abstraction enhances programmer productivity. The trial was conducted on 91 participants and was administered through our computerized testing platform. Although the study was narrowly focused on the basic steps of an offloaded computation problem and was not intended to be a comprehensive evaluation of the superiority of one approach or the other, we found evidence that although Thrust was designed for ease of use, the abstractions tended to be confusing to students and in several cases diminished productivity. Specifically, abstractions in Thrust for (i) memory allocation through a C++ Standard Template Library-style vector library call, (ii) memory transfers between the host and Graphics Processing Unit coprocessor through an overloaded assignment operator, and (iii) execution of an offloaded routine through a generic transform library call instead of a CUDA kernel routine all performed either equal to or worse than CUDA.

关键词： Computer science education parallel programming threads human factors evidence evaluation empirical studies GPU programming

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：