检索结果-内蒙古大学图书馆

COMPUTING AND INFORMATICS 2020年第4期39卷 808-837页

作者： Gschwandtner, Philipp Jordan, Herbert Thoman, Peter Fahringer, Thomas Univ Innsbruck Dept Comp Sci Tech Str 21a A-6020 Innsbruck Austria

Effectively implementing scientific algorithms in distributed memory parallel applications is a difficult task for domain scientists, as evident by the large number of domain-specific languages and libraries available today attempting to facilitate the process. However, they usually provide a closed set of parallel patterns and are not open for extension without vast modifications to the underlying system. In this work, we present the AllScale API, a programming interface for developing distributed memory parallel applications with the ease of shared memory programming models. The AllScale API is closed for a modification but open for an extension, allowing new user-defined parallel patterns and data structures to be implemented based on existing core primitives and therefore fully supported in the AllScale framework. Focusing on high-level functionality directly offered to application developers, we present the design advantages of such an API design, detail some of its specifications and evaluate it using three real-world use cases. Our results show that AllScale decreases the complexity of implementing scientific applications for distributed memory while attaining comparable or higher performance compared to MPI reference implementations.

关键词： API programming interface parallel programming shared memory distributed memory parallel operator data structure

来源：评论

学校读者我要写书评

暂无评论

parallelizing and optimizing neural Encoder-Decoder models without padding on multi-core architecture

引用

FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE 2020年 108卷 1206-1213页

作者： Qiao, Yuchen Hashimoto, Kazuma Eriguchi, Akiko Wang, Haixia Wang, Dongsheng Tsuruoka, Yoshimasa Taura, Kenjiro Univ Tokyo Grad Sch Informat Sci & Technol Tokyo Japan Tsinghua Natl Lab Informat Sci & Technol Beijing Peoples R China Univ Tokyo Grad Sch Engn Tokyo Japan

Scaling up Artificial Intelligence (AI) algorithms for massive datasets to improve their performance is becoming crucial. In Machine Translation (MT), one of most important research fields of AI, models based on Recurrent Neural Networks (RNN) show state-of-the-art performance in recent years, and many researchers keep working on improving RNN-based models to achieve better accuracy in translation tasks. Most implementations of Neural Machine Translation (NMT) models employ a padding strategy when processing a mini-batch to make all sentences in a mini-batch have the same length. This enables an efficient utilization of caches and GPU/SIMD parallelism but leads to a waste of computation time. In this paper, we implement and parallelize batch learning for a Sequence-to-Sequence (Seq2Seq) model, which is the most basic model of NMT, without using a padding strategy. More specifically, our approach forms vectors which represent the input words as well as the neural network's states at different time steps into matrices when it processes one sentence, and as a result, the approach makes a better use of cache and optimizes the process that adjusts weights and biases during the back-propagation phase. Our experimental evaluation shows that our implementation achieves better scalability on multi-core CPUs. We also discuss our approach's potential to be used in other implementations of RNN-based models. (C) 2018 Elsevier B.V. All rights reserved.

关键词： Neural machine translation Cache optimization parallel programming

来源：评论

学校读者我要写书评

暂无评论

Distributed Training of Support Vector Machine on a Multiple-FPGA System

引用

IEEE TRANSACTIONS ON COMPUTERS 2020年第7期69卷 1015-1026页

作者： Dass, Jyotikrishna Narawane, Yashwardhan Mahapatra, Rabi N. Sarin, Vivek Texas A&M Univ Dept Comp Sci & Engn College Stn TX 77840 USA

Support Vector Machine (SVM) is a supervised machine learning model for classification tasks. Training SVM on a large number of data samples is challenging due to the high computational cost and memory requirement. Hence, model training is supported on a high-performance server which typically runs a sequential training algorithm on centralized data. However, as we move towards massive workloads, it will be impossible to store all the data in a centralized manner and expect such sequential training algorithms to scale on traditional processors. Moreover, with the growing demands of real-time machine learning for edge analytics, it is imperative to devise an efficient training framework with relatively cheaper computations and limited memory. Therefore, we propose and implement a first-of-its-kind system of multiple FPGAs as a distributed computing framework comprising up to eight FPGA units on Amazon F1 instances with negligible communication overhead to fully parallelize, accelerate, and scale the SVM training on decentralized data. Each FPGA unit has a pipelined SVM training IP logic core operating at 125 MHz with a power dissipation of 39 Watts for accelerating its allocated computations in the overall training process. We evaluate and compare the performance of the proposed system on five real SVM benchmarks.

关键词： Support vector machines Training Computational modeling Field programmable gate arrays Kernel Machine learning Distributed databases Distributed computing FPGA machine learning parallel programming support vector machines

来源：评论

学校读者我要写书评

暂无评论

An Efficient and Exact parallel Algorithm for Intersecting Large 3-D Triangular Meshes Using Arithmetic Filters

引用

COMPUTER-AIDED DESIGN 2020年第0期120卷 102801-000页

作者： Gomes de Magalhaes, Salles Viana Franklin, W. Randolph Alvim Andrade, Marcus Vinicius Univ Fed Vicosa MG Vicosa MG Brazil Rensselaer Polytech Inst Troy NY 12180 USA

We present 3D-EPUG-OVERLAY, a fast, exact, parallel, memory-efficient, algorithm for computing the intersection between two large 3-D triangular meshes with geometric degeneracies. Applications include CAD/CAM, CFD, GIS, and additive manufacturing. 3D-EPUG-OVERLAY combines 5 techniques: multiple precision rational numbers to eliminate roundoff errors during the computations: Simulation of Simplicity to properly handle geometric degeneracies: simple data representations and only local topological information to simplify the correct processing of the data and make the algorithm more parallelizable: a uniform grid to efficiently index the data, and accelerate testing pairs of triangles for intersection or locating points in the mesh: and parallel programming to exploit current hardware. 3D-EPUG-OVERLAY is up to 101 times faster than LibiGL, and comparable to QuickCSG, a parallel inexact algorithm. 3D-EPUG-OVERLAY is also more memory efficient. In all test cases, 3D-EPUG-OVERLAY'S result matched the reference solution. (C) 2019 Elsevier Ltd. All rights reserved.

关键词： Boolean operations parallel programming Exact computation Polyhedron intersection

来源：评论

学校读者我要写书评

暂无评论

Semiglobal Sequence Alignment with Gaps Using GPU

引用

IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020年第6期17卷 2086-2097页

作者： Carroll, Thomas C. Ojiaku, Jude-Thaddeus Wong, Prudence W. H. Univ Liverpool Dept Comp Sci Liverpool L69 3BX Merseyside England ASML BV NL-5504 DR Veldhoven Netherlands

In this paper, we consider the pair-wise semiglobal sequence alignment problem with gaps, which is motivated by the re-sequencing problem that requires to assemble short reads sequences into a genome sequence by referring to a reference sequence. The problem has been studied before for single gap and bounded number of gaps. For single gap, there is a GPU-based algorithm proposed (Barton et al., 2015). In our work, we propose a GPU-based algorithm for the bounded number of gaps case, called GPUGapsMis. We implement the algorithm and compare the performance with the CPU-based algorithm, called CPUGapsMis. The algorithm has two distinct stages: the alignment phase, and the backtrack phase. We investigate several different approaches, in order to determine the most favorable for this problem, by means of a Hybrid model or a wholly-GPU based model, as well as the alignment of single text sequences or multiple text sequences on the GPU at a time. We show that the alignment phase of the algorithm is a good candidate for parallelization, with peak speedup of 11 times. We show that although the backtracking phase is sequential, it is more beneficial to perform it on the GPU, as opposed to returning to the CPU and performing there. When performing both phases on the GPU, GPUGapsMis achieves a peak speedup of 10.4 times against CPUGapsMis. Our data parallel GPU algorithm achieves results which are an improvement on those of an existing GPU data parallel implementation (Ojiaku, 2014).

关键词： Graphics processors parallel programming data communications aspects bioinformatics

来源：评论

学校读者我要写书评

暂无评论

Detecting semantic violations of lock-free data structures through C plus plus contracts

引用

JOURNAL OF SUPERCOMPUTING 2020年第7期76卷 5057-5078页

作者： Lopez-Gomez, Javier del Rio Astorga, David Dolz, Manuel F. Fernandez, Javier Daniel Garcia, J. Univ Carlos III Madrid Dept Comp Sci Leganes 28911 Spain Univ Jaume I Castello Dept Engn & Comp Sci Castellon de La Plana 12071 Spain

The use of synchronization mechanisms in multithreaded applications is essential on shared-memory multi-core architectures. However, debugging parallel applications to avoid potential failures, such as data races or deadlocks, can be challenging. Race detectors are key to spot such concurrency bugs;nevertheless, if lock-free data structures are used, these may emit a significant number of false positives. In this paper, we present a framework for semantic violation detection of lock-free data structures which makes use of contracts, a novel feature of the upcoming C++20, and a customized version of the ThreadSanitizer race detector. We evaluate the detection accuracy of the framework in terms of false positives and false negatives leveraging some synthetic benchmarks which make use of the SPSC and MPMC lock-free queue structures from the Boost C++ library. Thanks to this framework, we are able to check the correct use of lock-free data structures, thus reducing the number of false positives.

关键词： parallel programming Semantic violation detection C plus plus contracts Lock-free data structures

来源：评论

学校读者我要写书评

暂无评论

A Work-Stealing Scheduler for Ada 2022, in Ada

Ada User Journal

引用

Ada User Journal 2022年第2期43卷 112-112页

作者： Tucker Taft, S. AdaCore LexingtonMA United States

Ada 2022 includes parallel programming features that use lightweight logical threads of control on top of the heavier-weight Ada tasks. This talk will report on the work in progress to implement a work-stealing schedu... 详细信息

关键词： parallel programming

来源：评论

学校读者我要写书评

暂无评论

Generation of high-performance code based on a domain-specific language for algorithmic skeletons

引用

JOURNAL OF SUPERCOMPUTING 2020年第7期76卷 5098-5116页

作者： Wrede, Fabian Rieger, Christoph Kuchen, Herbert Univ Munster European Res Ctr Informat Syst ERCIS Dept Informat Syst Leonardo Campus 3 D-48149 Munster Germany

parallel programming can be difficult and error prone, in particular if low-level optimizations are required in order to reach high performance in complex environments such as multi-core clusters using MPI and OpenMP. One approach to overcome these issues is based on algorithmic skeletons. These are predefined patterns which are implemented in parallel and can be composed by application programmers without taking care of low-level programming aspects. Support for algorithmic skeletons is typically provided as a library. However, optimizations are hard to implement in this setting and programming might still be tedious because of required boiler plate code. Thus, we propose a domain-specific language for algorithmic skeletons that performs optimizations and generates low-level C++ code. Our experimental results on four benchmarks show that the models are significantly shorter and that the execution time and speedup of the generated code often outperform equivalent library implementations using the Muenster Skeleton Library.

关键词： Algorithmic skeletons parallel programming High-performance computing Model-driven development Domain-specific language

来源：评论

学校读者我要写书评

暂无评论

A scalable multiple pairwise protein sequence alignment acceleration using hybrid CPU-GPU approach

引用

CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS 2020年第4期23卷 2677-2688页

作者： Alawneh, Luay Shehab, Mohammed A. Al-Ayyoub, Mahmoud Jararweh, Yaser Al-Sharif, Ziad A. Jordan Univ Sci & Technol Irbid Jordan Concordia Univ Montreal PQ Canada

Bioinformatics is an interdisciplinary field that applies trending techniques in information technology, mathematics, and statistics in studying large biological data. Bioinformatics involves several computational techniques such as sequence and structural alignment, data mining, macromolecular geometry, prediction of protein structure and gene finding. Protein structure and sequence analysis are vital to the understanding of cellular processes. Understanding cellular processes contributes to the development of drugs for metabolic pathways. Protein sequence alignment is concerned with identifying the similarities and the relationships among different protein structures. In this paper, we target two well-known protein sequence alignment algorithms, the Needleman-Wunsch and the Smith-Waterman algorithms. These two algorithms are computationally expensive which hinders their applicability for large data sets. Thus, we propose a hybrid parallel approach that combines the capabilities of multi-core CPUs and the power of contemporary GPUs, and significantly speeds up the execution of the target algorithms. The validity of our approach is tested on real protein sequences. Moreover, the scalability of the approach is verified on randomly generated sequences with predefined similarity levels. The results showed that the proposed hybrid approach was up to 242 times faster than the sequential approach.

关键词： Bioinformatics Needleman-Wunsch Smith-Waterman parallel programming Dynamic parallelism CUDA

来源：评论

学校读者我要写书评

暂无评论

Concurrent Irrevocability in Best-Effort Hardware Transactional Memory

引用

IEEE TRANSACTIONS ON parallel AND DISTRIBUTED SYSTEMS 2020年第6期31卷 1301-1315页

作者： Titos-Gil, Ruben Fernandez-Pascual, Ricardo Ros, Alberto Acacio, Manuel E. Univ Murcia Dept Ingn & Tecnol Comp Murcia 30100 Spain

Existing best-effort requester-wins implementations of transactional memory must resort to non-speculative execution to provide forward progress in the presence of transactions that exceed hardware capacity, experience page faults or suffer high-contention leading to livelocks. Current approaches to irrevocability employ lock-based synchronization to achieve mutual exclusion when executing a transaction non-speculatively, conservatively precluding concurrency with any other transactions in order to guarantee atomicity at the cost of degrading performance. In this article, we propose a new form of concurrent irrevocability whose goal is to minimize the loss of concurrency paid when transactions resort to irrevocability to complete. By enabling optimistic concurrency control also during non-speculative execution of a transaction, our proposal allows for higher parallelism than existing schemes. We describe the extensions to the instruction set to provide concurrent irrevocable transactions as well as the architectural extensions required to realize them on a best-effort HTM system without requiring any modification to the cache coherence protocol. Our evaluation shows that our proposal achieves an average reduction of 12.5 percent in execution time across the STAMP benchmarks, with 15.8 percent on average for highly contended workloads.

关键词： parallel programming multicore architectures transactional memory

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：