Effectively implementing scientific algorithms in distributed memory parallel applications is a difficult task for domain scientists, as evident by the large number of domain-specific languages and libraries available...
详细信息
Effectively implementing scientific algorithms in distributed memory parallel applications is a difficult task for domain scientists, as evident by the large number of domain-specific languages and libraries available today attempting to facilitate the process. However, they usually provide a closed set of parallel patterns and are not open for extension without vast modifications to the underlying system. In this work, we present the AllScale API, a programming interface for developing distributed memory parallel applications with the ease of shared memory programming models. The AllScale API is closed for a modification but open for an extension, allowing new user-defined parallel patterns and data structures to be implemented based on existing core primitives and therefore fully supported in the AllScale framework. Focusing on high-level functionality directly offered to application developers, we present the design advantages of such an API design, detail some of its specifications and evaluate it using three real-world use cases. Our results show that AllScale decreases the complexity of implementing scientific applications for distributed memory while attaining comparable or higher performance compared to MPI reference implementations.
Scaling up Artificial Intelligence (AI) algorithms for massive datasets to improve their performance is becoming crucial. In Machine Translation (MT), one of most important research fields of AI, models based on Recur...
详细信息
Scaling up Artificial Intelligence (AI) algorithms for massive datasets to improve their performance is becoming crucial. In Machine Translation (MT), one of most important research fields of AI, models based on Recurrent Neural Networks (RNN) show state-of-the-art performance in recent years, and many researchers keep working on improving RNN-based models to achieve better accuracy in translation tasks. Most implementations of Neural Machine Translation (NMT) models employ a padding strategy when processing a mini-batch to make all sentences in a mini-batch have the same length. This enables an efficient utilization of caches and GPU/SIMD parallelism but leads to a waste of computation time. In this paper, we implement and parallelize batch learning for a Sequence-to-Sequence (Seq2Seq) model, which is the most basic model of NMT, without using a padding strategy. More specifically, our approach forms vectors which represent the input words as well as the neural network's states at different time steps into matrices when it processes one sentence, and as a result, the approach makes a better use of cache and optimizes the process that adjusts weights and biases during the back-propagation phase. Our experimental evaluation shows that our implementation achieves better scalability on multi-core CPUs. We also discuss our approach's potential to be used in other implementations of RNN-based models. (C) 2018 Elsevier B.V. All rights reserved.
Support Vector Machine (SVM) is a supervised machine learning model for classification tasks. Training SVM on a large number of data samples is challenging due to the high computational cost and memory requirement. He...
详细信息
Support Vector Machine (SVM) is a supervised machine learning model for classification tasks. Training SVM on a large number of data samples is challenging due to the high computational cost and memory requirement. Hence, model training is supported on a high-performance server which typically runs a sequential training algorithm on centralized data. However, as we move towards massive workloads, it will be impossible to store all the data in a centralized manner and expect such sequential training algorithms to scale on traditional processors. Moreover, with the growing demands of real-time machine learning for edge analytics, it is imperative to devise an efficient training framework with relatively cheaper computations and limited memory. Therefore, we propose and implement a first-of-its-kind system of multiple FPGAs as a distributed computing framework comprising up to eight FPGA units on Amazon F1 instances with negligible communication overhead to fully parallelize, accelerate, and scale the SVM training on decentralized data. Each FPGA unit has a pipelined SVM training IP logic core operating at 125 MHz with a power dissipation of 39 Watts for accelerating its allocated computations in the overall training process. We evaluate and compare the performance of the proposed system on five real SVM benchmarks.
We present 3D-EPUG-OVERLAY, a fast, exact, parallel, memory-efficient, algorithm for computing the intersection between two large 3-D triangular meshes with geometric degeneracies. Applications include CAD/CAM, CFD, G...
详细信息
We present 3D-EPUG-OVERLAY, a fast, exact, parallel, memory-efficient, algorithm for computing the intersection between two large 3-D triangular meshes with geometric degeneracies. Applications include CAD/CAM, CFD, GIS, and additive manufacturing. 3D-EPUG-OVERLAY combines 5 techniques: multiple precision rational numbers to eliminate roundoff errors during the computations: Simulation of Simplicity to properly handle geometric degeneracies: simple data representations and only local topological information to simplify the correct processing of the data and make the algorithm more parallelizable: a uniform grid to efficiently index the data, and accelerate testing pairs of triangles for intersection or locating points in the mesh: and parallel programming to exploit current hardware. 3D-EPUG-OVERLAY is up to 101 times faster than LibiGL, and comparable to QuickCSG, a parallel inexact algorithm. 3D-EPUG-OVERLAY is also more memory efficient. In all test cases, 3D-EPUG-OVERLAY'S result matched the reference solution. (C) 2019 Elsevier Ltd. All rights reserved.
In this paper, we consider the pair-wise semiglobal sequence alignment problem with gaps, which is motivated by the re-sequencing problem that requires to assemble short reads sequences into a genome sequence by refer...
详细信息
In this paper, we consider the pair-wise semiglobal sequence alignment problem with gaps, which is motivated by the re-sequencing problem that requires to assemble short reads sequences into a genome sequence by referring to a reference sequence. The problem has been studied before for single gap and bounded number of gaps. For single gap, there is a GPU-based algorithm proposed (Barton et al., 2015). In our work, we propose a GPU-based algorithm for the bounded number of gaps case, called GPUGapsMis. We implement the algorithm and compare the performance with the CPU-based algorithm, called CPUGapsMis. The algorithm has two distinct stages: the alignment phase, and the backtrack phase. We investigate several different approaches, in order to determine the most favorable for this problem, by means of a Hybrid model or a wholly-GPU based model, as well as the alignment of single text sequences or multiple text sequences on the GPU at a time. We show that the alignment phase of the algorithm is a good candidate for parallelization, with peak speedup of 11 times. We show that although the backtracking phase is sequential, it is more beneficial to perform it on the GPU, as opposed to returning to the CPU and performing there. When performing both phases on the GPU, GPUGapsMis achieves a peak speedup of 10.4 times against CPUGapsMis. Our data parallel GPU algorithm achieves results which are an improvement on those of an existing GPU data parallel implementation (Ojiaku, 2014).
The use of synchronization mechanisms in multithreaded applications is essential on shared-memory multi-core architectures. However, debugging parallel applications to avoid potential failures, such as data races or d...
详细信息
The use of synchronization mechanisms in multithreaded applications is essential on shared-memory multi-core architectures. However, debugging parallel applications to avoid potential failures, such as data races or deadlocks, can be challenging. Race detectors are key to spot such concurrency bugs;nevertheless, if lock-free data structures are used, these may emit a significant number of false positives. In this paper, we present a framework for semantic violation detection of lock-free data structures which makes use of contracts, a novel feature of the upcoming C++20, and a customized version of the ThreadSanitizer race detector. We evaluate the detection accuracy of the framework in terms of false positives and false negatives leveraging some synthetic benchmarks which make use of the SPSC and MPMC lock-free queue structures from the Boost C++ library. Thanks to this framework, we are able to check the correct use of lock-free data structures, thus reducing the number of false positives.
Ada 2022 includes parallel programming features that use lightweight logical threads of control on top of the heavier-weight Ada tasks. This talk will report on the work in progress to implement a work-stealing schedu...
详细信息
parallel programming can be difficult and error prone, in particular if low-level optimizations are required in order to reach high performance in complex environments such as multi-core clusters using MPI and OpenMP....
详细信息
parallel programming can be difficult and error prone, in particular if low-level optimizations are required in order to reach high performance in complex environments such as multi-core clusters using MPI and OpenMP. One approach to overcome these issues is based on algorithmic skeletons. These are predefined patterns which are implemented in parallel and can be composed by application programmers without taking care of low-level programming aspects. Support for algorithmic skeletons is typically provided as a library. However, optimizations are hard to implement in this setting and programming might still be tedious because of required boiler plate code. Thus, we propose a domain-specific language for algorithmic skeletons that performs optimizations and generates low-level C++ code. Our experimental results on four benchmarks show that the models are significantly shorter and that the execution time and speedup of the generated code often outperform equivalent library implementations using the Muenster Skeleton Library.
Bioinformatics is an interdisciplinary field that applies trending techniques in information technology, mathematics, and statistics in studying large biological data. Bioinformatics involves several computational tec...
详细信息
Bioinformatics is an interdisciplinary field that applies trending techniques in information technology, mathematics, and statistics in studying large biological data. Bioinformatics involves several computational techniques such as sequence and structural alignment, data mining, macromolecular geometry, prediction of protein structure and gene finding. Protein structure and sequence analysis are vital to the understanding of cellular processes. Understanding cellular processes contributes to the development of drugs for metabolic pathways. Protein sequence alignment is concerned with identifying the similarities and the relationships among different protein structures. In this paper, we target two well-known protein sequence alignment algorithms, the Needleman-Wunsch and the Smith-Waterman algorithms. These two algorithms are computationally expensive which hinders their applicability for large data sets. Thus, we propose a hybrid parallel approach that combines the capabilities of multi-core CPUs and the power of contemporary GPUs, and significantly speeds up the execution of the target algorithms. The validity of our approach is tested on real protein sequences. Moreover, the scalability of the approach is verified on randomly generated sequences with predefined similarity levels. The results showed that the proposed hybrid approach was up to 242 times faster than the sequential approach.
Existing best-effort requester-wins implementations of transactional memory must resort to non-speculative execution to provide forward progress in the presence of transactions that exceed hardware capacity, experienc...
详细信息
Existing best-effort requester-wins implementations of transactional memory must resort to non-speculative execution to provide forward progress in the presence of transactions that exceed hardware capacity, experience page faults or suffer high-contention leading to livelocks. Current approaches to irrevocability employ lock-based synchronization to achieve mutual exclusion when executing a transaction non-speculatively, conservatively precluding concurrency with any other transactions in order to guarantee atomicity at the cost of degrading performance. In this article, we propose a new form of concurrent irrevocability whose goal is to minimize the loss of concurrency paid when transactions resort to irrevocability to complete. By enabling optimistic concurrency control also during non-speculative execution of a transaction, our proposal allows for higher parallelism than existing schemes. We describe the extensions to the instruction set to provide concurrent irrevocable transactions as well as the architectural extensions required to realize them on a best-effort HTM system without requiring any modification to the cache coherence protocol. Our evaluation shows that our proposal achieves an average reduction of 12.5 percent in execution time across the STAMP benchmarks, with 15.8 percent on average for highly contended workloads.
暂无评论