The use of synchronization mechanisms in multithreaded applications is essential on shared-memory multi-core architectures. However, debugging parallel applications to avoid potential failures, such as data races or d...
详细信息
The use of synchronization mechanisms in multithreaded applications is essential on shared-memory multi-core architectures. However, debugging parallel applications to avoid potential failures, such as data races or deadlocks, can be challenging. Race detectors are key to spot such concurrency bugs;nevertheless, if lock-free data structures are used, these may emit a significant number of false positives. In this paper, we present a framework for semantic violation detection of lock-free data structures which makes use of contracts, a novel feature of the upcoming C++20, and a customized version of the ThreadSanitizer race detector. We evaluate the detection accuracy of the framework in terms of false positives and false negatives leveraging some synthetic benchmarks which make use of the SPSC and MPMC lock-free queue structures from the Boost C++ library. Thanks to this framework, we are able to check the correct use of lock-free data structures, thus reducing the number of false positives.
Ada 2022 includes parallel programming features that use lightweight logical threads of control on top of the heavier-weight Ada tasks. This talk will report on the work in progress to implement a work-stealing schedu...
详细信息
parallel programming can be difficult and error prone, in particular if low-level optimizations are required in order to reach high performance in complex environments such as multi-core clusters using MPI and OpenMP....
详细信息
parallel programming can be difficult and error prone, in particular if low-level optimizations are required in order to reach high performance in complex environments such as multi-core clusters using MPI and OpenMP. One approach to overcome these issues is based on algorithmic skeletons. These are predefined patterns which are implemented in parallel and can be composed by application programmers without taking care of low-level programming aspects. Support for algorithmic skeletons is typically provided as a library. However, optimizations are hard to implement in this setting and programming might still be tedious because of required boiler plate code. Thus, we propose a domain-specific language for algorithmic skeletons that performs optimizations and generates low-level C++ code. Our experimental results on four benchmarks show that the models are significantly shorter and that the execution time and speedup of the generated code often outperform equivalent library implementations using the Muenster Skeleton Library.
Bioinformatics is an interdisciplinary field that applies trending techniques in information technology, mathematics, and statistics in studying large biological data. Bioinformatics involves several computational tec...
详细信息
Bioinformatics is an interdisciplinary field that applies trending techniques in information technology, mathematics, and statistics in studying large biological data. Bioinformatics involves several computational techniques such as sequence and structural alignment, data mining, macromolecular geometry, prediction of protein structure and gene finding. Protein structure and sequence analysis are vital to the understanding of cellular processes. Understanding cellular processes contributes to the development of drugs for metabolic pathways. Protein sequence alignment is concerned with identifying the similarities and the relationships among different protein structures. In this paper, we target two well-known protein sequence alignment algorithms, the Needleman-Wunsch and the Smith-Waterman algorithms. These two algorithms are computationally expensive which hinders their applicability for large data sets. Thus, we propose a hybrid parallel approach that combines the capabilities of multi-core CPUs and the power of contemporary GPUs, and significantly speeds up the execution of the target algorithms. The validity of our approach is tested on real protein sequences. Moreover, the scalability of the approach is verified on randomly generated sequences with predefined similarity levels. The results showed that the proposed hybrid approach was up to 242 times faster than the sequential approach.
Existing best-effort requester-wins implementations of transactional memory must resort to non-speculative execution to provide forward progress in the presence of transactions that exceed hardware capacity, experienc...
详细信息
Existing best-effort requester-wins implementations of transactional memory must resort to non-speculative execution to provide forward progress in the presence of transactions that exceed hardware capacity, experience page faults or suffer high-contention leading to livelocks. Current approaches to irrevocability employ lock-based synchronization to achieve mutual exclusion when executing a transaction non-speculatively, conservatively precluding concurrency with any other transactions in order to guarantee atomicity at the cost of degrading performance. In this article, we propose a new form of concurrent irrevocability whose goal is to minimize the loss of concurrency paid when transactions resort to irrevocability to complete. By enabling optimistic concurrency control also during non-speculative execution of a transaction, our proposal allows for higher parallelism than existing schemes. We describe the extensions to the instruction set to provide concurrent irrevocable transactions as well as the architectural extensions required to realize them on a best-effort HTM system without requiring any modification to the cache coherence protocol. Our evaluation shows that our proposal achieves an average reduction of 12.5 percent in execution time across the STAMP benchmarks, with 15.8 percent on average for highly contended workloads.
Computationally intensive deep neural networks (DNNs) are well- suited to run on GPUs, but newly developed algorithms usually require the heavily optimized DNN routines to work efficiently, and this problem could be e...
详细信息
Computationally intensive deep neural networks (DNNs) are well- suited to run on GPUs, but newly developed algorithms usually require the heavily optimized DNN routines to work efficiently, and this problem could be even more difficult for specialized DNN architectures. In this article, we propose a mathematical formulation that can be useful for transferring the algorithm optimization knowledge across computing platforms. We discover that data movement and storage inside parallel processor architectures can be viewed as tensor transforms across memory hierarchies, making it possible to describe many memory optimization techniques mathematically. Such transform, which we call memory-efficient ranged inner-product tensor (MERIT) transform, can be applied to not only DNN tasks but also many traditional machine learning and computer vision computations. Moreover, the tensor transforms can be readily mapped to existing vector processor architectures. In this article, we demonstrate that many popular applications can be converted to a succinct MERIT notation on GPUs, speeding up GPU kernels up to 20 times while using only half as many code tokens. We also use the principle of the proposed transform to design a specialized hardware unit called MERIT-z processor. This processor can be applied to a variety of DNN tasks as well as other computer vision tasks while providing comparable area and power efficiency to dedicated DNN application-specific integrated circuits (ASICs).
A wealth of important scientific and engineering applications are configured for use on high performance computing architectures using functionality found in the MPI specification. This specification provides applicat...
详细信息
A wealth of important scientific and engineering applications are configured for use on high performance computing architectures using functionality found in the MPI specification. This specification provides application developers with a straightforward means for implementing their ideas for execution on distributed-memory parallel processing computers. OpenMP directives provide a means for operating on shared-memory regions of those computers. With the advent of machines composed of many-core processors, the strict synchronisation required by the bulk synchronous parallel (BSP) communication model can hinder performance increases. This is due to the complexity to handle load imbalances, to reduce serialisation imposed by blocking communication patterns, to overlap communication with computation and, finally, to deal with increasing memory overheads. The MPI specification provides advanced features such as non-blocking calls or shared memory to mitigate some of these factors. However, applying these features efficiently usually requires significant changes on the application structure. Task parallel programming models are being developed as a means of mitigating the abovementioned issues but without requiring extensive changes on the application code. In this work, we present a methodology to develop hybrid applications based on tasks called hierarchical domain overdecomposition with tasking (HDOT). This methodology overcomes most of the issues found on MPI-only and traditional hybrid MPI+OpenMP applications. However, by emphasising the reuse of data partition schemes from process-level and applying them to task-level, it enables a natural coexistence between MPI and shared-memory programming models. The proposed methodology shows promising results in terms of programmability and performance measured on a set of applications. (C) 2019 Elsevier Inc. All rights reserved.
The holistic analysis and understanding of the latent (that is, not directly observable) variables and patterns buried in large datasets is crucial for data-driven science, decision making and emergency response. Such...
详细信息
The holistic analysis and understanding of the latent (that is, not directly observable) variables and patterns buried in large datasets is crucial for data-driven science, decision making and emergency response. Such exploratory analyses require devising unsupervised learning methods for data mining and extraction of the latent features, and non-negative matrix factorization (NMF) is one of the prominent such methods. NMF is based on compute-intense non-convex constrained minimization, which, for large datasets requires fast and distributed algorithms. However, current parallel implementations of NMF fail to estimate the number of latent features. In practice, identifying these features is both difficult and significant for pattern recognition and latent feature analysis, especially for large dense matrices. In this paper, we introduce a distributed NMF algorithm coupled with distributed custom clustering followed by a stability analysis on dense data, which we call DnMFk, to determine the number of latent variables. The results on synthetic data and the classical Swimmer data set demonstrate the accuracy of model determination while scaling nearly linearly across multiple processors for large data. Further, we employ DnMFk to determine the number of hidden features from a terabyte matrix.
For input data of homogenous type, the standard form of convolutional neural network is normally constructed with universally applied filters to identify global patterns. However, for certain datasets, there are ident...
详细信息
For input data of homogenous type, the standard form of convolutional neural network is normally constructed with universally applied filters to identify global patterns. However, for certain datasets, there are identifiable trends and patterns within subgroups of input data. This research proposes a convolutional neural network that deliberately partitions input data into groups to be processed with unique sets of convolutional layers, thus identifying the underlying features of individual data groups. Training and testing data are built from historical prices of stock market and preprocessed so that the generated datasets are suitable for both standard and the proposed convolutional neural network. The author of this research also developed a software framework that can construct neural networks to perform necessary testing. The calculation logic was implemented using parallel programming and executed on a Nvidia graphic processing unit, thus allowing tests to be executed without expensive hardware. Tests were executed for 134 sets of datasets to benchmark the performance between standard and the proposed convolutional neural network. Test results show that the partitioned convolution method is capable of performance that rivals its standard counterpart. Further analysis indicates that more sophisticated method of building datasets, larger sets of training data, or more training epochs can further improve the performance of the partitioned neural network. For suitable datasets, the proposed method could be a viable replacement or supplement to the standard convolutional neural network structure.
Coprocessor architectures in High Performance Computing are prevalent in today's scientific computing clusters and require specialized knowledge for proper utilization. Various alternative paradigms for parallel a...
详细信息
Coprocessor architectures in High Performance Computing are prevalent in today's scientific computing clusters and require specialized knowledge for proper utilization. Various alternative paradigms for parallel and offload computation exist, but little is known about the human factors impacts of using the different paradigms. With computer science student participants from the University of Nevada, Las Vegas with no previous exposure to Graphics Processing Unit programming, our study compared NVIDIA CUDA C/C++ as a control group and the Thrust library. The designers of Thrust claim their higher level of abstraction enhances programmer productivity. The trial was conducted on 91 participants and was administered through our computerized testing platform. Although the study was narrowly focused on the basic steps of an offloaded computation problem and was not intended to be a comprehensive evaluation of the superiority of one approach or the other, we found evidence that although Thrust was designed for ease of use, the abstractions tended to be confusing to students and in several cases diminished productivity. Specifically, abstractions in Thrust for (i) memory allocation through a C++ Standard Template Library-style vector library call, (ii) memory transfers between the host and Graphics Processing Unit coprocessor through an overloaded assignment operator, and (iii) execution of an offloaded routine through a generic transform library call instead of a CUDA kernel routine all performed either equal to or worse than CUDA.
暂无评论