In this work, two parallel techniques based on shared memory programming are presented. These models are specially suitable to be applied over evolutionary algorithms. To study their performance, the algorithm UEGO (U...
详细信息
In this work, two parallel techniques based on shared memory programming are presented. These models are specially suitable to be applied over evolutionary algorithms. To study their performance, the algorithm UEGO (Universal Evolutionary Global Optimizer) has been chosen.
Given a target protein structure, the prime objective of protein design is to find amino acid sequences that will fold/acquire to the given three-dimensional structure. The protein design problem belongs to the non-de...
详细信息
Given a target protein structure, the prime objective of protein design is to find amino acid sequences that will fold/acquire to the given three-dimensional structure. The protein design problem belongs to the non-deterministic polynomial-time-hard class as sequence search space increases exponentially with protein length. To ensure better search space exploration and faster convergence, we propose a protein modularity-based parallel protein design algorithm. The modular architecture of the protein structure is exploited by considering an intermediate structural organization between secondary structure and domain defined as protein unit (PU). Here, we have incorporated a divide-and-conquer approach where a protein is split into PUs and each PU region is explored in a parallel fashion. It has been further analyzed that our sharedmemory implementation of modularity-based parallel sequence search leads to better search space exploration compared to the case of traditional full protein design. Sequence-based analysis on design sequences depicts an average of 39.7% sequence similarity on the benchmark data set. Structure-based comparison of the modeled structures of the design protein with the target structure exhibited an average root-mean-square deviation of 1.17 angstrom and an average template modeling score of 0.89. The selected modeled structures of the design protein sequences are validated using 100 ns molecular dynamics simulations where 80% of the proteins have shown better or similar stability to the respective target proteins. Our study informs that our modularity-based protein design algorithm can be extended to protein interaction design as well.
Finding a good graph coloring quickly is often a crucial phase in the development of efficient, parallel algorithms for many scientific and engineering applications. In this paper we consider the problem of solving th...
详细信息
Finding a good graph coloring quickly is often a crucial phase in the development of efficient, parallel algorithms for many scientific and engineering applications. In this paper we consider the problem of solving the graph coloring problem itself in parallel, We present a simple and fast parallel graph coloring heuristic that is well suited for shared memory programming and yields an almost linear speedup on the PRAM model. We also present a second heuristic that improves on the number of colors used. The heuristics have been implemented using OpenMP, Experiments conducted on an SGI Gray Origin 2000 supercomputer using very large graphs from finite element methods and eigenvalue computations validate the theoretical run-time analysis. Copyright (C) 2000 John Whey & Sons, Ltd.
In recent years, high performance computing and powerful supercomputers are becoming a staple in many areas of academia and industry. The author introduces the concepts of shared memory programming in the context of s...
详细信息
In recent years, high performance computing and powerful supercomputers are becoming a staple in many areas of academia and industry. The author introduces the concepts of shared memory programming in the context of solving the heat equation, which will allow the exploration of several finite difference and parallelization schemes.
We propose a methodology to address the programmability issues derived from the emergence of newgeneration shared-memory NUMA architectures. For this purpose, we employ dense matrix factorizations and matrix inversion...
详细信息
We propose a methodology to address the programmability issues derived from the emergence of newgeneration shared-memory NUMA architectures. For this purpose, we employ dense matrix factorizations and matrix inversion (DMFI) as a use case, and we target two modern architectures (AMD Rome and Huawei Kunpeng 920) that exhibit configurable NUMA topologies. Our methodology pursues performance portability across different NUMA configurations by proposing multi-domain implementations for DMFI plus a hybrid task- and loop-level parallelization that configures multi-threaded executions to fix core-todata binding, exploiting locality at the expense of minor code modifications. In addition, we introduce a generalization of the multi-domain implementations for DMFI that offers support for virtually any NUMA topology in present and future architectures. Our experimentation on the two target architectures for three representative dense linear algebra operations validates the proposal, reveals insights on the necessity of adapting both the codes and their execution to improve data access locality, and reports performance across architectures and inter- and intra-socket NUMA configurations competitive with state-of-the-art message-passing implementations, maintaining the ease of development usually associated with shared-memoryprogramming. (c) 2023 The Author(s). Published by Elsevier Inc. This is an open access article under the CC BY-NC-ND
In the era of Exascale computing, writing efficient parallel programs is indispensable, and, at the same time, writing sound parallel programs is very difficult. Specifying parallelism with frameworks such as OpenMP i...
详细信息
In the era of Exascale computing, writing efficient parallel programs is indispensable, and, at the same time, writing sound parallel programs is very difficult. Specifying parallelism with frameworks such as OpenMP is relatively easy, but data races in these programs are an important source of bugs. In this article, we propose LLOV, a fast, lightweight, language agnostic, and static data race checker for OpenMP programs based on the LLVM compiler framework. We compare LLOV with other state-of-the-art data race checkers on a variety of well-established benchmarks. We show that the precision, accuracy, and the F1 score of LLOV is comparable to other checkers while being orders of magnitude faster. To the best of our knowledge, LLOV is the only tool among the state-of-the-art data race checkers that can verify a C/C++ or FORTRAN program to be data race free.
Modern classification problems tackled by using Decision Tree (DT) models often require demanding constraints in terms of accuracy and scalability. This is often hard to achieve due to the ever-increasing volume of da...
详细信息
ISBN:
(纸本)9798350305487
Modern classification problems tackled by using Decision Tree (DT) models often require demanding constraints in terms of accuracy and scalability. This is often hard to achieve due to the ever-increasing volume of data used for training and testing. Bayesian approaches to DTs using Markov Chain Monte Carlo (MCMC) methods have demonstrated great accuracy in a wide range of applications. However, the inherently sequential nature of MCMC makes it unsuitable to meet both accuracy and scaling constraints. One could run multiple MCMC chains in an embarrassingly parallel fashion. Despite the improved runtime, this approach sacrifices accuracy in exchange for strong scaling. Sequential Monte Carlo (SMC) samplers are another class of Bayesian inference methods that also have the appealing property of being parallelizable without trading off accuracy. Nevertheless, finding an effective parallelization for the SMC sampler is difficult, due to the challenges in parallelizing its bottleneck, redistribution, in such a way that the workload is equally divided across the processing elements, especially when dealing with variable-size models such as DTs. This study presents a parallel SMC sampler for DTs on sharedmemory (SM) architectures, with an O(log(2) N) parallel redistribution for variable-size samples. On an SM machine mounting 32 cores, the experimental results show that our proposed method scales up to a factor of 16 compared to its serial implementation, and provides comparable accuracy to MCMC, but 51 times faster.
Due to the slowdown of Moore's Law, systems designers have begun integrating non-cache-coherent heterogeneous computing elements in order to continue scaling performance. programming such systems has traditionally...
详细信息
ISBN:
(纸本)9781450375887
Due to the slowdown of Moore's Law, systems designers have begun integrating non-cache-coherent heterogeneous computing elements in order to continue scaling performance. programming such systems has traditionally been difficult - developers were forced to use programming models that exposed multiple memory regions, requiring developers to manually maintain memory consistency. Previous works proposed distributed sharedmemory (DSM) as a way to achieve high programmability in such systems. However, past DSM systems were plagued by low-bandwidth networking and utilized complex memory consistency protocols, which limited their adoption. Recently, new networking technologies have begun to change the assumptions about which components are bottlenecks in the system. Additionally, many popular shared-memoryprogramming models utilize memory consistency semantics similar to those proposed for DSM, leading to widespread adoption in mainstream programming. In this work, we argue that it is time to revive DSM as a means for achieving good programmability and performance on non-cache-coherent systems. We explore optimizing an existing DSM protocol by relaxing memory consistency semantics and exposing new cross-node barrier primitives. We integrate the new mechanisms into an existing OpenMP runtime, allowing developers to leverage cross-node execution without changing a single line of code. When evaluated on an x86 server connected to an ARMv8 server via InfiniBand, the DSM optimizations achieve an average of 11% (up to 33%) improvement versus the baseline DSM implementation.
The engagement of cluster and grid computing, two popular trends of today's high performance computation, has formed an imperative need for efficient utilization of the afforded resources. In this paper we present...
详细信息
ISBN:
(纸本)9781605584133
The engagement of cluster and grid computing, two popular trends of today's high performance computation, has formed an imperative need for efficient utilization of the afforded resources. In this paper we present the concept, design and implementation of the Pleiad platform'. Having its origin in the proposition of distributed sharedmemory (DSM), Pleiad is a cluster middleware that provides sharedmemory abstraction which enables transparent multithreaded execution across the cluster nodes. It belongs to the new generation of cluster middleware that aside from providing the proof of concept regarding unification of the cluster memory resources, they aim to achieve satisfactory levels of performance and scalability for a broad range of multithreaded applications. First results from the performance evaluation of Pleiad appear emboldening and they are presented in comparison with an efficient implementation of MPI for the Java platform.
Various partitioned global address space (PGAS) languages capable of providing global-view programming environments on multi-node computer systems have been proposed to improve programming productivity in high-perform...
详细信息
ISBN:
(纸本)9781728127941
Various partitioned global address space (PGAS) languages capable of providing global-view programming environments on multi-node computer systems have been proposed to improve programming productivity in high-performance computing. However, several PGAS languages often require a detailed description of the remote data access, similar to descriptions used in message passing interface one-sided communications. Some PGAS languages have limitations pertaining to remote data access and recommend their local-view programming models, rather than the global-view ones, due to performance-related reasons. In this study, we propose SMint, which is an application programming interface that provides a global-view programming model with a software distributed sharedmemory mSMS as the runtime. Using stencil computation as a typical processing method, the performance and programmability of SMint have been compared with those of XcalableMP and Unified Parallel C, which are well-known examples of PGAS languages based on the C language. It was found that SMint achieved the best performance under the ideal global-view programming model.
暂无评论