This paper presents a simple and efficient approach for finding the bridges and failure points in a densely connected network mapped as a graph. The algorithm presented here is a parallel algorithm which works in a di...
详细信息
Application performance on graphical processing units (GPUs), in terms of execution speed and memory usage, depends on the efficient use of hierarchical memory. It is expected that enhancing data locality in molecular...
详细信息
Application performance on graphical processing units (GPUs), in terms of execution speed and memory usage, depends on the efficient use of hierarchical memory. It is expected that enhancing data locality in molecular dynamic simulations will lower the cost of data movement across the GPU memory hierarchy. The work presented in this article analyses the spatial data locality and data reuse characteristics for row-major, Hilbert and Morton orderings and the impact these have on the performance of molecular dynamics simulations. A simple cache model is presented, and this is found to give results that are consistent with the timing results for the particle force computation obtained on NVidia GeForce GTX960 and Tesla P100 GPUs. Further analysis of the observed memory use, in terms of cache hits and the number of memory transactions, provides a more detailed explanation of execution behaviour for the different orderings. To the best of our knowledge, this is the first study to investigate memory analysis and data locality issues for molecular dynamics simulations of Lennard-Jones fluids on NVidia's Maxwell and Tesla architectures.
The integration of reduced-order models with high-performance computing is critical for developing digital twins, particularly for real-time monitoring and predictive maintenance of industrial systems. This paper pres...
详细信息
The integration of reduced-order models with high-performance computing is critical for developing digital twins, particularly for real-time monitoring and predictive maintenance of industrial systems. This paper presents a comprehensive, high-performance computing-enabled workflow for developing and deploying projection-based reduced-order models for large-scale mechanical simulations. We use PyCOMPSs’ parallel framework to efficiently execute reduced-order model training simulations, employing parallel singular value decomposition algorithms such as randomized singular value decomposition, Lanczos singular value decomposition, and full singular value decomposition based on tall-skinny QR. Moreover, we introduce a partitioned version of the hyperreduction scheme known as the Empirical Cubature Method to further enhance computational efficiency in projection-based reduced-order models for mechanical systems. Despite the widespread use of high-performance computing for projection-based reduced-order models, there is a significant lack of publications detailing comprehensive workflows for building and deploying end-to-end projection-based reduced-order models in high-performance computing environments. Our workflow is validated through a case study focusing on the thermal dynamics of a motor, a multiphysics problem involving convective heat transfer and mechanical components. The projection-based reduced-order model is designed to deliver a real-time prognosis tool that could enable rapid and safe motor restarts post-emergency shutdowns under different operating conditions, demonstrating its potential impact on the practice of simulations in engineering mechanics. To facilitate deployment, we use the High-Performance Computing Workflow as a Service strategy and Functional Mock-Up Units to ensure compatibility and ease of integration across high-performance computing, edge, and cloud environments. The outcomes illustrate the efficacy of combining projection-based reduc
Complex networks are large and analysis of these networks require significantly different methods than small networks. parallel processing is needed to provide analysis of these networks in a timely manner. Graph cent...
详细信息
ISBN:
(纸本)9781665407601
Complex networks are large and analysis of these networks require significantly different methods than small networks. parallel processing is needed to provide analysis of these networks in a timely manner. Graph centrality measures provide convenient methods to assess the structure of these networks. We review main centrality algorithms, describe implementation of closed centrality in Python and propose a simple parallel algorithm of closed centrality and show its implementation in Python with obtained results.
Recently, there has been substantial interest in the study of various random networks as mathematical models of complex systems. As these complex systems grow larger, the ability to generate progressively large random...
详细信息
ISBN:
(纸本)9781450323789
Recently, there has been substantial interest in the study of various random networks as mathematical models of complex systems. As these complex systems grow larger, the ability to generate progressively large random networks becomes all the more important. This motivates the need for efficient parallel algorithms for generating such networks. Naive parallelization of the sequential algorithms for generating random networks may not work due to the dependencies among the edges and the possibility of creating duplicate (parallel) edges. In this paper, we present MPI-based distributed memory parallel algorithms for generating random scale-free networks using the preferential-attachment model. Our algorithms scale very well to a large number of processors and provide almost linear speedups. The algorithms can generate scale-free networks with 50 billion edges in 123 seconds using 768 processors.
In this paper, we present the staggered parallel short-time Fourier transform, an algorithm that uses a quasi-parallel procedure to compute exact STFT coefficients of 1D signals. The algorithm leverages parallelism wi...
详细信息
In this paper, we present the staggered parallel short-time Fourier transform, an algorithm that uses a quasi-parallel procedure to compute exact STFT coefficients of 1D signals. The algorithm leverages parallelism with the capacity of feedforward STFT algorithms to re-use prior computations. It performs this by carefully organizing input signals and collecting past computations into 2D memory buffers. Re-using stored information in memory enables fast computation of up to N/2 FFTs in parallel. The algorithm's time complexity is at O[6T] under an abstract circuit implementation - achieving a complexity measure that is independent of sample complexity N. Its time complexity is asymptotically equivalent with the best possible exact algorithm of O[T] time complexity, with a constant efficiency at O[1] relative to the best known sequential algorithm. Its efficiency property holds whether in an abstract circuit implementation or in a CPU implementation with limited number of cores. In general, the algorithm consumes less processors than other parallel STFT algorithms but can potentially require more memory. To test the algorithm's properties, we implement several STFT algorithms in a CPU with varying numbers of cores. These algorithms use either FFT, iterative, or feedforward schemes to capture the range of existing STFT algorithms for comparison. From our experimental results, our proposed algorithm has the least running time among exact STFT algorithms, while consuming less CPU processors than other forms of parallel implementations. (C) 2019 Elsevier Inc. All rights reserved.
Nowadays, most of the current research on object detection is to improve the whole framework, in order to improve the accuracy of detection, but another problem of object detection is the detection speed. The more com...
详细信息
ISBN:
(纸本)9798350396386
Nowadays, most of the current research on object detection is to improve the whole framework, in order to improve the accuracy of detection, but another problem of object detection is the detection speed. The more complex the architecture, the slower the speed. This time, we implemented a Single Shot Multibox Detector(SSD) using GPU with *** have improved the object detection speed of SSD, which is one of the most regularly used object detection frameworks. The most time-consuming part, the VGG16 network, is rephrased by using cuDNN, which is made faster by about 9%. The second time-consuming part is post-processing, where non-maximum-suppression (NMS) is performed. We accelerated NMS by implementing our new algorithms that are suitable for GPUs, which is about 52% faster than the original PyTorch version [11]. We also ported those parts that were originally executed on the CPU to the GPU. In total, our GPU-accelerated SSD can detect objects 22.5% faster than the original version. We demonstrate that using GPUs to accelerate existing frameworks is a viable approach.
Information seeking applications employ information filtering techniques as a main component of their functioning. The purpose of the present article is to explore techniques to efficiently implement scalable and effi...
详细信息
ISBN:
(纸本)9783642366086;9783642366079
Information seeking applications employ information filtering techniques as a main component of their functioning. The purpose of the present article is to explore techniques to efficiently implement scalable and efficient information filtering techniques, based on the XML representation, when both structural and value constraints are imposed. In the majority of the provided implementations the use of the XML representation appears in single processor systems, and the involved user profiles are represented using the XPath query language and efficient heuristic techniques for constraining the complexity of the filtering mechanism are employed. Here, we propose a parallel filtering algorithm based on the well known YFilter algorithm, which dynamically applies a work-load balancing approach to each thread to achieve the best parallelization. In addition, the proposed filtering algorithm adds support for value-based predicates by embedding three different algorithms for handling value constraints during XML filtering, based on the popularity and the semantic interpretation of the predicate values. Experimental results depict that the proposed system outperforms the previous parallel approaches to the XML filtering problem.
Language communication and understanding involve more and more fields. Human-computer interaction, translation systems, etc. have entered our lives more and more, and are and will change the way of production of our l...
详细信息
Language communication and understanding involve more and more fields. Human-computer interaction, translation systems, etc. have entered our lives more and more, and are and will change the way of production of our lives. The GLR algorithm can analyze English translation sentences well, and its analysis time complexity is lower than that of parallel algorithms. The main purpose of this paper is to design and study the intelligent recognition model of English translation based on the GLR of cloud computing. The paper mainly lists the main points of the GLR algorithm, the characteristics of the word algorithm, and the classification of words and collocations through calculation, which is convenient for intelligent and so on. Experiments show that the accuracy of English translation is improved by 24% after proofreading, and there is a difference. Therefore, the intelligent recognition model of English translation of this system is relatively very effective.
We provide time lowerbounds for sequential and parallel algorithms deciding bisimulation on labelled transition systems that use partition refinement. For sequential algorithms this is Ω((m+n)logn) and for parallel a...
详细信息
暂无评论