The authors present and evaluate an unplugged activity to introduce parallelcomputing concepts to undergraduate students. Students in five CS classrooms used a deck of playing cards in small groups to consider how pa...
详细信息
Comprehending the performance bottlenecks at the core of the intricate hardware-software inter-actions exhibited by highly parallel programs on HPC clusters is crucial. This paper sheds light on the issue of automatic...
详细信息
Comprehending the performance bottlenecks at the core of the intricate hardware-software inter-actions exhibited by highly parallel programs on HPC clusters is crucial. This paper sheds light on the issue of automatically asynchronous MPI communication in memory-bound parallel programs on multicore clusters and how it can be facilitated. For instance, slowing down MPI processes by deliberate injection of delays can improve performance if certain conditions are met. This leads to the counter-intuitive conclusion that noise, independent of its source, is not always detrimental but can be leveraged for performance improvements. We employ phase-space graphs as a new tool to visualize parallel program dynamics. They are useful in spotting certain patterns in parallel execution that will easily go unnoticed with traditional tracing tools. We investigate five different microbenchmarks and applications on different supercomputer platforms: an MPI-augmented STREAM Triad, two implementations of Lattice-Boltzmann fluid solvers (D3Q19 and SPEChpc D2Q37), the LULESH and HPCG proxy applications.& COPY;2023 Elsevier B.V. All rights reserved.
The performance of highly parallel applications on distributed-memory systems is influenced by many factors. Analytic performance modeling techniques aim to provide insight into performance limitations and are often t...
详细信息
The performance of highly parallel applications on distributed-memory systems is influenced by many factors. Analytic performance modeling techniques aim to provide insight into performance limitations and are often the starting point of optimization efforts. However, coupling analytic models across the system hierarchy (socket, node, network) fails to encompass the intricate interplay between the program code and the hardware, especially when execution and communication bottlenecks are involved. In this paper we investigate the effect of bottleneck evasionand how it can lead to automatic overlap of communication overhead with computation. Bottleneck evasion leads to a gradual loss of the initial bulk-synchronous behavior of a parallel code so that its processes become desynchronized. This occurs most prominently in memory-bound programs, which is why we choose memory-bound benchmark and application codes, specifically an MPI-augmented STREAM Triad, sparse matrix-vector multiplication, and a collective-avoiding Chebyshev filter diagonalization code to demonstrate the consequences of desynchronization on two different supercomputing platforms. We investigate the role of idle waves as possible triggers for desynchronization and show the impact of automatic asynchronous communication for a spectrum of code properties and parameters, such as saturation point, matrix structures, domain decomposition, and communication concurrency. Our findings reveal how eliminating synchronization points (such as collective communication or barriers) precipitates performance improvements that go beyond what can be expected by simply subtracting the overhead of the collective from the overall runtime.
Big Data is an extremely massive amount of heterogeneous and multisource data which often requires fast processing and real time analysis. Solving big data analytics problems needs powerful platforms to handle this en...
详细信息
Big Data is an extremely massive amount of heterogeneous and multisource data which often requires fast processing and real time analysis. Solving big data analytics problems needs powerful platforms to handle this enormous mass of data and efficient machine learning algorithms to allow the use of big data full potential. Hidden Markov models are statistical models, rich and widely used in various fields especially for time varying data sequences modeling and analysis. They owe their success to the existence of many efficient and reliable algorithms. In this paper, we present ParaDist-HMM, a paralleldistributed implementation of hidden Markov model for modeling and solving big data analytics problems. We describe the development and the implementation of the improved algorithms and we propose a Spark-based approach consisting in a paralleldistributed big data architecture in cloud computing environment, to put the proposed algorithms into practice. We evaluated the model on synthetic and real financial data in terms of running time, speedup and prediction quality which is measured by using the accuracy and the root mean square error. Experimental results demonstrate that ParaDist-HMM algorithms outperforms other implementations of hidden Markov models in terms of processing speed, accuracy and therefore in efficiency and effectiveness.
In this article, we use a Kullback-Leibler random sample partition data model to generate a set of disjoint data blocks, where each block is a good representation of the entire data set. Every random sample partition ...
详细信息
ISBN:
(纸本)9783030602451;9783030602444
In this article, we use a Kullback-Leibler random sample partition data model to generate a set of disjoint data blocks, where each block is a good representation of the entire data set. Every random sample partition (RSP) block has a sample distribution function similar to the entire data set. To obtain the statistical measure between them, Kernel Density Estimation (KDE) with a dual-tree recursion data structure is firstly applied to fast estimate the probability density of each block. Then, based on the Kullback-Leibler (KL) divergence measure, we can obtain the statistical similarity between a randomly selected RSP data block and other RSP data blocks. We rank the RSP data blocks according to their divergence values in descending order and choose the first ten for an ensemble classification learning. The classification models are established in parallel for the selected RSP data blocks and the final ensemble classification model is obtained by the weighted voting ensemble strategy. The experiments were conducted by building XGboost model based on those ten blocks in parallel, and we incrementally ensemble them according to their KL values. The testing classification results show that our method can increase the generalization capability of the ensemble classification model. It could reduce the model building time in parallel computation environment by using less than 15% of the entire data, which could also solve the memory constraints of big data analysis.
The aim of this paper is to present a paralleldistributed version of Viterbi algorithm that combines the advantages of Spark, the big data framework, and hidden Markov models to solve the decoding problem for large s...
详细信息
The aim of this paper is to present a paralleldistributed version of Viterbi algorithm that combines the advantages of Spark, the big data framework, and hidden Markov models to solve the decoding problem for large scale multidimensional data. The scope of the paper includes a review of hidden Markov models, a study of decoding problem, a presentation of related work, and a discussion of previously proposed implementations. The main part of the paper consists of a description of development and implementation of a paralleldistributed Viterbi algorithm in a cloud computing environment, followed by a description of evaluation experiments of the presented algorithm. The results showed that the proposed algorithm is faster, with high scalability and no deterioration in forecast accuracy is observed.
Empirical Dynamic Modeling (EDM) is a nonlinear time series causal inference framework. The latest implementation of EDM, cppEDM, has only been used for small datasets due to computational cost. With the growth of dat...
详细信息
ISBN:
(纸本)9781728190747
Empirical Dynamic Modeling (EDM) is a nonlinear time series causal inference framework. The latest implementation of EDM, cppEDM, has only been used for small datasets due to computational cost. With the growth of data collection capabilities, there is a great need to identify causal relationships in large datasets. We present mpEDM, a paralleldistributed implementation of EDM optimized for modern GPU-centric supercomputers. We improve the original algorithm to reduce redundant computation and optimize the implementation to fully utilize hardware resources such as GPUs and SIMD units. As a use case, we run mpEDM on AI Bridging Cloud Infrastructure (ABCI) using datasets of an entire animal brain sampled at single neuron resolution to identify dynamical causation patterns across the brain. mpEDM is 1,530x faster than cppEDM and a dataset containing 101,729 neuron was analyzed in 199 seconds on 512 nodes. This is the largest EDM causal inference achieved to date.
Because of their effectiveness and flexibility in finding useful solutions, Genetic Algorithms (GM) are very popular search techniques for solving complex optimization problems in scientific and industrial fields. Par...
详细信息
Because of their effectiveness and flexibility in finding useful solutions, Genetic Algorithms (GM) are very popular search techniques for solving complex optimization problems in scientific and industrial fields. parallel GM (PGAs), and especially distributed ones have been usually presented as the way to overcome the time-consuming shortcoming of sequential GM. In the case of applying PGAs, we can expect better performance, the reason being the exchange of knowledge during the parallel search process. The resulting distributed search is different compared to what sequential pansnictic GM do, then deserving additional studies. This article presents a performance study of three different PGAs. Moreover, we investigate the effect of synchronizing communications over modern shared-memory multiprocessors. We consider the master-slave model along with synchronous and asynchronous distributed GM (dGAs), presenting their different designs and expected similarities when running in a number of cores ranging from one to 32 cores. The master-slave model showed a competitive numerical effort versus the other dGAs and demonstrated to be able to scale-up well over multiprocessors. We describe how the speed-up and parallel performance of the dGAs is changing as the number of cores enlarges. Results of the island model show that synchronous and asynchronous dGAs have different numerical performances on a multiprocessor, the asynchronous algorithm having a faster execution, thus more attractive for time demanding applications. Our results and statistical analyses help in developing a novel body of knowledge on PGAs running in shared memory multiprocessors (versus overwhelming literature oriented to distributed memory clusters), something useful for researchers, beginners, and final users of these techniques.
In this work we present the experience of the course "Build your own supercomputer with Raspberry Pi", offered as a non-mandatory workshop with the purpose of bringing High Performance computing (HPC) closer...
详细信息
ISBN:
(纸本)9781728159751
In this work we present the experience of the course "Build your own supercomputer with Raspberry Pi", offered as a non-mandatory workshop with the purpose of bringing High Performance computing (HPC) closer to bachelor students of Universitat Jaume I (UJI, Spain). The intention of the course is twofold;on the one hand, we target towards increasing the knowledge of Computer Science and Engineering students about the labor performed by the HPC community;on the other hand, we aim to create a personalized experience for each student by fulfilling their curiosity about the topics presented and discussed in the class. In order to evaluate the impact and learning, we analyze two surveys filled out by the students respectively before and after the course, where HPC interest and knowledge are exposed.
A model complex and algorithms for assessing the technical condition (TC) and reliability of the launch vehicle (LV) "Soyuz-2" with the decision support (DS) for managing its life cycle (LC) is considered in...
详细信息
ISBN:
(纸本)9783319911892
A model complex and algorithms for assessing the technical condition (TC) and reliability of the launch vehicle (LV) "Soyuz-2" with the decision support (DS) for managing its life cycle (LC) is considered in the article. On the basis of the analysis of modern problems and the requirements for the efficiency, quality and reliability of the assessment of the LV and the reliability of LV, it was concluded that it is necessary to use the new intelligent information technology (IIT), presented in the article, when designing the automated monitoring systems of the condition and DS for managing the LC LV "Soyuz-2". As a theoretical basis for this technology, the modification of the generalized computational model (GCM) as a knowledge representation model, allowing to build simulation-analytical model-based complexes for monitoring conditions and managing complex organizational and technical objects (COTO), is considered.
暂无评论