The authors present and evaluate an unplugged activity to introduce parallelcomputing concepts to undergraduate students. Students in five CS classrooms used a deck of playing cards in small groups to consider how pa...
详细信息
The numerical simulation plays a key role in industrial design because it enables to reduce the time and the cost to develop new products. Because of the international competition, it is important to have a complete c...
详细信息
The numerical simulation plays a key role in industrial design because it enables to reduce the time and the cost to develop new products. Because of the international competition, it is important to have a complete chain of simulation tools to perform efficiently some virtual prototyping. In this paper, we describe two components of large aeronautic numerical simulation chains that are extremely consuming of computer resource. The first is involved in computational fluid dynamics for aerodynamic studies. The second is used to study the wave propagation phenomena and is involved in acoustics. Because those softwares are used to analyze large and complex case studies in a limited amount of time, they are implemented on paralleldistributed computers. We describe the physical problems addressed by these codes, the main characteristics of their implementation. For the sake of re-usability and interoperability, these softwares are developed using object-oriented technologies. We illustrate their parallel performance on clusters of symmetric multi-processors. Finally, we discuss some challenges for the future generations of paralleldistributed numerical software that will have to enable the simulation of multi-physic phenomena in the context of virtual organizations also known as the extended enterprise. (c) 2005 Elsevier Inc. All rights reserved.
In this article, we use a Kullback-Leibler random sample partition data model to generate a set of disjoint data blocks, where each block is a good representation of the entire data set. Every random sample partition ...
详细信息
ISBN:
(纸本)9783030602451;9783030602444
In this article, we use a Kullback-Leibler random sample partition data model to generate a set of disjoint data blocks, where each block is a good representation of the entire data set. Every random sample partition (RSP) block has a sample distribution function similar to the entire data set. To obtain the statistical measure between them, Kernel Density Estimation (KDE) with a dual-tree recursion data structure is firstly applied to fast estimate the probability density of each block. Then, based on the Kullback-Leibler (KL) divergence measure, we can obtain the statistical similarity between a randomly selected RSP data block and other RSP data blocks. We rank the RSP data blocks according to their divergence values in descending order and choose the first ten for an ensemble classification learning. The classification models are established in parallel for the selected RSP data blocks and the final ensemble classification model is obtained by the weighted voting ensemble strategy. The experiments were conducted by building XGboost model based on those ten blocks in parallel, and we incrementally ensemble them according to their KL values. The testing classification results show that our method can increase the generalization capability of the ensemble classification model. It could reduce the model building time in parallel computation environment by using less than 15% of the entire data, which could also solve the memory constraints of big data analysis.
Big Data is an extremely massive amount of heterogeneous and multisource data which often requires fast processing and real time analysis. Solving big data analytics problems needs powerful platforms to handle this en...
详细信息
Big Data is an extremely massive amount of heterogeneous and multisource data which often requires fast processing and real time analysis. Solving big data analytics problems needs powerful platforms to handle this enormous mass of data and efficient machine learning algorithms to allow the use of big data full potential. Hidden Markov models are statistical models, rich and widely used in various fields especially for time varying data sequences modeling and analysis. They owe their success to the existence of many efficient and reliable algorithms. In this paper, we present ParaDist-HMM, a paralleldistributed implementation of hidden Markov model for modeling and solving big data analytics problems. We describe the development and the implementation of the improved algorithms and we propose a Spark-based approach consisting in a paralleldistributed big data architecture in cloud computing environment, to put the proposed algorithms into practice. We evaluated the model on synthetic and real financial data in terms of running time, speedup and prediction quality which is measured by using the accuracy and the root mean square error. Experimental results demonstrate that ParaDist-HMM algorithms outperforms other implementations of hidden Markov models in terms of processing speed, accuracy and therefore in efficiency and effectiveness.
Comprehending the performance bottlenecks at the core of the intricate hardware-software inter-actions exhibited by highly parallel programs on HPC clusters is crucial. This paper sheds light on the issue of automatic...
详细信息
Comprehending the performance bottlenecks at the core of the intricate hardware-software inter-actions exhibited by highly parallel programs on HPC clusters is crucial. This paper sheds light on the issue of automatically asynchronous MPI communication in memory-bound parallel programs on multicore clusters and how it can be facilitated. For instance, slowing down MPI processes by deliberate injection of delays can improve performance if certain conditions are met. This leads to the counter-intuitive conclusion that noise, independent of its source, is not always detrimental but can be leveraged for performance improvements. We employ phase-space graphs as a new tool to visualize parallel program dynamics. They are useful in spotting certain patterns in parallel execution that will easily go unnoticed with traditional tracing tools. We investigate five different microbenchmarks and applications on different supercomputer platforms: an MPI-augmented STREAM Triad, two implementations of Lattice-Boltzmann fluid solvers (D3Q19 and SPEChpc D2Q37), the LULESH and HPCG proxy applications.& COPY;2023 Elsevier B.V. All rights reserved.
Because of their effectiveness and flexibility in finding useful solutions, Genetic Algorithms (GM) are very popular search techniques for solving complex optimization problems in scientific and industrial fields. Par...
详细信息
Because of their effectiveness and flexibility in finding useful solutions, Genetic Algorithms (GM) are very popular search techniques for solving complex optimization problems in scientific and industrial fields. parallel GM (PGAs), and especially distributed ones have been usually presented as the way to overcome the time-consuming shortcoming of sequential GM. In the case of applying PGAs, we can expect better performance, the reason being the exchange of knowledge during the parallel search process. The resulting distributed search is different compared to what sequential pansnictic GM do, then deserving additional studies. This article presents a performance study of three different PGAs. Moreover, we investigate the effect of synchronizing communications over modern shared-memory multiprocessors. We consider the master-slave model along with synchronous and asynchronous distributed GM (dGAs), presenting their different designs and expected similarities when running in a number of cores ranging from one to 32 cores. The master-slave model showed a competitive numerical effort versus the other dGAs and demonstrated to be able to scale-up well over multiprocessors. We describe how the speed-up and parallel performance of the dGAs is changing as the number of cores enlarges. Results of the island model show that synchronous and asynchronous dGAs have different numerical performances on a multiprocessor, the asynchronous algorithm having a faster execution, thus more attractive for time demanding applications. Our results and statistical analyses help in developing a novel body of knowledge on PGAs running in shared memory multiprocessors (versus overwhelming literature oriented to distributed memory clusters), something useful for researchers, beginners, and final users of these techniques.
Currently many interconnection networks and parallel algorithms exist for message-passing computers. Users of these machines which to determine which message-passing computer is best for a given job, and how it will s...
详细信息
Currently many interconnection networks and parallel algorithms exist for message-passing computers. Users of these machines which to determine which message-passing computer is best for a given job, and how it will scale with the number of processors and algorithm size. The paper describes a general purpose simulator for message-passing multiprocessors (Parsim), which facilitates system modelling. A structured method for simulator design has been used which gives Parsim the ability to simulate different topologies and algorithm combinations easily. This is illustrated by applying Parsim to a number of algorithms on a variety of topologies. Parsim is then used to predict the performance of the new IBM SP2 parallel computer, with topologies ranging up to 1024 processors.
With the proliferation of workstation clusters connected by high-speed networks. providing efficient system support for concurrent applications engaging in nontrivial interaction has become an important problem. Two, ...
详细信息
With the proliferation of workstation clusters connected by high-speed networks. providing efficient system support for concurrent applications engaging in nontrivial interaction has become an important problem. Two, principal barriers to harnessing parallelism are: (1) efficient mechanisms that achieve transparent dependency maintenance while preserving semantic correctness and (2) scheduling algorithms that match coupled processes to distributed resources while explicitly incorporating their communication costs. This paper describes a set of performance features and their properties and implementation in a system support environment called DUNES that achieves transparent dependency maintenance-IPC, file access, memory access, process creation termination, process relationships-under dynamic load balancing. The two principal performance features are push/pull-based active and passive end-point caching and communication-sensitive load balancing. Collectively, they mitigate the overhead introduced by the transparent dependency maintenance mechanisms. Communication-sensitive load balancing. in addition, affects the scheduling of distributed resources to application processes where both communication and computation costs are explicitly taken into account. DUNES' architecture endows commodity operating systems with distributed operating system functionality while achieving transparency with respect to their existing application base. DUNES also preserves semantic correctness with respect to single processor semantics. We show performance measurements of a UNIX-based implementation on Spare and x86 architectures over high-speed LAN environments. We show that significant performance gains in terms of system throughput and parallel application speedup are achievable. (C) 1999 Academic Press.
The solution of elliptic problems is challenging on paralleldistributed memory computers since their Green's functions are global. To address this issue, we present a set of preconditioners for the Schur compleme...
详细信息
The solution of elliptic problems is challenging on paralleldistributed memory computers since their Green's functions are global. To address this issue, we present a set of preconditioners for the Schur complement domain decomposition method. They implement a global coupling mechanism, through coarse-space components, similar to the one proposed in [Bramble, Pasciak, and Shatz, Math. Comp., 47 (1986), pp. 103-134]. The definition of the coarse-space components is algebraic;they are defined using the mesh partitioning information and simple interpolation operators. These preconditioners are implemented on distributed memory computers without introducing any new global synchronization in the preconditioned conjugate gradient iteration. The numerical and parallel scalability of those preconditioners are illustrated on two-dimensional model examples that have anisotropy and/or discontinuity phenomena.
The performance of highly parallel applications on distributed-memory systems is influenced by many factors. Analytic performance modeling techniques aim to provide insight into performance limitations and are often t...
详细信息
The performance of highly parallel applications on distributed-memory systems is influenced by many factors. Analytic performance modeling techniques aim to provide insight into performance limitations and are often the starting point of optimization efforts. However, coupling analytic models across the system hierarchy (socket, node, network) fails to encompass the intricate interplay between the program code and the hardware, especially when execution and communication bottlenecks are involved. In this paper we investigate the effect of bottleneck evasionand how it can lead to automatic overlap of communication overhead with computation. Bottleneck evasion leads to a gradual loss of the initial bulk-synchronous behavior of a parallel code so that its processes become desynchronized. This occurs most prominently in memory-bound programs, which is why we choose memory-bound benchmark and application codes, specifically an MPI-augmented STREAM Triad, sparse matrix-vector multiplication, and a collective-avoiding Chebyshev filter diagonalization code to demonstrate the consequences of desynchronization on two different supercomputing platforms. We investigate the role of idle waves as possible triggers for desynchronization and show the impact of automatic asynchronous communication for a spectrum of code properties and parameters, such as saturation point, matrix structures, domain decomposition, and communication concurrency. Our findings reveal how eliminating synchronization points (such as collective communication or barriers) precipitates performance improvements that go beyond what can be expected by simply subtracting the overhead of the collective from the overall runtime.
暂无评论