Latency or throughput is often critical performance metrics in stream processing. Applications' performance can fluctuate depending on the input stream. This unpredictability is due to the variety in data arrival ...
详细信息
Latency or throughput is often critical performance metrics in stream processing. Applications' performance can fluctuate depending on the input stream. This unpredictability is due to the variety in data arrival frequency and size, complexity, and other factors. Researchers are constantly investigating new ways to mitigate the impact of these variations on performance with self-adaptive techniques involving elasticity or micro-batching. However, there is a lack of benchmarks capable of creating test scenarios to further evaluate these techniques. This work extends and improves the SPBench benchmarking framework to support dynamic micro-batching and data stream frequency management. We also propose a set of algorithms that generates the most commonly used frequency patterns for benchmarking stream processing in related work. It allows the creation of a wide variety of test scenarios. To validate our solution, we use SPBench to create custom benchmarks and evaluate the impact of micro-batching and data stream frequency on the performance of Intel TBB and FastFlow. These are two libraries that leverage stream parallelism for multi-core architectures. Our results demonstrated that our test cases did not benefit from micro-batches on multi-cores. For different data stream frequency configurations, TBB ensured the lowest latency, while FastFlow assured higher throughput in shorter pipelines.
With the development of heterogeneous systems, the demand for high-level programming methods that ease heterogeneous programming and produce portable applications has become more urgent. This paper proposes DACL, the ...
详细信息
With the development of heterogeneous systems, the demand for high-level programming methods that ease heterogeneous programming and produce portable applications has become more urgent. This paper proposes DACL, the data associated computing language. DACL introduces data partition patterns to achieve architecture-independent parallelism expression. Meanwhile, DACL provides simplified language extensions, as well as programming features such as serialization of the computing process, parameterization of data attributes and modularity, thus reducing the difficulty of heterogeneous programming and improving programming productivity. The operational semantics show that DACL enables different levels of parallelism degree calculation and retains data access patterns, reserving optimization potential. To support cross-platform execution, the currently implemented source-to-source compilers employ OpenMP and OpenCL as the backend. We reconstructed multiple benchmarks selected from the Parboil and Rodinia benchmark suits with DACL and conducted a comparison test on CPU, GPU and MIC platforms. The code size of each rebuilt benchmark is roughly equivalent to that of the serial code, which is only 13%-64% of the benchmark OpenCL code. With the support of the compilation system, the reconstructed code can execute on different processors without modification, yielding a competitive or better performance to that of the manually written benchmark code.
Multiple sequence alignment approaches refer to algorithmic solutions for the alignment of biological sequences. Since multiple sequence alignment has exponential time complexity when a dynamic programming approach is...
详细信息
Multiple sequence alignment approaches refer to algorithmic solutions for the alignment of biological sequences. Since multiple sequence alignment has exponential time complexity when a dynamic programming approach is applied, a substantial number of parallel computing approaches have been implemented in the last two decades to improve their performance. In this paper, we present a systematic literature review of parallel computing approaches applied to multiple sequence alignment algorithms for proteins, published in the open literature from 1988 to 2022;we extracted articles from four scientific databases: ACM Digital Library, IEEE Xplore, Science Direct and SpringerLink, and four journals: Bioinformatics, PLOS Computational Biology, PLOS ONE, and Scientific Reports. Additionally, in order to cover other potential databases and journals, we performed a transversal search through Google Scholar. We conducted a selection process that yielded 106 research articles;then, we analyzed these articles and defined a classification framework. Additionally, we point out some directions and trends for parallel computing approaches for multiple sequence alignment, as well as some unsolved problems.
Several real-world parallel applications are becoming more dynamic and long-running, demanding online (at run-time) adaptations. Stream processing is a representative scenario that computes data items arriving in real...
详细信息
Several real-world parallel applications are becoming more dynamic and long-running, demanding online (at run-time) adaptations. Stream processing is a representative scenario that computes data items arriving in real-time and where parallel executions are necessary. However, it is challenging for humans to monitor and manually self-optimize complex and long-running parallel executions continuously. Moreover, although high-level and structured parallel programming aims to facilitate parallelism, several issues still need to be addressed for improving the existing abstractions. In this paper, we extend self-adaptiveness for supporting autonomous and online changes of the parallel pattern compositions. Online self-adaptation is achieved with an online profiler that characterizes the applications, which is combined with a new self-adaptive strategy and a model for smooth transitions on reconfigurations. The solution provides a new abstraction layer that enables application programmers to define non-functional requirements instead of hand-tuning complex configurations. Hence, we contribute with additional abstractions and flexible self-adaptation for responsiveness at run-time. The proposed solution is evaluated with applications having different processing characteristics, workloads, and configurations. The results show that it is possible to provide additional abstractions, flexibility, and responsiveness while achieving performance comparable to the best static configuration executions.
The Multilevel Monte Carlo (MLMC) method has proven to be an effective variance-reduction statistical method for Uncertainty Quantification (UQ) in Partial Differential Equation (PDE) models, combining model computati...
详细信息
The Multilevel Monte Carlo (MLMC) method has proven to be an effective variance-reduction statistical method for Uncertainty Quantification (UQ) in Partial Differential Equation (PDE) models, combining model computations at different levels to create an accurate estimate. Still, the computational complexity of the resulting method is extremely high, particularly for 3D models, which requires advanced algorithms for the efficient exploitation of High Performance Computing (HPC). In this article we present a new implementation of the MLMC in massively parallel computer architectures, exploiting parallelism within and between each level of the hierarchy. The numerical approximation of the PDE is performed using the finite element method but the algorithm is quite general and could be applied to other discretization methods. The two key ingredients of the implementation are a good processor partition scheme together with a good scheduling algorithm to assign work to different processors. We introduce a multiple partition of the set of processors that permits the simultaneous execution of different levels and we develop a dynamic scheduling algorithm to exploit it. The problem of finding the optimal scheduling of distributed tasks in a parallel computer is an NP-complete problem. We propose and analyze a new greedy scheduling algorithm to assign samples and we show that it is a 2-approximation, which is the best that may be expected under general assumptions. On top of this result we design a distributed memory implementation using the Message Passing Interface (MPI) standard. Finally we present a set of numerical experiments illustrating its scalability properties.& COPY;2023 The Author(s). Published by Elsevier B.V. on behalf of International Association for Mathematics and Computers in Simulation (IMACS). This is an open access article under the CC BY-NC-ND license (http://***/licenses/by-nc-nd/4.0/).
The Software Defect Prediction (SDP) method forecasts the occurrence of defects at the beginning of the software development process. Early fault detection will decrease the overall cost of software and improve its de...
详细信息
The Software Defect Prediction (SDP) method forecasts the occurrence of defects at the beginning of the software development process. Early fault detection will decrease the overall cost of software and improve its dependability. However, no effort has been made in high-performance software to address it. The contribution of this paper is predicting and correcting software defects in the Message Passing Interface (MPI) based on machine learning (ML). This system predicts defects including deadlock, race conditions, and mismatch, by dividing the model into three stages: training, testing, and prediction. The training phase extracts and combines the features as well as the label and then trains on classification. During the testing phase, these features are extracted and classified. The prediction phase inputs the MPI code and determines whether it includes defects. If it discovers a defect, the correction subsystem corrects it. We collected 40 MPI codes in C++, including all MPI communication. Results show the NB classifiers have high accuracy, precision, and recall, which are about 1.
Contemporary HPC hardware typically provides several levels of parallelism, e.g. multiple nodes, each having multiple cores (possibly with vectorization) and accelerators. Efficiently programming such systems usually ...
详细信息
Contemporary HPC hardware typically provides several levels of parallelism, e.g. multiple nodes, each having multiple cores (possibly with vectorization) and accelerators. Efficiently programming such systems usually requires skills in combining several low-level frameworks such as MPI, OpenMP, and CUDA. This overburdens programmers without substantial parallel programming skills. One way to overcome this problem and to abstract from details of parallel programming is to use algorithmic skeletons. In the present paper, we evaluate the multi-node, multi-CPU and multi-GPU implementation of the most essential skeletons Map, Reduce, and Zip. Our main contribution is a discussion of the efficiency of using multiple parallelization levels and the consideration of which fine-tune settings should be offered to the user.
Python is becoming increasingly popular in scientific computing. The package MPI for Python (mpi4py) allows writing efficient parallel programs that scale across multiple nodes. However, it does not support non-contig...
详细信息
Python is becoming increasingly popular in scientific computing. The package MPI for Python (mpi4py) allows writing efficient parallel programs that scale across multiple nodes. However, it does not support non-contiguous data via slices, which is a well-known feature of NumPy. In this work, we therefore evaluate several methods to support the direct transfer of non-contiguous arrays in mpi4py. This significantly simplifies the code, while the performance basically stays the same. In a PingPong-, Stencil- and Lattice-Boltzmann-Benchmark, we compare the common manual copying, a NumPy-Copy design and a design that is based on MPI derived datatypes. In one case, the MPI derived datatype design could achieve a speedup of 15% in a Stencil-Benchmark on four compute nodes. Our designs are superior to naive manual copies, but for maximum performance manual copies with pre-allocated buffers or MPI persistent communication will be a better choice.
Most modern technologies,such as social media,smart cities,and the internet of things(IoT),rely on big *** big data is used in the real-world applications,two data challenges such as class overlap and class imbalance ...
详细信息
Most modern technologies,such as social media,smart cities,and the internet of things(IoT),rely on big *** big data is used in the real-world applications,two data challenges such as class overlap and class imbalance *** dealing with large datasets,most traditional classifiers are stuck in the local optimum *** a result,it’s necessary to look into new methods for dealing with large data *** solutions have been proposed for overcoming this *** rapid growth of the available data threatens to limit the usefulness of many traditional *** such as oversampling and undersampling have shown great promises in addressing the issues of class *** all of these techniques,Synthetic Minority Oversampling TechniquE(SMOTE)has produced the best results by generating synthetic samples for the minority class in creating a balanced *** issue is that their practical applicability is restricted to problems involving tens of thousands or lower instances of *** this paper,we have proposed a parallel mode method using SMOTE and MapReduce strategy,this distributes the operation of the algorithm among a group of computational nodes for addressing the aforementioned *** proposed solution has been divided into three ***first stage involves the process of splitting the data into different blocks using a mapping function,followed by a pre-processing step for each mapping block that employs a hybrid SMOTE algo-rithm for solving the class imbalanced *** each map block,a decision tree model would be ***,the decision tree blocks would be com-bined for creating a classification *** have used numerous datasets with up to 4 million instances in our experiments for testing the proposed scheme’s *** a result,the Hybrid SMOTE appears to have good scalability within the framework proposed,and it also cuts down the processing time.
暂无评论