This paper presents a proposal of a new load balancer which aims reduce the runtime and power consumption of parallel applications when these are runned in shared memory environments. The algorithm of the balancer col...
详细信息
ISBN:
(纸本)9781728137735
This paper presents a proposal of a new load balancer which aims reduce the runtime and power consumption of parallel applications when these are runned in shared memory environments. The algorithm of the balancer collects system and application information in real time and then use it to make task migration decisions. For the implementation of strategy was used the Charm++ parallel programming. Preliminary results show reductions of up to 35.36% of runtime and energy consumption for three benchmarks used in the tests.
One-sided communication is a useful paradigm for irregular parallel applications, but most one-sided programming environments, including MPI’s one-sided interface and PGAS programming languages, lack application-leve...
详细信息
Many scientific applications handle large size sparse matrices which can be stored using special compression formats to reduce memory space and processing time. The choice of the Optimal Compression Format (OCF) is a ...
详细信息
ISBN:
(纸本)9781538678800
Many scientific applications handle large size sparse matrices which can be stored using special compression formats to reduce memory space and processing time. The choice of the Optimal Compression Format (OCF) is a critical process that involves several criteria. In this paper, we propose to use machine learning approach to predict the OCF (among CSR, CSC, ELL and COO) for SMVP kernel on multiprocessor platform. Our goal is not only to reach high accuracy values but also to minimize the LUBS (Loss Under Best Selection). Our main contribution consists in using data parallel model to extract features dataset. Experimental results show that we achieve more than 95% accuracy.
The importance of optimization and NP-problem solving cannot be overemphasized. The usefulness and popularity of evolutionary computing methods are also well established. There are various types of evolutionary method...
详细信息
The importance of optimization and NP-problem solving cannot be overemphasized. The usefulness and popularity of evolutionary computing methods are also well established. There are various types of evolutionary methods;they are mostly sequential but some of them have parallel implementations as well. We propose a multi-population method to parallelize the Imperialist Competitive Algorithm. The algorithm has been implemented with the Message Passing Interface on 2 computer platforms, and we have tested our method based on shared memory and message passing architectural models. An outstanding performance is obtained, demonstrating that the proposed method is very efficient concerning both speed and accuracy. In addition, compared with a set of existing well-known parallel algorithms, our approach obtains more accurate results within a shorter time period.
Technology improvements as well as demographic expansion implies an increase on the amount of information that needs processing. With this necessity, it becomes apparent that the use of algorithms capable of handle su...
详细信息
Technology improvements as well as demographic expansion implies an increase on the amount of information that needs processing. With this necessity, it becomes apparent that the use of algorithms capable of handle such amount of information is a must have, as well as algorithms capable of taking the maximum advantage of the current processing *** work presents the use of the decision tree to analyze numeric data, and further more we will explore the massive parallelization of the algorithm using the CUDA technology and the PyCUDA module for an easy integration and the meta-programming capabilities it provides, showing the optimizations made in the process. A results comparison between the original algorithm, and the optimized implementation will be presented, and conclusions will be drawn *** algorithm to employ is the decision tree. This algorithm was selected for its simplicity, inherit partition of the data and the results. Unlike other machine learning algorithms, the decision tree provides a clear description of the process made to reach a certain classification which is a desired property for further *** technology will be exploited using CUDA's programming interface, achieving an improvement over 17000 x over the classic serial implementation and their bounds and limitations.
Context: Software developers face complex, connected, and large software projects. The development of such systems involves design decisions that directly impact the quality of the software. For an early decision maki...
详细信息
ISBN:
(纸本)9781538658628
Context: Software developers face complex, connected, and large software projects. The development of such systems involves design decisions that directly impact the quality of the software. For an early decision making, software developers can use model-based prediction approaches for (non-)functional quality properties. Unfortunately, the accuracy of these approaches is challenged by newly introduced hardware features like multiple cores within a single CPU (multicores) and their dependence on shared memory and other shared resources. Objectives: Our goal is to understand whether and how existing model-based performance prediction approaches face this challenge. We plan to use gained insights as foundation for enriching existing prediction approaches with capabilities to predict systems running on multicores. Methods: We perform a Systematic Literature Review (SLR) to identify current model-based prediction approaches in the context of multicores. Results: Our SLR covers the software engineering, embedded systems, High Performance Computing, and Software Performance Engineering domains for which we examined 34 sources in detail. We found various performance prediction approaches which tries to increase prediction accuracy for multicore systems by including shared memory designs to the prediction models. Conclusion: However, our results show that the memory designs models are only in an initial phase. Further research has to be done to improve cache, memory, and memory bandwidth model as well as to include auto tuner support.
Bioinformatics is an interdisciplinary field that applies techniques from computer science, statistics and engineering to guide in the study of large biological data. Protein structure and sequence analysis is very im...
详细信息
ISBN:
(纸本)9781509042432
Bioinformatics is an interdisciplinary field that applies techniques from computer science, statistics and engineering to guide in the study of large biological data. Protein structure and sequence analysis is very important in bioinformatics mainly in understanding cellular processes which helps in simplifying the development of drugs for metabolic pathways. Protein sequence alignment is a technique that is concerned with identifying the similarities among different protein structures in order to discover the relationships among them. These kinds of techniques are computationally extensive which hinders their applicability. In this paper, we propose a parallel approach to speed up the computational time of two sequence alignment algorithms using a hybrid implementation that combines the power of multicore CPUs and that of contemporary GPUs. Our study shows that the hybrid approach solves the problem much faster than its sequential counterpart.
The rapid progress of multi/many-core architectures has caused data-intensive parallel applications not yet fully optimized to deliver the best performance. In the advent of concurrent programming, frameworks offering...
详细信息
The rapid progress of multi/many-core architectures has caused data-intensive parallel applications not yet fully optimized to deliver the best performance. In the advent of concurrent programming, frameworks offering structured patterns have alleviated developers' burden adapting such applications to multithreaded architectures. While some of these patterns are implemented using synchronization primitives, others avoid them by means of lock-free data mechanisms. However, lock-free programming is not straightforward, ensuring an appropriate use of their interfaces can be challenging, since different memory models plus instruction reordering at compiler/processor levels can interfere in the occurrence of data races. The benefits of race detectors are formidable in this sense;however, they may emit false positives if are unaware of the underlying lock-free structure semantics. To mitigate this issue, this paper extends ThreadSanitizer, a race detection tool, with the semantics of 2 lock-free data structures: the single-producer/single-consumer and the multiple-producer/multiple-consumer queues. With it, we are able to drop false positives and detect potential semantic violations. The experimental evaluation, using different queue implementations on a set of benchmarks and real applications, demonstrates that it is possible to reduce, on average, 60% the number of data race warnings and detect wrong uses of these structures.
Network analysis software relies on graph layout algorithms to enable users to visually explore network data. Nowadays, networks easily consist of millions of nodes and edges, resulting in hours of computation time to...
详细信息
ISBN:
(纸本)9781538610428
Network analysis software relies on graph layout algorithms to enable users to visually explore network data. Nowadays, networks easily consist of millions of nodes and edges, resulting in hours of computation time to obtain a readable graph layout on a typical workstation. Although these machines usually do not have a very large number of CPU cores, they can easily be equipped with Graphics Processing Units (GPUs), opening up the possibility of exploiting hundreds or even thousands of cores to counter the aforementioned computational challenges. In this paper we introduce a novel GPU framework for visualizing large real-world network data. The main focus is on a GPU implementation of force-directed graph layout algorithms, which are known to create high quality network visualizations. The proposed framework is used to parallelize the well-known ForceAtlas2 algorithm, which is widely used in many popular network analysis packages and toolkits. The different procedures and data structures of the algorithm are adjusted to the CUDA GPU architecture's specifics in terms of memory coalescing, shared memory usage and thread workload balance. To evaluate its performance, the GPU implementation is tested using a diverse set of 38 different large-scale real-world networks. This allows for a thorough characterization of the parallelizable components of both force-directed layout algorithms in general as well as the proposed GPU framework as a whole. Experiments demonstrate how the approach can efficiently process very large real-world networks, showing overall speedup factors between 40x and 123x compared to existing CPU implementations. In practice, this means that a network with 4 million nodes and 120 million edges can be visualized in 14 minutes rather than 9 hours.
Efficiently programming shared-memory machines is a difficult challenge because mapping application threads onto the memory hierarchy has a strong impact on the performance. However, optimizing such thread placement i...
详细信息
ISBN:
(纸本)9781538623268
Efficiently programming shared-memory machines is a difficult challenge because mapping application threads onto the memory hierarchy has a strong impact on the performance. However, optimizing such thread placement is difficult: architectures become increasingly complex and application behavior changes with implementations and input parameters, e.g problem size and number of threads. In this work, we propose a fully automatic, abstracted and portable affinity module. It produces and implements an optimized affinity strategy that combines knowledge about application characteristics and the platform topology. Implemented in the back-end of our runtime system (ORWL), our approach was used to enhance the performance and the scalability of several unmodified ORWL-coded applications: matrix multiplication, a 2D stencil (Livermore Kernel 23), and a video tracking real world application. On two SMP machines with quite different hardware characteristics, our tests show spectacular performance improvements for these unmodified application codes due to a dramatic decrease of cache misses and pipeline stalls. A comparison to reference implementations using OpenMP confirms this performance gain of almost one order of magnitude.
暂无评论