检索结果-内蒙古大学图书馆

Simpasio em Sistemas Computacionais (WSCAD-SSC)

作者： Vinicius R.S. dos Santos Edson L. Padoin Philippe O. A. Navaux Regional University of the Northwest of Rio Grande do Sul Universidade Federal do Rio Grande do Sul Porto Alegre RS BR

ISBN: (纸本)9781728137735

This paper presents a proposal of a new load balancer which aims reduce the runtime and power consumption of parallel applications when these are runned in shared memory environments. The algorithm of the balancer collects system and application information in real time and then use it to make task migration decisions. For the implementation of strategy was used the Charm++ parallel programming. Preliminary results show reductions of up to 35.36% of runtime and energy consumption for three benchmarks used in the tests.

关键词： High performance computing Runtime Heuristic algorithms Load management Power demand parallel programming Integrated circuits

来源：评论

学校读者我要写书评

暂无评论

BCL: A Cross-Platform Distributed Container Library

arXiv

引用

arXiv 2018年

作者： Brock, Benjamin Buluç, Aydın Yelick, Katherine University of California Lawrence Berkeley National Laboratory Berkeley United States

One-sided communication is a useful paradigm for irregular parallel applications, but most one-sided programming environments, including MPI’s one-sided interface and PGAS programming languages, lack application-level libraries to support these applications. We present the Berkeley Container Library, a set of generic, cross-platform, high-performance data structures for irregular applications, including queues, hash tables, Bloom filters and more. BCL is written in C++ using an internal DSL called the BCL Core that provides one-sided communication primitives such as remote get and remote put operations. The BCL Core has backends for MPI, OpenSHMEM, GASNet-EX, and UPC++, allowing BCL data structures to be used natively in programs written using any of these programming environments. Along with our internal DSL, we present the BCL ObjectContainer abstraction, which allows BCL data structures to transparently serialize complex data types while maintaining efficiency for primitive types. We also introduce the set of BCL data structures and evaluate their performance across a number of high-performance computing systems, demonstrating that BCL programs are competitive with hand-optimized code, even while hiding many of the underlying details of message aggregation, serialization, and synchronization. Copyright © 2018, The Authors. All rights reserved.

关键词： parallel programming

来源：评论

学校读者我要写书评

暂无评论

Machine Learning for Optimal Compression Format Prediction on Multiprocessor Platform

Machine Learning for Optimal Compression Format Prediction o...

引用

International Conference on High Performance Computing & Simulation (HPCS)

作者： Ichrak Mehrez Olfa Hamdi-Larbi Thomas Dufaud Nahid Emad Université de Versailles St-Quentin Université Paris-Saclay Li-Parad Versailles France

ISBN: (纸本)9781538678800

Many scientific applications handle large size sparse matrices which can be stored using special compression formats to reduce memory space and processing time. The choice of the Optimal Compression Format (OCF) is a critical process that involves several criteria. In this paper, we propose to use machine learning approach to predict the OCF (among CSR, CSC, ELL and COO) for SMVP kernel on multiprocessor platform. Our goal is not only to reach high accuracy values but also to minimize the LUBS (Loss Under Best Selection). Our main contribution consists in using data parallel model to extract features dataset. Experimental results show that we achieve more than 95% accuracy.

关键词： Sparse matrices Numerical models Data models parallel programming Computational modeling Computer architecture Program processors

来源：评论

学校读者我要写书评

暂无评论

parallel imperialist competitive algorithms

引用

CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE 2018年第7期30卷

作者： Majd, Amin Sahebi, Golnaz Daneshtalab, Masoud Plosila, Juha Lotfi, Shahriar Tenhunen, Hannu Abo Akad Univ SF-20500 Turku Finland Malardalen Univ S-72123 Vasteras Sweden Univ Tabriz Tabriz *** Iran Univ Turku SF-20500 Turku Finland

The importance of optimization and NP-problem solving cannot be overemphasized. The usefulness and popularity of evolutionary computing methods are also well established. There are various types of evolutionary methods;they are mostly sequential but some of them have parallel implementations as well. We propose a multi-population method to parallelize the Imperialist Competitive Algorithm. The algorithm has been implemented with the Message Passing Interface on 2 computer platforms, and we have tested our method based on shared memory and message passing architectural models. An outstanding performance is obtained, demonstrating that the proposed method is very efficient concerning both speed and accuracy. In addition, compared with a set of existing well-known parallel algorithms, our approach obtains more accurate results within a shorter time period.

关键词： evolutionary computing ICA multi-population parallel approaches parallel programming optimization super-linear performance

来源：评论

学校读者我要写书评

暂无评论

Data Intensive parallel Tree Algorithm Patterns based on GPUs

Data Intensive Parallel Tree Algorithm Patterns based on GPU...

引用

2018 International Conference on Data Science and Information Technology（DSIT 2018）

作者： Edgar León-SANDoval Liliana Ibeth Barbosa-Santillán University of Guadalajara

Technology improvements as well as demographic expansion implies an increase on the amount of information that needs processing. With this necessity, it becomes apparent that the use of algorithms capable of handle such amount of information is a must have, as well as algorithms capable of taking the maximum advantage of the current processing *** work presents the use of the decision tree to analyze numeric data, and further more we will explore the massive parallelization of the algorithm using the CUDA technology and the PyCUDA module for an easy integration and the meta-programming capabilities it provides, showing the optimizations made in the process. A results comparison between the original algorithm, and the optimized implementation will be presented, and conclusions will be drawn *** algorithm to employ is the decision tree. This algorithm was selected for its simplicity, inherit partition of the data and the results. Unlike other machine learning algorithms, the decision tree provides a clear description of the process made to reach a certain classification which is a desired property for further *** technology will be exploited using CUDA's programming interface, achieving an improvement over 17000 x over the classic serial implementation and their bounds and limitations.

关键词： algorithms parallel programming Entropy GPUs

来源：评论

学校读者我要写书评

暂无评论

parallelization, Modeling, and Performance Prediction in the Multi-/Many Core Area: A Systematic Literature Review 7

Parallelization, Modeling, and Performance Prediction in the...

引用

IEEE 7th International Symposium on Cloud and Service Computing (IEEE SC2)

作者： Frank, Markus Hilbrich, Marcus Lehrig, Sebastian Becker, Steffen Tech Univ Chemnitz Chemnitz Germany Univ Stuttgart Stuttgart Germany

ISBN: (纸本)9781538658628

Context: Software developers face complex, connected, and large software projects. The development of such systems involves design decisions that directly impact the quality of the software. For an early decision making, software developers can use model-based prediction approaches for (non-)functional quality properties. Unfortunately, the accuracy of these approaches is challenged by newly introduced hardware features like multiple cores within a single CPU (multicores) and their dependence on shared memory and other shared resources. Objectives: Our goal is to understand whether and how existing model-based performance prediction approaches face this challenge. We plan to use gained insights as foundation for enriching existing prediction approaches with capabilities to predict systems running on multicores. Methods: We perform a Systematic Literature Review (SLR) to identify current model-based prediction approaches in the context of multicores. Results: Our SLR covers the software engineering, embedded systems, High Performance Computing, and Software Performance Engineering domains for which we examined 34 sources in detail. We found various performance prediction approaches which tries to increase prediction accuracy for multicore systems by including shared memory designs to the prediction models. Conclusion: However, our results show that the memory designs models are only in an initial phase. Further research has to be done to improve cache, memory, and memory bandwidth model as well as to include auto tuner support.

关键词： Software Multicore processing Predictive models Unified modeling language parallel programming Hardware Computational modeling

来源：评论

学校读者我要写书评

暂无评论

A Hybrid CPU-GPU Implementation to Accelerate Multiple Pairwise Protein Sequence Alignment 8

A Hybrid CPU-GPU Implementation to Accelerate Multiple Pairw...

引用

8th International Conference on Information and Communication Systems (ICICS)

作者： Shehab, Mohammed A. Ghadawi, Abdullah A. Alawneh, Luay Al-Ayyoub, Mahmoud Jararweh, Yaser Jordan Univ Sci & Technol Irbid Jordan Univ Technol Malaysia Johor Baharu Malaysia

ISBN: (纸本)9781509042432

Bioinformatics is an interdisciplinary field that applies techniques from computer science, statistics and engineering to guide in the study of large biological data. Protein structure and sequence analysis is very important in bioinformatics mainly in understanding cellular processes which helps in simplifying the development of drugs for metabolic pathways. Protein sequence alignment is a technique that is concerned with identifying the similarities among different protein structures in order to discover the relationships among them. These kinds of techniques are computationally extensive which hinders their applicability. In this paper, we propose a parallel approach to speed up the computational time of two sequence alignment algorithms using a hybrid implementation that combines the power of multicore CPUs and that of contemporary GPUs. Our study shows that the hybrid approach solves the problem much faster than its sequential counterpart.

关键词： Bioinformatics parallel programming Performance

来源：评论

学校读者我要写书评

暂无评论

Enabling semantics to improve detection of data races and misuses of lock-free data structures

Enabling semantics to improve detection of data races and mi...

引用

Euro-Par Conference / 7th International Workshop on programming Models and Applications for Multicores and Manycores (PMAM) held in conjunction with the 21st SIGPLAN Symposium on Principles and Practice of parallel programming (PPoPP)

作者： Dolz, Manuel F. Astorga, David Del Rio Fernandez, Javier Torquati, Massimo Garcia, Jose Daniel Garcia-Carballeira, Felix Danelutto, Marco Univ Carlos III Madrid Dept Comp Sci Madrid 28911 Spain Univ Pisa Dept Comp Sci I-56127 Pisa Italy

The rapid progress of multi/many-core architectures has caused data-intensive parallel applications not yet fully optimized to deliver the best performance. In the advent of concurrent programming, frameworks offering structured patterns have alleviated developers' burden adapting such applications to multithreaded architectures. While some of these patterns are implemented using synchronization primitives, others avoid them by means of lock-free data mechanisms. However, lock-free programming is not straightforward, ensuring an appropriate use of their interfaces can be challenging, since different memory models plus instruction reordering at compiler/processor levels can interfere in the occurrence of data races. The benefits of race detectors are formidable in this sense;however, they may emit false positives if are unaware of the underlying lock-free structure semantics. To mitigate this issue, this paper extends ThreadSanitizer, a race detection tool, with the semantics of 2 lock-free data structures: the single-producer/single-consumer and the multiple-producer/multiple-consumer queues. With it, we are able to drop false positives and detect potential semantic violations. The experimental evaluation, using different queue implementations on a set of benchmarks and real applications, demonstrates that it is possible to reduce, on average, 60% the number of data race warnings and detect wrong uses of these structures.

关键词： data race detectors parallel programming semantics wait- lock-free data structures

来源：评论

学校读者我要写书评

暂无评论

Exploiting GPUs for fast force-directed visualization of large-scale networks 46

Exploiting GPUs for fast force-directed visualization of lar...

引用

46th International Conference on parallel Processing Workshops (ICPPW)

作者： Brinkmann, Govert G. Rietveld, Kristian F. D. Takes, Frank W. Leiden Univ LIACS Leiden Netherlands Univ Amsterdam CORPNET Amsterdam Netherlands

ISBN: (纸本)9781538610428

Network analysis software relies on graph layout algorithms to enable users to visually explore network data. Nowadays, networks easily consist of millions of nodes and edges, resulting in hours of computation time to obtain a readable graph layout on a typical workstation. Although these machines usually do not have a very large number of CPU cores, they can easily be equipped with Graphics Processing Units (GPUs), opening up the possibility of exploiting hundreds or even thousands of cores to counter the aforementioned computational challenges. In this paper we introduce a novel GPU framework for visualizing large real-world network data. The main focus is on a GPU implementation of force-directed graph layout algorithms, which are known to create high quality network visualizations. The proposed framework is used to parallelize the well-known ForceAtlas2 algorithm, which is widely used in many popular network analysis packages and toolkits. The different procedures and data structures of the algorithm are adjusted to the CUDA GPU architecture's specifics in terms of memory coalescing, shared memory usage and thread workload balance. To evaluate its performance, the GPU implementation is tested using a diverse set of 38 different large-scale real-world networks. This allows for a thorough characterization of the parallelizable components of both force-directed layout algorithms in general as well as the proposed GPU framework as a whole. Experiments demonstrate how the approach can efficiently process very large real-world networks, showing overall speedup factors between 40x and 123x compared to existing CPU implementations. In practice, this means that a network with 4 million nodes and 120 million edges can be visualized in 14 minutes rather than 9 hours.

关键词： network visualization force-directed graph layout large-scale networks parallel programming CUDA

来源：评论

学校读者我要写书评

暂无评论

Automatic, Abstracted and Portable Topology-Aware Thread Placement

Automatic, Abstracted and Portable Topology-Aware Thread Pla...

引用

IEEE International Conference on Cluster Computing (CLUSTER)

作者： Gustedt, Jens Jeannot, Emmanuel Mansouri, Farouk Univ Strasbourg ICube INRIA Strasbourg France Univ Bordeaux CNRS INRIA LaBRIBordeaux INP Bordeaux France Inria Bordeaux Bordeaux France

ISBN: (纸本)9781538623268

Efficiently programming shared-memory machines is a difficult challenge because mapping application threads onto the memory hierarchy has a strong impact on the performance. However, optimizing such thread placement is difficult: architectures become increasingly complex and application behavior changes with implementations and input parameters, e.g problem size and number of threads. In this work, we propose a fully automatic, abstracted and portable affinity module. It produces and implements an optimized affinity strategy that combines knowledge about application characteristics and the platform topology. Implemented in the back-end of our runtime system (ORWL), our approach was used to enhance the performance and the scalability of several unmodified ORWL-coded applications: matrix multiplication, a 2D stencil (Livermore Kernel 23), and a video tracking real world application. On two SMP machines with quite different hardware characteristics, our tests show spectacular performance improvements for these unmodified application codes due to a dramatic decrease of cache misses and pipeline stalls. A comparison to reference implementations using OpenMP confirms this performance gain of almost one order of magnitude.

关键词： Thread placement Task based runtimes Hardware affinity parallel programming

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：