检索结果-内蒙古大学图书馆

International Conference on parallel and Distributed Systems (ICPADS)

作者： Christian Siebert Heinrich Heine University Düsseldorf Centre for Information and Media Technology Düsseldorf Germany

ISBN: (纸本)9781665408790

Checksums are used to detect errors that might occur while storing or communicating data. Checking the integrity of data is well-established, but only for smaller data sets. Contrary, supercomputers have to deal with huge amounts of data, which introduces failures that may remain undetected. Therefore, additional protection becomes a necessity at large scale. However, checking the integrity of larger data sets, especially in case of distributed data, clearly requires parallel approaches. We show how popular checksums, such as CRC-32 or Adler-32, can be parallelized efficiently. This also disproves a widespread belief that parallelizing aforementioned checksums, especially in a scalable way, is not possible. The mathematical properties behind these checksums enable a method to combine partial checksums such that its result corresponds to the checksum of the concatenated partial data. Our parallel checksum algorithm utilizes this combination idea in a scalable hierarchical reduction scheme to combine the partial checksums from an arbitrary number of processing elements. Although this reduction scheme can be implemented manually using most parallel programming interfaces, we use the Message Passing Interface, which supports such a functionality directly via non-commutative user-defined reduction operations. In conjunction with the efficient checksum capabilities of the zlib library, our algorithm can not only be implemented conveniently and in a portable way, but also very efficiently. Additional shared-memory parallelization within compute nodes completes our hybrid parallel checksum solutions, which show a high scalability of up to 524,288 threads. At this scale, computing the checksums of 240 TiB data took only 3.4 seconds for CRC-32 and 2.6 seconds for Adler-32. Finally, we discuss the APES application as a representative of dynamic supercomputer applications. Thanks to our scalable checksum algorithm, even such applications are now able to detect many errors withi

关键词： Runtime parallel programming Heuristic algorithms Scalability Message passing Distributed databases Supercomputers

来源：评论

学校读者我要写书评

暂无评论

Design and Implementation of Multi-core DSP parallel Compiler Based on Otsu Method 20

Design and Implementation of Multi-core DSP Parallel Compile...

引用

Proceedings of the 4th International Conference on Advances in Image Processing

作者： Tianxu Zhang Fanchen Meng Wuhan Institute of Technology China and Huazhong University of Science and Technology China Wuhan Institute of Technology China

ISBN: (纸本)9781450388368

This paper introduces the principle of the three classical and widely applied local value methods, including Otsu method, maximum entropy method and iterative method. It runs on VS2010 (Microsoft Visual Studio 2010) platform, compares and analyzes it. And then, selects Otsu method with relatively good results to transplant in standard C language on CCS (Code Composer Studio) platform. A multi-core DSP (Digital Signal Processor) is established. After the TMS320C6678 environment, the OpenMP framework is used for parallel processing to optimize the Otsu method for fork-Join mode is used for parallel computing. Two cores, four cores and eight cores are used for fast processing of the Otsu method, summarize the law of speed increase. The results show that the parallel implementation of the digital image processing algorithm based on multi-core DSP in this paper can effectively improve the running speed on the basis of ensuring the accuracy of the Otsu method.

关键词： parallel programming Multi-core DSP Otsu method Digital image processing

来源：评论

学校读者我要写书评

暂无评论

ParDSL: a domain-specific language framework for supporting deployment of parallel algorithms

引用

SOFTWARE AND SYSTEMS MODELING 2019年第5期18卷 2907-2935页

作者： Tekinerdogan, Bedir Arkin, Ethem Wageningen Univ Informat Technol Wageningen Netherlands Aselsan Ankara Turkey

An important challenge in parallel computing is the mapping of parallel algorithms to parallel computing platforms. This requires several activities such as the analysis of the parallel algorithm, the definition of the logical configuration of the platform and the implementation and deployment of the algorithm to the computing platform. However, in current parallel computing approaches very often only conceptual and idiosyncratic models are used which fall short in supporting the communication and analysis of the design decisions. In this article, we present ParDSL, a domain-specific language framework for providing explicit models to support the activities for mapping parallel algorithms to parallel computing platforms. The language framework includes four coherent set of domain-specific languages each of which focuses on an activity of the mapping process. We use the domain-specific languages for modeling the design as well as for generating the required platform-specific models and the code of the selected parallel algorithm. In addition to the languages, a library is defined to support systematic reuse. We discuss the overall architecture of the language framework, the separate DSLs, the corresponding model transformations and the toolset. The framework is illustrated for four different parallel computing algorithms.

关键词： Model-driven software development parallel programming High-performance computing Domain-specific language Architecture framework

来源：评论

学校读者我要写书评

暂无评论

On the parallel programmability of JavaSymphony for multi-cores and clusters

引用

INTERNATIONAL JOURNAL OF AD HOC AND UBIQUITOUS COMPUTING 2019年第4期30卷 247-264页

作者： Aleem, Muhammad Prodan, Radu Islam, Muhammad Arshad Iqbal, Muhammad Azhar Capital Univ Sci & Technol Dept Comp Sci Islamabad 44000 Pakistan Univ Innsbruck Inst Comp Sci A-6020 Innsbruck Austria

This paper explains the programming aspects of a promising Java-based programming and execution framework called JavaSymphony. JavaSymphony provides unified high-level programming constructs for applications related to shared, distributed, hybrid memory parallel computers, and co-processors accelerators. JavaSymphony applications can be executed on multi/many-core conventional and data-parallel architectures. JavaSymphony is based on the concept of dynamic virtual architectures, which allows programmers to define a hierarchical structure of the underlying computing resources and to control load-balancing and task-locality. In addition to GPU support, JavaSymphony provides a multi-core aware scheduling mechanism capable of mapping parallel applications on large multi-core machines and heterogeneous clusters. Several real applications and benchmarks (on modern multi-core computers, heterogeneous clusters, and machines consisting of a combination of different multi-core CPU and GPU devices) have been used to evaluate the performance. The results demonstrate that the JavaSymphony outperforms the Java implementations, as well as other modern alternative solutions.

关键词： parallel programming Java multi-core scheduler GPU computing

来源：评论

学校读者我要写书评

暂无评论

Fast and Communication-Efficient Algorithm for Distributed Support Vector Machine Training

引用

IEEE TRANSACTIONS ON parallel AND DISTRIBUTED SYSTEMS 2019年第5期30卷 1065-1076页

作者： Dass, Jyotikrishna Sarin, Vivek Mahapatra, Rabi N. Texas A&M Univ Dept Comp Sci & Engn College Stn TX 77840 USA

Support Vector Machines (SVM) are widely used as supervised learning models to solve the classification problem in machine learning. Training SVMs for large datasets is an extremely challenging task due to excessive storage and computational requirements. To tackle so-called big data problems, one needs to design scalable distributed algorithms to parallelize the model training and to develop efficient implementations of these algorithms. In this paper, we propose a distributed algorithm for SVM training that is scalable and communication-efficient. The algorithm uses a compact representation of the kernel matrix, which is based on the QR decomposition of low-rank approximations, to reduce both computation and storage requirements for the training stage. This is accompanied by considerable reduction in communication required for a distributed implementation of the algorithm. Experiments on benchmark data sets with up to five million samples demonstrate negligible communication overhead and scalability on up to 64 cores. Execution times are vast improvements over other widely used packages. Furthermore, the proposed algorithm has linear time complexity with respect to the number of samples making it ideal for SVM training on decentralized environments such as smart embedded systems and edge-based internet of things, IoT.

关键词： Machine learning support vector machines classification algorithms parallel programming distributed computing message passing quadratic programming iterative algorithms optimization multicore processing

来源：评论

学校读者我要写书评

暂无评论

Kernel Tuner: A search-optimizing GPU code auto-tuner

引用

FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE 2019年 90卷 347-358页

作者： van Werkhoven, Ben Netherlands eSci Ctr Sci Pk 140 NL-1098 XG Amsterdam Netherlands

A very common problem in GPU programming is that some combination of thread block dimensions and other code optimization parameters, like tiling or unrolling factors, results in dramatically better performance than other kernel configurations. To obtain highly-efficient kernels it is often required to search vast and discontinuous search spaces that consist of all possible combinations of values for all tunable parameters. This paper presents Kernel Tuner, an easy-to-use tool for testing and auto-tuning OpenCL, CUDA, and C kernels with support for many search optimization algorithms that accelerate the tuning process. This paper introduces the application of many new solvers and global optimization algorithms for auto-tuning GPU applications. We demonstrate that Kernel Tuner can be used in a wide range of application scenarios and drastically decreases the time spent tuning, e.g. tuning a GEMM kernel on AMD Vega Frontier Edition 71.2x faster than brute force search. (C) 2018 The Author. Published by Elsevier B.V.

关键词： GPU computing Auto-tuning parallel programming Performance optimization Software development

来源：评论

学校读者我要写书评

暂无评论

CHAOS: a parallelization scheme for training convolutional neural networks on Intel Xeon Phi

引用

JOURNAL OF SUPERCOMPUTING 2019年第1期75卷 197-227页

作者： Viebke, Andre Memeti, Suejb Pllana, Sabri Abraham, Ajith Linnaeus Univ Dept Comp Sci S-35195 Vaxjo Sweden Machine Intelligence Res Labs MIR Labs 13rd St NWPOB 2259 Auburn WA 98071 USA

Deep learning is an important component of Big Data analytic tools and intelligent applications, such as self-driving cars, computer vision, speech recognition, or precision medicine. However, the training process is computationally intensive and often requires a large amount of time if performed sequentially. Modern parallel computing systems provide the capability to reduce the required training time of deep neural networks. In this paper, we present our parallelization scheme for training convolutional neural networks (CNN) named Controlled Hogwild with Arbitrary Order of Synchronization (CHAOS). Major features of CHAOS include the support for thread and vector parallelism, non-instant updates of weight parameters during back-propagation without a significant delay, and implicit synchronization in arbitrary order. CHAOS is tailored for parallel computing systems that are accelerated with the Intel Xeon Phi. We evaluate our parallelization approach empirically using measurement techniques and performance modeling for various numbers of threads and CNN architectures. Experimental results for the MNIST dataset of handwritten digits using the total number of threads on the Xeon Phi show speedups of up to 103x compared to the execution on one thread of the Xeon Phi, 14x compared to the sequential execution on Intel Xeon E5, and 58x compared to the sequential execution on Intel Core i5.

关键词： parallel programming Deep learning Convolutional neural networks Intel Xeon Phi

来源：评论

学校读者我要写书评

暂无评论

parallelization and improvement of the MDV-SW algorithm for HEVC intra-prediction coding

引用

JOURNAL OF SUPERCOMPUTING 2019年第3期75卷 1150-1162页

作者： Georgiana Paraschiv, Elena Ruiz-Coll, Damian Pantoja, Maria Fernandez-Escribano, Gerardo Univ Castilla La Mancha Inst Invest Informat Albacete Albacete Spain Univ Rey Juan Carlos Fuenlabrada Spain Cal Poly San Luis Obispo Coll Engn San Luis Obispo CA USA

The new video coding standard high-efficiency video encoding (HEVC) greatly improves the efficiency of intra-prediction with respect to previous standards. However, these new features increase significantly the computational complexity, by evaluating all the possible combinations of unit size and intra-prediction modes. In this paper, we improved our previous version of the mean directional variance in sliding window (MDV-SW) algorithm, which detects the texture orientation of a block of pixels, allowing the speedup of the HEVC intra-prediction. This was done by doubling the number of texture orientations detectable, which allowed us to use pixels from the original image as reference samples instead of the reconstructed pixels, eliminating the dependency between blocks and making it possible to parallelize the algorithm at block level when an image is processed with MDV-SW. Finally, this paper shows how the use of parallel implementation can speed up significantly the MDV-SW algorithm, achieving a reduction of around 70% when threads in Windows or OpenMP are used, compared to sequential implementation.

关键词： HEVC Intra-prediction parallel programming Threads

来源：评论

学校读者我要写书评

暂无评论

High-Level and Productive Stream parallelism for Dedup, Ferret, and Bzip2

引用

INTERNATIONAL JOURNAL OF parallel programming 2019年第2期47卷 253-271页

作者： Griebler, Dalvan Hoffmann, Renato B. Danelutto, Marco Fernandes, Luiz G. Pontificia Univ Catolica Rio Grande do Sul Fac Informat Porto Alegre RS Brazil Univ Pisa Comp Sci Dept Pisa Italy

parallel programming has been a challenging task for application programmers. Stream processing is an application domain present in several scientific, enterprise, and financial areas that lack suitable abstractions to exploit parallelism. Our goal is to assess the feasibility of state-of-the-art frameworks/libraries (Pthreads, TBB, and FastFlow) and the SPar domain-specific language for real-world streaming applications (Dedup, Ferret, and Bzip2) targeting multi-core architectures. SPar was specially designed to provide high-level and productive stream parallelism abstractions, supporting programmers with standard C++-11 annotations. For the experiments, we implemented three streaming applications. We discussed SPar's programmability advantages compared to the frameworks in terms of productivity and structured parallel programming. The results demonstrate that SPar improves productivity and provides the necessary features to achieve similar performances compared to the state-of-the-art.

关键词： High-level parallelism parallel programming Stream processing parallel patterns Pipeline parallelism Streaming applications

来源：评论

学校读者我要写书评

暂无评论

Toward a BLAS library truly portable across different accelerator types

引用

JOURNAL OF SUPERCOMPUTING 2019年第11期75卷 7101-7124页

作者： Rodriguez-Gutiez, Eduardo Moreton-Fernandez, Ana Gonzalez-Escribano, Arturo Llanos, Diego R. Univ Valladolid Dept Informat Paseo Belen E-47011 Valladolid Spain

Scientific applications are some of the most computationally demanding software pieces. Their core is usually a set of linear algebra operations, which may represent a significant part of the overall run-time of the application. BLAS libraries aim to solve this problem by exposing a set of highly optimized, reusable routines. There are several implementations specifically tuned for different types of computing platforms, including coprocessors. Some examples include the one bundled with the Intel MKL library, which targets Intel CPUs or Xeon Phi coprocessors, or the cuBLAS library, which is specifically designed for NVIDIA GPUs. Nowadays, computing nodes in many supercomputing clusters include one or more different coprocessor types. To fully exploit these platforms might require programs that can adapt at run-time to the chosen device type, hardwiring in the program the code needed to use a different library for each device type that can be selected. This also forces the programmer to deal with different interface particularities and mechanisms to manage the memory transfers of the data structures used as parameters. This paper presents a unified, performance-oriented, and portable interface for BLAS. This interface has been integrated into a heterogeneous programming model (Controllers) which supports groups of CPU cores, Xeon Phi accelerators, or NVIDIA GPUs in a transparent way. The contribution of this paper includes: An abstraction layer to hide programming differences between diverse BLAS libraries;new types of kernel classes to support the context manipulation of different external BLAS libraries;a new kernel selection policy that considers both programmer kernels and different external libraries;a complete new Controller library interface for the whole collection of BLAS routines. This proposal enables the creation of BLAS-based portable codes that can execute on top of different types of accelerators by changing a single initialization parameter. Our softw

关键词： BLAS parallel programming Scientific libraries Heterogeneous programming Accelerators Coprocessors GPU Xeon Phi MIC CUDA

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：