Checksums are used to detect errors that might occur while storing or communicating data. Checking the integrity of data is well-established, but only for smaller data sets. Contrary, supercomputers have to deal with ...
详细信息
ISBN:
(纸本)9781665408790
Checksums are used to detect errors that might occur while storing or communicating data. Checking the integrity of data is well-established, but only for smaller data sets. Contrary, supercomputers have to deal with huge amounts of data, which introduces failures that may remain undetected. Therefore, additional protection becomes a necessity at large scale. However, checking the integrity of larger data sets, especially in case of distributed data, clearly requires parallel approaches. We show how popular checksums, such as CRC-32 or Adler-32, can be parallelized efficiently. This also disproves a widespread belief that parallelizing aforementioned checksums, especially in a scalable way, is not possible. The mathematical properties behind these checksums enable a method to combine partial checksums such that its result corresponds to the checksum of the concatenated partial data. Our parallel checksum algorithm utilizes this combination idea in a scalable hierarchical reduction scheme to combine the partial checksums from an arbitrary number of processing elements. Although this reduction scheme can be implemented manually using most parallel programming interfaces, we use the Message Passing Interface, which supports such a functionality directly via non-commutative user-defined reduction operations. In conjunction with the efficient checksum capabilities of the zlib library, our algorithm can not only be implemented conveniently and in a portable way, but also very efficiently. Additional shared-memory parallelization within compute nodes completes our hybrid parallel checksum solutions, which show a high scalability of up to 524,288 threads. At this scale, computing the checksums of 240 TiB data took only 3.4 seconds for CRC-32 and 2.6 seconds for Adler-32. Finally, we discuss the APES application as a representative of dynamic supercomputer applications. Thanks to our scalable checksum algorithm, even such applications are now able to detect many errors withi
This paper introduces the principle of the three classical and widely applied local value methods, including Otsu method, maximum entropy method and iterative method. It runs on VS2010 (Microsoft Visual Studio 2010) p...
详细信息
ISBN:
(纸本)9781450388368
This paper introduces the principle of the three classical and widely applied local value methods, including Otsu method, maximum entropy method and iterative method. It runs on VS2010 (Microsoft Visual Studio 2010) platform, compares and analyzes it. And then, selects Otsu method with relatively good results to transplant in standard C language on CCS (Code Composer Studio) platform. A multi-core DSP (Digital Signal Processor) is established. After the TMS320C6678 environment, the OpenMP framework is used for parallel processing to optimize the Otsu method for fork-Join mode is used for parallel computing. Two cores, four cores and eight cores are used for fast processing of the Otsu method, summarize the law of speed increase. The results show that the parallel implementation of the digital image processing algorithm based on multi-core DSP in this paper can effectively improve the running speed on the basis of ensuring the accuracy of the Otsu method.
An important challenge in parallel computing is the mapping of parallel algorithms to parallel computing platforms. This requires several activities such as the analysis of the parallel algorithm, the definition of th...
详细信息
An important challenge in parallel computing is the mapping of parallel algorithms to parallel computing platforms. This requires several activities such as the analysis of the parallel algorithm, the definition of the logical configuration of the platform and the implementation and deployment of the algorithm to the computing platform. However, in current parallel computing approaches very often only conceptual and idiosyncratic models are used which fall short in supporting the communication and analysis of the design decisions. In this article, we present ParDSL, a domain-specific language framework for providing explicit models to support the activities for mapping parallel algorithms to parallel computing platforms. The language framework includes four coherent set of domain-specific languages each of which focuses on an activity of the mapping process. We use the domain-specific languages for modeling the design as well as for generating the required platform-specific models and the code of the selected parallel algorithm. In addition to the languages, a library is defined to support systematic reuse. We discuss the overall architecture of the language framework, the separate DSLs, the corresponding model transformations and the toolset. The framework is illustrated for four different parallel computing algorithms.
This paper explains the programming aspects of a promising Java-based programming and execution framework called JavaSymphony. JavaSymphony provides unified high-level programming constructs for applications related t...
详细信息
This paper explains the programming aspects of a promising Java-based programming and execution framework called JavaSymphony. JavaSymphony provides unified high-level programming constructs for applications related to shared, distributed, hybrid memory parallel computers, and co-processors accelerators. JavaSymphony applications can be executed on multi/many-core conventional and data-parallel architectures. JavaSymphony is based on the concept of dynamic virtual architectures, which allows programmers to define a hierarchical structure of the underlying computing resources and to control load-balancing and task-locality. In addition to GPU support, JavaSymphony provides a multi-core aware scheduling mechanism capable of mapping parallel applications on large multi-core machines and heterogeneous clusters. Several real applications and benchmarks (on modern multi-core computers, heterogeneous clusters, and machines consisting of a combination of different multi-core CPU and GPU devices) have been used to evaluate the performance. The results demonstrate that the JavaSymphony outperforms the Java implementations, as well as other modern alternative solutions.
Support Vector Machines (SVM) are widely used as supervised learning models to solve the classification problem in machine learning. Training SVMs for large datasets is an extremely challenging task due to excessive s...
详细信息
Support Vector Machines (SVM) are widely used as supervised learning models to solve the classification problem in machine learning. Training SVMs for large datasets is an extremely challenging task due to excessive storage and computational requirements. To tackle so-called big data problems, one needs to design scalable distributed algorithms to parallelize the model training and to develop efficient implementations of these algorithms. In this paper, we propose a distributed algorithm for SVM training that is scalable and communication-efficient. The algorithm uses a compact representation of the kernel matrix, which is based on the QR decomposition of low-rank approximations, to reduce both computation and storage requirements for the training stage. This is accompanied by considerable reduction in communication required for a distributed implementation of the algorithm. Experiments on benchmark data sets with up to five million samples demonstrate negligible communication overhead and scalability on up to 64 cores. Execution times are vast improvements over other widely used packages. Furthermore, the proposed algorithm has linear time complexity with respect to the number of samples making it ideal for SVM training on decentralized environments such as smart embedded systems and edge-based internet of things, IoT.
A very common problem in GPU programming is that some combination of thread block dimensions and other code optimization parameters, like tiling or unrolling factors, results in dramatically better performance than ot...
详细信息
A very common problem in GPU programming is that some combination of thread block dimensions and other code optimization parameters, like tiling or unrolling factors, results in dramatically better performance than other kernel configurations. To obtain highly-efficient kernels it is often required to search vast and discontinuous search spaces that consist of all possible combinations of values for all tunable parameters. This paper presents Kernel Tuner, an easy-to-use tool for testing and auto-tuning OpenCL, CUDA, and C kernels with support for many search optimization algorithms that accelerate the tuning process. This paper introduces the application of many new solvers and global optimization algorithms for auto-tuning GPU applications. We demonstrate that Kernel Tuner can be used in a wide range of application scenarios and drastically decreases the time spent tuning, e.g. tuning a GEMM kernel on AMD Vega Frontier Edition 71.2x faster than brute force search. (C) 2018 The Author. Published by Elsevier B.V.
Deep learning is an important component of Big Data analytic tools and intelligent applications, such as self-driving cars, computer vision, speech recognition, or precision medicine. However, the training process is ...
详细信息
Deep learning is an important component of Big Data analytic tools and intelligent applications, such as self-driving cars, computer vision, speech recognition, or precision medicine. However, the training process is computationally intensive and often requires a large amount of time if performed sequentially. Modern parallel computing systems provide the capability to reduce the required training time of deep neural networks. In this paper, we present our parallelization scheme for training convolutional neural networks (CNN) named Controlled Hogwild with Arbitrary Order of Synchronization (CHAOS). Major features of CHAOS include the support for thread and vector parallelism, non-instant updates of weight parameters during back-propagation without a significant delay, and implicit synchronization in arbitrary order. CHAOS is tailored for parallel computing systems that are accelerated with the Intel Xeon Phi. We evaluate our parallelization approach empirically using measurement techniques and performance modeling for various numbers of threads and CNN architectures. Experimental results for the MNIST dataset of handwritten digits using the total number of threads on the Xeon Phi show speedups of up to 103x compared to the execution on one thread of the Xeon Phi, 14x compared to the sequential execution on Intel Xeon E5, and 58x compared to the sequential execution on Intel Core i5.
The new video coding standard high-efficiency video encoding (HEVC) greatly improves the efficiency of intra-prediction with respect to previous standards. However, these new features increase significantly the comput...
详细信息
The new video coding standard high-efficiency video encoding (HEVC) greatly improves the efficiency of intra-prediction with respect to previous standards. However, these new features increase significantly the computational complexity, by evaluating all the possible combinations of unit size and intra-prediction modes. In this paper, we improved our previous version of the mean directional variance in sliding window (MDV-SW) algorithm, which detects the texture orientation of a block of pixels, allowing the speedup of the HEVC intra-prediction. This was done by doubling the number of texture orientations detectable, which allowed us to use pixels from the original image as reference samples instead of the reconstructed pixels, eliminating the dependency between blocks and making it possible to parallelize the algorithm at block level when an image is processed with MDV-SW. Finally, this paper shows how the use of parallel implementation can speed up significantly the MDV-SW algorithm, achieving a reduction of around 70% when threads in Windows or OpenMP are used, compared to sequential implementation.
parallel programming has been a challenging task for application programmers. Stream processing is an application domain present in several scientific, enterprise, and financial areas that lack suitable abstractions t...
详细信息
parallel programming has been a challenging task for application programmers. Stream processing is an application domain present in several scientific, enterprise, and financial areas that lack suitable abstractions to exploit parallelism. Our goal is to assess the feasibility of state-of-the-art frameworks/libraries (Pthreads, TBB, and FastFlow) and the SPar domain-specific language for real-world streaming applications (Dedup, Ferret, and Bzip2) targeting multi-core architectures. SPar was specially designed to provide high-level and productive stream parallelism abstractions, supporting programmers with standard C++-11 annotations. For the experiments, we implemented three streaming applications. We discussed SPar's programmability advantages compared to the frameworks in terms of productivity and structured parallel programming. The results demonstrate that SPar improves productivity and provides the necessary features to achieve similar performances compared to the state-of-the-art.
Scientific applications are some of the most computationally demanding software pieces. Their core is usually a set of linear algebra operations, which may represent a significant part of the overall run-time of the a...
详细信息
Scientific applications are some of the most computationally demanding software pieces. Their core is usually a set of linear algebra operations, which may represent a significant part of the overall run-time of the application. BLAS libraries aim to solve this problem by exposing a set of highly optimized, reusable routines. There are several implementations specifically tuned for different types of computing platforms, including coprocessors. Some examples include the one bundled with the Intel MKL library, which targets Intel CPUs or Xeon Phi coprocessors, or the cuBLAS library, which is specifically designed for NVIDIA GPUs. Nowadays, computing nodes in many supercomputing clusters include one or more different coprocessor types. To fully exploit these platforms might require programs that can adapt at run-time to the chosen device type, hardwiring in the program the code needed to use a different library for each device type that can be selected. This also forces the programmer to deal with different interface particularities and mechanisms to manage the memory transfers of the data structures used as parameters. This paper presents a unified, performance-oriented, and portable interface for BLAS. This interface has been integrated into a heterogeneous programming model (Controllers) which supports groups of CPU cores, Xeon Phi accelerators, or NVIDIA GPUs in a transparent way. The contribution of this paper includes: An abstraction layer to hide programming differences between diverse BLAS libraries;new types of kernel classes to support the context manipulation of different external BLAS libraries;a new kernel selection policy that considers both programmer kernels and different external libraries;a complete new Controller library interface for the whole collection of BLAS routines. This proposal enables the creation of BLAS-based portable codes that can execute on top of different types of accelerators by changing a single initialization parameter. Our softw
暂无评论