With the spread of multi-core systems, parallel programming increased in popularity. However, parallelizing algorithms in some cases yield negative results due to overhead. Additionally, implementing parallel algorith...
详细信息
ISBN:
(纸本)9781728167206
With the spread of multi-core systems, parallel programming increased in popularity. However, parallelizing algorithms in some cases yield negative results due to overhead. Additionally, implementing parallel algorithms is not always an easy or achievable task. Therefore, finding out to what extent a multi-core architecture can aid in the enhancement of the algorithm's speedup could become extremely beneficial. This paper studies and calculates the execution time and speedup of three of the most popular divide and conquer algorithms (Merge sort, quick sort, and matrix multiplication), the conducted experiments tested against various array sizes. The experiments take place on three different multi-core machines ranging from a dual-core CPU to a hexa-core CPU. The obtained results conclude that speedup is directly proportional to the number of CPU cores, such that using a hexa-core CPU in lieu of a dual-core CPU can achieve a speedup up to twice as fast. Thus, utilizing powerful multi-core CPU's could rival the use of parallelism on a standard CPU.
The PVS search function,as a current mainstream and efficient algorithm,has been widely used in various kinds of chess *** applied the parallel search function based on the PVS and improved the running speed of the **...
详细信息
The PVS search function,as a current mainstream and efficient algorithm,has been widely used in various kinds of chess *** applied the parallel search function based on the PVS and improved the running speed of the *** the same time,we also did some research and experiments on the evaluation function of Amazon chess which provided a set of available Amazon evaluation functions and parameter adjustment results for reference.
This paper introduces the principle of the three classical and widely applied local value methods, including Otsu method, maximum entropy method and iterative method. It runs on VS2010 (Microsoft Visual Studio 2010) p...
详细信息
ISBN:
(纸本)9781450388368
This paper introduces the principle of the three classical and widely applied local value methods, including Otsu method, maximum entropy method and iterative method. It runs on VS2010 (Microsoft Visual Studio 2010) platform, compares and analyzes it. And then, selects Otsu method with relatively good results to transplant in standard C language on CCS (Code Composer Studio) platform. A multi-core DSP (Digital Signal Processor) is established. After the TMS320C6678 environment, the OpenMP framework is used for parallel processing to optimize the Otsu method for fork-Join mode is used for parallel computing. Two cores, four cores and eight cores are used for fast processing of the Otsu method, summarize the law of speed increase. The results show that the parallel implementation of the digital image processing algorithm based on multi-core DSP in this paper can effectively improve the running speed on the basis of ensuring the accuracy of the Otsu method.
This paper explains the programming aspects of a promising Java-based programming and execution framework called JavaSymphony. JavaSymphony provides unified high-level programming constructs for applications related t...
详细信息
This paper explains the programming aspects of a promising Java-based programming and execution framework called JavaSymphony. JavaSymphony provides unified high-level programming constructs for applications related to shared, distributed, hybrid memory parallel computers, and co-processors accelerators. JavaSymphony applications can be executed on multi/many-core conventional and data-parallel architectures. JavaSymphony is based on the concept of dynamic virtual architectures, which allows programmers to define a hierarchical structure of the underlying computing resources and to control load-balancing and task-locality. In addition to GPU support, JavaSymphony provides a multi-core aware scheduling mechanism capable of mapping parallel applications on large multi-core machines and heterogeneous clusters. Several real applications and benchmarks (on modern multi-core computers, heterogeneous clusters, and machines consisting of a combination of different multi-core CPU and GPU devices) have been used to evaluate the performance. The results demonstrate that the JavaSymphony outperforms the Java implementations, as well as other modern alternative solutions.
A very common problem in GPU programming is that some combination of thread block dimensions and other code optimization parameters, like tiling or unrolling factors, results in dramatically better performance than ot...
详细信息
A very common problem in GPU programming is that some combination of thread block dimensions and other code optimization parameters, like tiling or unrolling factors, results in dramatically better performance than other kernel configurations. To obtain highly-efficient kernels it is often required to search vast and discontinuous search spaces that consist of all possible combinations of values for all tunable parameters. This paper presents Kernel Tuner, an easy-to-use tool for testing and auto-tuning OpenCL, CUDA, and C kernels with support for many search optimization algorithms that accelerate the tuning process. This paper introduces the application of many new solvers and global optimization algorithms for auto-tuning GPU applications. We demonstrate that Kernel Tuner can be used in a wide range of application scenarios and drastically decreases the time spent tuning, e.g. tuning a GEMM kernel on AMD Vega Frontier Edition 71.2x faster than brute force search. (C) 2018 The Author. Published by Elsevier B.V.
Support Vector Machines (SVM) are widely used as supervised learning models to solve the classification problem in machine learning. Training SVMs for large datasets is an extremely challenging task due to excessive s...
详细信息
Support Vector Machines (SVM) are widely used as supervised learning models to solve the classification problem in machine learning. Training SVMs for large datasets is an extremely challenging task due to excessive storage and computational requirements. To tackle so-called big data problems, one needs to design scalable distributed algorithms to parallelize the model training and to develop efficient implementations of these algorithms. In this paper, we propose a distributed algorithm for SVM training that is scalable and communication-efficient. The algorithm uses a compact representation of the kernel matrix, which is based on the QR decomposition of low-rank approximations, to reduce both computation and storage requirements for the training stage. This is accompanied by considerable reduction in communication required for a distributed implementation of the algorithm. Experiments on benchmark data sets with up to five million samples demonstrate negligible communication overhead and scalability on up to 64 cores. Execution times are vast improvements over other widely used packages. Furthermore, the proposed algorithm has linear time complexity with respect to the number of samples making it ideal for SVM training on decentralized environments such as smart embedded systems and edge-based internet of things, IoT.
Checksums are used to detect errors that might occur while storing or communicating data. Checking the integrity of data is well-established, but only for smaller data sets. Contrary, supercomputers have to deal with ...
详细信息
ISBN:
(纸本)9781665408790
Checksums are used to detect errors that might occur while storing or communicating data. Checking the integrity of data is well-established, but only for smaller data sets. Contrary, supercomputers have to deal with huge amounts of data, which introduces failures that may remain undetected. Therefore, additional protection becomes a necessity at large scale. However, checking the integrity of larger data sets, especially in case of distributed data, clearly requires parallel approaches. We show how popular checksums, such as CRC-32 or Adler-32, can be parallelized efficiently. This also disproves a widespread belief that parallelizing aforementioned checksums, especially in a scalable way, is not possible. The mathematical properties behind these checksums enable a method to combine partial checksums such that its result corresponds to the checksum of the concatenated partial data. Our parallel checksum algorithm utilizes this combination idea in a scalable hierarchical reduction scheme to combine the partial checksums from an arbitrary number of processing elements. Although this reduction scheme can be implemented manually using most parallel programming interfaces, we use the Message Passing Interface, which supports such a functionality directly via non-commutative user-defined reduction operations. In conjunction with the efficient checksum capabilities of the zlib library, our algorithm can not only be implemented conveniently and in a portable way, but also very efficiently. Additional shared-memory parallelization within compute nodes completes our hybrid parallel checksum solutions, which show a high scalability of up to 524,288 threads. At this scale, computing the checksums of 240 TiB data took only 3.4 seconds for CRC-32 and 2.6 seconds for Adler-32. Finally, we discuss the APES application as a representative of dynamic supercomputer applications. Thanks to our scalable checksum algorithm, even such applications are now able to detect many errors withi
Many task models have been proposed to express and analyze the behavior of real-time applications at different levels of precision. Most of them target sequential applications with no support for parallelism. The digr...
详细信息
Many task models have been proposed to express and analyze the behavior of real-time applications at different levels of precision. Most of them target sequential applications with no support for parallelism. The digraph task model is one of the most general ones, as it allows modeling arbitrary directed graphs (digraphs) for sequential job releases. In this paper, we extend the digraph task model to support intra-task parallelism. For the proposed parallel multi-mode digraph model, we derive sufficient schedulability tests and a dichotomic search to improve the test pessimism for a set of n tasks onto a heterogeneous single-ISA multi-core platform. To reduce the computational complexity of the schedulability test, we also propose heuristics for (i) partitioning parallel digraph tasks onto the heterogeneous cores, and (ii) assigning core operating frequencies to reduce the overall energy consumption, while meeting real-time constraints. The effectiveness of the proposed approach is validated with an exhaustive set of simulations.
An important challenge in parallel computing is the mapping of parallel algorithms to parallel computing platforms. This requires several activities such as the analysis of the parallel algorithm, the definition of th...
详细信息
An important challenge in parallel computing is the mapping of parallel algorithms to parallel computing platforms. This requires several activities such as the analysis of the parallel algorithm, the definition of the logical configuration of the platform and the implementation and deployment of the algorithm to the computing platform. However, in current parallel computing approaches very often only conceptual and idiosyncratic models are used which fall short in supporting the communication and analysis of the design decisions. In this article, we present ParDSL, a domain-specific language framework for providing explicit models to support the activities for mapping parallel algorithms to parallel computing platforms. The language framework includes four coherent set of domain-specific languages each of which focuses on an activity of the mapping process. We use the domain-specific languages for modeling the design as well as for generating the required platform-specific models and the code of the selected parallel algorithm. In addition to the languages, a library is defined to support systematic reuse. We discuss the overall architecture of the language framework, the separate DSLs, the corresponding model transformations and the toolset. The framework is illustrated for four different parallel computing algorithms.
Background: Protein structure comparative analysis and similarity searches play essential roles in structural bioinformatics. A couple of algorithms for protein structure alignments have been developed in recent years...
详细信息
Background: Protein structure comparative analysis and similarity searches play essential roles in structural bioinformatics. A couple of algorithms for protein structure alignments have been developed in recent years. However, facing the rapid growth of protein structure data, improving overall comparison performance and running efficiency with massive sequences is still challenging. Results: Here, we propose MADOKA, an ultra-fast approach for massive structural neighbor searching using a novel two-phase algorithm. Initially, we apply a fast alignment between pairwise structures. Then, we employ a score to select pairs with more similarity to carry out a more accurate fragment-based residue-level alignment. MADOKA performs about 6-100 times faster than existing methods, including TM-align and SAL, in massive alignments. Moreover, the quality of structural alignment of MADOKA is better than the existing algorithms in terms of TM-score and number of aligned residues. We also develop a web server to search structural neighbors in PDB database (About 360,000 protein chains in total), as well as additional features such as 3D structure alignment visualization. The MADOKA web server is freely available at: http://***/ Conclusions: MADOKA is an efficient approach to search for protein structure similarity. In addition, we provide a parallel implementation of MADOKA which exploits massive power of multi-core CPUs.
暂无评论