Support Vector Machine (SVM) is one of the most popular machine learning algorithm to perform classification tasks and help organizations in different ways to improve their efficiency. A lot of studies have been made ...
详细信息
ISBN:
(纸本)9781509048342
Support Vector Machine (SVM) is one of the most popular machine learning algorithm to perform classification tasks and help organizations in different ways to improve their efficiency. A lot of studies have been made to improve SVM including speed, accuracy, and/or scalability. The algorithm possesses parameters that need precision tuning to perform well. This work proposes a novel parallelized parameter selection using Flower Pollination Algorithm (FPA) to quickly find the optimal parameters of SVM. In particular, MapReduce algorithm introduced in big data framework is applied to both FPA and SVM, which forms a fully distributed algorithm to support a large dataset. The experimental results of parallelized FPA-SVM on real datasets show its outstanding speed in generating optimal models while maintaining high accuracy.
Sparse tensors appear in many large-scale applications with multidimensional and sparse data. While multidimensional sparse data often need to be processed on manycore processors, attempts to develop highly-optimized ...
详细信息
ISBN:
(纸本)9781538623268
Sparse tensors appear in many large-scale applications with multidimensional and sparse data. While multidimensional sparse data often need to be processed on manycore processors, attempts to develop highly-optimized GPU-based implementations of sparse tensor operations are rare. The irregular computation patterns and sparsity structures as well as the large memory footprints of sparse tensor operations make such implementations challenging. We leverage the fact that sparse tensor operations share similar computation patterns to propose a unified tensor representation called F-COO. Combined with GPU-specific optimizations, F-COO provides highly-optimized implementations of sparse tensor computations on GPUs. The performance of the proposed unified approach is demonstrated for tensor-based kernels such as the Sparse Matricized Tensor-Times-Khatri-Rao Product (SpMTTKRP) and the Sparse Tensor-Times-Matrix Multiply (SpTTM) that are used in tensor decomposition algorithms. Compared to state-of-the-art work we improve the performance of SpTTM and SpMTTKRP up to 3.7 and 30.6 times respectively on NVIDIA Titan-X GPUs. We implement the CANDECOMP/PARAFAC (CP) decomposition and achieve up to 14.9 times speedup using the unified method over state-of-the-art libraries on NVIDIA Titan-X GPUs.
This paper presents parallelization strategies for the implementation of imaging algorithms for synthetic aperture radar (SAR). Great emphasis is placed on time-domain based algorithms, namely the Global Backprojectio...
详细信息
ISBN:
(纸本)9781538631690
This paper presents parallelization strategies for the implementation of imaging algorithms for synthetic aperture radar (SAR). Great emphasis is placed on time-domain based algorithms, namely the Global Backprojection Algorithm (GBP) and its accelerated version, the Fast Factorized Backprojection Algorithm (FFBP). Multi-core platforms are selected for implementation as some combine good performance results with moderate power consumption. The implemented algorithms support several types of parallelization, as the stages of the algorithms can be handled sequentially or interleaved. For the GBP algorithm three different data distribution schemes are investigated. For the FFBP algorithm a successive stage calculation method is compared with a combined calculation method. The performance is exemplary evaluated on the low cost/energy, yet powerful multi-core platform Odroid-XU4. All parallelization strategies show an almost linear speed-up with the number of used cores. Even though a specific multi-core platform is investigated, the design decisions are applicable for general multi-core architectures.
Underdetermined systems of equations in which the minimum norm solution needs to be computed arise in many applications, such as geophysics, signal processing, and biomedical engineering. In this article, we introduce...
详细信息
Underdetermined systems of equations in which the minimum norm solution needs to be computed arise in many applications, such as geophysics, signal processing, and biomedical engineering. In this article, we introduce a new parallel algorithm for obtaining the minimum 2-norm solution of an underdetermined system of equations. The proposed algorithm is based on the Balance scheme, which was originally developed for the parallel solution of banded linear systems. The proposed scheme assumes a generalized banded form where the coefficient matrix has column overlapped block structure in which the blocks could be dense or sparse. In this article, we implement the more general sparse case. The blocks can be handled independently by any existing sequential or parallel QR factorization library. A smaller reduced system is formed and solved before obtaining the minimum norm solution of the original system in parallel. We experimentally compare and confirm the error bound of the proposed method against the QR factorization based techniques by using true single-precision arithmetic. We implement the proposed algorithm by using the message passing paradigm. We demonstrate numerical effectiveness as well as parallel scalability of the proposed algorithm on both shared and distributed memory architectures for solving various types of problems.
This paper presents a new efficient algorithm for scaling by power of two in the residue number system (RNS). It focuses on arbitrary moduli sets with large dynamic ranges. In this algorithm, in order to determine the...
详细信息
ISBN:
(纸本)9781538645475
This paper presents a new efficient algorithm for scaling by power of two in the residue number system (RNS). It focuses on arbitrary moduli sets with large dynamic ranges. In this algorithm, in order to determine the remainder when dividing the number to be scaled by the scaling factor, an interval estimation of the RNS representation is used. The proposed algorithm requires only machine-precision integer and floating-point operations, and is well parallelized. The algorithm is implemented for CPU, as well as for GPU using CUDA C language.
The goal of Nonnegative Matrix Factorization (NMF) is to represent a large nonnegative matrix in an approximate way as a product of two significantly smaller nonnegative matrices. In comparison to other algorithms to ...
详细信息
ISBN:
(纸本)9781538632505
The goal of Nonnegative Matrix Factorization (NMF) is to represent a large nonnegative matrix in an approximate way as a product of two significantly smaller nonnegative matrices. In comparison to other algorithms to calculate the NMF, Newton-type methods can be parallelized very well because Newton iterations can be performed in parallel without exchanging data between processes. However, these methods can show problematic convergence behavior, limiting their efficiency. We present a modified algorithm that achieves stable convergence by using Karush-Kuhn-Tucker (KKT) conditions and a reflective technique for constraint handling, backtracking line search for global convergence, and a modified target function to avoid explicit inequality handling. Our method allows for an inexact approach, where only few Newton iterations are performed per outer iteration. Experiments show that this leads to faster convergence in the sequential as well as in the parallel case. Although shorter outer iterations increase communication overhead, speedups are still satisfactory.
We have discussed the multidimensional parallel computation for pseudo arc-length moving mesh schemes, and the schemes can be used to capture the strong discontinuity for multidimensional detonations. Different from t...
详细信息
We have discussed the multidimensional parallel computation for pseudo arc-length moving mesh schemes, and the schemes can be used to capture the strong discontinuity for multidimensional detonations. Different from the traditional Euler numerical schemes, the problems of parallel schemes for pseudo arc-length moving mesh schemes include diagonal processor communications and mesh point communications, which are illustrated by the schematic diagram and key pseudocodes. Finally, the numerical examples are given to show that the pseudo arc-length moving mesh schemes are second-order convergent and can successfully capture the strong numerical strong discontinuity of the detonation wave. In addition, our parallel methods are proved effectively and the computational time is obviously decreased.
Lattice rules for multiple integration yield a powerful method to approximate high-dimensional integrals for various function classes. Using generator vectors obtained from the fast component-by-component (CBC) constr...
详细信息
ISBN:
(纸本)9781538626528
Lattice rules for multiple integration yield a powerful method to approximate high-dimensional integrals for various function classes. Using generator vectors obtained from the fast component-by-component (CBC) construction of lattice rules, we incorporated rank-1 lattices for numerical integration on GPU accelerators. We show accuracy and efficiency results for a number of multivariate integrals, and compare with results obtained by Monte Carlo integration for the same functions also on GPU. The lattice rules achieve high accuracy and excellent speedups.
We explore in this paper the application of bio-inspired approaches to the association rules mining (ARM) problem for the purpose of accelerating the process of extracting the correlations between items in sizeable da...
详细信息
ISBN:
(纸本)9781509060580
We explore in this paper the application of bio-inspired approaches to the association rules mining (ARM) problem for the purpose of accelerating the process of extracting the correlations between items in sizeable data instances. A new bio-inspired GPU-based model is proposed, which benefits from the massively GPU threading by evaluating multiple rules in parallel on GPU. To validate the proposed model, the most used bio-inspired approaches (GA, PSO, and BSO) have been executed on GPU to solve well-known large ARM instances. Real experiments have been carried out on an Intel Xeon 64 bit quad-core processor E5520 coupled to an Nvidia Tesla C2075 GPU device. The results show that the genetic algorithm outperforms PSO and BSO. Moreover, it outperforms the state-of-the-art GPU-based ARM approaches when dealing with the challenging Webdocs instance.
Multicore CPUs and cheap co-processors such as GPUs create opportunities for vastly accelerating database queries. However, given the differences in their threading models, expected granularities of parallelism, and m...
详细信息
ISBN:
(纸本)9781450341974
Multicore CPUs and cheap co-processors such as GPUs create opportunities for vastly accelerating database queries. However, given the differences in their threading models, expected granularities of parallelism, and memory subsystems, effectively utilising all cores with all co-processors for an intensive query is very difficult. This paper introduces a novel templating methodology to create portable, yet architecture-aware, algorithms. We apply this methodology on the very compute-intensive task of calculating the skycube, a materialisation of exponentially many skyline query results, which finds applications in data exploration and multi-criteria decision making. We define three parallel templates, two that leverage insights from previous skycube research and a third that exploits a novel point-based paradigm to expose more data parallelism. An experimental study shows that, relative to the state-of-the-art that does not parallelise well due to its memory and cache requirements, our algorithms provide an order of magnitude improvement on either architecture and proportionately improve as more GPUs are added.
暂无评论