Withthe increasing amount of digital image data, massive image process and feature extraction process have become a time-consuming process. As an excellent mass data processing and storage capacity of the open source...
详细信息
ISBN:
(纸本)9783319093338;9783319093321
Withthe increasing amount of digital image data, massive image process and feature extraction process have become a time-consuming process. As an excellent mass data processing and storage capacity of the open source cloud platform, Hadoop provides a parallel computing model MapReduce, HDFS distributed file system module. Firstly, we introduced Hadoop platform programming framework and Tamura texture features. And then, the image processing and feature texture feature extraction calculations involved in the process to achieve Hadoop platform. the results which comparison with Matlab platform shows it is less obvious advantage of Hadoop platform in image processing and feature extraction of lower-resolution images, but for image processing and feature extraction of high-resolution images, the time spent in Hadoop platform is greatly reducing, data processing capability the advantages is obvious.
Physical and thermal restrictions hinder commensurate performance gains from the ever increasing transistor density. While multi-core scaling helped alleviate dimmed or dark silicon for some time, future processors wi...
详细信息
ISBN:
(纸本)9781450329712
Physical and thermal restrictions hinder commensurate performance gains from the ever increasing transistor density. While multi-core scaling helped alleviate dimmed or dark silicon for some time, future processors will need to become more heterogeneous. To this end, single instruction set architecture (ISA) heterogeneous processors are a particularly interesting solution that combines multiple cores withthe same ISA but asymmetric performance and power characteristics. these processors, however, are no free lunch for database systems. Mapping jobs to the core that fits best is notoriously hard for the operating system or a compiler. To achieve optimal performance and energy efficiency, heterogeneity needs to be exposed to the database system. In this paper, we provide a thorough study of parallelized core database operators and TPC-H query processing on a heterogeneous single-ISA multi-core architecture. Using these insights we design a heterogeneity-conscious job tocore mapping approach for our high-performance main memory database system HyPer and show that it is indeed possible to get a better mileage while driving faster compared to static and operating-system-controlled mappings. Our approach improves the energy delay product of a TPC-H power run by 31% and up to over 60% for specific TPC-H queries. Categories and Subject Descriptors H.2 [Database Management]: Systems. Copyright 2014 ACM.
Givens Rotation is a key computation-intensive block in embedded wireless applications. In order to achieve an efficient mapping which smoothly scales to the underlying architecture, we propose two new Column-based Gi...
详细信息
Givens Rotation is a key computation-intensive block in embedded wireless applications. In order to achieve an efficient mapping which smoothly scales to the underlying architecture, we propose two new Column-based Givens Rotation algorithms, derived from traditional Fast Givens and Square-root and Division Free Givens algorithms. these algorithms allow annihilation of multiple elements in a column of the input matrix simultaneously, without a dependency bottle-neck allowing increased parallelism, resource sharing and scalability. the ease of mapping and scalability has been tested on a layered coarse-grained reconfigurable architecture reaching close to optimal results for highly parallelarchitectures.
EULAG (Eulerian/semi-Lagrangian fluid solver) is an established computational model for simulating thermo-fluid flows across a wide range of scales and physical scenarios. the multidimensional positive defined advecti...
详细信息
ISBN:
(纸本)9783642552243
EULAG (Eulerian/semi-Lagrangian fluid solver) is an established computational model for simulating thermo-fluid flows across a wide range of scales and physical scenarios. the multidimensional positive defined advection transport algorithm (MPDATA) is among the most time-consuming components of EULAG. the main aim of our work is to design an efficient adaptation of the MPDATA algorithm to the NVIDIA GPU Kepler architecture. We focus on analysis of resources usage in the GPU platform and its influence on performance results. In this paper, a performance model is proposed, which ensures a comprehensive analysis of the resource consumption including registers, shared, global and texture memories. the performance model allows us to identify bottlenecks of the algorithm, and shows directions of optimizations. the group of the most common bottlenecks is considered in this work. they include data transfers between host memory and GPU global memory, GPU global memory and shared memory, as well as latencies and serialization of instructions, and GPU occupancy. We put the emphasis on providing a fixed memory access pattern, padding, reducing divergent branches and instructions latencies, as well as organizing computation in the MPDATA algorithm in order to provide efficient shared memory and register file reusing.
Sorting is one of the classic problems of data processing and many practical applications require implementation of parallel sorting algorithms. Only a few algorithms have been implemented using MPI, in this paper a f...
详细信息
Sorting is one of the classic problems of data processing and many practical applications require implementation of parallel sorting algorithms. Only a few algorithms have been implemented using MPI, in this paper a few additional parallel sorting algorithms have been implemented using MPI. A unified performance analysis of all these algorithms has been presented using two different architectures. On basis of experimental results obtained some guidelines has been suggested for the selection of proper algorithms.
Factorization Machines [1, 2] is a new factorization model that can combine the merits of SVM model with matrix factorization models. It can model all the interactive actions using factorized parameters. So it could m...
详细信息
ISBN:
(纸本)9781479967162
Factorization Machines [1, 2] is a new factorization model that can combine the merits of SVM model with matrix factorization models. It can model all the interactive actions using factorized parameters. So it could mimic most other matrix factorization models by feature engineering. Due to the superior flexible, Factorization Machines has already been widely used in many recommended algorithm competitions and practical online recommended system. But, because of the prevalence of large dataset, there is a need to improve the scalability of computation in factorization machines model. In this paper, we propose a parallel algorithm can be used on Factorization Machines model. the experimental results show that the proposed algorithm has good speed-up and scalability on big dataset.
LEA is a new lightweight and low-power encryption algorithm. this algorithm has a certain useful features which are especially suitable for parallel hardware and software implementations, i.e., simple ARX operations, ...
详细信息
this work proposes a novel technique for accelerating sparse recovery algorithms on multi-core shared memory architectures. All prior works attempt to speed-up algorithms by leveraging the speed-ups in matrix-vector p...
详细信息
this work proposes a novel technique for accelerating sparse recovery algorithms on multi-core shared memory architectures. All prior works attempt to speed-up algorithms by leveraging the speed-ups in matrix-vector products offered by the GPU. A major limitation of these studies is that in most signal processing applications, the operators are not available as explicit matrices but as implicit fast operators. In such a practical scenario, the prior techniques fail to speed up the sparse recovery algorithms. Our work is based on the principles of stochastic gradient descent. the main sequential bottleneck of sparse recovery methods is a gradient descent step. Instead of computing the full gradient, we compute multiple stochastic gradients in parallel cores; the full gradient is estimated by averaging these stochastic gradients. the other step of sparse recovery algorithms is a shrinkage operation which is inherently parallel. Our proposed method has been compared with existing sequential algorithms. We find that our method is as accurate as the sequential version but is significantly faster - the larger the size of the problem, the faster is our method.
We present a simulated annealing based partitioning technique for mapping task graphs, onto heterogeneous processingarchitectures. Task partitioning onto homogeneous architectures to minimize the makespan of a task g...
详细信息
We present a simulated annealing based partitioning technique for mapping task graphs, onto heterogeneous processingarchitectures. Task partitioning onto homogeneous architectures to minimize the makespan of a task graph, is a known NP-hard problem. Heterogeneity greatly complicates the aforementioned partitioning problem, thus making heuristic solutions essential. A number of heuristic approaches have been proposed, some using simulated annealing. We propose a simulated annealing method with a novel NEXT STATE function to enable exploration of different regions of the global search space when the annealing temperature is high and making the search more local as the temperature drops. the novelty of our approach is two fold: (1) we go a step further than the existing scientific literature, considering heterogeneity at levels of task parallelism, data parallelism and communication. (2) We present a novel algorithm that uses simulated annealing to find better partitions in the presence of heterogeneous architectures, data parallel execution units, and significant data communication costs. We conduct a statistical analysis of the performance of the proposed method, which shows that our approach clearly outperforms the existing simulated annealing method.
Blind Signal Separation is an algorithmic problem class that deals withthe restoration of original signal data from a signal mixture. Implementations, such as Fast ICA, are optimized for parallelization on CPU or fir...
详细信息
Blind Signal Separation is an algorithmic problem class that deals withthe restoration of original signal data from a signal mixture. Implementations, such as Fast ICA, are optimized for parallelization on CPU or first-generation GPU hardware. Withthe advent of modern, compute centered GPU hardware with powerful features such as dynamic parallelism support, these solutions no longer leverage the available hardware performance in the best-possible way. We present an optimized implementation of the FastICA algorithm, which is specifically tailored for next-generation GPU architectures such as Nvidia Kepler. Our proposal achieves a two digit factor of speedup in the prototype implementation, compared to a multithreaded CPU implementation. Our custom matrix multiplication kernels, tailored specifically for the use case, contribute to the speedup by delivering better performance than the state-of-the-art CUBLAS library.
暂无评论