Sparse matrix-vector multiplication on GPUs faces to a serious problem when the vector length is too large to be stored in GPU's device memory. To solve this problem, we propose a novel software-hardware hybrid me...
详细信息
Sparse matrix-vector multiplication on GPUs faces to a serious problem when the vector length is too large to be stored in GPU's device memory. To solve this problem, we propose a novel software-hardware hybrid method for a heterogeneous system with GPUs and functional memory modules connected by PCI express. The functional memory contains huge capacity of memory and provides scatter/gather operations. We perform some preliminary evaluation for the proposed method with using a sparse matrix benchmark collection. We observe that the proposed method for a GPU with converting indirect references to direct references without exhausting GPU's cache memory achieves 4.1 times speedup compared with conventional methods. The proposed method intrinsically has high scalability of the number of GPUs because intercommunication among GPUs is completely eliminated. Therefore we estimate the performance of our proposed method would be expressed as the single GPU execution performance, which may be suppressed by the burst-transfer bandwidth of PCI express, multiplied with the number of GPUs.
Recovery algorithms play a key role in compressive sampling (CS). Currently, a popular recovery algorithm for CS is the orthogonal matching pursuit (OMP), which possesses the merits of low complexity and good recovery...
详细信息
Recovery algorithms play a key role in compressive sampling (CS). Currently, a popular recovery algorithm for CS is the orthogonal matching pursuit (OMP), which possesses the merits of low complexity and good recovery quality. Considering that the OMP involves massive matrix/vector operations, it is very suited to being implemented in parallel on graphics processing unit (GPU). In this paper, we first analyze the complexity of each module in the OMP and point out the bottlenecks of the OMP lie in the projection module and the least-squares module. To speedup the projection module, Fujimoto's matrix-vector multiplication algorithm is adopted. To speedup the least-squares module, the matrix-inverse-update algorithm is adopted. Experimental results show that +40x speedup is achieved by our implementation of OMP on GTX480 GPU over on Intel(R) Core(TM) i7 CPU. Since the projection module occupies more than 2/3 of the total run time, we are looking for a faster matrix-vector multiplication algorithm.
Pedestrian Detection is of interest in many computer vision applications such as intelligent transportation systems and human-robot interaction; among the existing methods, the combination of shape feature (i.e. Histo...
详细信息
Pedestrian Detection is of interest in many computer vision applications such as intelligent transportation systems and human-robot interaction; among the existing methods, the combination of shape feature (i.e. Histogram of Oriented Gradients (HOG)) and texture features (i.e. Local Binary Pattern (LBP)) has shown promising results in detection accuracy, but it is limited due to computation cost. In this paper, we introduce a new pedestrian detection algorithm with fast computation of these features on GPU. We propose a robust and rapid pedestrian detector by combining the HOG with LBP, as the feature set and corresponding Support Vector Machine (SVM) classifiers. Also, we use the integral image method and an efficient parallel implementation to reduce detection time. We can achieve a more than 10× speed up, and 7% increase in detection rate.
Onboard processing of remotely sensed hyper spectral data is a highly desirable goal in many applications. For this purpose, compact reconfigurable hardware modules such as field programmable gate arrays (FPGAs) are w...
详细信息
Onboard processing of remotely sensed hyper spectral data is a highly desirable goal in many applications. For this purpose, compact reconfigurable hardware modules such as field programmable gate arrays (FPGAs) are widely used. In this paper, we develop a new implementation of an automatic target generation process (ATGP) for hyper spectral images. Our implementation is based on a design methodology that starts from a high-level description in Matlab (or alternative C/C++) and obtains a register transfer level (RTL) description that can be ported to FPGAs. In order to validate our new implementation, we develop a quantitative and comparative study using two different FPGA architectures: Xilinx Virtex-5 and Altera Stratix-III Altera. Experimental results have been obtained in the context of a real application focused on the detection of mineral components over the Cup rite mining district (Nevada), using hyper spectral data collected by NASA's Airborne Visible Infra-Red Imaging Spectrometer (AVIRIS). Our experimental results indicate that the proposed implementation can achieve peak frequency designs above 200MHz in the considered FPGAs, in addition to satisfactory results in terms of target detection accuracy and parallel performance. This represents a step forward towards the design of real-time onboard implementations of hyper spectral image analysis algorithms.
In this paper, we present a novel method for the acquisition and compression of hyperspectral images based on two concepts - distributed source coding and compressive sensing. Compressive sensing (CS) is a signal acqu...
详细信息
In this paper, we present a novel method for the acquisition and compression of hyperspectral images based on two concepts - distributed source coding and compressive sensing. Compressive sensing (CS) is a signal acquisition method that samples at sub Nyquist rates which is possible for signals that are sparse in some transform domain. distributed source coding (DSC) is a method to encode correlated sources separately and decode them together in an attempt to shift complexity from the encoder to the decoder. distributed compressive sensing (DCS) is a new framework suggested for jointly sparse signals which we apply to the correlated bands of hyperspectral images. We compressively sense each band of the hyperspectral image individually and can then recover the bands separately or using a joint recovery method. We use the Orthogonal Matching Pursuit (OMP) for individual recovery and Simultaneous Orthogonal Matching Pursuit (SOMP) for joint decoding and compare the two methods. The latter is shown to perform consistently better showing that the distributed Compressive Sensing method that exploits the joint sparsity of the hyperspectral image is much better than individual recovery.
This paper proposes an image steganalysis based on supervised learning using Sparse Code Shrinkage as a feature of image data. Sparse coding represents source signal as the linear sum of basic images, and has the prop...
详细信息
This paper proposes an image steganalysis based on supervised learning using Sparse Code Shrinkage as a feature of image data. Sparse coding represents source signal as the linear sum of basic images, and has the property that the coefficients of basic images are distributed as non-Gaussian. Sparse Code Shrinkage that is able to be regarded as a filter can effectively separate Gaussian distribution noise from sparse code coefficients. We assume that the degradation of image data by information hiding occurs as Gaussian noise. Therefore, the noise estimated by Sparse Code Shrinkage would be informative for image steganalysis. In the experiments, we show our method outperforms previous steganalysis methods for F5, StegHide, Spread spectrum image steganography.
Current online social networks are massive and still growing. For example, Face book has over 500 million active users sharing over 30 billion items per month. The scale within these data streams has outstripped tradi...
详细信息
Current online social networks are massive and still growing. For example, Face book has over 500 million active users sharing over 30 billion items per month. The scale within these data streams has outstripped traditional graph analysis methods. Real-time monitoring for anomalies may require dynamic analysis rather than repeated static analysis. The massive state behind multiple persistent queries requires shared data structures and flexible representations. We present a framework based on the STINGER data structure that can monitor a global property, connected components, on a graph of 16 million vertices at rates of up to 240,000 updates per second on 32 processors of a Cray XMT. For very large scale-free graphs, our implementation uses novel batching techniques that exploit the scale-free nature of the data and run over three times faster than prior methods. Our framework handles, for the first time, real-world data rates, opening the door to higher-level analytics such as community and anomaly detection.
We describe in this paper novel consensus-based distributed particle filtering algorithms which are applied to cooperative blind equalization of frequency-selective channels in a network with one transmitter and multi...
详细信息
We describe in this paper novel consensus-based distributed particle filtering algorithms which are applied to cooperative blind equalization of frequency-selective channels in a network with one transmitter and multiple receivers. The proposed algorithms employ parallel consensus averaging iterations to evaluate the product of some node-dependent quantities across the receiver network, thus eliminating the need for message broadcasts beyond each receiver's local neighborhood. Additionally, parallel minimum consensus iterations are used to assess the convergence of the quantized consensus averages and ensure accordingly the coherence of particle sets across the different network nodes. We verify via computer simulations that the consensus-based schemes exhibit a small performance gap compared to both centralized and communication-intensive broadcast solutions.
Stereo matching is an active area of research in imageprocessing. In a recent work, a convex programming approach was developed in order to generate a dense disparity field. In this paper, we address the same estimat...
详细信息
Stereo matching is an active area of research in imageprocessing. In a recent work, a convex programming approach was developed in order to generate a dense disparity field. In this paper, we address the same estimation problem and pro pose to solve it in a more general convex optimization frame work based on proximal methods. More precisely, unlike previous works where the criterion must satisfy some restrictive conditions in order to be able to numerically solve the minimization problem, this work offers a great flexibility in the choice of the involved criterion. The method is validated in a stereo image coding framework, and the results demonstrate the good performance of the proposed parallel proximal algorithm.
The complexity and performance requirements of embedded software are continuously increasing, making Multiprocessor System-on-Chip (MPSoC) architectures more and more important in the domain of embedded and cyber-phys...
详细信息
The complexity and performance requirements of embedded software are continuously increasing, making Multiprocessor System-on-Chip (MPSoC) architectures more and more important in the domain of embedded and cyber-physical systems. Using multiple cores in a single system reduces problems concerning energy consumption, heat dissipation, and increases performance. Nevertheless, these benefits do not come for free. Porting existing, mostly sequential, applications to MPSoCs requires extracting efficient parallelism to utilize all available cores. Many embedded applications, like network services and multimedia tasks for voice-, image- and video processing, are operating on data streams and thus have a streaming-based structure. Despite the abundance of parallelism in streaming applications, it is a non-trivial task to split and efficiently map sequential applications to MPSoCs. Therefore, we present an algorithm which automatically extracts pipeline parallelism from sequential ANSI-C applications. The presented tool employs an integer linear programming (ILP) based approach enriched with an adequate cost model to automatically control the granularity of the parallelization. By applying our tool to real-life applications, it can be shown that our approach is able to speed up applications by a factor of up to 3.9x on a four-core MPSoC architecture, compared to a sequential execution.
暂无评论