Real-time stereo matching, which is important in many applications like self-driving cars and 3-D scene reconstruction, requires large computation capability and high memory bandwidth. The most time-consuming part of ...
详细信息
Real-time stereo matching, which is important in many applications like self-driving cars and 3-D scene reconstruction, requires large computation capability and high memory bandwidth. The most time-consuming part of stereo-matching algorithms is the aggregation of information (i.e. costs) over local image regions. In this paper, we present a generic representation and suitable implementations for three commonly used cost aggregators on many-core processors. We perform typical optimizations on the kernels, which leads to significant performance improvement (up to two orders of magnitude). Finally, we present a performance model for the three aggregators to predict the aggregation speed for a given pair of input images on a given architecture. Experimental results validate our model with an acceptable error margin (an average of 10.4%). We conclude that GPU-like many-cores are excellent platforms for accelerating stereo matching.
This paper focuses on reducing the execution time of the video compression algorithms based on the 3D wavelet transform. We present several optimizations that could not be applied by the compiler due to the characteri...
详细信息
This paper focuses on reducing the execution time of the video compression algorithms based on the 3D wavelet transform. We present several optimizations that could not be applied by the compiler due to the characteristics of the algorithm. First, we use the Streaming SIMD Extensions (SSE) for some of the dimensions of the sequence (y and time), in order to reduce the number of floating point instructions, exploiting data level parallelism. Then, we apply loop unrolling and data prefetching to critical parts of the code, and finally the algorithm is vectorized by columns, allowing the use of SIMD instructions for the y dimension. Results show improvements of up to 1.54 over a version compiled with the maximum optimizations of the Intel CIC++ compiler Our experiments also show that, allowing the compiler to perform some of these optimizations (i.e. automatic code vectorization) causes performance slowdown which demonstrates the effectiveness of our optimizations.
Making sense of big data and big metadata remains a challenge as more and more data are churned out every day. The problem of adding value to unstructured data requires the application of computationally intensive alg...
详细信息
Making sense of big data and big metadata remains a challenge as more and more data are churned out every day. The problem of adding value to unstructured data requires the application of computationally intensive algorithms to discover useful patterns in the data. In terms of data streams from public transport such as buses, we address the problem of performing time-consuming algorithms to model the data while still being able to process abnormal events in real-time. We propose using Hidden Markov Models (HMMs) for identifying conditions for an abnormal event in bus journeys and methods for isolating HMM computations from real-time event processing. Results show that training HMMs with even noisy metadata can generate models that can recognize an abnormal event in a parallel and distributed manner in the cloud.
In this paper we present an efficient method for 3-D parallel digital filtering using a new parallel filtering algorithm based on the 3-D vector radix fast Hartley transform (3-D VR FH). This method is suitable for hi...
详细信息
In this paper we present an efficient method for 3-D parallel digital filtering using a new parallel filtering algorithm based on the 3-D vector radix fast Hartley transform (3-D VR FH). This method is suitable for high resolution/high speed image/video processing. The 3-D parallel algorithm is highly parallel and efficient as it overcomes the overhead and performance limitations of the block filtering method by eliminating the overlapping segments and boundary conditions in parallel filtering applications. It also lifts the restrictions on the input size for high performance in the block-filtering algorithm, as both the 3-D input data and impulse response of the system are segmented into smaller subsections. These subsections are independent and can be simultaneously processed. The algorithm's structure and mathematical derivation are given and the performance of the algorithm is tested and presented using a parallelprocessing system with 4-DSP processors.
Photo-Acoustic Tomography (PAT) combines ultrasound resolution and penetration with endogenous optical contrast of tissue. Real-time PAT imaging is limited by the number of parallel data acquisition channels and pulse...
详细信息
ISBN:
(纸本)9781479983407
Photo-Acoustic Tomography (PAT) combines ultrasound resolution and penetration with endogenous optical contrast of tissue. Real-time PAT imaging is limited by the number of parallel data acquisition channels and pulse repetition rate of the laser. Typical photoacoustic signals afford sparse representation. Additionally, PAT transducer configurations exhibit significant intra- and intersignal correlation. In this work, we formulate photoacoustic signal recovery in the distributed Compressed Sensing (DCS) framework to exploit this correlation. Reconstruction using the proposed method achieves better image quality than compressed sensing with significantly fewer samples. Through our results, we demonstrate that DCS has the potential to achieve real-time PAT imaging.
This paper proposes to use a frequency based cache admission policy in order to boost the effectiveness of caches subject to skewed access distributions. Rather than deciding on which object to evict, TinyLFU decides,...
详细信息
This paper proposes to use a frequency based cache admission policy in order to boost the effectiveness of caches subject to skewed access distributions. Rather than deciding on which object to evict, TinyLFU decides, based on the recent access history, whether it is worth admitting an accessed object into the cache at the expense of the eviction candidate. Realizing this concept is enabled through a novel approximate LFU structure called TinyLFU, which maintains an approximate representation of the access frequency of recently accessed objects. TinyLFU is extremely compact and lightweight as it builds upon Bloom filter theory. The paper shows an analysis of the properties of TinyLFU including simulations of both synthetic workloads as well as YouTube and Wikipedia traces.
parallel MR imaging methods like SMASH, SENSE etc. use multiple receiver coils to accelerate the imaging process by reducing the Fourier space sampling requirement. In this paper we show how one can optimally select t...
详细信息
parallel MR imaging methods like SMASH, SENSE etc. use multiple receiver coils to accelerate the imaging process by reducing the Fourier space sampling requirement. In this paper we show how one can optimally select the sampling locations based upon (1) knowledge of the statistics of the object being imaged and, (2) a statistical criterion for optimality determined by the application. In particular we show that the optimal uniform sample spacing is not necessarily an integer multiple of the Nyquist interval and depends upon the specific coil sensitivities and configuration.
With increasing bandwidth available to the client and the number of users growing at an exponential rate the Web server can become a performance bottleneck. This paper considers the parallelization of requests to Web ...
详细信息
With increasing bandwidth available to the client and the number of users growing at an exponential rate the Web server can become a performance bottleneck. This paper considers the parallelization of requests to Web pages each of which is composed of a number of embedded objects. The performance of systems in which the embedded objects are distributed across multiple backend servers are analyzed. parallelization of Web requests gives rise to a significant improvement in performance. Replication of servers is observed to be beneficial especially when the embedded objects in a Web page are not evenly distributed across servers. Load balancing policies used by the dispatcher of Web page requests are investigated. A simple round robin policy for backend server selection gives a better performance compared to the default random policy used by the Apache server.
Due to fast growth of video camera market, referring videos in sports training has been used as one of the most effective methods for improving their performance because athletes or coaches can analyze their performan...
详细信息
Due to fast growth of video camera market, referring videos in sports training has been used as one of the most effective methods for improving their performance because athletes or coaches can analyze their performance objectively. Adopting the advices from coaches, athletes can correct their wrong habits in their performance instantly. Although the videos from multiple cameras in multiple angles include much useful information, it is very hard for them to extract the information from the videos during their limited training time. Introducing an actual example of the training field where multiple cameras are used, this paper is focused on development of new video synthesizing software that generates a single video from the multiple videos, applying a sequence of commands to control the synthesis. This paper also describes an optimization technique to perform the video synthesis fast that applies parallelprocessing on a multicore processor to invoke multiple threads for video synthesis concurrently. Performance evaluation shows validity of the optimization technique.
The increasing number of airborne and satellite platforms that incorporate hyperspectral imaging spectrometers has soon created the need for efficient storage, transmission and data compression methodologies. In parti...
详细信息
ISBN:
(纸本)9780819473042
The increasing number of airborne and satellite platforms that incorporate hyperspectral imaging spectrometers has soon created the need for efficient storage, transmission and data compression methodologies. In particular, hyperspectral data compression is expected to play a crucial role in many remote sensing applications. Many efforts have been devoted to designing and developing lossless and lossy algorithms for hyperspectral imagery. However, most available lossy compression approaches have largely overlooked the impact of mixed pixels and subpixel targets, which can be accurately modeled and uncovered by resorting to the wealth of spectral information provided by hyperspectral image data. In this paper, we develop a simple lossy compression technique which relies on the concept of spectral unmixing, one of the most popular approaches to deal with mixed pixels and subpixel targets in hyperspectral analysis. The proposed method uses a two-stage approach in which the purest spectral signatures (also called endmembers) are first extracted from the input data, and then used to express mixed pixels as linear combinations of endmembers. Analytical and experimental results are presented in the context of a real application, using hyperspectral data collected by NASA's Jet Propulsion Laboratory over the World Trade Center area in New York City, right after the terrorist attacks of September 11th. These data are used in this work to evaluate the impact of compression using different methods on spectral signature quality for accurate detection of hot spot fires. Two parallel implementations are developed for the proposed lossy compression algorithm: a multiprocessor implementation tested on Thunderhead, a massively parallel Beowulf cluster at NASA's Goddard Space Flight Center, and a hardware implementation developed on a Xilinx Virtex-II FPGA device. Combined, these parts offer a thoughtful perspective on the potential and emerging challenges of incorporating parallel d
暂无评论