The row-wise and column-wise prefix-sum computation of a matrix has many applications in the area of imageprocessing such as computation of the summed area table and the Euclidean distance map. It is known that the p...
详细信息
ISBN:
(纸本)9783319780542;9783319780535
The row-wise and column-wise prefix-sum computation of a matrix has many applications in the area of imageprocessing such as computation of the summed area table and the Euclidean distance map. It is known that the prefix-sums of a 1-dimensional array can be computed efficiently on the GPU. Hence, the row-wise prefix-sums of a matrix can also be computed efficiently on the GPU by executing this prefix-sum algorithm for every row in parallel. However, the same approach does not work well for computing the column-wise prefix-sums, because inefficient stride memory access to the global memory is performed. The main contribution of this paper is to present an almost optimal column-wise prefix-sum algorithm on the GPU. Since all elements in an input matrix must be read and the resulting prefix-sums must be written, computation of the column-wise prefix-sums cannot be faster than simple matrix duplication in the global memory of the GPU. Quite surprisingly, experimental results using NVIDIA TITAN x show that our column-wise prefix-sum algorithm runs only 2-6% slower than matrix duplication. Thus, our column-wise prefix-sum algorithm is almost optimal.
Deep convolutional neural networks have found success in semantic image segmentation tasks in computer vision and medical imaging. These algorithms are executed on conventional von Neumann processor architectures or G...
详细信息
ISBN:
(数字)9781510616486
ISBN:
(纸本)9781510616486
Deep convolutional neural networks have found success in semantic image segmentation tasks in computer vision and medical imaging. These algorithms are executed on conventional von Neumann processor architectures or GPUs. This is suboptimal. Neuromorphic processors that replicate the structure of the brain are better-suited to train and execute deep learning models for image segmentation by relying on massively-parallelprocessing. However, given that they closely emulate the human brain, on-chip hardware and digital memory limitations also constrain them. Adapting deep learning models to execute image segmentation tasks on such chips, requires specialized training and validation. In this work, we demonstrate for the first-time, spinal image segmentation performed using a deep learning network implemented on neuromorphic hardware of the IBM TrueNorth Neurosynaptic System and validate the performance of our network by comparing it to human-generated segmentations of spinal vertebrae and disks. To achieve this on neuromorphic hardware, the training model constrains the coefficients of individual neurons to {-1,0,1} using the Energy Efficient Deep Neuromorphic (EEDN)1 networks training algorithm. Given the similar to 1 million neurons and 256 million synapses, the scale and size of the neural network implemented by the IBM TrueNorth allows us to execute the requisite mapping between segmented images and non-uniform intensity MR images >20 times faster than on a GPU-accelerated network and using <0.1 W. This speed and efficiency implies that a trained neuromorphic chip can be deployed in intra-operative environments where real-time medical image segmentation is necessary.
This book describes the methods and algorithms for image pre-processing and recognition. These methods are based on a parallel shift technology of the imaging copy, as well as simple mathematical operations to allow t...
详细信息
ISBN:
(数字)9781351778572
ISBN:
(纸本)9781138712263
This book describes the methods and algorithms for image pre-processing and recognition. These methods are based on a parallel shift technology of the imaging copy, as well as simple mathematical operations to allow the generation of a minimum set of features to describe and recognize the image. This book also describes the theoretical foundations of parallel shift technology and pattern recognition. Based on these methods and theories, this book is intended to help researchers with artificial intelligence systems design, robotics, and developing software and hardware applications.
Advances in vision processing have ignited a proliferation of mobile vision applications, including augmented reality. However, limited by the inability to rapidly reconfigure sensor operation for performance-efficien...
详细信息
ISBN:
(纸本)9781450356305
Advances in vision processing have ignited a proliferation of mobile vision applications, including augmented reality. However, limited by the inability to rapidly reconfigure sensor operation for performance-efficiency tradeoffs, high power consumption causes vision applications to drain the device's battery. To explore the potential impact of enabling rapid reconfiguration, we use a case study around marker-based pose estimation to understand the relationship between image frame resolution, task accuracy, and energy efficiency. Our case study motivates that to balance energy efficiency and task accuracy, the application needs to dynamically and frequently reconfigure sensor resolution. To explore the latency bottlenecks to sensor resolution reconfiguration, we define and profile the end-to-end reconfiguration latency and frame-to-frame latency of changing capture resolution on a Google LG Nexus 5x device. We identify three major sources of sensor resolution reconfiguration latency in current Android systems: (i) sequential configuration patterns, (ii) expensive system calls, and (iii) imaging pipeline delay. Based on our intuitions, we propose a redesign of the Android camera system to mitigate the sources of latency. Enabling smooth transitions between sensor configurations will unlock new classes of adaptive-resolution vision applications.
In the last few decades 3D/4D ultrasonography has been gaining increasing popularity not only as a scientific research topic but also as a new modality of medical imaging in clinical applications. However, design and ...
详细信息
In the last few decades 3D/4D ultrasonography has been gaining increasing popularity not only as a scientific research topic but also as a new modality of medical imaging in clinical applications. However, design and implementation of 3D/4D device for high quality ultrasound imaging within portable, handheld systems is a technological challenge. Design of transmit/receive (Tx/Rx) electronics for efficient operation with 2D array transducers, comprised of thousands of elements, enormous amount of input/output data that must be transferred and processed, power consumption limitation are just a few of the difficulties that arise. No less important is development of reliable and numerically efficient algorithms for 3D/4D imaging which should take all these restrictions into account. The main objective of this paper is to present a new hybrid spectral domain imaging (HSDI) method that delivers an original and innovative solution for the technical limitations of modern ultrasonography 3D/4D. The developed image reconstruction method is based on the plane-wave insonification (PWI) with sub-aperture data acquisition combined with frequency domain (FD) data processing. The performance of the method was tested using the Field ii simulated acoustic data of 3D cyst phantom. For a 3D low-resolution image (LRI) comprised of 64×64×512 pixels the proposed HSDI method is about 100 times faster, in the case of a single 3D, than its counterpart based on the PWI synthetic aperture time domain (TD) method for a single Tx/Rx event. On the other hand, the frame rate increase is proportional to the number of sub-apertures used for a single high-resolution image (HRI) synthesis.
Hyperspectral image registration is a relevant task for real-time applications like environmental disasters management or search and rescue scenarios. Traditional algorithms for this problem were not really devoted to...
详细信息
ISBN:
(数字)9781728144849
ISBN:
(纸本)9781728144856
Hyperspectral image registration is a relevant task for real-time applications like environmental disasters management or search and rescue scenarios. Traditional algorithms for this problem were not really devoted to real-time performance. The HYFMGPU algorithm arose as a high-performance GPU-based solution to solve such a lack. Nevertheless, a single-GPU solution is not enough, as sensors are evolving and then generating images with finer resolutions and wider wavelength ranges. An MPI+CUDA multi-GPU implementation of HYFMGPU was previously presented. However, this solution shows the programming complexity of combining MPI with an accelerator programming model. In this paper we present a new and more abstract programming approach for this type of applications, which provides a high efficiency while simplifying the programming of the multi-device parts of the code. The solution uses Hitmap, a library to ease the programming of parallelapplications based on distributed arrays. It uses a more algorithm-oriented approach than MPI, including abstractions for the automatic partition and mapping of arrays at runtime with arbitrary granularity, as well as techniques to build flexible communication patterns that transparently adapt to the data partitions. We show how these abstractions apply to this application class. We present a comparison of development effort metrics between the original MPI implementation and the one based on Hitmap, with reductions of up to 95% for the Halstead score in specific work redistribution steps. We finally present experimental results showing that these abstractions are internally implemented in a high efficient way that can reduce the overall performance time in up to 37% comparing with the original MPI implementation.
Motion estimation is a core task in computer vision and many applications utilize optical flow methods as fundamental tools to analyze motion in images and videos. Optical flow is the apparent motion of objects in ima...
Motion estimation is a core task in computer vision and many applications utilize optical flow methods as fundamental tools to analyze motion in images and videos. Optical flow is the apparent motion of objects in image sequences that results from relative motion between the objects and the imaging perspective. Today, optical flow fields are utilized to solve problems in various areas such as object detection and tracking, interpolation, visual odometry, etc. In this dissertation, three problems from different areas of computer vision and the solutions that make use of modified optical flow methods are explained. The contributions of this dissertation are approaches and frameworks that introduce i) a new optical flow-based interpolation method to achieve minimally divergent velocimetry data, ii) a framework that improves the accuracy of change detection algorithms in synthetic aperture radar (SAR) images, and iii) a set of new methods to integrate Proton Magnetic Resonance Spectroscopy (1HMRSI) data into threedimensional (3D) neuronavigation systems for tumor biopsies. In the first application an optical flow-based approach for the interpolation of minimally divergent velocimetry data is proposed. The velocimetry data of incompressible fluids contain signals that describe the flow velocity. The approach uses the additional flow velocity information to guide the interpolation process towards reduced divergence in the interpolated data. In the second application a framework that mainly consists of optical flow methods and other imageprocessing and computer vision techniques to improve object extraction from synthetic aperture radar images is proposed. The proposed framework is used for distinguishing between actual motion and detected motion due to misregistration in SAR image sets and it can lead to more accurate and meaningful change detection and improve object extraction from a SAR datasets. In the third application a set of new methods that aim to improve upon t
The fast compression of images is a requisite in many applications like TV production, teleconferencing, or digital cinema. Many of the algorithms employed in current image compression standards are inherently sequent...
详细信息
The fast compression of images is a requisite in many applications like TV production, teleconferencing, or digital cinema. Many of the algorithms employed in current image compression standards are inherently sequential. High performance implementations of such algorithms often require specialized hardware like field integrated gate arrays. Graphics processing Units (GPUs) do not commonly achieve high performance on these algorithms because they do not exhibit fine-grain parallelism. Our previous work introduced a new core algorithm for wavelet-based image coding systems. It is tailored for massive parallel architectures. It is called bitplane coding with parallel coefficient processing (BPC-PaCo). This paper introduces the first high performance, GPU-based implementation of BPC-PaCo. A detailed analysis of the algorithm aids its implementation in the GPU. The main insights behind the proposed codec are an efficient thread-to-data mapping, a smart memory management, and the use of efficient cooperation mechanisms to enable inter-thread communication. Experimental results indicate that the proposed implementation matches the requirements for high resolution (4 K) digital cinema in real time, yielding speedups of 30x with respect to the fastest implementations of current compression standards. Also, a power consumption evaluation shows that our implementation consumes 40x less energy for equivalent performance than state-of-the-art methods.
Convolutional Neural Network(CNN) is a hot and state-of-the-art algorithm which is widely used in applications such as face recognition, intelligent monitoring, image recognition and text recognition. Because of its h...
详细信息
ISBN:
(纸本)9781538666142
Convolutional Neural Network(CNN) is a hot and state-of-the-art algorithm which is widely used in applications such as face recognition, intelligent monitoring, image recognition and text recognition. Because of its high computational complexity, many efficient hardware accelerators have been proposed to exploit high degree of parallelprocessing for CNN. However, accelerators which are implemented on FPGAs and ASICs usually sacrifice generality for higher performance and lower power consumption. Other accelerators, such as GPUs, are general enough, but they lead to higher power consumption. Fine-grained dataflow architectures, which break conventional Von Neumann architectures, show natural advantages in processing CNN-like algorithms with high computational efficiency and low power consumption. At the same time, it remains broadly applicable and adaptable. In this paper, we propose a scheme for implementing and optimizing CNN on fine-grained dataflow architecture based accelerators. The experiment results reveal that by using our scheme, the performance of AlexNet running on the dataflow accelerator is 3.11x higher than that on NVIDIA Tesla K80, and the power consumption of our hardware is 8.52x lower than that of K80.
Recently, parallel high A-line speed and wide field imaging for optical coherence tomography angiography (OCTA) has become more prevalent, resulting in a dramatic increase of data quantity which poses a challenge for ...
详细信息
Recently, parallel high A-line speed and wide field imaging for optical coherence tomography angiography (OCTA) has become more prevalent, resulting in a dramatic increase of data quantity which poses a challenge for real time imaging even for GPU in data processing. In this manuscript, we propose a new OCTA processing technique, Gabor optical coherence tomographic angiography (GOCTA), for label-free human retinal angiography imaging. In spectral domain optical coherence tomography (SDOCT), k-space resampling and Fourier transform (FFT) are required for the entire data set of interference fringes to calculate blood flow information in previous OCTA algorithms, which are computationally intensive. As adults' eye anterior-posterior radii are nearly constant, only 3 A-scan lines need to be processed to obtain the gross orientation of the retina by using a sphere model. Subsequently, the en face microvascular images can be obtained by using the GOCTA algorithm from interference fringes directly without the steps of k-space resampling, numerical dispersion compensation, FFT. and maximum (mean) projection, resulting in a significant improvement of the data processing speed by 4 to 20 times faster than the existing methods. GOCTA is potentially suitable for SDOCT systems in en face preview applications requiring real-time microvascular imaging. (C) 2017 Optical Society of America
暂无评论