The emergence of Foundation Vision-Language Models (VLMs) has ignited a surge of research in the computer vision field due to their robust baseline performance. Inspired by this, we propose the Anchoring Vision-Langua...
详细信息
ISBN:
(数字)9798331529543
ISBN:
(纸本)9798331529550
The emergence of Foundation Vision-Language Models (VLMs) has ignited a surge of research in the computer vision field due to their robust baseline performance. Inspired by this, we propose the Anchoring Vision-Language Network (AnViL-Net), which integrates a vision language model for the challenging task of Weakly-Supervised Group Activity Recognition (WSGAR). Our network effectively incorporates VLMs into WSGAR, addressing the challenges posed by dynamic actor motions and domain-specific activity classes. AnViL-Net leverages highly generalized VLM vision features as anchors for extracting visual features. Additionally, semantically meaningful VLM language features serve as anchors for inferring the semantic relationships between actors and their activities. We demonstrate the effectiveness of AnViL-Net on multiple group activity datasets, achieving competitive state-of-the-art results.
Lookup tables (LUTs) are commonly used to speed up imageprocessing by handling complex mathematical functions like sine and exponential calculations. They are used in various applications such as camera image process...
详细信息
ISBN:
(数字)9798331529543
ISBN:
(纸本)9798331529550
Lookup tables (LUTs) are commonly used to speed up imageprocessing by handling complex mathematical functions like sine and exponential calculations. They are used in various applications such as camera imageprocessing, high-dynamic range imaging, and edge-preserving filtering. However, due to the increasing gap between computing and input/output performance, LUTs are becoming less effective. Even though specific circuits like SIMD can improve LUT efficiency, they still need to bridge the performance gap fully. The gap makes it difficult to choose between direct numerical and LUT calculations. For this problem, a register-LUTs method with the nearest neighbor was proposed; however, it is limited for functions with narrow-range values approaching zero. In this paper, we propose a method for using register LUTs to process images efficiently over a wide range of values. Our contributions include proposing register-LUT with linear interpolation for efficient computation, using a smaller data type for further efficiency, and suggesting an efficient data retrieving method.
The image sequences captured by Unmanned Aerial Vehicles (UAVs) can be applied to many computer vision tasks. However, due to the instability of UAV flight, the captured image sequences will deviate from the preset tr...
The image sequences captured by Unmanned Aerial Vehicles (UAVs) can be applied to many computer vision tasks. However, due to the instability of UAV flight, the captured image sequences will deviate from the preset trajectory and pose, which reduce the quality of subsequent applications such as panoramic image stitching. In this paper, a novel method is proposed to rectify UAV-captured image sequences by transforming the images to a regular trajectory with the uniform pose. First, to minimize the total transformation deviation, virtual regular camera trajectory is derived by minimizing the global error of coordinates between actual and virtual camera trajectories. Then, camera-pose-relevant local homography is proposed by inserting the camera pose into local homography to transform the images to the derived virtual trajectory with the uniform pose and correct translation parallax. The experimental results demonstrate the effectiveness of the proposed rectification algorithm from both theoretical and application levels.
A seam is a set of pixels with minimum energy forming a continuous line in an image. By eliminating or duplicating seams iteratively, an input image can be retargeted. However, this process often results in blurring, ...
A seam is a set of pixels with minimum energy forming a continuous line in an image. By eliminating or duplicating seams iteratively, an input image can be retargeted. However, this process often results in blurring, stretching, or distortion problems around the seams, especially when extending a target image. We propose a novel approach for image extension using content-aware seam restoration to solve this problem. First, we design CSR-Net, which employs features from the horizontal region of target pixels to restore the seams. Second, we develop an image extension scenario based on the seam restoration and the training methodology of CSR-Net. Experimental results demonstrate that the proposed algorithm provides more accurate expanded results at seam pixels the seams than conventional algorithms.
Plenoptic cameras are light field capturing devices able to acquire large amounts of angular and spatial information. The lenslet video produced by such cameras presents on each frame a distinctive hexagonal pattern o...
详细信息
ISBN:
(数字)9798331529543
ISBN:
(纸本)9798331529550
Plenoptic cameras are light field capturing devices able to acquire large amounts of angular and spatial information. The lenslet video produced by such cameras presents on each frame a distinctive hexagonal pattern of micro-images. Due to the particular structure of lenslet images, traditional video codecs perform poorly on lenslet video. Previous works have proposed a preprocessing scheme that cuts and realigns the micro-images on each lenslet frame. While effective, this method introduces high frequency components into the processed image. In this paper, we propose an additional step to the aforementioned scheme by applying an invertible smoothing transform. We evaluate the enhanced scheme on lenslet video sequences captured with single-focused and multi-focused plenoptic cameras. On average, the enhanced scheme achieves 9.85% bitrate reduction compared to the existing scheme.
This paper focuses on the Referring image Segmentation (RIS) task, which aims to segment objects from an image based on a given language description, having significant potential in practical applications such as food...
详细信息
ISBN:
(数字)9798331529543
ISBN:
(纸本)9798331529550
This paper focuses on the Referring image Segmentation (RIS) task, which aims to segment objects from an image based on a given language description, having significant potential in practical applications such as food safety detection. Recent advances using the attention mechanism for cross-modal interaction have achieved excellent progress. However, current methods tend to lack explicit principles of interaction design as guidelines, leading to inadequate cross-modal comprehension. Additionally, most previous works use a single-modal mask decoder for prediction, losing the advantage of full cross-modal alignment. To address these challenges, we present a Fully Aligned Network (FAN) that follows four cross-modal interaction principles. Under the guidance of reasonable rules, our FAN achieves state-of-the-art performance on the prevalent RIS benchmarks (RefCOCO, RefCOCO+, G-Ref) with a simple architecture.
In the recent years, special emphasis has been placed on visual-based gait recognition due to its unique characteristics such as not requiring a special user action, or its long-distance recognizability. In general, t...
详细信息
ISBN:
(纸本)9783030968786;9783030968779
In the recent years, special emphasis has been placed on visual-based gait recognition due to its unique characteristics such as not requiring a special user action, or its long-distance recognizability. In general, there exist two methods - model-based and appearance-based methods - both of which come with their own advantages and disadvantages. In an effort to harness the best of both worlds we create a compact 3D human model-based gait representation out of 2D images with the help of the DensePose algorithm. We design a simple CNN and train several instances to show that the obtained gait representation can in fact be used to improve gait recognition accuracy. Experimental results are based on the publicly available CASIA-B dataset.
The ever-evolving nature of the Internet and wireless communications, as well as the production of huge amounts of multimedia every day has created a dire need for their security. In this paper, an image encryption te...
详细信息
ISBN:
(纸本)9781665462198
The ever-evolving nature of the Internet and wireless communications, as well as the production of huge amounts of multimedia every day has created a dire need for their security. In this paper, an image encryption technique that is based on 3 stages is proposed. The first stage makes use of DNA encoding. The second stage proposed and utilizes a novel S-box that is based on the Mersenne Twister and a linear descent algorithm. The third stage employs the Tent chaotic map. The computed performance evaluation metrics exhibit a high level of achieved security.
Learning-based image compression methods have emerged as state-of-the-art, showcasing higher performance compared to conventional compression solutions. These data-driven approaches aim to learn the parameters of a ne...
Learning-based image compression methods have emerged as state-of-the-art, showcasing higher performance compared to conventional compression solutions. These data-driven approaches aim to learn the parameters of a neural network model through iterative training on large amounts of data. The optimization process typically involves minimizing the distortion between the decoded and the original ground truth images. This paper focuses on perceptual optimization of learning-based image compression solutions and proposes: i) novel loss function to be used during training and ii) novel subjective test methodology that aims to evaluate the decoded image fidelity. According to experimental results from the subjective test taken with the new methodology, the optimization procedure can enhance image quality for low-rates while offering no advantage for high-rates.
Deep learning has been performing reasonably well in computer vision tasks that call for a high volume of photos, although gathering images is often expensive and challenging. Different picture augmentation techniques...
详细信息
暂无评论