This paper considers the problem of open-vocabulary semantic segmentation (OVS), that aims to segment objects of arbitrary classes beyond a pre-defined, closed-set categories. The main contributions are as follows: Fi...
This paper considers the problem of open-vocabulary semantic segmentation (OVS), that aims to segment objects of arbitrary classes beyond a pre-defined, closed-set categories. The main contributions are as follows: First, we propose a transformer-based model for OVS, termed as OVSegmentor, which only exploits web-crawled imagetext pairs for pre-training without using any mask annotations. OVSegmentor assembles the image pixels into a set of learnable group tokens via a slotattention based binding module, then aligns the group tokens to corresponding caption embeddings. Second, we propose two proxy tasks for training, namely masked entity completion and cross-image mask consistency. The former aims to infer all masked entities in the caption given group tokens, that enables the model to learn fine-grained alignment between visual groups and text entities. The latter enforces consistent mask predictions between images that contain shared entities, encouraging the model to learn visual invariance. Third, we construct CC4M dataset for pre-training by filtering CC12M with frequently appeared entities, which significantly improves training efficiency. Fourth, we perform zero-shot transfer on four benchmark datasets, PASCAL VOC, PASCAL Context, COCO Object, and ADE20K. OVSegmentor achieves superior results over state-of-the-art approaches on PASCAL VOC using only 3% data (4M vs 134M) for pre-training.
image registration among multimodality has received increasing attention in the scope of computer vision and computational photography nowadays. However, the non-linear intensity variations prohibit the accurate featu...
详细信息
ISBN:
(纸本)9781728173221
image registration among multimodality has received increasing attention in the scope of computer vision and computational photography nowadays. However, the non-linear intensity variations prohibit the accurate feature points matching between modal-different image pairs. Thus, a robust image descriptor for multi-modal image registration is proposed, named shearlet-based modality robust descriptor(SMRD). The anisotropic feature of edge and texture information in multi-scale is encoded to describe the region around a point of interest based on discrete shearlet transform. We conducted the experiments to verify the proposed SMRD compared with several state-of-the-art multi-modal/multispectral descriptors on four different multi-modal datasets. The experimental results showed that our SMRD achieves superior performance than other methods in terms of precision, recall and F1-score.
In order to meet the urgent needs of automation and intelligent picking of kiwifruit, aiming at the problems of unreasonable construction of kiwifruit data set, low fruit recognition accuracy and poor spatial position...
In order to meet the urgent needs of automation and intelligent picking of kiwifruit, aiming at the problems of unreasonable construction of kiwifruit data set, low fruit recognition accuracy and poor spatial positioning in the natural environment of orchard, a precise recognition and visual positioning method of kiwifruit based on improved Yolov5s was proposed. In view of the growth characteristics of kiwifruit in trellis orchards, a multi-type kiwifruit data set was first constructed. Furthermore, the attention mechanism and multi-scale module are combined to improve the Yolov5s network structure, identify kiwifruit and extract the center coordinates of the prediction box. The experimental results show that the average accuracy of the model for six kiwifruit types under different weather and light conditions is 98 %. The single image recognition time of $1280\times 720$ pixel is about 13.8 ms, and the weight is only 15.21 Mb. It can be seen that this study can provide technical support for the vision system of kiwifruit automatic picking robot, and provide reference for the intelligent recognition and positioning of other fruits (such as apples, mangoes and oranges).
Steganography is the common name of methods that aim secret communication. In this conference proceeding, a novel steganography algorithm that hides plaintext payload in halftone images and a payload extraction algori...
详细信息
ISBN:
(数字)9781665450928
ISBN:
(纸本)9781665450935
Steganography is the common name of methods that aim secret communication. In this conference proceeding, a novel steganography algorithm that hides plaintext payload in halftone images and a payload extraction algorithm that is suitable for messages hidden using this steganography method is presented. Our steganography algorithm uses a modified pattern-based halftone image generation procedure and distributes the payload into multiple output images. The proposed method has proven to be secure and able to hide large payloads. According to the objective and subjective evaluations made, it was seen that the proposed method produces promising results.
Camera-based monitoring is becoming increasingly popular, as multi-objective detection tasks can be enabled by video analytics over captured frames. Yet, video frames have to be delivered to computation-capable edge n...
详细信息
ISBN:
(纸本)9781665435413
Camera-based monitoring is becoming increasingly popular, as multi-objective detection tasks can be enabled by video analytics over captured frames. Yet, video frames have to be delivered to computation-capable edge nodes for further processing, because the amount of required resources exceeds the capacity of built-in hardware of video cameras. In this paper, observing that video resolution directly determines the subsequent bandwidth and computing resource consumption, as well as the analytic accuracy, we propose an edge-assisted object-based resolution configuration algorithm to achieve efficient multi-task video analytics. The proposed algorithm harnesses the diversity of neural networks used for detecting different objects in one frame, which brings about two-fold possibility for bandwidth saving. On one hand, background information cannot be indiscriminately transmitted, as is unlikely to contribute to improving the analytics accuracy. On the other hand, fine-grained resolution selection allows object-level optimal resolution that minimizes the transmitted data volume under accuracy and latency constraints. Simulation results demonstrate that the proposed method can effectively reduce up to 50% of the transmitted data volume, compared to existing benchmarks.
In recent years, the number of dairy and beef cattle farms has been decreasing, while the number of cattle and the number of cattle per farm have been increasing, so systems for automatically monitoring cattle have be...
详细信息
image captioning is a rapidly emerging area in the Artificial Intelligence applications for natural language definitions. It works at the confluence of image data obtained through datasets, and the sentence definition...
详细信息
In vehicle visual navigation, image matching algorithm is highly critical to positioning accuracy and processing efficiency. One single matching algorithm cannot satisfy all types of image features accurate acquisitio...
详细信息
In vehicle visual navigation, image matching algorithm is highly critical to positioning accuracy and processing efficiency. One single matching algorithm cannot satisfy all types of image features accurate acquisition, so Harris, SUSAN, FAST, SIFT, and SURF are respectively adopted to process various road images under normal lighting condition. During practical application, the appropriate algorithm can be selected based on detection rate and running time of the above algorithms. Aiming at the illumination change interference of the collected images in vehicle visual navigation, many traditional matching algorithms for illumination change are not optimal, so an image precise matching algorithm with illumination change robustness is proposed. Because image edges and detail information have lower sensitivity for illumination change, SURF feature points are optimized by image gradient based on the idea of Canny, and the bidirectional search is used to obtain precise matching points. The experimental results show that feature point detection of the algorithm remains good stability for illumination change in images, and the matching accuracy can reach more than 94 & x0025;. The algorithm is not only robust to illumination change, but also ensures higher matching speed and meanwhile improves the matching accuracy significantly.
In order to measure the perceptual quality of images, it is important to find suitable image Quality Assessment (IQA) methods. Compared with the traditional objective IQA methods, the subjective IQA methods can more t...
详细信息
Video traffic comprises a large majority of the total traffic on the internet today. Uncompressed visual data requires a very large data rate;lossy compression techniques are employed in order to keep the data-rate ma...
详细信息
ISBN:
(纸本)9781728163956
Video traffic comprises a large majority of the total traffic on the internet today. Uncompressed visual data requires a very large data rate;lossy compression techniques are employed in order to keep the data-rate manageable. Increasingly, a significant amount of visual data being generated is consumed by analytics (such as classification, detection, etc.) residing in the cloud. image and video compression can produce visual artifacts, especially at lower data-rates, which can result in a significant drop in performance on such analytic tasks. Moreover, standard image and video compression techniques aim to optimize perceptual quality for human consumption by allocating more bits to perceptually significant features of the scene. However, these features may not necessarily be the most suitable ones for semantic tasks. We present here an approach to compress visual data in order to maximize performance on a given analytic task. We train a deep auto-encoder using a multi-task loss to learn the relevant embeddings. An approximate differentiable model of the quantizer is used during training which helps boost the accuracy during inference. We apply our approach on an image classification problem and show that for a given level of compression, it achieves higher classification accuracy than that obtained by performing classification on images compressed using JPEG. Our approach also outperforms the relevant state-of-the-art approach by a significant margin.
暂无评论