In this report, we introduce NICE (New frontiers for zero-shot Image Captioning Evaluation) project 1 and share the results and outcomes of 2023 challenge. This project is designed to challenge the computervision co...
详细信息
ISBN:
(数字)9798350365474
ISBN:
(纸本)9798350365481
In this report, we introduce NICE (New frontiers for zero-shot Image Captioning Evaluation) project
1
and share the results and outcomes of 2023 challenge. This project is designed to challenge the computervision community to develop robust image captioning models that advance the state-of-the-art both in terms of accuracy and fairness. Through the challenge, the image captioning models were tested using a new evaluation dataset that includes a large variety of visual concepts from many domains. There was no specific training data provided for the challenge, and therefore the challenge entries were required to adapt to new types of image descriptions that had not been seen during training. This report includes information on the newly proposed NICE dataset, evaluation methods, challenge results, and technical details of top-ranking entries. We expect that the outcomes of the challenge will contribute to the improvement of AI models on various vision-language tasks.
Neural networks are notorious for being overconfident predictors, posing a significant challenge to their safe deployment in real-world applications. While feature normalization has garnered considerable attention wit...
详细信息
ISBN:
(数字)9798350365474
ISBN:
(纸本)9798350365481
Neural networks are notorious for being overconfident predictors, posing a significant challenge to their safe deployment in real-world applications. While feature normalization has garnered considerable attention within the deep learning literature, current train-time regularization methods for Out-of-Distribution(OOD) detection are yet to fully exploit this potential. Indeed, the naive incorporation of feature normalization within neural networks does not guarantee substantial improvement in OOD detection performance. In this work, we introduce T2FNorm, a novel approach to transforming features to hyperspherical space during training, while employing non-transformed space for OOD-scoring purposes. This method yields a surprising enhancement in OOD detection capabilities without compromising model accuracy in in-distribution(ID). Our investigation demonstrates that the proposed technique substantially diminishes the norm of the features of all samples, more so in the case of out-of-distribution samples, thereby addressing the prevalent concern of overconfidence in neural networks. The proposed method also significantly improves various post-hoc OOD detection methods.
The Segment Anything Model (SAM) and CLIP are remarkable vision foundation models (VFMs). SAM, a prompt-driven segmentation model, excels in segmentation tasks across diverse domains, while CLIP is renowned for its ze...
详细信息
ISBN:
(数字)9798350365474
ISBN:
(纸本)9798350365481
The Segment Anything Model (SAM) and CLIP are remarkable vision foundation models (VFMs). SAM, a prompt-driven segmentation model, excels in segmentation tasks across diverse domains, while CLIP is renowned for its zero-shot recognition capabilities. However, their unified potential has not yet been explored in medical image segmentation. To adapt SAM, to medical imaging, existing methods primarily rely on tuning strategies that require extensive data or prior prompts tailored to the specific task, making it particularly challenging when only a limited number of data samples are available. This work presents an in-depth exploration of integrating SAM and CLIP into a unified framework for medical image segmentation. Specifically, we propose a simple unified framework, SaLIP, for organ segmentation. Initially, SAM is used for part-based segmentation within the image, followed by CLIP to retrieve the mask corresponding to the region of interest (ROI) from the pool of SAM’s generated masks. Finally, SAM is prompted by the retrieved ROI to segment a specific organ. Thus, SaLIP is training/fine-tuning free and does not rely on domain expertise or labeled data for prompt engineering. Our method shows substantial enhancements in zero-shot segmentation, showcasing notable improvements in DICE scores across diverse segmentation tasks like brain (63.46%), lung (50.11%), and fetal head (30.82%), when compared to un-prompted SAM. Code and text prompts are available at SaLIP.
Robotics paired with computervision are widely used in precision agriculture. Simulations are critical for safety and performance estimation by verifying their routine in a virtual world before real-world testing and...
详细信息
ISBN:
(数字)9798350365474
ISBN:
(纸本)9798350365481
Robotics paired with computervision are widely used in precision agriculture. Simulations are critical for safety and performance estimation by verifying their routine in a virtual world before real-world testing and deployment. However, many simulators used in agricultural robots lack photorealism in their virtual worlds compared to the real world. We implemented Unreal Engine 5 (UE5) and the Robot Operating System (ROS) to develop a robot simulator tailored to agricultural tasks and synthetic data generation with RGB, segmentation, and depth images. We designed a method for assigning multiple segmentation labels within a single plant mesh. We experimented with a semi-spherical routine for two robot arms to perform 3D point cloud reconstruction across 10 plant assets. We showed our simulator produces much more accurate segmentation images and reconstruction compared to existing UE5 solutions. We extend our results with Neural Radiance Field (NeRF) reconstructions. The packaged simulator, UE5 project, and ROS package with the Python routine can be found at https://***/NCSU-BAE-ARLab/AgriRoboSimUE5.
In this paper we propose a novel, highly flexible camera. The camera consists of an image detector and a special aperture, but no lens. The aperture is a set of parallel light attenuating layers whose transmittances a...
详细信息
In this paper we propose a novel, highly flexible camera. The camera consists of an image detector and a special aperture, but no lens. The aperture is a set of parallel light attenuating layers whose transmittances are controllable in space and time. By applying different transmittance patterns to this aperture, it is possible to modulate the incoming light in useful ways and capture images that are impossible to capture with conventional lens-based cameras. For example, the camera can pan and tilt its field of view without the use of any moving parts. It can also capture disjoint regions of interest in the scene without having to capture the regions in between them. In addition, the camera can be used as a computational sensor, where the detector measures the end result of computations performed by the attenuating layers on the scene radiance values. These and other imaging functionalities can be implemented with the same physical camera and the functionalities can be switched from one video frame to the next via software. We have built a prototype camera based on this approach using a bare image detector and a liquid crystal modulator for the aperture. We discuss in detail the merits and limitations of lensless imaging using controllable apertures.
Face motion is the sum of rigid motion related with face pose and non-rigid motion related with facial expression. Both motions are coupled in the captured image so that they can not be easily recovered from the image...
详细信息
Face motion is the sum of rigid motion related with face pose and non-rigid motion related with facial expression. Both motions are coupled in the captured image so that they can not be easily recovered from the image directly. In this paper, a novel technique is proposed to recover 3D face pose and facial expression simultaneously from a monocular video sequence in real time. First, twenty-eight salient facial features are detected and tracked robustly under various face orientations and facial expressions. Second, after modelling the coupling between face pose and facial expression in the 2D image as a nonlinear function, a normalized SVD (N-SVD) decomposition technique is proposed to recover the pose and expression parameters analytically. A nonlinear technique is subsequently utilized to refine the solution obtained from the N-SVD technique by imposing the orthonormality constraint on the pose parameters. Compared to the original SVD technique proposed in [1], which is very sensitive to the image noise and numerically unstable in practice, the proposed method can recover the face pose and facial expression robustly and accurately. Finally, the performance of the proposed technique is evaluated in the experiments using both synthetic and real image sequences.
To compete with existing mobile architectures, Mobile-ViG introduces Sparse vision Graph Attention (SVGA), a fast token-mixing operator based on the principles of GNNs. However, MobileViG scales poorly with model size...
详细信息
ISBN:
(数字)9798350365474
ISBN:
(纸本)9798350365481
To compete with existing mobile architectures, Mobile-ViG introduces Sparse vision Graph Attention (SVGA), a fast token-mixing operator based on the principles of GNNs. However, MobileViG scales poorly with model size, falling at most 1% behind models with similar latency. This paper introduces Mobile Graph Convolution (MGC), a new vision graph neural network (ViG) module that solves this scaling problem. Our proposed mobile vision architecture, Mobile-ViGv2, uses MGC to demonstrate the effectiveness of our approach. MGC improves on SVGA by increasing graph sparsity and introducing conditional positional encodings to the graph operation. Our smallest model, MobileViGv2-Ti, achieves a 77.7% top-1 accuracy on ImageNet-1K, 2% higher than MobileViG-Ti, with 0.9 ms inference latency on the iPhone 13 Mini NPU. Our largest model, MobileViGv2-B, achieves an 83.4% top-1 accuracy, 0.8% higher than MobileViG-B, with 2.7 ms inference latency. Besides image classification, we show that MobileViGv2 generalizes well to other tasks. For object detection and instance segmentation on MS COCO 2017, MobileViGv2-M outperforms MobileViG-M by 1.2 AP
box
and 0.7 AP
mask
, and MobileViGv2-B outperforms MobileViG-B by 1.0 AP
box
and 0.7 AP
mask
. For semantic segmentation on ADE20K, MobileViGv2-M achieves 42.9% mIoU and MobileViGv2-B achieves 44.3% mIoU
1
.
Existing fine-grained hashing methods typically lack code interpretability as they compute hash code bits holistically using both global and local features. To address this limitation, we propose ConceptHash, a novel ...
详细信息
ISBN:
(数字)9798350365474
ISBN:
(纸本)9798350365481
Existing fine-grained hashing methods typically lack code interpretability as they compute hash code bits holistically using both global and local features. To address this limitation, we propose ConceptHash, a novel method that achieves sub-code level interpretability. In ConceptHash, each sub-code corresponds to a human-understandable concept, such as an object part, and these concepts are automatically discovered without human annotations. Specifically, we leverage a vision Transformer architecture and introduce concept tokens as visual prompts, along with image patch tokens as model inputs. Each concept is then mapped to a specific sub-code at the model output, providing natural sub-code interpretability. To capture subtle visual differences among highly similar sub-categories (e.g., bird species), we incorporate language guidance to ensure that the learned hash codes are distinguishable within fine-grained object classes while maintaining semantic alignment. This approach allows us to develop hash codes that exhibit similarity within families of species while remaining distinct from species in other families. Extensive experiments on four fine-grained image retrieval benchmarks demonstrate that ConceptHash outperforms previous methods by a significant margin, offering unique sub-code interpretability as an additional benefit. Code at: https://***/kamwoh/concepthash.
Video surveillance applications such as smart room and security system are prevailing nowadays. Camera calibration information (e.g. camera position, orientation, and focal length) is very useful for various surveilla...
详细信息
Video surveillance applications such as smart room and security system are prevailing nowadays. Camera calibration information (e.g. camera position, orientation, and focal length) is very useful for various surveillance systems because it can provide scene knowledge and limit search space for object detection or tracking. In this paper, we describe a camera calibration tool that does not require any calibration object or specific geometric objects by using vanishing points. In urban environment, vanishing points are easily obtainable since there exist many parallel lines such as street lines, light poles, buildings, etc in either outdoor or indoor scene images. Experimental results from various surveillance cameras are presented.
We present a novel multiscale approach that combines segmentation with classification to detect abnormal brain structures in medical imagery, and demonstrate its utility in detecting multiple sclerosis lesions in 3D M...
详细信息
We present a novel multiscale approach that combines segmentation with classification to detect abnormal brain structures in medical imagery, and demonstrate its utility in detecting multiple sclerosis lesions in 3D MRI data. Our method uses segmentation to obtain a hierarchical decomposition of a multi-channel, anisotropic MRI scan. It then produces a rich set of features describing the segments in terms of intensity, shape, location, and neighborhood relations. These features are then fed into a decision tree-based classifier, trained with data labeled by experts, enabling the detection of lesions in all scales. Unlike common approaches that use voxel-by-voxel analysis, our system can utilize regional properties that are often important for characterizing abnormal brain structures. We provide experiments showing successful detections of lesions in both simulated and real MR images.
暂无评论