Large Language Models have demonstrated remarkable performance across various tasks, exhibiting the capacity to swiftly acquire new skills, such as through In-Context Learning (ICL) with minimal demonstration examples...
详细信息
ISBN:
(数字)9798350365474
ISBN:
(纸本)9798350365481
Large Language Models have demonstrated remarkable performance across various tasks, exhibiting the capacity to swiftly acquire new skills, such as through In-Context Learning (ICL) with minimal demonstration examples. In this work, we present a comprehensive framework for investigating Multimodal ICL (M-ICL) in the context of Large Multimodal Models. We consider the best open-source multimodal models (e.g., IDEFICS, OpenFlamingo) and a wide range of multimodal tasks. Our study unveils several noteworthy findings: (1) M-ICL primarily relies on text-driven mechanisms, showing little to no influence from the image modality. (2) When used with advanced-ICL strategy (like RICES), M-ICL is not better than a simple strategy based on majority voting over context examples. Moreover, we identify several biases and limitations of M-ICL that warrant consideration prior to deployment. Code available at ***/folbaeni/multimodal-icl
This paper discusses strategies for object detection in marine images from a practitioner’s perspective working with real-world long-tail distributed datasets with a large amount of additional unlabeled data on hand....
详细信息
ISBN:
(数字)9798350365474
ISBN:
(纸本)9798350365481
This paper discusses strategies for object detection in marine images from a practitioner’s perspective working with real-world long-tail distributed datasets with a large amount of additional unlabeled data on hand. The paper discusses the benefits of separating the localization and classification stages, making the case for robustness in localization through the amalgamation of additional datasets inspired by a widely used approach by practitioners in the camera-trap literature. For the classification stage, the paper compares strategies to use additional unlabeled data, comparing supervised, supervised iteratively, self-supervised, and semi-supervised pre-training approaches. Our findings reveal that semi-supervised pre-training, followed by supervised fine-tuning, yields a significantly improved balanced performance across the long-tail distribution, albeit occasionally with a trade-off in overall accuracy. These insights are validated through experiments on two real-world long-tailed underwater datasets collected by the Monterey Bay Aquarium Research Institute (MBARI).
Operators devoid of multiplication, such as Shift and Add, have gained prominence for their compatibility with hardware. However, neural networks (NNs) employing these operators typically exhibit lower accuracy compar...
详细信息
ISBN:
(数字)9798350365474
ISBN:
(纸本)9798350365481
Operators devoid of multiplication, such as Shift and Add, have gained prominence for their compatibility with hardware. However, neural networks (NNs) employing these operators typically exhibit lower accuracy compared to conventional NNs with identical structures. ShiftAddAug uses costly multiplication to augment efficient but less powerful multiplication-free operators, improving performance without any inference overhead. It puts a ShiftAdd tiny NN into a large multiplicative model and encourages it to be trained as a sub-model to obtain additional supervision. In order to solve the weight discrepancy problem between hybrid operators, a new weight sharing method is proposed. Additionally, a novel two stage neural architecture search is used to obtain better augmentation effects for smaller but stronger multiplication-free tiny neural networks. The superiority of ShiftAddAug is validated through experiments in image classification and semantic segmentation, consistently delivering noteworthy enhancements. Remarkably, it secures up to a 4.95% increase in accuracy on the CIFAR100 compared to its directly trained counterparts, even surpassing the performance of multiplicative NNs.
The state of the art of many learning tasks, e.g., image classification, is advanced by collecting larger datasets and then training larger models on them. As the outcome, the increasing computational cost is becoming...
详细信息
ISBN:
(数字)9798350365474
ISBN:
(纸本)9798350365481
The state of the art of many learning tasks, e.g., image classification, is advanced by collecting larger datasets and then training larger models on them. As the outcome, the increasing computational cost is becoming unaffordable. In this paper, we investigate how to prune the large-scale datasets, and thus produce an informative subset for training sophisticated deep models with negligible performance drop. We propose a simple yet effective dataset pruning method by exploring both the prediction uncertainty and training dynamics. We study dataset pruning by measuring the variation of predictions during the whole training process on large-scale datasets, i.e., ImageNet-1K and ImageNet-21K, and advanced models, i.e., Swin Transformer and ConvNeXt. Extensive experimental results indicate that our method outperforms the state of the art and achieves 25% lossless pruning ratio on both ImageNet1K and ImageNet-21K. The code and pruned datasets are available at https://***/BAAI-DCAI/Dataset-Pruning.
We introduce a new technique for generating retinal fundus images that have anatomically accurate vascular structures, using diffusion models. We generate artery/vein masks to create the vascular structure, which we t...
详细信息
ISBN:
(数字)9798350365474
ISBN:
(纸本)9798350365481
We introduce a new technique for generating retinal fundus images that have anatomically accurate vascular structures, using diffusion models. We generate artery/vein masks to create the vascular structure, which we then condition to produce retinal fundus images. The proposed method can generate high-quality images with more realistic vascular structures and can create a diverse range of images based on the strengths of the diffusion model. We present quantitative evaluations that demonstrate the performance improvement using our method for data augmentation on vessel segmentation and artery/vein classification. We also present Turing test results by clinical experts, showing that our generated images are difficult to distinguish with real images. We believe that our method can be applied to construct stand-alone datasets that are irrelevant of patient privacy.
This paper presents an innovative approach to multi-view generation that can be comprehensively controlled over both perspectives (viewpoints) and non-perspective attributes (such as depth maps). Our controllable dual...
详细信息
ISBN:
(数字)9798350365474
ISBN:
(纸本)9798350365481
This paper presents an innovative approach to multi-view generation that can be comprehensively controlled over both perspectives (viewpoints) and non-perspective attributes (such as depth maps). Our controllable dual-branch pipeline, named Depth Guided Branched Diffusion (DGBD), leverages depth maps and perspective information to generate images from alternative viewpoints while preserving shape and size fidelity. In the first DGBD branch, we fine-tune a pre-trained diffusion model on multi-view data, introducing a regularized batch-aware self-attention mechanism for multi-view consistency and generalization. Direct control over perspective is then achieved through cross-attention conditioned on camera position. Meanwhile, the second DGBD branch introduces non-perspective control using depth maps. Qualitative and quantitative experiments validate the effectiveness of our approach, surpassing or matching the performance of state-of-the-art novel view and multi-view synthesis methods.
Identifying robust and accurate correspondences across images is a fundamental problem in computervision that enables various downstream tasks. Recent semi-dense matching methods emphasize the effectiveness of fusing...
详细信息
ISBN:
(数字)9798350365474
ISBN:
(纸本)9798350365481
Identifying robust and accurate correspondences across images is a fundamental problem in computervision that enables various downstream tasks. Recent semi-dense matching methods emphasize the effectiveness of fusing relevant cross-view information through Transformer. In this paper, we propose several improvements upon this paradigm. Firstly, we introduce affine-based local attention to model cross-view deformations. Secondly, we present selective fusion to merge local and global messages from cross attention. Apart from network structure, we also identify the importance of enforcing spatial smoothness in loss design, which has been omitted by previous works. Based on these augmentations, our network demonstrate strong matching capacity under different settings. The full version of our network achieves state-of-the-art performance among semi-dense matching methods at a similar cost to LoFTR, while the slim version reaches LoFTR baseline’s performance with only 15% computation cost and 18% parameters.
This paper outlines the advancements and results of the Fifth Thermal Image Super-Resolution challenge, hosted at the Perception Beyond the Visible Spectrum CVPR 2024 workshop. The challenge employed a novel benchmark...
详细信息
ISBN:
(数字)9798350365474
ISBN:
(纸本)9798350365481
This paper outlines the advancements and results of the Fifth Thermal Image Super-Resolution challenge, hosted at the Perception Beyond the Visible Spectrum CVPR 2024 workshop. The challenge employed a novel benchmark cross-spectral dataset consisting of 1000 thermal images, each paired with its corresponding registered RGB image. The challenge featured two tracks: Track-1 focused on Single Thermal Image Super-Resolution with an ×8 upscale factor, while Track-2 extended its evaluation to include both ×8 and ×16 scaling factors, utilizing high-resolution RGB images to guide the super-resolution process for low-resolution thermal images. The participation of over 175 teams highlights the research community’s strong engagement and dedication to enhancing image resolution techniques across both single and cross-spectral methodologies. This year’s challenge sets new benchmarks and provides valuable insights into future directions for research in thermal image super-resolution.
The assessment of rehabilitation exercises for neurological and musculoskeletal disorders are crucial for recovery. Traditionally, assessment methods have been subjective, with inherent uncertainty and limitations. Th...
详细信息
ISBN:
(数字)9798350365474
ISBN:
(纸本)9798350365481
The assessment of rehabilitation exercises for neurological and musculoskeletal disorders are crucial for recovery. Traditionally, assessment methods have been subjective, with inherent uncertainty and limitations. This paper introduces a novel multi-modality dataset named FineRehab
§
to prompt the study of rehabilitation movement analysis, leveraging advancements in sensor technology and artificial intelligence. FineRehab collects 16 actions from 50 participants, including both patients with musculoskeletal disorders and healthy individuals, and consists of 4,215 action samples captured by two Kinect cameras and 17 IMUs. To benchmark FineRehab, we present a reliable approach to analyze rehabilitation exercises, and make experiments to evaluate the comprehensive movement quality from across multi-dimensions. Comparative experimental analyses have verified the validity of our dataset in distinguishing between the movement of the normal population and patients, which can offer a quantifiable basis for personalized rehabilitation feedback. The introduction of FineRehab will encourage researchers to apply, develop and adapt various methods for rehabilitation exercise analysis.
Event cameras are a new type of vision sensor that incorporates asynchronous and independent pixels, offering advantages over traditional frame-based cameras such as high dynamic range and minimal motion blur. However...
Event cameras are a new type of vision sensor that incorporates asynchronous and independent pixels, offering advantages over traditional frame-based cameras such as high dynamic range and minimal motion blur. However, their output is not easily understandable by humans, making the reconstruction of intensity images from event streams a fundamental task in event-based vision. While recent deep learning-based methods have shown promise in video reconstruction from events, this problem is not completely solved yet. To facilitate comparison between different approaches, standardized evaluation protocols and diverse test datasets are essential. This paper proposes a unified evaluation methodology and introduces an open-source framework called EVREAL to comprehensively benchmark and analyze various event-based video reconstruction methods from the literature. Using EVREAL, we give a detailed analysis of the state-of-the-art methods for event-based video reconstruction, and provide valuable insights into the performance of these methods under varying settings, challenging scenarios, and downstream tasks.
暂无评论