Learning-based point cloud registration methods can handle clean point clouds well, while it is still challenging to generalize to noisy, partial, and density-varying point clouds. To this end, we propose a novel poin...
详细信息
ISBN:
(纸本)9789819785070;9789819785087
Learning-based point cloud registration methods can handle clean point clouds well, while it is still challenging to generalize to noisy, partial, and density-varying point clouds. To this end, we propose a novel point cloud registration framework for these imperfect point clouds. By introducing a neural implicit representation, we replace the problem of rigid registration between point clouds with a registration problem between the point cloud and the neural implicit function. We then propose to alternately optimize the implicit function and the registration between the implicit function and point cloud. In this way, point cloud registration can be performed in a coarse-to-fine manner. By fully capitalizing on the capabilities of the neural implicit function without computing point correspondences, our method showcases remarkable robustness in the face of challenges such as noise, incompleteness, and density changes of point clouds.
Modern Large Visual Language Models (LVLMs) can transfer Large Language Models (LLMs)' powerful abilities to visual domains by combining LLMs withthe pre-trained visual encoder, and can also leverage in-context l...
详细信息
ISBN:
(纸本)9789819785100;9789819785117
Modern Large Visual Language Models (LVLMs) can transfer Large Language Models (LLMs)' powerful abilities to visual domains by combining LLMs withthe pre-trained visual encoder, and can also leverage in-context learning originated from LLMs to achieve remarkable performance in the Text-based Visual Question Answering (TextVQA) task. However, the alignment process between vision and language requires a significant amount of training resources. this study introduces SETS (stands for Show Exemplars and Tell me what you See), a straightforward yet effective in-context learning framework for TextVQA. SETS consists of two components, an LLM for reasoning and decision-making, as well as a set of external tools that extract visual entities in scene images, including scene text and objects, to assist the LLM. More specifically, SETS selects visual entities relevant to questions, constructs their spatial relationships, and customizes task-specific instructions. Furthermore, given these instructions, a two-round inference strategy is applied to automatically choose the final predicted answer. Extensive experiments on three widely used TextVQA datasets demonstrate that SETS enables frozen LLMs like Vicuna and LLaMA2 to achieve superior performance when compared with LVLMs counterparts.
Head detection is a challenging and widely applied object detection task. Although previous CNN-based head detectors have made good progress, the inherent locality of CNN restricts the extraction of global contextual ...
详细信息
ISBN:
(纸本)9789819788576;9789819788583
Head detection is a challenging and widely applied object detection task. Although previous CNN-based head detectors have made good progress, the inherent locality of CNN restricts the extraction of global contextual information, which leads to low precision and recall rates in head detection. In this article, we propose an end-to-end high-quality head detector based on Transformer, which effectively models the contextual relationships between heads, other objects and the background. To extract and generate discriminative feature maps suitable for detecting small head targets, we incorporate specific CNN-based auxiliary detector heads for joint training. the GIoU-aware classification loss function is improved to generate bounding boxes with high localization quality and high classification confidence, and a feature fusion module is introduced to enhance the feature representation capabilities of the model. We conduct experiments on COCO 2017 dataset and Brainwash head dataset, and the results demonstrate that our method outperforms in both COCO generalized object detection and Brainwash head detection tasks compared to previous CNN-based detectors as well as other current mainstream Transformer-based object detection models.
Deep neural networks have displayed promising performance in various fields, including biometrics, medical image processing and analysis, as well as dental healthcare. However, deep learning solutions have not yet bec...
详细信息
ISBN:
(纸本)9789819784950;9789819784967
Deep neural networks have displayed promising performance in various fields, including biometrics, medical image processing and analysis, as well as dental healthcare. However, deep learning solutions have not yet become the norm in routine dental practice. this is mainly due to the scarcity of dental datasets. To address this challenge, we have built a dataset called Quadruple Dental X-ray Panoramic (Quad-DXP) Dataset, specifically targeted at the recognition of dental disease and treatment. this dataset annotates nine types of dental issues (disease or treatment), and is the dental panorama dataset withthe most abundant types of annotations so far. We further propose a framework for dental pathological issue identification on panoramic radiographs. this framework takes a panoramic X-ray image as input, feeds it into a series of neural network modules, and then achieves the recognition results of dental disease/treatment and enumeration detection. We have achieved satisfactory experimental results under the supervision of dentists and experts, which proves the effectiveness and reliability of our framework in dental diagnosis. this work can assist dentists in formulating treatment plans and improving dental healthcare.
Face Forgery Detection (FFD) plays a pivotal role in preserving privacy and bolstering information security by identifying counterfeit face images sourced from the internet. However, FFD encounters a significant chall...
详细信息
ISBN:
(纸本)9789819784981;9789819784998
Face Forgery Detection (FFD) plays a pivotal role in preserving privacy and bolstering information security by identifying counterfeit face images sourced from the internet. However, FFD encounters a significant challenge in terms of its limited capacity to generalize across diverse datasets due to the striking similarities between genuine and forged images. To tackle this issue, this paper introduces a novel approach known as Multi-level Distributional Discrepancy Enhancement (MDDE). the primary objective of MDDE is to discern variations in the distribution patterns of real and fake data at multiple levels of latent representations. To further enhance its capabilities for generalization, we incorporate a deformable convolution module that extracts intricate features from genuine images. the integration of this module equips MDDE withthe ability to generalize to a broader range of samples. Extensive experiments conducted on extensive datasets verify the efficacy of our proposed method and its superior performance compared to several stateof-the-art techniques.
Passive acoustic monitoring is essential for monitoring cetaceans in their natural habitats. In this paper, we consider the monitoring of Fin Whales (Balaenoptera physalus). the deployment of automated tools for this ...
详细信息
ISBN:
(纸本)9798350376371;9798350376364
Passive acoustic monitoring is essential for monitoring cetaceans in their natural habitats. In this paper, we consider the monitoring of Fin Whales (Balaenoptera physalus). the deployment of automated tools for this purpose is essential to efficiently handle and analyse the vast amounts of data generated by hundreds of hours of recordings, something humans cannot do. In this paper, we present two methods for automated detection systems based on a convolutional neural network (CNN) classifier and a circle detection technique. Both of them use a spectrogram of the recordings as input which converts sounds into images. the first method consists of a two-stage R-CNN classifier with 26 layers. the second method is an image-based technique that uses classical computervision algorithms based on the morphology of pulses. Both approaches demonstrate good performance, with circle detection showing better results, even as it is a simpler method. the results obtained on a large dataset demonstrate that the proposed approach is highly effective in detecting and characterising animals in their habitats, thus offering valuable information to identify seasonal patterns.
the salient object detection (SOD) models based on the UNet or FCN structure have reached a significant milestone, and the addition of edge constraints to the SOD model has progressively become a common practice in cu...
详细信息
ISBN:
(纸本)9789819784929;9789819784936
the salient object detection (SOD) models based on the UNet or FCN structure have reached a significant milestone, and the addition of edge constraints to the SOD model has progressively become a common practice in current methods. Despite these methods producing excellent results, they still lack sufficient confidence in places with sharp edges of the objects owing to sample imbalance. In addition, compressing the encoded features to lower dimensions to decrease the computational cost, as a commonly used method, would unavoidably diminish the model's precision. To overcome the aforementioned issues, we propose a feature mutual feedback network (FMFNet) for the SOD task in which the semantic supplement module (SSM) integrates diverse feature information through different receptive fields to preserve important features. In addition, we provide a novel details map, which can better serve as an edge map to aid the model in learning the hard edge regions, resulting in more complete saliency maps. Multiple experiments on five benchmark datasets indicate the effectiveness, robustness, and superiority of the proposed model and details map.
Existing generalizable object pose estimation frameworks utilize a set of reference images to predict the complete pose of the target object in a query scene, which does not require textured CAD models to generate tra...
详细信息
ISBN:
(纸本)9789819785070;9789819785087
Existing generalizable object pose estimation frameworks utilize a set of reference images to predict the complete pose of the target object in a query scene, which does not require textured CAD models to generate training data and can handle unseen novel objects during inference. However, current methods suffer from insufficient discriminative capability due to the template matching strategy. Both potential distractors and negative samples with similar appearance can be confused withthe foreground, which limits performance on precise pose estimation. To address these problems, we propose a novel method called ESD-Pose to enhance the discrimination capacity of the framework. Specifically, a semantic interaction aware (SIA) module is introduced to seek semantic consistency among reference images and discrepancies between reference-query pairs. this module mitigates problems related to model deception caused by distractors. For dealing with slender objects robustly, we propose a dynamic scale weight learner to generate adaptive weights for multi-scale feature fusion, making for reasonable utilization of semantic information at different levels. Finally, an IoU-guided loss is designed to align localization and scale prediction, thus facilitating accurate pose estimation. Comprehensive experiments in the LINEMOD and GenMOP datasets demonstrate that ESD-Pose outperforms existing advanced methods, further validating the effectiveness of our method.
In the domain of computervision, Transformers have shown great promise, yet they face difficulties when trained from scratch on small datasets, often underperforming compared to convolutional neural networks (ConvNet...
详细信息
ISBN:
(纸本)9789819785049;9789819785056
In the domain of computervision, Transformers have shown great promise, yet they face difficulties when trained from scratch on small datasets, often underperforming compared to convolutional neural networks (ConvNets). Our work highlights vision Transformers (ViTs) experience a challenge with unfocused attention when trained on limited datasets. this insight has catalyzed the development of our Swelling ViT framework, an adaptive training strategy that initializes ViT with a local attention window, allowing it to expand gradually during training. this innovative approach enables the model to more easily learn local features thereby mitigating the attention dispersion phenomenon. Our empirical evaluation on the Cifar100 dataset with Swelling ViT-B has yielded remarkable results, achieving an accuracy of 82.60% after 300 epochs from scratch and further improving to 83.31% with 900 epochs of training. these outcomes not only signify a state-of-the-art performance but also underscore the Swelling ViT's capability to effectively address the attention dispersion issue, particularly on small datasets. Moreover, the robustness of our Swelling ViT is affirmed by its consistent performance on the extensive ImageNet dataset, confirming that the strategy does not compromise effectiveness when scaled to larger data regimes. this work, therefore, not only bridges the gap in data efficiency for ViT models but also introduces a versatile solution that can be readily adapted to various domains, regardless of data availability.
In the field of autonomous driving, profound scene understanding is crucial, and semantic segmentation of LiDAR point clouds plays a key role in this context. A prevalent issue in point cloud datasets is the imbalance...
详细信息
ISBN:
(纸本)9789819787913;9789819787920
In the field of autonomous driving, profound scene understanding is crucial, and semantic segmentation of LiDAR point clouds plays a key role in this context. A prevalent issue in point cloud datasets is the imbalance in class distribution. To address this, we introduce the InstanceAug data augmentation pipeline, which balances the class distribution by duplicating instances within scenes. this approach significantly enhances the robustness of our model. Deep learning models for point cloud processing often use sparse convolution for efficiency, but this limits feature transmission and the receptive field. Building on the strengthened dataset, we present KA-Seg, an innovative attention-based framework. KA-Seg refines sparse voxel features to further enhance robustness. Its core feature is an attention mechanism with super voxel partitioning and key point subsampling, which greatly improves the model's ability to identify complex spatial patterns and focus on important voxel regions. Inspired by Transformer architecture, KA-Seg utilizes learnable key point sampling for global feature querying, expanding the model's spatial understanding. this method augments spatial information processing across the point cloud and achieves a 1.3% higher mean intersection over union (mIoU) on the test set compared to the baseline model. Our code is publicly available at https://***/cvkdnk/kaseg.
暂无评论