The rapid advancement of deep learning models is often attributed to their ability to leverage massive training data. In contrast, such privilege has not yet fully benefited3ddeep learning, mainly due to the limited...
详细信息
ISBN:
(纸本)9798350353006
The rapid advancement of deep learning models is often attributed to their ability to leverage massive training data. In contrast, such privilege has not yet fully benefited3ddeep learning, mainly due to the limited availability of large-scale 3ddatasets. Merging multiple available data sources and letting them collaboratively train a single model is a potential solution. However, due to the large domain gap between 3d point clouddatasets, such mixed supervision could adversely affect the model's performance and lead to degenerated performance (i.e., negative transfer) compared to single-dataset training. In view of this challenge, we introduce Point Prompt Training (PPT), a novel framework for multi-dataset synergistic learning in the context of 3d representation learning that supports multiple pre-training paradigms. Based on this framework, we propose Prompt-driven Normalization, which adapts the model to different datasets with domain-specific prompts and Language-guided Categorical Alignment that decently unifies the multiple-dataset label spaces by leveraging the relationship between label text. Extensive experiments verify that PPT can overcome the negative transfer associated with synergistic learning and produce generalizable representations. Notably, it achieves state-of-the-art performance on each dataset using a single weight-shared model with supervised multi-dataset training. Moreover, when served as a pre-training framework, it outperforms other pre-training approaches regarding representation quality and attains remarkable state-of-the-art performance across over ten diverse downstream tasks spanning both indoor and outdoor 3d scenarios.
The rising importance of 3d representation learning, pivotal in computer vision, autonomous driving, and robotics, is evident. However, a prevailing trend, which straightforwardly resorted to transferring 2d alignment...
详细信息
The rising importance of 3d representation learning, pivotal in computer vision, autonomous driving, and robotics, is evident. However, a prevailing trend, which straightforwardly resorted to transferring 2d alignment strategies to the 3ddomain, encounters three distinct challenges: (1) Information degradation: This arises from the alignment of 3ddata with mere single-view 2d images and generic texts, neglecting the need for multi-view images anddetailed subcategory texts. (2) Insufficient Synergy: These strategies align 3drepresentations to image and text features individually, hampering the overall optimization for 3d models. (3) Underutilization: The fine-grained information inherent in the learnedrepresentations is often not fully exploited, indicating a potential loss in detail. To address these issues, we introduce JM3d, a comprehensive approach integrating point cloud, text, and image. Key contributions include the Structured Multimodal Organizer (SMO), enriching vision-language representation with multiple views and hierarchical text, and the Joint Multi-modal Alignment (JMA), combining language understanding with visual representation. Our advanced model, JM3d-LLM, marries 3drepresentation with large language models via efficient fine-tuning. Evaluations on ModelNet40 and ScanObjectNN establish JM3d's superiority. The superior performance of JM3d-LLM further underscores the effectiveness of our representation transfer approach.
The success of superviseddeep learning heavily depends on large labeleddatasets whose construction is often challenging in medical image analysis. Contrastive learning, a variant of self-supervisedlearning, is a po...
详细信息
ISBN:
(纸本)9781665405409
The success of superviseddeep learning heavily depends on large labeleddatasets whose construction is often challenging in medical image analysis. Contrastive learning, a variant of self-supervisedlearning, is a potential solution to alleviate the strong demand for data annotation. In this work, we extend the contrastive learning framework to 3d volumetric medical imaging. Specifically, we propose (1) multiview contrasting strategy to maximize the mutual information between three views of 3d image to learn global representations and (2) long-short spatial contrasting strategy to learn local representations by matching a short spatial clip to a long spatial clip in the latent space. To combine these two strategies, we propose multiview long-short spatial contrastive learning (MLSSCL) framework, which can effectively learn generic 3drepresentations. Our extensive experiments on two brain Magnetic Resonance Imaging (MRI) datasets demonstrate that MLSSCL significantly outperforms learning from scratch and other self-supervisedlearning methods on both classification and segmentation tasks.
Most existing 3d object classification and retrieval algorithms rely on one-off supervisedlearning on closed3d object sets and tend to provide rigid convolutional neural networks with little scalability. Such limita...
详细信息
Most existing 3d object classification and retrieval algorithms rely on one-off supervisedlearning on closed3d object sets and tend to provide rigid convolutional neural networks with little scalability. Such limitations substantially restrict their potential to learn newly emerged3d object classes continually in the real world. Aiming to go beyond these limitations, we innovatively propose two new and challenging tasks: class-incremental 3d object classification (CI-3dOC) and class-incremental 3d object retrieval (CI-3dOR), the key to which is class-incremental 3d representation learning. It expects the network to update continually to learn new 3d class representations without forgetting the previously learned ones. To this end, we design a novel balanceddistillation network (BdNet) that uses a dual supervision mechanism to balance between consolidating old knowledge (stability) and adapting to new 3d object classes (plasticity) carefully. On the one hand, we employ stability-based supervision to retain the stable anddiscriminative information of old classes that greatly benefit both classification and retrieval tasks. On the other hand, we use plasticity-based supervision to improve the network's generalization for learning new class 3drepresentations by transferring knowledge from a temporary teacher network to the current model. By properly handling the relationship between the two modules, we achieve a surprising performance improvement. Furthermore, considering there is no available dataset for evaluation, we build two 3ddatasets, INOR-1 and INOR-2, to evaluate these two new tasks. Extensive experimental results demonstrate that our method can significantly outperform other state-of-the-art class-incremental learning methods. Even if we store 500-1000 fewer 3d objects than SOTA methods, BdNet still achieves comparable performance.
3d scene graphs are an emerging 3d scene representation, that models both the objects present in the scene as well as their relationships. However, learning3d scene graphs is a challenging task because it requires no...
详细信息
ISBN:
(纸本)9798350362466;9798350362459
3d scene graphs are an emerging 3d scene representation, that models both the objects present in the scene as well as their relationships. However, learning3d scene graphs is a challenging task because it requires not only object labels but also relationship annotations, which are very scarce in datasets. While it is widely accepted that pre-training is an effective approach to improve model performance in low data regimes, in this paper, we find that existing pre-training methods are ill-suited for 3d scene graphs. To solve this issue, we present the first language-based pre-training approach for 3d scene graphs, whereby we exploit the strong relationship between scene graphs and language. To this end, we leverage the language encoder of CLIP, a popular vision-language model, to distill its knowledge into our graph-based network. We formulate a contrastive pre-training, which aligns text embeddings of relationships (subject-predicate-object triplets) and predicted3d graph features. Our method achieves state-of-the-art results on the main semantic 3d scene graph benchmark by showing improved effectiveness over pre-training baselines and outperforming all the existing fully supervised scene graph prediction methods by a significant margin. Furthermore, since our scene graph features are language-aligned, it allows us to query the language space of the features in a zero-shot manner. In this paper, we show an example of utilizing this property of the features to predict the room type of a scene without further training.
Recent advances in 3d semantic segmentation with deep neural networks have shown remarkable success, with rapid performance increase on available datasets. However, current 3d semantic segmentation benchmarks contain ...
详细信息
ISBN:
(数字)9783031198274
ISBN:
(纸本)9783031198267;9783031198274
Recent advances in 3d semantic segmentation with deep neural networks have shown remarkable success, with rapid performance increase on available datasets. However, current 3d semantic segmentation benchmarks contain only a small number of categories - less than 30 for ScanNet and SemanticKITTI, for instance, which are not enough to reflect the diversity of real environments (e.g., semantic image understanding covers hundreds to thousands of classes). Thus, we propose to study a larger vocabulary for 3d semantic segmentation with a new extended benchmark on ScanNet data with 200 class categories, an order of magnitude more than previously studied. This large number of class categories also induces a large natural class imbalance, both of which are challenging for existing 3d semantic segmentation methods. To learn more robust 3d features in this context, we propose a language-driven pre-training method to encourage learned3d features that might have limited training examples to lie close to their pre-trained text embeddings. Extensive experiments show that our approach consistently outperforms state-of-the-art 3d pre-training for 3d semantic segmentation on our proposed benchmark (+9% relative mIoU), including limited-data scenarios with +25% relative mIoU using only 5% annotations.
Motivation: Protein model quality assessment (ProteinQA) is a fundamental task that is essential for biologically relevant applications, i.e., protein structure refinement, protein design, etc. Previous works aimed to...
详细信息
Motivation: Protein model quality assessment (ProteinQA) is a fundamental task that is essential for biologically relevant applications, i.e., protein structure refinement, protein design, etc. Previous works aimed to conduct ProteinQA only on the global structure or per -residue level, ignoring potentially usable and precise cues from a fine-grained per -atom perspective. In this study, we propose an atom -level ProteinQA model, named Atom-ProteinQA, in which two innovative modules are designed to extract geometric and topological atomlevel relationships respectively. Specifically, on the one hand, a geometric perception module exploits 3d sparse convolution to capture the geometric features of the input protein, generating fine-grained atom -level predictions. On the other hand, natural chemical bonds are utilized to construct an atom -level graph, then message passing from a topological perception module is applied to output residue -level predictions in parallel. Eventually, through a cross -model aggregation module, features from different modules mutually interact, enhancing performance on both the atom and residue levels. Results: Extensive experiments show that our proposed Atom-ProteinQA outperforms previous methods by a large margin, regardless of residue -level or atom -level assessment. Concretely, we achieved state-of-the-art performance on CATH-2084, decoy -8000, public benchmarks CASP13 & CASP14, and the CAMEO. Availability: The repository of this project is released on: https://github .com /luyfcandy /Atom _ProteinQA.
暂无评论