We ask whether 3D objects can be reconstructed from real world data collected for some other purpose, such as autonomous driving or augmented reality, thus inferring objects only incidentally. 3D reconstruction from i...
详细信息
ISBN:
(数字)9798350365474
ISBN:
(纸本)9798350365481
We ask whether 3D objects can be reconstructed from real world data collected for some other purpose, such as autonomous driving or augmented reality, thus inferring objects only incidentally. 3D reconstruction from incidental data is a major challenge because, in addition to significant noise, only a few views of each object are observed, which are insufficient for reconstruction. We approach this problem as a co-reconstruction task, where multiple objects are reconstructed together, learning shape and appearance priors for regularization. In order to do so, we introduce a neural radiance field that is conditioned via an attention mechanism on the identity of the individual objects. We further disentangle shape from appearance and diffuse color from specular color via an asymmetric two-stream network, which factors shared information from instance-specific details. We demonstrate the ability of this method to reconstruct full 3D objects from partial, incidental observations in autonomous driving and other datasets.
Diffusion models have found valuable applications in anomaly detection by capturing the nominal data distribution and identifying anomalies via reconstruction. Despite their merits, they struggle to localize anomalies...
详细信息
ISBN:
(数字)9798350365474
ISBN:
(纸本)9798350365481
Diffusion models have found valuable applications in anomaly detection by capturing the nominal data distribution and identifying anomalies via reconstruction. Despite their merits, they struggle to localize anomalies of varying scales, especially larger anomalies such as entire missing components. Addressing this, we present a novel framework that enhances the capability of diffusion models, by extending the previous introduced implicit conditioning approach [24] in three significant ways. First, we incorporate a dynamic step size computation that allows for variable noising steps in the forward process guided by an initial anomaly prediction. Second, we demonstrate that denoising an only scaled input, without any added noise, outperforms conventional denoising process. Third, we project images in a latent space to abstract away from fine details that interfere with reconstruction of large missing components. Additionally, we propose a fine-tuning mechanism that facilitates the model to effectively grasp the nuances of the target domain. Our method undergoes rigorous evaluation on prominent anomaly detection datasets VisA, BTAD and MVTec yielding strong performance. Importantly, our framework effectively localizes anomalies regardless of their scale, marking a pivotal advancement in diffusion-based anomaly detection.
Modern agricultural applications rely more and more on deep learning solutions. However, training well-performing deep networks requires a large amount of annotated data that may not be available and in the case of 3D...
Modern agricultural applications rely more and more on deep learning solutions. However, training well-performing deep networks requires a large amount of annotated data that may not be available and in the case of 3D annotation may not even be feasible for human annotators. In this work, we develop a deep learning approach to segment mushrooms and estimate their pose on 3D data, in the form of point clouds acquired by depth sensors. To circumvent the annotation problem, we create a synthetic dataset of mushroom scenes, where we are fully aware of 3D information, such as the pose of each mushroom. The proposed network has a fully convolutional backbone, that parses sparse 3D data, and predicts pose information that implicitly defines both instance segmentation and pose estimation task. We have validated the effectiveness of the proposed implicit-based approach for a synthetic test set, as well as provided qualitative results for a small set of real acquired point clouds with depth sensors.
The large-scale use of surveillance cameras in public spaces raised severe concerns about an individual privacy breach. Introducing privacy and security in video surveillance systems, primarily in person re-identifica...
详细信息
ISBN:
(纸本)9781665458245
The large-scale use of surveillance cameras in public spaces raised severe concerns about an individual privacy breach. Introducing privacy and security in video surveillance systems, primarily in person re-identification (re-id), is quite challenging. Event cameras are novel sensors, which only respond to brightness changes in the scene. This characteristic makes event-based vision sensors viable for privacy-preserving in video surveillance. Integrating privacy into the person re-id;this work investigates the possibility of performing person re-id with the event-camera network for the first time. We transform the asynchronous events stream generated by an event camera into synchronous image-like representations to leverage deep learning models and then evaluate how complex the re-id problem is with this new sensor modality. Interestingly, such event-based representations contain meaningful spatial details which are very similar to standard edges and contours. We use two different representations, image-like representation and their transformation to polar coordinates (which carry more distinct edge patterns). Finally, we train a person re-id model on such images to demonstrate the feasibility of performing event-driven re-id. We evaluate the performance of our approach and produce baseline results on two synthetic datasets (generated from publicly available datasets, SAIVT and DukeMTMC-reid).
Multi-modality information fusion can compensate deficiencies of single modality and provide rich scene information for 2D semantic segmentation. However, the inconsistency in the feature space between different modal...
详细信息
ISBN:
(数字)9798350365474
ISBN:
(纸本)9798350365481
Multi-modality information fusion can compensate deficiencies of single modality and provide rich scene information for 2D semantic segmentation. However, the inconsistency in the feature space between different modalities may lead to poor presentation of objects and that would affect subsequent segmented effectiveness. The idea of modal transition can reduce the modal differences and avoid biased processing during the fusion process, but it is hard to perfectly retain the contents of the source images. To address these challenges, a fusion method based on dual-cycled cross-awareness of structure tensor is proposed. Firstly, we propose a dual-cycle modality transition network based on cross-awareness consistency to learn the differences in feature space from different modalities. Secondly, a set of globally structure-tensor preserving modules are designed to enhance the capabilities of the network in capturing complementary features and perceiving global modal consistency. Under the joint constraint of globally structure-tensor awareness loss and cross-awareness loss, our network achieves a robust mapping of feature space from visible to pseudo-infrared images without relying on Ground-Truth. Finally, the pseudo-infrared images that inherit the superior qualities of two modalities are fused with the original infrared images directly, which effectively reduces the complexity of fusion. Extensive comparative experiments show that our method outperforms state-of-the-art methods in qualitative and quantitative evaluation.
Multi-camera Multi-object tracking (MTMC) surpasses conventional single-camera tracking by enabling seamless object tracking across multiple camera views. This capability is critical for security systems and improving...
详细信息
ISBN:
(数字)9798350365474
ISBN:
(纸本)9798350365481
Multi-camera Multi-object tracking (MTMC) surpasses conventional single-camera tracking by enabling seamless object tracking across multiple camera views. This capability is critical for security systems and improving situational awareness in various environments. This paper proposes a novel MTMC framework designed for online operation. The framework employs a three-stage pipeline: Multiobject Tracking (MOT), Multi-target Multi-camera Tracking (MTMC), and Cross Interval Synchronization (CIS). In the MOT stage, ReID features are extracted and localized tracklets are created. MTMC links these tracklets across cameras using spatial-temporal constraints and constraint hierarchical clustering with anchor features for improved inter-camera association. Finally, CIS ensures the temporal coherence of tracklets across time intervals. The proposed framework achieves robust tracking performance, validated on the challenging 2024 AI City Challenge with a HOTA score of 51.0556%, ranking sixth. The code is available at: https://***/ARV-MLCORE/AIC2024Track1ARV
In this paper, we tackle the problem of few-shot class incremental learning (FSCIL). FSCIL aims to incrementally learn new classes with only a few samples in each class. Most existing methods only consider the increme...
详细信息
Contrastive learning methods have been applied to a range of domains and modalities by training models to identify similar "views" of data points. However, specialized scientific modalities pose a challenge ...
Contrastive learning methods have been applied to a range of domains and modalities by training models to identify similar "views" of data points. However, specialized scientific modalities pose a challenge for this paradigm, as identifying good views for each scientific instrument is complex and time-intensive. In this paper, we focus on applying contrastive learning approaches to a variety of remote sensing datasets. We show that Viewmaker networks, a recently proposed method for generating views without extensive domain knowledge, can produce useful views in this setting. We also present a Viewmaker variant called Divmaker, which achieves similar performance and does not require adversarial optimization. Applying both methods to four multispectral imaging problems, each with a different format, we find that Viewmaker and Divmaker can outperform cropping- and reflection-based methods for contrastive learning in every case when evaluated on downstream classification tasks. This provides additional evidence that domain-agnostic methods can empower contrastive learning to scale to real-world scientific domains. Open source code can be found at https://***/jbayrooti/divmaker.
Representation learning from Gigapixel Whole Slide Images (WSI) poses a significant challenge in computational pathology due to the complicated nature of tissue structures and the scarcity of labeled data. Multi-insta...
详细信息
ISBN:
(数字)9798350365474
ISBN:
(纸本)9798350365481
Representation learning from Gigapixel Whole Slide Images (WSI) poses a significant challenge in computational pathology due to the complicated nature of tissue structures and the scarcity of labeled data. Multi-instance learning methods have addressed this challenge, leveraging image patches to classify slides utilizing pretrained models using Self-Supervised Learning (SSL) approaches. The performance of both SSL and MIL methods relies on the architecture of the feature encoder. This paper proposes leveraging the vision Mamba (Vim) architecture, inspired by state space models, within the DINO framework for representation learning in computational pathology. We evaluate the performance of Vim against vision Transformers (ViT) on the Camelyon16 dataset for both patch-level and slide-level classification. Our findings highlight Vim’s enhanced performance compared to ViT, particularly at smaller scales, where Vim achieves an 8.21 increase in ROC AUC for models of similar size. An explainability analysis further highlights Vim’s capabilities, which reveals that Vim uniquely emulates the pathologist workflow—unlike ViT. This alignment with human expert analysis highlights Vim’s potential in practical diagnostic settings and contributes significantly to developing effective representation-learning algorithms in computational pathology. We release the codes and pretrained weights at https://***/AtlasAnalyticsLab/Vim4Path.
Deep neural networks are susceptible to attacks from adversarial examples in recent years. Especially, the black-box attacks cause a more serious threat to practical applications. However, while most existing black-bo...
详细信息
ISBN:
(数字)9798350365474
ISBN:
(纸本)9798350365481
Deep neural networks are susceptible to attacks from adversarial examples in recent years. Especially, the black-box attacks cause a more serious threat to practical applications. However, while most existing black-box attacks have achieved a high success rate in deceiving models, they have not focused on the stealthiness of adversarial examples, often exhibiting suspicious visual appearances. To address this issue, this paper proposes the Mask Momentum Iterative Attack (MMIA), which introduces a masking mechanism and adopts an optimal perturbation strategy to identify regions of an image most vulnerable to attacks. This approach effectively ensures the transferability and stealthiness of adversarial examples. Simultaneously, by integrating image enhancement techniques and temporal and spatial momentum terms into the iterative process of the attack, we prevent the attack from getting stuck in local optima, further improving the transferability of adversarial examples. To enhance the success rate of black-box attacks, we apply MMIA to a model ensemble using a joint optimization strategy. We demonstrate that adversarially trained models with a strong defense ability are also susceptible to our black-box attacks. We conduct extensive experiments on classification tasks using common vision models, and our results significantly demonstrate the superiority of our method over state-of-the-art approaches when considering both transferability and stealthiness.
暂无评论