In this paper, we address the challenge of selecting an optimal dataset from a source pool with annotations to enhance performance on a target dataset derived from a different source. This is important in scenarios wh...
详细信息
ISBN:
(纸本)9798350365474
In this paper, we address the challenge of selecting an optimal dataset from a source pool with annotations to enhance performance on a target dataset derived from a different source. This is important in scenarios where it is hard to afford on-the-fly dataset annotation and is also the theme of the second Visual Data Understanding (VDU) Challenge. Our solution, the Classifier Guided Cluster Density Reduction (CCDR) framework, operates in two stages. Initially, we employ a filtering technique to identify images that align with the target dataset's distribution. Subsequently, we implement a graph-based cluster density reduction method, steered by a classifier that approximates the distance between the target distribution and source distribution. This classifier is trained to distinguish between images that resemble the target dataset and those that do not, facilitating the pruning process shown in Figure 1. Our approach maintains a balance between selecting pertinent images that match the target distribution and eliminating redundant ones that do not contribute to the enhancement of the detection model. We demonstrate the superiority of our method over various baselines in object detection tasks, particularly in optimizing the training set distribution on the region100 dataset. We have released our code here: https://***/ himsR/DataCVChallenge-2024/tree/main
Recent works have demonstrated that natural language can be used to generate and edit 3D shapes. However, these methods generate shapes with limited fidelity and diversity. We introduce CLIP-Sculptor, a method to addr...
详细信息
ISBN:
(纸本)9798350301298
Recent works have demonstrated that natural language can be used to generate and edit 3D shapes. However, these methods generate shapes with limited fidelity and diversity. We introduce CLIP-Sculptor, a method to address these constraints by producing high-fidelity and diverse 3D shapes without the need for (text, shape) pairs during training. CLIP-Sculptor achieves this in a multi-resolution approach that first generates in a low-dimensional latent space and then upscales to a higher resolution for improved shape fidelity. For improved shape diversity, we use a discrete latent space which is modeled using a transformer conditioned on CLIP's image-text embedding space. We also present a novel variant of classifier-free guidance, which improves the accuracy-diversity trade-off. Finally, we perform extensive experiments demonstrating that CLIP-Sculptor outperforms state-of-the-art baselines.
Although vision transformers (ViTs) have shown promising results in various computervision tasks recently, their high computational cost limits their practical applications. Previous approaches that prune redundant t...
详细信息
ISBN:
(纸本)9798350301298
Although vision transformers (ViTs) have shown promising results in various computervision tasks recently, their high computational cost limits their practical applications. Previous approaches that prune redundant tokens have demonstrated a good trade-off between performance and computation costs. Nevertheless, errors caused by pruning strategies can lead to significant information loss. Our quantitative experiments reveal that the impact of pruned tokens on performance should be noticeable. To address this issue, we propose a novel joint Token Pruning & Squeezing module (TPS) for compressing vision transformers with higher efficiency. Firstly, TPS adopts pruning to get the reserved and pruned subsets. Secondly, TPS squeezes the information of pruned tokens into partial reserved tokens via the unidirectional nearest-neighbor matching and similarity-based fusing steps. Compared to state-of-the-art methods, our approach outperforms them under all token pruning intensities. Especially while shrinking DeiT-tiny&small computational budgets to 35%, it improves the accuracy by 1%-6% compared with baselines on ImageNet classification. The proposed method can accelerate the throughput of DeiT-small beyond DeiT-tiny, while its accuracy surpasses DeiT-tiny by 4.78%. Experiments on various transformers demonstrate the effectiveness of our method, while analysis experiments prove our higher robustness to the errors of the token pruning policy. Code is available at https://***/megvii-research/TPS-cvpr2023.
The current approach for testing the robustness of object detectors suffers from serious deficiencies such as improper methods of performing out-of-distribution detection and using calibration metrics which do not con...
详细信息
ISBN:
(纸本)9798350301298
The current approach for testing the robustness of object detectors suffers from serious deficiencies such as improper methods of performing out-of-distribution detection and using calibration metrics which do not consider both localisation and classification quality. In this work, we address these issues, and introduce the Self Aware Object Detection (SAOD) task, a unified testing framework which respects and adheres to the challenges that object detectors face in safety-critical environments such as autonomous driving. Specifically, the SAOD task requires an object detector to be: robust to domain shift;obtain reliable uncertainty estimates for the entire scene;and provide calibrated confidence scores for the detections. We extensively use our framework, which introduces novel metrics and large scale test datasets, to test numerous object detectors in two different use-cases, allowing us to highlight critical insights into their robustness performance. Finally, we introduce a simple baseline for the SAOD task, enabling researchers to benchmark future proposed methods and move towards robust object detectors which are fit for purpose. Code is available at: https://***/fiveai/saod.
In digital pathology, the spatial context of cells is important for cell classification, cancer diagnosis and prognosis. To model such complex cell context, however, is challenging. Cells form different mixtures, line...
详细信息
ISBN:
(纸本)9798350301298
In digital pathology, the spatial context of cells is important for cell classification, cancer diagnosis and prognosis. To model such complex cell context, however, is challenging. Cells form different mixtures, lineages, clusters and holes. To model such structural patterns in a learnable fashion, we introduce several mathematical tools from spatial statistics and topological data analysis. We incorporate such structural descriptors into a deep generative model as both conditional inputs and a differentiable loss. This way, we are able to generate high quality multi-class cell layouts for the first time. We show that the topology-rich cell layouts can be used for data augmentation and improve the performance of downstream tasks such as cell classification.
In this work, instead of directly predicting the pixel-level segmentation masks, the problem of referring image segmentation is formulated as sequential polygon generation, and the predicted polygons can be later conv...
详细信息
ISBN:
(纸本)9798350301298
In this work, instead of directly predicting the pixel-level segmentation masks, the problem of referring image segmentation is formulated as sequential polygon generation, and the predicted polygons can be later converted into segmentation masks. This is enabled by a new sequence-to-sequence framework, Polygon Transformer (PolyFormer), which takes a sequence of image patches and text query tokens as input, and outputs a sequence of polygon vertices autoregressively. For more accurate geometric localization, we propose a regression-based decoder, which predicts the precise floating-point coordinates directly, without any coordinate quantization error. In the experiments, PolyFormer outperforms the prior art by a clear margin, e.g., 5.40% and 4.52% absolute improvements on the challenging RefCOCO+ and RefCOCOg datasets. It also shows strong generalization ability when evaluated on the referring video segmentation task without fine-tuning, e.g., achieving competitive 61.5% J&F on the Ref-DAVIS17 dataset.
This paper focuses on bridging the gap between natural language descriptions, 360 degrees panoramas, room shapes, and layouts/floorplans of indoor spaces. To enable new multimodal (image, geometry, language) research ...
详细信息
ISBN:
(纸本)9798350365474
This paper focuses on bridging the gap between natural language descriptions, 360 degrees panoramas, room shapes, and layouts/floorplans of indoor spaces. To enable new multimodal (image, geometry, language) research directions in indoor environment understanding, we propose a novel extension to the Zillow Indoor Dataset (ZInD) which we call ZInD-Tell1. We first introduce an effective technique for extracting geometric information from ZInD's raw structural data, which facilitates the generation of accurate ground truth descriptions using GPT-4. A human-in-the-loop approach is then employed to ensure the quality of these descriptions. To demonstrate the vast potential of our dataset, we introduce the ZInD-Tell benchmark, focusing on two exemplary tasks: language-based home retrieval and indoor description generation. Furthermore, we propose an end-to-end, zero-shot baseline model, ZInD-Agent, designed to process an unordered set of panorama images and generate home descriptions. ZInD-Agent outperforms naive methods in both tasks, hence, can be considered as a complement to the naive to show potential use of the data and impact of geometry. We believe this work initiates new trajectories in leveraging computervision techniques to analyze indoor panorama images descriptively by learning the latent relation between vision, geometry, and language modalities.
Modern deep CNN face matchers are trained on datasets containing "color" images. We show that such matchers achieve essentially the same accuracy on color images when trained using only grayscale images. We ...
详细信息
ISBN:
(纸本)9798350365474
Modern deep CNN face matchers are trained on datasets containing "color" images. We show that such matchers achieve essentially the same accuracy on color images when trained using only grayscale images. We then consider possible causes for deep CNN face matchers "not using color". Popular web-scraped face datasets actually have 30 to 60% of their identities with one or more grayscale images. We analyze whether this grayscale element in the training set impacts the accuracy achieved, and conclude that it does not. Comparable accuracy for color test images using only grayscale images implies that the inclusion of "color" may not necessarily add any significant information to the recognition of individuals. This also implies the use of computing resources can be optimized to make the training process more efficient using only grayscale images. Utilizing grayscale images for training reduces the memory footprint of the training data, thereby decreasing system processing time during training. Additionally, our findings emphasize that the adoption of grayscale images not only makes face recognition training more efficient but also offers the opportunity to include more training data, which could result in more accurate face recognition models.
We present Region-aware Open-vocabulary vision Transformers (RO-ViT) - a contrastive image-text pretraining recipe to bridge the gap between image-level pretraining and open-vocabulary object detection. At the pretrai...
详细信息
ISBN:
(纸本)9798350301298
We present Region-aware Open-vocabulary vision Transformers (RO-ViT) - a contrastive image-text pretraining recipe to bridge the gap between image-level pretraining and open-vocabulary object detection. At the pretraining phase, we propose to randomly crop and resize regions of positional embeddings instead of using the whole image positional embeddings. This better matches the use of positional embeddings at region-level in the detection finetuning phase. In addition, we replace the common softmax cross entropy loss in contrastive learning with focal loss to better learn the informative yet difficult examples. Finally, we leverage recent advances in novel object proposals to improve open-vocabulary detection finetuning. We evaluate our full model on the LVIS and COCO open-vocabulary detection benchmarks and zero-shot transfer. RO-ViT achieves a state-of-the-art 32.1 AP(r) on LVIS, surpassing the best existing approach by +5.8 points in addition to competitive zero-shot transfer detection. Surprisingly, RO-ViT improves the image-level representation as well and achieves the state of the art on 9 out of 12 metrics on COCO and Flickr image-text retrieval benchmarks, outperforming competitive approaches with larger models.
Artificial light sources are often powered by an electric grid, and then their intensities rapidly oscillate in response to the grid's alternating current (AC). Interestingly, the flickers of scene radiance values...
详细信息
ISBN:
(纸本)9798350301298
Artificial light sources are often powered by an electric grid, and then their intensities rapidly oscillate in response to the grid's alternating current (AC). Interestingly, the flickers of scene radiance values due to AC illumination are useful for extracting rich information on a scene of interest. In this paper, we show that the flickers due to AC illumination is useful for intrinsic image decomposition (IID). Our proposed method conducts the light source separation (LSS) followed by the IID under AC illumination. In particular, we reveal the ambiguity in the blind LSS via matrix factorization and the ambiguity in the IID assuming the diffuse reflection model, and then show why and how those ambiguities can be resolved via a physics-based approach. We experimentally confirmed that our method can recover the colors of the light sources, the diffuse reflectance values, and the diffuse and specular intensities (shadings) under each of the light sources, and that the IID under AC illumination is effective for application to auto white balancing.
暂无评论