Labels are the cornerstone of supervised machine learning algorithms. Most visual recognition methods are fully supervised, using bounding boxes or pixel-wise segmentations for object localization. Traditional labelin...
详细信息
ISBN:
(纸本)9798350365474
Labels are the cornerstone of supervised machine learning algorithms. Most visual recognition methods are fully supervised, using bounding boxes or pixel-wise segmentations for object localization. Traditional labeling methods, such as crowd-sourcing, are prohibitive due to cost, data privacy, amount of time, and potential errors on large datasets. To address these issues, we propose a novel annotation framework, Advanced Line Identification and Notation Algorithm (ALINA), which can be used for labeling taxiway datasets that consist of different camera perspectives and variable weather attributes (sunny and cloudy). Additionally, the CIRCular threshoLd pixEl Discovery And Traversal (CIRCLEDAT) algorithm has been proposed, which is an integral step in determining the pixels corresponding to taxiway line markings. Once the pixels are identified, ALINA generates corresponding pixel coordinate annotations on the frame. Using this approach, 60,249 frames from the taxiway dataset, AssistTaxi have been labeled. To evaluate the performance, a context-based edge map (CBEM) set was generated manually based on edge features and connectivity. The detection rate after testing the annotated labels with the CBEM set was recorded as 98.45%, attesting its dependability and effectiveness.
In this paper, we address the challenge of selecting an optimal dataset from a source pool with annotations to enhance performance on a target dataset derived from a different source. This is important in scenarios wh...
详细信息
ISBN:
(纸本)9798350365474
In this paper, we address the challenge of selecting an optimal dataset from a source pool with annotations to enhance performance on a target dataset derived from a different source. This is important in scenarios where it is hard to afford on-the-fly dataset annotation and is also the theme of the second Visual Data Understanding (VDU) Challenge. Our solution, the Classifier Guided Cluster Density Reduction (CCDR) framework, operates in two stages. Initially, we employ a filtering technique to identify images that align with the target dataset's distribution. Subsequently, we implement a graph-based cluster density reduction method, steered by a classifier that approximates the distance between the target distribution and source distribution. This classifier is trained to distinguish between images that resemble the target dataset and those that do not, facilitating the pruning process shown in Figure 1. Our approach maintains a balance between selecting pertinent images that match the target distribution and eliminating redundant ones that do not contribute to the enhancement of the detection model. We demonstrate the superiority of our method over various baselines in object detection tasks, particularly in optimizing the training set distribution on the region100 dataset. We have released our code here: https://***/ himsR/DataCVChallenge-2024/tree/main
Synthetic images can help alleviate much of the cost in the creation of training data for plant phenotyping-focused AI development. Synthetic-to-real style transfer is of particular interest to users of artificial dat...
详细信息
ISBN:
(纸本)9798350365474
Synthetic images can help alleviate much of the cost in the creation of training data for plant phenotyping-focused AI development. Synthetic-to-real style transfer is of particular interest to users of artificial data because of the domain shift problem created by training neural networks on images generated in a digital environment. In this paper we present a pipeline for synthetic plant creation and image-to-image style transfer, with a particular interest in synthetic to real domain adaptation targeting specific real datasets. Utilizing new advances in generative AI, we employ a combination of Stable diffusion, Low Ranked Adapters (LoRA) and ControlNets to produce an advanced system of style transfer. We focus our work on the core task of leaf instance segmentation, exploring both synthetic to real style transfer as well as inter-species style transfer and find that our pipeline makes numerous improvements over CycleGAN for style transfer, and the images we produce are comparable to real images when used as training data.
The inherent complexity and uncertainty of Machine Learning (ML) makes it difficult for ML-based computervision (CV) approaches to become prevalent in safety-critical domains like autonomous driving, despite their hi...
详细信息
ISBN:
(纸本)9798350365474
The inherent complexity and uncertainty of Machine Learning (ML) makes it difficult for ML-based computervision (CV) approaches to become prevalent in safety-critical domains like autonomous driving, despite their high performance. A crucial challenge in these domains is the safety assurance of ML-based systems. To address this, recent safety standardization in the automotive domain has introduced an ML safety lifecycle following an iterative development process. While this approach facilitates safety assurance, its iterative nature requires frequent adaptation and optimization of the ML function, which might include costly retraining of the ML model and is not guaranteed to converge to a safe AI solution. In this paper, we propose a modular ML approach which allows for more efficient and targeted measures to each of the modules and process steps. Each module of the modular concept model represents one visual concept and is aggregated with the other modules' outputs into a task output. The design choices of a modular concept model can be categorized into the selection of the concept modules, the aggregation of their output and the training of the concept modules. Using the example of traffic sign classification, we present each step of the involved design choices and the corresponding targeted measures to take in an iterative development process for engineering safe AI.
This paper focuses on bridging the gap between natural language descriptions, 360 degrees panoramas, room shapes, and layouts/floorplans of indoor spaces. To enable new multimodal (image, geometry, language) research ...
详细信息
ISBN:
(纸本)9798350365474
This paper focuses on bridging the gap between natural language descriptions, 360 degrees panoramas, room shapes, and layouts/floorplans of indoor spaces. To enable new multimodal (image, geometry, language) research directions in indoor environment understanding, we propose a novel extension to the Zillow Indoor Dataset (ZInD) which we call ZInD-Tell1. We first introduce an effective technique for extracting geometric information from ZInD's raw structural data, which facilitates the generation of accurate ground truth descriptions using GPT-4. A human-in-the-loop approach is then employed to ensure the quality of these descriptions. To demonstrate the vast potential of our dataset, we introduce the ZInD-Tell benchmark, focusing on two exemplary tasks: language-based home retrieval and indoor description generation. Furthermore, we propose an end-to-end, zero-shot baseline model, ZInD-Agent, designed to process an unordered set of panorama images and generate home descriptions. ZInD-Agent outperforms naive methods in both tasks, hence, can be considered as a complement to the naive to show potential use of the data and impact of geometry. We believe this work initiates new trajectories in leveraging computervision techniques to analyze indoor panorama images descriptively by learning the latent relation between vision, geometry, and language modalities.
Knowledge Distillation (KD) has been extensively studied as a means to enhance the performance of smaller models in Convolutional Neural Networks (CNNs). Recently, the vision Transformer (ViT) has demonstrated remarka...
详细信息
ISBN:
(纸本)9798350365474
Knowledge Distillation (KD) has been extensively studied as a means to enhance the performance of smaller models in Convolutional Neural Networks (CNNs). Recently, the vision Transformer (ViT) has demonstrated remarkable success in various computervision tasks, leading to an increased demand for KD in ViT. However, while logit-based KD has been applied to ViT, other feature-based KD methods for CNNs cannot be directly implemented due to the significant structure gap. In this paper, we conduct an analysis of the properties of different feature layers in ViT to identify a method for feature-based ViT distillation. Our findings reveal that both shallow and deep layers in ViT are equally important for distillation and require distinct distillation strategies. Based on these guidelines, we propose our feature-based method ViTKD, which mimics the shallow layers and generates the deep layer in the teacher. ViTKD leads to consistent and significant improvements in the students. On ImageNet-1K, we achieve performance boosts of 1.64% for DeiT-Tiny, 1.40% for DeiT-Small, and 1.70% for DeiT-Base. Downstream tasks also demonstrate the superiority of ViTKD. Additionally, ViTKD and logit-based KD are complementary and can be applied together directly, further enhancing the student's performance. Specifically, DeiT-T, S, and B achieve accuracies of 77.78%, 83.59%, and 85.41%, respectively, using this combined approach. Code is available at https:// github. com/ yzdv/cls_KD.
In the field of Class Incremental Object Detection (CIOD), creating models that can continuously learn like humans is a major challenge. Pseudo-labeling methods, although initially powerful, struggle with multi-scenar...
详细信息
ISBN:
(纸本)9798350365474
In the field of Class Incremental Object Detection (CIOD), creating models that can continuously learn like humans is a major challenge. Pseudo-labeling methods, although initially powerful, struggle with multi-scenario incremental learning due to their tendency to forget past knowledge. To overcome this, we introduce a new approach called vision-Language Model assisted Pseudo-Labeling (VLM-PL). This technique uses vision-Language Model (VLM) to verify the correctness of pseudo ground-truths (GTs) without requiring additional model training. VLM-PL starts by deriving pseudo GTs from a pre-trained detector. Then, we generate custom queries for each pseudo GT using carefully designed prompt templates that combine image and text features. This allows the VLM to classify the correctness through its responses. Furthermore, VLM-PL integrates refined pseudo and real GTs from upcoming training, effectively combining new and old knowledge. Extensive experiments conducted on the Pascal VOC and MS COCO datasets not only highlight VLM-PL's exceptional performance in multi-scenario but also illuminate its effectiveness in dual-scenario by achieving state-of-the-art results in both.
We analyze various factors affecting the proper functioning of MIA and MINT, two research lines aimed at detecting data used for training. The difference between these lines lies in the environmental conditions, while...
详细信息
ISBN:
(纸本)9798350365474
We analyze various factors affecting the proper functioning of MIA and MINT, two research lines aimed at detecting data used for training. The difference between these lines lies in the environmental conditions, while the fundamental bases are similar for both. As evident in the literature, this detection task is far from straightforward and poses an ongoing challenge for the scientific community. Specifically, in this work, we conclude that factors such as the number of times data passes through the original network, the loss function, or dropout significantly impact detection outcomes. Therefore, it is crucial to consider them when developing these methods and during the training of any neural network, both to avoid (MIA) and to enhance (MINT) this detection. We evaluate the AdaFace facial recognition model using five databases with over 22 million images, modifying the different factors under analysis and defining a suitable protocol for their examination. State-of-the-art accuracy reaching up to 87% is achieved, surpassing existing methods.
We propose a content-based system for matching video and background music. The system aims to address the challenges in music recommendation for new users or new music give short-form videos. To this end, we propose a...
详细信息
ISBN:
(纸本)9798350365474
We propose a content-based system for matching video and background music. The system aims to address the challenges in music recommendation for new users or new music give short-form videos. To this end, we propose a cross-modal framework VMCML (Video and Music Matching via Cross-Modality Lifting) that finds a shared embedding space between video and music representations. To ensure the embedding space can be effectively shared by both representations, we leverage CosFace loss based on margin-based cosine similarity loss. Furthermore, to confirm the music is not the original sound of the video and that more than one video is matched to the same music, we follow the rule and collect videos and music from a well-known multi-media platform. That is because there are limitations of previous datasets. We establish a large-scale dataset called MSV, which provide 390 individual music and the corresponding matched 150,000 videos. We conduct extensive experiments on Youtube-8M and our MSV datasets. Our quantitative and qualitative results demonstrate the effectiveness of our proposed framework and achieve state-of-the-art video and music matching performance.
Monitoring dietary intake is a crucial aspect of promoting healthy living. In recent years, advances in computervision technology have facilitated dietary intake monitoring through the use of images and depth cameras...
详细信息
ISBN:
(纸本)9798350365474
Monitoring dietary intake is a crucial aspect of promoting healthy living. In recent years, advances in computervision technology have facilitated dietary intake monitoring through the use of images and depth cameras. However, the current state-of-the-art image-based food portion estimation algorithms assume that users take images of their meals one or two times, which can be inconvenient and fail to capture food items that are not visible from a top-down perspective, such as ingredients submerged in a stew. To address these limitations, we introduce an innovative solution that utilizes stationary user-facing cameras to track food items on utensils, not requiring any change of camera perspective after installation. The shallow depth of utensils provides a more favorable angle for capturing food items, and tracking them on the utensil's surface offers a significantly more accurate estimation of dietary intake without the need for post-meal image capture. The system is reliable for estimation of nutritional content of liquid-solid heterogeneous mixtures such as soups and stews. Through a series of experiments, we demonstrate the exceptional potential of our method as a non-invasive, user-friendly, and highly accurate dietary intake monitoring tool.
暂无评论