Transformer neural networks (TNN) have been widely utilized on a diverse range of applications, including natural language processing (NLP), machine translation, and computervision (CV). Their widespread adoption has...
详细信息
Labels are the cornerstone of supervised machine learning algorithms. Most visual recognition methods are fully supervised, using bounding boxes or pixel-wise segmentations for object localization. Traditional labelin...
详细信息
ISBN:
(数字)9798350365474
ISBN:
(纸本)9798350365481
Labels are the cornerstone of supervised machine learning algorithms. Most visual recognition methods are fully supervised, using bounding boxes or pixel-wise segmentations for object localization. Traditional labeling methods, such as crowd-sourcing, are prohibitive due to cost, data privacy, amount of time, and potential errors on large datasets. To address these issues, we propose a novel annotation framework, Advanced Line Identification and Notation Algorithm (ALINA), which can be used for labeling taxiway datasets that consist of different camera perspectives and variable weather attributes (sunny and cloudy). Additionally, the CIRCular threshoLd pixEl Discovery And Traversal (CIRCLEDAT) algorithm has been proposed, which is an integral step in determining the pixels corresponding to taxiway line markings. Once the pixels are identified, ALINA generates corresponding pixel coordinate annotations on the frame. Using this approach, 60,249 frames from the taxiway dataset, AssistTaxi have been labeled. To evaluate the performance, a context-based edge map (CBEM) set was generated manually based on edge features and connectivity. The detection rate after testing the annotated labels with the CBEM set was recorded as 98.45%, attesting its dependability and effectiveness.
This paper presents the results of two tracks from the fourth Thermal Image Super-Resolution (TISR) challenge, held at the Perception Beyond the Visible Spectrum (PBVS) 2023 workshop. Track-1 uses the same thermal ima...
This paper presents the results of two tracks from the fourth Thermal Image Super-Resolution (TISR) challenge, held at the Perception Beyond the Visible Spectrum (PBVS) 2023 workshop. Track-1 uses the same thermal image dataset as previous challenges, with 951 training images and 50 validation images at each resolution. In this track, two evaluations were conducted: the first consists of generating a SR image from a HR thermal noisy image downsampled by four, and the second consists of generating a SR image from a mid-resolution image and compare it with its semi-registered HR image (acquired with another camera). The results of Track-1 outperformed those from last year’s challenge. On the other hand, Track-2 uses a new acquired dataset consisting of 160 registered visible and thermal images of the same scenario for training and 30 validation images. This year, more than 150 teams participated in the challenge tracks, demonstrating the community’s ongoing interest in this topic.
In this paper, we develop a MultiTask Learning (MTL) model to achieve dense predictions for comics panels to, in turn, facilitate the transfer of comics from one publication channel to another by assisting authors in ...
In this paper, we develop a MultiTask Learning (MTL) model to achieve dense predictions for comics panels to, in turn, facilitate the transfer of comics from one publication channel to another by assisting authors in the task of reconfiguring their narratives. Our MTL method can successfully identify the semantic units as well as the embedded notion of 3D in comics panels. This is a significantly challenging problem because comics comprise disparate artistic styles, illustrations, layouts, and object scales that depend on the author's creative process. Typically, dense image-based prediction techniques require a large corpus of data. Finding an automated solution for dense prediction in the comics domain, therefore, becomes more difficult with the lack of ground-truth dense annotations for the comics images. To address these challenges, we develop the following solutions- we leverage a commonly-used strategy known as unsupervised image-to-image translation, which allows us to utilize a large corpus of real-world annotations; - we utilize the results of the translations to develop our multitasking approach that is based on a vision transformer backbone and a domain transferable attention module; -we study the feasibility of integrating our MTL dense-prediction method with an existing retargeting method, thereby reconfiguring comics.
Instance-based semantic segmentation provides detailed per-pixel scene understanding information crucial for both computervision and robotics applications. However, state-of-the-art approaches such as Mask2Former are...
详细信息
ISBN:
(数字)9798350365474
ISBN:
(纸本)9798350365481
Instance-based semantic segmentation provides detailed per-pixel scene understanding information crucial for both computervision and robotics applications. However, state-of-the-art approaches such as Mask2Former are computationally expensive and reducing this computational burden while maintaining high accuracy remains challenging. Knowledge distillation has been regarded as a potential way to compress neural networks, but to date limited work has explored how to apply this to distill information from the output queries of a model such as *** this paper, we match the output queries of the student and teacher models to enable a query-based knowledge distillation scheme. We independently match the teacher and the student to the groundtruth and use this to define the teacher to student relationship for knowledge distillation. Using this approach we show that it is possible to perform knowledge distillation where the student models can have a lower number of queries and the backbone can be changed from a Transformer architecture to a convolutional neural network architecture. Experiments on two challenging agricultural datasets, sweet pepper (BUP20) and sugar beet (SB20), and Cityscapes demonstrate the efficacy of our approach. Across the three datasets the student models obtain an average absolute performance improvement in AP of 1.8 and 1.9 points for ResNet-50 and Swin-Tiny backbone respectively. To the best of our knowledge, this is the first work to propose knowledge distillation schemes for instance semantic segmentation with transformer-based models.
The rapid growth of machine learning has spurred legislative initiatives such as "the Right to be Forgotten," allowing users to request data removal. In response, machine unlearning proposes the selective re...
详细信息
ISBN:
(数字)9798350365474
ISBN:
(纸本)9798350365481
The rapid growth of machine learning has spurred legislative initiatives such as "the Right to be Forgotten," allowing users to request data removal. In response, machine unlearning proposes the selective removal of unwanted data without the need for retraining from scratch. While the Neural-Tangent-Kernel (NTK) based unlearning method excels in performance, it suffers from significant computational complexity, especially for large-scale models and datasets. To improve this situation, our work introduces "Fast-NTK," a novel NTK-based unlearning algorithm that significantly reduces the computational complexity by incorporating parameter-efficient fine-tuning methods, such as fine-tuning batch normalization layers in a CNN or visual prompts in a vision transformer. Our experimental results demonstrate scalability to really large neural networks and datasets (e.g., 88M parameters and 5k images), surpassing the limitations of previous full-model NTK-based approaches designed for smaller cases (e.g., 8M parameters and 500 images). Notably, our approach maintains a performance comparable to the traditional methods of retraining on the retain set alone. Fast-NTK can thus enable practical and scalable NTK-based unlearning in deep neural networks.
Automatic Lip-Reading (ALR) requires the recognition of spoken words based on a visual recording of the speaker’s lips, without access to the sound. ALR with neuromorphic event-based vision sensors, instead of tradit...
详细信息
ISBN:
(数字)9798350365474
ISBN:
(纸本)9798350365481
Automatic Lip-Reading (ALR) requires the recognition of spoken words based on a visual recording of the speaker’s lips, without access to the sound. ALR with neuromorphic event-based vision sensors, instead of traditional frame-based cameras, is particularly promising for edge applications due to their high temporal resolution, low power consumption and robustness. Neuromorphic models, such as Spiking Neural Networks (SNNs), encode information using events and are naturally compatible with such data. The sparse and event-based nature of both the sensor data and SNN activations can be leveraged in an end-to-end neuromorphic hardware pipeline for low-power and low-latency edge applications. However, the accuracy of SNNs is often largely degraded compared to state-of-the-art non-spiking Artificial Neural Networks (ANNs). In this work, a new SNN model, the Signed Spiking Gated Recurrent Unit (SpikGRU2+), is proposed and used as a task head for event-based ALR. The SNN architecture is as accurate as its ANN equivalent, and outperforms the state-of-the-art on the DVS-Lip dataset. Notably, the accuracy is improved by 25% (respectively 4%) compared to the previous state-of-the-art SNN (respectively ANN). In addition, the SNN spike sparsity can be optimized to further reduce the number of operations up to 22x compared to the ANN while maintaining a high accuracy. This work opens up new perspectives for the use of SNNs for accurate and low-power end-to-end neuromorphic gesture recognition. Code is available
1
.
Recent advancements in image segmentation have focused on enhancing the efficiency of the models to meet the demands of real-time applications, especially on edge devices. However, existing research has primarily conc...
详细信息
ISBN:
(数字)9798350365474
ISBN:
(纸本)9798350365481
Recent advancements in image segmentation have focused on enhancing the efficiency of the models to meet the demands of real-time applications, especially on edge devices. However, existing research has primarily concentrated on single-task settings, especially on semantic segmentation, leading to redundant efforts and specialized architectures for different tasks. To address this limitation, we propose a novel architecture for efficient multi-task image segmentation, capable of handling various segmentation tasks without sacrificing efficiency or accuracy. We introduce BiSeNetFormer, that leverages the efficiency of two-stream semantic segmentation architectures and it extends them into a mask classification framework. Our approach maintains the efficient spatial and context paths to capture detailed and semantic information, respectively, while leveraging an efficient transformed-based segmentation head that computes the binary masks and class probabilities. By seamlessly supporting multiple tasks, namely semantic and panoptic segmentation, BiSeNetFormer offers a versatile solution for multi-task segmentation. We evaluate our approach on popular datasets, Cityscapes and ADE20K, demonstrating impressive inference speeds while maintaining competitive accuracy compared to state-of-the-art architectures. Our results indicate that BiSeNetFormer represents a significant advancement towards fast, efficient, and multi-task segmentation networks, bridging the gap between model efficiency and task adaptability.
Detecting and segmenting moving objects from a moving monocular camera is challenging in the presence of unknown camera motion, diverse object motions and complex scene structures. Most existing methods rely on a sing...
详细信息
ISBN:
(数字)9798350365474
ISBN:
(纸本)9798350365481
Detecting and segmenting moving objects from a moving monocular camera is challenging in the presence of unknown camera motion, diverse object motions and complex scene structures. Most existing methods rely on a single motion cue to perform motion segmentation, which is usually insufficient when facing different complex environments. While a few recent deep learning based methods are able to combine multiple motion cues to achieve improved accuracy, they depend heavily on vast datasets and extensive annotations, making them less adaptable to new scenarios. To address these limitations, we propose a novel monocular dense segmentation method that achieves state-of-the-art motion segmentation results in a zero-shot manner. The proposed method synergestically combines the strengths of deep learning and geometric model fusion methods by performing geometric model fusion on object proposals. Experiments show that our method achieves competitive results on several motion segmentation datasets and even surpasses some state-of-the-art supervised methods on certain benchmarks, while not being trained on any data. We also present an ablation study to show the effectiveness of combining different geometric models together for motion segmentation, highlighting the value of our geometric model fusion strategy.
Existing pedestrian attribute recognition (PAR) algorithms are mainly developed based on a static image. However, the performance is not reliable for images with challenging factors, such as heavy occlusion, motion bl...
Existing pedestrian attribute recognition (PAR) algorithms are mainly developed based on a static image. However, the performance is not reliable for images with challenging factors, such as heavy occlusion, motion blur, etc. In this work, we propose to understand human attributes using video frames that can make full use of temporal information. Specifically, we formulate the video-based PAR as a vision-language fusion problem and adopt pre-trained big models CLIP to extract the feature embeddings of given video frames. To better utilize the semantic information, we take the attribute list as another input and transform the attribute words/phrase into the corresponding sentence via split, expand, and prompt. Then, the text encoder of CLIP is utilized for language embedding. The averaged visual tokens and text tokens are concatenated and fed into a fusion Transformer for multi-modal interactive learning. The enhanced tokens will be fed into a classification head for pedestrian attribute prediction. Extensive experiments on a large-scale video-based PAR dataset fully validated the effectiveness of our proposed framework. Both the source code and pre-trained models will be released at https://***/Event–AHU/VTF_PAR.
暂无评论