In this paper, we propose a novel training method for the transformer encoder-decoder based image captioning, which directly generates a captioning text from an input image. In general, many image- to- text paired dat...
In this paper, we propose a novel training method for the transformer encoder-decoder based image captioning, which directly generates a captioning text from an input image. In general, many image- to- text paired datasets need to be prepared for robust image captioning, but such datasets cannot be collected in practical cases. Our key idea for mitigating the data preparation cost is to utilize text-to-text paraphrasing modeling, i.e., a task to convert an input text into different expressions without changing the meaning. In fact, paraphrasing deals with a similar transformation task to image captioning even though paraphrasing tasks have to handle texts instead of images. In our proposed method, an encoder-decoder network trained via the paraphrasing task is directly leveraged for image captioning. Thus, an encoder-decoder network pre-trained by a text-to-text transformation task is transferred into an image-to-text transformation task even though a different modal must be handled in the encoder network. Our experiments using the MS COCO caption datasets demonstrate the effectiveness of the proposed method.
Real-world image recognition systems often face corrupted input images, which cause distribution shifts and degrade the performance of models. These systems often use a single prediction model in a central server and ...
Real-world image recognition systems often face corrupted input images, which cause distribution shifts and degrade the performance of models. These systems often use a single prediction model in a central server and process images sent from various environments, such as cameras distributed in cities or cars. Such single models face images corrupted in heterogeneous ways in test time. Thus, they require to instantly adapt to the multiple corruptions during testing rather than being re-trained at a high cost. Test-time adaptation (TTA), which aims to adapt models without accessing the training dataset, is one of the settings that can address this problem. Existing TTA methods indeed work well on a single corruption. However, the adaptation ability is limited when multiple types of corruption occur, which is more realistic. We hypothesize this is because the distribution shift is more complicated, and the adaptation becomes more difficult in case of multiple corruptions. In fact, we experimentally found that a larger distribution gap remains after TTA. To address the distribution gap during testing, we propose a novel TTA method named Covariance-Aware Feature alignment (CAFe). We empirically show that CAFe outperforms prior TTA methods on image corruptions, including multiple types of corruptions.
Recurrent neural networks with a gating mechanism such as an LSTM or GRU are powerful tools to model sequential data. In the mechanism, a forget gate, which was introduced to control information flow in a hidden state...
详细信息
This paper addresses a major issue in planning the trajectories of under-actuated autonomous vehicles based on neurodynamic optimization.A receding-horizon vehicle trajectory planning task is formulated as a sequentia...
详细信息
This paper addresses a major issue in planning the trajectories of under-actuated autonomous vehicles based on neurodynamic optimization.A receding-horizon vehicle trajectory planning task is formulated as a sequential global optimization problem with weighted quadratic navigation functions and obstacle avoidance constraints based on given vehicle goal *** feasibility of the formulated optimization problem is guaranteed under derived *** optimization problem is sequentially solved via collaborative neurodynamic optimization in a neurodynamics-driven trajectory planning method/*** results with under-actuated unmanned wheeled vehicles and autonomous surface vehicles are elaborated to substantiate the efficacy of the neurodynamics-driven trajectory planning method.
Denoising is one of the most fundamental and important problems in signal processing, and graph signal denoising methods have been actively studied. Several graph signal denoising methods based on mathematical program...
Denoising is one of the most fundamental and important problems in signal processing, and graph signal denoising methods have been actively studied. Several graph signal denoising methods based on mathematical programming require solving linear equations involving Laplacian matrix, which creates problem with computational accuracy and running time. This study proposes a fast and accurate solution of linear equations for denoising based on the fast graph Fourier transform method. Moreover, the proposed method can perform denoising not only on graphs for which the fast graph Fourier transform can be performed, but also on a wide class of graphs with more relaxed conditions, without loss of accuracy. Experiments demonstrate the efficiency of the proposed method and confirm that denoising can be performed up to 167.3 times faster without loss of accuracy.
While foundation models have been exploited for various expert tasks through fine-tuning, any foundation model will become outdated due to its old knowledge or limited capability. Thus the underlying foundation model ...
详细信息
This paper proposes a novel automatic speech recognition (ASR) system that can transcribe individual speaker’s speech while identifying whether they are target or non-target speakers from multi-talker overlapped spee...
详细信息
This paper proposes a knowledge distillation method for an external bidirectional language model trained by masked language modeling to achieve high accuracy in scene text recognition. In Asian languages such as Japan...
This paper proposes a knowledge distillation method for an external bidirectional language model trained by masked language modeling to achieve high accuracy in scene text recognition. In Asian languages such as Japanese, it is necessary to perform text recognition in units of multiple words or sentences rather than individual words because words are not separated by spaces, and so high-level linguistic knowledge is needed to recognize text correctly. To enhance linguistic knowledge, several methods that use an external language model have been proposed, but these methods fail to consider future context well in performing text recognition because they revise the text candidates yielded by autoregressive text recognition models, which consider mainly past context. To overcome this deficiency, our key idea is to enhance a text recognition model by utilizing knowledge of an external bidirectional language model trained by masked language modeling, which reflects not only past but also future context. So as to actively consider future context in text recognition, our proposed method introduces a distillation loss term that makes the output probability of the text recognition model closer to that of the bidirectional language model. Experiments on Japanese scene text recognition demonstrate the effectiveness of the proposed method.
Training deep neural networks (DNNs) is computationally expensive, which is problematic especially when performing duplicated or similar training runs in model ensemble or fine-tuning pre-trained models, for example. ...
详细信息
This paper presents a novel method for online domain adaptation (OnDA) for DEtection TRansformer (DETR)-based object detection models called OnDA-DETR. OnDA is a domain adaptation paradigm that adapts a model trained ...
This paper presents a novel method for online domain adaptation (OnDA) for DEtection TRansformer (DETR)-based object detection models called OnDA-DETR. OnDA is a domain adaptation paradigm that adapts a model trained on the source domain data to perform well on the target domain in an online manner during testing, using only the unlabeled test data from the target domain. Due to challenging and realistic problem settings, OnDA has garnered significant attention. However, OnDA methods for DETR-based models, which have demonstrated excellent performance in object detection research fields, had not been developed. OnDA-DETR is the first OnDA method specifically designed for DETR-based models. OnDA-DETR incorporates a self-training framework that generates pseudo-labels for the unlabeled target domain data. To effectively incorporate the self-training framework into DETR-based models, we leverage recall-aware pseudo-labeling and quality-aware training in OnDA-DETR. Experimental results indicate that OnDA-DETR improves the performance of the source-trained model by about 3.0 % points through OnDA.
暂无评论