visual information can be incorporated into automatic speech recognition (ASR) systems to improve their robustness in adverse acoustic conditions. Conventional audio-visual speech recognition (AVSR) systems require hi...
详细信息
ISBN:
(纸本)9781510872219
visual information can be incorporated into automatic speech recognition (ASR) systems to improve their robustness in adverse acoustic conditions. Conventional audio-visual speech recognition (AVSR) systems require highly specialized audio-visual (AV) data in both system training and evaluation. For many real-world speech recognition applications, only audio information is available. This presents a major challenge to a wider application of AVSR systems. In order to address this challenge, this paper proposes a semi-supervised visual feature learning approach for developing AVSR systems on a DARPA GALE Mandarin broadcast transcription task. Audio to visualfeature inversion long short-term memory neural networks (LSTMs) were initially constructed using limited amounts of out of domain AV data. The acoustic features domain mismatch against the broadcast data was further reduced using multi-level domain adaptive deep networks. visualfeatures were then automatically generated from the broadcast speech audio and used in both AVSR system training and testing time. Experimental results suggest a CNN based AVSR system using the proposed semi-supervised cross-domain audio-to-visualfeature generation technique outperformed the baseline audio only CNN ASR system by an average CER reduction of 6.8% relative. In particular. on the most difficult Phoenix TV subset, a CER reduction of 1.32% absolute (8.34% relative) was obtained.
Deep learning for automated cell imaging analysis has become a tool of choice to process large amounts of data. But many of these methods lack explainability, slowing down their deployment for tasks such as diagnosis....
详细信息
ISBN:
(纸本)9798350349405;9798350349399
Deep learning for automated cell imaging analysis has become a tool of choice to process large amounts of data. But many of these methods lack explainability, slowing down their deployment for tasks such as diagnosis. We present a prototype-based framework to analyze structural changes which addresses the specific challenges of explainability in the context of cell imaging. Our method relies on classification between two distinct cell populations in a weakly supervised context where no label for individual cells is available. Our model extracts typical features from each population, representing intra-cellular structure, and provides an explanation on its classification decision by creating visualization of the local textures corresponding to the structures of interest. We show a real application where it effectively highlights a change in the organization of the actin content of the cells.
There is a huge market demand for searching for products by images in e-commerce sites. visualfeatures play the most important role in solving this content-based image retrieval task. Most existing methods leverage p...
详细信息
ISBN:
(纸本)9781450379885
There is a huge market demand for searching for products by images in e-commerce sites. visualfeatures play the most important role in solving this content-based image retrieval task. Most existing methods leverage pre-trained models on other large-scale datasets with well-annotated labels, e.g. the ImageNet dataset, to extract visualfeatures. However, due to the large difference between the product images and the images in ImageNet, the feature extractor trained on ImageNet is riot efficient in extracting the visualfeatures of product images. And retraining the feature extractor on the product images is faced with the dilemma of lacking the annotated labels. In this paper, we utilize the easily accessible text information, that is, the product title, as a supervised signal to learn the features of the product image. Specifically, we use the n-grams extracted from the product title as the label of the product image to construct a dataset for image classification. This dataset is then used to fine-tuned a pre-trained model. Finally, the basic max-pooling activation of convolutions (MAC) feature is extracted from the fine-tuned model. As a result, we achieve the fourth position in the Grand Challenge of AI Meets Beauty in 2020 ACM Multimedia by using only a single ResNet-50 model without any human annotations and pre-processing or post-processing tricks. Our code is available at: https://***/FanpdangFeng/AI- Meets-Beauty- 2020.
Aiming at improving the performance of visual classification in a cost-effective manner, this paper proposes an incremental semi-supervised learning paradigm called deep co-space (DCS). Unlike many conventional semi-s...
详细信息
Aiming at improving the performance of visual classification in a cost-effective manner, this paper proposes an incremental semi-supervised learning paradigm called deep co-space (DCS). Unlike many conventional semi-supervised learning methods usually performed within a fixed feature space, our DCS gradually propagates information from labeled samples to unlabeled ones along with deep featurelearning. We regard deep featurelearning as a series of steps pursuing feature transformation, i.e., projecting the samples from a previous space into a new one, which tends to select the reliable unlabeled samples with respect to this setting. Specifically, for each unlabeled image instance, we measure its reliability by calculating the category variations of feature transformation from two different neighborhood variation perspectives and merged them into a unified sample mining criterion deriving from Hellinger distance. Then, those samples keeping stable correlation to their neighboring samples (i.e., having small category variation in distribution) across the successive feature space transformation are automatically received labels and incorporated into the model for incrementally training in terms of classification. Our extensive experiments on standard image classification benchmarks (e.g., Caltech-256 and SUN-397) demonstrate that the proposed framework is capable of effectively mining from large-scale unlabeled images, which boosts image classification performance and achieves promising results compared with other semi-supervised learning methods.
暂无评论