Video summarization is a challenging under-constrained problem because the underlying summary of a single video strongly depends on users' subjective understandings. Data-driven approaches, such as deep neural net...
详细信息
ISBN:
(数字)9783030012649
ISBN:
(纸本)9783030012649;9783030012632
Video summarization is a challenging under-constrained problem because the underlying summary of a single video strongly depends on users' subjective understandings. Data-driven approaches, such as deep neural networks, can deal with the ambiguity inherent in this task to some extent, but it is extremely expensive to acquire the temporal annotations of a large-scale video dataset. To leverage the plentiful web-crawled videos to improve the performance of video summarization, we present a generative modelling framework to learn the latent semantic video representations to bridge the benchmark data and web data. Specifically, our framework couples two important components: a variational autoencoder for learning the latent semantics from web videos, and an encoder-attention-decoder for saliency estimation of raw video and summary generation. A loss term to learn the semantic matching between the generated summaries and web videos is presented, and the overall framework is further formulated into a unified conditional variational encoder-decoder, called variational encoder-summarizer-decoder (VESD). Experiments conducted on the challenging datasets CoSum and TVSum demonstrate the superior performance of the proposed VESD to existing state-of-the-art methods. The source code of this work can be found at https://***/cssjcai/vesd.
作者:
Wang, HengQin, ZengchangWan, TaoBeihang Univ
Sch ASEE Intelligent Comp & Machine Learning Lab Beijing 100191 Peoples R China Beihang Univ
Beijing Adv Innovat Ctr Biomed Engn Sch Biol Sci & Med Engn Beijing 100191 Peoples R China
In this paper, we propose a model using generative adversarial net (GAN) to generate realistic text. Instead of using standard GAN, we combine variational autoencoder (VAE) with generative adversarial net. The use of ...
详细信息
ISBN:
(纸本)9783319930374;9783319930367
In this paper, we propose a model using generative adversarial net (GAN) to generate realistic text. Instead of using standard GAN, we combine variational autoencoder (VAE) with generative adversarial net. The use of high-level latent random variables is helpful to learn the data distribution and solve the problem that generative adversarial net always emits the similar data. We propose the VGAN model where the generative model is composed of recurrent neural network and VAE. The discriminative model is a convolutional neural network. We train the model via policy gradient. We apply the proposed model to the task of text generation and compare it to other recent neural network based models, such as recurrent neural network language model and Seq-GAN. We evaluate the performance of the model by calculating negative log-likelihood and the BLEU score. We conduct experiments on three benchmark datasets, and results show that our model outperforms other previous models.
We study the problem of cross-lingual voice conversion in non-parallel speech corpora and one-shot learning setting. Most prior work require either parallel speech corpora or enough amount of training data from a targ...
详细信息
ISBN:
(纸本)9781510872219
We study the problem of cross-lingual voice conversion in non-parallel speech corpora and one-shot learning setting. Most prior work require either parallel speech corpora or enough amount of training data from a target speaker. However, we convert an arbitrary sentences of an arbitrary source speaker to target speaker's given only one target speaker training utterance. To achieve this, we formulate the problem as learning disentangled speaker-specific and context-specific representations and follow the idea of [1] which uses Factorized Hierarchical variational autoencoder (FHVAE). After training FHVAE on multi speaker training data, given arbitrary source and target speakers' utterance, we estimate those latent representations and then reconstruct the desired utterance of converted voice to that of target speaker. We investigate the effectiveness of the approach by conducting voice conversion experiments with varying size of training utterances and it was able to achieve reasonable performance with even just one training utterance. We also examine the speech representation and show that World vocoder outperforms Short-time Fourier Transform (STFT) used in [1]. Finally, in the subjective tests, for one language and cross-lingual voice conversion, our approach achieved significantly better or comparable results compared to VAE-STFT and GMM baselines in speech quality and similarity.
In this paper, we explore the use of a factorized hierarchical variational autoencoder (FHVAE) model to learn an unsupervised latent representation for dialect identification (DID). An FHVAE can learn a latent space t...
详细信息
ISBN:
(纸本)9781538643341
In this paper, we explore the use of a factorized hierarchical variational autoencoder (FHVAE) model to learn an unsupervised latent representation for dialect identification (DID). An FHVAE can learn a latent space that separates the more static attributes within an utterance from the more dynamic attributes by encoding them into two different sets of latent variables. Useful factors for dialect identification, such as phonetic or linguistic content, are encoded by a segmental latent variable, while irrelevant factors that are relatively constant within a sequence, such as a channel or a speaker information, are encoded by a sequential latent variable. The disentanglement property makes the segmental latent variable less susceptible to channel and speaker variation, and thus reduces degradation from channel domain mismatch. We demonstrate that on fully-supervised DID tasks, an end-to-end model trained on the features extracted from the FHVAE model achieves the best performance, compared to the same model trained on conventional acoustic features and an i-vector based system. Moreover, we also show that the proposed approach can leverage a large amount of unlabeled data for FHVAE training to learn domain-invariant features for DID, and significantly improve the performance in a low-resource condition, where the labels for the in-domain data are not available.
In this paper, we present an incremental learning framework for efficient and accurate facial performance tracking. Our approach is to alternate the modeling step, which takes tracked meshes and texture maps to train ...
详细信息
In this paper, we present an incremental learning framework for efficient and accurate facial performance tracking. Our approach is to alternate the modeling step, which takes tracked meshes and texture maps to train our deep learning-based statistical model, and the tracking step, which takes predictions of geometry and texture our model infers from measured images and optimize the predicted geometry by minimizing image, geometry and facial landmark errors. Our Geo-Tex VAE model extends the convolutional variational autoencoder for face tracking, and jointly learns and represents deformations and variations in geometry and texture from tracked meshes and texture maps. To accurately model variations in facial geometry and texture, we introduce the decomposition layer in the Geo-Tex VAE architecture which decomposes the facial deformation into global and local components. We train the global deformation with a fully-connected network and the local deformations with convolutional layers. Despite running this model on each frame independently - thereby enabling a high amount of parallelization - we validate that our framework achieves sub-millimeter accuracy on synthetic data and outperforms existing methods. We also qualitatively demonstrate high-fidelity, long-duration facial performance tracking on several actors.
Deep generative adversarial networks (GANs) are the emerging technology in drug discovery and biomarker development. In our recent work, we demonstrated a proof-of-concept of implementing deep generative adversarial a...
详细信息
Deep generative adversarial networks (GANs) are the emerging technology in drug discovery and biomarker development. In our recent work, we demonstrated a proof-of-concept of implementing deep generative adversarial autoencoder (AAE) to identify new molecular fingerprints with predefined anticancer properties. Another popular generative model is the variational autoencoder (VAE), which is based on deep neural architectures. In this work, we developed an advanced AAE model for molecular feature extraction problems, and demonstrated its advantages compared to VAE in terms of (a) adjustability in generating molecular fingerprints;(b) capacity of processing very large molecular data sets;and (c) efficiency in unsupervised pretraining for regression model. Our results suggest that the proposed AAE model significantly enhances the capacity and efficiency of development of the new molecules with specific anticancer properties using the deep generative models.
Electrical distribution network is constantly ageing worldwide. Therefore, probability of cable faults is increasing over time. Fast recovering of damaged networks is of vital importance and a quick and automatic iden...
详细信息
Electrical distribution network is constantly ageing worldwide. Therefore, probability of cable faults is increasing over time. Fast recovering of damaged networks is of vital importance and a quick and automatic identification of the failure source may help to promptly recover the functionality of the network. The scenario we are taking into consideration is a vast number of recording devices spread across a network that constantly monitor low voltage cables. When the current of a cable reaches a very high value, data is sent to a central server which analyses it through a variant of a variational Auto Encoder (VAE), a deep neural network. This VAE has been trained by using historical data collected from several hundreds of faults recorded, but in which only a handful of them has been labelled by an on-site analysis of the fault. Data used for training is simply the recorded levels of voltages and currents, after a simple pre-processing step. The final goal is to let the network distinguish if the fault occurred in a point of the cable, on a joint, or at the pot-end located at the termination. A preliminary evaluation of its ability to generalise over the non-labelled samples shows encouraging results.
Latent generative models can learn higher-level underlying factors from complex data in an unsupervised manner. Such models can be used in a wide range of speech processing applications, including synthesis, transform...
详细信息
ISBN:
(纸本)9781510833135
Latent generative models can learn higher-level underlying factors from complex data in an unsupervised manner. Such models can be used in a wide range of speech processing applications, including synthesis, transformation and classification. While there have been many advances in this field in recent years, the application of the resulting models to speech processing tasks is generally not explicitly considered. In this paper we apply the variational autoencoder (VAE) to the task of modeling frame-wise spectral envelopes. The VAE model has many attractive properties such as continuous latent variables, prior probability over these latent variables, a tractable lower bound on the marginal log likelihood, both generative and recognition models, and end-to-end training of deep models. We consider different aspects of training such models for speech data and compare them to more conventional models such as the Restricted Boltzmann Machine (RBM). While evaluating generative models is difficult, we try to obtain a balanced picture by considering both performance in terms of reconstruction error and when applying the model to a series of modeling and transformation tasks to get an idea of the quality of the learned features.
Modeling the speech generation process can provide flexible and interpretable ways to generate intended synthetic speech. In this paper, we present a deep generative model of fundamental frequency (F_0) contours of no...
详细信息
ISBN:
(纸本)9781538646595
Modeling the speech generation process can provide flexible and interpretable ways to generate intended synthetic speech. In this paper, we present a deep generative model of fundamental frequency (F_0) contours of normal speech and singing voices. The generative model we propose in this paper 1) is able to accurately decompose an F_0 contour into the sum of phrase and accent components of the Fujisaki model, a mathematical model describing the control mechanism of vocal fold vibration, without an iterative algorithm, and 2) can represent/generate F_0 contours of both normal speech and singing voices reasonably well.
Modeling musical timbre is critical for various music information retrieval (MIR) tasks. This work addresses the task of classifying playing techniques, which involves extremely subtle variations of timbre among diffe...
详细信息
Modeling musical timbre is critical for various music information retrieval (MIR) tasks. This work addresses the task of classifying playing techniques, which involves extremely subtle variations of timbre among different categories. A deep collaborative learning framework is proposed to represent a music with greater discriminative power than previously achieved. Firstly, a novel variational autoencoder (VAE) is developed to eliminate the variation of acoustic features within a class. Secondly, a Gaussian process classifier is jointly learned to distinguish the variations of timbres between classes, which increases the discriminative power of the learned representations. We derive a new lower bound that guides a VAE-based representation. Experiments were conducted on a database of seven classes of guitar playing techniques. The experimental results demonstrated that the proposed method outperforms baselines in terms of the F1-score and accuracy.
暂无评论