检索结果-内蒙古大学图书馆

Interspeech Conference

作者： Tuan Vu Ho Quoc Huy Nguyen Akagi, Masato Unoki, Masashi Japan Adv Inst Sci & Technol Nomi Japan

Speech-enhancement methods based on the complex ideal ratio mask (cIRM) have achieved promising results. These methods often deploy a deep neural network to jointly estimate the real and imaginary components of the cIRM defined in the complex domain. However, the unbounded property of the cIRM poses difficulties when it comes to effectively training a neural network. To alleviate this problem, this paper proposes a phase-aware speech-enhancement method through estimating the magnitude and phase of a complex adaptive Wiener filter. With this method, a noise-robust vector-quantized variational autoencoder is used for estimating the magnitude of the Wiener filter by using the Itakura-Saito divergence on the time-frequency domain, while the phase of the Wiener filter is estimated using a convolutional recurrent network using the scale-invariant signal-to-noise-ratio constraint in the time domain. The proposed method was evaluated on the open Voice Bank+DEMAND dataset to provide a direct comparison with other speech-enhancement methods and achieved a Perceptual Evaluation of Speech Quality score of 2.85 and ShortTime Objective Intelligibility score of 0.94, which is better than the state-of-art method based on cIRM estimation during the 2020 Deep Noise Challenge.

关键词： Speech enhancement vector-quantized variational autoencoder complex Wiener filter noise reduction

来源：评论

学校读者我要写书评

暂无评论

CRANK: AN OPEN-SOURCE SOFTWARE FOR NONPARALLEL VOICE CONVERSION BASED ON vector-quantized variational autoencoder

CRANK: AN OPEN-SOURCE SOFTWARE FOR NONPARALLEL VOICE CONVERS...

引用

IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

作者： Kobayashi, Kazuhiro Huang, Wen-Chin Wu, Yi-Chiao Tobing, Patrick Lumban Hayashi, Tomoki Toda, Tomoki Nagoya Univ Nagoya Aichi Japan TARVO Inc Tokyo Japan

ISBN: (纸本)9781728176055

In this paper, we present an open-source software for developing a nonparallel voice conversion (VC) system named crank. Although we have released an open-source VC software based on the Gaussian mixture model named sprocket in the last VC Challenge, it is not straightforward to apply any speech corpus because it is necessary to prepare parallel utterances of source and target speakers to model a statistical conversion function. To address this issue, in this study, we developed a new open-source VC software that enables users to model the conversion function by using only a nonparallel speech corpus. For implementing the VC software, we used a vector-quantized variational autoencoder (VQVAE). To rapidly examine the effectiveness of recent technologies developed in this research field, crank also supports several representative works for autoencoder-based VC methods such as the use of hierarchical architectures, cyclic architectures, generative adversarial networks, speaker adversarial training, and neural vocoders. Moreover, it is possible to automatically estimate objective measures such as mel-cepstrum distortion and pseudo mean opinion score based on MOSNet. In this paper, we describe representative functions developed in crank and make brief comparisons by objective evaluations.

关键词： voice conversion open-source software vector-quantized variational autoencoder nonparallel neural vocoder

来源：评论

学校读者我要写书评

暂无评论

EM-LAST: Effective Multidimensional Latent Space Transport for an Unpaired Image-to-Image Translation With an Energy-Based Model

引用

IEEE ACCESS 2022年 10卷 72839-72849页

作者： Han, Giwoong Min, Jinhong Han, Sung Won Korea Univ Sch Ind & Management Engn Seoul 02841 South Korea

For an unpaired image-to-image translation to work effectively, the latent space of each image domain must be well-designed. The codes of each style must be translated toward the target while preserving the parts corresponding to the source content. In general, most variational autoencoder (VAE)-based models use a one-dimensional latent space. However, to apply high dimensional methodologies such as vector quantization, controlling a multidimensional latent space is necessary. In this study, among the VAE-based models that use relatively complex multidimensional latent spaces, we apply an Energy-Based Model and vector-quantized VAE v2, with the latter as the main model. We show that among the latent spaces that represent each image domain, the importance of each feature at the top and bottom latent spaces must be interpreted differently for appropriate translation. Therefore, we argue that simply understanding the features of latent space composition well can show effective image translation results. We also present various analyses and visual outcomes of multidimensional latent space transport.

关键词： Task analysis Aerospace electronics Visualization Licenses Generative adversarial networks Deep learning Decoding Energy-based model image-to-image translation Langevin dynamics multidimensional latent space vector-quantized variational autoencoder

来源：评论

学校读者我要写书评

暂无评论

Phase-Aware Speech Enhancement With Complex Wiener Filter

引用

IEEE ACCESS 2023年 11卷 141573-141584页

作者： Nguyen, Huy Ho, Tuan Vu Akagi, Masato Unoki, Masashi Japan Adv Inst Sci & Technol JAIST Grad Sch Adv Sci & Technol Nomi Ishikawa 9231292 Japan Hitachi Ltd Adv Artificial Intelligent Innovat Ctr Media Intelligent Proc Reseach Dept Tokyo 1858601 Japan

In speech enhancement, accurate phase reconstruction can significantly improve speech quality. While phase-aware speech enhancement methods using the complex ideal ratio mask (cIRM) have shown promise, the estimation difficulty of the phase is shared with the real and imaginary parts of the cIRM. The pattern lacking in the imaginary part poses particular difficulties. To address this issue, we proposed a phase-aware speech enhancement method that uses a complex Wiener filter, which delegates the estimation of speech and noise amplitude properties and the phase property to different models, mitigating the issues with the cIRM and improving the effectiveness of neural-network training. Our method uses a speech-variance estimation model with a noise-robust vector-quantized variational autoencoder and a phase corrector that maximizes the scale-invariant signal-to-noise ratio in the time domain. To further improve speech-variance estimation, we propose a loss function that uses a categorical distribution of fundamental frequency (F0) for enhancing the spectral fine structure of estimated speech variance. We evaluated our method on the open dataset released by Valentini et al. to directly compare it with other speech-enhancement methods. Our method achieved a perceptual evaluation of speech quality score of 2.86 and short-time objective intelligibility score of 0.94, better than the state-of-the-art method based on cIRM estimation during the 2020 Deep Noise Challenge. Our comprehensive analysis shows that incorporating the proposed loss function for spectral-fine-structure enhancement improves speech quality, especially when the F0 is low.

关键词： Speech enhancement complex Wiener filter vector-quantized variational autoencoder noise reduction spectral fine structure enhancement F0 distribution

来源：评论

学校读者我要写书评

暂无评论

End-to-End Image-to-Speech Generation for Untranscribed Unknown Languages

引用

IEEE ACCESS 2021年 9卷 55144-55154页

作者： Effendi, Johanes Sakti, Sakriani Nakamura, Satoshi Nara Inst Sci & Technol Ikoma 6300192 Japan RIKEN Ctr Adv Intelligence Project AIP Tokyo 1030027 Japan

Describing orally what we are seeing is a simple task we do in our daily life. However, in the natural language processing field, this simple task needs to be bridged by a textual modality that helps the system to generalize various objects in the image and various pronunciations in speech utterances. In this study, we propose an end-to-end Image2Speech system that does not need any textual information in its training. We use a vector-quantized variational autoencoder (VQ-VAE) model to learn the discrete representation of a speech caption in an unsupervised manner, where discrete labels are used by an image-captioning model. This self-supervised speech representation enables the Image2Speech model to be trained with the minimum amount of paired image-speech data while still maintaining the quality of the speech caption. Our experimental results with a multi-speaker natural speech dataset demonstrate our proposed text-free Image2Speech system's performance close to the one with textual information. Furthermore, our approach also successfully outperforms the most recent existing frameworks with phoneme-based and grounding-based Image2Speech systems.

关键词： Task analysis Image reconstruction Decoding Training Bridges Speech recognition Data models Image-to-speech image captioning self-supervised speech representation vector-quantized variational autoencoder untranscribed unknown language

来源：评论

学校读者我要写书评

暂无评论

Weakly-supervised Speech-to-text Mapping with Visually Connected Non-parallel Speech-text Data using Cyclic Partially-aligned Transformer 22

Weakly-supervised Speech-to-text Mapping with Visually Conne...

引用

Interspeech Conference

作者： Effendi, Johanes Sakti, Sakriani Nakamura, Satoshi Nara Inst Sci & Technol Ikoma Nara Japan RIKEN Ctr Adv Intelligence Project AIP Tokyo Japan

ISBN: (纸本)9781713836902

Despite the successful development of automatic speech recognition (ASR) systems for several of the world's major languages, they require a tremendous amount of parallel speech-text data. Unfortunately, for many other languages, such resources are usually unavailable. This study addresses the speech-to-text mapping problem given only a collection of visually connected non-parallel speech-text data. We call this "mapping" since the system attempts to learn the semantic association between speech and text instead of recognizing the speech with the exact word-by-word transcription. Here, we propose utilizing our novel cyclic partially-aligned Transformer with two-fold mechanisms. First, we train a Transformer-based vector-quantized variational autoencoder (VQ-VAE) to produce a discrete speech representation in a self-supervised manner. Then, we use a Transformer-based sequence-to-sequence model inside a chain mechanism to map from unknown untranscribed speech utterances into a semantically equivalent text. Because this is not strictly recognizing speech, we focus on evaluating the semantic equivalence of the generated text hypothesis. Our evaluation shows that our proposed method is also effective for a multispeaker natural speech dataset and can also be applied for a cross-lingual application.

关键词： Speech-to-text mapping non-parallel data weakly-supervised vector-quantized variational autoencoder cyclic partially-aligned Transformer

来源：评论

学校读者我要写书评

暂无评论

A vector quantized MASKED autoencoder FOR SPEECH EMOTION RECOGNITION

A VECTOR QUANTIZED MASKED AUTOENCODER FOR SPEECH EMOTION REC...

引用

IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)

作者： Sadok, Samir Leglaive, Simon Seguier, Renaud CentraleSupelec IETR UMR CNRS 6164 Gif Sur Yvette France

ISBN: (纸本)9798350302615

Recent years have seen remarkable progress in speech emotion recognition (SER), thanks to advances in deep learning techniques. However, the limited availability of labeled data remains a significant challenge in the field. Self-supervised learning has recently emerged as a promising solution to address this challenge. In this paper, we propose the vector quantized masked autoencoder for speech (VQ-MAE-S), a self-supervised model that is fine-tuned to recognize emotions from speech signals. The VQ-MAE-S model is based on a masked autoencoder (MAE) that operates in the discrete latent space of a vector quantized variational autoencoder. Experimental results show that the proposed VQ-MAE-S model, pre-trained on the VoxCeleb2 dataset and fine-tuned on emotional speech data, outperforms an MAE working on the raw spectrogram representation and other state-of-the-art methods in SER.

关键词： Self-supervised learning masked autoencoder vector-quantized variational autoencoder speech emotion recognition

来源：评论

学校读者我要写书评

暂无评论

PCGen: A Fully Parallelizable Point Cloud Generative Model

引用

SENSORS 2024年第5期24卷 1414页

作者： Vercheval, Nicolas Royen, Remco Munteanu, Adrian Pizurica, Aleksandra Univ Ghent Fac Engn & Architecture Dept Telecommun & Informat Proc Res Grp Artificial Intelligence & Sparse Modelling B-9000 Ghent Belgium Univ Ghent Fac Engn & Architecture Dept Elect & Informat Syst Clifford Res Grp B-9000 Ghent Belgium Vrije Univ Brussel Dept Elect & Informat ETRO Fac Engn B-1050 Brussels Belgium

Generative models have the potential to revolutionize 3D extended reality. A primary obstacle is that augmented and virtual reality need real-time computing. Current state-of-the-art point cloud random generation methods are not fast enough for these applications. We introduce a vector-quantized variational autoencoder model (VQVAE) that can synthesize high-quality point clouds in milliseconds. Unlike previous work in VQVAEs, our model offers a compact sample representation suitable for conditional generation and data exploration with potential applications in rapid prototyping. We achieve this result by combining architectural improvements with an innovative approach for probabilistic random generation. First, we rethink current parallel point cloud autoencoder structures, and we propose several solutions to improve robustness, efficiency and reconstruction quality. Notable contributions in the decoder architecture include an innovative computation layer to process the shape semantic information, an attention mechanism that helps the model focus on different areas and a filter to cover possible sampling errors. Secondly, we introduce a parallel sampling strategy for VQVAE models consisting of a double encoding system, where a variational autoencoder learns how to generate the complex discrete distribution of the VQVAE, not only allowing quick inference but also describing the shape with a few global variables. We compare the proposed decoder and our VQVAE model with established and concurrent work, and we prove, one by one, the validity of the single contributions.

关键词： point clouds autoencoder variational autoencoder vector-quantized variational autoencoder real-time computing

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：