text encoders play an important role in text-to-speech (TTS) by analyzing text input and converting it into linguistic representations. In order to generate expressive speech from text, pre-training text encoders on l...
详细信息
text encoders play an important role in text-to-speech (TTS) by analyzing text input and converting it into linguistic representations. In order to generate expressive speech from text, pre-training text encoders on large amounts of data has recently become a solution to generate richer and more effective linguistic representations. However, existing pre-trained text encoders only use the self-supervised target on the text data, without considering the relationship between text and speech modalities during the pre-training stage. In this paper, we propose TEAR, a cross-modal pre-trained text encoder enhanced by Acoustic Representations for TTS. In addition to conventional text pre-training, TEAR incorporates speech pre-training to extract semantic and prosody-related acoustic representations from speech. Then, TEAR introduces a novel cross-modal pre-training task for the text encoder, termed acoustics-aware joint prediction. This task leverages the acoustic representations generated by the preceding speech pre-training, enabling the linguistic representation to perceive and comprehend prosody during the encoding process. In our implementation, TEAR was pre-trained on 130 million unlabeled Chinese and English sentences, as well as 740,000 Chinese text-speech pairs. The results of the downstream TTS experiments on three expressive TTS datasets indicate that the proposed TEAR can encode more effective and comprehensive linguistic representations compared to the text-only pre-trained encoders, leading to the generation of more natural speech.
text encoders based on C-DSSM or transformers have demonstrated strong performance in many Natural Language Processing (NLP) tasks. Low latency variants of these models have also been developed in recent years in orde...
详细信息
ISBN:
(纸本)9781450383127
text encoders based on C-DSSM or transformers have demonstrated strong performance in many Natural Language Processing (NLP) tasks. Low latency variants of these models have also been developed in recent years in order to apply them in the field of sponsored search which has strict computational constraints. However these models are not the panacea to solve all the Natural Language Understanding (NLU) challenges as the pure semantic information in the data is not sufficient to fully identify the user intents. We propose the textGNN model that naturally extends the strong twin tower structured encoders with the complementary graph information from user historical behaviors, which serves as a natural guide to help us better understand the intents and hence generate better language representations. The model inherits all the benefits of twin tower models such as C-DSSM and TwinBERT so that it can still be used in the low latency environment while achieving a significant performance gain than the strong encoder-only counterpart baseline models in both offline evaluations and online production system. In offline experiments, the model achieves a 0.14% overall increase in ROC-AUC with a 1% increased accuracy for long-tail low-frequency Ads, and in the online A/B testing, the model shows a 2.03% increase in Revenue Per Mille with a 2.32% decrease in Ad defect rate.
Early human action prediction aims to complete the prediction of complete action sequences based solely on initial action sequences acquired at an initial stage. Considering that the execution of a single action usual...
详细信息
ISBN:
(纸本)9789819787944;9789819787951
Early human action prediction aims to complete the prediction of complete action sequences based solely on initial action sequences acquired at an initial stage. Considering that the execution of a single action usually relies on the synergistic coordination of multiple key body parts and the movement amplitude of different body parts at the onset of an action varies minimally, early human action prediction demonstrates high sensitivity to the location of action initiation and the type of action. Currently, skeletal-based action prediction methods primarily focus on action classification and exhibit limited capability for discrimination in terms of semantic association between actions. For instance, distinguishing actions concentrated on elbow joint movements, such as "touching the neck" and "touching the head," proves challenging through classification alone but can be achieved through semantic relationships. Therefore, when differentiating similar actions, incorporating descriptions of specific joint movements can enhance the feature extraction ability of the model. This paper introduces an Action Description-Assisted Learning Graph Convolutional Network (ADAL-GCN), which utilizes large language models as knowledge engines to pre-generate descriptions for key parts of different actions. These descriptions are then transformed into semantically rich feature vectors through text encoding. Furthermore, the model adopts a lightweight design, decoupling features across channel and temporal dimensions, consolidating redundant network modules, and executing strategic computational migration to optimize processing efficiency. Experimental results demonstrate significant performance improvements achieved by our proposed method, which achieves substantial reductions in training time without additional computational overhead.
Knowledge graph representation learning entails transforming entities and relationships within a knowledge graph into vectors to enhance downstream tasks. The rise of pre -trained language models has recently promoted...
详细信息
Knowledge graph representation learning entails transforming entities and relationships within a knowledge graph into vectors to enhance downstream tasks. The rise of pre -trained language models has recently promoted text -based approaches for knowledge graph representation learning. However, these methods often need more structural information on knowledge graphs, prompting the challenge of integrating graph structure knowledge into text -based methodologies. To tackle this issue, we introduce a text -enhanced model with local structure (TEGS) that embeds local graph structure details from the knowledge graph into the text encoder. TEGS integrates k -hop neighbor entity information into the text encoder and employs a decoupled attention mechanism to blend relative position encoding and text semantics. This strategy augments learnable content through graph structure information and mitigates the impact of semantic ambiguity via the decoupled attention mechanism. Experimental findings demonstrate TEGS's effectiveness at fusing graph structure information, resulting in state-ofthe-art performance across three datasets in link prediction tasks. In terms of Hit@1, when compared to the previous text -based models, our model demonstrated improvements of 2.1% on WN18RR, 2.4% on FB15k-237, and 2.7% on the NELL-One dataset. Our code is made publicly available on
Synthesizing photographic images from given text descriptions is a challenging problem. Although previous many studies have made significant progress on the visual quality of the generated images by using the multi-st...
详细信息
Synthesizing photographic images from given text descriptions is a challenging problem. Although previous many studies have made significant progress on the visual quality of the generated images by using the multi-stage and attentional network, they ignore the interrelationships between the images generated by the generator in each stage and simply leverage the attention mechanism. In this paper, the Photographic text-to-Image Generation with Pyramid Contrastive Consistency Model (PCCM-GAN) is proposed to generate photographic images. PCCM-GAN introduces two modules: a Pyramid Contrastive Consistency Model (PCCM) and a stack attention model (Stack-Attn). Based on generated images from the different stages, PCCM is proposed to compute a contrastive loss for training the generator. Stack-Attn concentrates on generating images with more details and better semantic consistency by stacking the global-local attention mechanism. And visual inspection of the inner product of PCCM and Stack-Attn is also performed to validate their effectiveness. Extensive experiments and ablation studies on the CUB and MS-COCO datasets prove the superiority of the proposed method. (c) 2021 Published by Elsevier B.V.
In social media, the data-sharing activities have turned out to be more pervasive;individuals and companies have comprehended the significance of promoting info by social media network. However, these individuals and ...
详细信息
In social media, the data-sharing activities have turned out to be more pervasive;individuals and companies have comprehended the significance of promoting info by social media network. However, these individuals and companies face more challenges with the issue of "how to obtain the full benefit that the platforms provide". Therefore, social media policies to improve the online promotion are turning out to be more significant. The popularization of social media contents are related to public attention and interest of users, thus the popularity fore cast of online contents has considered being the major task in social media analytic and it facilitates several appliances in diverse domain as well. This paper intends to introduce a popularity forecast approach that derives and combines the richest data of "text content encoder, user encoder, time series encoder, and user sentiment analysis". The extracted features are then predicted via Long Short Term Memory (LSTM). Particularly, to enhance the prediction accuracy of the LSTM, the weights are fine-tuned via Self Adaptive Rain optimization (SA-RO).
text-to-image generation models generate photo-realistic images from textual descriptions, typically using GANs and BiLSTM networks. However, as input text sequence length increases, these models suffer from a loss of...
详细信息
ISBN:
(数字)9783031585357
ISBN:
(纸本)9783031585340;9783031585357
text-to-image generation models generate photo-realistic images from textual descriptions, typically using GANs and BiLSTM networks. However, as input text sequence length increases, these models suffer from a loss of information, leading to missed keywords and unsatisfactory results. To address this, we propose an attentional GAN (AttnGAN) model with a text attention mechanism. We evaluate AttnGAN variants on the MS-COCO dataset qualitatively and quantitatively. For the image quality analysis, we utilize performance measures such as FID score, R-precision, and IS score. Our results show that the proposed model outperforms existing approaches, producing more realistic images by preserving vital information in the input sequence.
Using audio and text embeddings jointly for Keyword Spotting (KWS) has shown high-quality results, but the key challenge of how to semantically align two embeddings for multi-word keywords of different sequence length...
详细信息
Using audio and text embeddings jointly for Keyword Spotting (KWS) has shown high-quality results, but the key challenge of how to semantically align two embeddings for multi-word keywords of different sequence lengths remains largely unsolved. In this paper, we propose an audio-text-based end-to-end model architecture for flexible keyword spotting (KWS), which builds upon learned audio and text embeddings. Our architecture uses a novel dynamic programming-based algorithm, Dynamic Sequence Partitioning (DSP), to optimally partition the audio sequence into the same length as the word-based text sequence using the monotonic alignment of spoken content. Our proposed model consists of an encoder block to get audio and text embeddings, a projector block to project individual embeddings to a common latent space, and an audio-text aligner containing a novel DSP algorithm, which aligns the audio and text embeddings to determine if the spoken content is the same as the text. Experimental results show that our DSP is more effective than other partitioning schemes, and the proposed architecture outperformed the state-of-the-art results on the public dataset in terms of Area Under the ROC Curve (AUC) and Equal-Error-Rate (EER) by 14.4 % and 28.9%, respectively.
The single-modal information retrieval pattern is gradually unable to meet the growing information processing needs. Cross-modal retrieval based on deep learning, as a new information retrieval scheme, is gradually re...
详细信息
The single-modal information retrieval pattern is gradually unable to meet the growing information processing needs. Cross-modal retrieval based on deep learning, as a new information retrieval scheme, is gradually receiving more attention. To address the potential issue of imprecise text queries in cross-modal retrieval, an iterative query-based cross-modal retrieval model is proposed. The model is generally divided into four modules: image feature extraction, text feature extraction, matching ranking, and query reinforcement. The model first extracts feature of images and text through deep learning models, then performs matching and retrieval of image-text features through the image-text stacked cross-attention algorithm. Finally, in the query reinforcement module, the most distinctive object category in the retrieval results is obtained through deep reinforcement learning for user confirmation, thereby increasing text richness and improving retrieval performance.
Acontent to picture production approach seeks to produce photorealistic images that are semantically coherent with the provided descriptions from text descriptions. Applications for creating photorealistic visuals fro...
详细信息
ISBN:
(纸本)9783031353192;9783031353208
Acontent to picture production approach seeks to produce photorealistic images that are semantically coherent with the provided descriptions from text descriptions. Applications for creating photorealistic visuals from text includes photo editing and more. Strong neural network topologies, such as GANs (Generative Adversarial Networks) have been shown to produce effective outcomes in recent years. Two very significant factors, visual reality and content consistency, must be taken into consideration when creating images from text descriptions. Recent substantial advancements in GAN have made it possible to produce images with a high level of visual realism. However, generating images from text ensuring high content consistency between the text and the generated image is still ambitious. To address the above two issues, a Bridge GAN model is proposed, where the bridge is a transitional space containing meaningful representations of the given text description. The proposed systems incorporate Bridge GAN and char CNN - RNN model to generate the image in high content consistency and the results shows that the proposed system outperformed the existing systems.
暂无评论