检索结果-内蒙古大学图书馆

TEAR: A Cross-Modal Pre-Trained text encoder Enhanced by Acoustic Representations for Speech Synthesis

IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING 2025年 33卷 1117-1128页

作者： Wang, Shiming Ai, Yang Chen, Liping Hu, Yajun Ling, Zhenhua Univ Sci & Technol China Natl Engn Res Ctr Speech & Language Informat Proc Hefei 230027 Peoples R China

text encoders play an important role in text-to-speech (TTS) by analyzing text input and converting it into linguistic representations. In order to generate expressive speech from text, pre-training text encoders on large amounts of data has recently become a solution to generate richer and more effective linguistic representations. However, existing pre-trained text encoders only use the self-supervised target on the text data, without considering the relationship between text and speech modalities during the pre-training stage. In this paper, we propose TEAR, a cross-modal pre-trained text encoder enhanced by Acoustic Representations for TTS. In addition to conventional text pre-training, TEAR incorporates speech pre-training to extract semantic and prosody-related acoustic representations from speech. Then, TEAR introduces a novel cross-modal pre-training task for the text encoder, termed acoustics-aware joint prediction. This task leverages the acoustic representations generated by the preceding speech pre-training, enabling the linguistic representation to perceive and comprehend prosody during the encoding process. In our implementation, TEAR was pre-trained on 130 million unlabeled Chinese and English sentences, as well as 740,000 Chinese text-speech pairs. The results of the downstream TTS experiments on three expressive TTS datasets indicate that the proposed TEAR can encode more effective and comprehensive linguistic representations compared to the text-only pre-trained encoders, leading to the generation of more natural speech.

关键词： Acoustics Linguistics Encoding Bidirectional control Training Data models Predictive models Speech enhancement Transformers Context modeling Acoustic representation cross-modal pre-training natural language processing text encoder text-to-speech

来源：评论

学校读者我要写书评

暂无评论

textGNN: Improving text encoder via Graph Neural Network in Sponsored Search 21

TextGNN: Improving Text Encoder via Graph Neural Network in ...

引用

30th World Wide Web Conference (WWW)

作者： Zhu, Jason Yue Cui, Yanling Liu, Yuming Sun, Hao Li, Xue Pelger, Markus Yang, Tianqi Zhang, Liangjie Zhang, Ruofei Zhao, Huasha Stanford Univ Stanford CA 94305 USA Microsoft Beijing Peoples R China Microsoft Sunnyvale CA USA Microsoft San Francisco CA USA

ISBN: (纸本)9781450383127

text encoders based on C-DSSM or transformers have demonstrated strong performance in many Natural Language Processing (NLP) tasks. Low latency variants of these models have also been developed in recent years in order to apply them in the field of sponsored search which has strict computational constraints. However these models are not the panacea to solve all the Natural Language Understanding (NLU) challenges as the pure semantic information in the data is not sufficient to fully identify the user intents. We propose the textGNN model that naturally extends the strong twin tower structured encoders with the complementary graph information from user historical behaviors, which serves as a natural guide to help us better understand the intents and hence generate better language representations. The model inherits all the benefits of twin tower models such as C-DSSM and TwinBERT so that it can still be used in the low latency environment while achieving a significant performance gain than the strong encoder-only counterpart baseline models in both offline evaluations and online production system. In offline experiments, the model achieves a 0.14% overall increase in ROC-AUC with a 1% increased accuracy for long-tail low-frequency Ads, and in the online A/B testing, the model shows a 2.03% increase in Revenue Per Mille with a 2.32% decrease in Ad defect rate.

关键词： Ad Relevance Sponsored Search text encoder Graph Neural Network Transformers C-DSSM BERT Knowledge Distillation

来源：评论

学校读者我要写书评

暂无评论

ADAL-GCN: Action Description Aided Learning Graph Convolution Network for Early Action Prediction 7th

ADAL-GCN: Action Description Aided Learning Graph Convolutio...

引用

7th Chinese Conference on Pattern Recognition and Computer Vision

作者： Li, Xianshan Dong, Yuan Ning, Xingxing Zhang, Pengwei Zhao, Fengda Yanshan Univ Qinhuangdao Hebei Peoples R China Xinjiang Univ Sci & Technol Sch Informat Sci & Engn Urumqi Xinjiang Peoples R China Yanshan Univ Key Lab Software Engn IIebei Prov Qinhuangdao Hebei Peoples R China

ISBN: (纸本)9789819787944;9789819787951

Early human action prediction aims to complete the prediction of complete action sequences based solely on initial action sequences acquired at an initial stage. Considering that the execution of a single action usually relies on the synergistic coordination of multiple key body parts and the movement amplitude of different body parts at the onset of an action varies minimally, early human action prediction demonstrates high sensitivity to the location of action initiation and the type of action. Currently, skeletal-based action prediction methods primarily focus on action classification and exhibit limited capability for discrimination in terms of semantic association between actions. For instance, distinguishing actions concentrated on elbow joint movements, such as "touching the neck" and "touching the head," proves challenging through classification alone but can be achieved through semantic relationships. Therefore, when differentiating similar actions, incorporating descriptions of specific joint movements can enhance the feature extraction ability of the model. This paper introduces an Action Description-Assisted Learning Graph Convolutional Network (ADAL-GCN), which utilizes large language models as knowledge engines to pre-generate descriptions for key parts of different actions. These descriptions are then transformed into semantically rich feature vectors through text encoding. Furthermore, the model adopts a lightweight design, decoupling features across channel and temporal dimensions, consolidating redundant network modules, and executing strategic computational migration to optimize processing efficiency. Experimental results demonstrate significant performance improvements achieved by our proposed method, which achieves substantial reductions in training time without additional computational overhead.

关键词： Early Action Prediction Large Language Model Skeleton encoder text encoder Lightweight Design

来源：评论

学校读者我要写书评

暂无评论

text-enhanced knowledge graph representation learning with local structure

引用

INFORMATION PROCESSING & MANAGEMENT 2024年第5期61卷

作者： Li, Zhifei Jian, Yue Xue, Zengcan Zheng, Yumin Zhang, Miao Zhang, Yan Hou, Xiaoju Wang, Xiaoguang Hubei Univ Sch Comp Sci & Informat Engn Wuhan 430062 Peoples R China Wuhan Univ Intellectual Comp Lab Cultural Heritage Wuhan 430072 Peoples R China Hubei Univ Key Lab Intelligent Sensing Syst & Secur Minist Educ Wuhan 430062 Peoples R China Hubei Univ Hubei Key Lab Big Data Intelligent Anal & Applicat Wuhan 430062 Peoples R China Cent China Normal Univ Fac Artificial Intelligence Educ Wuhan 430079 Hubei Peoples R China Guangdong Ind Polytech Inst Vocat Educ Guangzhou 510300 Peoples R China Wuhan Univ Sch Informat Management Wuhan 430072 Peoples R China

Knowledge graph representation learning entails transforming entities and relationships within a knowledge graph into vectors to enhance downstream tasks. The rise of pre -trained language models has recently promoted text -based approaches for knowledge graph representation learning. However, these methods often need more structural information on knowledge graphs, prompting the challenge of integrating graph structure knowledge into text -based methodologies. To tackle this issue, we introduce a text -enhanced model with local structure (TEGS) that embeds local graph structure details from the knowledge graph into the text encoder. TEGS integrates k -hop neighbor entity information into the text encoder and employs a decoupled attention mechanism to blend relative position encoding and text semantics. This strategy augments learnable content through graph structure information and mitigates the impact of semantic ambiguity via the decoupled attention mechanism. Experimental findings demonstrate TEGS's effectiveness at fusing graph structure information, resulting in state-ofthe-art performance across three datasets in link prediction tasks. In terms of Hit@1, when compared to the previous text -based models, our model demonstrated improvements of 2.1% on WN18RR, 2.4% on FB15k-237, and 2.7% on the NELL-One dataset. Our code is made publicly available on

关键词： Knowledge graph Representation learning text encoder Link prediction

来源：评论

学校读者我要写书评

暂无评论

PCCM-GAN: Photographic text-to-Image Generation with Pyramid Contrastive Consistency Model

引用

NEUROCOMPUTING 2021年 449卷 330-341页

作者： Zhongjian, Q. Sun, Jun Qian, Jinzhao Xu, Jiajia Zhan, Shu Hefei Univ Technol Key Lab Knowledge Engn Big Data Minist Educ Hefei Peoples R China Hefei Univ Technol Sch Comp & Informat Hefei 230601 Anhui Peoples R China Tsinghua Univ Dept Automat Beijing 100084 Peoples R China iFlytek Co Ltd Hefei 230088 Anhui Peoples R China

Synthesizing photographic images from given text descriptions is a challenging problem. Although previous many studies have made significant progress on the visual quality of the generated images by using the multi-stage and attentional network, they ignore the interrelationships between the images generated by the generator in each stage and simply leverage the attention mechanism. In this paper, the Photographic text-to-Image Generation with Pyramid Contrastive Consistency Model (PCCM-GAN) is proposed to generate photographic images. PCCM-GAN introduces two modules: a Pyramid Contrastive Consistency Model (PCCM) and a stack attention model (Stack-Attn). Based on generated images from the different stages, PCCM is proposed to compute a contrastive loss for training the generator. Stack-Attn concentrates on generating images with more details and better semantic consistency by stacking the global-local attention mechanism. And visual inspection of the inner product of PCCM and Stack-Attn is also performed to validate their effectiveness. Extensive experiments and ablation studies on the CUB and MS-COCO datasets prove the superiority of the proposed method. (c) 2021 Published by Elsevier B.V.

关键词： text-to-image GAN text encoder Image encoder

来源：评论

学校读者我要写书评

暂无评论

引用

INTERNATIONAL JOURNAL OF UNCERTAINTY FUZZINESS AND KNOWLEDGE-BASED SYSTEMS 2023年第2期31卷 283-302页

作者： Mannepalli, Kasiprasad Singh, Suryabhan Pratap Kolli, Chandra Sekhar Raj, Sundeep Bojja, Giridhar Reddy Rajakumar, B. R. Binu, D. Koneru Lakshmaiah Educ Fdn Dept ECE Vijayawada Andhra Prades India Deen Dayal Upadhyay Gorakhpur Univ Inst Engn & Technol Dept Informat Technol Gorakhpur Uttar Pradesh India Gandhi Inst Technol & Management GITAM Sch Sci Dept Comp Sci Visakhapatnam Andhra Prades India Ajay Kumar Garg Engn Coll Dept Informat Technol Ghaziabad Uttar Pradesh India Dakota State Univ Madison SD USA Resbee Info Technol Private Ltd Thuckalay Tamil Nadu India

In social media, the data-sharing activities have turned out to be more pervasive;individuals and companies have comprehended the significance of promoting info by social media network. However, these individuals and companies face more challenges with the issue of "how to obtain the full benefit that the platforms provide". Therefore, social media policies to improve the online promotion are turning out to be more significant. The popularization of social media contents are related to public attention and interest of users, thus the popularity fore cast of online contents has considered being the major task in social media analytic and it facilitates several appliances in diverse domain as well. This paper intends to introduce a popularity forecast approach that derives and combines the richest data of "text content encoder, user encoder, time series encoder, and user sentiment analysis". The extracted features are then predicted via Long Short Term Memory (LSTM). Particularly, to enhance the prediction accuracy of the LSTM, the weights are fine-tuned via Self Adaptive Rain optimization (SA-RO).

关键词： Social media popularity prediction LSTM text encoder SA-RO algorithm

来源：评论

学校读者我要写书评

暂无评论

An Improved AttnGAN Model for text-to-Image Synthesis 1

引用

8th International Conference on Computer Vision and Image Processing (CVIP)

作者： Gopalakrishnan, Remya Sambagni, Naveen Sudeep, P. V. Natl Inst Technol Dept Elect & Commun Engn Kozhikode NIT Campus PO Calicut 673601 Kerala India

ISBN: (数字)9783031585357

ISBN: (纸本)9783031585340;9783031585357

text-to-image generation models generate photo-realistic images from textual descriptions, typically using GANs and BiLSTM networks. However, as input text sequence length increases, these models suffer from a loss of information, leading to missed keywords and unsatisfactory results. To address this, we propose an attentional GAN (AttnGAN) model with a text attention mechanism. We evaluate AttnGAN variants on the MS-COCO dataset qualitatively and quantitatively. For the image quality analysis, we utilize performance measures such as FID score, R-precision, and IS score. Our results show that the proposed model outperforms existing approaches, producing more realistic images by preserving vital information in the input sequence.

关键词： Context awareness Deep learning Image generation text encoder Unified text attention

来源：评论

学校读者我要写书评

暂无评论

Matching Latent Encoding for Audio-text based Keyword Spotting 24

Matching Latent Encoding for Audio-Text based Keyword Spotti...

引用

Interspeech Conference

作者： Nishu, Kumari Cho, Minsik Naik, Devang Apple Inc Cupertino CA 95014 USA

Using audio and text embeddings jointly for Keyword Spotting (KWS) has shown high-quality results, but the key challenge of how to semantically align two embeddings for multi-word keywords of different sequence lengths remains largely unsolved. In this paper, we propose an audio-text-based end-to-end model architecture for flexible keyword spotting (KWS), which builds upon learned audio and text embeddings. Our architecture uses a novel dynamic programming-based algorithm, Dynamic Sequence Partitioning (DSP), to optimally partition the audio sequence into the same length as the word-based text sequence using the monotonic alignment of spoken content. Our proposed model consists of an encoder block to get audio and text embeddings, a projector block to project individual embeddings to a common latent space, and an audio-text aligner containing a novel DSP algorithm, which aligns the audio and text embeddings to determine if the spoken content is the same as the text. Experimental results show that our DSP is more effective than other partitioning schemes, and the proposed architecture outperformed the state-of-the-art results on the public dataset in terms of Area Under the ROC Curve (AUC) and Equal-Error-Rate (EER) by 14.4 % and 28.9%, respectively.

关键词： keyword spotting audio encoder text encoder audio-text sequence alignment

来源：评论

学校读者我要写书评

暂无评论

Cross Modal Retrieval Algorithm Based on Iterative Queries 13th

Cross Modal Retrieval Algorithm Based on Iterative Queries

引用

13th International Conference on Computer Engineering and Networks (CENet)

作者： Cheng, Xiuchuan Yang, Xiaoyu Li, Huiping Wang, Zhiguo Yin, Guangqiang UESTC Shenzhen Inst Adv Study Shenzhen 518110 Peoples R China Univ Elect Sci & Technol China Chengdu 611730 Peoples R China Kashi Inst Elect & Informat Ind Kashi 844199 Peoples R China Univ Elect Sci & Technol China Kashi 844199 Peoples R China

ISBN: (纸本)9789819992454;9789819992430;9789819992423

The single-modal information retrieval pattern is gradually unable to meet the growing information processing needs. Cross-modal retrieval based on deep learning, as a new information retrieval scheme, is gradually receiving more attention. To address the potential issue of imprecise text queries in cross-modal retrieval, an iterative query-based cross-modal retrieval model is proposed. The model is generally divided into four modules: image feature extraction, text feature extraction, matching ranking, and query reinforcement. The model first extracts feature of images and text through deep learning models, then performs matching and retrieval of image-text features through the image-text stacked cross-attention algorithm. Finally, in the query reinforcement module, the most distinctive object category in the retrieval results is obtained through deep reinforcement learning for user confirmation, thereby increasing text richness and improving retrieval performance.

关键词： Cross-modal Retrieval Image encoder text encoder Matching Ranking Reinforced Query

来源：评论

学校读者我要写书评

暂无评论

text to Image Synthesis Using Bridge Generative Adversarial Network and Char CNN Modeltext to Image Synthesis Using Bridge Generative Adversarial Network and Char CNN Model 28th

Text to Image Synthesis Using Bridge Generative Adversarial ...

引用

28th International Conference on Applications of Natural Language to Information Systems (NLDB)

作者： Gajendran, Sudhakaran Arunarani, Ar. Manjula, D. Sugumaran, Vijayan Vellore Inst Technol Sch Elect Engn Chennai Tamil Nadu India SRM Inst Sci & Technol Sch Comp Dept Computat Intelligence Chennai Tamil Nadu India Vellore Inst Technol Sch Comp Sci & Engn Chennai Tamil Nadu India Oakland Univ Ctr Data Sci & Big Data Analyt Rochester MI 48309 USA Oakland Univ Sch Business Adm Dept Decis & Informat Sci Rochester MI 48063 USA

ISBN: (纸本)9783031353192;9783031353208

Acontent to picture production approach seeks to produce photorealistic images that are semantically coherent with the provided descriptions from text descriptions. Applications for creating photorealistic visuals from text includes photo editing and more. Strong neural network topologies, such as GANs (Generative Adversarial Networks) have been shown to produce effective outcomes in recent years. Two very significant factors, visual reality and content consistency, must be taken into consideration when creating images from text descriptions. Recent substantial advancements in GAN have made it possible to produce images with a high level of visual realism. However, generating images from text ensuring high content consistency between the text and the generated image is still ambitious. To address the above two issues, a Bridge GAN model is proposed, where the bridge is a transitional space containing meaningful representations of the given text description. The proposed systems incorporate Bridge GAN and char CNN - RNN model to generate the image in high content consistency and the results shows that the proposed system outperformed the existing systems.

关键词： Generative Adversarial Network CNN text encoder Image Synthesis

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：