检索结果-内蒙古大学图书馆

Augmenting Context Representation with Triggers Knowledge for Relation Extraction 12th

学校读者我要写书评

暂无评论

Augmenting Context Representation with Triggers Knowledge fo...

12th IFIP TC 12 International Conference on Intelligent Information processing, IIP 2022

作者： Li, En Shi, Shumin Yang, Zhikun Huang, He Yan School of Computer Science and Technology Beijing Institute of Technology Beijing China Beijing Engineering Research Center of High Volume Language Information Processing and Cloud Computing Applications Beijing China

ISBN: (纸本)9783031039478

Relation Extraction (RE) requires the model to classify the correct relation from a set of relation candidates given the corresponding sentence and two entities. Recent work mainly studies how to utilize more data or incorporate extra context information especially with Pre-trained language Models (PLMs). However, these models still face with the challenges of avoiding being affected by irrelevant or misleading words. In this paper, we propose a novel model to help alleviate such deficiency. Specifically, our model automatically mines the triggers of the sentence iteratively with the sentence itself from the previous iteration, and augment the semantics of the context representation from BERT with both entity pair and triggers skillfully. We conduct extensive experiments to evaluate the proposed model and effectively obtain empirical improvement in TACRED. © 2022, IFIP International Federation for Information processing.

关键词： Semantics

Decoding speech Categorization using Microstate Cortical EEG Signals and Machine Learning

学校读者我要写书评

暂无评论

Decoding Speech Categorization using Microstate Cortical EEG...

2024 IEEE Signal processing in Medicine and Biology Symposium, SPMB 2024

作者： Mahmud, M. Hasan, M. Yeasin, M. Bidelman, G. Univ. Of Tennessee Health Science Center Div. Of General Internal Medicine MemphisTN United States Middle Tennessee State University Computational And Data Science MurfreesboroTN United States University Of Memphis Department Of Electrical And Computer Engineering TN United States Indiana University Language And Hearing Sciences Department Of Speech BloomingtonIN United States

ISBN: (纸本)9798350388572

Categorical perception (CP) is a perceptual phenomenon that refers to the tendency of humans to group speech sounds into discrete units. In this work, we used cortical event-related potential (ERP) signals, recorded during a speech identification task, as input to a Hierarchical Dirichlet Process Hidden Markov Model (HDP-HMM) and machine learning (ML) classifiers to examine how neural signatures of the brain distinguish prototypical vs. ambiguous speech token categories. Particularly, we used extreme gradient boosting (XGBoost), support vector machine, and random forest classifiers. Our analysis shows that using source reconstructed whole-brain data, the XGBoost classifier yielded the best classification accuracy of 94.12% with (area under the curve (AUC) 94.13%, Precision 94.00%, F1-score 94.00%) in the 197-258 ms time window after the stimulus onset. We also identified 15 important brain regions that distinguished vowel classes with accuracy of 90.28% (AUC 90.17%, F1-score 90.00%, Precision 90.00%), 4% less than whole-brain data. Remarkably, out of these 15 critical brain regions, nine were from the left hemisphere, consistent with the left-brain dominance for language processing. The best speech decoding was obtained from early portions of the ERP (~180-320 ms), which corroborates notions that sensory encoding is critical for successful phonetic processing. © 2024 IEEE.

关键词： Hidden Markov models

MAT-SED: A Masked Audio Transformer with Masked-Reconstruction Based Pre-training for Sound Event Detection

学校读者我要写书评

暂无评论

arXiv 2024年

作者： Cai, Pengfei Song, Yan Li, Kang Song, Haoyu McLoughlin, Ian National Engineering Research Center of Speech and Language Information Processing University of Science and Technology of China China ICT Cluster Singapore Institute of Technology Singapore The Australian National University Australia

Sound event detection (SED) methods that leverage a large pre-trained Transformer encoder network have shown promising performance in recent DCASE challenges. However, they still rely on an RNN-based context network to model temporal dependencies, largely due to the scarcity of labeled data. In this work, we propose a pure Transformer-based SED model with masked-reconstruction based pre-training, termed MAT-SED. Specifically, a Transformer with relative positional encoding is first designed as the context network, pre-trained by the masked-reconstruction task on all available target data in a self-supervised way. Both the encoder and the context network are jointly fine-tuned in a semi-supervised manner. Furthermore, a global-local feature fusion strategy is proposed to enhance the localization capability. Evaluation of MAT-SED on DCASE2023 task4 surpasses state-of-the-art performance, achieving 0.587/0.896 PSDS1/PSDS2 respectively. © 2024, CC BY.

关键词： Self-supervised learning

Greek Sign language Recognition for the SL-ReDu Learning Platform 7

学校读者我要写书评

暂无评论

Greek Sign Language Recognition for the SL-ReDu Learning Pla...

7th Workshop on Sign language Translation and Avatar Technology: The Junction of the Visual and the Textual Challenges and Perspectives, SLTAT 2022

作者： Papadimitriou, Katerina Potamianos, Gerasimos Sapountzaki, Galini Goulas, Theodor Efthimiou, Eleni Fotinea, Stavroula-Evita Maragos, Petros Department of Electrical & Computer Engineering University of Thessaly Volos Greece Department of Special Education University of Thessaly Volos Greece Institute for Language & Speech Processing Athena Research & Innovation Center Athens Greece School of Electrical & Computer Engineering National Technical University of Athens Greece

ISBN: (纸本)9791095546825

There has been increasing interest lately in developing education tools for sign language (SL) learning that enable self-assessment and objective evaluation of learners' SL productions, assisting both students and their instructors. Crucially, such tools require the automatic recognition of SL videos, while operating in a signer-independent fashion and under realistic recording conditions. Here, we present an early version of a Greek Sign language (GSL) recognizer that satisfies the above requirements, and integrate it within the SL-ReDu learning platform that constitutes a first in GSL with recognition functionality. We develop the recognition module incorporating state-of-the-art deep-learning based visual detection, feature extraction, and classification, designing it to accommodate a medium-size vocabulary of isolated signs and continuously fingerspelled letter sequences. We train the module on a specifically recorded GSL corpus of multiple signers by a web-cam in non-studio conditions, and conduct both multi-signer and signer-independent recognition experiments, reporting high accuracies. Finally, we let student users evaluate the learning platform during GSL production exercises, reporting very satisfactory objective and subjective assessments based on recognition performance and collected questionnaires, respectively. © European language Resources Association (ELRA), licensed under CC-BY-NC 4.0.

关键词： Deep learning

Quality-Aware End-to-End Audio-Visual Neural Speaker Diarization

学校读者我要写书评

暂无评论

arXiv 2024年

作者： He, Mao-Kui Du, Jun Niu, Shu-Tong Liu, Qing-Feng Lee, Chin-Hui National Engineering Research Center of Speech and Language Information Processing University of Science and Technology of China Anhui Hefei China IFlytek Hefei Anhui China School of Electrical and Computer Engineering Georgia Institute of Technology AtlantaGA United States

In this paper, we propose a quality-aware end-to-end audio-visual neural speaker diarization framework, which comprises three key techniques. First, our audio-visual model takes both audio and visual features as inputs, utilizing a series of binary classification output layers to simultaneously identify the activities of all speakers. This end-to-end framework is meticulously designed to effectively handle situations of overlapping speech, providing accurate discrimination between speech and non-speech segments through the utilization of multimodal information. Next, we employ a quality-aware audio-visual fusion structure to address signal quality issues for both audio degradations, such as noise, reverberation and other distortions, and video degradations, such as occlusions, off-screen speakers, or unreliable detection. Finally, a cross attention mechanism applied to multi-speaker embedding empowers the network to handle scenarios with varying numbers of speakers. Our experimental results, obtained from various data sets, demonstrate the robustness of our proposed techniques in diverse acoustic environments. Even in scenarios with severely degraded video quality, our system attains performance levels comparable to the best available audio-visual systems. Copyright © 2024, The Authors. All rights reserved.

关键词： Reverberation

Exploring the Robustness of Task-oriented Dialogue Systems for Colloquial German Varieties

学校读者我要写书评

暂无评论

arXiv 2024年

作者： Artemova, Ekaterina Blaschke, Verena Plank, Barbara MaiNLP Center for Information and Language Processing LMU Munich Germany Munich Germany Department of Computer Science IT University of Copenhagen Denmark Toloka.AI

Mainstream cross-lingual task-oriented dialogue (ToD) systems leverage the transfer learning paradigm by training a joint model for intent recognition and slot-filling in English and applying it, zero-shot, to other languages. We address a gap in prior research, which often overlooked the transfer to lower-resource colloquial varieties due to limited test data. Inspired by prior work on English varieties, we craft and manually evaluate perturbation rules that transform German sentences into colloquial forms and use them to synthesize test sets in four ToD datasets. Our perturbation rules cover 18 distinct language phenomena, enabling us to explore the impact of each perturbation on slot and intent performance. Using these new datasets, we conduct an experimental evaluation across six different transformers. Here, we demonstrate that when applied to colloquial varieties, ToD systems maintain their intent recognition performance, losing 6% (4.62 percentage points) in accuracy on average. However, they exhibit a significant drop in slot detection, with a decrease of 31% (21 percentage points) in slot F1 score. Our findings are further supported by a transfer experiment from Standard American English to synthetic Urban African American Vernacular English. © 2024, CC BY.

关键词： Zero-shot learning

DSPGAN: A Gan-Based Universal Vocoder for High-Fidelity TTS by Time-Frequency Domain Supervision from DSP

学校读者我要写书评

暂无评论

DSPGAN: A Gan-Based Universal Vocoder for High-Fidelity TTS ...

International Conference on Acoustics, speech, and Signal processing (ICASSP)

作者： Kun Song Yongmao Zhang Yi Lei Jian Cong Hanzhao Li Lei Xie Gang He Jinfeng Bai Audio Speech and Language Processing Group (ASLP@NPU) School of Computer Science Northwestern Polytechnical University Xi’an China TAL Education Group Beijing China

Recent development of neural vocoders based on the generative adversarial neural network (GAN) has shown obvious advantages of generating raw waveform conditioned on mel-spectrogram with fast inference speed and lightweight networks. Whereas, it is still challenging to train a universal neural vocoder that can synthesize high-fidelity speech from various scenarios with unseen speakers, languages, and speaking styles. In this paper, we propose DSP- GAN, a GAN-based universal vocoder for high-fidelity speech synthesis by applying the time-frequency domain supervision from digital signal processing (DSP). To eliminate the mismatch problem caused by the ground-truth spectrograms in the training phase and the predicted spectrograms in the inference phase, we leverage the mel-spectrogram extracted from the waveform generated by a DSP module, rather than the predicted mel-spectrogram from the Text-to-speech (TTS) acoustic model, as the time-frequency domain supervision to the GAN-based vocoder. We also utilize sine excitation as the time-domain supervision to improve the harmonic modeling and eliminate various artifacts of the GAN-based vocoder. Experiments show that DSPGAN significantly outperforms the compared approaches and it can generate high-fidelity speech for various TTS models trained using diverse data. 1

关键词： Training Time-frequency analysis Vocoders Digital signal processing Generative adversarial networks speech Acoustics

Zero-Shot Emotion Transfer for Cross-Lingual speech Synthesis

学校读者我要写书评

暂无评论

Zero-Shot Emotion Transfer for Cross-Lingual Speech Synthesi...

IEEE Workshop on Automatic speech Recognition and Understanding

作者： Yuke Li Xinfa Zhu Yi Lei Hai Li Junhui Liu Danming Xie Lei Xie Audio Speech and Language Processing Group (ASLP@NPU) School of Computer Science Northwestern Polytechnical University Xi’an China iQIYI Inc. Chengdu China

Zero-shot emotion transfer in cross-lingual speech synthesis aims to transfer emotion from an arbitrary speech reference in the source language to the synthetic speech in the target language. Building such a system faces challenges of unnatural foreign accents and difficulty in modeling the shared emotional expressions of different languages. Building on the DelightfulTTS [1] neural architecture, this paper addresses these challenges by introducing specifically-designed modules to model the language-specific prosody features and language-shared emotional expressions separately. Specifically, the language-specific speech prosody is learned by a non-autoregressive predictive coding (NPC) module [2] to improve the naturalness of the synthetic cross-lingual speech. The shared emotional expression between different languages is extracted from a pre-trained self-supervised model Hu BERT with strong generalization capabilities. We further use hierarchical emotion modeling to capture more comprehensive emotions across different languages. Experimental results demonstrate the proposed framework’s effectiveness in synthesizing bi-lingual emotional speech for the monolingual target speaker without emotional training data 1 . 1 speech samples: https://***/ZSET/

关键词：

To Know or Not To Know? Analyzing Self-Consistency of Large language Models under Ambiguity

学校读者我要写书评

暂无评论

arXiv 2024年

作者： Sedova, Anastasiia Litschko, Robert Frassinelli, Diego Roth, Benjamin Plank, Barbara Faculty of Computer Science UniVie Doctoral School Computer Science Austria Faculty of Philological and Cultural Studies University of Vienna Austria MaiNLP Center for Information and Language Processing LMU Munich Germany Germany

One of the major aspects contributing to the striking performance of large language models (LLMs) is the vast amount of factual knowledge accumulated during pre-training. Yet, many LLMs suffer from self-inconsistency, which raises doubts about their trustworthiness and reliability. This paper focuses on entity type ambiguity, analyzing the proficiency and consistency of state-of-the-art LLMs in applying factual knowledge when prompted with ambiguous entities. To do so, we propose an evaluation protocol that disentangles knowing from applying knowledge, and test state-of-the-art LLMs on 49 ambiguous entities. Our experiments reveal that LLMs struggle with choosing the correct entity reading, achieving an average accuracy of only 85%, and as low as 75% with underspecified prompts. The results also reveal systematic discrepancies in LLM behavior, showing that while the models may possess knowledge, they struggle to apply it consistently, exhibit biases toward preferred readings, and display self-inconsistencies. This highlights the need to address entity ambiguity in the future for more trustworthy LLMs.1 © 2024, CC BY.

关键词： Computational linguistics