This paper provides a comprehensive overview of the NLPCC 2024 shared Task 5: Argument Mining for Chinese Argumentative Essay (AMCAE). This task aims to identify the argument type which the sentences from an argumenta...
ISBN:
(纸本)9789819794423;9789819794430
This paper provides a comprehensive overview of the NLPCC 2024 shared Task 5: Argument Mining for Chinese Argumentative Essay (AMCAE). This task aims to identify the argument type which the sentences from an argumentative essay, authored by Chinese high school students, belong to, thus helping improve writing skills of students as well as the efficiency of teaching. There are 14 teams that submitted valid results, and we summarize some of the representative techniques used by participants. The task guideline and more information can be accessed at https://***/cubenlp/NLPCC-2024-Shared-Task5.
In the NLPCC2024 Shared Task 5, our team employed a context-based learning strategy for model fine-tuning. Through meticulously designed prompts and iterative optimization, we successfully enhanced the ability of larg...
ISBN:
(纸本)9789819794423;9789819794430
In the NLPCC2024 Shared Task 5, our team employed a context-based learning strategy for model fine-tuning. Through meticulously designed prompts and iterative optimization, we successfully enhanced the ability of large pre-trained language models (LLMs) to mine arguments in Chinese argumentative essays. We further incorporated a model voting mechanism to improve prediction accuracy and robustness. Ultimately, our system ranked first in the test set with a composite score of 0.7936.
Conversational Emotion Detection (CED), spanning across multiple modalities (e.g., textual, visual and acoustic modalities), has been drawing ever-more interest in the multi-modal fields. Previous studies consistently...
ISBN:
(纸本)9789819794423;9789819794430
Conversational Emotion Detection (CED), spanning across multiple modalities (e.g., textual, visual and acoustic modalities), has been drawing ever-more interest in the multi-modal fields. Previous studies consistently consider the CED task as an emotion classification problem utterance by utterance, which largely ignore the global topic information of each conversation, especially the multi-modal topic information inside multiple modalities. Obviously, such information is crucial for alleviating the emotional information deficiency problem in a single utterance. With this in mind, we propose a Topic-enriched Variational Transformer (TVT) approach to capture the conversational topic information inside different modalities for CED. Particularly, a modality-independent topic module in TVT is designed to mine topic clues from either the discrete textual content, or the continuous visual and acoustic contents in each conversation. Detailed evaluation shows the great advantage of TVT to the CED task over the state-of-the-art baselines, justifying the importance of the multi-modal topic information to CED and the effectiveness of our approach in capturing such information.
Voice signals convey hidden, valuable information about speakers, such as age, gender, and emotional state. Extracting this kind of information from human speech is significant in human-computer interaction (HCI). It ...
ISBN:
(纸本)9783031780134;9783031780141
Voice signals convey hidden, valuable information about speakers, such as age, gender, and emotional state. Extracting this kind of information from human speech is significant in human-computer interaction (HCI). It enables computers to understand human behaviors and develop interactive systems with customized responses, raising the significance of advancements in speech emotion recognition (SER), especially for languages with large numbers of speakers. Despite over 100 million people speak the Egyptian dialect, SER studies that address the Egyptian dialect are extremely scarce and predominantly rely on traditional machine learning models and convolutional neural networks (CNN) for classification. In this context, we proposed an enhanced compact convolution transformer (CCT) that detects the speaker's age, gender, and emotional state, leveraging the strengths of CNNs for capturing spatial features and transformers for modeling long-range dependencies. The proposed approach combines the best of both architectures, marking a novel architecture for the Egyptian emotion recognition task. To the best of our knowledge, this is the first work to address age detection from Egyptian speech, as well as the first to propose a unified model for the recognition of age, gender, and emotion from Egyptian speech. In the context of HCI improvements, the proposed model was applied in a real-world setting by integrating it into a custom-developed Egyptian chatbot to enhance the chatbot's ability to provide emotionally aware responses based on the user's emotional state.
Large language models (LLMs), such as ChatGPT, have demonstrated remarkable abilities in simple information extraction tasks. However, when it comes to complex and demanding tasks like relation extraction (RE), LLMs m...
ISBN:
(纸本)9789819794331;9789819794348
Large language models (LLMs), such as ChatGPT, have demonstrated remarkable abilities in simple information extraction tasks. However, when it comes to complex and demanding tasks like relation extraction (RE), LLMs may still have considerable space for improvement. In this paper, we extensively evaluate ChatGPT's performance in RE to expose its strengths and limitations. We explore the design choices of ChatGPT's input prompts for RE. Considering different combinations of these choices, we conduct thorough experiments on benchmark datasets and analyze ChatGPT's comprehension abilities under different settings. Our experiment results provide insights for the future development of LLM-based RE models.
Time series forecasting is vital in industries like weather and transportation. However, Transformer models may face challenges capturing both variable and temporal correlations in multivariate forecasting, potentiall...
ISBN:
(纸本)9789819794331;9789819794348
Time series forecasting is vital in industries like weather and transportation. However, Transformer models may face challenges capturing both variable and temporal correlations in multivariate forecasting, potentially hindering their understanding of complex data dynamics. To address this, we present VPformer, a model that leverages cross-variable and temporal correlations to enhance forecasting accuracy. VPformer utilizes variable embedding and self-attention to explore variable correlations and transform raw data into a featurerich space data. It segments new data into three patch types using non-overlapping partitioning and applies channel-independent techniques, sequence embedding, and attention mechanisms to capture temporal correlations. A fusion strategy integrates features, providing a holistic view that captures both variable and temporal information. VPformer's performance is validated on real-world datasets, demonstrating superior forecasting accuracy and computational efficiency.
Text Classification (TC), as a fundamental task in the Natural Language Process (NLP), plays an important role in many areas. However, adversarial examples (AEs) that adding small perturbations on the input text sampl...
ISBN:
(纸本)9789819794393;9789819794409
Text Classification (TC), as a fundamental task in the Natural Language Process (NLP), plays an important role in many areas. However, adversarial examples (AEs) that adding small perturbations on the input text samples poses a serious challenge for TC. One key characteristic of AEs in the context of NLP is the visual consistency, attackers generally keep AEs similar to the original text sample in visual to facilitate user understanding. In this paper, we introduce an effective blackbox method Visage to generate AEs by considering the perspective of users. Specifically, Visage calculates the importance of words in the input text sample and modifies them by using similar characters in visual to generate AEs. Visage provides AEs for adversarial training and improves the robustness of TC. Extensive experiments show that AEs generated by Visage can effectively reduce the accuracy of victim models which outperforms related works by 22.95% on average. Furthermore, adding AEs generated by Visage in training datasets for adversarial training can improve the robustness by 19.5%.
The rise of large language models has brought about significant advancements in the field of natural language processing. However, these models often have the potential to generate content that can be hallucinatory or...
ISBN:
(纸本)9789819794423;9789819794430
The rise of large language models has brought about significant advancements in the field of natural language processing. However, these models often have the potential to generate content that can be hallucinatory or toxic. To this end, we organize NLPCC 2024 Shared Task 10, i.e., Regulating Large Language Models, which includes two subtasks: Multimodal Hallucination Detection for Multimodal Large Language Models and Detoxifying Large Language Models. In the first task, we construct a fine-grained and human-calibrated benchmark for multimodal hallucination detection, named MHaluBench, which contains 1270 training data, 600 validation data and 300 test data. The second task draws on the SafeEdit benchmark, containing 4050 training data, 2700 validation data and 540 test data. The aim is to design and implement strategies to prevent large language models from generating toxic content. This paper presents details of the shared tasks, datasets, evaluation metrics and evaluation results.
End-to-end text-to-speech (TTS) systems allow for the generation of high-quality computer-generated speech without relying on expert-created modules. This paper outlines initial efforts to develop a Serbian end-to-end...
ISBN:
(纸本)9783031779602;9783031779619
End-to-end text-to-speech (TTS) systems allow for the generation of high-quality computer-generated speech without relying on expert-created modules. This paper outlines initial efforts to develop a Serbian end-to-end TTS system using the Tacotron architecture. Listening tests revealed that while Tacotron can produce natural-sounding synthesis when properly trained, it is prone to overfitting and requires extensive data to avoid frequent hallucinations and accent errors. The use of a vocoder proved to be crucial in overall speech quality. Although the level of Tacotron training is less critical, it still demonstrates easy overfitting with relatively small databases. Correct accents and the absence of artifacts and hallucinations are extremely important for listeners, and any issues in these areas result in significantly lower ratings. Despite being less expressive, a controllable standard DNN-based TTS with a standard front end receives better grades because it never hallucinates and rarely makes linguistic mistakes. Integrating expert knowledge from existing pipelines can further improve synthesis quality, especially in data-constrained scenarios.
In the digital age, the widespread preference for instructional videos as an educational aid demands cutting-edge technologies that can swiftly pinpoint video segments in response to queries. In light of the likelihoo...
ISBN:
(纸本)9789819794423;9789819794430
In the digital age, the widespread preference for instructional videos as an educational aid demands cutting-edge technologies that can swiftly pinpoint video segments in response to queries. In light of the likelihood that learners from various cultural backgrounds may pose questions in different languages, the multilingual temporal answer grounding in single video (mTAGSV) challenge has been introduced. MTAGSV requires models to precisely identify the particular time segment within the video that align with the query posed in Chinese or English. By addressing the limitations of existing monolingual approaches and their inefficacy with silent videos, we utilize optical character recognition (OCR) to enhance video information and leverage large language models (LLMs) to bridge linguistic gap. For videos that contain audio, the subtitles are extracted by an automated speech recognition (ASR) tool. For silent videos or those with insufficient audio-based subtitles, we leverage an OCR tool to extract textual content from video frames, refining the OCR-generated text to act as a surrogate for subtitles. Furthermore, we leverage LLMs to translate English queries into their Chinese equivalents, bridging the linguistic divide between queries and video contents. The MutualSL model is employed as the backbone network for extracting features from textual subtitles and visual frames. Through extensive experiments, we demonstrate that our proposed techniques enhance the task performance, securing first place in track 1 of NLPCC 2024 shared task 7.
暂无评论