Large language models (LLMs) are showing dramatic progress in terms of language generation and in reasoning tasks. Existing works on fake news detection mostly focus on fine-tuning small language models such as BERT. ...
详细信息
ISBN:
(纸本)9789819794331;9789819794348
Large language models (LLMs) are showing dramatic progress in terms of language generation and in reasoning tasks. Existing works on fake news detection mostly focus on fine-tuning small language models such as BERT. One downside of fine-tuning is that it requires a lot of data which might not always be available. With the prevalent spread of fake news and misinformation, alternative ways are needed especially in cases where there is lack of enough training data. In this paper, we propose using multi-agent debate strategies to enhance fake news detection by leveraging the capabilities of LLMs. We introduce two approaches: a uniform prompt multi-agent debate and a diverse prompt multi-agent debate where each LLM agent adopts distinct roles such as fact-checker, journalist, or data scientist. These methods are bench-marked against single LLM evaluations to assess the impact of collaborative reasoning. Our experiments on the PolitiFact and GossipCop datasets reveal that the multi-agent debate methods outperform single LLM assessments. Notably, the diverse persona debate approach achieves the highest performance, demonstrating the value of incorporating different perspectives in reasoning. These results suggest that multi-agent debates can effectively harness the strengths of single LLMs to improve the reliability of fake news detection systems.
The rise of voice interface applications has renewed interest in improving the robustness of spoken language understanding(SLU). Many advances have come from end-to-end speech-language joint training, such as inferrin...
详细信息
ISBN:
(纸本)9789819794362;9789819794379
The rise of voice interface applications has renewed interest in improving the robustness of spoken language understanding(SLU). Many advances have come from end-to-end speech-language joint training, such as inferring semantics directly from speech signals and post-editing automatic speech recognition (ASR) output. Despite their performance achievements, these methods either suffer from the unavailability of a large number of paired error-prone ASR transcriptions and ground-truth annotations or are computationally costly. To mitigate these issues, we propose an ASR-robust pre-trained language model (ASRLM), which involves a generator generating simulated ASR transcriptions from ground-truth annotations and a sample-efficient discriminator distinguishing reasonable ASR errors from unrealistic ones. Experimental results demonstrate that ASRLM improves performance on a wide range of SLU tasks in the presence of ASR errors while saving 27% of the computation cost compared to baselines. Analysis also shows that our proposed generator is better than other simulation methods, including both BERT and GPT4-based, at simulating real-world ASR error situations.
Multi-purpose large language models (LLMs), a subset of generative artificial intelligence (AI), have recently made significant progress. While expectations for LLMs to assist systems engineering (SE) tasks are paramo...
详细信息
Multi-purpose large language models (LLMs), a subset of generative artificial intelligence (AI), have recently made significant progress. While expectations for LLMs to assist systems engineering (SE) tasks are paramount;the interdisciplinary and complex nature of systems, along with the need to synthesize deep-domain knowledge and operational context, raise questions regarding the efficacy of LLMs to generate SE artifacts, particularly given that they are trained using data that is broadly available on the internet. To that end, we present results from an empirical exploration, where a human expert-generated SE artifact was taken as a benchmark, parsed, and fed into various LLMs through prompt engineering to generate segments of typical SE artifacts. This procedure was applied without any fine-tuning or calibration to document baseline LLM performance. We then adopted a two-fold mixed-methods approach to compare AI generated artifacts against the benchmark. First, we quantitatively compare the artifacts using naturallanguageprocessing algorithms and find that when prompted carefully, the state-of-the-art algorithms cannot differentiate AI-generated artifacts from the human-expert benchmark. Second, we conduct a qualitative deep dive to investigate how they differ in terms of quality. We document that while the two-material appear very similar, AI generated artifacts exhibit serious failure modes that could be difficult to detect. We characterize these as: premature requirements definition, unsubstantiated numerical estimates, and propensity to overspecify. We contend that this study tells a cautionary tale about why the SE community must be more cautious adopting AI suggested feedback, at least when generated by multi-purpose LLMs.
Dialogue Discourse Parsing aims to identify the discourse links and relations between utterances, which has attracted more interest in recent years. Previous studies either adopt local optimization to independently se...
详细信息
ISBN:
(纸本)9789819794300;9789819794317
Dialogue Discourse Parsing aims to identify the discourse links and relations between utterances, which has attracted more interest in recent years. Previous studies either adopt local optimization to independently select one parent for each utterance or use global optimization to directly get the tree representing the dialogue structure. However, the influence of these two optimization methods remains less explored. In this paper, we aim to systematically inspect their performance. Specifically, for local optimization, we use local loss during the training stage and a greedy strategy during the inference stage. For global optimization, We implement optimization of unlabeled and labeled trees by structured losses including Max-Margin and TreeCRF, and exploit Chu-Liu-Edmonds algorithm during the inference stage. Experiments shows that the performance of these two optimization methods is closely related to the characteristics of the dataset, and global optimization can reduce the burden of identifying long-range dependency relations.
Mathematical reasoning is challenging for large language models (LLMs), while the scaling relationship concerning LLM capacity is under-explored. Existing works have tried to leverage the rationales of LLMs to train s...
详细信息
ISBN:
(纸本)9789819794393;9789819794409
Mathematical reasoning is challenging for large language models (LLMs), while the scaling relationship concerning LLM capacity is under-explored. Existing works have tried to leverage the rationales of LLMs to train small language models (SLMs) for enhanced reasoning abilities, referred to as distillation. However, most existing distillation methods have not considered guiding the small models to solve problems progressively from simple to complex, which can be a more effective way. This study proposes a multi-step self questioning and answering (M-SQA) method that guides SLMs to solve complex problems by starting from simple ones. Initially, multi-step self-questioning and answering rationales are extracted from LLMs based on complexity-based prompting. Subsequently, these rationales are employed for distilling SLMs in a multi-task learning framework, during which the model learns to multi-step reason in a self questioning and answering way and answer each sub-question in a single step iteratively. Experiments on current mathematical reasoning tasks demonstrate the effectiveness of the proposed approach.
Social media has become the primary source of information for individuals, yet much of this information remains unverified. The rise of generative artificial intelligence has further accelerated the creation of unveri...
详细信息
ISBN:
(纸本)9789819794393;9789819794409
Social media has become the primary source of information for individuals, yet much of this information remains unverified. The rise of generative artificial intelligence has further accelerated the creation of unverified content. Adaptive rumor resolution systems are imperative for maintaining information integrity and public trust. Traditional methods have relied on encoder-based frameworks to enhance rumor representation and propagation characteristics. However, these models are often small in scale and lack generalizability for unforeseen events. Recent advances in Large language Models show promise but are unreliable in discerning truth from falsehood. Our work leverages LLMs by creating a testbed for predicting unprecedented rumors and designing a retrieval-augmented framework that integrates historical knowledge and collective intelligence. Experiments on two real-world datasets demonstrate the effectiveness of our proposed framework.
Knowledge-based visual question answering (VQA) requires external knowledge in addition to the image content to answer questions. Recent studies convert images to text descriptions and then generate answers or acquire...
详细信息
ISBN:
(纸本)9789819794362;9789819794379
Knowledge-based visual question answering (VQA) requires external knowledge in addition to the image content to answer questions. Recent studies convert images to text descriptions and then generate answers or acquire implicit knowledge using a large language model (LLM). These methods achieve encouraging results with the strong knowledge retrieval and reasoning capabilities of LLMs. However, methods that incorporate LLMs are limited by the discrepancies between images and their text descriptions presented to LLMs. To address this challenge, we present RAVL, a retrieval-augmented visual language model (VLM) framework for knowledge-based VQA. Specifically, we first fine-tune a VLM on the knowledge-based VQA task with inputs consisting of retrieved knowledge and image-question pairs to adapt the VLM to inputs with retrieved knowledge. After that, we adapt the retrieval module to the fine-tuned VLM using supervision signals provided by the VLM, enabling the retrieved knowledge to improve the VLM perplexity. RAVL overcomes the limitation of visual information loss and improves the effectiveness of VLMs with external knowledge. We conduct experiments on OK-VQA dataset and our method achieves 65.73% accuracy, surpassing the previous state-of-the-art method (+3.63%).
General large language models enhanced with supervised fine-tuning and reinforcement learning from human feedback are increasingly popular in academia and industry as they generalize foundation models to various pract...
详细信息
ISBN:
(纸本)9789819794362;9789819794379
General large language models enhanced with supervised fine-tuning and reinforcement learning from human feedback are increasingly popular in academia and industry as they generalize foundation models to various practical tasks in a prompt manner. To assist users in selecting the best model in practical application scenarios, i.e., choosing the model that meets the application requirements while minimizing cost, we introduce A-Eval, an application-driven LLMs evaluation benchmark for general large language models. First, we categorize evaluation tasks into five main categories and 27 sub-categories from a practical application perspective. Next, we construct a dataset comprising 678 question-and-answer pairs through a process of collecting, annotating, and reviewing. Then, we design an objective and effective evaluation method and evaluate a series of LLMs of different scales on A-Eval. Finally, we reveal interesting laws regarding model scale and task difficulty level and propose a feasible method for selecting the best model. Through A-Eval, we provide clear empirical and engineer guidance for selecting the best model, reducing barriers to selecting and using LLMs and promoting their application and development. Our benchmark is publicly available at https://***/UnicomAI/UnicomBenchmark/tree/main/A-Eval.
In recent years, machine learning and websites have developed rapidly. This resulted in continual and explosive growth in the sharing of ideas and views on products and services over the worldwide web in an array of s...
详细信息
ISBN:
(纸本)9783031791635;9783031791642
In recent years, machine learning and websites have developed rapidly. This resulted in continual and explosive growth in the sharing of ideas and views on products and services over the worldwide web in an array of sectors. As a result, there is an enormous flow of internet data attainable for analytical research. Sentiment analysis (SA) is a part of naturallanguageprocessing (NLP) that requires to process enormous amounts of data in order to identify people's opinions and sentiments. Several studies have been conducted to deal with the negative effects of social networks. This field of research is increasing popularity in both the public and private sectors, leading to the creation of several challenges. However, the majority of the available datasets were in English. Whereas the Arabic Moroccan dialect (Darija) ones were not. Following that, we created models combining NLP and Marching learning techniques to detect and classify sentiments. We evaluated the models using the most used metrics: accuracy, loss, F1-score, precision, and recall. The results of the experiment revealed modest scores between 87% and 89%. These findings imply that the models require to be upgraded due to a lack of accessible datasets and pre-processing techniques to handle the Moroccan dialect of Arabic (Darija).
Recent advancements in naturallanguageprocessing (NLP) have been driven by large language models (LLMs) that excel in understanding the complexities of naturallanguage. These models have transformed NLP tasks throu...
详细信息
ISBN:
(纸本)9783031782541;9783031782558
Recent advancements in naturallanguageprocessing (NLP) have been driven by large language models (LLMs) that excel in understanding the complexities of naturallanguage. These models have transformed NLP tasks through transfer learning, where pre-trained LLMs are fine-tuned on domain-specific datasets. Financial sentiment analysis is particularly challenging due to the complexity of financial language, requiring more advanced methods than traditional sentiment analysis approaches. Fine-tuning LLMs can enhance performance in the financial domain, but the high computational cost of standard full fine-tuning is a barrier. This study explores the effectiveness of four (4) parameter-efficient fine-tuning (PEFT) methods, namely, Low-Rank Adaptation (LoRA), prompt tuning, prefix tuning, and adapters, for financial sentiment analysis. The findings show that PEFT methods can match or exceed the performance of full fine-tuning while significantly reducing computational requirements. Specifically, adapting the Open Pre-trained Transformers (OPT) model with LoRA achieved the highest accuracy of 89% using only 0.19% of the model's parameters. PEFT methods also resulted in substantial graphics processing unit (GPU) memory savings of up to 80%. Small-scale fine-tuned LLMs outperformed cutting-edge large-scale general-purpose models like ChatGPT, highlighting the value of domain-specific fine-tuning. LLMs demonstrated superiority over conventional long short-term memory (LSTM) models by achieving a 18% increase in accuracy, thereby validating their higher implementation costs.
暂无评论