Keyphrase extraction is a fundamental task in information management, which is often used as a preliminary step in various information retrieval and naturallanguageprocessing tasks. The main contribution of this pap...
详细信息
Keyphrase extraction is a fundamental task in information management, which is often used as a preliminary step in various information retrieval and naturallanguageprocessing tasks. The main contribution of this paper lies in providing a comparative assessment of prominent multilingual unsupervised keyphrase extraction methods that build on statistical (RAKE, YAKE), graph-based (TextRank, SingleRank) and deep learning (KeyBERT) methods. For the experimentations reported in this paper, we employ well-known datasets designed for keyphrase extraction from five different naturallanguages (English, French, Spanish, Portuguese and Polish). We use the F1 score and a partial match evaluation framework, aiming to investigate whether the number of terms of the documents and the language of each dataset affect the accuracy of the selected methods. Our experimental results reveal a set of insights about the suitability of the selected methods in texts of different sizes, as well as the performance of these methods in datasets of different languages.
Large language models (LLMs) predominantly employ decoder-only transformer architectures, necessitating the retention of keys/values information for historical tokens to provide contextual information and avoid redund...
详细信息
ISBN:
(纸本)9798891760998
Large language models (LLMs) predominantly employ decoder-only transformer architectures, necessitating the retention of keys/values information for historical tokens to provide contextual information and avoid redundant computation. However, the substantial size and parameter volume of these LLMs require massive GPU memory. This memory demand increases with the length of the input text, leading to an urgent need for more efficient methods of information storage and processing. This study introduces Anchor-based LLMs (AnLLMs), which utilize an innovative anchor-based self-attention network (AnSAN) and also an anchor-based inference strategy. This approach enables LLMs to compress sequence information into an anchor token, reducing the keys/values cache and enhancing inference efficiency. Experiments on question-answering benchmarks reveal that AnLLMs maintain similar accuracy levels while achieving up to 99% keys/values cache reduction and up to 3.5 times faster inference. Despite a minor compromise in accuracy, the substantial enhancements of AnLLMs employing the AnSAN technique in resource utilization and computational efficiency underscore their potential for practical LLM applications.(1)
Recent advancements in naturallanguageprocessing (NLP) have led to the development of NLP-based recommender systems that have shown superior performance. However, current models commonly treat items as mere IDs and ...
详细信息
We investigate how different domains are encoded in modern neural network architectures. We analyze the relationship between naturallanguage domains, model size, and the amount of training data used. The primary anal...
详细信息
Over the years, the review helpfulness prediction task has been the subject of several works, but remains being a challenging issue in naturallanguageprocessing, as results vary a lot depending on the domain, on the...
详细信息
Contextual advertising provides advertisers with the opportunity to target the context which is most relevant to their ads. The large variety of potential topics makes it very challenging to collect training documents...
详细信息
We present a web application for creating games and exercises for teaching English as a foreign language with the help of NLP tools. The application contains different kinds of games such as crosswords, word searches,...
详细信息
The proceedings contain 17 papers. The topics discussed include: a unified framework for cross-domain and cross-task learning of mental health conditions;critical perspectives: a benchmark revealing pitfalls in Perspe...
ISBN:
(纸本)9781959429197
The proceedings contain 17 papers. The topics discussed include: a unified framework for cross-domain and cross-task learning of mental health conditions;critical perspectives: a benchmark revealing pitfalls in PerspectiveAPI;securely capturing people’s interactions with voice assistants at home: a bespoke tool for ethical data collection;leveraging world knowledge in implicit hate speech detection;a dataset of sustainable diet arguments on twitter;impacts of low socio-economic status on educational outcomes: a narrative based analysis;enhancing crisis-related tweet classification with entity-masked language modeling and multi-task learning;misinformation detection in the wild: news source classification as a proxy for non-article texts;modelling persuasion through misuse of rhetorical appeals;breaking through inequality of information acquisition among social classes: a modest effort on measuring fun;and identifying condescending language: a tale of two distinct phenomena?.
This paper describes Team Cadence's winning submission to Task C of the MEDIQA-Chat 2023 shared tasks. We also present the set of methods, including a novel N-pass strategy to summarize a mix of clinical dialogue ...
This paper presents a combination of data augmentation methods to boost the performance of state-of-the-art transformer-basedlanguage models for Patronizing and Condescending language (PCL) detection and multi-label ...
详细信息
ISBN:
(纸本)9781955917803
This paper presents a combination of data augmentation methods to boost the performance of state-of-the-art transformer-basedlanguage models for Patronizing and Condescending language (PCL) detection and multi-label PCL classification tasks. These tasks are inherently different from sentiment analysis because positive/negative hidden attitudes in the context will not necessarily be considered positive/negative for PCL tasks. The oblation study observes that the imbalance degree of PCL dataset is in the extreme range. This paper presents a modified version of sentence paraphrasing deep learning model (PEGASUS) to tackle the limitation of maximum sequence length. The proposed algorithm has no specific maximum input length to paraphrase sequences. Our augmented underrepresented class of annotated data achieved competitive results among top-16 SemEval2022 participants. This paper's approaches rely on fine-tuning pretrained RoBERTa and GPT3 models such as Davinci and Curie engines with extra-enriched PCL dataset. Furthermore, we discuss Few-Shot learning technique to overcome the limitation of low-resource NLP problems.
暂无评论