检索结果-内蒙古大学图书馆

arXiv 2023年

作者： Pluščec, Domagoj Šnajder, Jan University of Zagreb Faculty of Electrical Engineering and Computing Text Analysis and Knowledge Engineering Lab Unska 3 Zagreb10000 Croatia

Data scarcity is a problem that occurs in languages and tasks where we do not have large amounts of labeled data but want to use state-of-the-art models. Such models are often deep learning models that require a significant amount of data to train. Acquiring data for various machine learning problems is accompanied by high labeling costs. Data augmentation is a low-cost approach for tackling data scarcity. This paper gives an overview of current state-of-the-art data augmentation methods used for natural language processing, with an emphasis on methods for neural and transformer-based models. Furthermore, it discusses the practical challenges of data augmentation, possible mitigations, and directions for future research. © 2023, CC BY.

关键词： Natural language processing systems

来源：评论

学校读者我要写书评

暂无评论

PANDORA Talks: Personality and Demographics on Reddit 9

PANDORA Talks: Personality and Demographics on Reddit

引用

9th International Workshop on Natural Language Processing for Social Media, SocialNLP 2021

作者： Gjurkovic, Matej Karan, Mladen Vukojevic, Iva Bosnjak, Mihaela Snajder, Jan Text Analysis and Knowledge Engineering Lab Faculty of Electrical Engineering and Computing University of Zagreb Unska 3 Zagreb10000 Croatia

ISBN: (纸本)9781954085329

Personality and demographics are important variables in social sciences and computational sociolinguistics. However, datasets with both personality and demographic labels are scarce. To address this, we present PANDORA, the first dataset of Reddit comments of 10k users partially labeled with three personality models and demographics (age, gender, and location), including 1.6k users labeled with the wellestablished Big 5 personality model. We showcase the usefulness of this dataset on three experiments, where we leverage the more readily available data from other personality models to predict the Big 5 traits, analyze gender classification biases arising from psychodemographic variables, and carry out a confirmatory and exploratory analysis based on psychological theories. Finally, we present benchmark prediction models for all personality and demographic variables. © SocialNLP 2021 Natural Language Processing for Social Media

关键词： Computation theory

来源：评论

学校读者我要写书评

暂无评论

ALANNO: An Active Learning Annotation System for Mortals

arXiv

引用

arXiv 2022年

作者： Jukić, Josip Jelenić, Fran Bićanić, Miroslav Šnajder, Jan University of Zagreb Faculty of Electrical Engineering and Computing Text Analysis and Knowledge Engineering Lab Croatia

Supervised machine learning has become the cornerstone of today’s data-driven society, increasing the need for labeled data. However, the process of acquiring labels is often expensive and tedious. One possible remedy is to use active learning (AL) – a special family of machine learning algorithms designed to reduce labeling costs. Although AL has been successful in practice, a number of practical challenges hinder its effectiveness and are often overlooked in existing AL annotation tools. To address these challenges, we developed ALANNO, an open-source annotation system for NLP tasks equipped with features to make AL effective in real-world annotation projects. ALANNO facilitates annotation management in a multi-annotator setup and supports a variety of AL methods and underlying models, which are easily configurable and extensible. © 2022, CC BY.

关键词： Learning algorithms

来源：评论

学校读者我要写书评

暂无评论

Large-scale Evaluation of Transformer-based Article Encoders on the Task of Citation Recommendation

arXiv

引用

arXiv 2022年

作者： Medić, Zoran Šnajder, Jan Text Analysis and Knowledge Engineering Lab Faculty of Electrical Engineering and Computing University of Zagreb Unska 3 Zagreb10000 Croatia

Recently introduced transformer-based article encoders (TAEs) designed to produce similar vector representations for mutually related scientific articles have demonstrated strong performance on benchmark datasets for scientific article recommendation. However, the existing benchmark datasets are predominantly focused on single domains and, in some cases, contain easy negatives in small candidate pools. Evaluating representations on such benchmarks might obscure the realistic performance of TAEs in setups with thousands of articles in candidate pools. In this work, we evaluate TAEs on large benchmarks with more challenging candidate pools. We compare the performance of TAEs with a lexical retrieval baseline model BM25 on the task of citation recommendation, where the model produces a list of recommendations for citing in a given input article. We find out that BM25 is still very competitive with the state-of-the-art neural retrievers, a finding which is surprising given the strong performance of TAEs on small benchmarks. As a remedy for the limitations of the existing benchmarks, we propose a new benchmark dataset for evaluating scientific article representations: Multi-Domain Citation Recommendation dataset (MDCR), which covers different scientific fields and contains challenging candidate pools. © 2022, CC BY.

关键词： Signal encoding

来源：评论

学校读者我要写书评

暂无评论

Staying true to your word: (How) can attention become explanation? 5

Staying true to your word: (How) can attention become explan...

引用

5th Workshop on Representation Learning for NLP, RepL4NLP 2020 at the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020

作者： Tutek, Martin Šnajder, Jan Text Analysis and Knowledge Engineering Lab. Faculty of Electrical Engineering and Computing University of Zagreb Unska 3 Zagreb10000 Croatia

ISBN: (纸本)9781952148156

The attention mechanism has quickly become ubiquitous in NLP. In addition to improving performance of models, attention has been widely used as a glimpse into the inner workings of NLP models. The latter aspect has in the recent years become a common topic of discussion, most notably in work of Jain and Wallace, 2019;Wiegreffe and Pinter, 2019. With the shortcomings of using attention weights as a tool of transparency revealed, the attention mechanism has been stuck in a limbo without concrete proof when and whether it can be used as an explanation. In this paper, we provide an explanation as to why attention has seen rightful critique when used with recurrent networks in sequence classification tasks. We propose a remedy to these issues in the form of a word level objective and our findings give credibility for attention to provide faithful interpretations of recurrent models. © 2020 Association for Computational Linguistics.

关键词： Natural language processing systems

来源：评论

学校读者我要写书评

暂无评论

A Survey of Citation Recommendation Tasks and Methods

引用

Journal of Computing and Information Technology 2020年第3期28卷 183-205页

作者： Medić, Zoran Šnajder, Jan Text Analysis and Knowledge Engineering Lab Faculty of Electrical Engineering and Computing University of Zagreb Croatia

Scientific articles store vast amounts of knowledge amassed through many decades of research. They serve to communicate research results among scientists but also for learning and tracking progress in the field. However, scientific production has risen to levels that make it difficult even for experts to keep up with work in their field. As a remedy, specialized search engines are being deployed, incorporating novel natural language processing and machine learning methods. The task of citation recommendation, in particular, has attracted much interest as it holds promise for improving the quality of scientific production. In this paper, we present the state-of-the-art in citation recommendation: we survey the methods for global and local approaches to the task, the evaluation setups and datasets, and the most successful machine learning models. In addition, we overview two tasks complementary to citation recommendation: extraction of key aspects and entities from articles and citation function classification. With this survey, we hope to provide the ground for understanding current efforts and stimulate further research in this exciting and promising field. © 2020, Journal of Computing and Information Technology. All Rights Reserved.

关键词： Surveys

来源：评论

学校读者我要写书评

暂无评论

XHATE-999: Analyzing and Detecting Abusive Language Across Domains and Languages 28

XHATE-999: Analyzing and Detecting Abusive Language Across D...

引用

28th International Conference on Computational Linguistics, COLING 2020

作者： Glavaš, Goran Karan, Mladen Vulić, Ivan Data and Web Science Group University of Mannheim Germany Text Analysis and Knowledge Engineering Lab. University of Zagreb Croatia Language Technology Lab. TAL University of Cambridge United Kingdom

ISBN: (纸本)9781952148279

We present XHATE-999, a multi-domain and multilingual evaluation data set for abusive language detection. By aligning test instances across six typologically diverse languages, XHATE-999 for the first time allows for disentanglement of the domain transfer and language transfer effects in abusive language detection. We conduct a series of domain- and language-transfer experiments with state-of-the-art monolingual and multilingual transformer models, setting strong baseline results and profiling XHATE-999 as a comprehensive evaluation resource for abusive language detection. Finally, we show that domain- and language-adaptation, via intermediate masked language modeling on abusive corpora in the target language, can lead to substantially improved abusive language detection in the target language in the zero-shot transfer setups. © 2020 COLING 2020 - 28th International Conference on Computational Linguistics, Proceedings of the Conference. All rights reserved.

关键词： Modeling languages

来源：评论

学校读者我要写书评

暂无评论

Staying True to Your Word: (How) Can Attention Become Explanation?

arXiv

引用

arXiv 2020年

作者： Tutek, Martin Šnajder, Jan Text Analysis and Knowledge Engineering Lab Faculty of Electrical Engineering and Computing University of Zagreb Unska 3 Zagreb10000 Croatia

关键词： Recurrent neural networks

来源：评论

学校读者我要写书评

暂无评论

Takelab at SemEval-2019 task 4: Hyperpartisan news detection 13

TakeLab at SemEval-2019 task 4: Hyperpartisan news detection

引用

13th International Workshop on Semantic Evaluation, SemEval 2019, co-located with the 17th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL HLT 2019

作者： Palić, Niko Vladika, Juraj Čubelić, Dominik Lovrenčić, Ivan Buljan, Maja Šnajder, Jan Text Analysis and Knowledge Engineering Lab Faculty of Electrical Engineering and Computing University of Zagreb Unska 3 Zagreb10000 Croatia

ISBN: (纸本)9781950737062

In this paper, we demonstrate the system built to solve the SemEval-2019 task 4: Hyperpartisan News Detection (Kiesel et al., 2019), the task of automatically determining whether an article is heavily biased towards one side of the political spectrum. Our system receives an article in its raw, textual form, analyzes it, and predicts with moderate accuracy whether the article is hyperpartisan. The learning model used was primarily trained on a manually prelabeled dataset containing news articles. The system relies on the previously constructed SVM model, available in the Python Scikit-Learn library. We ranked 6th in the competition of 42 teams with an accuracy of 79.1% (the winning team had 82.2%). © 2019 Association for Computational Linguistics

关键词： Python

来源：评论

学校读者我要写书评

暂无评论

Combining Shallow and Deep Learning for Aggressive text Detection 1

Combining Shallow and Deep Learning for Aggressive Text Dete...

引用

COLING 2018 - 1st Workshop on Trolling, Aggression and Cyberbullying, TRAC 2018 - Proceedings of the Workshop

作者： Golem, Viktor Karan, Mladen Šnajder, Jan Faculty of Electrical Engineering and Computing University of Zagreb Text Analysis and Knowledge Engineering Lab. Croatia

ISBN: (纸本)9781948087605

We describe the participation of team Takelab in the aggression detection shared task at the TRAC1 workshop for English. Aggression manifests in a variety of ways. Unlike some forms of aggression that are impossible to prevent in day-to-day life, aggressive speech abounding on social networks could in principle be prevented or at least reduced by simply disabling users that post aggressively worded messages. The first step in achieving this is to detect such messages. The task, however, is far from being trivial, as what is considered as aggressive speech can be quite subjective, and the task is further complicated by the noisy nature of user-generated text on social networks. Our system learns to distinguish between open aggression, covert aggression, and non-aggression in social media texts. We tried different machine learning approaches, including traditional (shallow) machine learning models, deep learning models, and a combination of both. We achieved respectable results, ranking 4th and 8th out of 31 submissions on the Facebook and Twitter test sets, respectively. © COLING 2018 - 1st Workshop on Trolling, Aggression and Cyberbullying, TRAC 2018 - Proceedings of the Workshop.

关键词： Social networking (online)

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：