Code switching (CS) is a very common phenomenon in written and spoken communication but one that is handled poorly by many naturallanguageprocessing (NLP) applications. Looking to the application of building CS corp...
详细信息
ISBN:
(纸本)9798891760882
Code switching (CS) is a very common phenomenon in written and spoken communication but one that is handled poorly by many naturallanguageprocessing (NLP) applications. Looking to the application of building CS corpora, we explore CS language identification (LID) for corpus building. We make the task more realistic by scaling it to more languages and considering models with simpler architectures for faster inference. We also reformulate the task as a sentence-level multi-label tagging problem to make it more tractable. Having defined the task, we investigate three reasonable models for this task and define metrics which better reflect desired performance. We present empirical evidence that no current approach is adequate and finally provide recommendations for future work in this area.
Sentence Representation Learning (SRL) is a fundamental task in naturallanguageprocessing (NLP), with the Contrastive Learning of Sentence Embeddings (CSE) being the mainstream technique due to its superior performa...
详细信息
ISBN:
(纸本)1577358872
Sentence Representation Learning (SRL) is a fundamental task in naturallanguageprocessing (NLP), with the Contrastive Learning of Sentence Embeddings (CSE) being the mainstream technique due to its superior performance. An intriguing phenomenon in CSE is the significant performance gap between supervised and unsupervised methods, with their only difference lying in the training data. Previous works attribute this performance gap to differences in two representation properties (alignment and uniformity). However, since alignment and uniformity only measure the results, they fail to answer "What aspects of the training data contribute to the performance gap?" and "How can the performance gap be narrowed?". In this paper, we conduct empirical experiments to answer these "What" and "How" questions. We first answer the "What" question by thoroughly comparing the behavior of supervised and unsupervised CSE during their respective training processes. From the comparison, we identify the similarity pattern as a key factor to the performance gap, and introduce a metric, called Relative Fitting Difficulty (RFD), to measure the complexity of the similarity pattern. Then, based on the insights gained from the "What" question, we tackle the "How" question by increasing the pattern complexity of the training data. We achieve this by leveraging the In-Context Learning (ICL) capability of the Large language Model (LLM) to generate data that simulates complex patterns. By utilizing the hierarchical patterns in the LLM-generated data, we effectively narrow the gap between supervised and unsupervised CSE. We release our codes and appendix at https://***/BDBC-KG-NLP/NGCSE.
Vision-language Models (VLMs) have seen a significant increase in both research interest and real-world applications across various domains, including healthcare, autonomous systems, and security. However, their growi...
详细信息
As open-weight large language models (LLMs) achieve ever more impressive performances across a wide range of tasks in English, practitioners aim to adapt these models to different languages. However, such language ada...
We find that the best publicly available LLMs like GPT-4 and Claude currently perform poorly on basic legal text handling. This motivates the creation of a benchmark consisting of examples that lawyers and paralegals ...
详细信息
Large language models (LLMs) often necessitate extensive labeled datasets and training compute to achieve impressive performance across downstream tasks. This paper explores a self-training paradigm, where the LLM aut...
详细信息
Large vision-language models (LVLMs) are prone to hallucinations, where certain contextual cues in an image can trigger the language module to produce overconfident and incorrect reasoning about abnormal or hypothetic...
详细信息
Segmenting text into fine-grained units of meaning is important to a wide range of NLP *** default approach of segmenting text into sentences is often insufficient, especially since sentences are usually complex enoug...
详细信息
Existing research predominantly focuses on developing powerful large language models (LLMs) for mathematical reasoning within monolingual languages, with few explorations in preserving efficacy in a multilingual conte...
详细信息
Few-Shot Cross-Domain NER is the process of leveraging knowledge from data-rich source domains to perform entity recognition on data-scarce target domains. Most previous state-of-the-art (SOTA) approaches use pre-trai...
详细信息
暂无评论