In the Text-to-SQL task, a significant challenge is enabling parsers to generalize effectively across diverse domains. Key to solving is schema-linking, which involves mapping words to the pertinent columns or tables ...
详细信息
ISBN:
(纸本)9798350349122;9798350349115
In the Text-to-SQL task, a significant challenge is enabling parsers to generalize effectively across diverse domains. Key to solving is schema-linking, which involves mapping words to the pertinent columns or tables in the databases. Existing methods base on pre-trained language models (PLMs), which rely on token masking, have limitations in capturing the variety of schemas. Unlike single token, phrases offer richer semantics, and superior discrimination in determining whether one word corresponds to tables or columns. In this paper, we present an innovative approach named Phrase-based Schema-Linking for Text-to-SQL (PS-SQL). By incorporating extracted phrases from the question, we enhance PLMs' ability to learn the mapping between tokens and schemas, leading to more robust schema-linking. We also introduce a mechanism to refine extracted phrases, reducing noise. During practical evaluations on several real-world datasets, PS-SQL consistently delivers enhanced schema-linking precision, resulting in higher-quality SQL query generation.
Generating naturallanguage text from graph-structured data is essential for conversational information seeking. Semantic triples derived from knowledge graphs can serve as a valuable source for grounding responses fr...
详细信息
ISBN:
(纸本)9798891760899
Generating naturallanguage text from graph-structured data is essential for conversational information seeking. Semantic triples derived from knowledge graphs can serve as a valuable source for grounding responses from conversational agents by providing a factual basis for the information they communicate. This is especially relevant in the context of large language models, which offer great potential for conversational interaction but are prone to hallucinating, omitting, or producing conflicting information. In this study, we conduct an empirical analysis of conversational large language models in generating naturallanguage text from semantic triples. We compare four large language models of varying sizes with different prompting techniques. Through a series of benchmark experiments on the WebNLG dataset, we analyze the models' performance and identify the most common issues in the generated predictions. Our findings show that the capabilities of large language models in triple verbalization can be significantly improved through few-shot prompting, post-processing, and efficient fine-tuning techniques, particularly for smaller models that exhibit lower zero-shot performance.
Supervised Word Sense Disambiguation (WSD) has been studied intensively for over three decades. However, disentangling diverse contexts is still a challenging problem. This paper addresses the problem and proposes a P...
详细信息
ISBN:
(纸本)9798891760189
Supervised Word Sense Disambiguation (WSD) has been studied intensively for over three decades. However, disentangling diverse contexts is still a challenging problem. This paper addresses the problem and proposes a Perturbation-based constrained attention network (Pconan) for injecting lexical knowledge derived from the WordNet. The Pconan allows modeling beneficial dependencies between the segments/words within the input sequence with the mask-attention technique. We incorporate a perturbation method into our model to mitigate the overfitting problem resulting from intensive learning. The experimental results by using a benchmark dataset show that our method is comparable to the SOTA WSD methods. Our source codes are available online.
Human beings aspire for a better life. Financial well-being enables this. However, lack of financial literacy, ever-growing wealth inequality, and persuading illicit information floating in social media inhibit one...
详细信息
ISBN:
(纸本)9798400704369
Human beings aspire for a better life. Financial well-being enables this. However, lack of financial literacy, ever-growing wealth inequality, and persuading illicit information floating in social media inhibit one's progress towards a good fortune. In this paper, we discuss four pillars where naturallanguageprocessing can help improve financial literacy, reduce wealth disparity, ensure a sustainable future, and economic prosperity. These pillars are: Inclusive investing, Improved investing, Impactful (green) investing, and Informed investing. Additionally, we focus to specifically cater to the Indian market (Indic investing) and present several resources to enhance comprehensibility of financial texts. Inclusive investing deals with enhancing the readability and reachability of financial texts. Improved investing addresses the need to simplify investors' journey by providing them with hypernyms and relations between entities. Impactful investing is associated with focusing on sustainable pathways. Improved investing is about eradicating finance related misinformation from social media, like evaluating trustworthiness of posts by executives, detecting in-claim and exaggerated numerals, etc. In most cases, we are able to demonstrate the efficacies of our approaches by benchmarking them with existing state-of-the-art methods.
We introduce a generalization of classic information-theoretic measures of predictive uncertainty in online languageprocessing, based on the simulation of expected continuations of incremental linguistic contexts. Ou...
详细信息
The paper contains an analytical review of methods for solving problems of semantically coherent text processing, search and selection of learning models for solving text processing problems, comparison of the obtaine...
详细信息
language models can be prompted to perform a wide variety of tasks with zero- and few-shot incontext learning. However, performance varies significantly with the choice of prompt, and we do not yet understand why this...
详细信息
ISBN:
(纸本)9798891760615
language models can be prompted to perform a wide variety of tasks with zero- and few-shot incontext learning. However, performance varies significantly with the choice of prompt, and we do not yet understand why this happens. In this paper, we analyze the factors that contribute to this variance and establish a new empirical hypothesis: the performance of a prompt is predicted by the extent to which the model is familiar with the language it contains. Over a wide range of tasks, we show that the lower the perplexity of the prompt, the better it is able to perform the task, when considering reasonable prompts that are related to it. As part of our analysis, we also devise a method to automatically extend a small seed set of manually written prompts by paraphrasing with GPT3 and backtranslation. This larger set allows us to verify that perplexity is a strong predictor of the success of a prompt and we show that the lowest perplexity prompts are consistently effective.
Several recent papers have published good solutions for language identification (LID) for about 300 high-resource and medium-resource languages. However, there is no LID available that (i) covers a wide range of low-r...
详细信息
ISBN:
(纸本)9798891760615
Several recent papers have published good solutions for language identification (LID) for about 300 high-resource and medium-resource languages. However, there is no LID available that (i) covers a wide range of low-resource languages, (ii) is rigorously evaluated and reliable and (iii) efficient and easy to use. Here, we publish GlotLID-M, an LID model that satisfies the desiderata of wide coverage, reliability and efficiency. It identifies 1665 languages, a large increase in coverage compared to prior work. In our experiments, GlotLID-M outperforms four baselines (CLD3, FT176, OpenLID and NLLB) when balancing F1 and false positive rate (FPR). We analyze the unique challenges that low-resource LID poses: incorrect corpus metadata, leakage from high-resource languages, difficulty separating closely related languages, handling of macrolanguage vs varieties and in general noisy data. We hope that integrating GlotLID-M into dataset creation pipelines will improve quality and enhance accessibility of NLP technology for low-resource languages and cultures. GlotLID-M model, code, and list of data sources are available: https://***/cisnlp/GlotLID.
Pre-trained language Models (PLMs) are trained on vast unlabeled data, rich in world knowledge. This fact has sparked the interest of the community in quantifying the amount of factual knowledge present in PLMs, as th...
详细信息
ISBN:
(纸本)9798891760615
Pre-trained language Models (PLMs) are trained on vast unlabeled data, rich in world knowledge. This fact has sparked the interest of the community in quantifying the amount of factual knowledge present in PLMs, as this explains their performance on downstream tasks, and potentially justifies their use as knowledge bases. In this work, we survey methods and datasets that are used to probe PLMs for factual knowledge. Our contributions are: (1) We propose a categorization scheme for factual probing methods that is based on how their inputs, outputs and the probed PLMs are adapted;(2) We provide an overview of the datasets used for factual probing;(3) We synthesize insights about knowledge retention and prompt optimization in PLMs, analyze obstacles to adopting PLMs as knowledge bases and outline directions for future work.
Linguists can access movement in the sign language video corpus through manual annotation or computational methods. The first relies on a predefinition of features, and the second requires technical knowledge. methods...
详细信息
ISBN:
(纸本)9798891760615
Linguists can access movement in the sign language video corpus through manual annotation or computational methods. The first relies on a predefinition of features, and the second requires technical knowledge. methods like MediaPipe and OpenPose are now more often used in sign languageprocessing. MediaPipe detects a two-dimensional (2D) body pose in a single image with a limited approximation of the depth coordinate. Such 2D projection of a three-dimensional (3D) body pose limits the potential application of the resulting models outside the capturing camera settings and position. 2D pose data does not provide linguists with direct and human-readable access to the collected movement data. We propose our four main contributions: A novel 3D normalization method for MediaPipe's 2D pose, a novel human-readable way of representing the 3D normalized pose data, an analysis of Japanese Sign language (JSL) sociolinguistic features using the proposed techniques, where we show how an individual signer can be identified based on unique personal movement patterns suggesting a potential threat to anonymity. Our method outperforms the common 2D normalization on a small, diverse JSL dataset. We demonstrate its benefit for deep-learning approaches by significantly outperforming the pose-based stateof-the-art models on the open sign language recognition benchmark.
暂无评论