Existing text-to-sql research assumes the availability of gold table when generating sql queries. It is possible to effectively generate complex and difficult queries by leveraging information from the gold table. How...
详细信息
Existing text-to-sql research assumes the availability of gold table when generating sql queries. It is possible to effectively generate complex and difficult queries by leveraging information from the gold table. However, in real-world scenarios, determining which of the numerous tables in a database should be referenced is challenging. Therefore, existing models reveal a gap in achieving the core objective of practicality in text-to- sql research. In response, we propose a practical framework that can effectively convert user questions into queries, even in scenarios where reference tables are not provided. By adding a phase to find tables, it can generate queries using only information from questions, mitigating the limitations that arise when restricting reference tables to a single one. We demonstrate that our methods are suitable for practical use in text-to-sql systems by achieving performances comparable to those of existing models with simple structures.
Language models have shown promising performance on the task of translating natural language questions into sql queries (text-to-sql). However, most of the state-of-the-art (SOTA) approaches rely on powerful yet close...
详细信息
Language models have shown promising performance on the task of translating natural language questions into sql queries (text-to-sql). However, most of the state-of-the-art (SOTA) approaches rely on powerful yet closed-source large language models (LLMs), such as ChatGPT and GPT-4, which may have the limitations of unclear model architectures, data privacy risks, and expensive inference overheads. To address the limitations, we introduce CodeS, a series of pre-trained language models with parameters ranging from 1B to 15B, specifically designed for the text-to-sql task. CodeS is a fully open-source language model, which achieves superior accuracy with much smaller parameter sizes. This paper studies the research challenges in building CodeS. To enhance the sql generation abilities of CodeS, we adopt an incremental pre-training approach using a specifically curated sql-centric corpus. Based on this, we address the challenges of schema linking and rapid domain adaptation through strategic prompt construction and a bi-directional data augmentation technique. We conduct comprehensive evaluations on multiple datasets, including the widely used Spider benchmark, the newly released BIRD benchmark, robustness-diagnostic benchmarks such as Spider-DK, Spider-Syn, Spider-Realistic, and ***, as well as two real-world datasets created for financial and academic applications. The experimental results show that our CodeS achieves new SOTA accuracy and robustness on nearly all challenging text-to-sql benchmarks.
A common problem with adopting text-to-sql translation in database systems is poor generalization. Specifically, when there is limited training data on new datasets, existing few-shot text-to-sql techniques, even with...
详细信息
A common problem with adopting text-to-sql translation in database systems is poor generalization. Specifically, when there is limited training data on new datasets, existing few-shot text-to-sql techniques, even with carefully designed textual prompts on pre-trained language models (PLMs), tend to be ineffective. In this paper, we present a divide-and-conquer framework to better support few-shot text-to-sql translation, which divides text-to-sql translation into two stages (or sub-tasks), such that each sub-task is simpler to be tackled. The first stage, called the structure stage, steers a PLM to generate an sql structure (including sql commands such as SELECT, FROM, WHERE and sql operators such as <", ?>") with placeholders for missing identifiers. The second stage, called the content stage, guides a PLM to populate the placeholders in the generated sql structure with concrete values (including sql identifies such as table names, column names, and constant values). We propose a hybrid prompt strategy that combines learnable vectors and fixed vectors (i.e., word embeddings of textual prompts), such that the hybrid prompt can learn contextual information to better guide PLMs for prediction in both stages. In addition, we design keyword constrained decoding to ensure the validity of generated sql structures, and structure guided decoding to guarantee the model to fill correct content. Extensive experiments, by comparing with ten state-of-the-art text-to-sql solutions at the time of writing, show that SC-Prompt significantly outperforms them in the few-shot scenario. In particular, on the widely-adopted Spider dataset, given less than 500 labeled training examples (5% of the official training set), SC-Prompt outperforms the previous SOTA methods by around 5% on accuracy.
作者:
Thanakrit JulavanichAkiko AizawaDepartment of Computer Science
Graduate School of Information Science and Technology The University of Tokyo Japan Aizawa Laboratory
National Institute of Informatics Japan and Department of Computer Science Graduate School of Information Science and Technology The University of Tokyo Japan
One of the challenges in NLP tasks, such as text-to-sql semantic parsing, is generalization. In the text-to-sql task, having separate training and testing data can measure one aspect of the generalization: how well th...
详细信息
ISBN:
(纸本)9781450397629
One of the challenges in NLP tasks, such as text-to-sql semantic parsing, is generalization. In the text-to-sql task, having separate training and testing data can measure one aspect of the generalization: how well the model generalizes to unseen databases. Other aspects, however, remain unaccounted for. We propose a new dataset and a more challenging and thorough evaluation process that focuses on the two challenges of generalizing the text-to-sql model: database content references and question patterns. We create SPIDER-QG, an augmented dataset that employs three techniques, to assess generalizability. First, we replace the set of values in the existing test set with other values from the same column in the same database. Second, we use the synonym of each value as a replacement instead. Third, we generate new questions for the existing sql query by back-translating the original question. Our evaluation setup demonstrates the generalization challenges and struggles of the current models.
The text-to-sql problem aims at developing natural language query interfaces for relational database systems by converting the text input into executable sql queries. Recently, using Large Language Models (LLM) has em...
详细信息
ISBN:
(纸本)9798400704369
The text-to-sql problem aims at developing natural language query interfaces for relational database systems by converting the text input into executable sql queries. Recently, using Large Language Models (LLM) has emerged as a new paradigm for the textto-sql problem. To this end, the LLM needs to understand not only user input but also information from the database. In this demo, we present multi-agent sql (Magesql), an LLM based text-to-sql approach that tackles the task by orchestrating multiple agents in a pipeline. We will showcase a user-friendly interface to demonstrate the inner workings of our approach that allows users to add and modify the agents with different functionalities, customize prompts, and see their impact on specific examples. Through several use cases, we will demonstrate how to (i) construct a text-to-sql pipeline with multiple agents;(ii) generate prompts for LLM with various templates and strategies;and (iii) monitor the results of natural language queries and perform debugging.
Stakeholders of software development projects have various information needs for making rational decisions during their daily work. Satisfying these needs requires substantial knowledge of where and how the relevant i...
详细信息
Stakeholders of software development projects have various information needs for making rational decisions during their daily work. Satisfying these needs requires substantial knowledge of where and how the relevant information is stored and consumes valuable time that is often not available. Easing the need for this knowledge is an ideal text-to-sql benchmark problem, a field where public datasets are scarce and needed. We propose the SEOSSQueries dataset consisting of natural language utterances and accompanying sql queries extracted from previous studies, software projects, issue tracking tools, and through expert surveys to cover a large variety of information need perspectives. Our dataset consists of 1,162 English utterances translating into 166 sql queries;each query has four precise utterances and three more general ones. Furthermore, the dataset contains 393,086 labeled utterances extracted from issue tracker comments. We provide pre-trained sqlNet and Ratsql baseline models for benchmark comparisons, a replication package facilitating a seamless application, and discuss various other tasks that may be solved and evaluated using the dataset. The whole dataset with paraphrased natural language utterances and sql queries is hosted at ***/s/75ed49ef01ac2f83b3e2. (C) 2022 The Authors. Published by Elsevier Inc.
暂无评论