The generalizability to new databases is of vital importance to text-to-sql systems which aim to parse human utterances into sql statements. Existing works achieve this goal by leveraging the exact matching method to ...
详细信息
ISBN:
(纸本)9781450393850
The generalizability to new databases is of vital importance to text-to-sql systems which aim to parse human utterances into sql statements. Existing works achieve this goal by leveraging the exact matching method to identify the lexical matching between the question words and the schema items. However, these methods fail in other challenging scenarios, such as the synonym substitution in which the surface form differs between the corresponding question words and schema items. In this paper, we propose a framework named ISESL-sql to iteratively build a semantic enhanced schema-linking graph between question tokens and database schemas. First, we extract a schema linking graph from PLMs through a probing procedure in an unsupervised manner. Then the schema linking graph is further optimized during the training process through a deep graph learning method. Meanwhile, we also design an auxiliary task called graph regularization to improve the schema information mentioned in the schema-linking graph. Extensive experiments on three benchmarks demonstrate that ISESL-sql could consistently outperform the baselines and further investigations show its generalizability and robustness.
text-to-sql aims to map natural language questions to sql queries. The sketch-based sqlova model combined with execution-guided (EG) decoding strategy has achieved a super-human performance on the Wikisql dataset. How...
详细信息
ISBN:
(数字)9781728186719
ISBN:
(纸本)9781728186719
text-to-sql aims to map natural language questions to sql queries. The sketch-based sqlova model combined with execution-guided (EG) decoding strategy has achieved a super-human performance on the Wikisql dataset. However, through our fine-grained error analysis, we find that sqlova cannot handle well the aggregation operator selection for numeric columns, due to the lack of column type information to distinguish between textual and numeric columns Besides, most predicted value spans in the WHERE clause have the same meaning with the ground truth, but they do not match exactly, leading to unnecessary errors. Therefore we propose Rulesqlova model, which enhances the sqlova base model with logic rules to deal with these two major weaknesses of sqlova. It first incorporates four logic rules into the model to constrain the aggregation operator prediction for numeric columns, using the general framework of iterative rule knowledge distillation. Then it leverages another logic rule for post-processing before EG decoding to ensure the consistency between predicted value spans and values in the table column. Experimental results indicate that Rulesqlova model offers significant and consistent improvements over sqlova, outperforming competitive sketch-based models on the Wikisql dataset, and our method also brings improvements to the sketch-based models on the Spider dataset.
When it comes to text-to-sql tasks, the model needs to learn context-based representations of schema along with natural language utterances. We present a simple and effective method for text-to-sql tasks, Column-Mask-...
详细信息
ISBN:
(纸本)9783030923099;9783030923105
When it comes to text-to-sql tasks, the model needs to learn context-based representations of schema along with natural language utterances. We present a simple and effective method for text-to-sql tasks, Column-Mask-Augmented Training (CMAT), to make up for the insufficiency of training data. To exploit the synthesized data, we propose the clause prediction (CP) object for multi-task learning, which forces the model to capture contextual features of the schema items. Besides, we add the fuzzy match and subword match to the schema linking strategy in RAT-sql. As a result, our method significantly increases the recall and F1 value of schema linking and achieves a competitive result with RAT-sql and GraPPa on Spider.
Existing text-to-sql research assumes the availability of gold table when generating sql queries. It is possible to effectively generate complex and difficult queries by leveraging information from the gold table. How...
详细信息
Existing text-to-sql research assumes the availability of gold table when generating sql queries. It is possible to effectively generate complex and difficult queries by leveraging information from the gold table. However, in real-world scenarios, determining which of the numerous tables in a database should be referenced is challenging. Therefore, existing models reveal a gap in achieving the core objective of practicality in text-to- sql research. In response, we propose a practical framework that can effectively convert user questions into queries, even in scenarios where reference tables are not provided. By adding a phase to find tables, it can generate queries using only information from questions, mitigating the limitations that arise when restricting reference tables to a single one. We demonstrate that our methods are suitable for practical use in text-to-sql systems by achieving performances comparable to those of existing models with simple structures.
This paper studies multi-turn text-to-sql generation, which is a new but important task in semantic parsing. In order to deal with its two challenges, i.e., multi-turn interaction and cross-domain evaluation, this pap...
详细信息
This paper studies multi-turn text-to-sql generation, which is a new but important task in semantic parsing. In order to deal with its two challenges, i.e., multi-turn interaction and cross-domain evaluation, this paper proposes a multiple-integration encoder, which derives the vector representations of user utterances and database schemas using three custom-designed modules for information integration. First, an utterance representation enhancing module is built to integrate the information of history utterances into the representation of each token in current utterance by attentive selection. Second, a schema discrepancy enhancing module is designed to integrate previous predicted sql query into the representation of schema items. Third, a latent schema linking module is employed to integrate schema information into utterance representations for better dealing with unseen database schemas. These three modules are all implemented based on a lightweight multi-head attention mechanism, which reduces the number of parameters in conventional multi-head attention. Experimental results on the SParC dataset show that our method achieved better accuracy of multi-turn text-to-sql generation than the most advanced benchmarks. Further ablations studies and analysis also demonstrate the effectiveness of the three modules designed for information integration in the encoder.
Language models have shown promising performance on the task of translating natural language questions into sql queries (text-to-sql). However, most of the state-of-the-art (SOTA) approaches rely on powerful yet close...
详细信息
Language models have shown promising performance on the task of translating natural language questions into sql queries (text-to-sql). However, most of the state-of-the-art (SOTA) approaches rely on powerful yet closed-source large language models (LLMs), such as ChatGPT and GPT-4, which may have the limitations of unclear model architectures, data privacy risks, and expensive inference overheads. To address the limitations, we introduce CodeS, a series of pre-trained language models with parameters ranging from 1B to 15B, specifically designed for the text-to-sql task. CodeS is a fully open-source language model, which achieves superior accuracy with much smaller parameter sizes. This paper studies the research challenges in building CodeS. To enhance the sql generation abilities of CodeS, we adopt an incremental pre-training approach using a specifically curated sql-centric corpus. Based on this, we address the challenges of schema linking and rapid domain adaptation through strategic prompt construction and a bi-directional data augmentation technique. We conduct comprehensive evaluations on multiple datasets, including the widely used Spider benchmark, the newly released BIRD benchmark, robustness-diagnostic benchmarks such as Spider-DK, Spider-Syn, Spider-Realistic, and ***, as well as two real-world datasets created for financial and academic applications. The experimental results show that our CodeS achieves new SOTA accuracy and robustness on nearly all challenging text-to-sql benchmarks.
The text-to-sql task has significant application prospects in automating relational database query interfaces. It can reduce user learning costs and improve data query efficiency. However, in text-to-sql tasks, there ...
详细信息
The text-to-sql task has significant application prospects in automating relational database query interfaces. It can reduce user learning costs and improve data query efficiency. However, in text-to-sql tasks, there is often a phenomenon of semantic gaps and insufficient information due to the absence of columns or condition values required by sql statements explicitly mentioned in the natural language queries. In this paper, a deep learning approach based on sketch filling is proposed to address the issues of insufficient information and semantic gaps in natural language queries. To tackle the problem of insufficient information, the model preprocesses the natural language queries, marks the named entities associated with the database table schema and content, and augments the data by randomly swapping entities. This augmentation strengthens the training of common natural language query templates, improving the model's accuracy in predicting results for typical questions. To address the issue of semantic gaps, the model introduces the missing table content from the natural language queries during semantic encoding. An attention mechanism is used to enhance the representation of table content, enabling the text-to-sql model to better understand queries and improve performance. The results demonstrate that the proposed model achieves better results on two benchmarks. Regarding the content augmentation methods proposed, ablation experiments show that both the data augmentation and table content enhancement schemes can improve the model's performance.
text-to-sql emerges to play an important role in interactive data analysis, which provides a friendly interface for converting natural language into relational database language (i.e., sql). In order to translate a us...
详细信息
text-to-sql emerges to play an important role in interactive data analysis, which provides a friendly interface for converting natural language into relational database language (i.e., sql). In order to translate a user's query into an executable sql statement, semantic parsing is essential to the transformation process. In particular, existing efforts provide some feasible solutions, and state-of-the-art models mainly adopt the sketch-based paradigm such that template values are to be filled. To this end, most methods extract values based on column representations. However, if the query contains multiple values that belong to different columns, these methods may fail to extract the values accurately. Moreover, it can be difficult to infer the right values when the query does not explicitly mention the corresponding column names. To bridge the gap, we propose a novel neural architecture, namely, ER-sql for learning enhanced representations for text-to-sql. Based on pre-trained model BERT, ER-sql uses column contents to better extract features of columns. Moreover, ER-sql harnesses the column representations to latently reformulate the query. To verify the effectiveness of ER-sql, comprehensive experiments demonstrate that ER-sql achieves better results than existing models on the benchmark dataset Wikisql, as well as on a representative Chinese dataset TableQA. (C) 2021 The Author(s). Published by Elsevier B.V.
A common problem with adopting text-to-sql translation in database systems is poor generalization. Specifically, when there is limited training data on new datasets, existing few-shot text-to-sql techniques, even with...
详细信息
A common problem with adopting text-to-sql translation in database systems is poor generalization. Specifically, when there is limited training data on new datasets, existing few-shot text-to-sql techniques, even with carefully designed textual prompts on pre-trained language models (PLMs), tend to be ineffective. In this paper, we present a divide-and-conquer framework to better support few-shot text-to-sql translation, which divides text-to-sql translation into two stages (or sub-tasks), such that each sub-task is simpler to be tackled. The first stage, called the structure stage, steers a PLM to generate an sql structure (including sql commands such as SELECT, FROM, WHERE and sql operators such as <", ?>") with placeholders for missing identifiers. The second stage, called the content stage, guides a PLM to populate the placeholders in the generated sql structure with concrete values (including sql identifies such as table names, column names, and constant values). We propose a hybrid prompt strategy that combines learnable vectors and fixed vectors (i.e., word embeddings of textual prompts), such that the hybrid prompt can learn contextual information to better guide PLMs for prediction in both stages. In addition, we design keyword constrained decoding to ensure the validity of generated sql structures, and structure guided decoding to guarantee the model to fill correct content. Extensive experiments, by comparing with ten state-of-the-art text-to-sql solutions at the time of writing, show that SC-Prompt significantly outperforms them in the few-shot scenario. In particular, on the widely-adopted Spider dataset, given less than 500 labeled training examples (5% of the official training set), SC-Prompt outperforms the previous SOTA methods by around 5% on accuracy.
text-to-sql systems allow users to explore relational databases by posing free-form queries, alleviating the need for using structured languages, such as sql. Although numerous systems have been developed so far, exis...
详细信息
ISBN:
(纸本)9781450383431
text-to-sql systems allow users to explore relational databases by posing free-form queries, alleviating the need for using structured languages, such as sql. Although numerous systems have been developed so far, existing system evaluations lack in rigour. In this work, we build a text-to-sql benchmark that covers different classes of queries, and we evaluate the effectiveness of several systems in the field. To evaluate system efficiency, we measure execution time and resource consumption for the different query classes. Our comprehensive evaluation aims at filling in a big gap in understanding the capabilities and boundaries of existing systems and it reveals several open challenges.
暂无评论