This survey focuses on text -to -sql, automated translation of natural language queries into sql queries. Initially, we describe the problem and its main challenges. Then, by following the PRISMA systematic review met...
详细信息
This survey focuses on text -to -sql, automated translation of natural language queries into sql queries. Initially, we describe the problem and its main challenges. Then, by following the PRISMA systematic review methodology, we survey the existing text -to -sql review papers in the literature. We apply the same method to extract proposed text -to -sql models and classify them with respect to used evaluation metrics and benchmarks. We highlight the accuracies achieved by various models on text -to -sql datasets and discuss execution -guided evaluation strategies. We present insights into model training times and implementations of different models. We also explore the availability of text -to -sql datasets in non-English languages. Additionally, we focus on large language model (LLM) based approaches for the text -to -sql task, where we examine LLM-based studies in the literature and subsequently evaluate the LLMs on the cross -domain Spider dataset. Finally, we conclude with a discussion of future directions for text -to -sql research, identifying potential areas of improvement and advancements in this field.
text-to-sql transforms natural language text into sql queries, a task complicated by sql's complex syntax and the need for specialized knowledge. Retrieval from the sql-generated case repository assists large lang...
详细信息
text-to-sql transforms natural language text into sql queries, a task complicated by sql's complex syntax and the need for specialized knowledge. Retrieval from the sql-generated case repository assists large language models (LLMs) by providing relevant examples. However, complex queries often involve complicated sql syntax, which can confuse LLMs because they are designed for natural language rather than sql. In this paper, we propose a multi- pattern retrieval-augmented framework for sql generation, which dynamically selects relevant examples based on the query and reasoning patterns. To retrieve similar query patterns, we construct question skeletons in the Poincar & eacute;model, which better distinguishes entities aligned with a question's needs. To provide tailored examples of reasoning patterns for each logical step, especially for complex problems, we design meta-instruction-based retrieval repositories for multi-category chain-of-thought fragments. To mitigate biases from initial retrievals, we implement a revision strategy that leverages LLMs to interact with databases, enabling LLMs to self-correct errors during sql generation. Experiments on four benchmarks show that our method outperforms strong baseline models, which increases execution accuracy by 13% to 23.1% with the same LLM. Ablation studies provide insights into the framework's performance sensitivity to different components and strategies. Further analysis reveals a correlation between the framework's effectiveness and LLMs' quality and stability, and substantial performance gains from initial iteration modifications are the most significant.
Intelligent question answering over industrial databases is a challenging task due to the multicolumn context and complex questions. The existing methods need to be improved in terms of sql generation accuracy. In thi...
详细信息
Intelligent question answering over industrial databases is a challenging task due to the multicolumn context and complex questions. The existing methods need to be improved in terms of sql generation accuracy. In this paper, we propose a question-aware few-shot text-to-sql approach based on the SDCUP pretrained model. Specifically, an attention-based filtering approach is proposed to reduce the redundant information from multiple columns in the industrial database scenario. We further propose an operator semantics enhancement method to improve the ability of identifying complex conditions in queries. Experimental results on the industrial benchmarks in the fields of electric energy and structural inspection show that the proposed model outperforms the baseline models across all few-shot settings.
Software analytics integrated with complex databases can deliver project intelligence into the hands of software engineering (SE) experts for satisfying their information needs. A new and promising machine learning te...
详细信息
Software analytics integrated with complex databases can deliver project intelligence into the hands of software engineering (SE) experts for satisfying their information needs. A new and promising machine learning technique known as text-to-sql automatically extracts information for users of complex databases without the need to fully understand the database structure nor the accompanying query language. Users pose their request as so-called natural language utterance, i.e., question. Our goal was evaluating the performance and applicability of text-to-sql approaches on data derived from tools typically used in the workflow of software engineers for satisfying their information needs. We carefully selected and discussed five seminal as well as state-of-the-art text-to-sql approaches and conducted a comparative assessment using the large-scale, cross-domain Spider dataset and the SE domain-specific SEOSS-Queries dataset. Furthermore, we study via a survey how SE professionals perform in satisfying their information needs and how they perceive text-to-sql approaches. For the best performing approach, we observe a high accuracy of 94% in query prediction when training specifically on SE data. This accuracy is almost independent of the query's complexity. At the same time, we observe that SE professionals have substantial deficits in satisfying their information needs directly via sql queries. Furthermore, SE professionals are open for utilizing text-to-sql approaches in their daily work, considering them less time-consuming and helpful. We conclude that state-of-the-art text-to-sql approaches are applicable in SE practice for day-to-day information needs.
Existing text-to-sql semantic parsers are typically designed for particular settings such as handling queries that span multiple tables, domains, or turns which makes them ineffective when applied to different setting...
详细信息
Existing text-to-sql semantic parsers are typically designed for particular settings such as handling queries that span multiple tables, domains, or turns which makes them ineffective when applied to different settings. We present UniSAr (Unified Structure-Aware Autoregressive Language Model), which benefits from directly using an off-the-shelf language model architecture and demonstrates consistently high performance under different settings. Specifically, UniSAr extends existing autoregressive language models to incorporate two non-invasive extensions to make them structure-aware: (1) adding structure mark to encode database schema, conversation context, and their relationships;(2) constrained decoding to decode well-structured sql for a given database schema. On seven well-known text-to-sql datasets covering multi-domain, multi-table, and multi-turn, UniSAr demonstrates highly comparable or better performance to the most advanced specifically-designed text-to-sql models.
To bridge the gap between users and data, numerous text-to-sql systems have been developed that allow users to pose natural language questions over relational databases. Recently, novel text-to-sql systems are adoptin...
详细信息
To bridge the gap between users and data, numerous text-to-sql systems have been developed that allow users to pose natural language questions over relational databases. Recently, novel text-to-sql systems are adopting deep learning methods with very promising results. At the same time, several challenges remain open making this area an active and flourishing field of research and development. To make real progress in building text-to-sql systems, we need to de-mystify what has been done, understand how and when each approach can be used, and, finally, identify the research challenges ahead of us. The purpose of this survey is to present a detailed taxonomy of neural text-to-sql systems that will enable a deeper study of all the parts of such a system. This taxonomy will allow us to make a better comparison between different approaches, as well as highlight specific challenges in each step of the process, thus enabling researchers to better strategise their quest towards the "holy grail" of database accessibility.
Recently, the text-to-sql task has received much attention. Many sophisticated neural models have been invented that achieve significant results. Most current work assumes that all the inputs are legal and the model s...
详细信息
Recently, the text-to-sql task has received much attention. Many sophisticated neural models have been invented that achieve significant results. Most current work assumes that all the inputs are legal and the model should generate an sql query for any input. However, in the real scenario, users are allowed to enter the arbitrary text that may not be answered by an sql query. In this article, we focus on the issue-answerability classification for the text-to-sql system, which aims to distinguish the answerability of the question according to the given database schema. Existing methods concatenate the question and the database schema into a sentence, then fine-tune the pre-trained language model on the answerability classification task. In this way, the database schema is regarded as sequence text that may ignore the intrinsic structure relationship of the schema data, and the attention that represents the correlation between the question token and the database schema items is not well designed. To this end, we propose a relational Question-Schema graph framework that can effectively model the attention and relation between question and schema. In addition, a conditional layer normalization mechanism is employed to modulate the pre-trained language model to generate better question representation. Experiments demonstrate that the proposed framework outperforms all existing models by largemargins, achieving new state of the art on the benchmark TRIAGEsql. Specifically, the model attains 88.41%, 78.24%, and 75.98% in Precision, Recall, and F1, respectively. Additionally, it outperforms the baseline by approximately 4.05% in Precision, 6.96% in Recall, and 6.01% in F1.
With the widespread usage of large language model (LLMs), LLM-based method has become the mainstream approach for text-to-sql tasks, achieving leading performance on text-to-sql leaderboards. However, generating compl...
详细信息
ISBN:
(纸本)9789819794331;9789819794348
With the widespread usage of large language model (LLMs), LLM-based method has become the mainstream approach for text-to-sql tasks, achieving leading performance on text-to-sql leaderboards. However, generating complex sql queries correctly has always been a main challenge. Current LLM-based models primarily utilize prompting-based methods on large scale closed-source LLMs (e.g., GPT-4 and ChatGPT), which may cause concerns of usage costs and data privacy. For fine-tuning based methods, it is difficult to generate complex sql accurately in only one fine-tuning step. Focusing on this, we propose TSPsql, a Two-Stage Progressive learning method for text-to-sql. TSPsql decomposes text-to-sql task into two stages: sql elements generation auxiliary task, and sql query generation main task. The two tasks are progressively fine-tuned on a single model, effectively reducing the difficulty of sql generation and improving accuracy. TSP-sql achieves state-of-the-art performance among open-source fine-tuning based methods on Spider dev set, and surpasses most of the methods based on large scale closed-source LLMs.
text-to-sql aims at translating textual questions into the corresponding sql *** tables are widely created for high-frequent *** text-to-sql has emerged as an important task,recent studies paid little attention to the...
详细信息
text-to-sql aims at translating textual questions into the corresponding sql *** tables are widely created for high-frequent *** text-to-sql has emerged as an important task,recent studies paid little attention to the task over aggregate *** increased aggregate tables bring two challenges:(1)mapping of natural language questions and relational databases will suffer from more ambiguity,(2)modern models usually adopt self-attention mechanism to encode database schema and *** mechanism is of quadratic time complexity,which will make inferring more time-consuming as input sequence length *** this paper,we introduce a novel approach named WAGG for text-to-sql over aggregate *** effectively select among ambiguous items,we propose a relation selection mechanism for relation *** deal with high computation costs,we introduce a dynamical pruning strategy to discard unrelated items that are common for aggregate *** also construct a new large-scale dataset SpiderwAGG extended from Spider dataset for validation,where extensive experiments show the effectiveness and efficiency of our proposed method with 4%increase of accuracy and 15%decrease of inference time w.r.t a strong baseline RAT-sql.
This paper investigates how the model size affects the ability of a Generative AI Language Model, or briefly a GLM, to support the text-to-sql task for databases with large schemas typical of real-world applications. ...
详细信息
ISBN:
(纸本)9783031758713;9783031758720
This paper investigates how the model size affects the ability of a Generative AI Language Model, or briefly a GLM, to support the text-to-sql task for databases with large schemas typical of real-world applications. The paper first introduces a text-to-sql framework that combines a prompt strategy and a Retrieval-Augmented Generation (RAG) technique, leaving as flexibilization points the GLM and the database. Then, it describes a benchmark based on an open-source database featuring a schema much larger than the schemas of most of the databases in familiar text-to-sql benchmarks. The paper proceeds with experiments to assess the performance of the text-to-sql framework instantiated with the benchmark database and GLMs of different sizes. The paper concludes with recommendations to help select which GLM size is appropriate for a text-to-sql scenario, characterized by the difficulty of the expected NL questions and the data privacy requirements, among other characteristics.
暂无评论