A Case Study in Finetuning Open Source LLMs: Training LLaMA 2 for the Text-to-SQL Task

Jun 04, 2024

Introduction

Below is write-up of a consulting project I did for a client in late 2023 (with all client names removed). In the report, I detail the models, datasets, and approaches needed to create a state-of-the-art text-to-SQL model that outperforms GPT-4. If you don’t have time to read the full report, the three main takeaways are the following:

The dataset is everything. By far my highest ROI activity came from inspecting, correcting, and augmenting my dataset. This included: spotchecking and fixing errors in the table schemas in the training data, identifying additional non-SQL datasets to include in training that helped performance, implementing curriculum learning to improve convergence, and more.
Don’t just validate next token prediction, test real task performance. You want to make sure your validation set is as close to the real thing as possible. To that end, I made two modifications:
1. I biased the validation set towards harder SQL queries rather than having the same distribution as the training set
2. I created real SQL tables using GPT-4 to match the schemas that are in the data. I then executed the SQL statements produced by the model against these tables and compared the result of this execution to the result of executing the ground-truth SQL. This gave me a metric that tracked the performance of the model in a real-world setting, rather than just looking at how accurately it can guess the next token
Experimentation is key. All of these insights took many training runs to arrive at (likely about 50 in total). Thus, make sure to use a parameter efficient finetuning method such as QLoRA so you don’t bankrupt yourself on all of the experimentation training runs. Also take meticulous notes - sometimes the insights coalesce over the course of many trials. For tracking purposes, I used Weights & Biases and kept all of my notes in a running Google Doc.

These three insights together allowed me to develop a model based on Code LLaMA that outperformed GPT4 by 30 percentage points and cost 99% less to run at inference.

Now, if you’d like all the details, the full consulting report is below.

Overview

Currently, usage of many database & data warehouse management tools requires SQL knowledge in order to write queries. As a result, the number of seats per software license for these tools is constrained by the number of employees within a given organization who have a strong grasp of SQL, which puts a lower cap on the potential revenue per client. The recent emergence of large language models (LLMs) can help to alleviate this problem, as they demonstrate near human-level ability to translate from natural language to code. However, usage of common LLMs such as ChatGPT can present several challenges, including:

Cost: OpenAI charges per 1000 tokens of input and output (where 1 token roughly corresponds to 1 word). As a result, costs can skyrocket as the user base increases. For example, 100,000 users making on average 10 document summarization requests per day will cost about $5,800 per day in API fees alone. This can significantly reduce profit margins on a product. In addition, sudden unforeseen spikes in API usage can result in large losses without appropriate API throttling checks in place.
Security: When using the OpenAI API endpoint, requests to ChatGPT are sent to an external endpoint. This increases the probability of data leakage and the associated negative downstream effects for a company, such as fines, legal fees, and loss of goodwill. Even if using an OpenAI endpoint deployed within a customer’s Azure environment, there will still be security concerns over using an external LLM on valuable, confidential data.
Output Control: When using OpenAI’s API endpoints, you are at the mercy of any changes they decide to make to their models. GPT-3.5 and GPT-4 are constantly updated with new training data pouring in from the millions of user chats. While this is intended to improve the service, performance can end up degrading for your specific task & prompt combination (see Figure 1 below). This can result in sudden, random drops in performance for your tool.

Figure 1. Chen, Lingjiao, Matei Zaharia, and James Zou. "How is ChatGPT's behavior changing over time?." arXiv preprint arXiv:2307.09009 (2023).

Open source LLMs, however, have the potential to address each of these key drawbacks of closed source LLMs through providing: (1) smaller models that reduce the cost required to serve the LLM, (2) ability to deploy the model in any environment, whether in a secure cloud VPC or even on a local machine, and (3) complete control over all model parameters, ensuring consistent output quality. In order to leverage these open source model advantages, we have developed a state-of-the-art text-to-SQL model based on Meta’s Code LLaMA model. Our new model, dubbed LLaMA2-SQL, significantly outperforms GPT-4 on the text-to-SQL task, while being lightweight enough to run on a CPU.

Methodology

Base Model Choice

We began the project with three main desired characteristics for our base model:

Strong general reasoning capabilities
Commercially-permissive licensing
Large open source community ecosystem

Of models released during the main phase of the project (March 2023-December 2023), Meta’s LLaMA 2 and Code LLaMA models were the highest performing LLMs to satisfy these three criteria. LLaMA 2 is a general purpose LLM which was pretrained on 2 trillion tokens. Code LLaMA extends LLaMA 2 by further pretraining the model on an additional 520 billion tokens of code, significantly improving performance on coding tasks (see Figure 2).

Figure 2. Llama and Code Llama performance compared to ChatGPT and several other open source models. HumanEval and MBPP both consist of tasks that the model must solve using Python code. Multilingual HumanEval extends HumanEval to include coding challenges in C#, Go, Java, JavaScript, Kotlin, Perl, PHP, Ruby, Scala, Swift, and TypeScript.

Given Code LLaMA’s strong general reasoning & coding performance, we hypothesized that it would provide a good starting point from which to finetune a SQL-specific model. Specifically, we selected Code LLaMA - Instruct 34B to leverage its instruction-following capability when creating LLaMA2-SQL.

Dataset Curation & Augmentation

Creation of the dataset was the most significant piece of the project and resulted in the largest gains in performance. To begin, we combined several datasets, including WikiSQL, Spider, and Code Instructions Alpaca 120K. These datasets combined to form the LLaMA2-SQL AI dataset. This amalgamation was not just a mere aggregation of data; it was a strategic blend designed to encompass a broad spectrum of SQL queries and structures.

In the pursuit of refining the dataset and enhancing the model's performance, several key strategies were employed. We adopted the instruct dataset format, which is known for its efficacy in guiding models towards more accurate and context-aware outputs. This format was instrumental in aligning the model's responses with the intricate demands of SQL query generation. Furthermore, we introduced a mix of general coding problems alongside SQL generation tasks. General coding questions tended to be longer and require more reasoning steps than the typical SQL queries found in WikiSQL and Spider datasets. As a result, this mixture improved the model’s reasoning capability on more complicated user questions, despite lowering the overall proportion of SQL queries in the training set.

One of the more technical challenges we addressed was a major table schema issue, where all columns in the training & validation sets were indiscriminately coded as VARCHAR. By resolving this, we ensured that the model could recognize and handle a variety of data types, thereby increasing the accuracy and reliability of its SQL output. Additionally, we eliminated examples from the dataset where the response was less than 10 characters. This exclusion was based on the rationale that shorter responses often lack the complexity and detail required for effective training in SQL generation.

To further refine the dataset, we sorted examples in order of instruction length. This sorting approach (known as curriculum learning) allowed for a gradual and systematic exposure of the model to increasingly complex queries, thereby enhancing its learning curve. We also utilized an embedding model to identify and remove similar data points. This step was crucial in ensuring that the dataset was not only diverse but also free of redundant or overly repetitive examples.

Another innovative step was the randomization of the SQL schema order within the dataset. This randomization was a strategic move to prevent the model from developing biases or shortcuts based on schema ordering. Lastly, we intentionally biased the validation set to include mostly difficult SQL problems. This bias ensured that the model was rigorously tested against complex and challenging queries, which is essential for real-world applications where query complexity can vary greatly.

Model Training

Model training, particularly the fine-tuning of large language models, typically demands substantial GPU memory, posing significant challenges in terms of cost and efficiency. Even the smallest LLaMA model, with 7 billion parameters, requires approximately 140 GB of GPU RAM, while the 70B LLaMA model demands a staggering 1400 GB. This high resource requirement makes conventional fine-tuning methods both expensive and time-consuming. To address these challenges, we employed Parameter Efficient Fine-Tuning (PEFT), a novel approach that significantly reduces GPU RAM requirements for fine-tuning open source models. This method is a part of the HuggingFace library ecosystem and offers various out-of-the-box implementations.

Among the PEFT methods, we chose to use Quantized Low-Rank Adaptation (QLoRA). QLoRA builds upon the core concept of PEFT and introduces three critical enhancements (see Figure 3). Firstly, it quantizes the base model since the base model parameters are frozen and do not require maintenance in a high-precision format. This quantization effectively reduces memory usage. Secondly, it offloads some of the optimizer's values to the CPU memory when the GPU memory is insufficient, bringing them back to the GPU as needed. Lastly, it uses an interesting trick from linear algebra to reduce the number of trainable parameters. The trick works as follows: Assume the weight matrix W has dimension (d x d). If we set the dimensions of matrix A to be (d x r) and matrix B to be (r x d), then the multiplied matrix AB has the dimensions (d x d), the same as our original weight matrix W, but with potentially far fewer parameters! W has d*d = d^2 parameters, whereas A and B together have (d*r) + (r*d) = 2dr parameters. When r is less than d/2, this results in a reduction in trainable parameters compared to the original model. Low values of r can make this reduction dramatic. The combination of these three enhancements makes QLoRA incredibly memory efficient for fine-tuning while still maintaining strong performance.

Figure 3. Hu, Edward J., et al. "Lora: Low-rank adaptation of large language models." arXiv preprint arXiv:2106.09685 (2021).

To facilitate PEFT fine-tuning, we utilized several libraries. Axolotl was instrumental in orchestrating the training process using common PEFT methods. Its user-friendly setup, requiring only a simple YAML file of training parameters, streamlined our workflow. Additionally, Axolotl's support for distributed training was crucial for managing the computational demands of our project.

Another key component in our training arsenal was Flash Attention. This library implements an exceedingly efficient attention mechanism, allowing for faster training completion and lower memory usage, which translates to cost savings. This efficiency was vital in our pursuit of a balance between performance and cost.

Google Colab played a pivotal role in our training process. By renting A100 instances at an affordable rate, we could train models up to 34 billion parameters using a combination of QLoRA and Flash Attention. We particularly recommend Colab Pro+ for its background execution capabilities and priority access to A100 instances.

The synergy of QLoRA and Flash Attention proved to be incredibly powerful. In our project, we successfully fine-tuned the Code LLaMA 34B model on 67,000 training examples over five epochs in under 22 hours. This efficiency resulted in a cost-effective training process, costing approximately $28 in Google compute credits. The outcome was a fully fine-tuned model that surpassed GPT-4 in our specific task, a testament to the effectiveness of our chosen training methods and tools.

Model Results

Accuracy

LLaMA2-SQL was evaluated against its two main competing models: GPT-3.5 and GPT-4. All three models were evaluated on a dataset consisting of a natural language instruction as input with a desired SQL statement as output. Some examples included the table schema as input, where others excluded the schema to test the model’s ability to infer schema information from text alone. Each text and SQL example was accompanied by a sample SQL table. An example data point is displayed below in Figure 4 (the contained text is fairly small, so I recommend zooming in on PDF to see the data more clearly).

Figure 4. Example data point from the evaluation set.

To produce our accuracy numbers, we executed the ground truth SQL statements against this sample table and compared the results to the tables produced by executing the SQL output of each of the LLMs tested in our benchmark suite. The accuracy thus represents the percent of evaluation examples where the model produced executable SQL that arrived at the correct answer. By examining the output accuracy rather than simply checking if the model’s SQL statement matches the desired SQL statement, we are able to account for examples where the model uses a different approach to arrive at the correct answer. As a result, our accuracy more closely reflects the accuracy that the model would achieve in the real world.

Using this evaluation setup, we see that LLaMA2-SQL significantly outperforms both GPT-4 and GPT-3.5, beating each by about 30 and 37 percentage points on the evaluation set, respectively (see Figure 5 below). It achieves an accuracy of 68.82%, which includes examples where no schema is provided at all. When provided table schemas along with every user question, LLaMA2-SQL’s accuracy exceeds 85%. As a result, we are able to provide a better user experience for the text-to-SQL task than is offered by closed source models, despite their increased model and training data size.

Figure 5. Model results on evaluation set.

Cost & Compute Efficiency

In addition to beating GPT-3.5 and GPT-4 handily in accuracy, LLaMA2-SQL has 1/10th as many parameters as GPT-3.5 and 1/100th as many as GPT-4. Moreover, we have optimized LLaMA2-SQL to further reduce the parameter size and enable it to run on CPUs. This translates to massive reductions in the compute needed to serve LLaMA2-SQL to customers, resulting in large cost savings. The LLaMA2-SQL model can either be served on a CPU machine, with response times of about 30-60 seconds, or on a cheap GPU machine, with response times of about 5 seconds. The deployment hardware can be determined by the desired usage – if response times do not need to be near instant to produce a strong user experience, then the cheaper CPU environment can be used. Estimated costs and theoretical maximum requests served for a single representative instance hosting the LLaMA2-SQL model are as follows:

GPU:
- AWS EC2 Instance Type: g4dn.xlarge
- Daily Cost: $12.60
- Maximum Requests per Day: About 29,000
CPU:
- AWS EC2 Instance Type: t4g.2xlarge
- Daily Cost: $6.47
- Maximum Requests per Day: About 1900

LLaMA2-SQL’s compute efficiency allows thousands of daily user requests to be served at the cost of only a few dollars per day. By comparison, 29,000 daily requests to the GPT-4 API endpoint would cost roughly $1500 per day. As a result, when deploying our model on a GPU instance, we achieve a 99% reduction in cost while also improving performance by 30 percentage points when compared to GPT-4.

Conclusion & Next Steps

This project’s key objective was to validate the potential of a custom, low cost text-to-SQL model, with the goal of increasing Client’s market penetration through making the tool accessible to non-technical users. LLaMA2-SQL achieved this objective, setting a new state-of-the-art in text-to-SQL generation, both from a performance and cost standpoint. Future work building on LLaMA2-SQL can take a number of directions, including:

Retraining LLaMA2-SQL using newly-released open source models that are more powerful than Code LLaMA
Improving LLaMA2-SQL’s training dataset to more closely match the requests that will be seen in a production environment from Client’s users
Optimizing & automating cloud infrastructure using Terraform to support rapid deployment of LLaMA2-SQL on AWS, Azure, and GCP
Developing UI/UX for LLaMA2-SQL to support its integration into the tool

Musings by Chris Hayduk

Discussion about this post