Fine-Tuning LLMs on a Budget: The Story Behind Muwa-OPT
In the ever-changing field of Machine Learning, Large Language Models (LLMs) have emerged as game changers by revolutionizing natural language processing. Many sophisticated models have the astonishing ability to generate human-like text, making them eligible for a wide range of both day-to-day and niche tasks.
However, as the name suggests, these models are large. All the models that were considered before the debut of LLMs cannot hold a candle to these humongous models due to these giants usually having Billions of trainable parameters. With great models come great computational requirements; training these LLMs require substantial computational requirements and financial investments. “Substantial” would be an understatement when it comes to some models like OpenAI’s GPT-4, making it prohibitively expensive for most individuals and organizations. Similar to training, fine-tuning these models using the traditional way will require an immense amount of computing power. Notice that I added the phrase “using the traditional way”; it will be explained later on.
If you are someone like me, that have no access to expensive computing or that has no money to buy some from a cloud provider, fine-tuning an LLM can be tricky. Even some organizations that have access to decent GPUs can struggle with handling so much data when a LLM is being fine-tuned. So how can such a feat be achieved by us; the mere mortals?
With the increase in popularity of transformer-based large language models, more and more researches were conducted to find ways to efficiently “mould” pre-trained huge LLMs to perform some specific tasks. A pre-trained LLM is a model which has been trained using a huge general-purpose dataset. Those models can be further fine-tuned using a custom dataset to make it more suitable to be used for specific tasks. More efficient fine-tuning processes means that these models can be fine-tuned without breaking the bank.
PEFT(Parameter-Efficient Fine-tuning) is a set of approaches that are meant to reduce the cost of fine-tuning, storing, and deploying large models. This Hugging Face article about PEFT describes it as follows:
“PEFT approaches only fine-tune a small number of (extra) model parameters while freezing most parameters of the pretrained LLMs, thereby greatly decreasing the computational and storage costs. This also overcomes the issues of catastrophic forgetting, a behaviour observed during the full finetuning of LLMs. PEFT approaches have also shown to be better than fine-tuning in the low-data regimes and generalize better to out-of-domain scenarios. It can be applied to various modalities, e.g., image classification and stable diffusion dreambooth.”
Hugging Face has launched a Python package with the same name and according to the documentation it implements several PEFT methods:
- LoRA
- Prefix Tuning
- P-Tuning
- Prompt Tuning
- AdaLoRA
LoRA (Low-Rank Adaptation) is a method proposed for adapting large pre-trained language models to specific tasks or domains. It involves freezing the pre-trained model weights and adding trainable rank decomposition matrices to each layer of the Transformer architecture, which significantly reduces the number of trainable parameters for downstream tasks. This approach allows for efficient adaptation of language models with fewer trainable parameters and reduced GPU memory requirements. More information on LoRA can be found in the paper that introduced the method, which can be accessed here. Also, I found this video that explains the paper in simple terms, which I found to be very useful. LoRA was used in my quest of fine-tuning LLMs using free resources.
Muwa is a fine-tuned LoRA model based on Facebook’s OPT 1.3b model. Muwa was fine-tuned using the databricks-dolly-15k, which is a dataset of instruction-following records that belong to multiple categories like brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization. The specialty of Muwa is that only free resources have been used to fine-tune the model, no paid GPUs were used for fine-tuning the model. Muwa was only fine-tuned using Google Colab free tier.
This work is heavily inspired by the Eluwa model by Yudhanjaya et al. Most of the model fine-tuning and benchmarking code is taken from their repository and I had to make some adjustments to the code and to change some parameters to make sure that the fine-tuning process can be done on free resources that were available to me at the time.
Training
This model was fine-tuned for 2 Epochs using the aforementioned Databricks Dolly 15K dataset. This model and its base model (OPT 1.3b) can be loaded in 8-bit. The notebook that was used for training this model can be found on the GitHub repo, including my notes on each code block.
The model was trained only using T4 GPU provided by Google Colab. To fit the whole model and the dataset into it, the dataset had an input limit of 1024 tokens per query. This was done because, with the default value, the GPU RAM was not enough to fine-tune the model.
With the limit in input tokens, the model training took ~12 GB of GPU RAM.
Evaluation
Muwa was tested and evaluated using SQuAD mini, wikitext, and piqa datasets. Both Muwa and its base model, OPT 1.3b were evaluated separately using all mentioned datasets and the results can be found on the GitHub repo.
Why “Muwa”?
As mentioned above, Muwa was heavily inspired by the Eluwa model developed by Yudhanjaya et al. “Eluwa” means goat in Sinhalese. Continuing the trend of naming LLMs after even-toed ungulates, this model is named “Muwa”.
Deers aren’t as fearsome as Goats, or even Llamas and alpacas but they are still an impressive species. They are graceful, agile, and known for their antlers, which they shed and regrow every year. In some cultures, deers are considered a symbol of gentleness and kindness.
There you have it. That’s a summary of my journey in using free resources to fine-tune Large Language Models to prepare them to perform a specific set of tasks. More details about the model can be found on both GitHub and Hugging Face repositories. They contain documentation relevant to the model and some examples to run inference using the model etc.
If you make it to this point of this article, I would very much like to request your feedback on my work presented here. My ultimate goal for this project was to test whether it was possible to fine-tune an LLM using only the free resources that are available for anyone with an internet connection. Despite OPT 1.3b being a small model compared to giant models like GPT-4, I see this as a win, because since the smaller models can be meddled with using free resources, the same should be possible with bigger models, provided you spend some money for computation resources.