Tech Blog

Fine-Tuning Falcon40b for Code Generation

Falcon40b is an open-source commercial-use friendly large language model. We've trained it to write code.

Sami Jaghouar, Alaeddine Abdessalem, Sebastian Weisshaar, Scott Martens

Jul 24, 2023 • 7 min read

Falcon40b is one of the biggest open-source LLMs currently available and comes with a commercial-use-friendly Apache 2.0 license. This makes it interesting for developers looking to bring NLP products to market.

The Falcon-40b-Instruct model, which is the base Falcon40b model fine-tuned for instruction-taking and chat, ranks near the top of the Hugging Face LLM benchmark, with only models based on Meta’s non-open-source Llama family performing better. Falcon40b’s performance is far superior to other open-source models.

The Falcon family of models belongs to the recent wave of open-source LLMs inspired by the Llama family. The underlying insight behind them is to train smaller models longer (for more epochs) and with more data (over a trillion tokens in this case). The empirical literature on LLM scaling suggests that model size, training compute, and training dataset size should rise in tandem, but this family of models uses far more compute and training data than would be expected for a model of this size.

Fine-tuning with CodeAlpaca

Falcon40b is a pre-trained model but has not been trained for any specific task. Its zero-shot performance is not especially good, and it needs to be trained to function as a chatbot or to follow instructions.

When we want it to follow instructions, we train it using a collection of instruction-response pairs like the Alpaca dataset.

For example, this instruction-response pair comes from Alpaca:

Below is an instruction that describes a task. Write a response that 
appropriately completes the request.

### Instruction:
Create an array of length 5 which contains all even numbers between 1 
and 10.

### Response:
arr = [2, 4, 6, 8, 10]

The Falcon-40b-Instruct model has already been trained as a general instruction-following model, so we decided to go in a different direction. There has been a lot of recent interest in training LLMs to write code, and there are a number of open-source projects in this field, notably Starcoder, Replit3b, CodeGen and CodeGen2. So we decided to try to train Falcon40b to write code from natural language prompts by fine-tuning it with the CodeAlpaca dataset.

CodeAlpaca is a set of 20,000 instruction-input-code triplets that, like the Alpaca dataset, were generated by a closed-source language model. An example:

Below is an instruction that describes a task, paired with an input 
that provides further context. Write a response that appropriately 
completes the request.

### Instruction:
Generate a snippet of code to extract all the URLs from the given 
string.

### Input:
This string contains some urls such as <https://www.google.com> and 
<https://www.facebook.com>.

### Response:
import re

string = "This string contains some urls such as <https://www.google.com> and <https://www.facebook.com>."
urls = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\\\(\\\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', string)
print(urls)

Training Falcon40b specifically on code construction rather than the more general task of instruction-following results in better performance on those tasks.

Fine-tuning hyperparameters

We fine-tuned Falcon40b using LoRA with 8-bit quantization on four NVIDIA A100 Tensor Core GPUs with 80GB of VRAM. Fine-tuning took approximately 4 hours, at a cost of approximately 25 euros on runpod.io.

We used the following hyperparameters:

Epochs: 2
Layer to add LoRA: full attention layer (QKV)
LoRA rank: 8
Learning rate: 3.0e-4
Batch size: 128

💡

We noticed that increasing the number of epochs to 2 helped the model “learn” when to stop. Otherwise, we were experiencing problems where it would repeat the instruction template instead of stopping.

Results

We selected 45 coding instructions that we used to assess the ability of the model to follow instructions. Our fine-tuned models are able to perform code generation from natural language instructions, answer questions about segments of code, and answer general computer programming questions.

For example:

Generic programming questions

Question: 

Explain the difference between an iterator and a generator 
in Python. Provide examples.

Response: 

An iterator is an object that can be used to iterate over a 
sequence of items, such as a list or a string. It provides a way to 
access each item in the sequence one at a time. A generator is a 
function that can be used to generate a sequence of items, such as a 
list or a string. It provides a way to generate each item in the 
sequence one at a time. The main difference between an iterator and 
a generator is that an iterator is an object that can be used to 
iterate over a sequence, while a generator is a function that can be 
used to generate a sequence.

Code generation from natural language prompts

Question: 

Write a Python function that accepts a list of words, and returns a 
dictionary where keys are the words and values are the frequencies of 
the words in the list. Use a for loop, if clause, and dictionary in 
your solution.

Response: 

def word_frequency(words):
    frequencies = {}
    for word in words:
        if word in frequencies:
            frequencies[word] += 1
        else:
            frequencies[word] = 1
    return frequencies

Answering questions about specific pieces of code

Question:

Explain the following Python script that uses FastAPI to create a basic 
'Hello World' API.

from fastapi import FastAPI

app = FastAPI()

@app.get("/")
def read_root():
    return {"Hello": "World"}

Response:

This Python script uses FastAPI to create a basic 'Hello World' API. 
FastAPI is a web framework that makes it easy to create APIs in Python. 
The @app.get("/") decorator defines a route for the API, and the 
read_root() function is called when the route is accessed. The function 
returns a dictionary with the key "Hello" and the value "World".

Writing code to handle AI models

Question: 

Write a Python script using transformers that tokenizes an input 
text and generates model output.

Response: 

import transformers

tokenizer = transformers.AutoTokenizer.from_pretrained("bert-base-uncased")
model = transformers.AutoModel.from_pretrained("bert-base-uncased")

input_text = "This is an example text."

tokenized_text = tokenizer.tokenize(input_text)
model_output = model.generate(input_text)

print(tokenized_text)
print(model_output)

In quantitative terms, our headline result is 22% on the first pass using the HumanEval benchmark, scored using this codebase. Although this is a good result, it is well below the best-documented performance of an AI model specialized in writing code.

Falcon-40b-Instruct scored 18% on the same tasks. We expect that this is because training it on a much larger dataset makes it better at following instructions, even if it is not better at writing code or solving programming problems. Most of its knowledge appears to have been acquired in the pre-training phase.

In a nutshell

We have released two models via HuggingFace: The full weight model and the LoRA adapter weights for our fine-tuned Falcon40b.

Although Falcon40b is not specialized for coding, it performs quite well on coding problems. With our fine-tuned LoRA mask, you can turn any Falcon40b installation into a coding assistant by only loading a few megabytes of additional data into memory.

Downloading our code and reproducing our results

You can reproduce what we did by checking out our codebase from GitHub. The code consists of fairly transparent wrappers around HuggingFace’s transformers module.

First, get our code from GitHub:

git clone https://github.com/jina-ai/jerboa.git

Then, go into the root directory of the git repository and run the following:

cd jerboa

finetune.py --base-model tiiuae/falcon-40b --lora-target-modules query_key_value --data-path sahil2801/CodeAlpaca-20k --output-dir ./lora-alpaca-code --batch-size 128 --micro-batch-size 4 --eval-limit 45 --eval-file code_eval.jsonl --wandb-project jerboa --wandb-log-model --wandb-watch gradients --num-epochs 2

If you have trouble, you can revert to the checkpoint of the version we used for this article:

git checkout abe1a23a4e9f5e141e19be0336ca8a4c888dd024

You may also be able to reduce compute and training time with LLM Foundry or some other tool that optimizes for low training costs.

Get involved

Check out Jina AI's website, GitHub repo, and Discord to explore what AI can do for you.