Table of Contents

Introduction

Model training is done using the huggingface transformers library, with logging taking place on weights and biases (wandb). After the syncing pipeline is run to construct the databases, and prompt are constructed on basis of that database, we are ready to use that data to train our models. We finetune our models on basis of bloomz and alpaca, both open source instruction-based llms.

Downloading Weights

When referring to the "model weights" we refer to a directory containing the weights, some configuration files and the tokenizer. See here for more information. We need these for both bloomz and alpaca.

Bloomz

The weights for bloomz 7b are available at huggingface, downloading them is as simple as:

import transformers

model = transformers.AutoModelForCausalLM.from_pretrained("bigscience/bloomz-7b1")

The weights will end up in the huggingface cache. Move the directory to your BASE_MODELS_DIRECTORY and give the model directory a relevant name.

Alpaca

follow instructions to download the alpaca weights and their diff here. Give the directory a name like "alpaca_base". Also place these weights in the BASE_MODELS_DIRECTORY.

Training

There are two training scripts, one for bloom (scripts/train_bloom.py) and one for alpaca (scripts/train_alpaca.py). There are lots of arguments that we can pass to each script, see the transformers training arguments see here. These arguments specify hyperparameters such as the amount of epochs, optimizer etc. We have chosen default values that we used for the training, so we need not worry about specifying these (unless you want to change a specific parameter of course).

In order to train the models we need only specify the training data path using the --data_path flag. This should be the output file of the data generation script generate_data.py

Multi-gpu training is all handled by huggingface if we invoke the script like e.g.

torchrun --nproc_per_node=<n_gpus> --master_port=<your_random_port> train_alpaca.py  --data_path <path_to_data>

Where we allocate a certain amount of GPU's (we used four), and give them a port like "4321" they can use to communicate.

The output models weight directory will be placed in the TRAINED_MODELS_DIRECTORY, in a subdirectory whose name is a number that increases by 1 each time. So the first model you train will end up in the directory TRAINED_MODELS_DIRECTORY/0.

Training will be logged on wandb, a great tool for monitoring model performance and bookkeeping. The run name will be equivalent to the output directory number.

Inference

For convencience and verification purposes there is an inference script available in the scripts directory. It expects the same input json as the training, and will use the data under "val". For the rest, we need only specify an output path, where the output json containing predictions will be put, and a model_path, the path to the directory of the model we want to use to run the inference.

python inference.py  --data_path <path_to_data> --model_path <path_to_model> --output_path <output_path>.json

Deploying

For model deployment we use replicate, which gives allows access to on-demand GPU instances. In order to deploy, follow the instructions here, you will need to use cog, docker and the replicate package.

There are two files (cog.yaml and predict.py) in the replicate_resources directory. These two files detail the environment that the model should run in, and how to use it to run a prediction, respectively. These are necessary to deploy the models using cog.

Date: February 21, 2024