Table of Contents
Introduction
Model training is done using the huggingface transformers library, with logging taking place on weights and biases (wandb). After the syncing pipeline is run to construct the databases, and prompt are constructed on basis of that database, we are ready to use that data to train our models. We finetune our models on basis of bloomz and alpaca, both open source instruction-based llms.
Downloading Weights
When referring to the "model weights" we refer to a directory containing the weights, some configuration files and the tokenizer. See here for more information. We need these for both bloomz and alpaca.
Bloomz
The weights for bloomz 7b are available at huggingface, downloading them is as simple as:
import transformers model = transformers.AutoModelForCausalLM.from_pretrained("bigscience/bloomz-7b1")
The weights will end up in the huggingface cache. Move the directory to your
BASE_MODELS_DIRECTORY
and give the model directory a relevant name.
Alpaca
follow instructions to download the alpaca weights and their diff here. Give the
directory a name like "alpaca_base". Also place these weights in the
BASE_MODELS_DIRECTORY
.
Training
There are two training scripts, one for bloom (scripts/train_bloom.py
) and one
for alpaca (scripts/train_alpaca.py
). There are lots of arguments that we can
pass to each script, see the transformers training arguments see here. These
arguments specify hyperparameters such as the amount of epochs, optimizer
etc. We have chosen default values that we used for the training, so we need
not worry about specifying these (unless you want to change a specific
parameter of course).
In order to train the models we need only specify the training data path using
the --data_path
flag. This should be the output file of the data generation
script generate_data.py
Multi-gpu training is all handled by huggingface if we invoke the script like e.g.
torchrun --nproc_per_node=<n_gpus> --master_port=<your_random_port> train_alpaca.py --data_path <path_to_data>
Where we allocate a certain amount of GPU's (we used four), and give them a port like "4321" they can use to communicate.
The output models weight directory will be placed in the
TRAINED_MODELS_DIRECTORY
, in a subdirectory whose name is a number that increases
by 1 each time. So the first model you train will end up in the directory
TRAINED_MODELS_DIRECTORY/0
.
Training will be logged on wandb, a great tool for monitoring model performance and bookkeeping. The run name will be equivalent to the output directory number.
Inference
For convencience and verification purposes there is an inference script available in the scripts directory. It expects the same input json as the training, and will use the data under "val". For the rest, we need only specify an output path, where the output json containing predictions will be put, and a model_path, the path to the directory of the model we want to use to run the inference.
python inference.py --data_path <path_to_data> --model_path <path_to_model> --output_path <output_path>.json
Deploying
For model deployment we use replicate, which gives allows access to on-demand GPU instances. In order to deploy, follow the instructions here, you will need to use cog, docker and the replicate package.
There are two files (cog.yaml
and predict.py
) in the replicate_resources
directory. These two files detail the environment that the model should run in,
and how to use it to run a prediction, respectively. These are necessary to
deploy the models using cog.