Table of Contents
Introduction
This file serves the purpose of providing an overview of the documentation
available for the debot
project, and provide some general information not
explained in the documentation of any of the components specifically.
In the debot project, we finetune two LLMs (alpaca and bloomz) on Dutch Parliamentary debates. From a bird's eye perspective, the project consists of the following components:
- A data pipeline that downloads and processes PDF documents from a dataset hosted on https://llms.openstate.eu, constructing a hybrid database in the process.
- A prompt generation process to convert the parsed debates into LLM training data.
- The generated training data is used to train models using third-party GPU providers. Finetuned models are hosted on third-party GPU providers as well, currently we use replicate.
- A simple API debate generation/completion API, communicating with the hosted finetuned models and taking care of some parsing logic.
Below we refer to the documentation of specific components of this project:
- dataset.html
- Brief README on the general structure of the source dataset of documents used in this project.
- pipeline.html
- Documents the implementation of the syncing pipeline, responsible for processing the dataset. Refers to merge_instructions.html and action_conditions.html for documentation on specific components of the pipeline.
- document_parsers.html
- Documents the main idea behind how the document parsers work.
- model_training.html
- Details how models are trained using third-party GPU providers.
- data_generation.html
- Describes the process of going from the pipeline-produced hybrid database to textual training data on which the models are finetuned.
- api.html
- General instructions on how to interact with the API and details on the parsed structure of the data.