Table of Contents

Introduction

This file serves the purpose of providing an overview of the documentation available for the debot project, and provide some general information not explained in the documentation of any of the components specifically.

In the debot project, we finetune two LLMs (alpaca and bloomz) on Dutch Parliamentary debates. From a bird's eye perspective, the project consists of the following components:

  1. A data pipeline that downloads and processes PDF documents from a dataset hosted on https://llms.openstate.eu, constructing a hybrid database in the process.
  2. A prompt generation process to convert the parsed debates into LLM training data.
  3. The generated training data is used to train models using third-party GPU providers. Finetuned models are hosted on third-party GPU providers as well, currently we use replicate.
  4. A simple API debate generation/completion API, communicating with the hosted finetuned models and taking care of some parsing logic.

Below we refer to the documentation of specific components of this project:

dataset.html
Brief README on the general structure of the source dataset of documents used in this project.
pipeline.html
Documents the implementation of the syncing pipeline, responsible for processing the dataset. Refers to merge_instructions.html and action_conditions.html for documentation on specific components of the pipeline.
document_parsers.html
Documents the main idea behind how the document parsers work.
model_training.html
Details how models are trained using third-party GPU providers.
data_generation.html
Describes the process of going from the pipeline-produced hybrid database to textual training data on which the models are finetuned.
api.html
General instructions on how to interact with the API and details on the parsed structure of the data.