Data Generation

Introduction
Data Format
- Prompt Dictionary
- Training data JSON
TrainPromptGenerator
PromptGenerator
Usage

Introduction

In this document, we explain the use and conceptual operation of the TrainPromptGenerator and PromptGenerator. The TrainPromptGenerator is responsible for converting the database with parsed debates, or sessions, into training data for the models, using the PromptGenerator for this purpose. The PromptGenerator converts a JSON representation of a single session into the desired textual format for our models.

Data Format

The TrainPromptGenerator generates the data with which our models are fine-tuned. The training data must be provided to our scripts as a JSON file containing training and validation data.

Prompt Dictionary

The basic element in the training data is a prompt dictionary. It's the same format as in the stanford_alpaca repo. The instruction contains the instruction that the models should follow, optionally containing necessary context in the "input" field. The desired output should be in the "output" field.

{
    'instruction': 'Je bent een Tweede Kamer debatssimulator. Je simuleert een vragenuur waarin een Tweede kamerlid (PvdA) vragen stelt aan de Minister (Financiën) betreffende het voorkomen van witwassen van Russische vermogens. Het vergaderjaar is 2013-2014.',
    'input': '',
    'output': 'Tweede kamerlid (PvdA):\n Herinnert u zich de motie-Servaes/Klaver .....'
}

Training data JSON

The JSON provided to our training scripts must contain a "train" and a "val" field. Each of these consists of a list of prompt dictionaries as described above, intended for training and validation, respectively.


{
    "val":[{"instruction":"Je bent een... ","input":"Minister van...", "output": "VVD:"},... ],
    "train":[{"instruction":"Je bent een... ","input":"De heer...", "output": "GL"},... ]
}

The output of our TrainPromptGenerator is in this format: a JSON with a validation and a train set in the form of prompt dictionaries.

`TrainPromptGenerator`

Training and Validation

The first step of the TrainPromptGenerator (TPG) is assigning each document to the training or validation data, ensuring that no data leakage occurs.

Creating Session Objects

The next step of the TPG is creating session representations as described in the API documentation (api.html). In short, a session object is a JSON representation of a partial or entire debate session, including the type of debate, the topic, and what was said (session entries). We convert all sessions in the database into session objects, carefully filtering the post, role, and party of the speakers based on the mappings in ../debot/value_mappings.py. If the raw value does not explicitly exist in our mappings, they are not used, and the session is ignored.

Integrating Government Status

Next, based on the date of the debate, the state of the government at that time is calculated to add some extra information to the session object: for example, whether a minister is acting or not, whether a speaking party is in opposition, and which cabinet is in place at that time.

Converting to a Prompt Dictionary

In the next step, each session object is converted into a prompt dictionary. This is the same step that needs to happen when an API request comes in, so it happens in a separate module used by both the API and the TPG: the PromptGenerator.

We use two types of prompts: a continuation and a generation prompt. In a continuation prompt, the model is given the context of a part of the debate that has already taken place and is asked to continue the debate. In a generation prompt, the model must start the debate itself. When converting a session object into a prompt dictionary the PromptGenerator is used to convert the session object into both a continuation and a generation prompt.

Single Message Prompt

For every prompt dictionary, there is a 20% chance that we also create a single message prompt dictionary, which entails that the model is instructed to generate just the next message in the debate, thus the desired output also contains only one message. This is so we can explicitly ask for just 1 message to be generated if necessary.

`PromptGenerator`

The PromptGenerator (PG) is used by both the API and the TPG to convert a session object into a prompt dictionary. It does this through some light parsing and string formatting, converting the session object into an prompt dictionary. An example of the format of a final instruction would be:

Je bent een Tweede Kamer debatssimulator. Je simuleert een 'Commissie debat'
betreffend 'Verslag van een algemeen overleg, gehouden op 4 juli 2012, inzake
Seksuele en Reproductieve Gezondheid en Rechten'. Aanwezig zijn: PVV (None),
D66 (oppositie), VVD (coalitie), voorzitter. De regering is Rutte I
(demissionair). Genereer het debat vanaf het begin."

This string would be the "instruction" field in the prompt dictionary.

With previous messages being represented in the input field as a string like this:

**VVD**:
Onze plannen zijn als volgt. We willen nog ongeveer 100 mln. aan noodhulp doen
en ongeveer 800 mln. multilateraal doen. Dan blijft er nog 500 mln. over om
bilateraal te doen.

**PVV**:
Dat is weer typisch de VVD: heel stoere verhalen in de krant schrijven, maar in
elk AO, zoals afgelopen maandag en ook vandaag weer, halfzachte
pro-ontwikkelingshulpverhaaltjes houden. De VVD blaft, maar bijt nooit.

**VVD**:
Als we 70% bezuinigen op wat we nu uitgeven, dan zijn het geen halfbakken
verhalen. We willen het alleen niet doen zoals de PVV. We willen niet alleen
inzetten op noodhulp, omdat je dan steeds meer noodhulp gaat geven. We willen
wel de ontwikkelings clubs zelf hun geld laten verdienen. Vervolgens willen we
nog een aantal grote problemen helpen aanpakken, want we leven niet op een
eiland. Daarin onderscheiden we ons waarschijnlijk van de PVV.

The prompt generator can either function in API mode or training data mode, indicated by a boolean flag when initializing the object. This is necessary as there is slightly different logic necessary when we generate a prompt for the training data versus a prompt for the api. For example, we need to make sure to populate the "output" field when generating training data, this is not the case when running inference using the api.

Usage

Creating the training data for our task is done on basis of the database that is created by the sync script. Run the generate_data.py script like:

python scripts/generate_data.py --model_name alpaca --output_path train_data_alpaca.json

This will create a json in the format described above, containing both training and validation data. We need to do seperate training data generation for each model as the models have different tokenizers, thus different constraint on input length etc.

Table of Contents