Table of Contents
Introduction
In this document, we explain the use and conceptual operation of the
TrainPromptGenerator
and PromptGenerator
. The TrainPromptGenerator
is
responsible for converting the database with parsed debates, or sessions, into
training data for the models, using the PromptGenerator
for this purpose. The
PromptGenerator
converts a JSON representation of a single session into the
desired textual format for our models.
Data Format
The TrainPromptGenerator
generates the data with which our models are
fine-tuned. The training data must be provided to our scripts as a JSON file
containing training and validation data.
Prompt Dictionary
The basic element in the training data is a prompt dictionary. It's the same format as in the stanford_alpaca repo. The instruction contains the instruction that the models should follow, optionally containing necessary context in the "input" field. The desired output should be in the "output" field.
{ 'instruction': 'Je bent een Tweede Kamer debatssimulator. Je simuleert een vragenuur waarin een Tweede kamerlid (PvdA) vragen stelt aan de Minister (Financiƫn) betreffende het voorkomen van witwassen van Russische vermogens. Het vergaderjaar is 2013-2014.', 'input': '', 'output': 'Tweede kamerlid (PvdA):\n Herinnert u zich de motie-Servaes/Klaver .....' }
Training data JSON
The JSON provided to our training scripts must contain a "train" and a "val" field. Each of these consists of a list of prompt dictionaries as described above, intended for training and validation, respectively.
{ "val":[{"instruction":"Je bent een... ","input":"Minister van...", "output": "VVD:"},... ], "train":[{"instruction":"Je bent een... ","input":"De heer...", "output": "GL"},... ] }
The output of our TrainPromptGenerator
is in this format: a JSON with a
validation and a train set in the form of prompt dictionaries.
TrainPromptGenerator
Training and Validation
The first step of the TrainPromptGenerator
(TPG) is assigning each document to
the training or validation data, ensuring that no data leakage occurs.
Creating Session Objects
The next step of the TPG is creating session representations as described in the API documentation (api.html). In short, a session object is a JSON representation of a partial or entire debate session, including the type of debate, the topic, and what was said (session entries). We convert all sessions in the database into session objects, carefully filtering the post, role, and party of the speakers based on the mappings in ../debot/value_mappings.py. If the raw value does not explicitly exist in our mappings, they are not used, and the session is ignored.
Integrating Government Status
Next, based on the date of the debate, the state of the government at that time is calculated to add some extra information to the session object: for example, whether a minister is acting or not, whether a speaking party is in opposition, and which cabinet is in place at that time.
Converting to a Prompt Dictionary
In the next step, each session object is converted into a prompt
dictionary. This is the same step that needs to happen when an API request
comes in, so it happens in a separate module used by both the API and the TPG:
the PromptGenerator
.
We use two types of prompts: a continuation and a generation prompt. In a
continuation prompt, the model is given the context of a part of the debate
that has already taken place and is asked to continue the debate. In a
generation prompt, the model must start the debate itself. When converting a
session object into a prompt dictionary the PromptGenerator
is used to convert
the session object into both a continuation and a generation prompt.
Single Message Prompt
For every prompt dictionary, there is a 20% chance that we also create a single message prompt dictionary, which entails that the model is instructed to generate just the next message in the debate, thus the desired output also contains only one message. This is so we can explicitly ask for just 1 message to be generated if necessary.
PromptGenerator
The PromptGenerator
(PG) is used by both the API and the TPG to
convert a session object into a prompt dictionary. It does this through some
light parsing and string formatting, converting the session object into an
prompt dictionary. An example of the format of a final instruction would be:
Je bent een Tweede Kamer debatssimulator. Je simuleert een 'Commissie debat' betreffend 'Verslag van een algemeen overleg, gehouden op 4 juli 2012, inzake Seksuele en Reproductieve Gezondheid en Rechten'. Aanwezig zijn: PVV (None), D66 (oppositie), VVD (coalitie), voorzitter. De regering is Rutte I (demissionair). Genereer het debat vanaf het begin."
This string would be the "instruction" field in the prompt dictionary.
With previous messages being represented in the input field as a string like this:
**VVD**: Onze plannen zijn als volgt. We willen nog ongeveer 100 mln. aan noodhulp doen en ongeveer 800 mln. multilateraal doen. Dan blijft er nog 500 mln. over om bilateraal te doen. **PVV**: Dat is weer typisch de VVD: heel stoere verhalen in de krant schrijven, maar in elk AO, zoals afgelopen maandag en ook vandaag weer, halfzachte pro-ontwikkelingshulpverhaaltjes houden. De VVD blaft, maar bijt nooit. **VVD**: Als we 70% bezuinigen op wat we nu uitgeven, dan zijn het geen halfbakken verhalen. We willen het alleen niet doen zoals de PVV. We willen niet alleen inzetten op noodhulp, omdat je dan steeds meer noodhulp gaat geven. We willen wel de ontwikkelings clubs zelf hun geld laten verdienen. Vervolgens willen we nog een aantal grote problemen helpen aanpakken, want we leven niet op een eiland. Daarin onderscheiden we ons waarschijnlijk van de PVV.
The prompt generator can either function in API mode or training data mode, indicated by a boolean flag when initializing the object. This is necessary as there is slightly different logic necessary when we generate a prompt for the training data versus a prompt for the api. For example, we need to make sure to populate the "output" field when generating training data, this is not the case when running inference using the api.
Usage
Creating the training data for our task is done on basis of the database that is created by the sync script. Run the generate_data.py script like:
python scripts/generate_data.py --model_name alpaca --output_path train_data_alpaca.json
This will create a json in the format described above, containing both training and validation data. We need to do seperate training data generation for each model as the models have different tokenizers, thus different constraint on input length etc.