Table of Contents

Dataset

Before we delve deeper into the structure and functioning of the syncing pipeline, it is necessary to provide a brief overview of the structure of the hosted dataset. The dataset at https://llms.openstate.eu/ consists of a docs/ and a json/ folder. The json/ folder contains about 2000 json files, each containing a large list of JSON objects that describe the Dutch House of Representatives documents. For example, an object here might look like this:

{
    "Id": "ec9315bf-e1d1-422e-a8fb-0000932ed927",
    "Soort": "Antwoord schriftelijke vragen",
    "DocumentNummer": "2019D16854",
    "Titel": null,
    "Onderwerp": "Antwoord op vragen van het lid Van Nispen over het mogelijk strafrechtelijke karakter van het sluiten van drugspanden door de burgemeester ",
    "Datum": "2019-04-24T00:00:00+02:00",
    "Vergaderjaar": "2018-2019",
    "Kamer": 2,
    "Volgnummer": -1,
    "Citeertitel": null,
    "Alias": null,
    "DatumRegistratie": "2019-04-23T00:00:00+02:00",
    "DatumOntvangst": null,
    "Aanhangselnummer": "181902411",
    "KenmerkAfzender": null,
    "Organisatie": "Tweede Kamer",
    "ContentType": "application/pdf",
    "ContentLength": 32150,
    "GewijzigdOp": "2022-04-26T09:55:26.683+02:00",
    "ApiGewijzigdOp": "2022-04-30T16:43:22.8081028Z",
    "Verwijderd": false
}

The Id field in these JSON objects can be used to form the unique path on the server where the actual file is hosted. This is done by using the first two letters of the ID as the first directory, the next two letters as the subdirectory, and the entire Id as the filename along with the expected extension based on the ContentType. Thus, the file described in the above example would be located at: https://llms.openstate.eu/docs/ec/93/ec9315bf-e1d1-422e-a8fb-0000932ed927.pdf

doctypes

The files under json/ thus serve as an index for the total set of files and contain the information to locate each of the files. The JSON objects contain important metadata about the documents, such as its type (Soort), which we call doctype internally. We convert each of the doctype names to a doctype_slug by replacing whitespaces, slashes, parentheses, or dots with dashes and lowercasing the name. This makes the names more easy to work with in the context of file systems.

Within the context of the debot project, we focused on the following types of debate (with corresponding doctype slugs)

Antwoord schriftelijke vragen
antwoord-schriftelijke-vragen,
Plenair debat
stenogram
Commissiedebat
verslag-van-een-commissiedebat, verslag-van-een-algemeen-overleg, verslag-van-een-hoorzitting-rondetafelgesprek, verslag-van-een-wetgevingsoverleg, verslag-van-een-notaoverleg

We thus only downloaded and processed files that correspond to these doctype slugs. This filter on doctype slugs is defined in debot/scope.py