Dataset
Before we delve deeper into the structure and functioning of the syncing
pipeline, it is necessary to provide a brief overview of the structure of the
hosted dataset. The dataset at https://llms.openstate.eu/ consists of a docs/
and a json/
folder. The json/
folder contains about 2000 json files, each
containing a large list of JSON
objects that describe the Dutch House of
Representatives documents. For example, an object here might look like this:
{ "Id": "ec9315bf-e1d1-422e-a8fb-0000932ed927", "Soort": "Antwoord schriftelijke vragen", "DocumentNummer": "2019D16854", "Titel": null, "Onderwerp": "Antwoord op vragen van het lid Van Nispen over het mogelijk strafrechtelijke karakter van het sluiten van drugspanden door de burgemeester ", "Datum": "2019-04-24T00:00:00+02:00", "Vergaderjaar": "2018-2019", "Kamer": 2, "Volgnummer": -1, "Citeertitel": null, "Alias": null, "DatumRegistratie": "2019-04-23T00:00:00+02:00", "DatumOntvangst": null, "Aanhangselnummer": "181902411", "KenmerkAfzender": null, "Organisatie": "Tweede Kamer", "ContentType": "application/pdf", "ContentLength": 32150, "GewijzigdOp": "2022-04-26T09:55:26.683+02:00", "ApiGewijzigdOp": "2022-04-30T16:43:22.8081028Z", "Verwijderd": false }
The Id
field in these JSON
objects can be used to form the unique path on
the server where the actual file is hosted. This is done by using the first two
letters of the ID as the first directory, the next two letters as the
subdirectory, and the entire Id
as the filename along with the expected
extension based on the ContentType
. Thus, the file described in the above
example would be located at:
https://llms.openstate.eu/docs/ec/93/ec9315bf-e1d1-422e-a8fb-0000932ed927.pdf
doctypes
The files under json/
thus serve as an index for the total set of files and
contain the information to locate each of the files. The JSON objects contain
important metadata about the documents, such as its type (Soort
), which we
call doctype
internally. We convert each of the doctype
names to a
doctype_slug
by replacing whitespaces, slashes, parentheses, or dots with
dashes and lowercasing the name. This makes the names more easy to work with in
the context of file systems.
Within the context of the debot
project, we focused on the following types
of debate (with corresponding doctype slugs)
- Antwoord schriftelijke vragen
antwoord-schriftelijke-vragen
,- Plenair debat
stenogram
- Commissiedebat
verslag-van-een-commissiedebat
,verslag-van-een-algemeen-overleg
,verslag-van-een-hoorzitting-rondetafelgesprek
,verslag-van-een-wetgevingsoverleg
,verslag-van-een-notaoverleg
We thus only downloaded and processed files that correspond to these doctype
slugs. This filter on doctype slugs is defined in debot/scope.py