Table of Contents
Introduction
The document parsers convert collections of text spans found within documents
into structured debates, termed as Sessions
. These Sessions
consist of a
series of SessionEntries
, each representing a contribution by a participant
in the debate. For an in-depth understanding of the Session
and
SessionEntry
constructs, refer to api.html.
Designed with flexibility in mind, these parsers facilitate easy adaptation to new types of documents. The dataset includes a diverse array of documents, each with its unique layout and structure, necessitating distinct parsing strategies.
Text Span Classification Based on Size and Font
The essence of document parser functionality lies in their ability to classify
text spans within PDFs by their size
and font
attributes. This
classification enables the parsers to distinguish between various types of
textual content, enhancing the parsing process's adaptability and efficiency
across different document formats.
For each specific parser implementation (e.g.
antwoord_schriftelijke_vragen.py
, stenogram.py
, commissiedebat.py
), a
font__size2text_type
mapping is established. This mapping associates specific
font and size combinations with a textual category (such as Normal, Bold,
Title, Irrelevant), for instance:
font__size2text_type = { ("Univers", "9.05950"): "NORMAL", ("Univers-Black", "12.25700"): "TITLE", ("Univers-Black", "9.05950"): "BOLD", ... }
These textual types are then used within the parser logic to determine how specific text spans should be treated, such as whether they should contribute to the content of a session entry or be ignored (marked as irrelevant).
A tailor-made parser is developed for each doctype_slug
, taking into account
the variability in fonts and sizes across documents belonging to that
type. This approach, despite the variance in fonts and sizes within a
doctype_slug
, has proven effective for parsing needs.
Hierarchical Tree Structure for Document Parsing
Text spans are organized within a hierarchical structure comprising five levels:
- Page Index: Specifies the page number of the text span.
- Column Index: Differentiates between columns on a page, organizing text spans into separate textual columns.
- In-col Index (Paragraph Index): Segments text within a column into paragraphs or cohesive blocks of text.
- Line Index: Divides paragraphs into individual lines.
- Inline Index: Identifies the specific text span or segment within a line.
This hierarchical organization excludes any non-essential text (like footers and headers, as they are marked as irrelevant), thus focusing on the text of interest. By organizing text spans into this tree-like hierarchy, parsers can effectively navigate through pages, columns, paragraphs, lines, and textual fragments, ensuring a logical and spatial understanding of the document's layout.
Extractors and Traversal of Document Structure
Extractors are specialized classes designed to parse specific segments of the
document based on the hierarchical structure. Each extractor implements a
parse
method to extract information from a given sub-document structure (a
subtree within the overall document structure) and a starts
method to
determine whether a particular point in the document structure marks the
beginning of a relevant extraction point.
For instance, an EntryExtractor
might employ the starts
method to identify
the beginning of a speaker's contribution by looking for a line (sub-document
structure at the line level) that contains boldly emphasized text concluded
with a colon (:). This method allows the identification of textual patterns
that signify semantically important segments within the document.
The document parser orchestrates the extraction process by maintaining a list
of extractors. It partitions the tree structure into disjoint subtrees, each
associated with an extractor determined by the extractor's starts
method. This partitioning effectively isolates segments of the document
relevant to each extractor, allowing for targeted parsing.
Representation Parsing and Extraction
After segmenting and targeting specific document parts for extraction, each subtree is processed by its respective extractor, which returns a structured JSON representation of the extracted information. This yields a set of representations, each encapsulating data from segments important to various extractors.
The document parser's parse_extractions
method is then responsible for
aggregating these extracted representations, combining them into a cohesive
session and list of session entries, in accordance with the structure of
Session
and SessionEntry
objects described in api.html
Conclusion
Document Parsers offer a scalable solution for transforming vast volumes of PDF
documents into structured Session
and SessionEntry
objects. Setting up text
span classification and extractor construction enables parsing of new document
type, which, given the potential volume of documents per type, is notably
efficient.
To err on the side of caution, the document parsers are intentionally not very robust, and fail easily. We'd rather have a document not parsed, then to have malformed data end up in the training data. As a consequence, the syncing pipeline will keep trying to re-parse the documents that haven't been parsed yet. The majority of documents (~90%) are parsed successfully, however.