Document Parser Overview

Introduction
Text Span Classification Based on Size and Font
Hierarchical Tree Structure for Document Parsing
- Extractors and Traversal of Document Structure
- Representation Parsing and Extraction
Conclusion

Introduction

The document parsers convert collections of text spans found within documents into structured debates, termed as Sessions. These Sessions consist of a series of SessionEntries, each representing a contribution by a participant in the debate. For an in-depth understanding of the Session and SessionEntry constructs, refer to api.html.

Designed with flexibility in mind, these parsers facilitate easy adaptation to new types of documents. The dataset includes a diverse array of documents, each with its unique layout and structure, necessitating distinct parsing strategies.

Text Span Classification Based on Size and Font

The essence of document parser functionality lies in their ability to classify text spans within PDFs by their size and font attributes. This classification enables the parsers to distinguish between various types of textual content, enhancing the parsing process's adaptability and efficiency across different document formats.

For each specific parser implementation (e.g. antwoord_schriftelijke_vragen.py, stenogram.py, commissiedebat.py), a font__size2text_type mapping is established. This mapping associates specific font and size combinations with a textual category (such as Normal, Bold, Title, Irrelevant), for instance:

font__size2text_type = {
    ("Univers", "9.05950"): "NORMAL",
    ("Univers-Black", "12.25700"): "TITLE",
    ("Univers-Black", "9.05950"): "BOLD",
    ...
}

These textual types are then used within the parser logic to determine how specific text spans should be treated, such as whether they should contribute to the content of a session entry or be ignored (marked as irrelevant).

A tailor-made parser is developed for each doctype_slug, taking into account the variability in fonts and sizes across documents belonging to that type. This approach, despite the variance in fonts and sizes within a doctype_slug, has proven effective for parsing needs.

Hierarchical Tree Structure for Document Parsing

Text spans are organized within a hierarchical structure comprising five levels:

Page Index: Specifies the page number of the text span.
Column Index: Differentiates between columns on a page, organizing text spans into separate textual columns.
In-col Index (Paragraph Index): Segments text within a column into paragraphs or cohesive blocks of text.
Line Index: Divides paragraphs into individual lines.
Inline Index: Identifies the specific text span or segment within a line.

This hierarchical organization excludes any non-essential text (like footers and headers, as they are marked as irrelevant), thus focusing on the text of interest. By organizing text spans into this tree-like hierarchy, parsers can effectively navigate through pages, columns, paragraphs, lines, and textual fragments, ensuring a logical and spatial understanding of the document's layout.

Extractors and Traversal of Document Structure

Extractors are specialized classes designed to parse specific segments of the document based on the hierarchical structure. Each extractor implements a parse method to extract information from a given sub-document structure (a subtree within the overall document structure) and a starts method to determine whether a particular point in the document structure marks the beginning of a relevant extraction point.

For instance, an EntryExtractor might employ the starts method to identify the beginning of a speaker's contribution by looking for a line (sub-document structure at the line level) that contains boldly emphasized text concluded with a colon (:). This method allows the identification of textual patterns that signify semantically important segments within the document.

The document parser orchestrates the extraction process by maintaining a list of extractors. It partitions the tree structure into disjoint subtrees, each associated with an extractor determined by the extractor's starts method. This partitioning effectively isolates segments of the document relevant to each extractor, allowing for targeted parsing.

Representation Parsing and Extraction

After segmenting and targeting specific document parts for extraction, each subtree is processed by its respective extractor, which returns a structured JSON representation of the extracted information. This yields a set of representations, each encapsulating data from segments important to various extractors.

The document parser's parse_extractions method is then responsible for aggregating these extracted representations, combining them into a cohesive session and list of session entries, in accordance with the structure of Session and SessionEntry objects described in api.html

Conclusion

Document Parsers offer a scalable solution for transforming vast volumes of PDF documents into structured Session and SessionEntry objects. Setting up text span classification and extractor construction enables parsing of new document type, which, given the potential volume of documents per type, is notably efficient.

To err on the side of caution, the document parsers are intentionally not very robust, and fail easily. We'd rather have a document not parsed, then to have malformed data end up in the training data. As a consequence, the syncing pipeline will keep trying to re-parse the documents that haven't been parsed yet. The majority of documents (~90%) are parsed successfully, however.