Table of Contents

Introduction

The Merge Instructions Framework facilitates the combination of data from multiple pandas DataFrames through a variety of merge operations. It employs an object-oriented approach to merge operations, enabling clear, configurable, and reusable merge strategy definitions that can adapt to complex data relationships.

At its core, the framework provides a base for defining and executing merge operations utilizing pandas' merging capabilities, with the added benefit of serialization to and from JSON for dynamic configuration and sharing.

Core Components

MergeOperation

  • Purpose: Serves as the base class for all merge operations, encapsulating the common logic for executing a merge between two `DataFrame` objects or nested `MergeOperation` instances.
  • Key Methods:
    • `merge(source2df)`: Executes the configured merge operation, utilizing `source2df`, a dictionary mapping source identifiers to `DataFrame` objects.

Subclasses for Merge Strategies

InnerMerge

  • Description: Executes an inner join between two data sources, returning rows with matching keys in both.

OuterMerge

  • Description: Performs a full outer join, returning all rows from both data sources, with non-matching rows filled with `NaN`.

LeftMerge

  • Description: Executes a left outer join, returning all rows from the left data source, and matched rows from the right, with `NaN` for non-matching rows.

RightMerge

  • Description: Conducts a right outer join, returning all rows from the right data source, matched rows from the left, and `NaN` for non-matching rows.

Serialization and Deserialization

  • Purpose: Enables the conversion of merge operations to a dictionary format for JSON serialization (`todict()`) and the construction of merge operations from such dictionaries (`fromdict(data)`).

Usage Examples

Creating a Merge Operation

To combine two data frames with an inner merge based on shared keys:

merge_operation = InnerMerge(
    left="data_source_1",
    right="data_source_2",
    left_on="key_column_1",
    right_on="key_column_2",
)
result_df = merge_operation.merge({"data_source_1": df1, "data_source_2": df2})

Serialization to JSON

To serialize a merge operation for configuration or sharing:

merge_config = merge_operation.to_dict()

Deserialization from JSON

To restore a merge operation from a serialized configuration:

restored_merge_operation = MergeOperation.from_dict(merge_config)
merged_df = restored_merge_operation.merge(source2df)

Conclusion

The Merge Instructions Framework provides a robust for managing and executing data frame merge operations in a structured and object-oriented manner. Its design prioritizes clarity, configurability, and reusability, making it an convenient asset in the context of large data pipelines depending on merge operations.