Table of Contents
Introduction
The Merge Instructions Framework facilitates the combination of data from multiple pandas DataFrames through a variety of merge operations. It employs an object-oriented approach to merge operations, enabling clear, configurable, and reusable merge strategy definitions that can adapt to complex data relationships.
At its core, the framework provides a base for defining and executing merge operations utilizing pandas' merging capabilities, with the added benefit of serialization to and from JSON for dynamic configuration and sharing.
Core Components
MergeOperation
- Purpose: Serves as the base class for all merge operations, encapsulating the common logic for executing a merge between two `DataFrame` objects or nested `MergeOperation` instances.
- Key Methods:
- `merge(source2df)`: Executes the configured merge operation, utilizing `source2df`, a dictionary mapping source identifiers to `DataFrame` objects.
Subclasses for Merge Strategies
InnerMerge
- Description: Executes an inner join between two data sources, returning rows with matching keys in both.
OuterMerge
- Description: Performs a full outer join, returning all rows from both data sources, with non-matching rows filled with `NaN`.
LeftMerge
- Description: Executes a left outer join, returning all rows from the left data source, and matched rows from the right, with `NaN` for non-matching rows.
RightMerge
- Description: Conducts a right outer join, returning all rows from the right data source, matched rows from the left, and `NaN` for non-matching rows.
Serialization and Deserialization
- Purpose: Enables the conversion of merge operations to a dictionary format for JSON serialization (`todict()`) and the construction of merge operations from such dictionaries (`fromdict(data)`).
Usage Examples
Creating a Merge Operation
To combine two data frames with an inner merge based on shared keys:
merge_operation = InnerMerge( left="data_source_1", right="data_source_2", left_on="key_column_1", right_on="key_column_2", ) result_df = merge_operation.merge({"data_source_1": df1, "data_source_2": df2})
Serialization to JSON
To serialize a merge operation for configuration or sharing:
merge_config = merge_operation.to_dict()
Deserialization from JSON
To restore a merge operation from a serialized configuration:
restored_merge_operation = MergeOperation.from_dict(merge_config) merged_df = restored_merge_operation.merge(source2df)
Conclusion
The Merge Instructions Framework provides a robust for managing and executing data frame merge operations in a structured and object-oriented manner. Its design prioritizes clarity, configurability, and reusability, making it an convenient asset in the context of large data pipelines depending on merge operations.