Transform Partitioning
In Ascend, each Transform can adopt one of three previously discussed strategies to process your data set. The strategy you select will influence not only the number of output partitions but also which input partitions are considered to compute an output partition.
Transform Partitioning Strategies
Transforms use one of three strategies: full reduction, mapping, or timestamp partitioning. You've encountered these before with Read Connector partitioning strategies and there's not much difference in behavior.
- Full Reduction: This strategy amalgamates all input partitions into a single output partition, processed in one Spark task. It's always limited to producing one output partition. Any modification in the input triggers a complete reprocessing of the transform with all input partitions.
- Mapping: Here, each input partition is processed individually, with each resulting in a corresponding output partition. This "one-to-one" approach means the number of output partitions mirrors the number of input partitions. Only the input partitions that have undergone changes are processed.
- Timestamp Partition: Ascend segments input partitions based on a date/time column, processing each time-specific segment individually. This approach is adept at handling late-arriving data, with output partitions corresponding to each timestamp granule. Changes in input partitions prompt a review of relevant timestamp partitions, ensuring all pertinent records are processed.
Advanced Partition Handling and Multiple Inputs
Advanced manipulation of input partitions includes metadata augmentation, partition filtering based on specific conditions, and union operations to combine partitions from different components. S
-
Metadata Integration: Ascend enables the enrichment of input partition records with selected metadata fields like
Ascend URL
,Created At
, and/orUpload At
. This feature facilitates a deeper understanding of data provenance and temporal context. -
Partition Filters: This powerful tool filters input partitions based on specific conditions, preserving data incrementality and pipeline efficiency. It supports both static filtering for single inputs and dynamic filtering for scenarios involving multiple inputs, ensuring that only relevant partitions trigger data processing.
-
Union All and Incremental Processing with Multiple Inputs: Ascend provides the capability to merge partitions from various inputs, a strategy that shines in multi-source data integration scenarios. When dealing with multiple inputs, the default behavior treats every additional input beyond the primary as a full dataset. However, Ascend's advanced partition handling allows for defining specific relationships between partitions of different inputs, maintaining pipeline incrementality and ensuring data integrity through accurate joins.
Partitioning with Multiple Inputs
The partitioning strategy you choose primarily affects input[0]
(your primary input), which serves as your main data source. By default, any additional inputs you incorporate into the component are processed as a full dataset, known as a full reduction.
Should any part of these additional inputs change—for instance, input[1]
—Ascend triggers a reprocessing of all parts of input[0]
. This method ensures your data joins are accurate, safeguarding data integrity. However, it means that, by default, your pipeline doesn't process data incrementally.
But there's flexibility. If your operational logic permits—for example, if only a specific segment of input[0]
needs to interact with certain segments of input[1]
—Ascend offers advanced partition handling. You can define these precise relationships, enabling you to keep your pipeline both incremental and your data integrity intact.
Let’s say you have an existing encyclopedia (your primary input) and a collection of supplementary articles from a second collection you’d like to include (your secondary input).
By default, incorporating new articles requires revisiting and possibly revising the entire main collection, a time-consuming process ensuring everything aligns but hindering quick updates.
However, you can specify which articles need updating based on changes in supplementary articles, selectively revising only relevant sections of your encyclopedia. This approach keeps your work current without overhauling the entire collection for every small addition or change.
Here’s how you could do that:
Metadata Integration: Tag articles with details about their origin and history, enriching the reader's understanding.
Partition Filters: Selectively include articles based on specific criteria, ensuring only the most relevant information is featured.
Union All and Incremental Processing: Merge relevant articles from the encyclopedia and collection of articles to update sections without redoing the whole volume.
Execution Parallelism and Strategy Selection
Regardless of the chosen strategy, Transforms execute in parallel across numerous compute resources. This parallelism is an extension of Spark's inherent capabilities, adding a layer of efficiency and flexibility in data processing. When setting up your transform in PySpark or SnowPark, you'll specify the partitioning strategy, with options for advanced settings enhancing your control over the process.
Updated 9 months ago