Read Connector Partitioning
Read connectors determine the initial partitioning strategy for the dataset.
Blob Store Connector Partitioning
Ascend's native blob store connectors typically create one partition for each file discovered. However, you have the flexibility to modify this behavior. By implementing Object Aggregation Strategies, it's possible to group multiple smaller files into a single partition, enhancing data management efficiency.
Partitioning Strategies for Other Connectors
When working with other connectors, such as databases, warehouses, and APIs, the responsibility of designing the partitioning strategy falls to you. In Ascend, each object
in the list_objects
function signifies a separate data partition. Here are some common partitioning strategies to consider:
-
Single Partition Strategy: Ideal for smaller data sets under a few gigabytes that don't require incremental processing, this strategy treats the entire data set as a singular partition.
-
Time-Based Partitioning: This involves creating partitions based on the
created_at
timestamp of records, organized into hour, day, or month groupings. It suits data sets where records are updated for a while after creation and then become static, like a user's orders table. -
Update-Based Partitioning: Similar to time-based partitioning but focuses on the
updated_at
timestamp. This strategy is useful for data sets where any record might be updated, making it essential to selectively retrieve the updated records, such as in auser
table.
If you need guidance in designing a partition strategy tailored to your specific data set, don't hesitate to reach out to your Ascend field engineer for a detailed discussion.
Ascend’s Processing of Partitions
With the list of objects
returned, Ascend is capable of identifying if any partitions have been added, deleted, or remain unchanged. For existing partitions, Ascend uses a generated fingerprint
SHA from each object
to determine if there's been a change in the partition’s data. It's important to note that only the partitions found to be "changed" – those that are newly created, updated, or deleted – are forwarded for downstream processing in Transforms.
Let's go back to our encyclopedia set for a minute.
By default, Ascend’s native blob store connectors create an encyclopedia where each book is dedicated to a specific, narrow topic. However, you can use object aggregation strategies to condense these topics into comprehensive volumes, making it easier to manage and reference.
When it comes to other connectors like databases and APIs, where you design the partitioning strategy, think of it as being the editor of your encyclopedia set. You decide how to categorize and divide the content. You can have one large book encompassing all topics (single partition) or chronologically by either when the a topic was created or when it was last updated (time- or update-based partitioning).
Updated 8 months ago