D-SHAs, or Data SHAs, are Ascend’s way of knowing whether or not data has changed. Ascend does this by calculating the SHA of all the data in each partition of a component. By keeping track of this information, Ascend knows which partitions need to be reprocessed downstream. This helps to save on compute (and costs) by preventing any redundant processing where data has not changed.
- An Ascend component is triggered to run and reprocesses all or some of its partitions
- While processing its partitions, Ascend calculates the D-SHA of the data in each output partition
- These D-SHAs are then piped to the down stream component. The downstream component uses the DSHAs to determine which of its partitions need to be processed. If the DSHA is the same for any given partition, the partition does not need to be reprocessed.
Let’s say you have a component A, with partitions 1, 2, 3, 4, 5, with a downstream component B with partitions mapped one-to-one.
When component A processes, only data in output partition 2 and 5 have changed. This means that the DSHAs for 2 and 5 have changed and DSHAs for 1, 3, and 4 have remained the same.
Downstream component B will then analyze the upstream partitions and see which partitions have different SHAs, and will only reprocess partitions 2 and 5, and partitions 1, 3, and 4 will immediately come up-to-date, since those partitions do not need to be reprocessed.
Since computation for DSHAs is done in parallel using Spark, they are only currently available in Spark Dataplane environments. For more information on D-SHA availability in other data planes, feel free to reach out to you Ascend representative or email [email protected]
Updated 4 months ago