Data Partitioning (Legacy SQL Transforms)
How to partition your data with legacy SQL
Partitioning in SQL Transforms
Ascend will automatically determine the partition strategy from analyzing the SQL operation:
- Mapping if the SQL does not require any aggregation, ordering, or limits.
- Timestamp Partition if:
- The SQL has a
GROUP BYclause that includes a date or timestamp column; the input data must either already be partitioned by this value, or, the column needs to be truncated to a single partition value (via
DATE_TRUNCfor timestamps or
- The SQL does not include a window function, limit, or ordering clauses.
- The SQL has a
- Full Reduction in all other cases.
Mapping in SQL Transforms
The mapping operation is used by default when there is no aggregation needed. 1 SQL transform task is processed for each input partition.
Below is an example of a mapping operation where the transform uses a SQL WHERE clause.
The example above demonstrates the 249 partitions from the Read Connector are maintained as 249 partitions in the Mapping transform. Ascend generated 249 tasks in the underlying engine to process this particular transform.
Timestamp Partitioning in SQL Transforms
Ascend automatically enables Timestamp Partitioning when the SQL code aggregates on a timestamp column. When Timestamp Partitioning is used, 1 task is sent to the backend compute engine for each timestamp granule.
The below example demonstrates Timestamp partitioning by aggregating data on a monthly basis, creating monthly partitions.
This transform generated 2 partitions because the dataset contains records from 2 different months. We can inspect the partitions see the exactly how many records are in each month as well as the rest of the data profile.
Full Reduction in SQL Transforms
Ascend falls back to a full reduction when the SQL requires an operation that cannot be distributed across multiple workers, or data cannot be partitioned by any partitioning unit. A full reduction executes as 1 task sent to the backend engine; the task will process across an aggregate of all input partitions.
The below example demonstrates a full reduction through its use of DISTINCT, which is still a query that can be executed in parallel, but requires the entire input data set to compute.
Other example operations that will trigger a full reduction:
- GROUP BY where no columns are date/timestamp type.
- Calculating row number with a ROW_NUMBER()OVER(PARTITIONED BY) window function.
- Adding ORDER BY to the SQL code.
Updated over 1 year ago