Transforms
Creating and Updating Transforms
An Ascend Transform is created and connected to a Read Connectors , Data Feeds, or another Transform component. A Transform component performs operations like cleaning, filtering, joining, and/or aggregating across data sets. These operations are written in SQL, PySpark, and Scala / Java .
Metadata Columns
Transforms have access to the metadata of their inputs, exposed through metadata columns.
For SQL Transforms, the metadata column name is referenced directly in the SQL statement after enabling it in Advanced Settings. It's automatically detected and included.
For PySpark Transforms, as well as Scala / Java Transforms, the user-supplied function must implement an additional method to request the specified metadata column in the inputs list. Please refer to the PySpark Transforms or Scala & Java Transforms documentation pages for the interfaces.
Use Alias for Metadata Columns
Metadata Columns aren't materialized columns. Ascend will populate this column for all input dataframes, so to prevent conflict, it is required to use alias when using these columns.
NOTE: Using
SELECT *
in SQL Transforms will automatically include metadata columns without the alias, causing the component to fail. List all non-metadata columns before the aliased metadata column to properly include the metadata column in the transform.
Supported Metadata Columns:
Column Name | Description |
---|---|
__ascend__url | The full original filename of the input partition. This column must be used directly downstream of a Read Connector and requires the transformation to be a 'mapping' transform (see partitioning strategies for more detail). In all other cases, an attempt to use this column will error. |
__ascend__object_created_at | The creation timestamp of the original object read. This column must be used directly downstream of an Azure Blob or Google Cloud Storage Read Connector and requires the transformation to be a 'mapping' transform (see partitioning strategies for more detail). In all other cases, an attempt to use this column will error. |
__ascend__object_updated_at | The last updated, or last modified timestamp of the original object read. This column must be used directly downstream of an AWS S3, Azure Blob, or Google Cloud Storage Read Connector and requires the transformation to be a 'mapping' transform (see partitioning strategies for more detail). In all other cases, an attempt to use this column will error. |
Updated 7 months ago