Transforms

Creating and Updating Transforms

An Ascend Transform is created and connected to a Read Connectors , Data Feeds, or another Transform component. A Transform component performs operations like cleaning, filtering, joining, and/or aggregating across data sets. These operations are written in SQL, PySpark, and Scala / Java .

Metadata Columns

Transforms have access to the metadata of their inputs, exposed through metadata columns.

For SQL Transforms, the metadata column name is referenced directly in the SQL statement after enabling it in Advanced Settings. It's automatically detected and included.

For PySpark Transforms, as well as Scala / Java Transforms, the user-supplied function must implement an additional method to request the specified metadata column in the inputs list. Please refer to the PySpark Transforms or Scala & Java Transforms documentation pages for the interfaces.

📘

Use Alias for Metadata Columns

Metadata Columns aren't materialized columns. Ascend will populate this column for all input dataframes, so to prevent conflict, it is required to use alias when using these columns.

NOTE: Using SELECT * in SQL Transforms will automatically include metadata columns without the alias, causing the component to fail. List all non-metadata columns before the aliased metadata column to properly include the metadata column in the transform.

Supported Metadata Columns:

Column NameDescription
__ascend__urlThe full original filename of the input partition. This column must be used directly downstream of a Read Connector and requires the transformation to be a 'mapping' transform (see partitioning strategies for more detail). In all other cases, an attempt to use this column will error.
__ascend__object_created_atThe creation timestamp of the original object read. This column must be used directly downstream of an Azure Blob or Google Cloud Storage Read Connector and requires the transformation to be a 'mapping' transform (see partitioning strategies for more detail). In all other cases, an attempt to use this column will error.
__ascend__object_updated_atThe last updated, or last modified timestamp of the original object read. This column must be used directly downstream of an AWS S3, Azure Blob, or Google Cloud Storage Read Connector and requires the transformation to be a 'mapping' transform (see partitioning strategies for more detail). In all other cases, an attempt to use this column will error.