Data Quality is currently available on Snowflake, Databricks, and BigQuery Data Planes.
Data Quality checks are assertions about the dataset associated with a component. It measures objective elements such as completeness, accuracy, and consistency. Ascend provides built-in Standard and Custom Data Quality Checks using SQL queries. Once the data is ingested or transformed, the Quality Checks run before Ascend marks the component as UP TO DATE.
There are two types of Data Quality Checks in Ascend:
- A Standard Check uses a set of predefined operators to express the validation rule. The following operators are supported:
NOT IN, or
UNIQUE. At runtime, Ascend constructs a
SELECT COUNT(*)query with the configured operators and fails the check if the result count is greater than the configured threshold.
- A Custom Check allows the user to write a custom SQL statement that returns records that do not pass the check. The check is failed if the number of records returned exceeds the configured failure threshold.
Each Data Quality check can be configured to run either at the component level or at the partition level. Component-level checks run on the entire output dataset of the component. Partition-level checks run on individual partitions whenever they are added to the dataset. Partition-level checks are more efficient for incremental data pipelines by avoiding repeatedly running the same queries on the entire dataset.
For Standard Checks, the Ascend UI sets the Default Run On setting to By Partition by default. For a comprehensive overview of how partitioning with Ascend works, see Data Partitioning. For Custom Checks, the Ascend UI sets the Run On setting to By Component by default. These settings are also configurable through the API and SDK. The default is partition-level if the test level is not configured.
Each Data Quality check can be configured to respond to a failure by Returning a Warning, Raising an Error, or Doing Nothing. When
Return a Warning is selected, failures of the Data Quality check result in a warning getting logged but allow processing to continue. When
Raise an Error is selected, failed checks result in processing for the component to stop. When configuring a Data Quality check in the Ascend UI, the failure behavior is set to
Return a Warning by default.
The threshold at which a check is considered to have failed is also configurable. Currently, you can only access and set failure thresholds for warning and failure behaviors of checks through the Ascend SDK or API. In the Ascend UI, the threshold is automatically set to
Example: For a Standard Check of
NOT NULL configured with a threshold of
10 and failure behavior set to Return a Warning, Ascend will check how many rows have null values, raising a warning if more than 10 such rows are found.
Standard Checks allow you to set specific checks for each column with the option to allow for a default value of "run on" and "failure behavior." This allows greater control over prioritizing data integrity by column. When configuring a Read Connector or Transform, complete the following steps before clicking on CREATE.
The Default Run On setting configures the default validation target (either Each Partition or the Whole Component) for all column checks. The Default Failure Behavior setting configures the default failure behavior (either Return Warning, Raise Error, or Do Nothing) for all column checks.
Select + Add Standard Check to add a new standard check row. Using the generated schema, identify the column to run the quality check on. Insert the column name into the
Column Name field of the new row.
Configure the desired failure conditions for the check. Available operators are:
NOT IN, or
Run On and
Failure Behavior columns allow you to override the default behavior set in Step 1.
Repeat Steps 2 and 3 to add any additional column checks.
Once your Standard Check parameters are set, continue the configuration for your Read Connector or Transform.
Custom Checks provide the ability to create custom SQL queries to run against the data when the capabilities of Standard Checks are not sufficient to describe the failure conditions. When configuring a Read Connector or Transform Connector, complete the following steps before selecting CREATE.
The Run On setting configures the validation target (either Each Partition or Whole Component) for the Custom Check. The Failure Behavior setting configures the failure behavior (either Return Warning, Raise Error, or Do Nothing) for the custom check.
Create and insert your custom SQL query.
Once your Custom Check query is set, continue the configuration for your Read or Transform Connector.
The results of Data Quality checks in two ways in the Ascend UI: through visual indicators on the component graph, and in the Data Quality Tab of the Component Details pane. Additionally, Ascend stores results of the Data Quality Results in a table in the Data Plane alongside the output data tables.
Warning and Error states can be quickly gleaned from a Component in the Component Graph.
The Data Quality Tab in the Component Details Pane provides details for each check that was run against the data set.
When configured with
Raise Error, a failed check also results in an error message displayed in the Errors and Warnings tab.
For any Data Plane, both Standard Checks and Custom Checks will store data quality test queries and results in a new table. The table will appear in the same database/schema as the original data table with the name
In this example, data is being ingested into a Snowflake Data Plane by a component named
YELLOW_CAB. The table name containing the quality test results is
Data Quality Checks can be configured through the SDK and API. These are also the only way in which non-default failure thresholds for checks can be configured (in the Ascend UI, the threshold is always set to
Updated 4 months ago