Data Quality

πŸ“˜

Data Quality is currently available on Snowflake, Databricks, and BigQuery Data Planes.

Overview

Data Quality checks are assertions about the dataset associated with a component. It measures objective elements such as completeness, accuracy, and consistency. Ascend provides built-in Standard and Custom Data Quality Checks using SQL queries. Once the data is ingested or transformed, the Quality Checks run before Ascend marks the component as UP TO DATE.

There are two types of Data Quality Checks in Ascend:

  • A Standard Check is predefined. Standard Checks include NOT NULL, >, >=, <, <=, IN, NOT IN, or UNIQUE.
  • A Custom Check is a custom SQL users write that returns failing records.

Data Quality Checks at the Component & Partition Level

Each Data Quality check can run on a component or partition level. Component-level checks run SQL queries on the entire dataset for the component. Partition-level checks run SQL queries on the partition whenever a new partition is added to the dataset. So it's more efficient for incremental data pipelines by avoiding repeatedly running the same queries on the entire dataset.

For Standard Checks, Ascend UI sets the Default Run On to By Partition by default. For a comprehensive overview of how partitioning with Ascend works, see Data Partitioning. For Custom Checks, Ascend UI sets the Default Run On to By Component by default. They are configurable on the UI, API, and SDK. The default is partition-level if the test level is not configured.

Data Quality Checks Threshold

For each Data Quality check, the failure effect can be warning or error for the component, and the thresholds for the failure effects are configurable. The current UI default failure effect is warning and the threshold is set to 0. This means a check fails if any rows are returned for either a Standard Check or Custom Check and the component shows as a warning status. Currently, you can only access and set failure thresholds for warning and failure behaviors of checks through the Ascend SDK or API.

Example: For a Standard Check of NOT NULL, you can set a threshold of "10" and failure effect as warning. Ascend will check how many rows have null values. If a single row containing null values is found, no error or warning is displayed because the threshold was set to "10".

How to Set Standard Checks

Standard Checks allow you to set specific checks for each column with the option to allow for a default value of "run on" and "failure behavior." This allows greater control over prioritizing data integrity by column. When configuring a Read Connector or Transform, complete the following steps before selecting CREATE.

Step 1: Set the Default Run On and Default Failure Behavior (Optional)

The Default Run On sets the default validation target for every column check (either Each Partition or the Whole Component). The Default Failure Behavior sets the failure for every column check (either Return Warning, Raise Error, or Do Nothing).

Step 2: Select columns for running Standard Checks

Select + Add Standard Check to add a new standard check row. Using the generated schema, identify the columns to run quality checks on. Insert the column name into the new row.

Step 3: Set the quality check parameters

Standard Checks allows Ascend to construct SQL queries using a SELECT COUNT(*). If the query returns a count outside of a specified number in either NOT NULL, >, >=, <, <=, IN, NOT IN, or UNIQUE, the component displays your selected default failure behavior.

The Run On and Failure Behavior columns allow you to override the default behavior previously set for specific column checks.

15351535

Figure 1. Creating Standard Checks.

Once your Standard Check parameters are set, continue the configuration for your Read Connector or Transform.

How to Set Custom Checks

Ability to create custom SQL queries to run against the data. Greater control over specific Standard Checks allow you to set specific checks for each column with the option to allow for a default value of "run on" and "failure behavior." This allows greater control over prioritizing data integrity by column. When configuring a Read Connector or Transform Connector, complete the following steps before selecting CREATE.

Step 1: Set the Default Run On and Default Failure Behavior

The Default Run On sets the default validation target for every column check (either Each Partition or Whole Component). The Default Failure Behavior sets the failure for every column check (either Return Warning, Raise Error, or Do Nothing).

Step 2: Custom SQL

Create and insert your custom SQL query. Once your Custom Check query is set, continue the configuration for your Read or Transform Connector.

13461346

Figure 2. Creating Custom Checks.

Reviewing Data Quality Check Errors and Warnings

Errors and warnings for Data Quality checks appear on the component itself and within the component view. Ascend also writes the failed rows to a separate table.

Component Errors and Warnings

The first is on the component itself within the graph. For any component, Return Warning will log the failure and allow the component to continue. Raise Error will log the failure and stop the component from running.

392392

Figure 3. Return Warning will present a warning on the component and allow the component to continue.

392392

Figure 4. Raise Error will present with a red icon.

A description of the warning also appears in the details pane within the component view. First, the Data Quality tab will indicate if a check failed for the partition or component.

15311531

Figure 5. Standard Check set to Raise Error showing the partition error in the Data Quality tab.

The error is also detailed in the Errors and Warnings tab.

11871187

Figure 6. Standard Check set to Raise Error showing partition error on Errors and Warnings tab.

Data Quality Checks Table

For any Data Plane, both Standard Checks and Custom Checks will store data quality test queries and results in a new table. The table will appear in the same database as the original data table and with the name TABLENAME_TESTS__ASCEND_META.

Ex. In Snowflake, the table name ingested is YELLOW_CAB. The table name containing the quality test results is YELLOW_CAB_TESTS__ASCEND_META.

10311031

Figure 7. Ascend data quality checks table in Snowflake.

Data Quality Checks through Ascend SDK

Data Quality Checks are available through the SDK. Additionally, you can access and set thresholds for warning and failure behaviors of checks.

Questions?

We're always happy to help with any other questions you might have! Send us an email, ping us on Slack, or chat with us to your right :point-right:.