Object Aggregation Strategies for Blob Storage

🚧
Not compatible with Data Replication
Data replication strategies are not compatible with Object Aggregation, which means that if you try to create a connector, you will be unable to have both features active simultaneously. See Data Replication Strategies for Blob Store.

Overview

Ascend aggregates data during ingestion from blob storage data sources into larger data partitions. This is known as object aggregation. Object Aggregation takes multiple data source objects (files, in the case of blob stores) and aggregates those objects into a single file.

📘
Object Aggregation is currently only supported by the following Read Connectors:

Amazon S3 Read Connector

Azure Data Lake Storage Gen 2 Read Connector

Google Cloud Storage Read Connector

What is Object Aggregation Strategy

Object aggregation is a data processing strategy designed for big data systems that reduces the number of files for a given job by combining large numbers of small files into a few number of larger files. This can be done in different ways, or strategies, depending on the needs of the system. For example, some systems may combine files based on their content, while others may combine files based on time or location. When one file in an aggregation changes (is removed, modified, or a new one added), all files previously aggregated with that file much also be re-aggregated (as to the system, it is tracked as a single file). The goal is to store data in as efficient structures as possible while limiting the number of files the system must keep track of.

Object Aggregation and Ascend

Ascend utilizes Spark clusters to read, transform, and write data. Spinning up a Spark cluster for many small data files can result in wasted (and costly) compute resources. For example, when a data set contains a few hundred files, a few hundred Spark jobs are generated.

Ascend uses object aggregation to ensure efficient use of resources, and provide the most cost effective compute processing for our customers. Object aggregation also speeds up processing time.

By default, one partition will be generated per object listed by a Read Connector. If your source directory structure has many small files (which can lead to inefficiencies in big data pipelines), the objects can be aggregated together for parsing, resulting in a smaller final number of partitions. Aggregation can occur for objects with a common prefix in the same directory (as determined by the delimiter), with a total size under the specified maximum aggregate object size. When aggregation occurs, the reported filename for the aggregated object–which is accessible to downstream transforms–is the common prefix of the original objects' filenames.

Adaptive

Adaptive object aggregation is a technique that can be used to improve performance when working with large object hierarchies. With adaptive object aggregation, you can choose to compress the object hierarchy below a certain directory level. This can help to improve performance when working with large object hierarchies, as it allows you to reduce the number of files that need to be accessed.

Field	Description
Column Name to Include Original File Name	This column cannot have a non-alphanumeric character (aside from underscore) and cannot begin with a number. If the field is left empty, Ascend will not include this column. NOTE: *Using this field requires you to manually add the column to the schema.
Max # of Input Bytes per Aggregation	There is no default value.
Max # of Files per Aggregation	The default value is `1024`.
Max # of Records per Aggregation	There is no default value.
Max # of Directory Levels per Aggregation	The default value is `8`.

Leaf Directory

Leaf directories in big data environments often store files produced at similar times, such as for a given hour. These files tend to change at similar times as each other, and then, after their time window has passed, never change, which makes them ideal for object aggregation. Leaf Directory object aggregation means Ascend will combine all files in each of the very last directories into a single file per directory.

Field	Description
Column Name to Include Original File Name	This column cannot have a non-alphanumeric character (aside from underscore) and cannot begin with a number. If the field is left empty, Ascend will not include this column. NOTE: *Using this field requires you to manually add the column to the schema.
Max # of Input Bytes per Aggregation	There is no default limit.
Max # of Files per Aggregation	The default value is `1024`.
Max # of Records per Aggregation	There is no default limit.

Prefix Regex Match

Aggregating object data with regular expressions allows you to group similar files, regardless of their location. This makes it easier to aggregate data that is not split up among directories, but rather, all placed in a single directory and/or share specific prefixes in the file names themselves. By using a regular expression, you can specify the pattern of characters you are interested in, and the aggregator will collect all the files that match that pattern.

Unlike Leaf Directory and Adaptive strategies, Prefix Regex Match does not have a max number of bytes, files, or records. Instead, data is grouped by timestamp.

Field	Required/Optional	Description
Regex to Match	Required	The regex to extract common prefixes. Ex. If files are named such as `/foo/bar/2021-01-01-03-04.log` (where such a file could represent data from Jan 1st, 2021 @ 3:04am), a regex of `/foo/bar/\d{4}-\d{2}-\d{2}` would aggregate all objects for a given day into a single object.
Column Name to Include Original File Name	Optional	This column cannot have a non-alphanumeric character (aside from underscore) and cannot begin with a number. If the field is left empty, Ascend will not include this column. NOTE: *Using this field requires you to manually add the column to the schema.

Field

Required/Optional

Description

Regex to Match

Required

The regex to extract common prefixes. Ex. If files are named such as /foo/bar/2021-01-01-03-04.log (where such a file could represent data from Jan 1st, 2021 @ 3:04am), a regex of /foo/bar/\d{4}-\d{2}-\d{2} would aggregate all objects for a given day into a single object.

Column Name to Include Original File Name

Optional

This column cannot have a non-alphanumeric character (aside from underscore) and cannot begin with a number. If the field is left empty, Ascend will not include this column.

*NOTE: **Using this field requires you to manually add the column to the schema.

Reshape Aggregation

Reshape object aggregation lets us take any data structure and reshape it into a hierarchy. Ascend has two ways to implement this:

Read metadata for a last modified timestamp, then "reshape" the files into a virtual folder structure by organizing them into time-based hierarchical folders, based on last modification time, before then aggregating.
Use a Regex to parse a timestamp from the file path, then "reshape" the files into a virtual folder structure by organizing them into time-based hierarchical folders, based on the timestamp extracted by the regex, before then aggregating.

Reshape by Last Modified Time

If no selection is made for Reshape by, Reshape Aggregation defaults to last modified time at the day granularity.

Field	Required/Optional	Description
Reshape Granularity	Optional	Reshape by year, month, day, or hour. Default granularity is day.
Column Name to Include Original File Name	Optional	This column cannot have a non-alphanumeric character (aside from underscore) and cannot begin with a number. If the field is left empty, Ascend will not include this column. NOTE: *Using this field requires you to manually add the column to the schema.

Field

Required/Optional

Description

Reshape Granularity

Optional

Reshape by year, month, day, or hour. Default granularity is day.

Column Name to Include Original File Name

Optional

This column cannot have a non-alphanumeric character (aside from underscore) and cannot begin with a number. If the field is left empty, Ascend will not include this column.

*NOTE: **Using this field requires you to manually add the column to the schema.

Extract Time via Regex

Field	Required/Optional	Description
Regex to match	Required	Extracts datetime parts using regex group names. Ex. if the object name `my_prefix/date=2022-11-2/file.parquet` use this regex: `date=(?P<year>\d{4})-(?P<month>\d{1,2})-(?P<day>\d{1,2})`, and the aggregated object name will be `year=2022/month=11/day=02`
Parse Failure Mode	Optional	Raise Error, Include File (but not aggregated), Skip File
Reshape Granularity	Optional	Reshape by year, month, day, or hour. Default granularity is day.
Column Name to Include Original File Name	Optional	This column cannot have a non-alphanumeric character (aside from underscore) and cannot begin with a number. If the field is left empty, Ascend will not include this column. NOTE: *Using this field requires you to manually add the column to the schema.

📘
If multiple parameters are set for Object Aggregation, then all parameters are taken into account when attempting to aggregate a prefix. If any of the parameters values are exceeded, then we will not aggregate that object. One parameter does not take preference over another, the condition for all parameters need to be met in order to aggregate objects in a given prefix