Object Aggregation for Blob Storage

Overview

Ascend aggregates data during ingestion from blob storage data sources into larger data partitions. This is known as object aggregation. Object Aggregation takes multiple data source objects (files, in the case of blob stores) and aggregates those objects into a single file.

📘

Object Aggregation is currently only supported by the following Read Connectors:

What is Object Aggregation Strategy

Object aggregation is a data processing strategy designed for big data systems. This can be done in different ways, or strategies, depending on the needs of the system. For example, some systems may combine files based on their content, while others may combine files based on time or location. The goal is to store as much data as possible in a way that makes it easy to work on in parallel.

Unlike partial reduction or full reduction, object aggregation strategies don't require reprocessing all data to apply transformations. Object aggregation is also more resilient and isn't as error prone as partial reduction.

Object Aggregation and Ascend

Ascend utilizes Spark clusters to read, transform, and write data. Spinning up a Spark cluster for many small data files can result in wasted (and costly) compute resources. For example, when a data set contains a few hundred files, a few hundred Spark jobs are generated.

Ascend uses object aggregation to ensure efficient use of resources, and provide the most cost effective compute processing for our customers. Object aggregation also speeds up processing time.

By default, one partition will be generated per object listed by a Read Connector. If your source directory structure has many small files (which can lead to inefficiencies in big data pipelines), the objects can be aggregated together for parsing, resulting in a smaller final number of partitions. Aggregation can occur for objects with a common prefix in the same directory (as determined by the delimiter), with a total size under the specified maximum aggregate object size. When aggregation occurs, the reported filename for the aggregated object–which is accessible to downstream transforms–is the common prefix of the original objects' filenames.

Adaptive

Adaptive object aggregation is a technique that can be used to improve performance when working with large object hierarchies. With adaptive object aggregation, you can choose to compress the object hierarchy below a certain directory level. This can help to improve performance when working with large object hierarchies, as it allows you to reduce the number of files that need to be accessed.

FieldDescription
Column Name to Include Original File NameThis column cannot have a non-alphanumeric character (aside from underscore) and cannot begin with a number. If the field is left empty, Ascend will not include this column.

NOTE: Using this field requires you to manually add the column to the schema.
Max # of Input Bytes per AggregationThere is no default value.
Max # of Files per AggregationThe default value is 1024.
Max # of Records per AggregationThere is no default value.
Max # of Directory Levels per AggregationThe default value is 8.

Leaf Directory

Leaf directories are often used in big data environments to store files that are accessed infrequently. This is because leaf directories can be quickly and easily scanned, which makes them ideal for storing large files. Leaf Directory object aggregation means Ascend will compress data down to the very last directory.

FieldDescription
Column Name to Include Original File NameThis column cannot have a non-alphanumeric character (aside from underscore) and cannot begin with a number. If the field is left empty, Ascend will not include this column.

NOTE: Using this field requires you to manually add the column to the schema.
Max # of Input Bytes per AggregationThere is no default limit.
Max # of Files per AggregationThe default value is 1024.
Max # of Records per AggregationThere is no default limit.

Prefix Regex Match

Aggregating object data with regular expressions allows you to group similar files, regardless of their location. This makes it easier to read and manage data that is spread out across different directories or storage services. By using a regular expression, you can specify the pattern of characters you are interested in, and the aggregator will collect all the files that match that pattern.

Unlike Leaf Directory and Adaptive strategies, Prefix Regex Match does not have a max number of bytes, files, or records. Instead, data is grouped by timestamp.

FieldRequired/OptionalDescription
Regex to MatchRequiredThe regex to extract common prefixes. Ex. If files are named such as /foo/bar/2021-01-01-03-04.log (where such a file could represent data from Jan 1st, 2021 @ 3:04am), a regex of /foo/bar/\d{4}-\d{2}-\d{2} would aggregate all objects for a given day into a single object.
Column Name to Include Original File NameOptionalThis column cannot have a non-alphanumeric character (aside from underscore) and cannot begin with a number. If the field is left empty, Ascend will not include this column.

NOTE: Using this field requires you to manually add the column to the schema.

Reshape Aggregation

Reshape object aggregation lets us take any data structure and reshape it into a hierarchy. Ascend has two ways to implement this:

  1. Read metadata for a last modified timestamp, then place flat files into hierarchical folders for ingest.
  2. Use Regex to parse the file name for a timestamp, then place flat files into hierarchical folders for ingest.

Reshape by Last Modified Time

If no selection is made for Reshape by, Reshape Aggregation defaults to last modified time at the day granularity.

FieldRequired/OptionalDescription
Reshape GranularityOptionalReshape by year, month, day, or hour. Default granularity is day.
Column Name to Include Original File NameOptionalThis column cannot have a non-alphanumeric character (aside from underscore) and cannot begin with a number. If the field is left empty, Ascend will not include this column.

NOTE: Using this field requires you to manually add the column to the schema.

Extract Time via Regex

FieldRequired/OptionalDescription
Regex to matchRequiredExtracts datetime parts using regex group names. Ex. if the object name my_prefix/date=2022-11-2/file.parquet use this regex: date=(?P<year>\d{4})-(?P<month>\d{1,2})-(?P<day>\d{1,2}), and the aggregated object name will be year=2022/month=11/day=02
Parse Failure ModeOptionalRaise Error, Include File (but not aggregated), Skip File
Reshape GranularityOptionalReshape by year, month, day, or hour. Default granularity is day.
Column Name to Include Original File NameOptionalThis column cannot have a non-alphanumeric character (aside from underscore) and cannot begin with a number. If the field is left empty, Ascend will not include this column.

NOTE: Using this field requires you to manually add the column to the schema.

📘

If multiple parameters are set for Object Aggregation, then all parameters are taken into account when attempting to aggregate a prefix. If any of the parameters values are exceeded, then we will not aggregate that object. One parameter does not take preference over another, the condition for all parameters need to be met in order to aggregate objects in a given prefix


More Reading