Google Cloud Storage Write Connector (Legacy)

Creating an Google Cloud Storage (GCS) Write Connector

Prerequisites:

Access credentials
Data location on Google Cloud Storage
Partition column from upstream, if applicable

Specify the Upstream and GCS path

📘
Suggestion
For optimal performance ensure the upstream transform is partitioned.

Location

GCS Write Connectors have location settings comprised of:

Bucket: The GCS bucket name, e.g. my_data_bucket. Do not put in any folder information in this box.
Object Prefix: The GCS folders hierarchical prefix, e.g. good_data/my_awesome_table. Do not include any leading forward slashes.
Partition Folder Pattern: The folder pattern that will be generated based on the values from the upstream partition column, e.g. you can use {{at_hour_ts(yyyy-MM-dd/HH)}} where column at_hour_ts is a partition column from the upstream transform.

❗️
Warning
Any data that is not being propagated by the upstream transform will automatically be deleted under the object prefix.
For example; if the write connector produces three files A, B and C in the object-prefix and there was an existing file called output.txt at the same location Ascend will delete output.txt since Ascend did not generate it.

Manifest file (optional)

If selected a manifest file is generated or updated every time the Write Connector gets Up-To-Date and will contain a list of file names for all data files that are ready to be consumed by downstream applications. To create a Manifest File, specify the full path, including the file name, for where the Manifest File should get created at, as well as whether this should be a CSV or a JSON file.

🚧
Important
Make sure the Manifest File is located under the same folder as the data files.

Google Cloud Storage Credentials

Enter the JSON key for the service account which has Storage Admin and Storage Object Admin for the GCS path. Refer to Google documentation for more details on GCS authentication.

🚧
Storage API
We relied on the Google Storage API to write into GCS locations. As a result, we require the Storage API to be enabled. This should be enabled by default in the GCP but in case it's not, user can enable it by going to https://console.developers.google.com/apis/api/storage-api.googleapis.com/overview?project={your gcp id} to enable this API.

Testing Connection

Use Test Connection to check whether all GCS permissions are correctly configured.

Selecting a formatter

JSON, Parquet and XSV data formats are supported for S3 and GCS write connectors.

XSV formatter: Supports 3 different delimiters and 9 different line terminators and allows specifying whether a Header Row should be included. The XSV generated is RFC4180 compliant.

🚧
Important
The XSV Formatter will NOT replace newline characters within values. Replace newline characters in the upstream transform if you require XSV files to contain only single line records.

JSON formatter: Will generate a file where each line in the file is a valid JSON object representing one data row from upstream.

🚧
Important
The JSON Formatter will automatically replace new line characters in column values to \n in order to guarantee the JSON file has single line records.

Parquet formatter: Will apply snappy compression automatically to the output files automatically.