Bytes Connector Code Properties
When creating a Python Read Connector, you must choose a connector code interface. With the Bytes Connector Code interface, Ascend reads in a stream of bytes.
Prerequisites
- A Custom Python Connection
Required Python Functions
The following table describes the functions available when creating a new Python Read Connector utilizing Bytes Connector Code interface. Create a Python Read Connector using the information below and these step-by-step instructions.
Function | Description |
---|---|
context | Gives Ascend the ability to generate context before running list_objects . This is where you can store credentials to access objects |
list_objects | defines what objects you want read. Each object is read in as a separate partition unless objects are defined by metadata. You can specify multiple reads and partition objects into a hierarchy by using metadata and is_prefix . |
metadata | Assigns metadata to objects for Ascend to parse. metadata includes name , is_prefix , and fingerprint . |
name | This is the object or file name an will be converted to the partition name. |
is_prefix | Tells Ascend whether or not to group objects together by a prefix. If set to false , Ascend will place an object in it's own partition. If set to true , a prefix must be defined and all objects with that prefix will be placed in the same partition. |
fingerprint | The fingerprint is the SHA of the object which allows Ascend to determine if the object has changed. A common practice is the assign a time/date stamp as the fingerprint . However, the fingerprint must be a string. The fingerprint must be unique for each yield/partition. |
yield | Tells Ascend when to create a new partition. To create many partitions, you need multiple yield statements or an iterative cycle of yields. |
read_function | Tells Ascend to return the objects as one of the available interfaces. |
If you want new lines to be interpreted within the string of bytes, you must explicitly return new lines with
\n
.
Recursive list_objects
list_objects
Metadata is a Python dictionary that defines a partition. Metadata is used in both list_objects
and read_bytes
. To trigger the recursive behavior within in list_objects
and create partitions, set is_prefix
to True
. If a previously created partition is not recalled when generating list_objects
, all previous partition metadata will be deleted.
When constructing your Python code,
list_objects
must return the partition metadata for all the partitions you expect to be in the component.
Example Bytes Connector Code
The following code example describes reading a spreadsheet for Google Sheets.
# This example reads a spreadsheet from Google Sheets to explain the functions to implement
import csv
import io
import json
from google.oauth2 import service_account
from googleapiclient.discovery import build
SPREADSHEET_ID = '1BxiMVs0XRA5nFMdKvBdBZjgmUUqptlbs74OgvE2upms'
RANGE_NAME = 'Class Data!A2:E'
SCOPES = ['https://www.googleapis.com/auth/spreadsheets.readonly', 'https://www.googleapis.com/auth/drive.metadata.readonly']
def context(credentials):
"""
Sets up the context for reading and listing data from data source.
This is where the Python Connection information will be passed through.
Avoid opening a database connection.
"""
service_account_info = json.loads(credentials)
creds = service_account.Credentials.from_service_account_info(service_account_info, scopes=SCOPES)
g_sheet = build('sheets', 'v4', credentials=creds)
drive = build('drive', 'v3', credentials=creds)
return {
'g_sheet_client': g_sheet,
'drive_client': drive,
}
def list_objects(context, metadata):
"""
This custom read connector processes 1 google spreadsheet with the ID listed above.
"""
fingerprint = context['drive_client'].files().get(fileId=SPREADSHEET_ID, fields='modifiedTime').execute()['modifiedTime']
yield {'name': SPREADSHEET_ID, 'fingerprint': fingerprint, 'is_prefix': False}
def read_bytes(context, metadata):
"""
Returns a byte stream that represents all data to return for a given ReadConnector configuration
"""
sheet = context['g_sheet_client'].spreadsheets()
result = sheet.values().get(spreadsheetId=metadata['name'], range=RANGE_NAME).execute()
values = result.get('values') or []
for row in values:
strbuf = io.StringIO()
w = csv.writer(strbuf)
w.writerow(row)
yield strbuf.getvalue() '\n'
External Keys and Secrets Management through Credentials
To use external keys or a secrets manager, set the
def context(credentials: str)
where str is the name you assigned the credential. Any connector code interface will receive external key as a result of that function call.
Parsers
Some Python Read Connector parsers have additional required or optional properties. For more information, see
Read Connector Parsers.
Code Fingerprint Strategy
The Code Fingerprint Strategy controls Ascend's DataAware function. Ascend read's the code and the configuration, and calculates the hash of the component. If the code changes in any way, the data must be reprocessed. Currently, the only available setting is Automatic content-based fingerprint.
Schema Generation
The Schema Generation reads the first few rows of the dataset and generates a schema just as with other Ascend Read Connectors. However, in order to sample the dataset, schema generation will run the code inserted into the Connector Code Interface across the entire dataset.
Updated 10 months ago