Using Dataflow / DataService Definitions

This section covers using the higher-level API of the Ascend SDK. Ascend represents the pipeline as Python classes which can then be "applied" to Ascend to true up state. For example, a single Ascend Dataflow is represented as a Dataflow class, with a name, id, embedded transforms, etc. This definition can be tweaked and then applied back to Ascend as part of a build/deploy process.

There are a couple of ways to get started. You could write a Dataflow definition from scratch based on the API reference and the examples in this tutorial. However, the typical workflow for a new user is to first create DataServices and Dataflows on the Ascend UI, and then download/export these definitions to their local file-system.

Download an existing Dataflow

To download a Dataflow:

from ascend.sdk.client import Client
from ascend.sdk.render import download_dataflow, TEMPLATES_V2

client = Client("myhost.ascend.io")
download_dataflow(client,
                  data_service_id="my_data_service",
                  dataflow_id="my_dataflow",
                  resource_base_path="./",
                  template_dir=TEMPLATES_V2)

This generates a Dataflow definition in a Python file named my_dataflow.py, and stores inline code from Transforms and ReadConnectors as separate files in a sub-folder named my_dataflow/ under resource_base_path.

Feel free to inspect and refactor the generated files. You'll notice that the Dataflow definition consists roughly of:

A series of import statements to pull in the required SDK and protobuf classes.
A Dataflow resource definition that includes component definitions, data feeds, data feed connectors, and component groups. Note how relationships between resources are expressed. For instance, dependencies between components are expressed via fields such as input_id (for WriteConnectors) and input_ids (for Transforms).
Creation of a SDK Client instance.
Create a DataflowApplier and apply the constructed Dataflow definition.
Once downloaded, this Python script can be run any number of times to achieve the desired end-state for the Dataflow.

The steps for downloading a DataService are similar:

from ascend.sdk.client import Client
from ascend.sdk.render import download_data_service, TEMPLATES_V2

client = Client("myhost.ascend.io")
download_data_service(client,
                      data_service_id="my_data_service",
                      resource_base_path="./",
                      template_dir=TEMPLATES_V2)

This generates a DataService definition in a Python file my_data_service.py under resource_base_path, creates a sub-folder for each Dataflow, and stores inline code for Transforms and ReadConnectors in these sub-folders.

Ascend SDK Modules Reference

There are 4 modules to use as part of interacting with Ascend through these "definition"-based APIs.

Module Name	Description
`definitions`	For use to express the definition of the key parts of Ascend, such as a Dataflow, DataService, etc.
`applier`	For 'applying' the definitions back to Ascend (typically used as part of "push"-ing or deploying to Ascend.
`render`	For getting a template of the existing resources in an Ascend environment to bootstrap / facilitate development.
`builder`	Utility functions to create Ascend Definitions from protobuf API responses.

Module definitions

Ascend resource definition classes.

Classes

Component(id: str)

Components are the functional units in a Dataflow.
There are three main types of Components: ReadConnectors, Transforms, and WriteConnectors.

ComponentGroup(id: str, name: str, component_ids: List[str], description: str = None)

A component group

Create a ComponentGroup definition.

Parameters:

id: identifier that is unique within Dataflow
name: user-friendly name
description: brief description
component_ids: list of ids of components that make up the group
Returns: a new ComponentGroup instance.

DataFeed(id: str, name: str, input_id: str, description: str = None, shared_with_all: bool = False, data_services_shared_with: List[str] = [], hidden_from_host_data_service: bool = False)

A data feed

Create a DataFeed definition.

Parameters:

id: identifier that is unique within Dataflow
name: user-friendly name
description: brief description
input_id: input component id
shared_with_all: if set to true, this data feed is shared with all data services,
including the host data service
data_services_shared_with: a list of ids of data services that we want to share
data with (not including the host data service)
hidden_from_host_data_service: if set to true, the host data service cannot subscribe
to the contents of this data feed
Returns: a new DataFeed instance.

Note - if shared_with_all is set to true, the values of data_services_shared_with and
hidden_from_host_data_service are ignored

DataFeedConnector(id: str, name: str, input_data_service_id: str, input_dataflow_id: str, input_data_feed_id: str, description: str = None)

A data feed connector

Create a DataFeedConnector definition.

Parameters:

id: identifier that is unique within Dataflow
name: user-friendly name
description: brief description
input_data_service_id: id of the DataService that the input DataFeed belongs to
input_dataflow_id: id of the Dataflow that the input DataFeed belongs to
input_data_feed_id: input DataFeed id
Returns: a new DataFeedConnector instance.

DataService(id: str, name: str, description: str = None, dataflows: List[definitions.Dataflow] = [])

A DataService is the highest-level organizational structure in Ascend, and the primary
means of controlling access. It contains one or more Dataflows and can communicate with
other Data Services via Data Feeds.

Create a DataService definition.

Parameters:

id: DataService identifier
name: name
description: brief description
dataflows: list of dataflows
Returns: a new DataService instance.

Dataflow(id: str, name: str, description: str = None, components: List[definitions.Component] = [], data_feed_connectors: List[definitions.DataFeedConnector] = [], data_feeds: List[definitions.DataFeed] = [], groups: List[definitions.ComponentGroup] = [])

A Dataflow is a representation of the operations being performed on data, expressed
as a dependency graph. Operations are expressed declaratively via components, and are
broadly categorized as read connectors (data ingestion), transforms (transformation),
and write connectors. Data feeds expose the results of transforms to other dataflows
and data services. Data feed connectors are subscriptions to data feeds. Component groups
offer users a convenient way to organize components that have a similar purpose.

Create a Dataflow definition.

Parameters

id: identifier for the Dataflow that is unique within its DataService
name: user-friendly name
description: brief description
components: list of components
data_feed_connectors: list of data feeds
data_feeds: list of data feed connectors
groups: list of component groups
Returns: a new Dataflow instance.

Definition()

Base class for all definitions.

ReadConnector(id: str, name: str, container: ascend.protos.io.io_pb2.Container, description: str = None, pattern: ascend.protos.pattern.pattern_pb2.Pattern = None, update_periodical: ascend.protos.core.core_pb2.Periodical = None, assigned_priority: ascend.protos.component.component_pb2.Priority = None, last_manual_refresh_time: google.protobuf.timestamp_pb2.Timestamp = None, aggregation_limit: google.protobuf.wrappers_pb2.Int64Value = None, bytes: ascend.protos.component.component_pb2.FromBytes = None, records: ascend.protos.component.component_pb2.FromRecords = None, compute_configurations: List[ascend.protos.expression.expression_pb2.StageComputeConfiguration] = None)

A read connector

Create a ReadConnector definition.

Parameters:

id: identifier that is unique within Dataflow
name: user-friendly name
description: brief description
container (proto): a typed representation of the source of data
pattern (proto): defines the set of objects in the source that will be ingested
update_periodical: specifies source update frequency
assigned_priority: scheduling priority for the read connector
last_manual_refresh_time:
aggregation_limit:
bytes: defines the operator for parsing bytes read from the source (example - json
or xsv parsing)
records: defines the schema for records read from a database or warehouse
compute_configurations: custom configuration overrides by stage
Returns: a new ReadConnector instance.

Note - a read connector can have either bytes or records defined, but not both.

Transform(id: str, name: str, input_ids: List[str], operator: ascend.protos.operator.operator_pb2.Operator, description: str = None, assigned_priority: ascend.protos.component.component_pb2.Priority = None)

A transform

Create a Transform definition.

Parameters:

id: identifier that is unique within Dataflow
name: user-friendly name
description: brief description
input_ids: list of input component ids
operator: transform operator definition - example - SQL/PySpark/Scala transform
assigned_priority: scheduling priority for transform
Returns: a new Transform instance.

WriteConnector(id: str, name: str, input_id: str, container: ascend.protos.io.io_pb2.Container, description: str = None, assigned_priority: ascend.protos.component.component_pb2.Priority = None, bytes: ascend.protos.component.component_pb2.ToBytes = None, records: ascend.protos.component.component_pb2.ToRecords = None, compute_configurations: List[ascend.protos.expression.expression_pb2.StageComputeConfiguration] = None)

A write connector

Create a WriteConnector definition.

Parameters:

id: identifier that is unique within Dataflow
name: user-friendly name
description: brief description
input_id: input component id
container (proto): a typed representation of the destination of data
assigned_priority: scheduling priority of write connector
bytes: defines the operator for formatting bytes written to the object store destination
records: indicates that the destination is a structured record store like a database
or a data warehouse
compute_configurations: custom configuration overrides by stage
Returns: a new WriteConnector instance.

Note - a write connector can have either bytes or records defined, but not both.

Module applier

Ascend resource applier classes - used for applying resource definitions.

Classes

DataServiceApplier(client: ascend.sdk.client.Client)

DataServiceApplier is a utility class that accepts a DataService definition and
'applies' it - ensuring that the DataService is created if it does not already exist,
deleting any existing Dataflows, components, and members of the DataService that
are not part of the supplied definition, and applying any configuration changes needed
to match the definition.

Creates a new DataServiceApplier.

Methods

apply(self, data_service: ascend.sdk.definitions.DataService, delete=True, dry_run=False)

Create or update the specified DataService.

Parameters:

data_service: DataService definition
delete (optional): If set to True (which is the default) - delete any Dataflow
that is not part of data_service. At the Dataflow level, remove any components
that are not part of the Dataflow definition.
dry_run (optional): If set to True (False by default) - skips any create or
update operations

DataflowApplier(client: ascend.sdk.client.Client)

DataflowApplier is a utility class that accepts a Dataflow definition and 'applies'
it - ensuring that a Dataflow is created if it does not already exist and binding it to
the DataService identified by data_service_id, deleting any components and members of
the Dataflow that are not part of the supplied definition, and applying any configuration
changes needed to match the definition.

Creates a new DataflowApplier.

Methods

apply(self, data_service_id: str, dataflow: ascend.sdk.definitions.Dataflow, delete=True, dry_run=False)

Accepts a Dataflow definition, and ensures that it is created, bound to the DataService
identified by data_service_id, and has all of the constituent elements included in the
Dataflow definition - components (read connectors, transforms, write connectors), component
groups, data feeds, and data feed connectors. The specified DataService must already exist.

Parameters:

data_service_id: DataService id
dataflow: Dataflow definition
delete: if set to True (default=True) - delete any components, data feeds,
data feed connectors, or groups not defined in dataflow
dry_run: If set to True (default=False) - skips any create or update operations

ComponentApplier.build(client: ascend.sdk.client.Client, data_service_id: str, dataflow_id: str)

ComponentApplier is a utility class that accepts a Component definition (ReadConnector, WriteConnector, Transform) and 'applies'
it - ensuring that the Component is created if it does not already exist and binding it to
the DataService identified by data_service_id and the DataFlow identified by
dataflow_id, crating or applying any configuration changes needed to match the definition.

Creates a new ComponentApplier that has the apply function.

Methods

apply(self, data_service_id: str, dataflow_id: str, component: Component)

Accepts a Component definition, and ensures that it is created, bound to the DataService
identified by data_service_id and the Dataflow identified by dataflow_id, and has
all of the properties in the Component definition The specified DataService and Dataflow must already exist.

Parameters:

data_service_id: DataService id
dataflow_id: Dataflow id
component: Component Definition (one of ReadConnector, Transform, WriteConnector)

Module render

Utility functions to download Ascend resource definitions.

Functions

from ascend.sdk.render import download_data_service

download_data_service(client: ascend.sdk.client.Client, data_service_id: str, resource_base_path: str = '.')

Downloads a DataService and writes its definition to a file named {data_service_id}.py
under resource_base_path. Inline code for Transforms and ReadConnectors are written as separate
files to sub-folders - resource_base_path/{dataflow_id}/components/ with the file name derived from
the id of the component to which the code belongs. Creates resource_base_path if
resource_base_path does not exist.

Parameters:

client: SDK client
data_service_id: DataService id
resource_base_path: base path to which DataService definition and resource files are written
template_dir: directory that contains templates for rendering. Defaults to v1 template directory.

from ascend.sdk.render import download_dataflow

download_dataflow(client: ascend.sdk.client.Client, data_service_id: str, dataflow_id: str, resource_base_path: str = '.')

Downloads a Dataflow and writes its definition to a file named {dataflow_id}.py under
resource_base_path. Inline code for Transforms and ReadConnectors are written as separate
files to a sub-folder - resource_base_path/components/ with the file names derived
from the id of the component to which the code belongs. Creates resource_base_path if
resource_base_path does not exist.

Parameters:

client: SDK client
data_service_id: DataService id
dataflow_id: Dataflow id
resource_base_path: path to which Dataflow definition and resource files are written
template_dir: directory that contains templates for rendering. Defaults to v1 template directory.

from ascend.sdk.render import download_component

download_component(client: ascend.sdk.client.Client, data_service_id: str, dataflow_id: str, component_id: str, resource_base_path: str = '.')

Downloads a Component and writes its definition to a file named {component_id}.py under
resource_base_path. Inline code for PySpark and SQL Transforms are written as separate
files in the same folder with the file names derived from the id of the component to
which the code belongs. Creates resource_base_path if resource_base_path does not exist.

Parameters:

client: SDK client
data_service_id: DataService id
dataflow_id: Dataflow id
component_id: Component id
resource_base_path: path to which Component definition will be written are written
template_dir: directory that contains templates for rendering. Defaults to v1 template directory.

Classes

InlineCode(code: str, attribute_path: Tuple[str, ...], resource_path: str, base_path: str, base64_encoded: bool = False)

Represents the result of extraction of inline code from a component - such as
a sql statement or PySpark code. Contains all of the metadata needed to render
a component definition, with the inline code written to a separate file, and a reference
to this code stored in the component.

Parameters:

code: inline code
attribute_path: path of the attribute in the component definition that contains
the inline code, represented as a tuple of path components. For instance, the
path for sql code in a Transform is ("operator", "sql_query", "sql")
resource_path: file path to which inline code is written
base_path: base path of dataflow or data service resource definition
base64_encoded: if set to True, inline code is base64 encoded

Module builder

Utility functions to create Ascend resource definitions from the corresponding protobuf
objects. The protobuf objects can be obtained via the 'imperative' API accessible through
the SDK client -
client.get_data_service(<data_service_id>)
client.get_dataflow(<dataflow_id>)

Functions

data_service_from_proto(client: ascend.sdk.client.Client, proto: ascend.protos.api.api_pb2.DataService) ‑> ascend.sdk.definitions.DataService

Build a DataService definition from the API DataService protobuf representation.
This is a utility function that helps bridge the declarative and imperative forms
of the SDK by transforming a DataService proto message to a DataService definition
that can be 'applied'.

Parameters:

client: SDK client
proto: DataService protobuf object
Returns: a DataService definition

dataflow_from_proto(client: ascend.sdk.client.Client, data_service_id: str, proto: ascend.protos.api.api_pb2.Dataflow) ‑> ascend.sdk.definitions.Dataflow

Build a Dataflow definition from the API Dataflow protobuf representation.

Parameters:

client: SDK client
data_service_id: DataService id
proto: Dataflow protobuf object
Returns: a Dataflow definition