SDK Definition Files
Using Dataflow / DataService Definitions
This section covers using the higher-level API of the Ascend SDK. Ascend represents the pipeline as Python classes which can then be "applied" to Ascend to true up state. For example, a single Ascend Dataflow is represented as a Dataflow
class, with a name, id, embedded transforms, etc. This definition can be tweaked and then applied back to Ascend as part of a build/deploy process.
There are a couple of ways to get started. You could write a Dataflow definition from scratch based on the API reference and the examples in this tutorial. However, the typical workflow for a new user is to first create DataServices and Dataflows on the Ascend UI, and then download/export these definitions to their local file-system.
Download an existing Dataflow
To download a Dataflow:
from ascend.sdk.client import Client
from ascend.sdk.render import download_dataflow, TEMPLATES_V2
client = Client("myhost.ascend.io")
download_dataflow(client,
data_service_id="my_data_service",
dataflow_id="my_dataflow",
resource_base_path="./",
template_dir=TEMPLATES_V2)
This generates a Dataflow definition in a Python file named my_dataflow.py
, and stores inline code from Transforms and ReadConnectors as separate files in a sub-folder named my_dataflow/
under resource_base_path
.
Feel free to inspect and refactor the generated files. You'll notice that the Dataflow definition consists roughly of:
- A series of import statements to pull in the required SDK and protobuf classes.
- A Dataflow resource definition that includes component definitions, data feeds, data feed connectors, and component groups. Note how relationships between resources are expressed. For instance, dependencies between components are expressed via fields such as
input_id
(for WriteConnectors) andinput_ids
(for Transforms). - Creation of a SDK Client instance.
- Create a
DataflowApplier
and apply the constructed Dataflow definition.
Once downloaded, this Python script can be run any number of times to achieve the desired end-state for the Dataflow.
The steps for downloading a DataService are similar:
from ascend.sdk.client import Client
from ascend.sdk.render import download_data_service, TEMPLATES_V2
client = Client("myhost.ascend.io")
download_data_service(client,
data_service_id="my_data_service",
resource_base_path="./",
template_dir=TEMPLATES_V2)
This generates a DataService definition in a Python file my_data_service.py
under resource_base_path
, creates a sub-folder for each Dataflow, and stores inline code for Transforms and ReadConnectors in these sub-folders.
Ascend SDK Modules Reference
There are 4 modules to use as part of interacting with Ascend through these "definition"-based APIs.
Module Name | Description |
---|---|
definitions | For use to express the definition of the key parts of Ascend, such as a Dataflow, DataService, etc. |
applier | For 'applying' the definitions back to Ascend (typically used as part of "push"-ing or deploying to Ascend. |
render | For getting a template of the existing resources in an Ascend environment to bootstrap / facilitate development. |
builder | Utility functions to create Ascend Definitions from protobuf API responses. |
Module definitions
Ascend resource definition classes.
Classes
Component(id: str)
Components are the functional units in a Dataflow.
There are three main types of Components: ReadConnectors, Transforms, and WriteConnectors.
ComponentGroup(id: str, name: str, component_ids: List[str], description: str = None)
A component group
Create a ComponentGroup definition.
Parameters:
- id: identifier that is unique within Dataflow
- name: user-friendly name
- description: brief description
- component_ids: list of ids of components that make up the group
Returns: a new ComponentGroup instance.
DataFeed(id: str, name: str, input_id: str, description: str = None, shared_with_all: bool = False, data_services_shared_with: List[str] = [], hidden_from_host_data_service: bool = False)
A data feed
Create a DataFeed definition.
Parameters:
- id: identifier that is unique within Dataflow
- name: user-friendly name
- description: brief description
- input_id: input component id
- shared_with_all: if set to true, this data feed is shared with all data services,
including the host data service - data_services_shared_with: a list of ids of data services that we want to share
data with (not including the host data service) - hidden_from_host_data_service: if set to true, the host data service cannot subscribe
to the contents of this data feed
Returns: a new DataFeed instance.
Note - if shared_with_all
is set to true, the values of data_services_shared_with
and
hidden_from_host_data_service
are ignored
DataFeedConnector(id: str, name: str, input_data_service_id: str, input_dataflow_id: str, input_data_feed_id: str, description: str = None)
A data feed connector
Create a DataFeedConnector definition.
Parameters:
- id: identifier that is unique within Dataflow
- name: user-friendly name
- description: brief description
- input_data_service_id: id of the DataService that the input DataFeed belongs to
- input_dataflow_id: id of the Dataflow that the input DataFeed belongs to
- input_data_feed_id: input DataFeed id
Returns: a new DataFeedConnector instance.
DataService(id: str, name: str, description: str = None, dataflows: List[definitions.Dataflow] = [])
A DataService is the highest-level organizational structure in Ascend, and the primary
means of controlling access. It contains one or more Dataflows and can communicate with
other Data Services via Data Feeds.
Create a DataService definition.
Parameters:
- id: DataService identifier
- name: name
- description: brief description
- dataflows: list of dataflows
Returns: a new DataService instance.
Dataflow(id: str, name: str, description: str = None, components: List[definitions.Component] = [], data_feed_connectors: List[definitions.DataFeedConnector] = [], data_feeds: List[definitions.DataFeed] = [], groups: List[definitions.ComponentGroup] = [])
A Dataflow is a representation of the operations being performed on data, expressed
as a dependency graph. Operations are expressed declaratively via components, and are
broadly categorized as read connectors (data ingestion), transforms (transformation),
and write connectors. Data feeds expose the results of transforms to other dataflows
and data services. Data feed connectors are subscriptions to data feeds. Component groups
offer users a convenient way to organize components that have a similar purpose.
Create a Dataflow definition.
Parameters
- id: identifier for the Dataflow that is unique within its DataService
- name: user-friendly name
- description: brief description
- components: list of components
- data_feed_connectors: list of data feeds
- data_feeds: list of data feed connectors
- groups: list of component groups
Returns: a new Dataflow instance.
Definition()
Base class for all definitions.
ReadConnector(id: str, name: str, container: ascend.protos.io.io_pb2.Container, description: str = None, pattern: ascend.protos.pattern.pattern_pb2.Pattern = None, update_periodical: ascend.protos.core.core_pb2.Periodical = None, assigned_priority: ascend.protos.component.component_pb2.Priority = None, last_manual_refresh_time: google.protobuf.timestamp_pb2.Timestamp = None, aggregation_limit: google.protobuf.wrappers_pb2.Int64Value = None, bytes: ascend.protos.component.component_pb2.FromBytes = None, records: ascend.protos.component.component_pb2.FromRecords = None, compute_configurations: List[ascend.protos.expression.expression_pb2.StageComputeConfiguration] = None)
A read connector
Create a ReadConnector definition.
Parameters:
- id: identifier that is unique within Dataflow
- name: user-friendly name
- description: brief description
- container (proto): a typed representation of the source of data
- pattern (proto): defines the set of objects in the source that will be ingested
- update_periodical: specifies source update frequency
- assigned_priority: scheduling priority for the read connector
- last_manual_refresh_time:
- aggregation_limit:
- bytes: defines the operator for parsing bytes read from the source (example - json
or xsv parsing) - records: defines the schema for records read from a database or warehouse
- compute_configurations: custom configuration overrides by stage
Returns: a new ReadConnector instance.
Note - a read connector can have either bytes
or records
defined, but not both.
Transform(id: str, name: str, input_ids: List[str], operator: ascend.protos.operator.operator_pb2.Operator, description: str = None, assigned_priority: ascend.protos.component.component_pb2.Priority = None)
A transform
Create a Transform definition.
Parameters:
- id: identifier that is unique within Dataflow
- name: user-friendly name
- description: brief description
- input_ids: list of input component ids
- operator: transform operator definition - example - SQL/PySpark/Scala transform
- assigned_priority: scheduling priority for transform
Returns: a new Transform instance.
WriteConnector(id: str, name: str, input_id: str, container: ascend.protos.io.io_pb2.Container, description: str = None, assigned_priority: ascend.protos.component.component_pb2.Priority = None, bytes: ascend.protos.component.component_pb2.ToBytes = None, records: ascend.protos.component.component_pb2.ToRecords = None, compute_configurations: List[ascend.protos.expression.expression_pb2.StageComputeConfiguration] = None)
A write connector
Create a WriteConnector definition.
Parameters:
- id: identifier that is unique within Dataflow
- name: user-friendly name
- description: brief description
- input_id: input component id
- container (proto): a typed representation of the destination of data
- assigned_priority: scheduling priority of write connector
- bytes: defines the operator for formatting bytes written to the object store destination
- records: indicates that the destination is a structured record store like a database
or a data warehouse - compute_configurations: custom configuration overrides by stage
Returns: a new WriteConnector instance.
Note - a write connector can have either bytes
or records
defined, but not both.
Module applier
Ascend resource applier classes - used for applying resource definitions.
Classes
DataServiceApplier(client: ascend.sdk.client.Client)
DataServiceApplier is a utility class that accepts a DataService definition and
'applies' it - ensuring that the DataService is created if it does not already exist,
deleting any existing Dataflows, components, and members of the DataService that
are not part of the supplied definition, and applying any configuration changes needed
to match the definition.
Creates a new DataServiceApplier.
Methods
apply(self, data_service: ascend.sdk.definitions.DataService, delete=True, dry_run=False)
Create or update the specified DataService.
Parameters:
- data_service: DataService definition
- delete (optional): If set to
True
(which is the default) - delete any Dataflow
that is not part ofdata_service
. At the Dataflow level, remove any components
that are not part of the Dataflow definition. - dry_run (optional): If set to
True
(False
by default) - skips any create or
update operations
DataflowApplier(client: ascend.sdk.client.Client)
DataflowApplier is a utility class that accepts a Dataflow definition and 'applies'
it - ensuring that a Dataflow is created if it does not already exist and binding it to
the DataService identified by data_service_id
, deleting any components and members of
the Dataflow that are not part of the supplied definition, and applying any configuration
changes needed to match the definition.
Creates a new DataflowApplier.
Methods
apply(self, data_service_id: str, dataflow: ascend.sdk.definitions.Dataflow, delete=True, dry_run=False)
Accepts a Dataflow definition, and ensures that it is created, bound to the DataService
identified by data_service_id
, and has all of the constituent elements included in the
Dataflow definition - components (read connectors, transforms, write connectors), component
groups, data feeds, and data feed connectors. The specified DataService must already exist.
Parameters:
- data_service_id: DataService id
- dataflow: Dataflow definition
- delete: if set to
True
(default=True
) - delete any components, data feeds,
data feed connectors, or groups not defined indataflow
- dry_run: If set to
True
(default=False
) - skips any create or update operations
ComponentApplier.build(client: ascend.sdk.client.Client, data_service_id: str, dataflow_id: str)
ComponentApplier is a utility class that accepts a Component definition (ReadConnector, WriteConnector, Transform) and 'applies'
it - ensuring that the Component is created if it does not already exist and binding it to
the DataService identified by data_service_id
and the DataFlow identified by
dataflow_id
, crating or applying any configuration changes needed to match the definition.
Creates a new ComponentApplier that has the apply
function.
Methods
apply(self, data_service_id: str, dataflow_id: str, component: Component)
Accepts a Component definition, and ensures that it is created, bound to the DataService
identified by data_service_id
and the Dataflow identified by dataflow_id
, and has
all of the properties in the Component definition The specified DataService and Dataflow must already exist.
Parameters:
- data_service_id: DataService id
- dataflow_id: Dataflow id
- component: Component Definition (one of ReadConnector, Transform, WriteConnector)
Module render
Utility functions to download Ascend resource definitions.
Functions
from ascend.sdk.render import download_data_service
download_data_service(client: ascend.sdk.client.Client, data_service_id: str, resource_base_path: str = '.')
Downloads a DataService and writes its definition to a file named {data_service_id}.py
under resource_base_path
. Inline code for Transforms and ReadConnectors are written as separate
files to sub-folders - resource_base_path
/{dataflow_id}
/components
/ with the file name derived from
the id of the component to which the code belongs. Creates resource_base_path
if
resource_base_path
does not exist.
Parameters:
- client: SDK client
- data_service_id: DataService id
- resource_base_path: base path to which DataService definition and resource files are written
- template_dir: directory that contains templates for rendering. Defaults to v1 template directory.
from ascend.sdk.render import download_dataflow
download_dataflow(client: ascend.sdk.client.Client, data_service_id: str, dataflow_id: str, resource_base_path: str = '.')
Downloads a Dataflow and writes its definition to a file named {dataflow_id}.py
under
resource_base_path
. Inline code for Transforms and ReadConnectors are written as separate
files to a sub-folder - resource_base_path
/components
/ with the file names derived
from the id of the component to which the code belongs. Creates resource_base_path
if
resource_base_path
does not exist.
Parameters:
- client: SDK client
- data_service_id: DataService id
- dataflow_id: Dataflow id
- resource_base_path: path to which Dataflow definition and resource files are written
- template_dir: directory that contains templates for rendering. Defaults to v1 template directory.
from ascend.sdk.render import download_component
download_component(client: ascend.sdk.client.Client, data_service_id: str, dataflow_id: str, component_id: str, resource_base_path: str = '.')
Downloads a Component and writes its definition to a file named {component_id}.py
under
resource_base_path
. Inline code for PySpark and SQL Transforms are written as separate
files in the same folder with the file names derived from the id of the component to
which the code belongs. Creates resource_base_path
if resource_base_path
does not exist.
Parameters:
- client: SDK client
- data_service_id: DataService id
- dataflow_id: Dataflow id
- component_id: Component id
- resource_base_path: path to which Component definition will be written are written
- template_dir: directory that contains templates for rendering. Defaults to v1 template directory.
Classes
InlineCode(code: str, attribute_path: Tuple[str, ...], resource_path: str, base_path: str, base64_encoded: bool = False)
Represents the result of extraction of inline code from a component - such as
a sql statement or PySpark code. Contains all of the metadata needed to render
a component definition, with the inline code written to a separate file, and a reference
to this code stored in the component.
Parameters:
- code: inline code
- attribute_path: path of the attribute in the component definition that contains
the inline code, represented as a tuple of path components. For instance, the
path for sql code in a Transform is("operator", "sql_query", "sql")
- resource_path: file path to which inline code is written
- base_path: base path of dataflow or data service resource definition
- base64_encoded: if set to
True
, inline code is base64 encoded
Module builder
Utility functions to create Ascend resource definitions from the corresponding protobuf
objects. The protobuf objects can be obtained via the 'imperative' API accessible through
the SDK client -
client.get_data_service(<data_service_id>)
client.get_dataflow(<dataflow_id>)
Functions
data_service_from_proto(client: ascend.sdk.client.Client, proto: ascend.protos.api.api_pb2.DataService) ‑> ascend.sdk.definitions.DataService
Build a DataService definition from the API DataService protobuf representation.
This is a utility function that helps bridge the declarative and imperative forms
of the SDK by transforming a DataService proto message to a DataService definition
that can be 'applied'.
Parameters:
- client: SDK client
- proto: DataService protobuf object
Returns: a DataService definition
dataflow_from_proto(client: ascend.sdk.client.Client, data_service_id: str, proto: ascend.protos.api.api_pb2.Dataflow) ‑> ascend.sdk.definitions.Dataflow
Build a Dataflow definition from the API Dataflow protobuf representation.
Parameters:
- client: SDK client
- data_service_id: DataService id
- proto: Dataflow protobuf object
Returns: a Dataflow definition
Updated 9 months ago