Jupyter notebooks offer a web-based interactive development environment for coding and accessing data. They're generally used to support a wide range of workflows in data science, scientific computing, and machine learning. The integration of Jupyter notebooks with Ascend is a natural extension and there are two primary ways in which you can use them to access data on the Ascend platform:
- External data access via the Records API (using the Python SDK)
- External data access via the Bytes API
The Records API is provided via the Ascend Python SDK which is a complete library of methods for externally accessing the various components and respective data partitions within the Ascend environment. Please note: The Ascend Python SDK is based on version 3.
Leveraging the Python SDK a user can use a Jupyter notebook to navigate and access dataflows within Ascend as well as streaming records from any component within a dataflow into the notebook for additional processing.
The Ascend SDK for Python, including documenting on installation, authorization, and usage can be found here. With this SDK you can read data from Ascend Components and Data Feeds, and examine Dataflow metadata.
The Bytes API is a high-speed byte-level API used for externally accessing data directly within Ascend via an S3 compatible interface. The data is formatted as parquet and is most useful in environments having separate Spark infrastructure or applications that can natively ingest parquet.
Leveraging the Bytes API a user can use a Jupyter notebook configured with Spark to access parquet data directly from Ascend that can be used with Spark dataframes natively. This provides users a more flexible (more than SQL) interface and high-bandwidth path for processing fragment data with custom code in a secure manner. For example, a Data Scientist can read data from Ascend into a Jupyter notebook and perform further analysis via a dataframe in Spark or even Pandas. Other use cases for the Bytes API include:
- Any external spark infrastructure such as Databricks Spark SQL, Tensorflow or Delta Lake.
- Any application that can natively work with parquet formatted data such as Presto
Before you can begin using either the Ascend python SDK to access the Records API or the Bytes API you will need to create a service account first. You'll be using this service account in your notebook to securely access various components and data within your Ascend environment. Go to the Service Accounts documentation for more information on creating and managing them. Once you have your Service Account created along with the keys you can proceed to the examples below.
Data Feeds - This example shows how to:
- List available Data Feeds
- Connect to a Data Feed
- Stream Records
- Connect to a Data Service
- List the Dataflows in a Data Service
- Get a Dataflow from a Data Service
- List Components in a Dataflow
- Get a Component from a Dataflow
- Streaming the Records Generated by a Component
Here is a demonstration of a notebook using Ascend's Structured data lake to access component data with Pyspark.
Ascend Structured Data Lake from Pyspark - This example shows how to:
- Read data from a data feed
- Read data from a component