Apache Spark is a fundamental part of Ascend, powering various operations including Spark SQL/PySpark transformations, schema inference, data ingestion via Read Connectors, writing data to sinks with Write Connectors, and previewing records from different sources.
Previously, Ascend utilized different backend services for each operation, each launching one or more Spark jobs. Now, all Spark workloads are run on consolidated, multi-purpose Spark clusters, which users can configure. These clusters are grouped into an elastic pool, known as a Spark Cluster Pool. Each Spark Cluster within the pool can run multiple jobs/tasks concurrently.
You have the flexibility to adjust various settings for Spark Cluster Pools. This includes:
- Setting the number of executors
- Adjusting the number of driver and executor CPUs
- Defining the maximum number of clusters in a pool
- Runtime settings: spark version, custom image
For example, a default Cluster Pool might have 16 executors, 5 executor CPUs, 10 driver CPUs, and a maximum of 2 clusters. At peak load, each cluster could utilize up to 6 Standard nodes, meaning the entire Cluster Pool could use up to 12 nodes.
You also have the ability to over-subscribe their cloud compute resources (Node pool) by configuring their Cluster Pools in a way that not all jobs can be scheduled. Over time, Ascend may introduce additional guardrails or warnings to better manage this aspect.
Spark Cluster Pools are demand-driven, scaling up or down based on pipeline activity or user activity. A Cluster Pool will scale down if clusters within the pool have been idle for longer than the 'Cluster Terminate After (minutes)' value.
An exception is when the 'Min Clusters in Pool' is set to
1. In this case, there will always be 1 Spark Cluster running in the Cluster Pool, regardless of activity. However, the number of executors in this "always-on" cluster will still scale up and down based on usage.
Cluster Pools can be assigned to a Data Service. When assigned, all Spark processing in that Data Service is executed on clusters within that Cluster Pool. If no specific Cluster Pool is assigned, the default Cluster Pool is used.
The Default Cluster Pool is the default processing cluster pool used by a Data Service and applies to any Dataflow created within that Service. Once a Dataflow is created, you can then change the default cluster pool of that Dataflow within the Data Plane Configuration.
The spark cluster pool to use for Read Connector activity in a Data Service. This pool will be used for data ingestion workloads.
The spark cluster pool to use for interactive workloads in this Data Service. Interactive work is work Ascend does to process your data such as generating schema and running data quality checks.
Updated 27 days ago