Pre-defined Spark Cluster Pool Sizes

👍

You've found a new feature! :eyes:

Pre-defined Spark Cluster sizing is available for all Gen2 users. Let us know if you have additional questions and/or comments in the chat box to your right! :point-right:

Selecting the Best Spark Cluster Size for Your Needs

Introduction

Ascend.io offers a flexible Spark cluster management feature that allows users to select an optimal cluster size based on their computational needs and budget. The platform provides a range of pre-defined cluster sizes, as well as the ability to manually adjust cluster settings for more granular control.

When to Use Spark Cluster Pools

Spark Cluster Pools should be used when there is a need for scalable and efficient data processing. This feature is particularly useful when dealing with varying workloads that require different amounts of resources at different times.

Pre-defined Spark Cluster Pool Sizes

📘

The sizes available for selection are dependent upon your environment type/tier. Not all sizes are available for all environment types/tiers. You will see the sizes available that are possible based on your environment type/tier.

Ascend.io provides a simplified system to simplify the selection process. Below are the available sizes, along with their corresponding resources:

  • 3XSmall to XSmall: These sizes are ideal for small-scale processing tasks or development work where minimal resources are required.

  • Small to Medium: Suitable for moderate workloads, allowing for some parallel processing while still being cost-efficient.

  • Large to 4XLarge: These sizes cater to large-scale, resource-intensive jobs that require high parallelism and are time-sensitive.

Each size has a designated number of driver and executor CPUs, as well as minimum and maximum executor counts, which are outlined in the detailed table provided in the platform documentation.

SizeDriver CPUsExecutor CPUsMinimum ExecutorsMaximum ExecutorsEphemeral Driver Volume SizeEphemeral Executor Volume Size
3XSmall3000960
2XSmall70002240
XSmall150004800
Small3152296480
Medium71544224480
Large151577480480
XLarge15151515480480
2XLarge15153030480480
3XLarge30156262480480
4XLarge6015124124480480

Choosing the Right Cluster Size

The choice of the right Spark cluster size depends on several factors including the size of the data, the complexity of the jobs, the expected execution time, and cost considerations. In a nutshell, here's how to approach the selection:

  1. Start Small: Begin with a smaller cluster size, like 2XSmall, to understand the performance and cost implications of your jobs.

  2. Monitor Performance: Utilize the Observe features like Billing, Cluster Pool Utilization (new dashboard coming soon!), and Jobs Usage to track speed, performance. Look for metrics like running vs. queued tasks, execution time, and driver CPU percentage.

  3. Scale Accordingly: If jobs are taking too long or if the CPU usage is consistently climbing, consider scaling up the cluster size.

  4. Consider Data Profile: If you're performing a large backfill and understand the performance profile of your data, you may preemptively choose a larger cluster size.

  5. Adjust Before Launch: Make adjustments to the cluster size before launching jobs. Once a job is sent, the size can only be adjusted for future jobs, not current ones.

Optimizing Cluster Selection

The new sizing system complements the existing manual adjustment capabilities by providing a convenient starting point for cluster sizing. By understanding your application's needs and closely monitoring performance, you can leverage the Spark Cluster Pools to achieve an optimal balance between performance and cost.