👍
You've found a new feature! 👀
Pre-defined Spark Cluster sizing is available for all Gen2 users. Let us know if you have additional questions and/or comments in the chat box to your right! 👉

Selecting the Best Spark Cluster Size for Your Needs

Introduction

Ascend.io offers a flexible Spark cluster management feature that allows users to select an optimal cluster size based on their computational needs and budget. The platform provides a range of pre-defined cluster sizes, as well as the ability to manually adjust cluster settings for more granular control.

When to Use Spark Cluster Pools

Spark Cluster Pools should be used when there is a need for scalable and efficient data processing. This feature is particularly useful when dealing with varying workloads that require different amounts of resources at different times.

Pre-defined Spark Cluster Pool Sizes

📘
The sizes available for selection are dependent upon your environment type/tier. Not all sizes are available for all environment types/tiers. You will see the sizes available that are possible based on your environment type/tier.

Ascend.io provides a simplified system to simplify the selection process. Below are the available sizes, along with their corresponding resources:

3XSmall to XSmall: These sizes are ideal for small-scale processing tasks or development work where minimal resources are required.
Small to Medium: Suitable for moderate workloads, allowing for some parallel processing while still being cost-efficient.
Large to 4XLarge: These sizes cater to large-scale, resource-intensive jobs that require high parallelism and are time-sensitive.

Each size has a designated number of driver and executor CPUs, as well as minimum and maximum executor counts, which are outlined in the detailed table provided in the platform documentation.

Size	Driver CPUs	Executor CPUs	Minimum Executors	Maximum Executors	Ephemeral Driver Volume Size	Ephemeral Executor Volume Size
3XSmall	3	0	0	0	96	0
2XSmall	7	0	0	0	224	0
XSmall	15	0	0	0	480	0
Small	3	15	2	2	96	480
Medium	7	15	4	4	224	480
Large	15	15	7	7	480	480
XLarge	15	15	15	15	480	480
2XLarge	15	15	30	30	480	480
3XLarge	30	15	62	62	480	480
4XLarge	60	15	124	124	480	480

Choosing the Right Cluster Size

The choice of the right Spark cluster size depends on several factors including the size of the data, the complexity of the jobs, the expected execution time, and cost considerations. In a nutshell, here's how to approach the selection:

Start Small: Begin with a smaller cluster size, like 2XSmall, to understand the performance and cost implications of your jobs.
Monitor Performance: Utilize the Observe features like Billing, Cluster Pool Utilization (new dashboard coming soon!), and Jobs Usage to track speed, performance. Look for metrics like running vs. queued tasks, execution time, and driver CPU percentage.
Scale Accordingly: If jobs are taking too long or if the CPU usage is consistently climbing, consider scaling up the cluster size.
Consider Data Profile: If you're performing a large backfill and understand the performance profile of your data, you may preemptively choose a larger cluster size.
Adjust Before Launch: Make adjustments to the cluster size before launching jobs. Once a job is sent, the size can only be adjusted for future jobs, not current ones.

Optimizing Cluster Selection

The new sizing system complements the existing manual adjustment capabilities by providing a convenient starting point for cluster sizing. By understanding your application's needs and closely monitoring performance, you can leverage the Spark Cluster Pools to achieve an optimal balance between performance and cost.