Cluster executors can also be referred to as cluster workers. If you’re aware of Hadoop MapReduce, you can think of driver as master node that divide jobs into different tasks and allocate those tasks to the data nodes. Executors are responsible for executing tasks and reporting to the driver the result.
You can see the difference between executors and driver, according to AI below.
Getting Number of Cluster Cores Within Spark Application
Note: I will be using Scala
val processor = getRuntime.availableProcessors()
* (sparkSn.sparkContext.statusTracker.getExecutorInfos.length – 1)
The availableProcessors() is the total number of cores per executor or you can think of it as the number of driver cores since driver and workers have the same number of cores.
The (sparkSn.sparkContext.statusTracker.getExecutorInfos.length – 1) is basically the total number of workers minus the driver. You can subtract the driver and only get the number of workers. For example, if you want to do repartition base on the number of cores, you want to use just the executor cores since the executors are responsible for executing the tasks.
Difference Between Executor and Driver
- Cluster executors are worker nodes within a distributed computing cluster. They are responsible for executing tasks in parallel on the data distributed across the cluster.
- Each executor typically runs on a separate machine or container within the cluster.
- Executors are responsible for processing data, running computations, and performing tasks such as map and reduce operations in parallel.
- They communicate with the driver (the master node) to receive instructions and report the results of their computations.
- Executors are typically numerous and can scale horizontally based on the size of the cluster and the computational needs of the job.
- The driver is the central control node of a distributed computing application.
- It is responsible for coordinating the entire job or application, including distributing tasks to the cluster executors, monitoring progress, and aggregating results.
- The driver initiates the execution of a distributed job, divides it into tasks, and schedules those tasks to be executed by the cluster executors.
- It collects and processes the results from the executors, making high-level decisions about the job’s flow and handling failures or retries.
- In many distributed frameworks, the driver also manages the job’s configuration and can interact with external data sources or storage systems.