Deployment modes and job submission in Apache Spark

There are various ways of submitting an application in spark. In Addition to client and cluster modes of execution there is also a local mode of submitting a spark job. We must understand these modes of execution before we start running our job. Before we jump into it we need to recall few important things which learnt in the previous lesson click Introduction to Apache Spark to know more. 


Spark is a Scheduling Monitoring and Distribution engine i.e. Spark is not only a processing engine it can also acts as a resource manager for the job submitted to it. Spark can run by itself(Standalone) using its own cluster manager and can run also on top of other cluster/resource managers. 

How Spark supports different Cluster Managers and Why? 

This is made possible with the help of SparkContext object which is in the main driver program of spark. SparkContext object can connect to several types of cluster managers enabling Spark to run on top of other cluster manager frameworks like Yarn or Mesos. It is this object which coordinates between the independently executing parallel threads of the cluster. 
Spark cluster components, Spark Driver and Workers, Spark Deployment modes, Spark Tutorials
src: https://spark.apache.org/docs/latest/cluster-overview.html 
Spark Installation can be launched in three different ways: -
        1.   Local(pseudo-cluster mode)
        2.   Standalone (Cluster with Spark default Cluster manager)
        3.   On top of other Cluster Manager (Cluster with Yarn, Mesos or Kubernetes as Cluster Manager)


Local:-

Local mode is pseudo-cluster mode generally used for testing and demonstration. In local mode it runs all the execution component in a single node.


Standalone: - 

In Standalone mode the default Cluster manager provided in the official distribution of Apache spark is used for resource and cluster management of Spark Jobs. It has standalone Master for resource Management and Standalone worker for the task.

Please do not get confused here, 
Standalone mode doesn't mean a single node Spark deployment. It is also a cluster deployment of Spark, the only thing to understand here is the cluster will be managed by Spark itself in Standalone mode.


On top of other Cluster Manager: -

Apache Spark can also run on other Cluster managers like Yarn, Mesos or Kubernates. However, the most used cluster manager for Spark in Industry is Yarn because of good compatibility with HDFS and other benefits it brings like data locality.


The command used to submit a spark job in Standalone and other cluster mode is same.
Scala Spark
PySpark
spark-submit \
  --class <main-class> \
  --master <master-url> \
  --deploy-mode <deploy-mode> \
  --conf <key>=<value> \
  ... # other spark properties options
  <application-jar> \
  [application-arguments]
spark-submit \ \
  --master <master-url> \
  --deploy-mode <deploy-mode> \
  --conf <key>=<value> \
  ... # other Spark properties options
  --py-files <python-modules-jars>
  my_application.py
  [application-arguments]
Table 1: Spark-submit command in Scala and Python

For Python applications, in place of a JAR we need to simply pass our .py file as <application-jar>, and add Python dependencies like modules, .zip, .egg or .py files in --py-files.

How to submit a Spark job on Standalone Cluster vs Cluster managed by other cluster managers? 

Answer to the above question is very simple. You need to use the "--master" option show in the above spark submit command and pass the master url of the cluster e.g.
Mode
Value of “--master”
For Standalone deployment mode
--master spark://HOST:PORT
For Mesos
--master mesos://HOST:PORT
For Yarn
--master yarn
Local
--master local[*] :: * = number of threads
Table 2: Spark-submit "--master" for different Spark deployment modes

When you submit a job in spark the application jars (the code which you have written for the job) is distributed to all worker nodes along with the jar files(if mentioned)

We talked enough about the Cluster deployment mode, now we need to understand the application "--deploy-mode" . The above deployment modes which we discussed so far is Cluster Deployment mode and is different from the "--deploy-mode" mentioned in spark-submit command(table 1) . --deploy-mode is the application(or driver) deploy mode which tells how to run the job in cluster(as already mentioned cluster can be a standalone, a yarn or Mesos). For an application(spark job) to run on cluster there are two --deploy-modes, one is client and other is cluster mode.

Spark Application deploy modes: -

Cluster: - When the driver runs inside the cluster then it is cluster deploy mode. In this case Resource Manager or Master decides which node the driver will run.

Client: - In Client mode the driver runs in the machine where the job is submitted.

Now the question arises -

"How to submit a job in Cluster or Client  mode and which is better?"



How to submit:-

In the above spark submit command just pass  "--deploy-mode client" for client mode and "--deploy-mode cluster" for cluster mode.

 

Which one is better, Client or Cluster mode:

Unlike Cluster mode in client mode if the client machine is disconnected then the job will fail. Client mode is good if you want to work on spark interactively, also if you don’t want to eat up any resource from your cluster for the driver daemon then you should go for client mode. When dealing with huge data set and calling action on RDDs or Dfs you need to make sure you have sufficient resources available on Client. So it’s not like the cluster or client mode is better than the other. You can choose any deploy mode for your application, it depends on what suits your requirement.

Client
Cluster
Driver runs in the machine where the job is submitted.
Driver runs inside the cluster. Resource Manager or Master decides which node the driver will run
Job fails if the driver is disconnected
After submitting the job client can disconnect.
Can be used to work with spark in an interactive manner. Performing action on RDD or DataFrame(like count) and capturing them in logs becomes easy.
Cannot be used to work with spark in an interactive manner.
Jars can be accessed from Client  machine.
Since the driver runs on a different machine than the client, so the jars present in local machine won’t work. Those jars should be made available to all nodes either by placing them on each node or mention them in --jars or as –py-files during spark-submit.
YARN:-
Spark driver does not run on the YARN cluster only executor runs inside YARN cluster.

Spark driver and executor both runs on the YARN cluster.
The local dir used by driver is spark.local.dir and for executor it is YARN config yarn.nodemanager.local-dirs.
The local directories used by the Spark executors and the Spark driver will be the local directories configured for YARN (Hadoop YARN config yarn.nodemanager.local-dirs)
Table 3: Spark Client Vs Cluster Mode

Here are some examples on submitting a spark job in different modes: -
Mode
Scala
PySpark
Local
./bin/spark-submit \
  --class main_class \
  --master local[8] \
  /path/to/examples.jar
./bin/spark-submit \
  --master local[8] \
  my_job.py
Spark Standalone: -

./bin/spark-submit \
  --class main_class \
 --master spark://<ip-address>:7077 \
 --deploy-mode cluster \
 --supervise \
 --executor-memory 10G \
 --total-executor-cores 100 \
  /path/to/examples.jar
./bin/spark-submit \
--master spark://<ip-add>:7077 \
  --deploy-mode cluster \
  --supervise \
  --executor-memory 10G \
  --total-executor-cores 100 \
  --py-files
  my_job.py
Yarn Cluster mode
./bin/spark-submit \
  --class main_class \
  --master yarn \
  --deploy-mode cluster \  # can be client for client mode
  --executor-memory 10G \
  --num-executors 50 \
  /path/to/examples.jar
./bin/spark-submit \
  --master yarn \
  --deploy-mode cluster \
  --executor-memory 10G \
  --num-executors 50 \
    --py-files
  my_job.py
Table 4: Spark submit examples for different mode


To learn more on Spark click here. Let us know your views or feedback on Facebook or Twitter @BigDataDiscuss.

0 comments:

Post a Comment