This application deploys a Spark standalone cluster.
Apache Spark standalone cluster architecture in client mode
The cluster architecture encompasses one master node (
node1), which acts as the cluster manager, and one or more worker nodes. By default, one worker always runs on
The cluster resources are specified using the parameters Number of nodes and Machine type.
The master process accepts the applications to be run and schedules the worker resources (available CPU cores and memory) among them. Worker processes execute the job's tasks. Applications are submitted either from the
node1 terminal interface or from another UCloud client app connected to
node1, which also runs Apache Spark (see, e.g., JupyterLab). Finally, the Spark driver is the program that creates the
SparkContext, connecting to the master node.
The standalone Spark cluster supports two deploy modes:
clientmode (default), the Spark driver is launched in the same process as the client that submits the application (see, e.g., the figure above).
clustermode the driver is launched from one of the worker processes inside the cluster, and the client process exits as soon as it fulfills its responsibility of submitting the application without waiting for the application to finish.
The deploy mode is specified via the property parameter
spark.submit.deployMode. The complete list of all application properties available for a Spark standalone cluster is reported here.
Information about completed and ongoing Spark applications is accessible via the app web interface, by clicking on the button
The Spark web user interface (UI) is used to monitor all the applications submitted to the cluster. The master and each worker has its own web UI that shows cluster and job statistics.
For each application the Spark UI includes:
A list of scheduler stages and tasks.
A summary of RDD sizes and memory usage.
Information about the running executors.
This information is only available for the duration of the Spark application.
Spark history server¶
The Spark history logs are saved in
/work/spark_logs, which is created by default when the job starts. A different directory can be specified using the parameter Spark history.
The Spark history server can be accessed by running the following command from the
node1 terminal interface:
The history server is accessible only from the
node1 web interface.
If the client runs in another UCloud app, the same logs folder must be mounted on both cluster and client apps.
The mandatory parameter Input folder is used to mount the data directory on all the cluster nodes.
If the client runs in another UCloud app, the same data folder must be mounted on both cluster and client apps.
Additional packages may be installed on all the worker nodes using the Initialization optional parameter.
For information on how to use this parameter, please refer to the Initialization - Bash script, Initialization - Conda packages, and Initialization - pip packages section of the documentation.
The same packages must be installed on a client node connected to the Spark cluster.
Run in batch mode¶
The parameter Batch processing is used to submit Spark applications via a Bash script, running on the master node.
The job is terminated after the script is executed, independently of the script exit status.