Spark Cluster¶

4.1.1

access

Operating System:
Terminal:
Shell:
Editor:
Package Manager:
Programming Language:
Database:
Utility:
Extension:

4.0.1

access

Operating System:
Terminal:
Shell:
Editor:
Package Manager:
Programming Language:
Database:
Utility:
Extension:

3.5.5

access

Operating System:
Terminal:
Shell:
Editor:
Package Manager:
Programming Language:
Database:
Utility:
Extension:

3.4.1

access

Operating System:
Terminal:
Shell:
Editor:
Package Manager:
Programming Language:
Database:
Utility:
Extension:

3.3.2

access

Operating System:
Shell:
Editor:
Package Manager:
Programming Language:
Database:

3.2.2

access

Operating System:
Shell:
Editor:
Package Manager:
Programming Language:
Database:

3.1.3

access

Operating System:
Shell:
Editor:
Package Manager:
Programming Language:
Database:

3.0.0

access

Operating System:
Shell:
Editor:
Package Manager:
Programming Language:
Database:

2.4.5

access

Operating System:
Shell:
Editor:
Package Manager:
Programming Language:
Database:

This app deploys a Spark standalone cluster.

Cluster Architecture¶

Apache Spark standalone cluster architecture in client mode

The cluster architecture consists of one master node (node1), acting as the cluster manager, and one or more worker nodes. By default, a worker process also runs on node1. The cluster configuration is defined by two mandatory parameters: Number of nodes and Machine type.

The master process receives application submissions and allocates worker resources (CPU cores and memory) accordingly. Worker processes then execute the tasks of each job. AApplications can be submitted either directly from the master node’s terminal or through another UCloud client application connected to node1, which also runs Apache Spark (see, e.g., JupyterLab). The Spark driver program creates the SparkContext and connects to the master node.

A standalone Spark cluster supports two deployment modes:

Client mode (default): The Spark driver runs in the same process as the client submitting the application (see figure above).
Cluster mode: The driver runs within one of the cluster’s worker processes. In this case, the client exits once it has submitted the application, without waiting for execution to complete.

The deployment mode is controlled by the property spark.submit.deployMode. A full list of configurable application properties for a Spark standalone cluster is available here.

Submit Applications¶

Applications can be submitted from the node1 terminal using spark-submit with the cluster’s master URL:

spark://node1:7077

Examples¶

Python (client mode, with event logging):

$ spark-submit \
--master spark://node1:7077 \
--deploy-mode client \
--conf spark.eventLog.enabled=true \
--conf spark.eventLog.dir=/work/spark_logs \
your_app.py --arg1 value1

Scala example (SparkPi from bundled examples):

$ spark-submit \
--master spark://node1:7077 \
--class org.apache.spark.examples.SparkPi \
"$SPARK_HOME"/examples/jars/spark-examples_*.jar 1000

Switch deploy mode¶

Add the following option to run in cluster mode:

--deploy-mode cluster

In cluster mode, the driver runs on a worker node and the client exits immediately after submission.

Resource configuration¶

Adjust these settings to match your cluster size and workload:

--executor-cores 4 --executor-memory 8g --total-executor-cores 32

Monitoring¶

Ongoing and completed Spark applications can be monitored through the web interfaces available on the cluster.

Spark UI¶

Each cluster node provides web interfaces for monitoring Spark components:

On node1 (master node):
- Spark Master UI – displays cluster-wide status and running applications.
- Spark History UI – displays completed applications based on event logs.
- Spark Worker UI – monitors the worker process running on the master node.
On all other worker nodes:
- Spark Worker UI – monitors the worker process on the respective node.

For each application, the Spark UI provides:

A list of scheduler stages and tasks.
A summary of RDD sizes and memory usage.
Environmental information.
Information about running executors.

Spark history server¶

The Spark history logs are stored in /work/spark_logs, which is created by default when the job starts. A different directory can be specified using the parameter Spark history logs.

The Spark History UI is available exclusively from the node1 web interface.

Note

When the client runs in another UCloud app, the same logs folder must be mounted on both the cluster and the client applications.

Connect From Other Apps¶

Applications can also be submitted from other UCloud client apps (for example, JupyterLab and Airflow) by connecting them to the Spark Cluster job using Connect to other jobs.

Use the master URL spark://node1:7077 when creating your SparkSession or invoking spark-submit.
Mount the same input folder on both the cluster and the client app to ensure consistent code and data paths.
To access the Spark history server, mount the same Spark history directory (/work/spark_logs by default) on both the cluster and the client app.

Note

The client app must contain a compatible Spark installation and any additional packages required by the application.

Initialization¶

Additional packages may be installed on all the worker nodes by specifying the optional Initialization parameter.

For information on how to use this parameter, please refer to the Initialization: Bash Script, Initialization: Conda Packages, and Initialization: PyPI Packages section of the documentation.

Note

The same packages must also be installed on any client node connected to the Spark cluster.

Example¶

Initialization Bash script snippet to install a PyPI package on all nodes:

#!/usr/bin/env bash
set -euxo pipefail
pip install --no-cache-dir pyarrow==21.0.0

Run in Batch Mode¶

The Batch processing parameter enables Spark applications to be submitted through a Bash script executed on the master node.

Note

The job terminates once the script finishes executing, regardless of the script’s exit status.

Example¶

Minimal batch processing script:

#!/usr/bin/env bash
set -euxo pipefail

export SPARK_HISTORY_OPTS="-Dspark.history.fs.logDirectory=/work/spark_logs"

spark-submit \
  --master spark://node1:7077 \
  --deploy-mode client \
  --conf spark.eventLog.enabled=true \
  --conf spark.eventLog.dir=/work/spark_logs \
  /work/your_project/your_app.py

Spark Cluster¶

Cluster Architecture¶

Submit Applications¶

Examples¶

Switch deploy mode¶

Resource configuration¶

Monitoring¶

Spark UI¶

Spark history server¶

Connect From Other Apps¶

Initialization¶

Example¶

Run in Batch Mode¶

Example¶

Contents