Spark Cluster¶

4.0.0

access

Operating System:
Terminal:
Shell:
Editor:
Package Manager:
Programming Language:
Database:
Utility:
Extension:

3.5.5

access

Operating System:
Terminal:
Shell:
Editor:
Package Manager:
Programming Language:
Database:
Utility:
Extension:

3.4.1

access

Operating System:
Terminal:
Shell:
Editor:
Package Manager:
Programming Language:
Database:
Utility:
Extension:

3.3.2

access

Operating System:
Shell:
Editor:
Package Manager:
Programming Language:
Database:

3.2.2

access

Operating System:
Shell:
Editor:
Package Manager:
Programming Language:
Database:

3.1.3

access

Operating System:
Shell:
Editor:
Package Manager:
Programming Language:
Database:

3.0.0

access

Operating System:
Shell:
Editor:
Package Manager:
Programming Language:
Database:

2.4.5

access

Operating System:
Shell:
Editor:
Package Manager:
Programming Language:
Database:

This application deploys a Spark standalone cluster.

Cluster Architecture¶

Apache Spark standalone cluster architecture in client mode

The cluster architecture encompasses one master node (node1), which acts as the cluster manager, and one or more worker nodes. By default, one worker always runs on node1. The cluster resources are specified using the parameters Number of nodes and Machine type.

The master process accepts the applications to be run and schedules the worker resources (available CPU cores and memory) among them. Worker processes execute the job's tasks. Applications are submitted either from the node1 terminal interface or from another UCloud client app connected to node1, which also runs Apache Spark (see, e.g., JupyterLab). Finally, the Spark driver is the program that creates the SparkContext, connecting to the master node.

The standalone Spark cluster supports two deploy modes:

In client mode (default), the Spark driver is launched in the same process as the client that submits the application (see, e.g., the figure above).
In cluster mode the driver is launched from one of the worker processes inside the cluster, and the client process exits as soon as it fulfills its responsibility of submitting the application without waiting for the application to finish.

The deploy mode is specified via the property parameter spark.submit.deployMode. The complete list of all application properties available for a Spark standalone cluster is reported here.

Monitoring¶

Information about completed and ongoing Spark applications is accessible via the app web interface, by clicking on the button

Spark UI¶

The Spark web user interface (UI) is used to monitor all the applications submitted to the cluster. The master and each worker has its own web UI that shows cluster and job statistics.

For each application the Spark UI includes:

A list of scheduler stages and tasks.
A summary of RDD sizes and memory usage.
Environmental information.
Information about the running executors.

Note

This information is only available for the duration of the Spark application.

Spark history server¶

The Spark history logs are saved in /work/spark_logs, which is created by default when the job starts. A different directory can be specified using the parameter Spark history.

The Spark history server can be accessed by running the following command from the node1 terminal interface:

$ display_spark_history

The history server is accessible only from the node1 web interface.

Note

If the client runs in another UCloud app, the same logs folder must be mounted on both cluster and client apps.

Import data¶

The mandatory parameter Input folder is used to mount the data directory on all the cluster nodes.

Note

If the client runs in another UCloud app, the same data folder must be mounted on both cluster and client apps.

Initialization¶

Additional packages may be installed on all the worker nodes using the Initialization optional parameter.

For information on how to use this parameter, please refer to the Initialization: Bash Script, Initialization: Conda Packages, and Initialization: PyPI Packages section of the documentation.

Note

The same packages must be installed on a client node connected to the Spark cluster.

Run in Batch Mode¶

The parameter Batch processing is used to submit Spark applications via a Bash script, running on the master node.

Note

The job is terminated after the script is executed, independently of the script exit status.

Spark Cluster¶

Cluster Architecture¶

Monitoring¶

Spark UI¶

Spark history server¶

Import data¶

Initialization¶

Run in Batch Mode¶

Contents