Spark Cluster

type access

  • Operating System:

  • Shell:

  • Editor:

  • Package Manager:

  • Programming Language:

  • Database:

type access

  • Operating System:

  • Shell:

  • Editor:

  • Package Manager:

  • Programming Language:

  • Database:

type access

  • Operating System:

  • Shell:

  • Editor:

  • Package Manager:

  • Programming Language:

  • Database:

type access

  • Operating System:

  • Shell:

  • Editor:

  • Package Manager:

  • Programming Language:

  • Database:

type access

  • Operating System:

  • Shell:

  • Editor:

  • Package Manager:

  • Programming Language:

  • Database:

This application deploys a Spark standalone cluster.

Cluster architecture

../_images/Spark_Standalone_Cluster.png

                              Apache Spark standalone cluster architecture in client mode

The cluster architecture encompasses one master node (node1), which acts as the cluster manager, and one or more worker nodes. By default, one worker always runs on node1. The cluster resources are specified using the parameters Number of nodes and Machine type.

The master process accepts the applications to be run and schedules the worker resources (available CPU cores and memory) among them. Worker processes execute the job's tasks. Applications are submitted either from the node1 terminal interface or from another UCloud client app connected to node1, which also runs Apache Spark (see, e.g., JupyterLab). Finally, the Spark driver is the program that creates the SparkContext, connecting to the master node.

The standalone Spark cluster supports two deploy modes:

  • In client mode (default), the Spark driver is launched in the same process as the client that submits the application (see, e.g., the figure above).

  • In cluster mode the driver is launched from one of the worker processes inside the cluster, and the client process exits as soon as it fulfills its responsibility of submitting the application without waiting for the application to finish.

The deploy mode is specified via the property parameter spark.submit.deployMode. The complete list of all application properties available for a Spark standalone cluster is reported here.

Monitoring

Information about completed and ongoing Spark applications is accessible via the app web interface, by clicking on the button

Spark UI

The Spark web user interface (UI) is used to monitor all the applications submitted to the cluster. The master and each worker has its own web UI that shows cluster and job statistics.

For each application the Spark UI includes:

  • A list of scheduler stages and tasks.

  • A summary of RDD sizes and memory usage.

  • Environmental information.

  • Information about the running executors.

Note

This information is only available for the duration of the Spark application.

Spark history server

The Spark history logs are saved in /work/spark_logs, which is created by default when the job starts. A different directory can be specified using the parameter Spark history.

The Spark history server can be accessed by running the following command from the node1 terminal interface:

$ display_spark_history

The history server is accessible only from the node1 web interface.

Note

If the client runs in another UCloud app, the same logs folder must be mounted on both cluster and client apps.

Import data

The mandatory parameter Input folder is used to mount the data directory on all the cluster nodes.

Note

If the client runs in another UCloud app, the same data folder must be mounted on both cluster and client apps.

Install new packages

Additional packages may be installed on all the worker nodes using the Dependencies parameter. The user should provide the list of packages to be installed either via a text file (.txt) or a YAML file (.yml/.yaml). The installation is done via the Conda command line package and environment manager. Alternatively, it is possible to load a Bash script (.sh) with the list of installation commands.

The example below shows three different ways to install the same packages.

numpy==1.19.5
pandas==1.3.2
keras==2.6.0
matplotlib==3.4.3
seaborn==0.11.2
plotly==5.2.1
name: base
channels:
  - conda-forge
  - defaults
  - numba

dependencies:
  - conda-forge::numpy=1.19.5
  - conda-forge::pandas=1.3.2
  - conda-forge::keras=2.6.0
  - conda-forge::matplotlib=3.4.3
  - conda-forge::seaborn=0.10.0
  - numba::numba=1.20.3
  - pip:
    - plotly==5.2.1
#!/usr/bin/env bash

ENV=base
set -eux

conda install -y -n $ENV -c conda-forge \
numpy=1.19.5 \
pandas=1.3.2 \
keras=2.6.0 \
matplotlib=3.4.3\
seaborn=0.11.2

conda install -y -n $ENV -c numba numba=1.20.3

pip install plotly==5.2.1

Note

The same packages must be installed on a client node connected to the Spark cluster.

Run in batch mode

The parameter Batch processing is used to submit Spark applications via a Bash script, running on the master node.

Note

The job is terminated after the script is executed, independently of the script exit status.