JupyterLab

type access

  • Operating System:

  • Terminal:

  • Shell:

  • Editor:

  • Package Manager:

  • Programming Language:

  • Database:

  • Utility:

type access

  • Operating System:

  • Terminal:

  • Shell:

  • Editor:

  • Package Manager:

  • Programming Language:

  • Database:

  • Utility:

type access

  • Operating System:

  • Terminal:

  • Shell:

  • Editor:

  • Package Manager:

  • Programming Language:

  • Database:

  • Utility:

type access

  • Operating System:

  • Terminal:

  • Shell:

  • Editor:

  • Package Manager:

  • Programming Language:

  • Database:

  • Utility:

type access

  • Operating System:

  • Terminal:

  • Shell:

  • Editor:

  • Package Manager:

  • Programming Language:

  • Database:

  • Utility:

type access

  • Operating System:

  • Terminal:

  • Shell:

  • Editor:

  • Package Manager:

  • Programming Language:

  • Database:

  • Utility:

type access

  • Operating System:

  • Terminal:

  • Shell:

  • Editor:

  • Package Manager:

  • Programming Language:

  • Database:

  • Utility:

JupyterLab is a web-based integrated development environment for Jupyter notebooks, code, and data. JupyterLab is flexible: the user interface can be configured and arranged to support a wide range of workflows in data science, scientific computing, and machine learning.

The JupyterLab app is implemented with three starting modes, which can be selected via the Deployment mode parameter.

Several programming language kernels are pre-loaded in the app.

Install new packages

Additional packages can be installed inside the application container using the Dependencies parameter. The user should provide the list of packages either via a text file (.txt) or a YAML file (.yml/.yaml). The installation is done via the Conda command line package and environment manager. Alternatively, it is possible to load a Bash script (.sh) with the list of shell commands to be used for the installation.

The example below shows three different ways to install the same packages:

numpy==1.18.1
pandas==1.0.2
keras==2.3.1
matplotlib==3.2.0
seaborn==0.10.0
plotly==4.5.4
name: base
channels:
  - conda-forge
  - defaults
  - numba

dependencies:
  - conda-forge::numpy=1.18.1
  - conda-forge::pandas=1.0.2
  - conda-forge::keras=2.3.1
  - conda-forge::matplotlib=3.2.0
  - conda-forge::seaborn=0.10.0
  - numba::numba=0.48.0
  - pip:
    - plotly==4.5.4
#!/usr/bin/env bash

ENV=base
set -eux

conda install -y -n $ENV -c conda-forge \
numpy=1.18.1 \
pandas=1.0.2 \
keras=2.3.1 \
matplotlib=3.2.0 \
seaborn=0.10.0

conda install -y -n $ENV -c numba numba=0.48.0

pip install plotly==4.5.4

Packages can also be installed using the terminal interface available in all the starting modes.

Run in batch mode

The parameter Batch processing is used to execute a Bash script in batch mode.

Note

The job is terminated after the script is executed, independently of the script exit status.

Manage environments

Conda virtual environment

The default Conda environment at the startup is called base. The user can create a new environment via the JupyterLab terminal interface:

$ conda create --name myenv

This command will create by default a basic Python v3.9 workspace. A different version of Python can be specified as follows:

$ conda create --name myenv python=3.6

By default new environments are created in the folder: /opt/conda/envs/. The user can change the environment path to a new folder inside /work, so that all the installed packages are exported in the job output folder on UCloud, after the job is completed. An example is shown below:

$ conda create --prefix /work/myenv python=3.6

Warning

After job completion, the user should move (not copy) myenv from the corresponding job folder to a new path inside a UCloud workspace.

In order to activate more environments, one first has to initialize the shell with the command:

$ eval "$(conda shell.bash hook)"

Tip

(base) ucloud:/work$

Then, the new environment can be activated as follows:

$ conda activate myenv

Tip

(myenv) ucloud:/work$

A complete documentation about how to manage Conda environments can be found here.

Hint

In case the installation of very recent packages via conda fails due to strict channel priority settings, run the command:

$ conda config --set channel_priority false

Python virtual environment

A lightweight virtual environment can also be set up using the venv module of Python. For example:

$ python -m venv /work/myenv

This command will create all the dependencies in the folder /work/myenv. The latter is activated as follows:

$ source /work/myenv/bin/activate

Tip

(myenv) ucloud:/work$

New modules are installed via pip, e.g.

$ pip install tensorflow-gpu==2.5.0

The default Python interpreter /opt/conda/bin/python is reset with the command:

$ deactivate

Add new kernels

The user can add new kernels in the JupyterLab launcher via command line. These are some working examples:

$ conda create -y --name py36 python=3.6 ipykernel
$ conda env update -n py36 --file environment.yml
$ conda activate py36
$ python -m ipykernel install --user --name ipy36 --display-name "Python (v3.6)"
$ conda deactivate
$ conda create -y --name r351 r-essentials=3.5.1 r-base=3.5.1
$ conda env update -n r351 --file environment.yml
$ conda activate r351
$ R -e "IRkernel::installspec(name = 'ir351', displayname = 'R (v3.5.1)')"
$ conda deactivate
$ conda create -y --name jl julia=1.0.3
$ conda activate jl
$ julia -e 'import Pkg; Pkg.update(); Pkg.add("IJulia")'
$ julia -e "using Pkg; pkg\"add IJulia\"; pkg\"precompile\""
$ conda deactivate

The list of all the installed environments is given by the command:

$ conda env list

Tip

# conda environments:

#

base                             * /opt/conda

jl                                 /opt/conda/envs/jl

py36                               /opt/conda/envs/py36

r351                               /opt/conda/envs/r351

After the installation, it might be necessary to refresh the web page to visualize the new kernels in the launcher.

drawing

Hint

A new kernel and the corresponding virtual environment can also be created at the startup by submitting a Bash script (*.sh) via the Dependencies parameter. For example:

#!/usr/bin/env bash

conda create -y -n tf2 python=3.9 cudatoolkit=11.2 cudnn=8.2 ipykernel
eval "$(conda shell.bash hook)"
conda activate tf2
pip install tensorflow_gpu==2.5
python -m ipykernel install --user --name tf2 --display-name "TF v2.5"

Submit a Spark application

A local instance of Apache Spark is already installed in the app container, within the directory $SPARK_HOME. Spark can be used to quickly perform processing tasks on very large datasets.

The user can choose to submit Spark applications either in local mode or in cluster mode, depending of the size of the dataset.

Local deployment

In the case of smaller/sampled datasets it is convenient to run Spark applications in local mode, that is using only the resources allocated to the JupyterLab app. To connect to the local Spark instance, the SparkContext should be defined as in the following example:

from pyspark.conf import SparkConf
from pyspark.sql import SparkSession

conf = SparkConf().setAll(
    [
        ("spark.master", "local[16]"),
        ("spark.eventLog.enabled", True),
        ("spark.eventLog.dir", "/work/spark_logs"),
        ("spark.history.fs.logDirectory", "/work/spark_logs"),
    ]
)

spark = SparkSession.builder.config(conf=conf).getOrCreate()
sc = spark.sparkContext

where it is assumed that the selected machine type has at least 16 cores.

In this mode the applications can be monitored using the SparkMonitor extension of JupyterLab, which is available in the app and accessible from the menu on the top. By default, the Spark UI connects to port 4040.

Cluster deployment

Spark applications which require distributed computational resources can be submitted directly to a Spark standalone cluster, which allows to distribute data processing tasks across multiple nodes. In this case, the JupyterLab app should be connected to a Spark cluster instance using the optional parameter Connect to other jobs, as shown in the following example:

drawing

where the Job entry is used to select the job ID of the Spark Cluster instance, created in advance. Besides, the Hostname parameter is employed to assign the URL of the master node to the SparkContext, created in the JupyterLab app. The default port on the master node is 7077.

An example of a Spark application deployed on a standalone cluster is shown in the code snippet below.

from pyspark.conf import SparkConf
from pyspark.sql import SparkSession

from random import random
from operator import add

MASTER_HOST = "spark://my-cluster:7077"

NODES = 3
CLUSTER_CORES_MAX = 63 * NODES  # set cluster total number of cores
CLUSTER_MEMORY_MAX = 371 * NODES  # set cluster total memory in GB

EXECUTOR_CORES = 21  # set cores per executor on worker node
EXECUTOR_MEMORY = int(
    371 / (63 / EXECUTOR_CORES) * 0.5
)  # set executor memory in GB on each worker node

conf = SparkConf().setAll(
    [
        ("spark.master", MASTER_HOST),
        ("spark.cores.max", CLUSTER_CORES_MAX),
        ("spark.executor.cores", EXECUTOR_CORES),
        ("spark.executor.memory", str(EXECUTOR_MEMORY) + "g"),
        ("spark.eventLog.enabled", True),
        ("spark.eventLog.dir", "/work/spark_logs"),
        ("spark.history.fs.logDirectory", "/work/spark_logs"),
        ("spark.deploy.mode", "cluster"),
    ]
)

## check executor memory, taking into accout 10% of memory overhead (minimum 384 MiB)
CHECK = (CLUSTER_CORES_MAX / EXECUTOR_CORES) * (
    EXECUTOR_MEMORY + max(EXECUTOR_MEMORY * 0.10, 0.403)
)

assert (
    int(CHECK) <= CLUSTER_MEMORY_MAX
), "Executor memory larger than cluster total memory!"

spark = SparkSession.builder.config(conf=conf).appName("PI Calc").getOrCreate()
sc = spark.sparkContext

partitions = 100000
n = 100000 * partitions

def f(_):
    x = random()
    y = random()
    return 1 if x ** 2 + y ** 2 <= 1 else 0

count = spark.sparkContext.parallelize(range(1, n + 1), partitions).map(f).reduce(add)

print("Pi is roughly %f" % (4.0 * count / n))

sc.stop()

After submitting the application to the cluster, the Spark UI looks like in the following figure.

drawing

In this example the cluster architecture consists of 3 worker nodes.