JupyterLab is a web-based integrated development environment for Jupyter notebooks, code, and data. JupyterLab is flexible: the user interface can be configured and arranged to support a wide range of workflows in data science, scientific computing, and machine learning.
The JupyterLab app is implemented with three starting modes, which can be selected via the Deployment mode parameter.
Several programming language kernels are pre-loaded in the app.
Install new packages¶
Additional packages can be installed inside the application container using the Dependencies parameter. The user should provide the list of packages either via a text file (
.txt) or a YAML file (
.yaml). The installation is done via the Conda command line package and environment manager. Alternatively, it is possible to load a Bash script (
.sh) with the list of shell commands to be used for the installation.
The example below shows three different ways to install the same packages:
numpy==1.18.1 pandas==1.0.2 keras==2.3.1 matplotlib==3.2.0 seaborn==0.10.0 plotly==4.5.4
name: base channels: - conda-forge - defaults - numba dependencies: - conda-forge::numpy=1.18.1 - conda-forge::pandas=1.0.2 - conda-forge::keras=2.3.1 - conda-forge::matplotlib=3.2.0 - conda-forge::seaborn=0.10.0 - numba::numba=0.48.0 - pip: - plotly==4.5.4
#!/usr/bin/env bash ENV=base set -eux conda install -y -n $ENV -c conda-forge \ numpy=1.18.1 \ pandas=1.0.2 \ keras=2.3.1 \ matplotlib=3.2.0 \ seaborn=0.10.0 conda install -y -n $ENV -c numba numba=0.48.0 pip install plotly==4.5.4
Packages can also be installed using the terminal interface available in all the starting modes.
Run in batch mode¶
The parameter Batch processing is used to execute a Bash script in batch mode.
The job is terminated after the script is executed, independently of the script exit status.
Conda virtual environment¶
The default Conda environment at the startup is called
base. The user can create a new environment via the JupyterLab terminal interface:
$ conda create --name myenv
This command will create by default a basic Python v3.9 workspace. A different version of Python can be specified as follows:
$ conda create --name myenv python=3.6
By default new environments are created in the folder:
/opt/conda/envs/. The user can change the environment path to a new folder inside
/work, so that all the installed packages are exported in the job output folder on UCloud, after the job is completed. An example is shown below:
$ conda create --prefix /work/myenv python=3.6
After job completion, the user should move (not copy)
myenv from the corresponding job folder to a new path inside a UCloud workspace.
In order to activate more environments, one first has to initialize the shell with the command:
$ eval "$(conda shell.bash hook)"
Then, the new environment can be activated as follows:
$ conda activate myenv
A complete documentation about how to manage Conda environments can be found here.
In case the installation of very recent packages via
conda fails due to strict channel priority settings, run the command:
$ conda config --set channel_priority false
Python virtual environment¶
A lightweight virtual environment can also be set up using the
venv module of Python. For example:
$ python -m venv /work/myenv
This command will create all the dependencies in the folder
/work/myenv. The latter is activated as follows:
$ source /work/myenv/bin/activate
New modules are installed via
$ pip install tensorflow-gpu==2.5.0
The default Python interpreter
/opt/conda/bin/python is reset with the command:
Add new kernels¶
The user can add new kernels in the JupyterLab launcher via command line. These are some working examples:
$ conda create -y --name py36 python=3.6 ipykernel $ conda env update -n py36 --file environment.yml $ conda activate py36 $ python -m ipykernel install --user --name ipy36 --display-name "Python (v3.6)" $ conda deactivate
$ conda create -y --name r351 r-essentials=3.5.1 r-base=3.5.1 $ conda env update -n r351 --file environment.yml $ conda activate r351 $ R -e "IRkernel::installspec(name = 'ir351', displayname = 'R (v3.5.1)')" $ conda deactivate
$ conda create -y --name jl julia=1.0.3 $ conda activate jl $ julia -e 'import Pkg; Pkg.update(); Pkg.add("IJulia")' $ julia -e "using Pkg; pkg\"add IJulia\"; pkg\"precompile\"" $ conda deactivate
The list of all the installed environments is given by the command:
$ conda env list
# conda environments:
base * /opt/conda
After the installation, it might be necessary to refresh the web page to visualize the new kernels in the launcher.
A new kernel and the corresponding virtual environment can also be created at the startup by submitting a Bash script (
*.sh) via the Dependencies parameter. For example:
#!/usr/bin/env bash conda create -y -n tf2 python=3.9 cudatoolkit=11.2 cudnn=8.2 ipykernel eval "$(conda shell.bash hook)" conda activate tf2 pip install tensorflow_gpu==2.5 python -m ipykernel install --user --name tf2 --display-name "TF v2.5"
Submit a Spark application¶
A local instance of Apache Spark is already installed in the app container, within the directory
$SPARK_HOME. Spark can be used to quickly perform processing tasks on very large datasets.
The user can choose to submit Spark applications either in local mode or in cluster mode, depending of the size of the dataset.
In the case of smaller/sampled datasets it is convenient to run Spark applications in local mode, that is using only the resources allocated to the JupyterLab app. To connect to the local Spark instance, the
SparkContext should be defined as in the following example:
from pyspark.conf import SparkConf from pyspark.sql import SparkSession conf = SparkConf().setAll( [ ("spark.master", "local"), ("spark.eventLog.enabled", True), ("spark.eventLog.dir", "/work/spark_logs"), ("spark.history.fs.logDirectory", "/work/spark_logs"), ] ) spark = SparkSession.builder.config(conf=conf).getOrCreate() sc = spark.sparkContext
where it is assumed that the selected machine type has at least 16 cores.
In this mode the applications can be monitored using the SparkMonitor extension of JupyterLab, which is available in the app and accessible from the menu on the top. By default, the Spark UI connects to port
Spark applications which require distributed computational resources can be submitted directly to a Spark standalone cluster, which allows to distribute data processing tasks across multiple nodes.
In this case, the JupyterLab app should be connected to a Spark cluster instance using the optional parameter Connect to other jobs, as shown in the following example:
where the Job entry is used to select the job ID of the Spark Cluster instance, created in advance. Besides, the Hostname parameter is employed to assign the URL of the master node to the
SparkContext, created in the JupyterLab app. The default port on the master node is
An example of a Spark application deployed on a standalone cluster is shown in the code snippet below.
from pyspark.conf import SparkConf from pyspark.sql import SparkSession from random import random from operator import add MASTER_HOST = "spark://my-cluster:7077" NODES = 3 CLUSTER_CORES_MAX = 63 * NODES # set cluster total number of cores CLUSTER_MEMORY_MAX = 371 * NODES # set cluster total memory in GB EXECUTOR_CORES = 21 # set cores per executor on worker node EXECUTOR_MEMORY = int( 371 / (63 / EXECUTOR_CORES) * 0.5 ) # set executor memory in GB on each worker node conf = SparkConf().setAll( [ ("spark.master", MASTER_HOST), ("spark.cores.max", CLUSTER_CORES_MAX), ("spark.executor.cores", EXECUTOR_CORES), ("spark.executor.memory", str(EXECUTOR_MEMORY) + "g"), ("spark.eventLog.enabled", True), ("spark.eventLog.dir", "/work/spark_logs"), ("spark.history.fs.logDirectory", "/work/spark_logs"), ("spark.deploy.mode", "cluster"), ] ) ## check executor memory, taking into accout 10% of memory overhead (minimum 384 MiB) CHECK = (CLUSTER_CORES_MAX / EXECUTOR_CORES) * ( EXECUTOR_MEMORY + max(EXECUTOR_MEMORY * 0.10, 0.403) ) assert ( int(CHECK) <= CLUSTER_MEMORY_MAX ), "Executor memory larger than cluster total memory!" spark = SparkSession.builder.config(conf=conf).appName("PI Calc").getOrCreate() sc = spark.sparkContext partitions = 100000 n = 100000 * partitions def f(_): x = random() y = random() return 1 if x ** 2 + y ** 2 <= 1 else 0 count = spark.sparkContext.parallelize(range(1, n + 1), partitions).map(f).reduce(add) print("Pi is roughly %f" % (4.0 * count / n)) sc.stop()
After submitting the application to the cluster, the Spark UI looks like in the following figure.
In this example the cluster architecture consists of 3 worker nodes.