Modern HPC Workflow Example (Guix)
This guide applies the workflow presented at Modern HPC Workflow with Containers to run an application container built with Guix.
-
This tutorial will focus on using Grid5000 for both building the container with Guix and deploying it with Singularity, as it provides both tools.
-
The container may be built on any computer with Guix installed. You may refer to the documentation if you wish to install Guix on your machine. Beware that if you build it on your local machine, you’ll have to copy it to Grid5000.
-
Additional instructions will be provided for deployment on Jean-Zay, that can be easily adapted to any cluster supporting Singularity and using SLURM as job management system.
-
The application chosen as an example is Chameleon, a dense linear algebra software for heterogeneous architectures that supports MPI and NVIDIA GPUs through CUDA or AMD GPUs through ROCm.
Chameleon on NVIDIA GPUs
Build the container on Grid5000
- Login to Grid500 (detailed instructions
here).
The full list of
resources
shows where to find an NVIDIA GPU and an
x86_64CPU (for Singularity). For instance, thechifflotqueue, located in Lille, contains nodes with NVIDIA P100 GPUs.
- Get the channels file. The
chameleon-cudapackage (thechameleonpackage variant with CUDA support) is defined in the Guix-HPC non-free channel, which is not activated by default.
The channels.scm file contains the following:
- Generate the Singularity container image with the
guix packcommand, prefixed withguix time-machinein order to use ourchannels.scmfile. The-roption creates a symbolic link to the resulting container image in the Guix store, aschameleon.sif.
Tip
guix pack can generate different formats, like
Singularity (squashfs), Docker or relocatable binaries.
Tip
Singularity needs bash to be in the package list.
Deploy the container on Grid5000
- Start an interactive job using OAR.
Tip
CUDA applications deployed with Guix need LD_PRELOAD to
be set with the path to libcuda.so since the library is
provided by the proprietary CUDA driver, installed on the
machine, and not part of the Guix software stack.
Tip
The OPENBLAS_NUM_THREADS environment variable is set to
improve the computation performance and not compulsory.
Deploy the container on Jean-Zay
- Copy the image to Jean-Zay. Depending on your SSH setup, you might have to adapt the commands below.
- Setup the container image on Jean-Zay. First, the image has to
be copied to the allowed space (
$SINGULARITY_ALLOWED_DIR) in order to be accessible to Singularity. This step is specific to Jean-Zay, more details in the documentation. Then thesingularitymodule needs to be loaded (this step is not always necessary, depending on the supercomputer, but is not specific to Jean-Zay).
- Start the job using the container with SLURM.
Tip
Environment variables are propagated to the Singularity container
context, but since the path to libcuda.so doesn’t exist outside of the
container context (the path in bind-mounted by Singularity due to the --nv
flag) it leads to an error when LD_PRELOAD is declared outside of the
container context.
Deploy the container on Vega (EuroHPC)
- Copy the image to Vega. Depending on your SSH setup, you might have to adapt the commands below.
- Start the job using the container with SLURM.
Deploy the container on MeluXina (EuroHPC)
- Copy the image to MeluXina.
- Start an interactive allocation with SLURM and load
Singularity/Apptainer. On MeluXina, the
singularitycommand is available through a module and themodulecommand is only accessible on a compute node.
- Start the computation using Singularity.
Tip
In this example, we use a single node. In order to use multiple nodes, a
script should be submitted using sbatch (not covered in this tutorial).
Build a Docker image on Grid5000
- Build a Docker container on Grid5000.
Deploy a Docker container on Irene (TGCC)
- Copy the container to Irene. Depending on your SSH setup, you might have to adapt the commands below.
- **SSH to Irene and import the image using
pcocc-rs.
Tip
The TGCC uses a specific tool to deploy Docker images called
pcocc-rs. See the
documentation.
- Start a job in interactive mode.
Tip
On Irene, resources are allocated using ccc_mprun. See the documentation.
For instance, the -s option spawns an interactive session directly on a compute node.
Tip
On Irene, the number of allocated GPUs is directly related to the number of allocated cores on the node. Here, 20 cores are allocated on a V100 which contains 40 cores in total, so 50% of the GPUs available on the node (4 x V100) are allocated. See the documentation.
Tip
The --module nvidia option make the CUDA libraries available inside the
image in the /pcocc/nvidia/usr/lib64 folder.
Chameleon on AMD GPUs
Build the image on Grid5000
- Connect to Grid5000 and build the Singularity container.
Deploy on Adastra
- Copy the Singularity image to Adastra. Depending on your SSH setup, you might have to adapt the commands below.
Warning
Before being able to use a custom Singularity image, it has be manually copied to an authorized path by the support, which should be contacted by email. See the documentation.
- Start a job in interactive mode.
Deploy on LUMI
- Copy the Singularity image to LUMI. Depending on your SSH setup, you might have to adapt the commands below.
- Start a job in interactive mode.
Bonus: relocatable binaries
For machines where Singularity is not available (or you have to ask
support to deploy your custom image), an alternative can be the
relocatable binary archive. The command below generates an archive
containing chameleon-hip for AMD GPUs that can be run on e.g.
Adastra:
This archive can then be uploaded to a supercomputer (e.g. Adastra) and deployed: