Compute Backend ConfigΒΆ
The Compute Backend configuration can be given either in YAML or JSON format and may contain information for the resources that are allocated for the job.
The possible config entries are:
UnicoreApiUrl
(optional): Specifies the API URL of UNICORE for the respective HPC site.See Supported HPC Sites for a list of supported HPC sites and their UNICORE API URLs.
Example:
UnicoreApiUrl: https://zam2125.zam.kfa-juelich.de:9112/JUWELS/rest/core
Firecrest
(optional): Specifies the firecREST details for the respective HPC site.ApiUrl
(required): firecREST API URL of the HPC site.TokenUrl
(required): Token URL used for creating authentification tokens.Machine
(required): Name of the machine to use for submission.
See Supported HPC Sites for a list of supported HPC sites and their firecREST URLs and machines.
Example:
Firecrest: ApiUrl: https://firecrest.cscs.ch TokenUrl: https://auth.cscs.ch/auth/realms/firecrest-clients/protocol/openid-connect/token Machine: daint
Note
Either UnicoreApiUrl
or Firecrest
section have to be provided.
It is not possible to provide both.
See Supported HPC Sites for a list of supported HPC sites and their respective UNICORE/firecREST URLs and machines.
Environment
(optional): Anything related to the execution environment of the MLflow project.Choose between
Apptainer
,Python
, or none. Currently, we support Apptainer images or local environments built with thevenv
module, respectively.When using no environment for execution, i.e. neither of Apptainer or Python, the MLproject entry point will be executed as is. This might be a viable option, for example, when only making use of software modules (see
Environment.Modules
) instead of a venv.Apptainer
(optional): Defines which Apptainer image is used to run the project in.Given image will be executed using
srun apptainer run <image>
Path
(required): Path to the Apptainer image file.Apptainer: Path: image.sif
Type
(optional, default islocal
): Whether the image is stored locally or remotely.local
: Local path to the image file, either relative or absolute. If the given path is relative, it is assumed to be relative tomlproject
directory.remote
: Absolute path to the image file on the remote system.
Apptainer: Path: image.sif Type: local
Note
If
Apptainer.Type
islocal
(which is the default), andApptainer.Path
is given as a relative path (e.g.image.sif
,sub_dir/image.sif
, or../image.sif
), that relative path will be assumed to be relative to the MLflow project directory.Apptainer: Path: /path/on/remote/system/image.sif Type: remote
Options
(optional): Options as a string or list of strings that will be passed to the Apptainer executable.The options are passed as
srun apptainer run <options> <image>
.Apptainer: Path: /path/on/remote/system/image.sif Type: remote Options: - --nv - -B /data:/data
Apptainer: Path: /path/on/remote/system/image.sif Type: remote Options: --nv -B /data:/data
Apptainer: Path: /path/on/remote/system/image.sif Type: remote Options: > --nv -B /data:/data
Note
Options allow the usage of environment variables (e.g.
Options: -B $HOME/data:/data
). The value of the respective environment variables will be read from the run environment.Environment variables may reference variable names given in the
Config.Environment.Variables
section since they will be available in the run environment.
Python
(optional): Path to the virtual environment to load before executing the project.The given path will be activated using
source <path to venv>/bin/activate
before running the main command.Python: /path/to/venv
Python: Path: /path/to/venv
Note
If the venv is located on the remote system, the given path must be absolute.
If you put a venv in the MLflow project directory, the path must be relative to the MLflow project directory.
When you submit runs to the CSCS cluster, itβs not possible to access the
$HOME
directory. Therefore, you should create any venv in the$SCRATCH
directory (e.g./scratch/snx3000/$USER
fordaint
). (This is because at CSCS firecREST, which is used by Mantik to submit a run, doesnβt have access to the$HOME
directory.)
Modules
(optional): List of modules to load before executing the project.Will be loaded using
module load
. Note that the modules specified here will be loaded to the compute node and not to the login node.Modules: - Python/3.9.6 - PyTorch/1.8.1-Python-3.8.5
Variables
(optional): Pass environment variables as key, value pairs that will be available at runtime.Environment: Variables: TEST_ENV_VAR: test value ANOTHER_VAR: another value
PreRunCommandOnLoginNode
/PostRunCommandOnLoginNode
/PreRunCommandOnComputeNode
/PostRunCommandOnComputeNode
(optional): a single command or a series of commands in the form of a list. As their name indicates, these commands will be executed either before or after the main application runs, and they can be set to run on either the login node or the compute node.Pre/PostRunCommandOnLoginNode
can not be used with a site that uses firecREST (e.g. CSCS, see Supported HPC Sites).All four command types can be used simultaneously within the Config. Their execution sequence is organized as follows:
PreRunCommandOnLoginNode
PreRunCommandOnComputeNode
PostRunCommandOnComputeNode
PostRunCommandOnLoginNode
PreRunCommandOnLoginNode: - echo running on login node PostRunCommandOnComputeNode: - echo running on compute node - echo cleaning up
Resources
(required): Specify resources to allocate for the application.Queue
(required): Queue to schedule the job to.Runtime
(optional): Maximum runtime of the job.Required format is
<value><unit>
. Valid units ares
(seconds),m
(minutes),h
(hours),d
(days).Multiple time units are not allowed (e.g.
1h30m
).Resources: Runtime: 12h
Nodes
(optional): Number of nodes to use for job execution.TotalCPUs
(optional): Total number of CPUs to use.CPUsPerNode
(optional): Number of CPUs per node.Attention
If
CPUsPerNode
is given andSRUN_CPUS_PER_TASK
is not explicitly set inEnvironment.Variables
, the value ofCPUsPerNode
will be used forSRUN_CPUS_PER_TASK
to makesrun
inherit fromCPUsPerNode
(see here for further details).This can be prevented by manually setting
SRUN_CPUS_PER_TASK
inEnvironment.Variables
.GPUsPerNode
(optional): Number of GPUs per node.MemoryPerNode
(optional): Memory per node to allocate for the job.Resources: Memory: 12GiB
Reservation
(optional): Batch system reservation ID.NodeConstraints
(optional): Batch system node constraints.Resources: NodeConstraints: gpu
QoS
(optional): Batch system QoS.
Exclude
(optional): List of file names, directories, or patterns to exclude from uploading.Exclude: - data.csv - "*.sif" - sub-directory/ - "**/*.png"
See also
For more details about each option see the UNICORE job description.
Information on the
UnicoreApiUrl
for Juelich Computing Center, e.g. for access to JUWELS, can be found here and here.
ExamplesΒΆ
UnicoreApiUrl: https://zam2125.zam.kfa-juelich.de:9112/JUWELS/rest/core
Environment:
Apptainer:
Path: image.sif
Type: local
Options: --nv
Variables:
TEST_ENV_VAR: variable value
Resources:
Queue: batch
Nodes: 2
Exclude:
- another-image.sif
{
"UnicoreApiUrl": "https://zam2125.zam.kfa-juelich.de:9112/JUWELS/rest/core",
"Environment": {
"Python": "/path/to/venv",
"Variables": {
"TEST_ENV_VAR": "variable value"
}
},
"Resources": {
"Queue": "batch",
"Nodes": 2
},
"Exclude": ["*.sif"]
}