Compute Backend ConfigΒΆ

The Compute Backend configuration can be given either in YAML or JSON format and may contain information for the resources that are allocated for the job.

The possible config entries are:

  • UnicoreApiUrl (optional): Specifies the API URL of UNICORE for the respective HPC site.

    See Supported HPC Sites for a list of supported HPC sites and their UNICORE API URLs.

    Example:

    UnicoreApiUrl: https://zam2125.zam.kfa-juelich.de:9112/JUWELS/rest/core
    
  • Firecrest (optional): Specifies the firecREST details for the respective HPC site.

    • ApiUrl (required): firecREST API URL of the HPC site.

    • TokenUrl (required): Token URL used for creating authentification tokens.

    • Machine (required): Name of the machine to use for submission.

    See Supported HPC Sites for a list of supported HPC sites and their firecREST URLs and machines.

    Example:

    Firecrest:
      ApiUrl: https://firecrest.cscs.ch
      TokenUrl: https://auth.cscs.ch/auth/realms/firecrest-clients/protocol/openid-connect/token
      Machine: daint
    

Note

Either UnicoreApiUrl or Firecrest section have to be provided. It is not possible to provide both.

See Supported HPC Sites for a list of supported HPC sites and their respective UNICORE/firecREST URLs and machines.

  • Environment (optional): Anything related to the execution environment of the MLflow project.

    Choose between Apptainer, Python, or none. Currently, we support Apptainer images or local environments built with the venv module, respectively.

    When using no environment for execution, i.e. neither of Apptainer or Python, the MLproject entry point will be executed as is. This might be a viable option, for example, when only making use of software modules (see Environment.Modules) instead of a venv.

    • Apptainer (optional): Defines which Apptainer image is used to run the project in.

      Given image will be executed using srun apptainer run <image>

      • Path (required): Path to the Apptainer image file.

        Apptainer:
          Path: image.sif
        
      • Type (optional, default is local): Whether the image is stored locally or remotely.

        • local: Local path to the image file, either relative or absolute. If the given path is relative, it is assumed to be relative to mlproject directory.

        • remote: Absolute path to the image file on the remote system.

        Apptainer:
          Path: image.sif
          Type: local
        

        Note

        If Apptainer.Type is local (which is the default), and Apptainer.Path is given as a relative path (e.g. image.sif, sub_dir/image.sif, or ../image.sif), that relative path will be assumed to be relative to the MLflow project directory.

        Apptainer:
          Path: /path/on/remote/system/image.sif
          Type: remote
        
      • Options (optional): Options as a string or list of strings that will be passed to the Apptainer executable.

        The options are passed as srun apptainer run <options> <image>.

        Apptainer:
          Path: /path/on/remote/system/image.sif
          Type: remote
          Options:
            - --nv
            - -B /data:/data
        
        Apptainer:
          Path: /path/on/remote/system/image.sif
          Type: remote
          Options: --nv -B /data:/data
        
        Apptainer:
          Path: /path/on/remote/system/image.sif
          Type: remote
          Options: >
            --nv
            -B /data:/data
        

        Note

        • Options allow the usage of environment variables (e.g. Options: -B $HOME/data:/data). The value of the respective environment variables will be read from the run environment.

        • Environment variables may reference variable names given in the Config.Environment.Variables section since they will be available in the run environment.

    • Python (optional): Path to the virtual environment to load before executing the project.

      The given path will be activated using source <path to venv>/bin/activate before running the main command.

      Python: /path/to/venv
      
      Python:
        Path: /path/to/venv
      

      Note

      • If the venv is located on the remote system, the given path must be absolute.

      • If you put a venv in the MLflow project directory, the path must be relative to the MLflow project directory.

      • When you submit runs to the CSCS cluster, it’s not possible to access the $HOME directory. Therefore, you should create any venv in the $SCRATCH directory (e.g. /scratch/snx3000/$USER for daint). (This is because at CSCS firecREST, which is used by Mantik to submit a run, doesn’t have access to the $HOME directory.)

    • Modules (optional): List of modules to load before executing the project.

      Will be loaded using module load. Note that the modules specified here will be loaded to the compute node and not to the login node.

      Modules:
        - Python/3.9.6
        - PyTorch/1.8.1-Python-3.8.5
      
    • Variables (optional): Pass environment variables as key, value pairs that will be available at runtime.

      Environment:
        Variables:
          TEST_ENV_VAR: test value
          ANOTHER_VAR: another value
      
    • PreRunCommandOnLoginNode/PostRunCommandOnLoginNode/ PreRunCommandOnComputeNode/PostRunCommandOnComputeNode (optional): a single command or a series of commands in the form of a list. As their name indicates, these commands will be executed either before or after the main application runs, and they can be set to run on either the login node or the compute node.

      Pre/PostRunCommandOnLoginNode can not be used with a site that uses firecREST (e.g. CSCS, see Supported HPC Sites).

      All four command types can be used simultaneously within the Config. Their execution sequence is organized as follows:

      1. PreRunCommandOnLoginNode

      2. PreRunCommandOnComputeNode

      3. PostRunCommandOnComputeNode

      4. PostRunCommandOnLoginNode

      PreRunCommandOnLoginNode:
        - echo running on login node
      PostRunCommandOnComputeNode:
        - echo running on compute node
        - echo cleaning up
      
  • Resources (required): Specify resources to allocate for the application.

    • Queue (required): Queue to schedule the job to.

    • Runtime (optional): Maximum runtime of the job.

      Required format is <value><unit>. Valid units are s (seconds), m (minutes), h (hours), d (days).

      Multiple time units are not allowed (e.g. 1h30m).

      Resources:
        Runtime: 12h
      
    • Nodes (optional): Number of nodes to use for job execution.

    • TotalCPUs (optional): Total number of CPUs to use.

    • CPUsPerNode (optional): Number of CPUs per node.

      Attention

      If CPUsPerNode is given and SRUN_CPUS_PER_TASK is not explicitly set in Environment.Variables, the value of CPUsPerNode will be used for SRUN_CPUS_PER_TASK to make srun inherit from CPUsPerNode (see here for further details).

      This can be prevented by manually setting SRUN_CPUS_PER_TASK in Environment.Variables.

    • GPUsPerNode (optional): Number of GPUs per node.

    • MemoryPerNode (optional): Memory per node to allocate for the job.

      Resources:
        Memory: 12GiB
      
    • Reservation (optional): Batch system reservation ID.

    • NodeConstraints (optional): Batch system node constraints.

      Resources:
        NodeConstraints: gpu
      
    • QoS (optional): Batch system QoS.

  • Exclude (optional): List of file names, directories, or patterns to exclude from uploading.

    Exclude:
      - data.csv
      - "*.sif"
      - sub-directory/
      - "**/*.png"
    

See also

  • For more details about each option see the UNICORE job description.

  • Information on the UnicoreApiUrl for Juelich Computing Center, e.g. for access to JUWELS, can be found here and here.

ExamplesΒΆ

UnicoreApiUrl: https://zam2125.zam.kfa-juelich.de:9112/JUWELS/rest/core
Environment:
  Apptainer:
    Path: image.sif
    Type: local
    Options: --nv
  Variables:
    TEST_ENV_VAR: variable value
Resources:
  Queue: batch
  Nodes: 2
Exclude:
  - another-image.sif
{
  "UnicoreApiUrl": "https://zam2125.zam.kfa-juelich.de:9112/JUWELS/rest/core",
  "Environment": {
    "Python": "/path/to/venv",
    "Variables": {
      "TEST_ENV_VAR": "variable value"
    }
  },
  "Resources": {
    "Queue": "batch",
    "Nodes": 2
  },
  "Exclude": ["*.sif"]
}