Submitting an Application¶
Mantik allows to execute your ML applications on remote HPC systems using UNICORE or firecREST. It can be done either via the Mantik platform, CLI or Python API.
It expects a certain format of the ML project you want to use. We basically follow the MLflow conventions for MLprojects (see previous section Preparing Your Application).
Submitting Runs¶
A Mantik run represents an application submitted for execution on an external system. A run might, for example, train a ML model, or load and use a trained ML model for inference.
First, set your credentials as environment variables (see Credentials and Environment Variables).
On top of that, Mantik requires the MANTIK_COMPUTE_BUDGET_ACCOUNT
environment variable, which
defines the account that will be used for billing of the computation resources used.
Moreover, Mantik requires your credentials for accessing the remote system. These can be provided in two ways:
Store your credentials as a Connections on the Mantik platform. This is required if you want to submit your application from the platform in the future.
Set the required credentials for the HPC site as environment variables, depending on whether it’s accessible via UNICORE or firecREST. (see Credentials and Environment Variables).
The preferred solution is to store the credentials on Mantik (see Credentials and Environment Variables).
Then you can submit a run using Mantik’s UI as shown here or via
mantik runs submit <relative path to mlproject file> \
--name <Name of the Run> \
--connection-id <Mantik Connection UUID> \
--entry-point <entry point name> \
--backend-config <path to backend configuration file relative to mlproject path> \
--project-id <ID of the project to which the run should be linked> \
--experiment-repository-id <ID of the experiment repository to which the run should be linked> \
--code-repository-id <ID of the code repository where the mlproject is located> \
--branch <Name of the code repository's branch> \
--compute-budget-account <Name of the compute budget account on HPC> \
-P <key>=<value> \
-P <key>=<value>
where <relative path to mlproject file>
refers to the path to the MLproject file, with the MLproject directory as
base directory. You will find all the needed IDs in Mantik’s UI.
The response contains the run ID and the timestamp, so that you can find your runs easily in the UI and interact with them via the Mantik API.
Mantik automatically creates a working directory on the remote system before it starts executing the
designated job.
This is <scratch directory>/unicore-jobs/<UNICORE job ID>
for UNICORE
and <scratch directory>/mantik/<Mantik run ID>
for firecREST.
Within this directory, you can find a file named mantik.log
. This file gathers and
records all the outputs and error messages (stderr
and stdout
) generated by your application during
its execution.
To simplify the process of submitting a run without repeatedly specifying project, experiment repository, code repository, and data repository IDs as options, you can either use the re-run feature of the respective run on the platform (Project > Runs > Submissions > ), or store these IDs in environment variables:
MANTIK_PROJECT_ID
, MANTIK_EXPERIMENT_REPOSITORY_ID
, MANTIK_CODE_REPOSITORY_ID
, and MANTIK_DATA_REPOSITORY_ID
.
For further details see the CLI reference
import mantik
mantik.submit_run(
experiment_id=<experiment id>,
mlflow_parameters={<key: value pairs for mlflow parameters and values>},
mlproject_path="<relative path to mlproject file>",
backend_config="<path to backend configuration file relative to mlproject path>",
project_id=<ID of the project to which the run should be linked>,
experiment_repository_id=<ID of the experiment repository to which the run should be linked>,
code_repository_id=<ID of the code repository where the mlproject is located>,
branch="<Name of the code repository's branch>",
compute_budget_account="<Name of the compute budget account on HPC>",
connection_id=<Mantik Connection UUID>
)
where <relative path to mlproject file>
refers to the path to the MLproject file, with the MLproject directory as
base directory. You will find all the needed IDs in Mantik’s UI.
For further details see Python API documentation .
mlflow run --backend unicore --backend-config <path to backend config> <mlproject directory path>
where <path to mlproject directory>
is the path to the MLflow project directory.
Remember that when you submit a run, the code is retrieved from your remote Git Code Repository. So make sure to commit and push your changes before submitting a run! When using the CLI, the Python API and teh MLflow CLI the only file read from your local system is the backend config.
Warning
In accordance with the MLflow docker backend, where the MLproject directory is mounted, all files in this directory are transferred alongside the Apptainer image (venv environment) and accessible at runtime.
Make sure not to have unnecessary files in that directory to avoid slowing down the file upload or exclude them using the Backend Config (see Compute Backend Config).
Interacting With Submitted Runs¶
In the Runs view, you can find all the runs related to a specific project. Here, you will find details like the experiment repository, the connection used, the run’s start time, and its current status. You can interact with a run by clicking on any of the icons in the “Action” column.
To re-submit the run with the exact same mlflow parameters and backend config click on
To access detailed information about a specific run such as the submission info and the logs, click on
To delete the run from your project click on
To cancel a run, meaning interrupting its execution click on
Attention
When submitting a run, the client by default uploads the entire MLflow project directory
to the remote run directory.
To avoid uploading specific files or folders, it is possible to give a list of files,
directories and/or patterns to exclude from the upload in the Backend Config
Exclude
field (see Compute Backend Config).
Running in containers¶
To guarantee portability of projects, we recommend using container-based projects using Apptainer.
The two possibilities to configure container-based projects with Mantik are:
build a Docker image locally and reference it in the MLflow Docker container environment, or
directly build a Apptainer image and reference it in the Compute Backend Config.
Apptainer images can easily reach multiple Gigabyte in size. We highly encourage users to have these images already present on the remote infrastructure and reference them accordingly in the Backend Config to avoid long upload times and reduce redundant system space on the remote system when images are reused for various runs.
Docker¶
MLflow offers the possibility to define Docker container environments for your mlproject. We strongly suggest you to set up your projects in this way.
Apptainer¶
Mantik allows to either make use of local Apptainer images (up to a certain size limit) or remote images stored on the remote file system of the respective HPC site. In the first case, the Mantik Client transfers the image to the compute resource provider.
See also
For installation instructions, see the Apptainer documentation. For more information on Apptainer images see the documentation.
Configuration¶
The Compute Backend config can be given either in YAML or JSON format and may contain information for the resources that are allocated for the job.
For detailed information see Compute Backend Config.
Example Projects¶
For a set of example projects using Mantik for remote runs see Examples.