Using MLflow for Tracking

This tutorial helps you to get started with MLflow for experiment tracking by giving a short overview of the main features of MLflow. For detailed information see the MLflow documentation.

Authentication at the Mantik Platform

You can use all MLflow tracking functions with the Mantik platform. However, in order initialize the tracking with Mantik, a token has to be retrieved from the Mantik API to allow authenticating at the MLflow Tracking Server.

The command

import mantik

mantik.init_tracking()

reads environment variables (see Credentials and Environment Variables) and acquires an API token for secure communication with the MLflow Tracking Server.

The required environment variables for tracking are:

Attention

In some cases, MLflow sends API requests before executing any code. Thus, it might be required to use the CLI command mantik init before running any Python code.

Being Independent of API Call Success

Sometimes invoking a Mantik or MLflow method may fail due to multiple reasons. To avoid crashing the actual application because of a single API call failing, we provide a set of convenience functions in mantik.mlflow that wrap the Mantik and MLflow functions and log warnings if an invocation of an MLflow method has failed.

The functions cover mantik.mlflow.init_tracking() and the most important MLflow methods and can be used as follows:

import mantik.mlflow

with mantik.mlflow.start_run():
    ...
    manitk.mlflow.log_param("<name>", "<value>")

If a method is not provided, there is a general mantik.mlflow.call_method function that allows to comfortably call any MLflow method and silently fail with a log message if it fails:

import mantik.mlflow

with mantik.mlflow.start_run():
    manitk.mlflow.call_method("log_param", "<name>", "<value>")

In the following, we will mostly make use of the convenience functions that come with Mantik. Each of the method, however, just attempts to invoke the respective MLflow method.

Creating an MLflow Run

Once you’ve authenticated yourself at the Mantik platform and created an experiment, you can make use of all MLflow tracking commands in your scripts to configure what is tracked when you e.g. train your ML models.

However, MLflow only allows tracking within the context of an active run. An MLflow run can be created by using context manager, which is the recommended way:

import mlflow

with mlflow.start_run():
    ...

or via two separate method calls:

import mlflow

mlflow.start_run()
...
mlflow.end_run()

After mlflow.start_run(), and before mlflow.end_run() (or before exit of the context), any MLflow method can be used for tracking to the respective run.

Danger

An MLflow run is not necessarily identical to a Mantik run. While a Mantik run via the Compute Backend always automatically creates an MLflow run that can be used for tracking, an MLflow run might also be created manually to track e.g. training of a ML model.

The Mantik Compute Backend starts an MLflow run and sets the environment variable MLFLOW_RUN_ID to the ID of that run.

Therefore, use mlflow.start_run() without passing a run_id, otherwise MLflow will create an additional run. This still allows to create nested runs.

Note that mlflow.active_run() will only detect Mantik’s run once mlflow.start_run() or any other of the tracking functions has been called.

Remote Tracking

By default, MLflow tracks experiments to your local filesystem. In order to use a remote tracking server, the environment variable MLFLOW_TRACKING_URI must be set.

When using the Submitting an Application, the tracking URI (MLFLOW_TRACKING_URI), the API token (MLFLOW_TRACKING_TOKEN), and the MLFLOW run ID (MLFLOW_RUN_ID) are provided to the run environment automatically by setting the respective environment variables. Hence, you do not need to call mantik.init_tracking() inside your MLproject code.

Experiment Specification

Experiments can be referenced by ID or name. We recommend to create experiments on the platform (see Experiments) and then use IDs to refer to the experiment. With standard MLflow usage, you can set the MLFLOW_EXPERIMENT_ID environment variable to track current runs to the specified experiment. Experiment IDs can be found either in the detail view of an experiment on the Mantik platform or in the MLflow GUI, where the ID is shown above the description section of an experiment.

When using the Compute Backend, the experiment_id is a required argument in the mantik.ComputeBackendClient.submit_run method.

Tracking Functions

Mlflow offers a multitude of tracking functions, all of which are integrated with the Mantik platform. Below, the most important methods are explained briefly. For detailed reference, refer to the MLflow documentation.

Autologging

Mlflow supports many Python ML frameworks out of the box. Autologging enables automatic logging of model configurations from these frameworks, see autolog documentation. You can enable it with:

import mlflow

mlflow.autolog()

Parameters

Parameters can be logged explicitly with the log_params method. This is especially useful if you have custom code and parameters that are not tracked with autologging:

import mlflow

with mlflow.start_run():
    mlflow.log_params({"parameter1_name": value1, "parameter2_name": value2})

or alternatively

import mlflow

with mlflow.start_run():
    mlflow.log_param("parameter_name", value)

to log individual parameters.

We recommend log_params over log_param due to a much better performance (at least 3X faster, up to 20X faster for a large number of parameters)

Metrics

Metrics, mostly regarding the performance of ML models, can be tracked explicitly with the log_metrics method.

import mlflow

with mlflow.start_run():
    ...
    mlflow.log_metrics(metrics={"mse": 2500.00, "rmse": 50.00})

Analogously to parameters, also individual metrics can be logged as

import mlflow

with mlflow.start_run():
    ...
    mlflow.log_metric("metric_name", value)

We recommend log_metrics over log_metric due to a much better performance (at least 3X faster, up to 20X faster for a large number of parameters)

Real-time Monitoring of Training Processes

mlflow.log_metrics/mlflow.log_metric also allow an optional step parameter that allows tracking metrics per step (e.g. epoch). Metrics logged using step are available in real-time in the MLflow GUI to allow monitoring the traning process via, for example, the training loss.

You may either use mlflow.log_metrics() inside your training loop (e.g. for PyTorch), or write a callback (e.g. for TensorFlow) that are used at each epoch end to automatically track desired metrics:

import mlflow

with mlflow.start_run():
    ...
    for epoch in range(10):
        loss = ...
        mlflow.log_metrics(metrics={"loss": loss}, step=epoch)
import keras.callbacks
import mlflow


class Metrics(keras.callbacks.Callback):
    def on_epoch_end(self, epoch, logs=None):
        mlflow.log_metrics({"loss": logs["loss"]}, step=epoch)

Artifacts

Artifacts are any kind of files that you might want to keep from your experiments, e.g. images, diagrams, or generated text files. Custom artifacts can be logged with the log_artifact method.

import mlflow

with mlflow.start_run():
    mlflow.log_artifact("artifact_name", artifact)

For example

import mlflow
import matplotlib.pyplot as plt

fig = plt.figure()
plt.plot([0.2, 0.4], [0.2, 0.2])
fig.savefig('confusion_matrix.pdf')

with mlflow.start_run():
    mlflow.log_artifact("confusion_matrix.pdf")

Alternatively, you can use

import mlflow

with mlflow.start_run():
    mlflow.log_artifacts("/local/path/to/upload")

We recommend log_artifact over log_artifacts due to a much better performance (up to 2X faster). (Which is different than for log_params vs. log_param and log_metrics vs log_metric.)

Models

Trained models can be stored for later reusability. Autologging also logs models from supported frameworks. Since every framework might have a custom way to save models, model storage is subject to the corresponding mlflow submodules. E.g. if you want to explicitly log an sklearn model, you can use the mlflow.sklearn.log_model method.

import mlflow.sklearn

mlflow.sklearn.log_model(model)

See ../inference.md for more details on how to save and load models.

Saving and Loading Models with MLflow

MLflow introduces a standard format for packaging ML models known as MLflow Model. The format defines a convention that lets you save and load a model in multiple flavors (pytorch, keras, sklearn) that can be interpreted by different downstream platforms. For example, mlflow.sklearn contains save_model, log_model, and load_model functions for scikit-learn models which allow to save and load scikit-learn models.

Saving Models

Models can be saved and loaded in different ways. Many popular ML libraries (e.g., TensorFlow, PyTorch, Scikit-Learn) are integrated into MLflow, which makes logging models especially easy. For example, a PyTorch model can be saved to the MLflow tracking server just by calling mlflow.pytorch.log_model(model, "<model_name>"). Currently supported built-in flavors can be found here.

MLflow can be extended to support additional flavors as explained here.

Loading Models

You can load a previously logged model for inference in any script or notebook. To load the model you can use the built-in utilities of the same ML framework (e.g., mlflow.pytorch.load_model) that you used for saving the model. Moreover, you can load the model in two ways where the first way is loading the model using an MLflow run and the second way is using MLflow’s model registry.

When referencing a model run, the following path should be provided to the load_model function: "runs:/<mlflow_run_id>/<path_to_model>", where mlflow_run_id is the run id under which your model was saved and path_to_model is the relative path to the model within the run’s artifacts directory. You can look at the respective run and see the logged model through the Mantik MLflow UI. There you can also find the RUN ID.

logged_model = "runs:/<mlflow_run_id>/<path_to_model>"
loaded_model = mlflow.pytorch.load_model(logged_model)

The model can also be loaded through the MLflow model registry. The model registry is MLflows central hub of managing models, which allows for model versioning and utilities for model deployment. For this the model needs to be registered. You can register your saved model from a completed run in the MLflow UI. When viewing the model artifacts here, just press the “Register Model” button that appears in the upper left corner when selecting the saved model artifact. After this step, you can load the model by also using mlflow.pytorch.load_model with a slightly different reference path. To load a model from the model registry, the path should be "models:/<model_name>/<version>" where model_name is the name of your model in the model registry and version is the version of the model in the model registry.

Prediction

After loading the model, you can simply make predictions on new input data.

prediction = loaded_model(input_data)

The function call differs based on the framework used to create the model. For more information on the various function calls, please refer to the following documentation.