Machine learning model packaging fundamentals

Machine learning (ML) frameworks have come a long way, and now it is easier than ever to train a model with just a few lines of code. However, how do you go from a Jupyter Notebook to a real-world product that uses your trained model?

This is where machine learning model packaging comes in. ML model packaging is the step concerned with saving the model to disk in a way that it can be easily retrieved for later re-use.

Although it might seem trivial at first, saving a trained model is not always a walk in the park. In fact, from my experience, it’s a common pain point for ML teams and the decisions made at this step have ripple effects in every other stage downstream, from the infrastructure requirements to the model API design.

In this post, we will explore the fundamental components of machine learning model packaging.

ML model package components

Even though on the lowest level, ML models boil down to mathematical functions, saving a model to disk involves more than saving just a formula. Furthermore, an ML model never runs in isolation. It depends on different pieces of information that need to fit together.

This is why people refer to it as an ML model package. It’s a package with multiple components.

Regardless of whether you use a library such as MLflow or BentoML or you have implemented something on your own, the same four pieces of information go into packaging an ML model.

The components are:

Model artifacts;
Environment information;
Model interface;
Metadata.

In the next sections, we discuss the details of each one of them.

Model artifacts

At the end of a training pipeline, your trained model is a (Python) object. For example, in scikit-learn, you might have:

from sklearn.linear_model import LogisticRegression

clf = LogisticRegression()
clf.fit(x_train, y_train)

# 'clf' is the trained model object

The first component of the model package is comprised of the resulting files from serializing (saving) this object. These files are usually referred to as model artifacts.

The process of saving the model artifacts varies depending on the framework you use. Therefore, you should search for your version of the save method.

Furthermore, the number of resulting files from the serialization process also varies. While in scikit-learn you’ll generally end up with a single file that contains the object information, for other frameworks (such as HuggingFace transformers or TensorFlow), you’ll end up with a directory containing many files and sub-directories.

If we google “sklearn model persistence,” we end up at a dedicated documentation page, which discusses how one can save a sklearn model to disk. The recommended approach involves using pickle, a common module for serializing Python objects. Thus, we can save our model artifacts with:

import pickle

with open("model.pkl", "wb") as f:
    pickle.dump(clf, f)

# saves a 'model.pkl' file to disk

After saving the artifacts, you could load it back with:

import pickle

with open("model.pkl", "rb") as f:
    clf = pickle.load(f)

And voilà! You have a way of retrieving an object identical to your trained model.

Now, something interesting can happen in this process. Imagine the Notebook you were using to train and save the model artifacts was running in an environment with version 0.24.2 of scikit-learn. Then, in the future, you tried to load the model artifacts in a Notebook running in an environment with a newer version of scikit-learn, such as 1.2.2.

This is what you would see as you tried to load the model back:

UserWarning: Trying to unpickle estimator LogisticRegression from
version 0.24.2 when using version 1.2.2.
This might lead to breaking code or invalid results.
Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations

The warning tells us that we have serialized (pickled) a trained model with one version of scikit-learn, and now we’re loading it back (unpickling) using a different version. Note how the object’s attributes, methods, etc., might change from one version to the next. Consequently, the code might break (e.g., if we try to access an attribute that no longer exists) or produce unexpected results (e.g., if the internals of a method changed.)

With the above versions and using a logistic regression, we only received a user warning. With other versions and estimators, we might not even manage to load the object.

As mentioned earlier, the model artifacts are not the only piece of the model package. The environment information is also critical to re-using a model, which leads us to the next section.

Environment information

In the previous section we saw how the modules that comprise the environment are fundamental for model re-use.

The model and its artifacts don’t live in isolation. The ML model is always used inside an environment, which contains dependencies such as libraries, modules, packages, and environment variables.

Capturing the environment information needed by the model is important to ensure reproducible results.

In concrete terms, what does the environment information look like inside a model package?

The environment information will depend on the task you have at hand.

In its simplest form, it can be a requirements.txt file fixing all the module versions needed by your model. By doing so, it is clear that prior to loading your model artifacts, the environment should look just like the one defined by the requirements file. Note how this would prevent the warning we encountered in the previous section.

For more complex use cases, a requirements.txt might not be enough. In that case, it is common to write a setup.py or setup.sh scripts which are responsible for running all the commands needed to prepare the environment for the model.

Model interface

Imagine that one of your colleagues read this blog post up to this point. Then, they prepared and sent you an (incomplete) model package.

You open it up, and there they are: the model artifacts and a requirements.txt file. Great, so you know what to do next -- create a new environment and install the requirements. But what happens next? How do you use the model to get the predictions you need?

You start digging through the model artifacts, trying to figure out which framework was used to develop it. Maybe you're able to load the model into memory, but then you're left scratching your head trying to figure out how to get the predictions -- is it the predict method? Or maybe it's predict_proba, or run, or something else entirely?

The process is frustrating and time-consuming. Interfacing with the model is still an issue.

That’s why it can be a good idea to include a standardized model interface in the model package.

I find that, in broad strokes, two interactions are important with the model:

Loading the model artifacts;
Using the model to get predictions.

How can we standardize the interface for these two interactions?

I like the idea of adding a .py file inside the model package that looks something like this:

import pickle
from pathlib import Path

import numpy as np
import pandas as pd

PACKAGE_PATH = Path(__file__).parent


class MyModel:
    def __init__(self):
        """This is where the serialized objects needed should
        be loaded as class attributes."""
        with open(PACKAGE_PATH / "model.pkl", "rb") as model_file:
            self.model = pickle.load(model_file)

    def _preprocess_data(self, df: pd.DataFrame) -> pd.DataFrame:
        """Internal method needed for this particular case."""
        # TODO: implement this if needed
        return df

    def predict(self, input_data_df: pd.DataFrame) -> np.ndarray:
        """Makes predictions with the model.
        Returns the model's predictions."""
        encoded_df = self._preprocess_data(input_data_df)
        return self.model.predict(encoded_df)

def load_model():
    """Function that returns the wrapped model object."""
    return MyModel()

This file has a function called load_model. This simple function removes the burden of having to know about the framework’s idiosyncrasies in loading a model. I know that if I call load_model, it will return the model object.

The file also contains a class that wraps the specific framework’s model object and provides a uniform interface. I know that with the object I receive back from the load_model function, I can always call predict to get the model's predictions.

The standardization happens if, for every model you work with, you include a load_model function, and for the returned object, you implement a predict method.

A model interface like this is also needed if you want to analyze your model with a tool. The tool must be able to use the model in a standard way. An interface file like the one above is almost identical to what is asked for full model uploads to Openlayer. This allows the platform to work with every framework effortlessly: the user only needs to make their framework adhere to a common interface. This is also very similar to creating a custom flavor in MLflow.

Metadata

With the pieces of the model package outlined above, we have the necessary components to load a trained model reliably and use it for predictions in most cases. However, to provide a comprehensive understanding of the model and how to run it, it's a good practice to include a file with model metadata in the model package.

Typically, this metadata file can be in the YAML format, which provides additional information on the model's architecture, training process, and usage instructions. This file can contain crucial details such as the model's hyperparameters, details about the input and output formats (e.g., number of features, feature names, output format, how to decode the output), and any other relevant information.

By including this metadata file in the model package, you provide a more complete picture of the model's characteristics and how to use it, making it easier for other developers and tools to work with the model.

A metadata file might look something like this:

name: Example Model
version: 1.0
description: A model that predicts the probability of a customer churning.
author: John Doe
created_date: 2023-03-01
trained_on_data: churn_data.csv
input_features:
  - name: tenure
    type: float
    description: Number of months the customer has been with the company.
  - name: payment_method
    type: categorical
    description: The payment method used by the customer.
    categories: ['Credit Card', 'Debit Card', 'PayPal']
  - name: monthly_charges
    type: float
    description: The customer's monthly charges.
  - name: gender
    type: categorical
    description: The customer's gender.
    categories: ['Male', 'Female', 'Other']
output:
  - name: churn_probability
    type: float
    description: The predicted probability of the customer churning.
    format: percentage
hyperparameters:
  optimizer: Adam
  learning_rate: 0.001
  batch_size: 32

The resulting model package may look something like this:

.
├── model.pkl
├── requirements.txt
├── interface.py
└── metadata.yaml

In conclusion, model packaging is a critical step in the path toward using an ML model in the real world. It involves saving the trained model to disk in a way that it can be easily and reproducibly retrieved for later use. In this post, we explored the four key components of a model package: model artifacts, environment information, model interface, and metadata. These four components are present regardless of whether you use a tool such as MLflow or BentoML to create the model package or do it on your own. After having a model package, the next step is using it to power your application! I might write something about it in the future. If you’re interested in knowing more about it, let me know!