Serving an ML Model

Objective

After completing this lesson, you will be able to deploy an ML model in SAP AI Core and use the deployment URL for inferencing

Model Deployment in SAP AI Core

After a model is trained, that is, it has learned (hidden) patterns from the provided dataset, the model needs to be deployed. By deploying the model, one can send new data to the model and get a "prediction" for the given data record. This is also called Model Serving or Inferencing.

A Kubernetes cluster can be configured to provide both CPU (cheap) and GPU containers for Model Serving.

In addition, a high number of requests can be sent to the Model Server at the same time. In order to process these "inference" requests in a timely fashion, Kubernetes allows to scale the Model Server "on demand".

Here we have 2 cases:

  • Autoscaling: adding (cloning) new containers on demand.

  • Scale to Zero: enables cost efficiency and pay per use, by shutting down idle containers.

Deploying a model in SAP AI Core consists of writing a web application that is able to serve the inference requests through an endpoint exposed on the internet, which could be easily scaled on the Kubernetes infrastructure.

Steps for deploying a trained model: submit a Serving Template to obtain the deployment URL, which can be used in any app for model inference.

Serving Application

To serve a model, you code and develop a serving application that will be run in the form of a container.

Everything starts with an inference request sent to an endpoint. Internally, the web application has to interpret the data contained in the body of the call and then it has to retrieve the model from the hyperscaler object store, apply it to the data and pack the prediction into a response that will be consumed by a custom service.

Description of Serving Application workflow: When data is received by the model server, the data can be preprocessed, for example, normalized before feeding into the model and obtaining the inference result.

Note

While there are different ways of inferencing (that is, batch inference), we are mainly looking into online inferencing with an exposed endpoint (AI API) that the end user calls using an http request.

Coding is fundamental, but, as mentioned before, the model server that will be deployed is defined by a specific template. This self-contained template will create an executable with the definition of the required parameters, the container to be executed, the resources needed for starting the web application, and the number of replicas of the model server to be started.

This combination of the proper serving executable, with the reference to the model to be used, will enable SAP AI Core to start your deployment.

When the model server is running and the deployment URL is ready, the very final step of the ML workflow in SAP AI Core is the consumption of the model through the exposed endpoint. The API can be easily integrated in any business application using an http request, such as a Jupyter notebook, Postman, or a CAP application, and so on.

Log in to track your progress & complete quizzes