Serving an ML Model

After completing this lesson, you will be able to:

After completing this lesson, you will be able to:

  • Deploy an ML model in SAP AI Core and use the deployment URL for inferencing

Model Deployment in SAP AI Core

After a model is trained, that is, it has learned (hidden) patterns from the provided dataset, the model needs to be deployed. By deploying the model, one can send new data to the model and get a "prediction" for the given data record. This is also called Model Serving or Inferencing.

A Kubernetes cluster can be configured to provide both CPU (cheap) and GPU containers for Model Serving.

In addition, a high number of requests can be sent to the Model Server at the same time. In order to process these "inference" requests in a timely fashion, Kubernetes allows to scale the Model Server "on demand".

Here we have 2 cases:

  • Autoscaling: adding (cloning) new containers on demand.

  • Scale to Zero: enables cost efficiency and pay per use, by shutting down idle containers.

Deploying a model in SAP AI Core consists of writing a web application that is able to serve the inference requests through an endpoint exposed on the internet, which could be easily scaled on the Kubernetes infrastructure.

Serving Application

To serve a model, you code and develop a serving application that will be run in the form of a container.

Everything starts with an inference request sent to an endpoint. Internally, the web application has to interpret the data contained in the body of the call and then it has to retrieve the model from the hyperscaler object store, apply it to the data and pack the prediction into a response that will be consumed by a custom service.

While there are different ways of inferencing (that is, batch inference), we are mainly looking into online inferencing with an exposed endpoint (AI API) that the end user calls using an http request.

Coding is fundamental, but, as mentioned before, the model server that will be deployed is defined by a specific template. This self-contained template will create an executable with the definition of the required parameters, the container to be executed, the resources needed for starting the web application, and the number of replicas of the model server to be started.

This combination of the proper serving executable, with the reference to the model to be used, will enable SAP AI Core to start your deployment.

When the model server is running and the deployment URL is ready, the very final step of the ML workflow in SAP AI Core is the consumption of the model through the exposed endpoint. The API can be easily integrated in any business application using an http request, such as a Jupyter notebook, Postman, or a CAP application, and so on.

Build a House Price Predictor with SAP AI Core

Log in to track your progress & complete quizzes