Machine Learning in Production: Serving Up Multiple ML Models at Scale the TensorFlow Serving + Kubernetes + Google Cloud + Microservices Way

Dr Stephen Odaibo
The Blog of RETINA-AI Health, Inc.
11 min readNov 10, 2019

--

The reason the field of machine learning is experiencing such an epic boom is because of its real potential to revolutionize industries and change lives for the better. Once machine learning models have been trained, the next step is to deploy these models into usage, making them accessible to those who need them — be they hospitals, self-driving car manufacturers, high-tech farms, banks, airlines, or everyday smartphone users. In production, the stakes are high and one cannot afford to have a server crash, connection slow down, etc. As our customers increase their demand for our machine learning services, we want to seamlessly meet that demand, be it at 3AM or 3PM. Similarly, if there is a decrease in demand we want to scale down the committed resources so as to save cost, because as we all know, cloud resources are very expensive.

In this write-up we will demonstrate the following:

  1. How to serve multiple custom machine learning models with TensorFlow Serving
  2. How to package that TensorFlow Serving instance into a Docker container
  3. How to push that Docker container onto the Google Container Registry
  4. How to create a node (VM) cluster in Google Container Engine
  5. How to deploy our custom TensorFlow Server into pods in the nodes
  6. How to create a load balancing service that exposes an external IP address with which to access the TensorFlow Serving pods

Microservices Architecture

Microservices are an emerging way to organize software services and their development. The idea is that instead of a monolithic or linear block of logic, one separates the total service into microservices such as machine learning models, authentication & authorization, database, front-end interface, etc. This allows modular programing at the microservices level, enabling code re-use and re-combination as needed.

Format of ML Model and Directory

Remember to save your ML model in the format: <Model_Name>/0001/saved_model.pb. Where <Model_Name> a folder bearing the name we choose for the model, 0001 is a folder indicating the version of the model. It can be any number — TensorFlow Serving will always load the model stored in the folder with the highest number of all the folders at that level. And finally, the name of our model file must be exactly saved_model.pb. Note that if say we trained our model in something like Keras and saved as htf5 or h5, we must convert to pb (“protobuff”) format with specific TensorFlow Serving signatures to enable the model be served by TensorFlow Serving. Consult the TensorFlow Serving documentation for more details on this.

Docker & The Google Container Registry (gcr.io)

Microservices are often isolated and packaged into docker containers which can then be deployed as pods running within virtual machines (nodes) on the Google cloud. One of the big advantages and uses of Docker containers is that they have all the dependencies needed to build/install the particular software they are for, and therefore we no longer face the issue of lack of dependencies, version incompatibilities, etc. It runs on top of the native operating system and is therefore lightweight. Additionally it isolates our logic and makes it portable. For instructions on how to install nvidia-docker 2.0 and on how to use TensorFlow Serving to serve multiple ML models on your local machine, see my prior article linked HERE.

Docker images can be registered on a number of registries such as the Docker hub (public) or on a number of private repositories such as Docker private hub, Google container registry, or Amazon Elastic Container registry (Amazon ECR), or the Azure Container Registry on Microsoft Azure, to name a few. Here we will be using the Google Container Registry on Google Cloud.

Preparing our Custom TensorFlow Serving Image

Obtain the latest TensorFlow Serving container image from the public Docker Hub with the following command:

$ docker pull tensorflow/serving:latest-gpu

The tag “latest-gpu” gives us the latest version that can run on GPU. Looks like this on my local machine:

Next, git clone tensorflow/serving as follows:

$ mkdir -p /tmp/tfserving
$ cd /tmp/tfserving
$ git clone https://github.com/tensorflow/serving

At this point, checking with the sudo docker ps command to see if there are any running docker instances, we see that there are none:

Checking for docker container images (AKA repositories) with the sudo docker images command, we find as we expect the tensorflow/serving image:

Next we prepare our model_config.config file which stores the locations and names of our multiple models, and which we will pass as an argument to TensorFlow Serving:

model_config_list: {
config: {
name: "Model1",
base_path: "/models/model1",
model_platform: "tensorflow"
},
config: {
name: "Model2",
base_path: "/models/model2",
model_platform: "tensorflow"
},
config: {
name: "Model3",
base_path: "/models/model3",
model_platform: "tensorflow"
}
}

And we can save the config file as models_config.config.

Next, we start a docker container in detach mode (-d), and name it something e.g serving_base3 . We use detach mode because we are yet to populate the container with our ML models, and the program will go searching in default model folders for default model names. This will cause error message to print endlessly and waste a terminal. We don’t want that.

$ sudo docker run -d -p 8501:8501 --name serving_base3 tensorflow/serving:latest-gpu

Using sudo docker ps we can verify that our container was created and we can take note of container ID.

You can enter into the container to inspect it. You’ll see that the /models directory is indeed empty as we expect.

sudo docker exec -it 2366e74af244 sh where 2366e74af244 is the container ID:

We now load up our ML models into the /models folder of the serving base as follows:

$ sudo docker cp /DIR1/Model1/ serving_base3:/models/Model1
$ sudo docker cp /DIR2/Model2/ serving_base3:/models/Model2$ sudo docker cp /CONFIG_DIR/model_config.config/ serving_base3:/models/models_config.config

You can re-enter the container to verify that all the models loaded in correctly. And next we commit the newly loaded container and name it something e.g my-new-server, as follows:

$ sudo docker commit serving_base3 my-new-server

Now a sudo docker images check of the repositories reveals that our new commit succeeded

Now to free space and avoid conflict, we can delete serving_base3 which merely served as an intermediary vehicle

$ sudo docker kill serving_base3
$ sudo docker rm serving_base3

If we attempt to run my-new-server without the model_config file, we will get an error, because the program will check the default names and location for the models and will find nothing.

$ sudo docker run -p 8501:8501 -t my-new-server

will yield error such as:no servable models available in /models/model

Instead, we must do:

$ sudo docker run -p 8501:8501 --runtime=nvidia -t my-new-server --model_config_file=/models/va_models_config.config

which works:

the --runtime=nvidia is optional and is for GPU.

Now keeping in mind that we want to simplify our code on the cloud cluster side, we can minimize the need to pass in parameters there. Therefore we will obtain the name of the current running instance which has already incorporated our model_config_file, and we will re-commit it with a new name e.g my-new-running-server; after which any re-run of that image will no longer need to be passed a model_config_file flag.

$ sudo docker stop 9013863808a9
$ sudo docker commit 9013863808a9 my-new-running-server
$ sudo docker run -p 8501:8501 -t my-new-running-server

Great! It works.

Now that we have a running instance of TensorFlow serving loaded with all our custom models, it is a good idea to test out the server on our localhost to ensure that it works well. Assuming our models are for image classification, we’ll simple send an image to the server from our RESTful Python API client. Look up how to do so in my previous write up.

Apapa Port, Lagos Nigeria

Next we will register our custom TensorFlow Server to the Google Container registry:

Registering Our Custom TensorFlow Server Image to Google Container Registry

First, we tag the image in preparation for a push to the registry:

$ sudo docker tag my-new-running-server gcr.io/ml-production-257721/my-new-running-server

And now checking our list of images we find the newly tagged image too:

The image is ready to be pushed to the Google Container Registry, and we do so as follows:

$ sudo docker push gcr.io/ml-production-257721/my-new-running-server
Pushing tagged custom TF image to GCR

Now we can check the Google Container Registry to see if our image is there. On the Google Cloud side panel we navigate to “Container Registry,” and then to “Images.” Clicking on that reveals the image was successfully pushed and now resides in the registry.

Creating the Node Cluster

If you are not within the project you want and are not yet authenticated, then perform the following commands:

$ gcloud config set project [project name]
$ gcloud auth login

If instead you say had launched a cloud shell from within your project, then you will already be authenticated. In such case the above steps should not be done and can even cause problems (permission/authentication errors).

Next we will create a node cluster on the cloud using the following command:

$ gcloud container clusters create [cluster-name] --zone [zone-name] --num-nodes [number of nodes] --accelerator [type],[count]

In this particular example we used:

$ gcloud container clusters create my-cluster --zone us-west1-b --num-nodes 2 --accelerator type=nvidia-tesla-k80,count=1
Output after successfully creating cluster

Upon running the above command, two VMs (nodes) become visible in the “VM instances” panel of Compute Engine. As shown:

Inspecting further by clicking on one of the nodes, we see more specs such as the 1x NVIDIA Tesla K80 GPU, the n1-standard machine, and the zone, here us-west1-b:

Of note the defaults for num-nodes is 3. i.e. if we don’t specify the number of VMs we want in the cluster, a 3 VM cluster will be created. Other argument to know are “machine-type” which defaults to n1-standard-1 (1 vCPU, 3.75 GB memory); and “accelerator,” which indicates that we want a GPU. By default no GPU is commissioned.

Now that we have created our cluster, we will make it the default cluster and we will transfer it’s credentials to the Kubernetes CLI, kubectl, as follows:

$ gcloud config set container/cluster my-cluster
$ gcloud container clusters get-credentials my-cluster --zone us-west1-b
Credential transfer

Deploying TensorFlow Serving and Creating a Load Balancer

Here we will deploy our custom TensorFlow Server onto pods as well as create a load balancer that exposes an external IP address and balances traffic between the nodes in our cluster. The configurations to do both of these tasks are stored in a yaml file. These files can be separated or we can place both configurations into the same file such that one is a deployment and the other is the load balancer. That is what we will do here, and the file is as shown:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: my-deployment
spec:
replicas: 2
template:
metadata:
labels:
app: my-server
spec:
containers:
- name: my-container
image: gcr.io/ml-production-257721/my-new-running-server@sha256:*****...
ports:
- containerPort: 8501
---
apiVersion: v1
kind: Service
metadata:
labels:
run: my-service
name: my-service
spec:
ports:
- port: 8501
targetPort: 8501
selector:
app: my-server
type: LoadBalancer

Before we create the deployment, we can inspect the state of our cluster with the following commands:

$ kubectl get pods and kubectl get services

From the above, we see as expected that there are no pods on our cluster because we have not yet deployed any. And we see that the only service running is the Kubernetes Master service which coordinates base activities of the cluster and was created automatically when the cluster was created.

Now onwards to deploying the TF server and the load balancer. We do so with a single command as follows:

$ kubectl create -f my-deployment.yaml

where my-deployment.yaml is our configurations yaml file.

We see above that that two pods are being created. And shortly afterwards they are up and running:

And the Load Balancer is also now up and running, and listening on port 8501:

Now we can test the TF server via a RESTful API python client as follows:

import base64
import requests
IMAGE_URL = "http://tensorflow.org/images/blogs/serving/cat.jpg"
dl_request = requests.get(IMAGE_URL, stream=True)
dl_request.raise_for_status()
jpeg_bytes = base64.b64encode(dl_request.content).decode('utf-8')
predict_request = '{"instances" : [{"b64": "%s"}]}' %jpeg_bytes
resp = requests.post('http://104.196.253.190:8501/v1/models/Model1:predict', data=predict_request)
resp.content.decode('utf-8')

SUCCESS!!! And that is how to Deploy Multiple Machine Learning Models in Production at Scale, the TensorFlow Serving + Kubernetes + Google Cloud + Microservices Way.

BIO

Dr. Stephen G. Odaibo is CEO & Founder of RETINA-AI Health, Inc, and is on the Faculty of the MD Anderson Cancer Center. He is a Physician, Retina Specialist, Mathematician, Computer Scientist, and Full Stack AI Engineer. In 2017 he received UAB College of Arts & Sciences’ highest honor, the Distinguished Alumni Achievement Award. And in 2005 he won the Barrie Hurwitz Award for Excellence in Neurology at Duke Univ School of Medicine where he topped the class in Neurology and in Pediatrics. He is author of the books “Quantum Mechanics & The MRI Machine” and “The Form of Finite Groups: A Course on Finite Group Theory.” Dr. Odaibo Chaired the “Artificial Intelligence & Tech in Medicine Symposium” at the 2019 National Medical Association Meeting. Through RETINA-AI, he and his team are building AI solutions to address the world’s most pressing healthcare problems. He resides in Houston Texas with his family.

www.retina-ai.com

REFERENCES
1) TensorFlow Serving of Multiple ML Models Simultaneously to a REST API Python Client
2) https://www.tensorflow.org/tfx/serving/docker
3) Serving Tensorflow Model and SavedModel format: https://www.tensorflow.org/tfx/serving/serving_basic
4) Signature Definitions for SavedModel: https://www.tensorflow.org/tfx/serving/signature_defs
5) TensorFlow Serving + Kubernetes https://towardsdatascience.com/deploy-your-machine-learning-models-with-tensorflow-serving-and-kubernetes-9d9e78e569db
6) Kubernetes: https://github.com/kuberneteshttps://github.com/kubernetes
7) https://kubernetes.io

--

--

Dr Stephen Odaibo
The Blog of RETINA-AI Health, Inc.

Physician. Retina Specialist. Computer Scientist. Mathematician. Full Stack AI Engineer. Christian. Husband. Dad. CEO/Founder RETINA-AI Health, Inc.