Quantize Tensorflow Model for Inference using Accelerator#

You can apply InferenceOptimizer.quantize(..., accelerator=xxx) API and specify accelerator ONNXRuntime or OpenVINO to realize post-training quantization on your Tensorflow Keras models by , which takes only a few lines.

Let’s take an EfficientNetB0 model pretrained on ImageNet dataset and finetuned on Imagenette dataset for validation as an example (seen full definition of prepare_dataset and create_model in runnable example):

[ ]:

train_set, test_set, calibration_set, ds_info = prepare_dataset()
ori_model = create_model()
ori_model.fit(train_set,
          epochs=10,
          steps_per_epoch=(ds_info.splits['train'].num_examples // 512 + 1),
          )

To enable quantization using accelerator for inference, you could simply import BigDL-Nano InferenceOptimizer, and use InferenceOptimizer.quantize(..., accelerator=xxx) to quantize your TensorFlow model using different accelerators:

[ ]:

from bigdl.nano.tf.keras import InferenceOptimizer

# quantize using accelerator ONNXRuntime
q_model = InferenceOptimizer.quantize(ori_model, x=calibration_set, accelerator='onnxruntime')

# or using accelerator OpenVINO
q_model = InferenceOptimizer.quantize(ori_model, x=calibration_set, accelerator='openvino')

You can also choose to use precision as bf16 when using OpenVINO accelerator.

[ ]:

from bigdl.nano.tf.keras import InferenceOptimizer

q_model = InferenceOptimizer.quantize(ori_model,
                                      precision='bf16',
                                      accelerator='openvino')

📝 Note

InferenceOptimizer will by default quantize your TensorFlow models using int8 precision through static post-training quantization. Currently ‘dynamic’ approach is not supported yet. x (for calibration data) is only required when using int8 precision. To avoid data leak during calibration, it is suggested using training dataset or the subset of training set.

Please refer to API documentation for more information on InferenceOptimizer.quantize.

You could then do the normal inference steps with the quantized model:

[ ]:

x = tf.random.normal(shape=(2, 224, 224, 3))
# use the optimized model here
y_hat = q_model(x)
predictions = tf.argmax(y_hat, axis=1)
print(predictions)

📚 Related Readings

How to install BigDL-Nano

How to install BigDL-Nano in Google Colab