View the runnable example on GitHub

Quantize Tensorflow Model for Inference using Intel Neural Compressor#

With Intel Neural Compressor (INC) as quantization engine, you can apply InferenceOptimizer.quantize API to realize post-training quantization on your Tensorflow Keras models, which takes only a few lines.

Let’s take an EfficientNetB0 model pretrained on ImageNet dataset and finetuned on Imagenette dataset for validation as an example (seen full definition of prepare_dataset and create_model in runnable example):

[ ]:
train_set, test_set, calibration_set, ds_info = prepare_dataset()
ori_model = create_model()
ori_model.fit(train_set,
          epochs=10,
          steps_per_epoch=(ds_info.splits['train'].num_examples // 512 + 1),
          )

To enable quantization using INC for inference, you could simply import BigDL-Nano InferenceOptimizer, and use InferenceOptimizer to quantize your TensorFlow model:

[ ]:
from bigdl.nano.tf.keras import InferenceOptimizer

q_model = InferenceOptimizer.quantize(ori_model,
                                      x=calibration_set)

📝 Note

InferenceOptimizer will by default quantize your TensorFlow models using int8 precision through static post-training quantization. Currently ‘dynamic’ approach is not supported yet. For this case, x (for calibration data) is required. To avoid data leak during calibration, it is suggested using training dataset or the subset of training set.

Please refer to API documentation for more information on InferenceOptimizer.quantize.

You could then do the normal inference steps with the quantized model:

[ ]:
x = tf.random.normal(shape=(2, 224, 224, 3))
# use the optimized model here
y_hat = q_model(x)
predictions = tf.argmax(y_hat, axis=1)
print(predictions)