Find Acceleration Method with the Minimum Inference Latency for TensorFlow model using InferenceOptimizer#

This example illustrates how to apply InferenceOptimizer to quickly find acceleration method with the minimum inference latency for Tensorflow model under specific restrictions or without restrictions for a trained model. By calling optimize(), we can obtain all available accelaration combinations provided by BigDL-Nano for inference. By calling get_best_model() , we could get the best model under specific restrictions or without restrictions.

First, prepare model and dataset. We use a pretrained EfficientNetB0 model on Imagenet dataset and train the model on on Imagenette in this example.

[ ]:

train_set, test_set, calibration_set, validation_set, ds_info = prepare_dataset()
ori_model = create_model()
ori_model.fit(train_set,
          epochs=5,
          steps_per_epoch=(ds_info.splits['train'].num_examples // 512 + 1),
          )

The full definition of function prepare_dataset and create_model _ could be found in the_ runnable example.

Obtain available accelaration combinations by `optimize`#

1. Default search mode#

To find acceleration method with the minimum inference latency, you could import InferenceOptimizer and call optimize method. The optimize method will run all possible acceleration combinations and output the result.

[ ]:

from bigdl.nano.tf.keras import InferenceOptimizer

opt = InferenceOptimizer()
opt.optimize(ori_model,
             x=calibration_set,
             latency_sample_num=10)

The example output of optimizer.optimize is shown below.

==========================Optimization Results==========================
 -------------------------------- ---------------------- --------------
|             method             |        status        | latency(ms)  |
 -------------------------------- ---------------------- --------------
|            original            |      successful      |   110.198    |
|              int8              |      successful      |    55.621    |
|         openvino_fp32          |      successful      |    30.763    |
|         openvino_int8          |      successful      |    33.872    |
|        onnxruntime_fp32        |      successful      |    23.38     |
|    onnxruntime_int8_qlinear    |      successful      |    9.836     |
|    onnxruntime_int8_integer    |      successful      |    12.899    |
 -------------------------------- ---------------------- --------------
Optimization cost 347.9s in total.

2. Search with accuracy supervision#

When calling optimize, to care about the possible accuracy drop, you could specify validation_data, metric, direction paramaters to enable validation:

[ ]:

from tensorflow.keras.metrics import CategoricalAccuracy

opt.optimize(ori_model,
             x=calibration_set,
             validation_data=validation_set,
             metric=CategoricalAccuracy(),
             direction="max",
             latency_sample_num=10)

The example output of optimizer.optimize is shown below.

==========================Optimization Results==========================
 -------------------------------- ---------------------- -------------- ----------------------
|             method             |        status        | latency(ms)  |     metric value     |
 -------------------------------- ---------------------- -------------- ----------------------
|            original            |      successful      |   106.692    |        0.996         |
|              int8              |      successful      |    55.652    |        0.996         |
|         openvino_fp32          |      successful      |    32.002    |        0.996*        |
|         openvino_int8          |      successful      |    33.648    |        0.995         |
|        onnxruntime_fp32        |      successful      |    25.639    |        0.996*        |
|    onnxruntime_int8_qlinear    |      successful      |    9.877     |        0.971         |
|    onnxruntime_int8_integer    |      successful      |     9.85     |        0.956         |
 -------------------------------- ---------------------- -------------- ----------------------
* means we assume the metric value of the traced model does not change, so we don't recompute metric value to save time.
Optimization cost 465.5s in total.

3. Filter acceleration methods#

In some cases, you may just want to test or compare several specific methods, there are two ways to achieve this.

If you just want to test very little methods, you could just set includes parameter:

[ ]:

from tensorflow.keras.metrics import CategoricalAccuracy

opt.optimize(ori_model,
             x=calibration_set,
             validation_data=validation_set,
             metric=CategoricalAccuracy(),
             direction="max",
             includes=["openvino_fp32", "onnxruntime_fp32"],
             latency_sample_num=10)

The example output of optimizer.optimize is shown below.

==========================Optimization Results==========================
 -------------------------------- ---------------------- -------------- ----------------------
|             method             |        status        | latency(ms)  |     metric value     |
 -------------------------------- ---------------------- -------------- ----------------------
|            original            |      successful      |   108.209    |        0.994         |
|         openvino_fp32          |      successful      |    30.325    |        0.994*        |
|        onnxruntime_fp32        |      successful      |    31.313    |        0.994*        |
 -------------------------------- ---------------------- -------------- ----------------------
* means we assume the metric value of the traced model does not change, so we don't recompute metric value to save time.
Optimization cost 133.9s in total.

In some cases, if you expect that some acceleration methods will not work for your model / not work well / run for too long / cause exceptions to the program, you could avoid running these methods by specifying excludes paramater:

[ ]:

from tensorflow.keras.metrics import CategoricalAccuracy

opt.optimize(ori_model,
             x=calibration_set,
             validation_data=validation_set,
             metric=CategoricalAccuracy(),
             direction="max",
             excludes=["int8", "onnxruntime_int8_integer"],
             latency_sample_num=10)

The example output of optimizer.optimize is shown below.

==========================Optimization Results==========================
 -------------------------------- ---------------------- -------------- ----------------------
|             method             |        status        | latency(ms)  |     metric value     |
 -------------------------------- ---------------------- -------------- ----------------------
|            original            |      successful      |   110.496    |        0.994         |
|         openvino_fp32          |      successful      |    30.778    |        0.994*        |
|         openvino_int8          |      successful      |    38.152    |        0.994         |
|        onnxruntime_fp32        |      successful      |    23.143    |        0.994*        |
|    onnxruntime_int8_qlinear    |      successful      |    12.721    |        0.963         |
 -------------------------------- ---------------------- -------------- ----------------------
* means we assume the metric value of the traced model does not change, so we don't recompute metric value to save time.
Optimization cost 323.0s in total.

Obtain specific model#

You could call get_best_model method to obtain the best model under specific restrictions or without restrictions. Here we get the model with minimal latency when accuracy drop less than 5%.

[5]:

from tensorflow.keras.metrics import CategoricalAccuracy

opt.optimize(ori_model,
             x=calibration_set,
             validation_data=validation_set,
             metric=CategoricalAccuracy(),
             direction="max",
             latency_sample_num=10)

acc_model, option = opt.get_best_model(accuracy_criterion=0.05)
print("When accuracy drop less than 5%, the model with minimal latency is: ", option)

When accuracy drop less than 5%, the model with minimal latency is:  inc + onnxruntime + integer

📝 Note

If you want to find the best model with accuracy_criterion paramter, make sure you have called optimize with validation data.

If you just want to obtain a specific model although it doesn’t have the minimal latency, you could call get_model method and specify method_name. Here we take openvino_fp32 as an example:

[ ]:

oepnvino_model = opt.get_model(method_name='openvino_fp32')

Inference#

Then you could use the obtained model for inference.

[6]:

for img, _ in tqdm(test_set):
    acc_model(img)

📚 Related Readings

How to install BigDL-Nano

How to install BigDL-Nano in Google Colab

Find Acceleration Method with the Minimum Inference Latency for TensorFlow model using InferenceOptimizer#

Obtain available accelaration combinations by optimize#

1. Default search mode#

2. Search with accuracy supervision#

3. Filter acceleration methods#

Obtain specific model#

Inference#

Obtain available accelaration combinations by `optimize`#