Neural Network Quantization Resources
黎明灰烬, 20190315
List resources on neural network quantization here. Quantization are moving from research to industry (I mean real applications) nowdays (as in the begining of 2019). Hoping that this list may help :)
The resources are categorized into sections which may contain several subsections. The categories should be easy to understand. The recommanded materials are marked with ★ . Leave comments to collaborate :)
Introductions
Resources that help people having a basic understanding in this field.
 ★ Neural Network Quantization Introduction (2019) pays special attention to arithmetic behind quantization.
 ★ Quantization document of Nervana Neural Network Distiller (2018) introduces the key knowledge of quantization.
 Making Neural Nets Work With Low Precision mainly talks about TensorFlow Lite with brief quantization introduction.
 What I’ve learned about neural network quantization summarizes quantization related hardware support and software trend in 2017.
Researches
Quantization came into researcher’s eyesight even in the early days of deep learning.
 Binary Network. The most significant advangate of binary network is that they don’t need multiplication anymore  transformed into logic operations.
 Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or 1 (2016): neural networks with binary weights and activations at runtime and when computing the parameters’ gradient at traintime.
 Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations (2016) uses 1bit for forward and backward both (weight update remains 32bit). Get 51% Top1 accuracy for quantized AlexNet.
 BinaryConnect: Training Deep Neural Networks with binary weights during propagations (2016).
 Ternary Weight Networks (2016): neural networks with weights constrained to +1, 0 and 1.

XNOR Network (2016): the filters and the input to convolutional layers are binary. XNORNetworks approximate convolutions using primarily binary operations.
 ★ Deep Compression (2016) assembles pruning, quantization and encoding to reduce the storage requirement by 35x (AlexNet) to 49x (VGG19) without affecting their accuracy. The paper shows that 8 bit is required for quantized Convolution layer to avoid significant accuracy loss, while 4 bit is sufficent for Fully Connected layer.
 Fixed point quantization of deep convolutional networks (Qualcomm, 2016) collects statistics of weights, activations and biases, and then performs a SQNR analysis to figure out the best bitwidth for each layer. Their experiments show that in comparison to equal bitwidth settings, the fixed point DCNs with optimized bit width allocation offer > 20% reduction in the model size without any loss in accuracy on CIFAR10 benchmark.
 Apprentice: Using Knowledge Distillation Techniques To Improve LowPrecision Network Accuracy shows only 0.1% accuracy drop with ternary network which is pretty impressive.
 Flexpoint: An Adaptive Numerical Format for Efficient Training of Deep Neural Networks seems still floating point while the point (expononent) of the quantized tensor is flexiable  learned from models.
 ★ TensorRT Calibration uses KL Divergence (2017) to find the best scale which maps FP32 to INT8. The KL Divergence measures the distribution of quantized and nonquantized activication output of each operator, to evaluate the information lossing of quantization. The mapping scale that has minial KL divergence is choosed.
 HighAccuracy LowPrecision Training aims to maintain accuracy by using stochastic variancereduced gradient to reduce gradient variance, and to combine this with a novel technique called bit centering to reduce quantization error.
 Mixed precision training of CNNs using integer operations (2018) uses Dynamic Fixed Point technique which achieves of exceeds the SOTA network (ResNet50, GoogLeNetv1, VGG16 and AlexNet) on [ImageNet][imangenet] with 1.8X performance improvement.
 TwoStep Quantization for Lowbit Neural Networks (2018): the two steps are code learning and transformation function learning based on the learned codes. The authors tried their method with different bits, and for binary and ternary weight quantization of AlexNet, they outperforming SOTA work.
 WeightedEntropybased Quantization for Deep Neural Networks propsed a scheme which can chose the quantization bits according to the accuracy target. Authors performed experiments not only on image classification targets but also segmentation and natual language modeling.
Softwares
Mainly on softwares which enables quantized neural networks.
 TensorQuant allows a transparent quantization simulation of existing DNN topologies during training and inference. TensorQuant supports generic quantization methods and allows experimental evaluation of the impact of the quantization on single layers as well as on the full topology.
 ★ Nervana Neural Network Distiller (2018) is a Python package for neural network compression research.
 Nvidia TensorRT (2017) uses Calibration to improve accuracy of quantized network.
 Posttraining quantization is supported by TensorFlow, PyTorch, MxNet and so on.
 ★ Quantizationaware Training (CVPR paper, 2018) simulates quantization arithmetic in forwarding pass when training.
 [MXNet][mxnet] provides example usage of quantization based on MDKDNN Model Optimization and cuDNN.
 MKLDNN, as the acceralting library for Intel CPU, provides posttraining quantization techniques and sound performance. See Lower Numerical Precision Deep Learning Inference and Training. MKLDNN has been intergated into most popular frameworks such as TensorFlow, Caffe(2) and MXNet. They declare the support of 16bit lowprecision during traning.
 ★ Gemmlowp (2015) is not a full linear algebra library, but focus on lowprecision computing. Gemmlowp is used in TensorFlow Lite (conv and fc) to accerelate quantization arithemtic on CPU. Gemmlowp also provides lots quantization utilities such as
SaturatingRoundingDoublingHighMul
in Conv and so on.  ★ QNNPACK (news, 2018) is mobileoptimized implementation of quantized neural network operators. QNNPACK is intergated into PyTorch/Caffe2. QNNPACK aims to improve performance for quantized neural networks only, and probably for mobile platforms only. It assumes that the model size is small, and designed particular kernels. We observed that QNNPACK outperforms most quantization dedicated accerelate library.