# Convert TensorFlow Lite Models to ONNX

Open Neural Network Exchange (ONNX) aims to bridge deep learning frameworks together. TF2ONNX was built to translate TensorFlow models to ONNX, therefore other deep learning systems can benefit from TensorFlow functionality. However, TF2ONNX currently doesn’t support quantization. This article introduces TFLite2ONNX which converts TensorFlow Lite models to ONNX with quantization semantic translation enabled.

## Introduction

ONNX, created by Facebook and Microsoft originally, is an open format built to represent machine learning models and has developed into a community-driven organization.

Figure 1: The ONNX Vision

TF2ONNX was built to convert TensorFlow models to ONNX to bring TensorFlow trained models to basically all systems that support ONNX.

TF2ONNX does have some limitations (as of v1.5.5, when we started to build TFLite2ONNX) such as no support for TensorFlow 2.0 nor quantization. The effort of converting volatile TensorFlow model representation to ONNX is significant. And, as quantization plays more and more important role in deep learning deployment, lacking such support is a drawback.

On the other hand, the model representation of TensorFlow Lite (TFLite) is relatively stable, and the officially maintained model converter of TensorFlow and TFLite is robust. The converter simplifies the TensorFlow models by graph transformations including Batch Normalization folding and activation function fusion. The FakeQuantWithMinMaxVars nodes, which are generated during TensorFlow Quantization-aware Training (QAT), can also be handled.

TFLite2ONNX is created to convert TFLite models to ONNX. As of today (v0.3), TFLite2ONNX is compatible with TensorFlow 2.0 (thanks to TFLite converter) and quantization conversion. This article introduces the story and implementation of TFLite2ONNX that is used to close the semantic gap between TFLite and ONNX model representation.

## Data Layout Semantic Conversion

The most obvious gap is the data layout issue - TFLite model is NHWC format while ONNX is NCHW, which is named as layout semantic divergence in this article.

### The Problem and TF2ONNX

The data layout format of TFLite is not mentioned neither the document nor the model representation, but in the agreement of the TFLite converter (the TensorFlow model needs to be NHWC) and the kernels.

On the contrary, ONNX explicitly declares it uses NCHW in both operator representation and document (which is generated from operator representation).

Figure 2: Data layout handling of TF2ONNX - MobileNetV2 example

TF2ONNX adopts a transpose based routine to close the gap. TF2ONNX converts the internal operators and tensors into NCHW data layout, but leaves the inputs and outputs of graph NHWC data layout. Transpose operators are inserted to bridge NHWC and NCHW subgraphs. Figure 2 above is an example that we convert MobileNetV2 TensorFlow model to ONNX by using TF2ONNX. (More description of TF2ONNX’s handling of data layout can be found at the GitHub issue.)

During the development of TFLite2ONNX, we tried two approaches:

• Transpose based approach (different from TF2ONNX’s) - enabled in v0.1 and dropped in v0.3.
• Propagation based approach - introduced and is the default since v0.2.

### Transpose based Approach

Regarding layout semantic divergence, one fact is, some operators have implicit data layout, e.g. Conv; while others don’t, e.g. Add.

Different from TF2ONNX, the transpose based approach in TFLite2ONNX inserts transpose pattern where the operator has layout semantic divergence. The transpose pattern is a Transpose operator bridges source layout (TFLite) and target layout (ONNX).

For example, TFLite pattern $\left<Data_{nhwc}\right> \rightarrow [Conv]$ is converted to $\left<Data_{nhwc}\right> \rightarrow [Transpose] \rightarrow \left<Data_{nchw}\right> \rightarrow [Conv]$. (In this article, $\left<TensorName\right>$ denotes a tensor, and $[Operator]$ denotes an operator.) Figure 3 gives an example of converting the first Conv of MobileNetV2.

Figure 3: ONNX model converted by transpose based approach of TFLite2ONNX

With this approach, we only need to process a limited set of operators such as Conv and Pooling. All other operator and tensor conversion are trivial.

### Propagation based Approach

Though transpose based approach can handle layout semantic divergence, too many operators and tensors are added, therefore the generated ONNX model is too complicated. Propagation based approach is introduced to resolve this by propagating the layout semantic divergence across the graph, and then performs the conversion.

By default (for most cases), given a graph, some of the tensors have implicit layout semantic, e.g. tensors that are connected to Conv directly, while others are not - which are likely to be transparent to layout, e.g. Abs and Add. When converting the tensor semantic of one operator, the others of the same operator may need to apply similar conversion.

For example, when converting TFLite graph (omitted kernel and bias) $\left< A_{nhwc} \right> \rightarrow [Conv] \rightarrow \left< B_{nhwc} \right> \rightarrow [Abs] \rightarrow \left< C_{?} \right>$ to ONNX, tensor $A_{nhwc}$ becomes $A_{nchw}$ and $B_{nhwc}$ becomes $B_{nchw}$. Here, the output $C$ of Abs should have the same format as input $B$. Propagation based approach propagates the conversion from $B$ to $C$. Therefore we have the ONNX graph $\left< A_{nchw} \right> \rightarrow [Conv] \rightarrow \left< B_{nchw} \right> \rightarrow [Abs] \rightarrow \left< C_{nchw} \right>$.

The conversion permutes the shape of tensors if they are activations, i.e. value info in ONNX. The data of Weights, i.e. initializer in ONNX, needs to be transposed besides.

In this approach, operators are categorized into four (marked some operators in Figure 5):

• Implicit: operators have layout semantic divergence, e.g. Conv. They are the source of layout semantic divergence.
• Transparent: operators that are insensitive to layout, e.g. Abs. If any tensor has layout semantic divergence, propagate it to all tensors that are connected to such operators.
• Attribute: operators that can propagate layout semantic divergence, but have layout sensitive attributes that need special handling, e.g. attribute axis of Concat. An additional pass after propagation to adjust these attributes is needed.
• Terminate: operators that don’t have and cannot propagate layout semantic divergence, e.g. Reshape.

Figure 5: Part of the ONNX model generated by propagation based approach of TFLite2ONNX

When propagating layout semantic divergence across the graph, for a particular operator: if it is Transparent or Attribute, propagate layout semantic divergence among its tensors; if it is Implicit or Terminate, terminates the recursion in this direction. Figure 5 is part of the ONNX model generated by propagation based approach from the NASNet TFLite model.

### Explicit Layout and Broadcast of Propagation

With propagation based approach, the converted ONNX model includes zero effort to handle layout semantic divergence, i.e. no additional operators are introduced.

However, sometimes there could have incompatible layouts. Consider Reshape, which is Terminate, as below. If $A$ is propagated while other tensors are not, the output layout could be not what the user expected. (Transpose based approach doesn’t have this issue as its layout is TFLite style at model level, layout semantic divergence is handled inside $[Transpose] \rightarrow [OP] \rightarrow [Transpose]$.)

\left. \begin{aligned} \{Graph\} \rightarrow \left< A \right> \rightarrow [Reshape] \rightarrow \left< B \right> \\ \left< C \right> \\ \end{aligned} \right\} \rightarrow [Concat] \rightarrow \left< D \right>

Explicit layout is introduced to handle such scenario. Users can feed a mapping of tensor, TFLite layout, and ONNX layout to TFLite2ONNX. And, it’s flexible for user to define the layout conversion for tensors that have no incompatible layout.

Another problem is the broadcast of binary operators such as Add (see this issue for more). Tensor $B$ in the TFLite graph below needs to be broadcasted. If $A$ is converted from NHWC to NCHW, i.e. $A_{(2 \times 5 \times 3 \times 4)}$, $B$ is no longer broadcastable in ONNX. Even worse, the layout semantic divergence will fail when propagated to $B$ as $A$ and $B$ have different dimensions.

\left. \begin{aligned} \{Graph\} \rightarrow \left< A_{(2 \times 3\times 4 \times5)} \right> \\ \left< B_{(4 \times 5)} \right> \\ \end{aligned} \right\} \rightarrow [Add] \rightarrow \left< C \right>

To manage broadcasting in ONNX model, additional Reshape pattern is introduced. Any tensors like $B$ are reshaped to extend their dimensions to be equal with the other, such that propagation and broadcasting can correctly do its job. The example of the intermediate graph is as below.

\left. \begin{aligned} \{Graph\} \rightarrow \left< A_{(2 \times 3\times 4 \times5)} \right> \\ \left< B_{(4 \times 5)} \right> \rightarrow [Reshape] \rightarrow \left< B^{'}_{(1 \times 1 \times 4 \times 5)} \right>\\ \end{aligned} \right\} \rightarrow [Add] \rightarrow \left< C \right>

## Quantization Semantic Conversion

TensorFlow stack is the first to provide production-ready quantization support. By converting quantized TFLite models to ONNX, we can bring the quantization capability to more systems. (If the data type in this section confuses you, check more in the discussion where we brought quantization to TVM.)

### The Problem and TF2ONNX

TensorFlow and TFLite provide many solutions for quantization: spec, post-training, and quantization-aware training. All these generate TFLite models of which tensors are quantized - uint8 for the most case which is enabled by quantized version operators in TFLite runtime.

On the other hand, quantization support in ONNX has two aspects (wiki):

Obviously, the semantic gap between TensorFlow and ONNX is significant, and TF2ONNX doesn’t provide quantization support.

### Using Quantized Operators

Initially, quantized TFLite operators was converted to quantized ONNX operator, if it has a peer e.g. QLinearConv; and converted back to float otherwise.

However, like many other systems, ONNX tensor representation doesn’t carry quantization semantic. I.e. a low precision uint8 tensor is raw uint8 data just like numpy - no zero point nor scale description. And, as only Conv and MatMul has quantized version operator in ONNX, an end to end quantized ONNX model is impossible. Therefore, the quantized ONNX operator needs to be closed by quantize and dequantize patterns (example as below).

\left. \begin{aligned} \left< A_{float} \right> \\ \left< B_{float} \right> \\ \end{aligned} \right\} \rightarrow [Add] \rightarrow \left< C_{float} \right> \rightarrow [QuantizeLinear] \rightarrow \left< D_{uint8} \right> \rightarrow [QLinearConv] \rightarrow \left< E_{uint8} \right> \rightarrow [DequantizeLinear] \rightarrow \left< F_{float} \right>

For models that are mainly composed of Conv, e.g. MobileNetV1 (we do have tried to convert such), this is not very significant. But for most other models, Conv and MatMul take a small part regarding the count of operators. Further, the quantization semantic of operators other than Conv and MatMul has lost - we do not benefit much from mechanisms like quantization-aware training.

### Maintaining Quantization Information

Instead of using quantized operators, TFLite2ONNX maintains quantization semantic, i.e. zero point and scale of a tensor which is key to maintain quantization prediction accuracy, in the ONNX model by inserting quantization pattern.

[OP] \rightarrow \left< T_{f} \right> \rightarrow [Quantize] \rightarrow \left\{ \begin{aligned} \left< T_q \right> \\ \left< T_{zero\_point} \right> \\ \left< T_{scale} \right> \\ \end{aligned} \right\} \rightarrow [Dequantize] \rightarrow \left< T'_{f} \right> \rightarrow [OP]

Technically, every quantized tensor $[OP] \rightarrow \left< T_{q} \right> \rightarrow [OP]$ in the TFLite model are converted into the ONNX pattern above.

Figure 6: Quantized ONNX model generated by TFLite2ONNX

This mechanism adds many new operators and tensors. If the original TFLite model has $O$ operators and $T$ tensors, the generated may have $O+2T$ operators and $3T$ tensors. Figure 6 is an example of converting the quantized TFLite Conv model to ONNX.

The framework that takes the ONNX model can decide how to enable the quantized ONNX model.

## The Implementation

TFLite2ONNX is a very simple package that includes only ~2200 lines of code. The code is divided into several parts: each TFLite operator has a dedicated converter class; data layout and quantization handling are managed at Graph level; others are helpers or wrappers such as Tensor, Layout.

As of v0.3, many Convolution Neural Networks have been enabled. We maintain a test that includes a subset of them. About 20 TFLite operators are supported. Python interface together with a command-line tool are avaiable.

If some operators are not supported yet, you may request a new operator. It would be great if you can help to enable new operators, please join us with How to enable a new operator.