Chapter 5: End-to-End INT8 Quantization and Deployment of a Streaming ASR Model (Zipformer) on an Embedded Platform (RV1126B)

[Project Retrospective Report V5.0] End-to-End INT8 Quantization and Deployment of a Streaming ASR Model (Zipformer) on an Embedded Platform (RV1126B)

Project Attribute	Details
Project Name	Part 5: Sherpa ASR RKNN Quantization on the RV1126BP Platform
Project Period	2025-08-21 ~ 2025-09-23
Project Duration	40 hours (Approx. 5 person-days)
Review Date	2025-09-27
Key Personnel	Potter White

1. Project Objective

To convert a streaming Zipformer ASR (Automatic Speech Recognition) ONNX model from FP32 precision to an INT8 precision RKNN model for an embedded device based on the Rockchip RV1126B SoC. The goal is to complete its deployment and validation within the sherpa-onnx C++ inference framework to meet the operational requirements of low latency and low power consumption.

2. Methodology & Workflow

This project adopted a systematic methodology, covering the entire process from model analysis and runtime environment modification to calibration data generation and final model validation.

graph TD
    classDef main fill:#2c3e50,color:white,stroke:#8e9eab;
    classDef task fill:#4b3e58,color:white,stroke:#b8a3d3;
    classDef check fill:#27ae60,color:white,stroke:#88d1a1;
    classDef fail fill:#c0392b,color:white,stroke:#e69a93;
    classDef success fill:#16a085,color:white,stroke:#72d6c1;
    classDef note fill:#f39c12,color:white,stroke:#f8c471;

    S[<b>Start: FP32 ONNX Model</b>]:::main
    A["<div style='text-align:left; padding:10px;'><b>1. Initial Model Analysis & Conversion</b><br>- Attempt direct conversion using pre-quantized INT8 ONNX model</div>"]:::task

    S --> A

    A --> R1{"<div style='text-align:center'>Conversion Successful?</div>"}:::check
    R1 -- "No: `DynamicQuantizeLinear` operator present" --> B

    B["<div style='text-align:left; padding:10px;'><b>2. C++ Runtime Adaptation</b><br>- Analyze `sherpa-onnx` RKNN backend code<br>- Add handling branch for `RKNN_TENSOR_INT8` type<br>- Implement manual dequantization logic for INT8 to FP32 output</div>"]:::task
    B --> C

    C["<div style='text-align:left; padding:10px;'><b>3. Calibration Dataset Iteration</b><br><b>3.1 (Snapshot Method):</b> Extract feature frames, zero-pad state tensors<br><b>3.2 (Streaming Simulation):</b> Use `onnxruntime` to simulate streaming inference, capture real non-zero state tensors</div>"]:::task
    C --> R2{"<div style='text-align:center'>INT8 Model Inference Result?</div>"}:::check

    R2 -- "3.1 -> Garbled Output" --> C
    R2 -- "3.2 (`interval=1`) -> <b>First Success</b>" --> D

    D["<div style='text-align:left; padding:10px;'><b>4. Quantization Strategy Tuning</b><br>- <b>Algorithm:</b> Test `normal` (default) vs. `kl_divergence` algorithms<br>- <b>Mixed Precision:</b> Attempt manual and automatic mixed-precision quantization</div>"]:::task
    D --> E

    E["<div style='text-align:left; padding:10px;'><b>5. Final Solution Validation</b><br>- Compare model accuracy and performance (RTF) across different strategies</div>"]:::task
    E --> F[<b>Finish: Deliver Deployable INT8 RKNN Model</b>]:::success

    subgraph "Key Technical Points"
      K1[Static Graph vs. Dynamic Graph]:::note
      K2[Calibration for Stateful Models]:::note
      K3[Memory Analysis of Quantization Algorithms]:::note
      K4[Mixed-Precision Quantization]:::note
    end

    A --> K1
    C --> K2
    D --> K3 & K4

3. Implementation & Key Findings

3.1 Initial Model Conversion

Action Taken: Attempted to directly convert the int8.onnx model released by the model provider using rknn-toolkit2.
Observation: The conversion failed. The toolchain reported an error, indicating the model contained a DynamicQuantizeLinear operator, which is incompatible with the static graph computation model of the RKNN NPU.
Measure Taken: Determined that the FP32 ONNX model must be used as the input source, and the static quantization functionality of rknn-toolkit2 itself must be utilized for the conversion.

3.2 C++ Inference Framework (sherpa-onnx) Adaptation

Action Taken: After successfully converting the FP32 model to an INT8 RKNN model, performed inference in sherpa-onnx.
Observation: A runtime error occurred with the message: Unsupported tensor type: 2, INT8.
Measures Taken:
1. Modified the RKNN backend C++ source code of sherpa-onnx, adding branch handling for the RKNN_TENSOR_INT8 data type in the model input, output, and states processing logic.
2. After obtaining the model output, implemented a manual dequantization step based on the scale and zp (zero_point) parameters from rknn_tensor_attr. This converted the INT8 output to the FP32 data required by the subsequent decoding logic.

3.3 Iteration and Optimization of the Quantization Calibration Dataset

This was the core phase of the project, where an effective calibration data generation scheme was determined through multiple iterations.

Iteration 1: Snapshot Dataset
- Method: Extracted multiple feature frames (x) from audio as calibration samples. All state inputs (cached_* tensors) were padded with all-zero tensors.
- Result: The generated INT8 model produced garbled output or silence during inference.
Iteration 2: Precise Feature Extraction + Snapshot Dataset
- Method: Deeply analyzed the C++ source code of sherpa-onnx (features.h) and precisely replicated its audio preprocessing pipeline—including Remove DC Offset and Pre-emphasis—in the Python data generation script. State inputs remained all-zero tensors.
- Result: No improvement in inference results, proving that the authenticity of the state tensors was a more critical factor.
Iteration 3: Simulated Streaming Dataset
- Method: Wrote a Python script to fully simulate the streaming inference process of sherpa-onnx using onnxruntime. At each time step, the real feature frame (x) and the real non-zero state (cached_*) computed by onnxruntime from the previous step were saved as a complete calibration sample.
- Result:
  - After quantizing with the full dataset generated by this method (sampling_interval=1), the model successfully produced complete, readable recognition results for the first time, with a Real-Time Factor (RTF) of approximately 0.29.
  - The dataset was enormous (over 100GB), leading to conversion times of several hours.

3.4 Further Exploration of Quantization Strategies

Based on the successful interval=1 dataset from section 3.3, the following strategy optimizations were performed:

Quantization Algorithm:
- normal (default): Successfully generated the model, with results as described in 3.3.
- kl_divergence: Failed during conversion due to an Out of Memory (OOM) error caused by excessive memory consumption. Analysis indicated that this algorithm needs to load a large number of samples at once for statistical computation, making it incompatible with large-scale datasets.
Data Sampling Method:
- Systematic Sampling (sampling_interval > 1) / Random Sampling: Attempted to reduce the dataset size through sampling to accommodate the kl_divergence algorithm.
- Result: Although the conversion succeeded, the inference results of the generated model reverted to garbled output. This indicates that the Zipformer model is highly sensitive to the Temporal Continuity of states in the calibration data.
Mixed-Precision Quantization:
- Method: Attempted to use the manual and automatic mixed-precision quantization features of rknn-toolkit2 to keep precision-sensitive operators like Softmax in FP16.
- Result: Manual mixing did not significantly improve accuracy and introduced new recognition errors. The model generated by automatic mixing resulted in runtime errors or garbled output.

4. Project Results & Conclusion

Final Deliverables:
1. An INT8 RKNN ASR model that runs stably on the RV1126B platform, achieving an RTF of 0.29, successfully completing the deployment from FP32 to INT8.
2. A validated calibration dataset generation script (09_generate_calibration_dataset.py) for streaming stateful models, which ensures the authenticity and continuity of state tensors by simulating the streaming inference process.
Engineering Conclusions:
1. For this Zipformer model, the state authenticity and temporal continuity of the calibration data are the decisive factors for successful quantization, outweighing minor differences in feature extraction.
2. Statistical quantization algorithms in rknn-toolkit2, such as kl_divergence, have memory limitations when processing large-scale, high-dimensional, continuous datasets.
3. Given the current toolchain and model architecture, using a complete simulated streaming dataset with the normal quantization algorithm represents the optimal engineering solution, balancing the constraints of accuracy, performance, and conversion feasibility.