[Project Retrospective Report V5.0] End-to-End INT8 Quantization and Deployment of a Streaming ASR Model (Zipformer) on an Embedded Platform (RV1126B)
| Project Attribute | Details |
|---|---|
| Project Name | Part 5: Sherpa ASR RKNN Quantization on the RV1126BP Platform |
| Project Period | 2025-08-21 ~ 2025-09-23 |
| Project Duration | 40 hours (Approx. 5 person-days) |
| Review Date | 2025-09-27 |
| Key Personnel | Potter White |
1. Project Objective
To convert a streaming Zipformer ASR (Automatic Speech Recognition) ONNX model from FP32 precision to an INT8 precision RKNN model for an embedded device based on the Rockchip RV1126B SoC. The goal is to complete its deployment and validation within the sherpa-onnx C++ inference framework to meet the operational requirements of low latency and low power consumption.
2. Methodology & Workflow
This project adopted a systematic methodology, covering the entire process from model analysis and runtime environment modification to calibration data generation and final model validation.
graph TD
classDef main fill:#2c3e50,color:white,stroke:#8e9eab;
classDef task fill:#4b3e58,color:white,stroke:#b8a3d3;
classDef check fill:#27ae60,color:white,stroke:#88d1a1;
classDef fail fill:#c0392b,color:white,stroke:#e69a93;
classDef success fill:#16a085,color:white,stroke:#72d6c1;
classDef note fill:#f39c12,color:white,stroke:#f8c471;
S[<b>Start: FP32 ONNX Model</b>]:::main
A["<div style='text-align:left; padding:10px;'><b>1. Initial Model Analysis & Conversion</b><br>- Attempt direct conversion using pre-quantized INT8 ONNX model</div>"]:::task
S --> A
A --> R1{"<div style='text-align:center'>Conversion Successful?</div>"}:::check
R1 -- "No: `DynamicQuantizeLinear` operator present" --> B
B["<div style='text-align:left; padding:10px;'><b>2. C++ Runtime Adaptation</b><br>- Analyze `sherpa-onnx` RKNN backend code<br>- Add handling branch for `RKNN_TENSOR_INT8` type<br>- Implement manual dequantization logic for INT8 to FP32 output</div>"]:::task
B --> C
C["<div style='text-align:left; padding:10px;'><b>3. Calibration Dataset Iteration</b><br><b>3.1 (Snapshot Method):</b> Extract feature frames, zero-pad state tensors<br><b>3.2 (Streaming Simulation):</b> Use `onnxruntime` to simulate streaming inference, capture real non-zero state tensors</div>"]:::task
C --> R2{"<div style='text-align:center'>INT8 Model Inference Result?</div>"}:::check
R2 -- "3.1 -> Garbled Output" --> C
R2 -- "3.2 (`interval=1`) -> <b>First Success</b>" --> D
D["<div style='text-align:left; padding:10px;'><b>4. Quantization Strategy Tuning</b><br>- <b>Algorithm:</b> Test `normal` (default) vs. `kl_divergence` algorithms<br>- <b>Mixed Precision:</b> Attempt manual and automatic mixed-precision quantization</div>"]:::task
D --> E
E["<div style='text-align:left; padding:10px;'><b>5. Final Solution Validation</b><br>- Compare model accuracy and performance (RTF) across different strategies</div>"]:::task
E --> F[<b>Finish: Deliver Deployable INT8 RKNN Model</b>]:::success
subgraph "Key Technical Points"
K1[Static Graph vs. Dynamic Graph]:::note
K2[Calibration for Stateful Models]:::note
K3[Memory Analysis of Quantization Algorithms]:::note
K4[Mixed-Precision Quantization]:::note
end
A --> K1
C --> K2
D --> K3 & K43. Implementation & Key Findings
3.1 Initial Model Conversion
- Action Taken: Attempted to directly convert the
int8.onnxmodel released by the model provider usingrknn-toolkit2. - Observation: The conversion failed. The toolchain reported an error, indicating the model contained a
DynamicQuantizeLinearoperator, which is incompatible with the static graph computation model of the RKNN NPU. - Measure Taken: Determined that the
FP32ONNX model must be used as the input source, and the static quantization functionality ofrknn-toolkit2itself must be utilized for the conversion.
3.2 C++ Inference Framework (sherpa-onnx) Adaptation
- Action Taken: After successfully converting the
FP32model to anINT8RKNN model, performed inference insherpa-onnx. - Observation: A runtime error occurred with the message:
Unsupported tensor type: 2, INT8. - Measures Taken:
- Modified the RKNN backend C++ source code of
sherpa-onnx, adding branch handling for theRKNN_TENSOR_INT8data type in the model input, output, and states processing logic. - After obtaining the model output, implemented a manual dequantization step based on the
scaleandzp(zero_point) parameters fromrknn_tensor_attr. This converted theINT8output to theFP32data required by the subsequent decoding logic.
- Modified the RKNN backend C++ source code of
3.3 Iteration and Optimization of the Quantization Calibration Dataset
This was the core phase of the project, where an effective calibration data generation scheme was determined through multiple iterations.
Iteration 1: Snapshot Dataset
- Method: Extracted multiple feature frames (
x) from audio as calibration samples. All state inputs (cached_*tensors) were padded with all-zero tensors. - Result: The generated
INT8model produced garbled output or silence during inference.
- Method: Extracted multiple feature frames (
Iteration 2: Precise Feature Extraction + Snapshot Dataset
- Method: Deeply analyzed the C++ source code of
sherpa-onnx(features.h) and precisely replicated its audio preprocessing pipeline—including Remove DC Offset and Pre-emphasis—in the Python data generation script. State inputs remained all-zero tensors. - Result: No improvement in inference results, proving that the authenticity of the state tensors was a more critical factor.
- Method: Deeply analyzed the C++ source code of
Iteration 3: Simulated Streaming Dataset
- Method: Wrote a Python script to fully simulate the streaming inference process of
sherpa-onnxusingonnxruntime. At each time step, the real feature frame (x) and the real non-zero state (cached_*) computed byonnxruntimefrom the previous step were saved as a complete calibration sample. - Result:
- After quantizing with the full dataset generated by this method (
sampling_interval=1), the model successfully produced complete, readable recognition results for the first time, with a Real-Time Factor (RTF) of approximately 0.29. - The dataset was enormous (over 100GB), leading to conversion times of several hours.
- After quantizing with the full dataset generated by this method (
- Method: Wrote a Python script to fully simulate the streaming inference process of
3.4 Further Exploration of Quantization Strategies
Based on the successful interval=1 dataset from section 3.3, the following strategy optimizations were performed:
Quantization Algorithm:
normal(default): Successfully generated the model, with results as described in 3.3.kl_divergence: Failed during conversion due to an Out of Memory (OOM) error caused by excessive memory consumption. Analysis indicated that this algorithm needs to load a large number of samples at once for statistical computation, making it incompatible with large-scale datasets.
Data Sampling Method:
- Systematic Sampling (
sampling_interval > 1) / Random Sampling: Attempted to reduce the dataset size through sampling to accommodate thekl_divergencealgorithm. - Result: Although the conversion succeeded, the inference results of the generated model reverted to garbled output. This indicates that the Zipformer model is highly sensitive to the Temporal Continuity of states in the calibration data.
- Systematic Sampling (
Mixed-Precision Quantization:
- Method: Attempted to use the manual and automatic mixed-precision quantization features of
rknn-toolkit2to keep precision-sensitive operators likeSoftmaxinFP16. - Result: Manual mixing did not significantly improve accuracy and introduced new recognition errors. The model generated by automatic mixing resulted in runtime errors or garbled output.
- Method: Attempted to use the manual and automatic mixed-precision quantization features of
4. Project Results & Conclusion
Final Deliverables:
- An
INT8RKNN ASR model that runs stably on the RV1126B platform, achieving an RTF of 0.29, successfully completing the deployment fromFP32toINT8. - A validated calibration dataset generation script (
09_generate_calibration_dataset.py) for streaming stateful models, which ensures the authenticity and continuity of state tensors by simulating the streaming inference process.
- An
Engineering Conclusions:
- For this Zipformer model, the state authenticity and temporal continuity of the calibration data are the decisive factors for successful quantization, outweighing minor differences in feature extraction.
- Statistical quantization algorithms in
rknn-toolkit2, such askl_divergence, have memory limitations when processing large-scale, high-dimensional, continuous datasets. - Given the current toolchain and model architecture, using a complete simulated streaming dataset with the
normalquantization algorithm represents the optimal engineering solution, balancing the constraints of accuracy, performance, and conversion feasibility.