Part 3: Independent RKNN Model Conversion for Sherpa ASR on the RV1126BP Platform

[Project Retrospective Report V3.0] Independent Model Conversion: Overcoming the Technical Challenges of Converting Dynamic ONNX Models to RKNN (Final Version)

Project Attribute	Details
Project Name	Part 3: Independent RKNN Model Conversion for Sherpa ASR on the RV1126BP Platform
Project Timeline	2025-07-31 ~ 2025-08-09 (Estimated)
Project Duration	31.5 hours (Approx. 3.9 person-days)
Review Date	2025-08-20
Core Personnel	Potter White

1. Project Background and Kick-off

In the V2.0 project, we verified that the sherpa-onnx framework could not directly use NPU acceleration due to the outdated RKNN SDK version on the target platform. At the same time, the pure CPU solution with sherpa-ncnn had already reached its performance bottleneck. Therefore, to fully leverage the NPU hardware of the RV1126b platform, the core objective of this project (V3.0) was to bypass the compilation limitations of sherpa-onnx and independently complete the conversion from ONNX models to RKNN models.

The starting point for this work was to address the dynamic input dimension issue prevalent in Sherpa ASR models.

2. Foundational Problem Analysis and General Conversion Strategy

2.1 Understanding Dynamic Dimension `[N, ...]`

In ONNX models, the dimension N typically serves as a placeholder for a dynamic batch size. For example, an input shape of [N, 39, 80] means the model can process N input samples at once. However, to achieve the most efficient computation on embedded NPUs, the RKNN toolchain usually requires the model’s input shape to be fixed and static.

2.2 Establishing a General Technical Workflow

Based on this analysis, I established a general three-step strategy for model conversion:

Statically Reshape the ONNX Model: Write a script to fix all dynamic dimensions N in the original ONNX model to a specific value, typically 1, to indicate single-input processing.
Extract Model Metadata: Analyze and extract the unique custom_string metadata from the sherpa-onnx model. This data contains essential parameters for inference, such as vocab_size, which need to be provided to the RKNN toolchain during conversion.
Perform the Conversion: Use rknn-toolkit2 to convert the statically reshaped ONNX model.

This general workflow is effective for model components with relatively simple structures.

3. Standard Conversion Process for `encoder` and `joiner` Models

The encoder and joiner models do not contain complex dynamic control flow operators internally. Therefore, they could strictly follow the general conversion strategy outlined above. The automated script I designed for this purpose executes as follows:

graph TD
    subgraph "User Action"
        A[Execute conversion script: ./convert.py encoder]
    end

    subgraph "Automated Script Execution Flow"
        B(1. Load original encoder.onnx) --> C{2. Check input dimensions};
        C -- Dynamic dimension 'N' found --> D[3. Change 'N' to 1];
        D --> E[4. Save as encoder_fixed.onnx];
        E --> F(5. Load original onnx to get metadata);
        F --> G[6. Construct custom_string];
        G & E --> H(7. Call rknn-toolkit2);
        H --> I[8. Generate encoder.rknn];
    end

    A --> B
    I --> J((Success))

Workflow Description:

The script first loads the original encoder.onnx model.
It programmatically inspects the dimensions of its input tensors to identify the dynamic dimension N.
N is then modified to the static value 1, generating an intermediate _fixed.onnx file.
Concurrently, the script reads the original model’s metadata using onnxruntime and formats it into the custom_string required by RKNN.
Finally, the statically reshaped model and the metadata string are fed into rknn-toolkit2, successfully completing the conversion and generating the final encoder.rknn file.

The conversion process for the joiner model is identical to that of the encoder. However, when this standard process was applied to the decoder model, it resulted in a critical conversion failure.

4. Special Challenges and In-depth Debugging of the `decoder` Model

The decoder model, due to its internal dynamic control flow (an If operator) based on input shapes, caused the standard conversion process to fail. It required additional, more complex preprocessing.

4.1 Failure of Initial Attempts and Problem Identification

I first attempted two direct methods, both of which failed, but these failures helped me precisely identify the root cause of the problem.

Attempt 1 - Direct Conversion of Dynamic Model: rknn-toolkit2 failed immediately during the model loading phase, explicitly stating that it does not support the dynamic input dimension N.
Attempt 2 - Specifying Input Size During Conversion: By forcing the input size through a parameter in rknn.load_onnx, the model loaded successfully. However, it failed during the build phase (rknn.build) with the error: All outputs ['decoder_out'] of model are constants.

4.2 The Core Problem: Logical Failure Caused by Constant Folding Optimization

To resolve the core error, “All outputs are constants,” I shifted my focus to preprocessing the ONNX model itself.

Action: I first executed the initial step of the standard process, fixing the dynamic input dimension N of decoder.onnx to 1 and generating decoder_fixed.onnx.
Observation: When I used this static model for conversion, it reproduced the exact same error as in “Attempt 2.”
Root Cause Analysis:
1. Dynamic Control Flow: By analyzing the decoder model with the Netron visualization tool, I discovered an If operator inside it. The condition for this If operator depended on the shape of an intermediate tensor calculated by preceding Shape and Gather operators.
2. Logic Solidification: In the original dynamic model, this shape was variable, making the path of the If branch non-deterministic. However, once I fixed the input dimension N to 1, this shape-dependent condition also became a constant (always True or always False).
3. Over-Optimization: During its build process, the RKNN toolchain performs an optimization called “constant folding” (fold_constant). When it detected that the If operator’s condition was now constant, it “intelligently” pruned the branch that would never be executed. In the decoder model, this pruning triggered a chain reaction, causing an entire computation path from Shape_7 to Gemm_15 to be removed. Ultimately, this led to the model’s output, decoder_out, being incorrectly identified as a constant value independent of the input, thus throwing the “invalid model” error.

4.3 The Final Solution: Introducing a Professional Model Simplification Tool

The root of the problem was how to handle the now-static If operator. After failed attempts at disabling optimizations and manually modifying the computation graph, I identified the final solution.

Final Strategy: Add a crucial “model simplification” step to the standard workflow. I introduced the professional Python library onnx-simplifier.
The Role of onnx-simplifier: This tool correctly performs constant folding. It automatically evaluates the If operator’s condition, safely prunes the static branch, and, most importantly, ensures that the resulting simplified ONNX model is topologically valid.

4.4 Final Conversion Workflow for the `decoder` Model

Based on the above analysis, I designed a specialized and more robust four-step conversion process for the decoder model and codified it into the automation script.

graph TD
    subgraph "User Action"
        A[Execute conversion script: ./convert.py decoder]
    end

    subgraph "Automated Script Execution Flow (Special Handling for Decoder)"
        B(1. Load original decoder.onnx) --> C{2. Check input dimensions};
        C -- Dynamic dimension 'N' found --> D[3. Change 'N' to 1];
        D --> E[4. Save as decoder_fixed.onnx];
        E --> F(5. **Call onnx-simplifier**);
        F --> G[6. **Prune static branches & reorder graph**];
        G --> H[7. Save as decoder_simplified.onnx];
        H --> I(8. Load original onnx to get metadata);
        I --> J[9. Construct custom_string];
        J & H --> K(10. Call rknn-toolkit2);
        K --> L[11. Generate decoder.rknn];
    end

    A --> B
    L --> M((Success))

Workflow Description: Compared to the standard process for encoder, the decoder’s workflow adds a critical step at step 5: using onnx-simplifier to deeply optimize the statically reshaped _fixed.onnx model. This step not only safely removed the problematic If operator but also ensured that the output _simplified.onnx was a legitimate model with a complete and topologically correct computation graph. Finally, this perfectly “sanitized” model could be converted by rknn-toolkit2 without any issues.

5. Project Outcomes and Conclusion

5.1 Key Achievements

Successfully established a conversion pipeline for complex ONNX models with dynamic control flow to RKNN models.
Created a differentiated, standardized model conversion process: a standard workflow for simple models and an enhanced workflow with an extra simplification step for complex models (like decoder).
Produced a reusable, automated model conversion script (unified_onnx_to_rknn_converter.py). This script can intelligently identify the model type and automatically apply the corresponding conversion workflow, enabling “one-click” conversion for all model components.

5.2 Conclusion

This technical effort demonstrates that when converting models with dynamic control flow (like If operators) for embedded platforms, the core challenge lies in bridging the gap between a dynamic model and static hardware requirements. Simply fixing input dimensions often breaks the internal computation logic of the model, causing the conversion tool’s optimization process to make incorrect judgments. In such cases, introducing a professional, validated model optimization tool (like onnx-simplifier) to preprocess and sanitize the model before feeding it into the hardware vendor’s toolchain is a more reliable and efficient solution than attempting to manually modify the computation graph or tweak conversion tool parameters.

The success of this phase lays a solid foundation for the subsequent development of a high-performance, NPU-based inference engine on the RV1126b platform.