Part 2: Technical Evaluation and Implementation of Sherpa ASR Deployment on the RV1126 Platform

[Project Retrospective Report] Technical Evaluation and Implementation of Sherpa ASR Deployment on the RV1126 Platform

Project Attribute	Details
Project Name	Part 2: Technical Evaluation and Implementation of Sherpa ASR Deployment on the RV1126 Platform
Project Timeline	2025-07-26 ~ 2025-08-06 (Estimated)
Project Duration	50.17 hours (Approx. 6.3 person-days)
Review Date	2025-08-20
Core Personnel	Potter White

1. Project Background and Initial Goals

1.1 Project Kick-off Background

This project aimed to integrate a high-performance, real-time Automatic Speech Recognition (ASR) function into the company’s product line on the Rockchip RV1126b embedded platform. Given the success of the V1.0 project in deploying sherpa-onnx on the RK3588s platform and achieving excellent performance with NPU acceleration, the initial plan for this project was to replicate that successful path. The goal was to implement a cost-effective AI feature with solid performance on the lower-cost RV1126b platform.

1.2 Core Technical Goals

Functional Goal: Successfully deploy the Sherpa ASR engine and integrate it as a stable and reliable service module.
Performance Goal: Achieve a Real-Time Factor (RTF) significantly below 1.0 to ensure a smooth user experience for features like real-time subtitles.

1.3 Initial Technical Plan and Assumptions

Based on the experience from the previous project, I formulated a technical plan centered around the sherpa-onnx framework. The advantage of this approach was its ability to directly leverage the Rockchip platform’s RKNN toolchain for NPU hardware acceleration, which, in theory, represented the optimal performance solution. This plan was built on the following key assumptions:

Assumption 1: Toolchain Compatibility. I assumed that the sherpa-onnx source code would be compatible with the native cross-compilation toolchain (GCC 8.3) provided by the RV1126b platform.
Assumption 2: SDK API Compatibility. I assumed that the API version of the librknnrt runtime library on the RV1126b platform would meet the requirements of the sherpa-onnx framework for calling RKNN functions.

However, as the project progressed, both of these core assumptions were proven false, forcing a major pivot in the project’s technical direction.

2. Technical Challenges and Problem Analysis

2.1 Phase 1: Exploration of `sherpa-onnx` (NPU-accelerated) Solution Compilation and Adaptation

The goal of this phase was to establish a successful cross-compilation pipeline for sherpa-onnx on the RV1126b platform. However, I encountered a series of tightly coupled compatibility barriers, from the toolchain and core dependencies to the hardware driver libraries.

Challenge 1: Toolchain Version Conflict
- Problem: The sherpa-onnx source code failed to compile, with errors pointing to the use of newer C++ syntax features.
- Analysis & Decision: After analysis, I determined that building sherpa-onnx required a compiler version of GCC 9.0 or higher. However, the official SDK for the RV1126b platform provided a cross-compiler version of GCC 8.3. Considering the product had already entered mass production, any major upgrade to the base SDK toolchain could introduce system-level instability risks and create significant difficulties for maintaining existing products. Therefore, I concluded that upgrading the SDK was not a viable option and a solution had to be found within the existing toolchain.
Challenge 2: Manual Porting of Core Dependency onnxruntime
- Background: Since upgrading the compiler was not an option, I decided to tackle the core dependency of sherpa-onnx, onnxruntime, by attempting to build a version that could compile under GCC 8.3.
- 2.1 GLIBC Version Mismatch (bug-i): When trying to link against the pre-compiled libonnxruntime.so provided by the official sherpa-onnx repository, the linker reported an undefined reference to 'log2@GLIBC_2.29'. This clearly indicated that the pre-compiled library depended on a higher GLIBC version than was present in the RV1126b system environment, proving the necessity of compiling onnxruntime from source.
- 2.2 Bypassing onnxruntime Build Preconditions (bug-ii, bug-iii): While compiling the onnxruntime source, its build system (CMake) explicitly checked for and required a GCC version higher than 9.0. I bypassed this restriction by modifying the CMake script to comment out the version check logic. Additionally, its dependency Eigen failed a hash check during automatic download. I resolved this by manually downloading it and placing it in the specified cache directory.
- 2.3 Source-Level Defect on ARMv7 Architecture (bug-iv): The most critical obstacle appeared in the cpuid_info.cc file within the onnxruntime source, which is responsible for CPU feature detection. The cpuinfo library it depends on had a compilation error due to an undefined function on the ARMv7 architecture. By reviewing onnxruntime’s community issues, I found this was a known problem that had been fixed in a newer, unreleased version. I backported the code snippet from the fix into the version I was compiling, introducing an alternative code path via the preprocessor macro #ifdef CPUINFO_SUPPORTED. This finally allowed me to successfully compile libonnxruntime.so for the target platform.
Challenge 3: Version Gap in RKNN SDK API
- Background: After successfully building a compatible onnxruntime library, I returned to the sherpa-onnx compilation process and enabled RKNN support.
- 3.1 Header File Path Configuration (bug-v): The first issue was that the rknn_api.h header file could not be found. By analyzing the official sherpa-onnx aarch64 build script, I discovered that external header paths were passed to the compiler via the CPLUS_INCLUDE_PATH environment variable, not through CMake arguments. I adopted this method and resolved the path issue.
- 3.2 Fatal API Missing (bug-vi): Subsequently, a series of fatal compilation errors occurred, such as ‘rknn_custom_string’ does not name a type. I delved into the C++ source code in sherpa-onnx related to RKNN and found that it heavily relies on APIs introduced in newer versions of the RKNN SDK, such as rknn_custom_string and rknn_dup_context. These APIs are primarily used for fetching model metadata and efficiently duplicating contexts for multi-threading. However, the librknnrt library provided by the RV1126b platform was an older version, and its header file rknn_api.h did not contain definitions for these types and functions at all.
Phase 1 Final Conclusion: After in-depth technical investigation, I concluded that due to the outdated RKNN API version provided by the official SDK for the RV1126b platform, there is an irreconcilable compatibility gap with the sherpa-onnx framework. Therefore, the original NPU acceleration plan is not feasible under the current conditions.

2.2 Phase 2: Implementation and Performance Tuning of `sherpa-ncnn` (CPU) Solution

After the NPU path was proven unfeasible, I immediately shifted the project’s focus to the alternative solution: using the sherpa-ncnn framework for pure CPU inference.

Challenge 4: Severe Discrepancy Between Build Artifact Performance and Expectations
- Observation: The cross-compilation of sherpa-ncnn was very smooth. However, when I conducted performance tests on the target platform, the results were shocking: the RTF was far greater than 1.0, rendering it completely unusable. Even the official sherpa-ncnn-alsa demo program would immediately encounter an overrun (audio buffer overflow) error upon startup due to slow processing speed.
- Analysis & Resolution: This issue brought the project to a standstill. I investigated multiple angles, including models and algorithm configurations, but found no cause. Just as I was about to declare the CPU solution a failure, I re-examined the entire compilation process. I recalled that, for debugging convenience early on, I had changed the CMake build type -DCMAKE_BUILD_TYPE from the default Release to Debug. Debug mode disables all compiler optimizations (-O0) and adds extensive debugging information, which is a fatal performance killer for compute-intensive neural network inference tasks. When I switched the build type back to Release (-O3), the performance problem was instantly solved. The RTF successfully dropped below 1.0, proving that the sherpa-ncnn solution was viable from a performance standpoint.
Challenge 5: Stability Issues After Integration with Self-Developed Service (bug-vii)
- Observation: After integrating the performant sherpa-ncnn dynamic library into my self-developed C/S service framework, the program crashed with a Floating point exception during the initialization of the ASR engine.
- Analysis & Resolution: I used GDB to debug the crash dump. The call stack traced the exception to a low-level convolution algorithm implementation within the NCNN library. By analyzing the parameters passed to this function, I discovered that my self-developed code, due to a lack of strict validation when reading the num_threads parameter from the configuration file, passed an invalid negative value. This negative number caused a division-by-zero error inside NCNN during certain calculations (like tile partitioning). I added boundary checks for the num_threads parameter in the configuration loading module to ensure it was a positive integer, which completely resolved the crash.
Challenge 6: CPU Resource Underutilization and Performance Bottleneck Assessment (bug-viii)
- Observation: During stress testing of the integrated service, I noticed that although the client was continuously sending audio data, the server’s CPU usage hovered between 50-60%, far from the near-100% utilization (all cores maxed out) achieved by the official command-line tool.
- Analysis & Decision: I determined this was not an issue with the NCNN library itself, but rather with my C/S application-layer architecture. My service used a synchronous “receive-process-respond” model, meaning the worker thread would block on network I/O operations like send() and the next recv() after processing an audio chunk. This waiting time prevented the CPU from being continuously utilized. Despite this, when running in parallel with the product’s other multimedia modules (like video streaming), the system’s total CPU usage was already at its limit, leading to client voice data backlogs and increased processing latency.
- Phase 2 Final Conclusion: I confirmed that the sherpa-ncnn solution is functionally feasible on the RV1126b platform, but its performance has reached the physical limits of the platform’s CPU. In a real-world, multi-tasking product environment, pure CPU inference cannot provide sufficient performance headroom to guarantee the service’s real-time responsiveness and stability.

3. Project Summary and Future Plans

3.1 Core Achievements

Delivered a validated sherpa-ncnn ASR solution and clearly defined its performance boundaries on the RV1126b platform.
Built a highly automated cross-compilation framework for embedded AI projects. This framework uses Git Submodules to manage upstream dependencies and employs automated scripts to solve a series of tedious engineering problems, including configuration, patching, compilation, and packaging, creating a valuable engineering asset for future similar projects.
Produced a detailed feasibility analysis report for deployment on the RV1126b platform. It systematically explained the obstacles of the NPU solution and the performance bottlenecks of the CPU solution, providing critical data support and factual basis for the company’s future technology roadmap decisions.

3.2 Lessons Learned and Reflections

The Importance of Technical Research: Before starting any porting project, conducting in-depth preliminary research on the target platform’s toolchain (GCC/GLIBC) and core dependency libraries (especially hardware-related SDKs) is a critical step to mitigate major technical risks and save development time.
Rigor in Performance Testing: Performance evaluation must be conducted in Release build mode, consistent with the production environment. Compilation parameters have a decisive impact on the performance of compute-intensive applications.
Systems Thinking: The final performance of a program depends not only on the core algorithm but also on factors like application-layer architecture and the I/O model. Performance bottlenecks need to be analyzed and identified from a holistic system perspective.

3.3 Next Steps

Based on the conclusions of this retrospective, the project will now enter Phase V3.0. I will shelve further optimization of the pure CPU solution, and the work focus will shift entirely to overcoming the challenges of the NPU acceleration solution. The specific plan is as follows:

Initiate Independent Model Conversion Work: Conduct in-depth research into the ONNX model structure and the RKNN toolchain to bypass the sherpa-onnx framework and directly perform ONNX to RKNN model conversion, quantization, and deployment.
Explore Lightweight Inference Engines: If the independently converted model still cannot be supported by the older API of sherpa-onnx, I will further explore developing a lightweight inference engine from scratch. This engine will directly call the native librknnrt library APIs on the RV1126b platform to perform model inference, achieving maximum control over the hardware.