[Project Retrospective Report] Technical Evaluation and Implementation of Sherpa ASR Deployment on the RV1126 Platform
| Project Attribute | Details |
|---|---|
| Project Name | Part 2: Technical Evaluation and Implementation of Sherpa ASR Deployment on the RV1126 Platform |
| Project Timeline | 2025-07-26 ~ 2025-08-06 (Estimated) |
| Project Duration | 50.17 hours (Approx. 6.3 person-days) |
| Review Date | 2025-08-20 |
| Core Personnel | Potter White |
1. Project Background and Initial Goals
1.1 Project Kick-off Background
This project aimed to integrate a high-performance, real-time Automatic Speech Recognition (ASR) function into the company’s product line on the Rockchip RV1126b embedded platform. Given the success of the V1.0 project in deploying sherpa-onnx on the RK3588s platform and achieving excellent performance with NPU acceleration, the initial plan for this project was to replicate that successful path. The goal was to implement a cost-effective AI feature with solid performance on the lower-cost RV1126b platform.
1.2 Core Technical Goals
- Functional Goal: Successfully deploy the Sherpa ASR engine and integrate it as a stable and reliable service module.
- Performance Goal: Achieve a Real-Time Factor (RTF) significantly below 1.0 to ensure a smooth user experience for features like real-time subtitles.
1.3 Initial Technical Plan and Assumptions
Based on the experience from the previous project, I formulated a technical plan centered around the sherpa-onnx framework. The advantage of this approach was its ability to directly leverage the Rockchip platform’s RKNN toolchain for NPU hardware acceleration, which, in theory, represented the optimal performance solution. This plan was built on the following key assumptions:
- Assumption 1: Toolchain Compatibility. I assumed that the
sherpa-onnxsource code would be compatible with the native cross-compilation toolchain (GCC 8.3) provided by the RV1126b platform. - Assumption 2: SDK API Compatibility. I assumed that the API version of the
librknnrtruntime library on the RV1126b platform would meet the requirements of thesherpa-onnxframework for calling RKNN functions.
However, as the project progressed, both of these core assumptions were proven false, forcing a major pivot in the project’s technical direction.
2. Technical Challenges and Problem Analysis
2.1 Phase 1: Exploration of sherpa-onnx (NPU-accelerated) Solution Compilation and Adaptation
The goal of this phase was to establish a successful cross-compilation pipeline for sherpa-onnx on the RV1126b platform. However, I encountered a series of tightly coupled compatibility barriers, from the toolchain and core dependencies to the hardware driver libraries.
Challenge 1: Toolchain Version Conflict
- Problem: The
sherpa-onnxsource code failed to compile, with errors pointing to the use of newer C++ syntax features. - Analysis & Decision: After analysis, I determined that building
sherpa-onnxrequired a compiler version of GCC 9.0 or higher. However, the official SDK for the RV1126b platform provided a cross-compiler version of GCC 8.3. Considering the product had already entered mass production, any major upgrade to the base SDK toolchain could introduce system-level instability risks and create significant difficulties for maintaining existing products. Therefore, I concluded that upgrading the SDK was not a viable option and a solution had to be found within the existing toolchain.
- Problem: The
Challenge 2: Manual Porting of Core Dependency
onnxruntime- Background: Since upgrading the compiler was not an option, I decided to tackle the core dependency of
sherpa-onnx,onnxruntime, by attempting to build a version that could compile under GCC 8.3. - 2.1 GLIBC Version Mismatch (
bug-i): When trying to link against the pre-compiledlibonnxruntime.soprovided by the officialsherpa-onnxrepository, the linker reported anundefined reference to 'log2@GLIBC_2.29'. This clearly indicated that the pre-compiled library depended on a higher GLIBC version than was present in the RV1126b system environment, proving the necessity of compilingonnxruntimefrom source. - 2.2 Bypassing
onnxruntimeBuild Preconditions (bug-ii,bug-iii): While compiling theonnxruntimesource, its build system (CMake) explicitly checked for and required a GCC version higher than 9.0. I bypassed this restriction by modifying the CMake script to comment out the version check logic. Additionally, its dependencyEigenfailed a hash check during automatic download. I resolved this by manually downloading it and placing it in the specified cache directory. - 2.3 Source-Level Defect on ARMv7 Architecture (
bug-iv): The most critical obstacle appeared in thecpuid_info.ccfile within theonnxruntimesource, which is responsible for CPU feature detection. Thecpuinfolibrary it depends on had a compilation error due to an undefined function on the ARMv7 architecture. By reviewingonnxruntime’s community issues, I found this was a known problem that had been fixed in a newer, unreleased version. I backported the code snippet from the fix into the version I was compiling, introducing an alternative code path via the preprocessor macro#ifdef CPUINFO_SUPPORTED. This finally allowed me to successfully compilelibonnxruntime.sofor the target platform.
- Background: Since upgrading the compiler was not an option, I decided to tackle the core dependency of
Challenge 3: Version Gap in RKNN SDK API
- Background: After successfully building a compatible
onnxruntimelibrary, I returned to thesherpa-onnxcompilation process and enabled RKNN support. - 3.1 Header File Path Configuration (
bug-v): The first issue was that therknn_api.hheader file could not be found. By analyzing the officialsherpa-onnxaarch64 build script, I discovered that external header paths were passed to the compiler via theCPLUS_INCLUDE_PATHenvironment variable, not through CMake arguments. I adopted this method and resolved the path issue. - 3.2 Fatal API Missing (
bug-vi): Subsequently, a series of fatal compilation errors occurred, such as‘rknn_custom_string’ does not name a type. I delved into the C++ source code insherpa-onnxrelated to RKNN and found that it heavily relies on APIs introduced in newer versions of the RKNN SDK, such asrknn_custom_stringandrknn_dup_context. These APIs are primarily used for fetching model metadata and efficiently duplicating contexts for multi-threading. However, thelibrknnrtlibrary provided by the RV1126b platform was an older version, and its header filerknn_api.hdid not contain definitions for these types and functions at all.
- Background: After successfully building a compatible
Phase 1 Final Conclusion: After in-depth technical investigation, I concluded that due to the outdated RKNN API version provided by the official SDK for the RV1126b platform, there is an irreconcilable compatibility gap with the
sherpa-onnxframework. Therefore, the original NPU acceleration plan is not feasible under the current conditions.
2.2 Phase 2: Implementation and Performance Tuning of sherpa-ncnn (CPU) Solution
After the NPU path was proven unfeasible, I immediately shifted the project’s focus to the alternative solution: using the sherpa-ncnn framework for pure CPU inference.
Challenge 4: Severe Discrepancy Between Build Artifact Performance and Expectations
- Observation: The cross-compilation of
sherpa-ncnnwas very smooth. However, when I conducted performance tests on the target platform, the results were shocking: the RTF was far greater than 1.0, rendering it completely unusable. Even the officialsherpa-ncnn-alsademo program would immediately encounter anoverrun(audio buffer overflow) error upon startup due to slow processing speed. - Analysis & Resolution: This issue brought the project to a standstill. I investigated multiple angles, including models and algorithm configurations, but found no cause. Just as I was about to declare the CPU solution a failure, I re-examined the entire compilation process. I recalled that, for debugging convenience early on, I had changed the CMake build type
-DCMAKE_BUILD_TYPEfrom the defaultReleasetoDebug.Debugmode disables all compiler optimizations (-O0) and adds extensive debugging information, which is a fatal performance killer for compute-intensive neural network inference tasks. When I switched the build type back toRelease(-O3), the performance problem was instantly solved. The RTF successfully dropped below 1.0, proving that thesherpa-ncnnsolution was viable from a performance standpoint.
- Observation: The cross-compilation of
Challenge 5: Stability Issues After Integration with Self-Developed Service (
bug-vii)- Observation: After integrating the performant
sherpa-ncnndynamic library into my self-developed C/S service framework, the program crashed with aFloating point exceptionduring the initialization of the ASR engine. - Analysis & Resolution: I used GDB to debug the crash dump. The call stack traced the exception to a low-level convolution algorithm implementation within the NCNN library. By analyzing the parameters passed to this function, I discovered that my self-developed code, due to a lack of strict validation when reading the
num_threadsparameter from the configuration file, passed an invalid negative value. This negative number caused a division-by-zero error inside NCNN during certain calculations (like tile partitioning). I added boundary checks for thenum_threadsparameter in the configuration loading module to ensure it was a positive integer, which completely resolved the crash.
- Observation: After integrating the performant
Challenge 6: CPU Resource Underutilization and Performance Bottleneck Assessment (
bug-viii)- Observation: During stress testing of the integrated service, I noticed that although the client was continuously sending audio data, the server’s CPU usage hovered between 50-60%, far from the near-100% utilization (all cores maxed out) achieved by the official command-line tool.
- Analysis & Decision: I determined this was not an issue with the NCNN library itself, but rather with my C/S application-layer architecture. My service used a synchronous “receive-process-respond” model, meaning the worker thread would block on network I/O operations like
send()and the nextrecv()after processing an audio chunk. This waiting time prevented the CPU from being continuously utilized. Despite this, when running in parallel with the product’s other multimedia modules (like video streaming), the system’s total CPU usage was already at its limit, leading to client voice data backlogs and increased processing latency. - Phase 2 Final Conclusion:
I confirmed that the
sherpa-ncnnsolution is functionally feasible on the RV1126b platform, but its performance has reached the physical limits of the platform’s CPU. In a real-world, multi-tasking product environment, pure CPU inference cannot provide sufficient performance headroom to guarantee the service’s real-time responsiveness and stability.
3. Project Summary and Future Plans
3.1 Core Achievements
- Delivered a validated
sherpa-ncnnASR solution and clearly defined its performance boundaries on the RV1126b platform. - Built a highly automated cross-compilation framework for embedded AI projects. This framework uses Git Submodules to manage upstream dependencies and employs automated scripts to solve a series of tedious engineering problems, including configuration, patching, compilation, and packaging, creating a valuable engineering asset for future similar projects.
- Produced a detailed feasibility analysis report for deployment on the RV1126b platform. It systematically explained the obstacles of the NPU solution and the performance bottlenecks of the CPU solution, providing critical data support and factual basis for the company’s future technology roadmap decisions.
3.2 Lessons Learned and Reflections
- The Importance of Technical Research: Before starting any porting project, conducting in-depth preliminary research on the target platform’s toolchain (GCC/GLIBC) and core dependency libraries (especially hardware-related SDKs) is a critical step to mitigate major technical risks and save development time.
- Rigor in Performance Testing: Performance evaluation must be conducted in
Releasebuild mode, consistent with the production environment. Compilation parameters have a decisive impact on the performance of compute-intensive applications. - Systems Thinking: The final performance of a program depends not only on the core algorithm but also on factors like application-layer architecture and the I/O model. Performance bottlenecks need to be analyzed and identified from a holistic system perspective.
3.3 Next Steps
Based on the conclusions of this retrospective, the project will now enter Phase V3.0. I will shelve further optimization of the pure CPU solution, and the work focus will shift entirely to overcoming the challenges of the NPU acceleration solution. The specific plan is as follows:
- Initiate Independent Model Conversion Work: Conduct in-depth research into the ONNX model structure and the RKNN toolchain to bypass the
sherpa-onnxframework and directly perform ONNX to RKNN model conversion, quantization, and deployment. - Explore Lightweight Inference Engines: If the independently converted model still cannot be supported by the older API of
sherpa-onnx, I will further explore developing a lightweight inference engine from scratch. This engine will directly call the nativelibrknnrtlibrary APIs on the RV1126b platform to perform model inference, achieving maximum control over the hardware.