Deploying Sherpa (Next Gen Kaldi) on Various Rockchip SoCs
Project Introduction
Multiple company products required the deployment of models to achieve offline Automatic Speech Recognition (ASR). After a comparison with competing technologies of the time (such as Whisper), Sherpa was found to better meet the requirements for rapid prototyping.
The inference engines for Sherpa are open-source, support multi-language inference, and have demonstrated superior performance metrics on various tested platforms. Consequently, the final decision was to adopt a solution based on the secondary development of Sherpa.
My role was a Model Deployment Engineer, with my core responsibilities covering the end-to-end iterative development cycle: solution design, implementation, self-testing, and optimization.
This portfolio showcases the deployment and optimization results I achieved on all SoCs used at the company.
For some modules, instructional documents and standard operating procedure (SOP) documents were produced during development. This was intended to help me more quickly grasp the essence of the work when facing a completely new development paradigm like AI, and to further internalize difficult concepts.
Project List
Chapter 1: System Architecture and Integration — Building a High-Performance sherpa-onnx Service on the RK3588s
- Core Question: “How do you build a complete, stable inference system for a production environment starting from an open-source Sherpa-ONNX SDK?”
- Synopsis: This chapter focuses on the V1.0 phase of the project, targeting the RK3588s platform. It details how, centered around a core AI SDK, I designed a client/server architecture from scratch, encapsulated core functionalities into a modular, reusable library, and addressed key engineering challenges in high-concurrency services, such as thread safety, resource management, and graceful shutdown.
- Read Project Details »
Chapter 2: Platform Porting and Troubleshooting — Adapting to the RV1126 Platform and Implementing the sherpa-ncnn Solution
- Core Question: “Given the limited documentation and fewer available models for Sherpa-NCNN compared to Sherpa-ONNX, how can ASR be successfully implemented on the RV1126?”
- Synopsis: This chapter documents the complete debugging process of the project’s V2.0 phase. When porting the system to the RV1126 platform, the original NPU-based plan (
sherpa-onnx) was halted due to SDK version incompatibility. The report provides a detailed retrospective on pivoting to a CPU-only solution (sherpa-ncnn), overcoming a series of cross-compilation challenges, and the entire process of identifying and resolving a critical performance issue caused by theDebugmode configuration. - Read Project Details »
Chapter 3: Independent Model Conversion — Achieving NPU Acceleration for the RV1126bp
- Core Question: “The RV1126B is a new SoC for which Sherpa has not officially released RKNN models. Therefore, the task was to first convert the models to RKNN, then benchmark the RTF of CPU vs. NPU inference, and deploy the optimal solution.”
- Synopsis: As the CPU performance on the RV1126 was confirmed to be insufficient in V2.0, project V3.0 directly tackled an NPU-based solution for the RV1126bp platform, which features a new-generation NPU. This chapter provides a deep dive into the root cause of the RKNN conversion failure for the
decodermodel—the presence of dynamic control flow operators. It showcases how, through systematic technical exploration, an effective workflow was established and implemented: “Fixed Shape -> Professional Simplification withonnx-simplifier-> RKNN Conversion,” successfully achieving model conversion for this platform. - Read Project Details »
Chapter 5: End-to-End INT8 Quantization and Deployment of a Streaming ASR Model (Zipformer) on an Embedded Platform (RV1126B)
- Core Question: “Can the RKNN performance on the RV1126B be further squeezed? How to find a suitable dataset, perform quantization, and ultimately test the inference accuracy.”
- Synopsis: This stage was the second step in enhancing the Zipformer model’s performance. It involved converting a streaming Zipformer ASR ONNX model from
FP32precision to anINT8precision RKNN model, and completing its deployment and validation within thesherpa-onnxC++ inference framework to meet the operational requirements of low latency and low power consumption. - Read Project Details »