sherpa-onnx: Offline Speech Recognition, TTS, and VAD Without the Cloud

5月6日 Published inVoice & Speech Tools

Built on next-generation Kaldi and ONNX Runtime, sherpa-onnx is a versatile speech processing toolkit. It provides comprehensive support for streaming and non-streaming ASR, TTS, speaker diarization, speech enhancement, and voice activity detection (VAD). All processing is performed locally, ensuring privacy and functionality without an internet connection.

Speech Processing Capabilities:

  • Audio tasks: Speech recognition (ASR), speech synthesis (TTS), keyword spotting, and audio tagging.
  • Speaker analysis: Diarization, identification, and verification.
  • Additional utilities: Language identification, automatic punctuation insertion, and source separation (similar to Spleeter or UVR).

Supported Platforms

The toolkit is designed for cross-platform compatibility across various operating systems and architectures:

  • Operating Systems: Android, iOS, Windows, macOS, Linux, HarmonyOS
  • Architectures: x64, x86, arm64, arm32, riscv64

The official documentation provides a compatibility matrix detailing supported combinations. sherpa-onnx also extends to embedded hardware, including Raspberry Pi, NVIDIA Jetson, RV1126, and LicheePi4A. Additionally, it supports WebAssembly for in-browser execution.

Supported Programming Languages

Developers can integrate sherpa-onnx into their projects using 11 different languages: C++, C, Python, JavaScript, Java, C#, Kotlin, Swift, Go, Dart, Rust, and Pascal (Object Pascal).

Code Repository Structure

The repository includes comprehensive examples and build scripts for each target platform:

  • android/: Android implementation details.
  • ios-swift/: Samples for iOS development using Swift.
  • python-api-examples/: Python-based demonstrations.
  • wasm/: WebAssembly support for web integration.

Specific build scripts, such as build-android-arm64-v8a.sh and build-wasm-simd-asr.sh, facilitate compilation for various architectures.

Pre-trained Models

A wide range of ready-to-use models is available for diverse languages and deployment scenarios:

ASR Models

  • Streaming: Zipformer (Chinese-English bilingual) and lightweight models optimized for Chinese and English (capable of running on Cortex A7 CPUs).
  • Non-streaming: Whisper (tiny.en), Paraformer (multilingual), and Telespeech CTC (specialized for Chinese dialects).

Other Models The ecosystem includes models for VAD, speech enhancement (GTCRN), and source separation (Spleeter). Users can download these from the official links, with several models offering support for multiple languages and regional dialects.

Mobile and Desktop Applications

  • Android APK: A prebuilt package featuring ASR, TTS, and VAD. A domestic mirror is available for users in China.
  • Flutter App: A cross-platform application for real-time ASR. Note that the iOS version requires manual compilation.
  • Lazarus App: A desktop subtitle generator designed for video editing workflows.

Integration Projects

  • Open-LLM-VTuber: A cross-platform, local voice-driven virtual streamer solution.
  • voiceapi: A streaming ASR and TTS service built on FastAPI.
  • TMSpeech: A C#-based real-time captioning tool with a GUI, designed for generating live captions during Tencent Meetings.
  • Unity Integration: The sherpa-onnx-unity project brings advanced speech processing capabilities directly into the Unity engine.

For comprehensive implementation guides and technical details, visit the official documentation: https://k2-fsa.github.io/sherpa/onnx/