Qwen3-ASR-Studio: Real-Time Voice Recognition with PiP Mode

9月12日 Published inVoice & Speech Tools

Qwen3-ASR-Studio is a high-performance web application designed as a streamlined interface for Alibaba Cloud’s Qwen ASR model. Its primary objective is simple: to convert speech into text with minimal friction.

The application supports direct uploads of various audio formats—including WAV, MP3, FLAC, and M4A—as well as live microphone recording. During recording, a real-time waveform visualizer provides immediate feedback, while the underlying Qwen ASR model ensures rapid and accurate transcription.

To improve accuracy in specialized domains, users can add context hints such as specific names or technical terminology. The app automatically detects multiple languages, including Chinese, English, and Japanese. Furthermore, enabling Inverse Text Normalization (ITN) allows the system to convert spoken phrases like "January fifth" into concise written formats like "Jan 5."

For a more efficient workflow, users can hold the spacebar to record and release it to stop and initiate transcription. To reduce wait times on slower internet connections, audio files are compressed locally on the user's machine before being processed.

The most distinctive feature is Picture-in-Picture (PiP) mode. This creates a floating window that stays on top of other applications. When you speak, the transcribed text can be sent directly into any active text field, effectively serving as a global voice input method across your system.

Two distinct editing modes are available. Single-pass mode is designed for processing one audio file at a time to maintain a clean workspace. Notes mode aggregates multiple transcriptions into a single editable area, which is ideal for documenting long meetings or lectures.

Data security and persistence are handled through automatic saving. Transcripts, audio files, notes, and settings are stored within the browser’s IndexedDB rather than on a central server. This local-first approach protects user privacy and ensures that previously processed files do not need to be re-transcribed.

The History tab allows users to revisit previous transcriptions, while the Notes section keeps important results organized and separate from daily logs. A one-click option is available to clear all history when necessary.

Personalization options include a choice between light and dark themes, as well as an option to automatically copy results to the clipboard as soon as transcription finishes. The application remembers these preferences locally for a consistent experience across sessions.

Tech stack

  • Frontend: React + TypeScript
  • Styling: Tailwind CSS
  • ASR backend: Alibaba Qwen ASR model deployed on Gradio Space
  • Client: Web Audio API for recording and visualization, IndexedDB for local storage

Local deployment Requires Node.js v18 or higher. Use pnpm (recommended), npm, or yarn.

  1. Clone the repo: git clone https://github.com/yeahhe365/Qwen3-ASR-Studio.git cd Qwen3-ASR-Studio

  2. Install dependencies: pnpm install (or npm install)

  3. Start dev server: pnpm dev (or npm run dev)

  4. Open http://localhost:5173 in your browser.