Qwen3-ASR-Toolkit: Transcribe Long Audio Files Beyond the 3-Minute Limit

9月18日 Published inVideo Tools

Qwen3-ASR-Toolkit is a Python-based command-line utility designed to extend the capabilities of the Qwen-ASR API. By employing intelligent Voice Activity Detection (VAD), the tool segments long audio or video files into chunks shorter than three minutes. These segments are then processed concurrently using multiple threads, allowing users to bypass the official API’s duration limits and transcribe hours of content in a fraction of the time.

The toolkit supports nearly any audio or video format through its FFmpeg integration. It handles technical requirements automatically, such as resampling audio to 16kHz mono to ensure compatibility with the API. With a straightforward command-line interface, users only need to provide a DashScope API Key to access its full range of features.

Key Features

  • Bypass the 3-minute limit – Process files of any duration without interruption.
  • VAD-based segmentation – Intelligent splitting at natural pauses ensures that sentences remain intact.
  • Concurrent processing – Multi-threaded uploads and processing significantly reduce total wait times.
  • Automated post-processing – Identifies and removes common transcription "hallucinations" and repetitive phrases automatically.
  • Automatic resampling – Converts any sample rate or channel configuration to the required 16kHz mono format.
  • Extensive format support – Compatible with mp4, mov, mkv, mp3, wav, m4a, and various other media types.
  • User-friendly interface – Initiate complex transcription tasks with a single command.

How It Works

  1. Media Input – The tool reads a local file or fetches data from a remote URL.
  2. VAD Analysis – It scans the audio to identify silent intervals.
  3. Intelligent Splitting – The file is cut at silent points to ensure every segment is under the three-minute threshold.
  4. Parallel API Calls – A thread pool manages multiple simultaneous requests to process all segments at once.
  5. Data Aggregation – The tool collects, sequences, and cleans the individual transcriptions.
  6. Output Generation – The final consolidated transcript is displayed in the console and saved as a text file.

Installation

Prerequisites

  • Python 3.8 or higher
  • FFmpeg (for media processing)
    • Ubuntu/Debian: sudo apt update && sudo apt install ffmpeg
    • macOS: brew install ffmpeg
    • Windows: Download the binary and add it to your system PATH
  • DashScope API Key (available via Alibaba Cloud)

Setting your API Key as an environment variable is recommended:

# Linux/macOS
export DASHSCOPE_API_KEY="your_api_key_here"

# Windows (PowerShell)
$env:DASHSCOPE_API_KEY="your_api_key_here"

Install

Option 1: Via PyPI (recommended)

pip install qwen3-asr-toolkit

Option 2: From source

git clone https://github.com/QwenLM/Qwen3-ASR-Toolkit.git
cd Qwen3-ASR-Toolkit
pip install .

Usage

The basic command syntax is as follows:

qwen3-asr -i <input_file_or_url> [-key <api_key>] [-j <num_threads>] [-c <context>] [-t <tmp_dir>] [-s]

Parameters

Parameter Short Description Required
--input-file -i Path to a local file or a remote URL Yes
--context -c Provide context/keywords to improve recognition of specific terms No
--dashscope-api-key -key DashScope API key No (if env variable is set)
--num-threads -j Number of concurrent threads (default: 4) No
--tmp-dir -t Directory for temporary files (default: ~/qwen3-asr-cache) No
--silence -s Silent mode – suppresses progress information No

Examples

1. Transcribe a local file

qwen3-asr -i "/path/to/my/long_lecture.mp4"

2. Transcribe a remote audio file

qwen3-asr -i "https://somewebsite.com/audios/podcast_episode.mp3"

3. Increase concurrency and provide an API key manually

qwen3-asr -i "/path/to/my/podcast.wav" -j 8 -key "your_api_key_here"

4. Improve accuracy with context hints

qwen3-asr -i "/path/to/my/tech_talk.mp4" -c "Qwen-ASR, DashScope, FFmpeg, VAD"

5. Execute in silent mode

qwen3-asr -i "/path/to/my/meeting_recording.m4a" -s