MOSS-Speech: Real Voice-to-Voice AI Without Text Bottlenecks

10月20日 Published inVoice Interaction Models

MOSS-Speech eliminates the intermediary. Traditional speech models typically rely on a text-based pipeline: transcribing audio into words, generating a text reply, and then synthesizing a new voice. This sequence introduces latency and discards essential nuances like tone, pauses, and inflection. MOSS-Speech bypasses this process entirely by modeling speech-to-speech directly—removing the text bottleneck and the need for transcripts.

Technically, MOSS-Speech integrates speech-specific layers into a pretrained large language model (LLM) backbone. To preserve the LLM’s existing intelligence, the developers employ a "frozen weights" strategy. By training only the newly added modal layers while leaving the original weights untouched, the model retains its extensive reasoning and knowledge base while learning to process audio waveforms instead of text tokens.

Key Distinctions

  • Direct Voice-to-Voice Modeling: Text guidance is entirely optional. The model listens and responds in a single, continuous flow.
  • Layer-Split Architecture: New modal layers are positioned atop a frozen LLM trunk. The trunk maintains its linguistic capabilities while the specialized layers manage sound.
  • Non-Destructive Training: Core abilities of the base model do not degrade. Speech capability is added as an extension rather than a replacement.
  • Benchmark Performance: In spoken question-answering and voice-to-voice tasks, MOSS-Speech outperforms conventional hybrid approaches.

Installation

Clone the repository and enter the project directory:

git clone https://github.com/OpenMOSS/MOSS-Speech
cd MOSS-Speech

Install the required dependencies and initialize submodules:

pip install -r requirements.txt
git submodule update --init --recursive

Getting Started

Launch the Web Demo

You can start the Gradio interface with a single command:

python3 gradio_demo.py

Using the Demo

Interaction Modes

The interface supports four distinct workflows to cover various use cases:

  • Voice-in → Voice-out
  • Voice-in → Text-out
  • Text-in → Voice-out
  • Text-in → Text-out
System Prompt

The default configuration sets the model as a helpful voice assistant tasked with answering user questions via audio.

Generation and Controls
  • Input: Upload or drag audio files into the interface and click "Submit."
  • Chat History: Previous exchanges are saved in the conversation panel. Use "Clear History" to reset the session.
  • Output: Audio responses are rendered in the output section for immediate playback.

The model is also accessible via API. You can further customize behavior through Gradio’s configuration settings to better integrate the tool into your specific workflow. Consult the repository for advanced documentation.