Fay: Build and Deploy Your Own Talking Digital Human for Free

6月3日 Published inAI Tools

Fay is an open-source digital human framework that integrates large language models with interactive digital characters. It is available in three specialized versions—Retail, Assistant, and Agent—to suit different project requirements. Typical applications include virtual shopping guides, broadcast hosts, service staff, tutors, voice assistants, and mobile text-based helpers.

The framework allows developers to assemble digital humans or voice assistants without the complications of tightly coupled systems. Its architecture cleanly separates core concerns: audio input, speech recognition (ASR), sentiment analysis, NLP processing, emotional speech synthesis (TTS), audio output, and facial expression control. Each module operates independently.

Model Adapter Layer: Handles high-level integration with digital human models, including photorealistic 3D drivers and Live2D anime-style characters. It also facilitates low-level communication with mainstream large language models, including DeepSeek and other reasoning-focused LLMs.

Function Component Layer: Manages ASR, TTS, NLP, and expression control. These components are modular, allowing developers to swap individual elements to meet specific needs.

Terminal Adapter Layer: Provides standardized interfaces for connecting the framework to microcontrollers, mobile applications, websites, and large-scale displays without requiring a major codebase overhaul.

What Fay Does

• Supports full offline operation for environments without network access.

• Enables continuous streaming interactions for low-latency voice conversations.

• Manages multi-user concurrency to handle high-load scenarios.

• Supports custom knowledge bases via a simple qa.csv file.

• Provides configurable wake words and interaction rules.

• Features an agent decision system built on the Model Context Protocol (MCP).

• Offers APIs for both text and voice interaction.

• Includes a dedicated control interface for digital human drivers.

• Provides an interface for automated broadcast tasks.

• Supports real photo driving via the "xuniren" method.

• Compatible with Live2D character models.

• Integrates with Unreal Engine 5 (UE5) and Unity 3D models.

Development and Deployment

License: MIT. Open for commercial use.

Environment: Lightweight footprint. Built on Python 3.12. Dependencies are managed via a standard requirements.txt.

Startup Options:

• Source Code: Execute main.py as the primary controller.

• Docker: GPU-accelerated images are available for containerized deployment.

Remote Communication: Integrated Ngrok support facilitates cross-device messaging, allowing smartphones, PCs, and wearables to connect to the core framework.

Directory/File Description
Core Modules
ai_module Contains AI algorithms and LLM integration logic
core The framework engine; manages interaction flow and state
genagents Manages agents using React-based decision logic
simulation_engine Controls digital human motion and facial expressions
Functional Components
asr Handles speech-to-text recognition
tts Manages speech synthesis; supports voices like Azure's Xiaoxiao
gui Graphical configuration panel for persona settings (name, role, wake word)
Tools and Configs
utils General utilities, including config_util.py for parsing settings
config.json Stores core parameters like model paths and API endpoints
system.conf.bak Backup file for system configurations
Deployment
requirements.txt Lists Python dependencies, such as PyAudio
fay_booter.py Boot script for silent background startup
main.py The main controller entry point

Quick Setup (Windows)

Prepare the Environment
  1. Download and install Python 3.12 from [python.org](www.python.org/downloads/release/python-3120), selecting the version appropriate for your system.

  2. Install the Visual Studio Build Tools from [Microsoft](learn.microsoft.com/zh-cn/visualstudio/releases/2022/release-notes). During the setup process, select the "Desktop development with C++" workload to install the necessary MSVC compiler.

Deploy the Framework
  1. Clone the repository.

    git clone https://github.com/xszyou/Fay.git
    cd Fay
    
  2. Install the required dependencies.

    pip install -r requirements.txt
    

    On Ubuntu, install the build tools first: sudo apt install build-essential portaudio19-dev.

  3. Configure the system.

    • Copy the system.conf.bak file and rename it to system.conf.

    • Update the following key settings in the configuration file:

    "llm_model_path": "path/to/your/local/LLM",   // Required for offline model usage
    "asr_provider": "local ASR service address",   // Can be swapped for Alibaba, Tencent, etc.
    "tts_voice": "Xiaoxiao"                        // Select any supported Azure voice
    
  4. Launch the application.

    python main.py
    

    To run in background mode on Linux or macOS, use: nohup python main.py &

Example Use Cases

Expanding Digital Human Types

1. Photorealistic 3D Avatars

Utilize the [fay-ue5](github.com/xszyou/fay-ue5) integration for Unreal Engine 5 to generate and drive lifelike models created from a single photograph.

2. Anime-Style Characters

Load Live2D models and manage their expressions using facial capture technology. Detailed integration documentation is available here.

Cross-Device Communication via Ngrok
  1. Sign up for an Ngrok account and obtain your authentication token.

  2. Initialize the tunnel.

    ngrok http 6000
    

    This command maps local port 6000 to a public URL.

  3. Configure your mobile app or smart device to connect to the framework using the generated Ngrok address.

Custom Persona Configuration

The GUI allows for the adjustment of various identity settings:

Identity Basics: Set the name (e.g., "Faye"), assign a role (e.g., Assistant), and select a gender.

Interaction Logic:

• Wake Word: Set a trigger phrase like "Hello" (supports prefix wake mode).

• Sensitivity: Adjust the responsiveness of the assistant to user input.

Media Settings:

• Voice: Select a preferred voice, such as Azure's "Xiaoxiao."

• Auto-broadcast URL: Set to http://127.0.0.1:6000 for local testing purposes.

By configuring these components, developers can deploy a voice-interactive digital human system capable of listening, speaking, and operating across multiple devices. This framework is well-suited for intelligent customer service, virtual training, brand marketing, or any application where a human face and voice enhance the user experience.