KVoiceWalk: Clone Any Voice for Kokoro TTS Using Random Walks

5月23日 Published inVoice & Speech Tools

KVoiceWalk is a specialized tool designed to clone voice styles for the Kokoro text-to-speech (TTS) system. It utilizes a random walk algorithm paired with a hybrid scoring method that integrates Resemblyzer similarity, audio feature extraction, and self-similarity metrics. The final output is a new Kokoro voice style tensor that closely mimics the characteristics of a target reference voice.

Earlier iterations of KVoiceWalk relied exclusively on Resemblyzer similarity scores, which often resulted in overfitting and inconsistent output quality. To address this, the developers introduced self-similarity to stabilize model outputs and prevent quality fluctuations across different text inputs. They also implemented audio feature similarity comparisons to prevent the algorithm from chasing high similarity scores at the expense of natural sound, effectively eliminating the metallic screeching artifacts common in over-optimized audio.

The scoring function is calculated using a harmonic mean. This approach allows self-similarity, feature similarity, and target similarity to fluctuate within reasonable bounds, preventing the algorithm from stalling when a single metric fails to improve immediately. Feature similarity is assigned a lower weight, acting primarily as a guardrail to stop voice characteristics from drifting too far from the source.

Installing and Using KVoiceWalk

Requirements

  1. Clone the repository:
git clone https://github.com/RobViren/kvoicewalk.git
cd kvoicewalk
  1. Prepare your target audio:
  • Format: WAV file with a 24,000 Hz sample rate.
  • Duration: 20–30 seconds of clean, single-speaker audio is ideal.
  • Conversion: Use FFmpeg to ensure the correct sample rate:
ffmpeg -i input_file.wav -ar 24000 target.wav

Basic Run Command

uv run main.py --target_text "Text you want to synthesize" --target_audio ./path/to/target.wav

The script scans the voices directory for pretrained voice samples and identifies the one most similar to your target audio. It then iteratively refines the tensor using random walks and saves the final result in the out folder.

Advanced: Interpolation Start

uv run main.py --target_text "Text from your audio" --target_audio ./path/to/target.wav --interpolate_start

This mode initiates a search by interpolating across existing pretrained voices. It generates an optimized initial population of tensors, saved in the interpolated folder, before beginning the random walk process. This method is computationally intensive and benefits significantly from GPU acceleration.

Performance Comparison

Using example/target.wav as the reference point:

Baseline Model: Using Kokoro’s built-in af_heart.pt yields a Resemblyzer similarity of 71% and a score of 81.22.

Interpolation Search Result: Using af_jessica.pt_if_sara.pt_0.10.pt, similarity increases to 78% with a score of 84.20.

After 10,000 Random Walk Steps: Similarity reaches 93% with a final score of 92.99. This demonstrates a much closer voice match without sacrificing the stability of the underlying model.

KVoiceWalk Features

  1. Concurrency Constraints: While the process is not inherently parallel-friendly, users can manually terminate low-quality tensor generations early. A single GPU, such as an RTX 3070, can typically handle two instances running in parallel.

  2. Stochastic Nature: Because the process is based on random walks, improvements may stall for long periods before seeing a sudden breakthrough. Running multiple attempts often yields better results.

  3. Future Roadmap:

    • Developing a database of results to train a similarity prediction model that can better guide voice generation.
    • Investigating voice generation techniques beyond Principal Component Analysis (PCA).
    • Transitioning from random walks to genetic algorithms to improve optimization efficiency.

KVoiceWalk File Structure

File/Folder Purpose
example Contains sample audio and configuration files
voices Repository for pretrained voice tensors
main.py The main entry point and core execution logic
voice_generator.py Handles the generation of voice tensors
speech_generator.py The module responsible for speech synthesis
fitness_scorer.py Implementation of the hybrid scoring function