Training a Tiny Piper TTS Model for Any Language
This guide will walk you through training your own tiny Piper text-to-speech model in any language. Piper has been successfully tested with languages like Hebrew and Korean, so it should work well for many languages.
What is Piper?
Piper is a neural text-to-speech system that can generate natural-sounding speech. We’ll be training a small version that can run efficiently on various devices.
Setting Up the Environment
First, we need to get the Piper training code and set up our working environment.
git clone https://github.com/neurlang/piper -b hebrew
This command downloads the Piper code from GitHub. The -b hebrew
flag specifies we want the “hebrew” branch, which was successfuly tested on various languages including Hebrew.
Navigate into the Piper directory:
cd piper
Now we’ll create a Python virtual environment. This keeps our project’s dependencies separate from other Python projects on your system:
python3 -m venv venv
venv/bin/pip install uv
UV is a fast Python package manager that we’ll use to handle dependencies.
Create another virtual environment using UV:
venv/bin/uv venv
Activate the UV virtual environment:
source .venv/bin/activate
Now let’s install the required Python packages for training:
cd src/python
../../venv/bin/uv pip install -e .
The -e
flag means we’re installing in “editable” mode, so changes to the source code will be immediately reflected.
Finally, we need to build a special algorithm used during training:
./build_monotonic_align.sh
This compiles C++ code that helps align text with audio during training.
Preparing Your Dataset
You need a dataset of audio files with their corresponding text transcripts. The dataset should be in LJSpeech format, which is a common format for TTS datasets.
If you already have a dataset, you can create a symbolic link to it:
ln -s -T ~/coqui-ai-TTS/recipes/slovakspeech/slovakspeech_female_dataset/ dataset
This creates a shortcut called “dataset” that points to your actual dataset folder.
Getting a Starting Model
Instead of training from scratch (which takes a long time), we’ll fine-tune an existing model. This is much faster and requires less data.
Download a pre-trained checkpoint:
wget https://huggingface.co/datasets/rhasspy/piper-checkpoints/resolve/main/en/en_US/ryan/medium/epoch=4641-step=3104302.ckpt
This downloads an English model that will adapt to your language.
Preprocessing Your Data
Before training, we need to convert your dataset into a format Piper can understand:
../../venv/bin/uv run python -m piper_train.preprocess \
--language Slovak \
--input-dir ./dataset \
--output-dir ./train \
--dataset-format ljspeech \
--single-speaker \
--sample-rate 22050 \
--phoneme-type pygoruut \
--max-workers 6
Let’s break down what each option does:
--language Slovak
: Sets the language (change this to your language)--input-dir ./dataset
: Where your dataset is located--output-dir ./train
: Where to save the processed data--dataset-format ljspeech
: The format of your dataset--single-speaker
: Your dataset has only one speaker--sample-rate 22050
: Audio quality setting (standard for TTS)--phoneme-type pygoruut
: The system that converts text to sounds--max-workers 6
: How many CPU cores to use (adjust based on your computer)
Choosing the right language code: You can use:
- Two-letter ISO codes (like “en” for English)
- Three-letter ISO codes (when two-letter isn’t available)
- Check available languages at https://hashtron.cloud/
Dataset format options:
ljspeech
: Transcript is in the last column of metadata.csvcoqui
: Transcript is in the second column of metadata.csv
Example of ljspeech format in metadata.csv:
audio_file_001|This is the text that will be spoken.|This is the text that will be spoken.
This preprocessing step creates the ./train
folder with config.json
and training files.
Starting the Training
Now we begin the actual training process:
../../venv/bin/uv run python3 -m piper_train \
--dataset-dir "./train" \
--accelerator 'gpu' \
--devices 1 \
--batch-size 24 \
--validation-split 0 \
--num-test-examples 0 \
--max_epochs 990000 \
--resume_from_checkpoint ./epoch=4641-step=3104302.ckpt \
--checkpoint-epochs 1 \
--precision 32
Key settings to adjust:
--batch-size
: How many audio samples to process at once- 16 uses about 12GB of GPU memory
- 24 uses about 16GB of GPU memory
--resume_from_checkpoint
: The model we’re fine-tuning from
Training will take time - potentially hours or days depending on your dataset size and hardware.
Monitoring Training Progress
While training runs, you can check how well your model is learning.
First, activate the environment in a new terminal window:
source ../../.venv/bin/activate
Then test the current model:
head -n 1 train/dataset.jsonl | \
../../venv/bin/uv run python3 -m piper_train.infer \
--sample-rate 22050 \
--checkpoint ./train/lightning_logs/version_0/checkpoints/*.ckpt \
--output-dir ./output \
--length-scale 1.3
Note: If this fails because the checkpoint file is being written to, just run it again.
This command takes the first example from your dataset and generates speech using your current model. A file called 0.wav
will appear in the output/
directory.
You can also monitor training with TensorBoard, which shows graphs of training progress:
../../venv/bin/uv run tensorboard --logdir ./train/lightning_logs/
Then open your web browser to the address shown in the output (usually http://localhost:6006).
Exporting Your Trained Model
Once training is complete, convert your model to ONNX format for use in applications:
uv run python -m piper_train.export_onnx ./train/lightning_logs/version_0/checkpoints/*.ckpt model.onnx
This creates model.onnx
- your final trained text-to-speech model ready to use!
Distribute it along with train/config.json
!
Tips for Success
- Start with a small dataset to test the process
- Make sure your audio files are good quality
- Be patient - training takes time
- Adjust batch size based on your GPU memory
- Regularly check the generated samples during training to monitor quality
Congratulations! You now have the knowledge to train your own text-to-speech model.
For inference you can use the piper rust code as I do in my model: https://huggingface.co/neurlang/piper-onnx-kss-korean