Training a Tiny Piper TTS Model for Any Language

Neurlang

2025/09/28

Training a Tiny Piper TTS Model for Any Language

This guide will walk you through training your own tiny Piper text-to-speech model in any language. Piper has been successfully tested with languages like Hebrew and Korean, so it should work well for many languages.

What is Piper?

Piper is a neural text-to-speech system that can generate natural-sounding speech. We’ll be training a small version that can run efficiently on various devices.

Setting Up the Environment

First, we need to get the Piper training code and set up our working environment.

git clone https://github.com/neurlang/piper -b hebrew

This command downloads the Piper code from GitHub. The -b hebrew flag specifies we want the “hebrew” branch, which was successfuly tested on various languages including Hebrew.

Navigate into the Piper directory:

cd piper

Now we’ll create a Python virtual environment. This keeps our project’s dependencies separate from other Python projects on your system:

python3 -m venv venv
venv/bin/pip install uv

UV is a fast Python package manager that we’ll use to handle dependencies.

Create another virtual environment using UV:

venv/bin/uv venv

Activate the UV virtual environment:

source .venv/bin/activate

Now let’s install the required Python packages for training:

cd src/python
../../venv/bin/uv pip install -e .

The -e flag means we’re installing in “editable” mode, so changes to the source code will be immediately reflected.

Finally, we need to build a special algorithm used during training:

./build_monotonic_align.sh

This compiles C++ code that helps align text with audio during training.

Preparing Your Dataset

You need a dataset of audio files with their corresponding text transcripts. The dataset should be in LJSpeech format, which is a common format for TTS datasets.

If you already have a dataset, you can create a symbolic link to it:

ln -s -T ~/coqui-ai-TTS/recipes/slovakspeech/slovakspeech_female_dataset/ dataset

This creates a shortcut called “dataset” that points to your actual dataset folder.

Getting a Starting Model

Instead of training from scratch (which takes a long time), we’ll fine-tune an existing model. This is much faster and requires less data.

Download a pre-trained checkpoint:

wget https://huggingface.co/datasets/rhasspy/piper-checkpoints/resolve/main/en/en_US/ryan/medium/epoch=4641-step=3104302.ckpt

This downloads an English model that will adapt to your language.

Preprocessing Your Data

Before training, we need to convert your dataset into a format Piper can understand:

    ../../venv/bin/uv run python -m piper_train.preprocess \
    --language Slovak \
    --input-dir ./dataset \
    --output-dir ./train \
    --dataset-format ljspeech \
    --single-speaker \
    --sample-rate 22050 \
    --phoneme-type pygoruut \
    --max-workers 6

Let’s break down what each option does:

Choosing the right language code: You can use:

Dataset format options:

Example of ljspeech format in metadata.csv:

audio_file_001|This is the text that will be spoken.|This is the text that will be spoken.

This preprocessing step creates the ./train folder with config.json and training files.

Starting the Training

Now we begin the actual training process:

../../venv/bin/uv run python3 -m piper_train \
        --dataset-dir "./train" \
        --accelerator 'gpu' \
        --devices 1 \
        --batch-size 24 \
        --validation-split 0 \
        --num-test-examples 0 \
        --max_epochs 990000 \
        --resume_from_checkpoint ./epoch=4641-step=3104302.ckpt \
        --checkpoint-epochs 1 \
        --precision 32

Key settings to adjust:

Training will take time - potentially hours or days depending on your dataset size and hardware.

Monitoring Training Progress

While training runs, you can check how well your model is learning.

First, activate the environment in a new terminal window:

source ../../.venv/bin/activate

Then test the current model:

head -n 1 train/dataset.jsonl  | \
        ../../venv/bin/uv run python3 -m piper_train.infer \
            --sample-rate 22050 \
            --checkpoint ./train/lightning_logs/version_0/checkpoints/*.ckpt \
            --output-dir ./output \
            --length-scale 1.3

Note: If this fails because the checkpoint file is being written to, just run it again.

This command takes the first example from your dataset and generates speech using your current model. A file called 0.wav will appear in the output/ directory.

You can also monitor training with TensorBoard, which shows graphs of training progress:

../../venv/bin/uv run tensorboard --logdir ./train/lightning_logs/

Then open your web browser to the address shown in the output (usually http://localhost:6006).

Exporting Your Trained Model

Once training is complete, convert your model to ONNX format for use in applications:

uv run python -m piper_train.export_onnx ./train/lightning_logs/version_0/checkpoints/*.ckpt model.onnx

This creates model.onnx - your final trained text-to-speech model ready to use! Distribute it along with train/config.json!

Tips for Success

Congratulations! You now have the knowledge to train your own text-to-speech model.

For inference you can use the piper rust code as I do in my model: https://huggingface.co/neurlang/piper-onnx-kss-korean