Training a Tiny Piper TTS Model for Any Language

This guide will walk you through training your own tiny Piper text-to-speech model in any language. Piper has been successfully tested with languages like Hebrew and Korean, so it should work well for many languages.

What is Piper?

Piper is a neural text-to-speech system that can generate natural-sounding speech. We’ll be training a small version that can run efficiently on various devices.

Setting Up the Environment

First, we need to get the Piper training code and set up our working environment.

git clone https://github.com/neurlang/piper -b hebrew

This command downloads the Piper code from GitHub. The -b hebrew flag specifies we want the “hebrew” branch, which was successfuly tested on various languages including Hebrew.

Navigate into the Piper directory:

cd piper

Now we’ll create a Python virtual environment. This keeps our project’s dependencies separate from other Python projects on your system:

python3 -m venv venv
venv/bin/pip install uv

UV is a fast Python package manager that we’ll use to handle dependencies.

Create another virtual environment using UV:

venv/bin/uv venv

Activate the UV virtual environment:

source .venv/bin/activate

Now let’s install the required Python packages for training:

cd src/python
../../venv/bin/uv pip install -e .

The -e flag means we’re installing in “editable” mode, so changes to the source code will be immediately reflected.

Finally, we need to build a special algorithm used during training:

./build_monotonic_align.sh

This compiles C++ code that helps align text with audio during training.

Preparing Your Dataset

You need a dataset of audio files with their corresponding text transcripts. The dataset should be in LJSpeech format, which is a common format for TTS datasets.

If you already have a dataset, you can create a symbolic link to it:

ln -s -T ~/coqui-ai-TTS/recipes/slovakspeech/slovakspeech_female_dataset/ dataset

This creates a shortcut called “dataset” that points to your actual dataset folder.

Getting a Starting Model

Instead of training from scratch (which takes a long time), we’ll fine-tune an existing model. This is much faster and requires less data.

Download a pre-trained checkpoint:

wget https://huggingface.co/datasets/rhasspy/piper-checkpoints/resolve/main/en/en_US/ryan/medium/epoch=4641-step=3104302.ckpt

This downloads an English model that will adapt to your language.

Preprocessing Your Data

Before training, we need to convert your dataset into a format Piper can understand:

    ../../venv/bin/uv run python -m piper_train.preprocess \
    --language Slovak \
    --input-dir ./dataset \
    --output-dir ./train \
    --dataset-format ljspeech \
    --single-speaker \
    --sample-rate 22050 \
    --phoneme-type pygoruut \
    --max-workers 6

Let’s break down what each option does:

--language Slovak: Sets the language (change this to your language)
--input-dir ./dataset: Where your dataset is located
--output-dir ./train: Where to save the processed data
--dataset-format ljspeech: The format of your dataset
--single-speaker: Your dataset has only one speaker
--sample-rate 22050: Audio quality setting (standard for TTS)
--phoneme-type pygoruut: The system that converts text to sounds
--max-workers 6: How many CPU cores to use (adjust based on your computer)

Choosing the right language code: You can use:

Two-letter ISO codes (like “en” for English)
Three-letter ISO codes (when two-letter isn’t available)
Check available languages at https://hashtron.cloud/

Dataset format options:

ljspeech: Transcript is in the last column of metadata.csv
coqui: Transcript is in the second column of metadata.csv

Example of ljspeech format in metadata.csv:

audio_file_001|This is the text that will be spoken.|This is the text that will be spoken.

This preprocessing step creates the ./train folder with config.json and training files.

Starting the Training

Now we begin the actual training process:

../../venv/bin/uv run python3 -m piper_train \
        --dataset-dir "./train" \
        --accelerator 'gpu' \
        --devices 1 \
        --batch-size 24 \
        --validation-split 0 \
        --num-test-examples 0 \
        --max_epochs 990000 \
        --resume_from_checkpoint ./epoch=4641-step=3104302.ckpt \
        --checkpoint-epochs 1 \
        --precision 32

Key settings to adjust:

--batch-size: How many audio samples to process at once
- 16 uses about 12GB of GPU memory
- 24 uses about 16GB of GPU memory
--resume_from_checkpoint: The model we’re fine-tuning from

Training will take time - potentially hours or days depending on your dataset size and hardware.

Monitoring Training Progress

While training runs, you can check how well your model is learning.

First, activate the environment in a new terminal window:

source ../../.venv/bin/activate

Then test the current model:

head -n 1 train/dataset.jsonl  | \
        ../../venv/bin/uv run python3 -m piper_train.infer \
            --sample-rate 22050 \
            --checkpoint ./train/lightning_logs/version_0/checkpoints/*.ckpt \
            --output-dir ./output \
            --length-scale 1.3

Note: If this fails because the checkpoint file is being written to, just run it again.

This command takes the first example from your dataset and generates speech using your current model. A file called 0.wav will appear in the output/ directory.

You can also monitor training with TensorBoard, which shows graphs of training progress:

../../venv/bin/uv run tensorboard --logdir ./train/lightning_logs/

Then open your web browser to the address shown in the output (usually http://localhost:6006).

Exporting Your Trained Model

Once training is complete, convert your model to ONNX format for use in applications:

uv run python -m piper_train.export_onnx ./train/lightning_logs/version_0/checkpoints/*.ckpt model.onnx

This creates model.onnx - your final trained text-to-speech model ready to use! Distribute it along with train/config.json!

Tips for Success

Start with a small dataset to test the process
Make sure your audio files are good quality
Be patient - training takes time
Adjust batch size based on your GPU memory
Regularly check the generated samples during training to monitor quality

Congratulations! You now have the knowledge to train your own text-to-speech model.

For inference you can use the piper rust code as I do in my model: https://huggingface.co/neurlang/piper-onnx-kss-korean

Training a Tiny Piper TTS Model for Any Language

Neurlang

2025/09/28

Training a Tiny Piper TTS Model for Any Language

What is Piper?

Setting Up the Environment

Preparing Your Dataset

Getting a Starting Model

Preprocessing Your Data

Starting the Training

Monitoring Training Progress

Exporting Your Trained Model

Tips for Success