Take Me Down to the KMEANS City Where VOCODING Is Cheap and Reversible FFT Is Pretty

Neurlang

2025/05/03

Take Me Down to the KMEANS City Where VOCODING Is Cheap and Reversible FFT Is Pretty

Or: How I Learned to Stop Worrying About Griffin-Lim and Love Phase-Aware Clustering


The Problem: Why Does Traditional Vocoding Sound So Bad?

If you’ve ever worked with speech synthesis, you’ve probably heard the metallic, robotic artifacts of the Griffin-Lim Algorithm (GLA). It’s the go-to method for reconstructing audio from magnitude spectrograms when phase is discarded—but it’s also the reason so many early TTS systems sounded like a Speak & Spell left in the microwave.

Conventional wisdom blames three things for GLA’s shortcomings:

  1. Mel-Scale Spectrograms: Most vocoders warp the frequency axis into the mel scale, which approximates human hearing but throws away high-frequency detail. This is lossy by design—like JPEG for audio.
  2. Phase Discard: Since phase looks like noise and is hard to interpret, researchers have historically ignored it. But phase is critical for accurate waveform reconstruction. Without it, GLA has to guess, leading to smeared transients and unnatural reverberation.
  3. Algorithmic Approximations: STFT windowing and overlap-add introduce minor errors, but these are negligible compared to the first two issues.

What if we could avoid all of these problems?


The Solution: A Phase-Aware, K-Means-Powered Vocoder

We built a near-lossless vocoder that:

Here’s how it works.

Step 1: Pretty reversible FFT with Phase Retention

For each audio file, we compute a full-resolution STFT (not mel!) and store both magnitude and phase. Unlike mel-spectrograms, this is fully invertible—no approximation needed.

Step 2: K-Means Clustering (Magnitude-Only)

We strip out the phase and perform K-means clustering on the magnitude spectrogram frames. This gives us k centroids, which act as a “codebook” of common spectral patterns (e.g., vowels, consonants, noise).

Step 3: Phase-Anchored Vocoding

Now, for synthesis:

  1. Take a new audio signal, compute its FFT, and keep the phase.
  2. Replace each magnitude frame with its nearest K-means centroid.
  3. Reconstruct the waveform by combining the centroid magnitudes with the original phase.

No Griffin-Lim. No mel-scale warping. Just clean, phase-aware reconstruction.


Why This Works Better Than Griffin-Lim

Trade-Offs?


Results & Implications

Our experiments show that this method:

This isn’t a neural network. It’s not even that fancy. It’s just Pretty reversible FFT + K-means + phase retention—proving that sometimes, the best solutions are hiding in plain sight.


Conclusion: Vocoding Doesn’t Have to Be Hard

The history of audio processing is littered with over-engineered solutions that ignore fundamentals. Phase matters. Mel-scale isn’t always better. And sometimes, a simple clustering algorithm gets you 90% of the way there.

So next time you’re wrestling with a neural vocoder’s training time or Griffin-Lim’s artifacts, remember:

🎶 Take me down to the KMEANS city, where vocoding is cheap, and the reversible FFT is pretty! 🎶


Try it yourself. The code is here. Blah blah blah—happy clustering! 🚀