AcousticCoding ⊂ SpeechToText ⊂ TextToSpeech: The Secret Hierarchy of Speech AI

Neurlang

2025/05/03

AcousticCoding ⊂ SpeechToText ⊂ TextToSpeech: The Secret Hierarchy of Speech AI

Why Everything You Know About Speech Processing is Backwards

For decades, speech AI has been stuck in a siloed mindset:

But what if I told you that STT is just a trivial byproduct of solving TTS correctly? And that vocoding is the real foundation of both?

Here’s the controversial truth:


1. The Secret Hierarchy

AcousticCoding ⊂ SpeechToText ⊂ TextToSpeech

(1) AcousticCoding: The Foundation

Definition: “Given raw speech, learn a discrete, invertible representation.”

Key Insight:

(2) SpeechToText: A Trivial Byproduct

Definition: “Map acoustic centroids to IPA symbols.”

Controversial Claim:

(3) TextToSpeech: The Apex

Definition: “Generate speech from IPA + durations using the same centroids.”


2. Why Everyone Else is Wrong

(1) Neural Networks are Overkill

(2) Griffin-Lim is Useless

(3) STT and TTS Should Never Be Separate


3. The Algorithm That Breaks Speech AI

Step 1: AcousticCoding (FFT + K-means)

  1. Compute full-resolution STFT (magnitude + phase).
  2. Cluster magnitudes → 500 centroids (e.g., /k/, /a/, /t/ sounds).
  3. Store phase for perfect reconstruction.

Step 2: SpeechToText (Centroid → IPA)

  1. For new audio, quantize each frame to the nearest centroid.
  2. Look up the IPA label for each centroid.
  3. Apply a language model to clean up the sequence.

Step 3: TextToSpeech (IPA → Centroids)

  1. Predict durations for each IPA symbol (e.g., /a/ = 3 frames).
  2. Traverse the centroid graph to find the smoothest path.
  3. Synthesize audio by stitching centroids + phase.

4. Why This Works Better

Method STT? TTS? Phase-Aware? Needs GPU?
Yours
Whisper (STT)
VITS (TTS)
Griffin-Lim

Advantages:

Unified model (STT + TTS in one system).
No neural nets (runs on a toaster).
Phase-perfect synthesis (no artifacts).


5. The Future of Speech AI

Implications:

  1. STT is not a real task—it’s just TTS’s inverse.
  2. Vocoding is the foundation—everything else is downstream.
  3. Neural networks are optional—classical DSP still wins.

Call to Action:


🚀 TL;DR

🔥 Do you have the audacity to enter the K-MEANS city? 🔥


Discussion: Is this brilliant or insane? Let’s argue in the comments. 👇