The Math Question That Could Break Speech AI

When it comes to processing audio signals, one of the key challenges is accurately clustering magnitude spectrograms in a way that mimics human hearing. The current approach uses a straightforward function to compute magnitudes from FFT (Fast Fourier Transform) coefficients:

math.Sqrt(math.Pow((math.Exp2(real(x))), 2) + math.Pow((math.Exp2(imaginary(x))), 2))

Here’s the given Go code converted into a clean LaTeX equation:

LaTeX version:

$$M(x) = \sqrt{ \left(2^{\Re(x)}\right)^2 + \left(2^{\Im(x)}\right)^2 }$$

$$
M(x) = \sqrt{ \left(2^{\Re(x)}\right)^2 + \left(2^{\Im(x)}\right)^2 }
$$

Where:

$\Re(x)$ is the real part of the complex FFT coefficient $x$ (already log₂-scaled by the FFT library)
$\Im(x)$ is the imaginary part (also log₂-scaled)
The function reverses the log scaling via $2^{(\cdot)}$, computes Euclidean magnitude in linear space

Frequency-Dependent Variant (if using `$f$`):

$$M(x,\,f) = \sqrt{ w(f) \cdot \left(2^{\Re(x)}\right)^2 + w(f) \cdot \left(2^{\Im(x)}\right)^2 }$$

$$
M(x,\,f) = \sqrt{ w(f) \cdot \left(2^{\Re(x)}\right)^2 + w(f) \cdot \left(2^{\Im(x)}\right)^2 }
$$

where $w(f)$ is a perceptual weighting function (e.g., A-weighting, Bark scale).

But is this really the best we can do?

The Problem

The FFT gives us complex numbers in log2 scale (real and imaginary parts are already log-transformed). The current function simply converts them back to linear scale, computes the Euclidean magnitude, and uses that for clustering. However, human hearing isn’t purely Euclidean—it’s frequency-dependent and nonlinear.

Two Potential Solutions

Frequency-Independent Solution – Maybe the current formula is already optimal, but we’re missing a normalization step or a better distance metric for clustering.
Frequency-Dependent Solution – Since we know the frequency (f) of each FFT bin, we could weight the magnitude calculation based on psychoacoustic models (e.g., A-weighting, critical bands, or masking effects).

Ideas for Improvement

Logarithmic Scaling: Since human hearing is logarithmic (decibels), perhaps we should avoid math.Exp2 and work directly in the log domain.
Perceptual Weighting: Modify the magnitude computation using frequency-dependent weights (e.g., emphasize mid-range frequencies where human hearing is most sensitive).
Alternative Distance Metric: K-means uses sum of squares, but maybe a different similarity measure (e.g., cosine distance, KL divergence) would work better.

The Challenge

If we crack this, we could get a better speech codec—one that clusters audio features in a way that truly matches human perception. But is the problem in the math function, or is it somewhere else in the pipeline?

What do you think? Is there a better formula? Let’s brainstorm! 🚀

The Math Question That Could Break Speech AI

Neurlang

2025/05/11

The Math Question That Could Break Speech AI

Frequency-Dependent Variant (if using `\(f\)`):

The Problem

Two Potential Solutions

Ideas for Improvement

The Challenge

The Math Question That Could Break Speech AI

Neurlang

2025/05/11

The Math Question That Could Break Speech AI

Frequency-Dependent Variant (if using \(f\)):

The Problem

Two Potential Solutions

Ideas for Improvement

The Challenge

Frequency-Dependent Variant (if using `\(f\)`):