The Math Question That Could Break Speech AI
When it comes to processing audio signals, one of the key challenges is accurately clustering magnitude spectrograms in a way that mimics human hearing. The current approach uses a straightforward function to compute magnitudes from FFT (Fast Fourier Transform) coefficients:
math.Sqrt(math.Pow((math.Exp2(real(x))), 2) + math.Pow((math.Exp2(imaginary(x))), 2))
Here’s the given Go code converted into a clean LaTeX equation:
LaTeX version:
$$M(x) = \sqrt{ \left(2^{\Re(x)}\right)^2 + \left(2^{\Im(x)}\right)^2 }$$
$$
M(x) = \sqrt{ \left(2^{\Re(x)}\right)^2 + \left(2^{\Im(x)}\right)^2 }
$$
Where:
\(\Re(x)\)
is the real part of the complex FFT coefficient\(x\)
(already log₂-scaled by the FFT library)\(\Im(x)\)
is the imaginary part (also log₂-scaled)- The function reverses the log scaling via
\(2^{(\cdot)}\)
, computes Euclidean magnitude in linear space
Frequency-Dependent Variant (if using \(f\)
):
$$M(x,\,f) = \sqrt{ w(f) \cdot \left(2^{\Re(x)}\right)^2 + w(f) \cdot \left(2^{\Im(x)}\right)^2 }$$
$$
M(x,\,f) = \sqrt{ w(f) \cdot \left(2^{\Re(x)}\right)^2 + w(f) \cdot \left(2^{\Im(x)}\right)^2 }
$$
where \(w(f)\)
is a perceptual weighting function (e.g., A-weighting, Bark scale).
But is this really the best we can do?
The Problem
The FFT gives us complex numbers in log2 scale (real and imaginary parts are already log-transformed). The current function simply converts them back to linear scale, computes the Euclidean magnitude, and uses that for clustering. However, human hearing isn’t purely Euclidean—it’s frequency-dependent and nonlinear.
Two Potential Solutions
- Frequency-Independent Solution – Maybe the current formula is already optimal, but we’re missing a normalization step or a better distance metric for clustering.
- Frequency-Dependent Solution – Since we know the frequency (
f
) of each FFT bin, we could weight the magnitude calculation based on psychoacoustic models (e.g., A-weighting, critical bands, or masking effects).
Ideas for Improvement
- Logarithmic Scaling: Since human hearing is logarithmic (decibels), perhaps we should avoid
math.Exp2
and work directly in the log domain. - Perceptual Weighting: Modify the magnitude computation using frequency-dependent weights (e.g., emphasize mid-range frequencies where human hearing is most sensitive).
- Alternative Distance Metric: K-means uses sum of squares, but maybe a different similarity measure (e.g., cosine distance, KL divergence) would work better.
The Challenge
If we crack this, we could get a better speech codec—one that clusters audio features in a way that truly matches human perception. But is the problem in the math function, or is it somewhere else in the pipeline?
What do you think? Is there a better formula? Let’s brainstorm! 🚀