The Transformer That Learned Without Gradients and with No Floating Point at all

Homograph disambiguation — choosing the right pronunciation for words spelled the same but meaning different things — is a tough nut in speech technology. It goes beyond simple grapheme-to-phoneme (G2P) conversion because context matters.

Take Hebrew or Icelandic, for example. You can’t just map letters to sounds: the same string of letters might sound completely different depending on sentence context. That’s why traditional systems (like eSpeak) often stumble, and why more advanced systems (like Phonikud) use transformers or BERT embeddings.

But what if you could solve this problem without BERT, without floating point arithmetic, and without backpropagation? That’s what I’ve been experimenting with: a weightless transformer.

What Does “Weightless” Mean?

In classical neural networks, each connection has a numeric weight that is continuously adjusted through backpropagation and gradient descent. A weightless model doesn’t store those continuous values. Instead, it uses lookup tables or filters that respond with discrete yes/no answers.

This idea isn’t new — it has been reinvented under different names in literature:

Weightless Neural Networks (WNNs)
Lookup Table (LUT) Cells
Bloomier Filters

I call my version Hashtron, since each processing unit is based on a quaternary lookup filter — think of it like a Bloom filter, but designed to give deterministic yes/no answers for values it has seen, and pseudorandom answers otherwise.

So when I say “weightless,” I mean there are no floating-point weights at all. Each unit (cell) is essentially a hash-based boolean table. A new kind of artificial neuron - an alternative to the good old perceptron.

How Does Learning Work Without Backpropagation?

Normally, transformers rely on backpropagation: compute a gradient, adjust weights, rinse, repeat. My network doesn’t do gradients at all. Instead, each cell learns independently through a process I call tallying.

Here’s the idea:

Pick the cell that should learn.
For each training example, run the network twice:
- once with the cell as-is,
- once with the cell negated.
Compare the outcomes against the correct answer.
Increment the tally (for the current cell) in whichever direction is better.

Over time, the cell’s lookup table stabilizes into the pattern that best optimizes the whole network’s output. In other words, the network converges not through gradients but through averaging the effect of each possible reaction.

It’s slower in the sense that each training pass involves multiple forwards, but it’s extremely transparent: you can literally inspect how a single boolean in a table is influencing results.

Preventing Overfitting

“But if every cell just memorizes independently, won’t it overfit?”

That’s a good question. In my design, overfitting is kept in check structurally, by how queries, keys, and values interact. Identical keys must always map to identical values. That constraint forces the network to behave consistently: it can’t just memorize random exceptions, because its boolean logic would break down.

So instead of dropout or weight decay, I rely on the Query/Key/Value structure itself, and the double descent theory, to enforce generalization.

Boolean Attention vs. Dot-Product Attention

Standard attention works like this:

Convert tokens into vectors in continuous space.
Compute dot products between queries and keys to get similarity scores.
Use those scores to weight values.

My boolean-based attention works differently:

Take query, key, and value.
Apply bitwise AND.
Store results in a boolean attention matrix.
Convert true/false into 1/0, then sum across columns.

Instead of continuous similarity scores, it’s more like a voting system: “How many agreements do we have?”

Yes, this is a coarser mechanism — you could say dot-product is about shades of gray, while boolean attention is black-and-white. But for homograph disambiguation, that works surprisingly well, because the problem is often binary: this pronunciation or that one.

Why No Positional Encoding?

Transformers usually add sinusoidal or learned positional encodings so that tokens “know” where they are in a sequence. I don’t need that. Each cell already has its position inherently — its place in the pipeline tells it where it is.

When I tested positional encodings, they didn’t help. In fact, they sometimes just added noise. So I dropped them. Sometimes, “it works without it” is all the justification you need.

Quaternary Filters in Practice

A quaternary filter is basically a probabilistic hash structure, similar to a Bloom filter. Here’s how it works in my context:

If a value has been seen before, the filter answers yes/no deterministically.
If a value hasn’t been seen, it produces a pseudorandom answer.

This makes it incredibly lightweight. Each cell can “know” millions of patterns with very little storage, and it’s extremely fast to query. Did I mention there is no XOR problem, like in biological neurons?

Why Go Instead of Python?

Because the whole system is basically one giant hot loop. It’s hashing and array lookups, not matrix multiplications.

Go’s concurrency model makes it trivial to spread this across 32 threads and saturate the CPU. Python, with its GIL, just can’t match that efficiency without dropping down into C or CUDA. I wanted the simplicity of writing the entire system in one language, so Go was the natural choice.

And about Go’s “small ML ecosystem”: I just built what I needed. If the libraries don’t exist, I write them.

Training and Dataset Scaling

Training scales linearly with dataset size. If you give it 10× more data, it will take ~10× more time and memory.

The main bottleneck is RAM. With Hebrew homographs, I trained on ~700 MB of data and it ran fine. For larger datasets, you either:

add more RAM,
or rely on swap (slower but workable).

There’s no mysterious exponential slowdown, but yes — at some point, hardware is the limit.

Why Not ONNX?

Because ONNX expects a standard tensor-math model. My models aren’t tensor graphs. They’re arrays of hash-based lookup cells.

Could I write a custom runtime in ONNX format? Maybe. But it would mostly be emulation — you wouldn’t gain much. For now, JSON serialization of my model format is good enough, and much easier to load and run natively.

Speed and Accuracy Compared to Transformers

So how does this all stack up?

Size: My Hebrew model is ~100 MB.
Speed: Training on CPU took a few hours. Inference is extremely fast because it’s just boolean lookups.
Accuracy: On Hebrew homographs, it outperforms eSpeak and comes close to Phonikud’s benchmark results — without using BERT or GPU training.

For now, that’s enough for me: it’s competitive where it matters.

Generalization Limits

What about generalization beyond homographs? Honestly, I don’t know yet. The system works well for classification-style G2P problems. For utterance-level G2P, my early experiments overfit because I made a design mistake (merging Key and Value in attention). That’s fixable, but it shows the architecture has quirks.

Could this replace GPT-style large language models? Probably not without more work. Could it become a lightweight alternative for speech tasks? That’s where it shines.

Homographs, Homophones, and Collisions

Linguists will point out the difference between homographs (same spelling, different pronunciation), homophones (same pronunciation, different meaning), and heteronyms (a subset of homographs).

In my system, I don’t worry about those fine distinctions. To me, they’re just collisions:

Grapheme collisions (homographs).
Phoneme collisions (homophones).

The model is trained to resolve collisions in either direction: text → IPA or IPA → text. That’s enough to make it useful in phonemization, dephonemization, and ASR contexts.

Real-World Deployment

Yes, this system can be deployed. Models are stored as compressed JSON, they run on CPU, and they don’t require GPUs. That makes them especially attractive for embedded or mobile use cases.

The exact deployment path depends on your pipeline (TTS, ASR, G2P, etc.), but the core properties are:

fast,
compact,
and low-resource.

Theoretical Justification for Integer-Only Transformers

Why does this work at all? The short answer: language is discrete. Letters, phonemes, words — they’re all symbols. Modeling them as integers feels natural.

Traditional transformers embed tokens into continuous vector spaces because it makes dot products easy and differentiable. I sidestep that: if all I need is discrete classification, booleans and integers, and their attention are enough.

Do I have a formal proof? No. For now, it’s an empirical fact: it works.

Novelty

Is this groundbreaking? Not entirely. Weightless networks have been explored in the past. Bloom filters are decades old. What I’ve done is combine these ideas into a transformer-like structure with attention and sequence modeling, and shown that it can solve real tasks competitively.

So call it a remix: not a revolution, but a revival of forgotten ideas, reframed for the transformer era.

Test Results for the Phonemizers including our Hebrew Homograph transformer in 0.6.3

Model	Avg CER	Avg WER	Avg WER (no stress)
Phonikud SOA	0.044	0.2014	0.1507
Goruut	0.4894	1.0011	0.9544
Goruut 0.6.3	0.0982	0.382	0.3465
Espeak NG	0.47	1.00	0.96
CharsiuG2P	0.71	1.00	0.99

As you can see (after integrating the transformer in 0.6.3), we’re pretty close to SOA and already better than competing non-neural models like espeak. Please note, that Phonikud is a two pass model, and is higher-latency and larger than Goruut in practice. For info, see the Phonikud Whitepaper at https://arxiv.org/pdf/2506.12311.

Closing Thoughts

Weightless transformers won’t replace GPT. But for specific, structured problems like homograph disambiguation, they’re lightweight, fast, and effective.

They run entirely on integers, train on CPUs, store as JSON, and still beat classical baselines. To me, that’s proof that there’s room for radically different architectures in speech and language processing — architectures that don’t look like the deep learning we’ve all grown used to.

Sometimes, throwing away weights is the heaviest idea of all.

Forget GPUs: This Transformer Learned Without Gradients and with No Floating Point at all - And It Actually Worked!

Neurlang

2025/08/25