When Less is More: Revolutionizing Language AI with Boolean Logic
What if we told you that a transformer with just 145 neurons could understand and process 140 different languages? In the world of large language models where billions of parameters are the norm, this might sound impossible. Yet, through the innovative Goruut project, we’ve achieved exactly that using a revolutionary approach called “weightless transformers.”
Traditional transformers rely on massive weight matrices and complex floating-point operations. Our weightless transformer, however, operates entirely on boolean logic and hash-based neurons called “hashtrons.” This fundamental shift in architecture allows us to achieve remarkable multilingual capabilities with minimal computational overhead.
The Two Transformer Variants in Production
As part of the Goruut phonetic to IPA translator, we have deployed two distinct weightless transformer architectures:
-
Self-Attention Phonetic Transformer: A weightless transformer based on non-masked cross-attention, currently in production use for all supported languages. This model handles the core phonetic-to-IPA translation task (for out of vocabulary words).
-
Cross-Attention Homograph Disambiguator: A specialized transformer using non-masked cross-attention, initially developed for English and Hebrew homograph disambiguation, with plans to extend support to most languages.
Both architectures share the same fundamental principle: they process information through boolean operations rather than traditional matrix multiplications, making them incredibly efficient while maintaining high accuracy across diverse linguistic contexts.
The Architecture: Rethinking Attention Mechanisms
Token Processing and the Query-Key-Value Paradigm
Our weightless transformer maintains the familiar Query-Key-Value (QKV) structure but implements it through a radically different approach. The architecture features 8 front layer slots, each accepting 3 tokens (query, key, value), resulting in a total of 24 tokens being processed simultaneously by the front layer.
This design choice reflects a key insight: rather than processing sequences of arbitrary length, we can achieve remarkable results by focusing on fixed-size windows with intelligent wraparound handling. When the input exceeds 24 tokens, the system employs a wraparound mechanism, ensuring that longer sequences are still processed effectively without the computational overhead of traditional attention mechanisms.
Key architectural decisions:
- No positional encoding: Unlike traditional transformers, our system doesn’t rely on positional embeddings. The hashtron neurons inherently capture positional relationships through their boolean operations.
- Fixed window size: The 24-token limit isn’t a constraint—it’s a feature that enables consistent, predictable processing across all languages.
- Boolean-first design: Every operation is designed around boolean logic, eliminating the need for floating-point arithmetic.
The Hashtron Revolution
At the heart of our architecture lies the hashtron—a novel type of neuron that operates entirely on hash-based boolean logic. Each of the 24 input tokens is fed into a hashtron neuron, which yields a boolean output. This seemingly simple transformation is where the magic happens.
The 24 boolean outputs from the first layer form the foundation of our attention mechanism. Rather than computing attention weights through softmax operations over continuous values, we create discrete attention patterns through boolean matrices. This approach offers several advantages:
- Deterministic behavior: Boolean operations are predictable and reproducible
- Memory efficiency: Storing booleans requires minimal memory compared to floating-point weights
- Speed: Boolean operations are computationally faster than floating-point arithmetic
- Interpretability: The attention patterns are directly visible as boolean matrices
Architecture of the network
graph TD
subgraph "24 Token Input Layer (Vertical Processing)"
T0["Token 0 - Q"]
T1["Token 1 - K"]
T2["Token 2 - V"]
T3[...]
T4["Token n"]
T5[...]
T23["Token 23 - V"]
end
subgraph "Layer 1"
L1H0[Hashtron 0]
L1H1[Hashtron 1]
L1H2[Hashtron 2]
L1H3[...]
L1H4[Hashtron n]
L1H5[...]
L1H23[Hashtron 23]
L1H0 --> L1B0{bool}
L1H1 --> L1B1{bool}
L1H2 --> L1B2{bool}
L1H3 --> L1B3{...}
L1H4 --> L1B4{bool}
L1H5 --> L1B5{...}
L1H23 --> L1B23{bool}
end
subgraph "8×24 Attention Matrix"
%% Row 1
AM1_1["●"]
AM1_2["○"]
AM1_3["●"]
AM1_4["○"]
AM1_5["●"]
%% Row 2
AM2_1["○"]
AM2_2["●"]
AM2_3["○"]
AM2_4["●"]
AM2_5["○"]
%% Row 3
AM3_1["●"]
AM3_2["○"]
AM3_3["●"]
AM3_4["○"]
AM3_5["●"]
%% Row 4
AM4_1["○"]
AM4_2["●"]
AM4_3["○"]
AM4_4["●"]
AM4_5["○"]
%% Row 5
AM5_1["●"]
AM5_2["○"]
AM5_3["●"]
AM5_4["○"]
AM5_5["●"]
%% Row 6
AM6_1["○"]
AM6_2["●"]
AM6_3["○"]
AM6_4["●"]
AM6_5["○"]
%% Row 7
AM7_1["●"]
AM7_2["○"]
AM7_3["●"]
AM7_4["○"]
AM7_5["●"]
%% Row 8
AM8_1["○"]
AM8_2["●"]
AM8_3["○"]
AM8_4["●"]
AM8_5["○"]
%% Row-to-row connections
AM1_1 --> AM2_1
AM1_2 --> AM2_2
AM1_3 --> AM2_3
AM1_4 --> AM2_4
AM1_5 --> AM2_5
AM2_1 --> AM3_1
AM2_2 --> AM3_2
AM2_3 --> AM3_3
AM2_4 --> AM3_4
AM2_5 --> AM3_5
AM3_1 --> AM4_1
AM3_2 --> AM4_2
AM3_3 --> AM4_3
AM3_4 --> AM4_4
AM3_5 --> AM4_5
AM4_1 --> AM5_1
AM4_2 --> AM5_2
AM4_3 --> AM5_3
AM4_4 --> AM5_4
AM4_5 --> AM5_5
AM5_1 --> AM6_1
AM5_2 --> AM6_2
AM5_3 --> AM6_3
AM5_4 --> AM6_4
AM5_5 --> AM6_5
AM6_1 --> AM7_1
AM6_2 --> AM7_2
AM6_3 --> AM7_3
AM6_4 --> AM7_4
AM6_5 --> AM7_5
AM7_1 --> AM8_1
AM7_2 --> AM8_2
AM7_3 --> AM8_3
AM7_4 --> AM8_4
AM7_5 --> AM8_5
end
subgraph "Agreements Column Summation"
CS0[∑ col0]
CS1[∑ col1]
CS2[∑ col2]
CS3[...]
CS4[∑ col n]
CS5[...]
CS23[∑ col23]
CS0 --> I0[small int]
CS1 --> I1[small int]
CS2 --> I2[small int]
CS3 --> I3[...]
CS4 --> I4[small int]
CS5 --> I5[...]
CS23 --> I23[small int]
end
subgraph "Stochastic Layer (Hashtron)"
SL0[Hashtron 0]
SL1[Hashtron 1]
SL2[Hashtron 2]
SL3[...]
SL4[Hashtron n]
SL5[...]
SL23[Hashtron 23]
SL0 --> SLB0{bool}
SL1 --> SLB1{bool}
SL2 --> SLB2{bool}
SL3 --> SLB3{...}
SL4 --> SLB4{bool}
SL5 --> SLB5{...}
SL23 --> SLB23{bool}
end
subgraph "Repeat 8x"
R1["Layer 1"]
R2["Layer 2"]
R3["Layer 3"]
R4["..."]
R5["Layer 8"]
R6["Attention → Sum"]
R7["→ Hashtron"]
end
subgraph "Final Output Layer"
FO0[bool 0]
FO1[bool 1]
FO2[bool 2]
FO3[...]
FO4[bool n]
FO5[...]
FO23[bool 23]
FSum[∑ all booleans] --> Total[Integer total]
Total --> FH[Final Hashtron]
FH --> Answer{"Boolean Answer<br/>Solution to problem"}
end
%% Token connections
T0 --> L1H0
T1 --> L1H1
T2 --> L1H2
T4 --> L1H4
T23 --> L1H23
%% Attention matrix formation
L1B0 --> AM1_1
L1B1 --> AM1_2
L1B2 --> AM1_3
L1B4 --> AM1_4
L1B23 --> AM1_5
%% Column sums from matrix (from last row)
AM8_1 --> CS0
AM8_2 --> CS1
AM8_3 --> CS2
AM8_4 --> CS4
AM8_5 --> CS23
%% Stochastic layer connections
I0 --> SL0
I1 --> SL1
I2 --> SL2
I4 --> SL4
I23 --> SL23
%% Final outputs
SLB0 --> FO0
SLB1 --> FO1
SLB2 --> FO2
SLB4 --> FO4
SLB23 --> FO23
%% Repeat layer connections (simplified)
SLB0 --> R1
SLB1 --> R1
SLB2 --> R1
SLB23 --> R1
R7 --> FO0
R7 --> FO1
R7 --> FO2
R7 --> FO23
%% Final summation
FO0 --> FSum
FO1 --> FSum
FO2 --> FSum
FO4 --> FSum
FO23 --> FSum
style T0 fill:#e1f5fe
style T1 fill:#e1f5fe
style T2 fill:#e1f5fe
style T4 fill:#e1f5fe
style T23 fill:#e1f5fe
style L1H0 fill:#f3e5f5
style L1H1 fill:#f3e5f5
style L1H2 fill:#f3e5f5
style L1H4 fill:#f3e5f5
style L1H23 fill:#f3e5f5
%% Attention Matrix Styling - Row 1
style AM1_1 fill:#fff3e0
style AM1_2 fill:#fff3e0
style AM1_3 fill:#fff3e0
style AM1_4 fill:#fff3e0
style AM1_5 fill:#fff3e0
%% Attention Matrix Styling - Row 2
style AM2_1 fill:#fff3e0
style AM2_2 fill:#fff3e0
style AM2_3 fill:#fff3e0
style AM2_4 fill:#fff3e0
style AM2_5 fill:#fff3e0
%% Attention Matrix Styling - Row 3
style AM3_1 fill:#fff3e0
style AM3_2 fill:#fff3e0
style AM3_3 fill:#fff3e0
style AM3_4 fill:#fff3e0
style AM3_5 fill:#fff3e0
%% Attention Matrix Styling - Row 4
style AM4_1 fill:#fff3e0
style AM4_2 fill:#fff3e0
style AM4_3 fill:#fff3e0
style AM4_4 fill:#fff3e0
style AM4_5 fill:#fff3e0
%% Attention Matrix Styling - Row 5
style AM5_1 fill:#fff3e0
style AM5_2 fill:#fff3e0
style AM5_3 fill:#fff3e0
style AM5_4 fill:#fff3e0
style AM5_5 fill:#fff3e0
%% Attention Matrix Styling - Row 6
style AM6_1 fill:#fff3e0
style AM6_2 fill:#fff3e0
style AM6_3 fill:#fff3e0
style AM6_4 fill:#fff3e0
style AM6_5 fill:#fff3e0
%% Attention Matrix Styling - Row 7
style AM7_1 fill:#fff3e0
style AM7_2 fill:#fff3e0
style AM7_3 fill:#fff3e0
style AM7_4 fill:#fff3e0
style AM7_5 fill:#fff3e0
%% Attention Matrix Styling - Row 8
style AM8_1 fill:#fff3e0
style AM8_2 fill:#fff3e0
style AM8_3 fill:#fff3e0
style AM8_4 fill:#fff3e0
style AM8_5 fill:#fff3e0
style CS0 fill:#e8f5e8
style CS1 fill:#e8f5e8
style CS2 fill:#e8f5e8
style CS4 fill:#e8f5e8
style CS23 fill:#e8f5e8
style SL0 fill:#f3e5f5
style SL1 fill:#f3e5f5
style SL2 fill:#f3e5f5
style SL4 fill:#f3e5f5
style SL23 fill:#f3e5f5
style R1 fill:#f5f5f5
style R2 fill:#f5f5f5
style R3 fill:#f5f5f5
style R4 fill:#f5f5f5
style R5 fill:#f5f5f5
style R6 fill:#f5f5f5
style R7 fill:#f5f5f5
style FO0 fill:#fce4ec
style FO1 fill:#fce4ec
style FO2 fill:#fce4ec
style FO4 fill:#fce4ec
style FO23 fill:#fce4ec
style FSum fill:#e8f5e8
style FH fill:#f3e5f5
style Answer fill:#ffebee
Understanding the Flow: From Tokens to Decisions
The diagram above illustrates the complete data flow through our weightless transformer. Let’s walk through each stage:
Stage 1: Token Input and Initial Processing
The journey begins with 24 tokens arranged in Query-Key-Value triplets. Each token is processed by a dedicated hashtron in Layer 1, converting the input into boolean representations. This binary transformation is crucial—it’s where continuous linguistic features become discrete, processable patterns.
Stage 2: The Boolean Attention Matrix
The 24 boolean outputs form an 8×24 attention matrix—the heart of our attention mechanism. Unlike traditional transformers that compute attention weights through complex mathematical operations, our system creates attention patterns through boolean logic. Each cell in this matrix represents a discrete attention decision: either a token pair is relevant (●) or it isn’t (○).
The vertical flow through the 8 rows represents the depth of attention processing. Each row refines the attention pattern, allowing the system to capture increasingly complex relationships between tokens.
Stage 3: Column Summation and Aggregation
Each column in the attention matrix is summed, producing small integers that represent the “agreement” level for each token position. These integers capture how many layers found a particular token position relevant, providing a natural weighting mechanism without floating-point operations.
Stage 4: Stochastic Processing
The integer sums feed into another layer of hashtron neurons—the stochastic layer. This layer introduces controlled randomness into the decision-making process, helping the system generalize across different linguistic contexts while maintaining deterministic core behavior.
Stage 5: Iterative Refinement
The attention-summation-hashtron pattern repeats 8 times, allowing the system to iteratively refine its understanding of the input. Each iteration can capture different aspects of the linguistic relationships, from local syntactic patterns to broader semantic connections.
Stage 6: Final Decision
The final 24 boolean outputs are summed and fed to a single hashtron neuron that produces the ultimate boolean answer. This binary output can represent various linguistic decisions: phonetic classifications, homograph disambiguations, or other language processing tasks.
Why This Architecture Works
The success of our weightless transformer stems from several key insights:
Linguistic Discreteness: Natural language, despite its apparent complexity, often involves discrete decisions. Our boolean approach aligns naturally with this reality.
Efficiency Through Simplicity: By eliminating floating-point operations, we achieve remarkable computational efficiency without sacrificing capability.
Scalable Attention: The boolean attention matrix scales linearly rather than quadratically, making it practical for real-world applications.
Cross-Linguistic Generalization: The hash-based approach naturally handles the diversity of linguistic features across 140 languages without requiring language-specific modifications.
Performance and Real-World Impact
The true test of any language model lies in its real-world performance. Our 145-neuron weightless transformer has been extensively evaluated across 140 languages, with results that demonstrate both the power and the practical limitations of this approach.
Comprehensive Multilingual Evaluation
The table below presents Word Error Rate (WER) and Character Error Rate (CER) measurements across a diverse set of 53 languages, representing different language families, writing systems, and phonological complexities. These metrics were computed on standardized test corpora, providing a fair comparison across linguistic contexts.
| language model | corpus lang_iso | word success rate | char success rate | word success rate (nostress) | char success rate (nostress) |
|---|---|---|---|---|---|
| albanian | sq | 53% | 91% | 53% | 91% |
| arabic | ar | 63% | 93% | 43% | 89% |
| armenian | hy | 46% | 95% | 46% | 95% |
| azerbaijani | az | 27% | 85% | 27% | 85% |
| bengali | bn | 42% | 94% | 42% | 94% |
| bulgarian | bg | 69% | 93% | 35% | 88% |
| catalan | ca | 28% | 83% | 31% | 83% |
| chinese/mandarin | zh | 9% | 83% | 8% | 83% |
| czech | cs | 64% | 87% | 55% | 86% |
| danish | da | 53% | 84% | 53% | 84% |
| dutch | nl | 73% | 91% | 31% | 84% |
| english | en | 81% | 93% | 28% | 77% |
| english/american | en | 84% | 92% | 31% | 78% |
| english/british | en | 84% | 93% | 33% | 79% |
| estonian | et | 42% | 91% | 43% | 91% |
| farsi | fa | 63% | 94% | 52% | 92% |
| finnish | fi | 40% | 90% | 58% | 95% |
| french | fr | 44% | 86% | 13% | 77% |
| georgian | ka | 86% | 99% | 86% | 99% |
| german | de | 63% | 85% | 10% | 75% |
| greek | el | 58% | 94% | 26% | 88% |
| hebew3 | he | 83% | 97% | 5% | 86% |
| hebrew2 | he | 12% | 82% | 2% | 82% |
| hindi | hi | 73% | 97% | 73% | 97% |
| hungarian | hu | 65% | 91% | 60% | 90% |
| icelandic | is | 69% | 91% | 64% | 90% |
| indonesian | id | 79% | 95% | 51% | 88% |
| italian | it | 59% | 91% | 38% | 90% |
| japanese | ja | 12% | 85% | 12% | 85% |
| kazakh | kk | 38% | 91% | 27% | 90% |
| korean | ko | 60% | 93% | 60% | 93% |
| latvian | lv | 43% | 91% | 42% | 91% |
| lithuanian | lt | 41% | 90% | 41% | 90% |
| macedonian | mk | 64% | 96% | 51% | 94% |
| malay/latin | ms | 100% | 1% | 100% | 1% |
| malayalam | ml | 96% | 64% | 95% | 64% |
| marathi | mr | 73% | 96% | 72% | 96% |
| nepali | ne | 48% | 94% | 48% | 94% |
| norwegian | no | 67% | 89% | 48% | 82% |
| polish | pl | 57% | 84% | 57% | 85% |
| portuguese | pt | 58% | 95% | 26% | 79% |
| romanian | ro | 73% | 93% | 58% | 89% |
| russian | ru | 82% | 95% | 11% | 88% |
| serbian | sr | 88% | 98% | 88% | 98% |
| slovak | sk | 65% | 92% | 64% | 92% |
| slovenian | sl | 51% | 91% | 39% | 87% |
| spanish | es | 79% | 93% | 24% | 78% |
| swedish | sv | 71% | 92% | 18% | 85% |
| tamil | ta | 58% | 96% | 58% | 96% |
| thai | th | 23% | 88% | 23% | 88% |
| turkish | tr | 72% | 92% | 44% | 87% |
| ukrainian | uk | 67% | 91% | 53% | 90% |
| urdu | ur | 65% | 94% | 61% | 94% |
| vietnamese/central | vi | 73% | 96% | 72% | 96% |
| vietnamese/northern | vi | 85% | 97% | 85% | 97% |
| vietnamese/southern | vi | 78% | 96% | 77% | 96% |
Key Performance Insights
Standout Performers: Several languages achieve exceptional accuracy, with Serbian (88% word success), Georgian (86%), and Vietnamese Northern (85%) leading the pack. These results demonstrate that the weightless transformer can achieve near-human performance for certain linguistic contexts.
Logographic Language Challenges: Chinese/Mandarin and Japanese show notably lower word success rates (9% and 12% respectively). This is expected given the fundamental differences in how logographic writing systems map to phonetic representations. Importantly, for these languages, the error rates represent sentence-level errors rather than individual word errors, making direct comparison with alphabetic languages less straightforward.
Character-Level Robustness: Even in challenging cases, character-level accuracy remains consistently high across most languages, typically exceeding 80%. This suggests that while complete word accuracy may be difficult to achieve, the system maintains strong phonetic approximations.
Stress Sensitivity: The “nostress” columns reveal how lack of stress marking affects performance. Languages like English show dramatic differences (84% vs 28% word success), highlighting the system’s sensitivity to prosodic features.
Computational Efficiency
Beyond accuracy, the weightless transformer delivers remarkable computational efficiency:
- Memory footprint: 145 neurons require minimal storage compared to billion-parameter models
- Processing speed: Boolean operations execute orders of magnitude faster than floating-point matrix multiplications
- Energy consumption: The architecture is particularly well-suited for edge deployment and mobile applications
- Scalability: Linear scaling with input size rather than quadratic attention complexity
Real-World Deployment Impact
This architecture has enabled practical applications that would be impossible with traditional transformers:
- Mobile Integration: The entire model runs efficiently on smartphones without requiring cloud connectivity
- Embedded Systems: IoT devices can now perform sophisticated phonetic processing locally
- Low-Resource Languages: The system provides reasonable performance even for languages with limited training data
- Real-Time Processing: The boolean operations enable real-time phonetic translation in interactive applications
The implications extend beyond just efficiency. This architecture opens new possibilities for deploying sophisticated language models in resource-constrained environments, from mobile devices to embedded systems, making advanced NLP accessible in contexts where traditional transformers would be impractical.