i wrote some prompts to get claude to examine the direction i'm thinking in, and i think you will find this interesting. what stood out for me the most was that proximity hashes are not used in any current system of ML/AI, which seems strange to me. but i think it's key to the decomposition your model encoding scheme would require:
# Algebraic Decomposition and Deterministic Compression in LLM Training
A discussion exploring alternatives to gradient-based training using lossless compression schemes and locality-sensitive hashing.
---
## Part 1: Deterministic Compression vs. Sampling/Quantization
### The Problem
Current LLM training suffers from a fundamental mismatch:
- **Input**: Discrete tokens (finite vocabulary, combinatorial structure)
- **Processing**: Continuous floating-point operations
- **Output**: Discrete tokens (via sampling/argmax)
Each new parameter affects almost the entire graph, effects are discrete and thus imprecise, and there's a limit to precision based on memory bandwidth. This raises the question: if language is inherently discrete, why use approximation-based methods?
### Lossy vs. Lossless Approaches
| Type | Examples | Characteristics |
|------|----------|-----------------|
| **Lossy** | DCT (JPEG, MP3), wavelets (Dirac) | Approximation, quantization |
| **Lossless** | Huffman, arithmetic coding, LZ77/78, RLE, FLAC, PNG | Exact reconstruction, algebraic |
For discrete data like human language or computer code, approximation may be the wrong approach if we want to minimize computation, memory copying, and rounding errors.
### Research in Deterministic/Algebraic Directions
#### 1. Minimum Description Length / Kolmogorov Complexity
The most principled connection to compression. Solomonoff induction states: the best model is the shortest program generating the data. This IS deterministic. Work by Rissanen and Grünwald on MDL formalized this. The limitation: computing Kolmogorov complexity is uncomputable in general, so we approximate.
#### 2. Hopfield Networks and Modern Associative Memory
Classical Hopfield networks store patterns via outer products—algebraically exact for capacity up to ~0.14N patterns. Recent work (Ramsauer et al., 2020) connects modern Hopfield networks to transformer attention. These have more algebraic flavor—energy minimization with discrete attractors.
#### 3. Transformers as Weighted Finite Automata
Theoretical work (Merrill, Sabharwal, and others) analyzes what formal languages transformers can represent. If a transformer IS a finite automaton, there might be ways to construct/train it more directly. Related: whether transformers can be represented as tensor networks with exact decompositions.
#### 4. Sparse and Binary Networks
Binary neural networks (Courbariaux et al.), ternary quantization—keeping weights in {-1, 0, +1}. These ARE discrete, but typically still trained via gradient descent with straight-through estimators. Not truly algebraic training.
#### 5. Program Synthesis / Neuro-Symbolic AI
If language models are "compressed programs," why not search program space directly? Work on neural program induction, differentiable interpreters. The DREAMCODER system (Ellis et al.) learns compositional libraries. More algebraic, but doesn't scale to LLM-level complexity yet.
#### 6. Arithmetic Coding Connection
A well-trained language model IS optimal compression via arithmetic coding. If a model assigns probability p(x_i | x_{<i}), then arithmetic coding achieves entropy rate. Recent work ("Language Models are Compressors," Delétang et al.) formalizes this.
The implication: training a better model = learning better compression. But we still learn the compressor via gradient descent rather than constructing it algebraically.
#### 7. Tensor Decomposition Methods
CP decomposition, Tucker decomposition, tensor-train. These give algebraic factorizations of weight matrices. Mostly used post-training for compression, but some work on training directly in decomposed form.
#### 8. Incremental/Delta Updates
Rather than retraining everything, store differences:
- **LoRA** (low-rank adaptation)—learning a small additive matrix
- **Model merging techniques** (TIES, DARE)—algebraically combining model deltas
- **Continual learning** without catastrophic forgetting
### The Lossless Compression Insight
For lossless schemes like FLAC/PNG, the key insight is: exploit *predictable structure* to encode *residuals* cheaply. In language:
- **Huffman/arithmetic coding**: Assign short codes to frequent symbols
- **LZ77/78**: Reference prior occurrences exactly
- **BWT**: Group similar contexts for run-length encoding
These are deterministic transformations. An analogous approach for transformers might:
1. Build a deterministic structure capturing exact relationships (like a suffix trie for context)
2. Encode "exceptions" or residuals where the structure fails
3. Update incrementally when new data arrives
### Why Gradient Descent Still Dominates
The core issue: **generalization**. Deterministic/lossless schemes excel at reproducing *seen* patterns. Language requires:
- Compositional generalization to unseen combinations
- Semantic similarity (synonymy, paraphrase)
- Pragmatic inference
These seem to require the "soft" interpolation that continuous representations provide. But perhaps there are algebraic structures (lattices, type systems, category theory) that could capture semantic relationships discretely.
### Further Reading
1. "Scaling Laws for Neural Language Models" (Kaplan et al.)
2. "Language Models as Knowledge Bases?"
3. "Gzip is All You Need" — compression-based classifiers
4. "Neural Networks and Kolmogorov Complexity"
5. Work on mechanistic interpretability
---
## Part 2: Locality-Sensitive Hashing and Collision-Based Structure
### The Key Insight
In cryptographic hashes, collision = failure. In locality-sensitive hashing (LSH):
- **Collision = similarity** (by design)
- Hamming distance between hashes ≈ semantic/structural distance between inputs
- No training required—the similarity structure is baked into the algorithm
### Nilsimsa and Related Schemes
**Nilsimsa** specifically:
- 256-bit output
- Trigram-based analysis
- Designed for near-duplicate text detection
- Deterministic, reproducible
### Hash Family Comparison
| Hash Type | Similarity Preserved | Properties |
|-----------|---------------------|------------|
| **Nilsimsa** | Edit distance (text) | 256-bit, trigram-based |
| **SimHash** | Cosine similarity | Bit-sampling of projections |
| **MinHash** | Jaccard (set overlap) | Permutation-based |
| **GeoHash** | Spatial proximity | Hierarchical, prefix-comparable |
| **TLSH** | Byte distribution | Locality-sensitive, diff-friendly |
### Connection to Neural Attention
Transformer attention as "find similar keys for this query" can be compared to LSH:
```
Neural attention: softmax(QK^T / √d) → learned similarity
LSH attention: hash(q) collides with hash(k) → designed similarity
```
The Reformer paper (Kitaev et al., 2020) exploited exactly this—using LSH to reduce attention from O(n²) to O(n log n). But they still learned the embeddings; LSH just accelerated the lookup.
### A More Radical Proposal
Replace learned embeddings entirely with deterministic similarity hashes:
1. Hash each token/context with Nilsimsa or SimHash
2. Similar contexts → similar hashes (no learning needed)
3. Store associations in a hash-addressed memory
4. Retrieval via Hamming distance lookup
5. Incremental updates: just add new hashes, no retraining
This resembles older approaches:
- **Random Indexing**: Assign random sparse vectors to words, accumulate context
- **Count-based distributional semantics** (LSA, HAL)
- **Bloom filters for set membership**
### Collision-as-Structure
If collisions encode similarity, then:
- The hash function defines a **fixed partition** of input space
- Each bucket contains "similar" items (by the hash's definition)
- Exact lookup tables can be built per bucket
- Updates are local—adding data only affects its bucket
This is essentially **product quantization** or **vector quantization**, but deterministic rather than learned.
### Building a Generative Model
For retrieval/classification, hash-based approaches work. "Gzip is All You Need" showed compression-distance classifiers are surprisingly competitive.
For generation:
1. Given context hash, retrieve associated continuations
2. Combine/blend when multiple buckets are relevant
3. Handle novel combinations (generalization)
The challenge: LSH captures **surface** similarity (n-grams, character patterns). Semantic similarity ("happy" ≈ "joyful") requires either:
- A semantically-aware hash function (but how without learning?)
- A composition mechanism that derives semantics from surface patterns
### Hybrid Architectures
**1. Hash-addressed memory + minimal learned integration**
```
context → LSH hash → retrieve candidates → small learned model → output
```
RETRO, REALM, RAG do variants of this, but with learned retrievers.
**2. Hierarchical hashing**
- Coarse hash for topic/domain
- Fine hash for local context
- Exact storage at leaves
- Like a trie but with similarity-aware branching
**3. Collision-based training signal**
Instead of gradient descent, update based on whether the right things collide:
- If semantically similar items don't collide → adjust hash parameters
- This is "learning to hash" but could potentially be done algebraically
### A Concrete Experiment
To test this approach:
1. Build a Nilsimsa-indexed store of (context_hash → next_token) pairs
2. At inference: hash context, retrieve from all buckets within Hamming distance k
3. Vote/blend retrieved continuations (weighted by inverse Hamming distance?)
4. Measure perplexity vs. n-gram baseline vs. small transformer
The "blend" step is where something learned might still be needed, or might admit an algebraic solution.
---
## Open Questions
1. Can purely deterministic, LSH-based language models achieve competitive performance without gradient-trained components?
2. What algebraic structures (lattices, type systems, category theory) might capture semantic relationships discretely?
3. Is there a way to construct (not train) the similarity structure that transformers learn?
4. Can incremental/delta encoding approaches (like LoRA) be made fully algebraic?
5. What's the minimal learned component needed when combining hash-based retrieval with generation?
---
## Summary
The current paradigm of gradient-based training on continuous representations may be suboptimal for discrete data like language. Deterministic approaches from compression theory (lossless coding) and similarity hashing (LSH, Nilsimsa) offer potential alternatives that could:
- Eliminate rounding errors and precision limits
- Enable true incremental updates
- Provide more interpretable, algebraic structure
- Reduce computational overhead
The key research gap is in **generation** and **compositional generalization**—can algebraic operations on discrete representations achieve what soft attention accomplishes through interpolation?
Login to reply
Replies (1)
Here’s a tight, on-point pivot:
---
This is still retrieval + similarity partitioning.
LSH gives locality.
Compression gives efficiency.
But neither gives compositional closure.
If two contexts collide, you can retrieve them —
but you can’t compose them into new valid states without blending or interpolation.
That’s still similarity space thinking.
Elliptic traversal isn’t about buckets.
It’s about motion in a closed algebra:
deterministic composition
defined inverses
bounded state growth
structure preserved under operation
Hashing partitions space.
Group operations generate space.
Different primitive.
That’s the shift.