Thread - Nostr Hypermedia

i wrote some prompts to get claude to examine the direction i'm thinking in, and i think you will find this interesting. what stood out for me the most was that proximity hashes are not used in any current system of ML/AI, which seems strange to me. but i think it's key to the decomposition your model encoding scheme would require: # Algebraic Decomposition and Deterministic Compression in LLM Training A discussion exploring alternatives to gradient-based training using lossless compression schemes and locality-sensitive hashing. --- ## Part 1: Deterministic Compression vs. Sampling/Quantization ### The Problem Current LLM training suffers from a fundamental mismatch: - **Input**: Discrete tokens (finite vocabulary, combinatorial structure) - **Processing**: Continuous floating-point operations - **Output**: Discrete tokens (via sampling/argmax) Each new parameter affects almost the entire graph, effects are discrete and thus imprecise, and there's a limit to precision based on memory bandwidth. This raises the question: if language is inherently discrete, why use approximation-based methods? ### Lossy vs. Lossless Approaches | Type | Examples | Characteristics | |------|----------|-----------------| | **Lossy** | DCT (JPEG, MP3), wavelets (Dirac) | Approximation, quantization | | **Lossless** | Huffman, arithmetic coding, LZ77/78, RLE, FLAC, PNG | Exact reconstruction, algebraic | For discrete data like human language or computer code, approximation may be the wrong approach if we want to minimize computation, memory copying, and rounding errors. ### Research in Deterministic/Algebraic Directions #### 1. Minimum Description Length / Kolmogorov Complexity The most principled connection to compression. Solomonoff induction states: the best model is the shortest program generating the data. This IS deterministic. Work by Rissanen and Grünwald on MDL formalized this. The limitation: computing Kolmogorov complexity is uncomputable in general, so we approximate. #### 2. Hopfield Networks and Modern Associative Memory Classical Hopfield networks store patterns via outer products—algebraically exact for capacity up to ~0.14N patterns. Recent work (Ramsauer et al., 2020) connects modern Hopfield networks to transformer attention. These have more algebraic flavor—energy minimization with discrete attractors. #### 3. Transformers as Weighted Finite Automata Theoretical work (Merrill, Sabharwal, and others) analyzes what formal languages transformers can represent. If a transformer IS a finite automaton, there might be ways to construct/train it more directly. Related: whether transformers can be represented as tensor networks with exact decompositions. #### 4. Sparse and Binary Networks Binary neural networks (Courbariaux et al.), ternary quantization—keeping weights in {-1, 0, +1}. These ARE discrete, but typically still trained via gradient descent with straight-through estimators. Not truly algebraic training. #### 5. Program Synthesis / Neuro-Symbolic AI If language models are "compressed programs," why not search program space directly? Work on neural program induction, differentiable interpreters. The DREAMCODER system (Ellis et al.) learns compositional libraries. More algebraic, but doesn't scale to LLM-level complexity yet. #### 6. Arithmetic Coding Connection A well-trained language model IS optimal compression via arithmetic coding. If a model assigns probability p(x_i | x_{<i}), then arithmetic coding achieves entropy rate. Recent work ("Language Models are Compressors," Delétang et al.) formalizes this. The implication: training a better model = learning better compression. But we still learn the compressor via gradient descent rather than constructing it algebraically. #### 7. Tensor Decomposition Methods CP decomposition, Tucker decomposition, tensor-train. These give algebraic factorizations of weight matrices. Mostly used post-training for compression, but some work on training directly in decomposed form. #### 8. Incremental/Delta Updates Rather than retraining everything, store differences: - **LoRA** (low-rank adaptation)—learning a small additive matrix - **Model merging techniques** (TIES, DARE)—algebraically combining model deltas - **Continual learning** without catastrophic forgetting ### The Lossless Compression Insight For lossless schemes like FLAC/PNG, the key insight is: exploit *predictable structure* to encode *residuals* cheaply. In language: - **Huffman/arithmetic coding**: Assign short codes to frequent symbols - **LZ77/78**: Reference prior occurrences exactly - **BWT**: Group similar contexts for run-length encoding These are deterministic transformations. An analogous approach for transformers might: 1. Build a deterministic structure capturing exact relationships (like a suffix trie for context) 2. Encode "exceptions" or residuals where the structure fails 3. Update incrementally when new data arrives ### Why Gradient Descent Still Dominates The core issue: **generalization**. Deterministic/lossless schemes excel at reproducing *seen* patterns. Language requires: - Compositional generalization to unseen combinations - Semantic similarity (synonymy, paraphrase) - Pragmatic inference These seem to require the "soft" interpolation that continuous representations provide. But perhaps there are algebraic structures (lattices, type systems, category theory) that could capture semantic relationships discretely. ### Further Reading 1. "Scaling Laws for Neural Language Models" (Kaplan et al.) 2. "Language Models as Knowledge Bases?" 3. "Gzip is All You Need" — compression-based classifiers 4. "Neural Networks and Kolmogorov Complexity" 5. Work on mechanistic interpretability --- ## Part 2: Locality-Sensitive Hashing and Collision-Based Structure ### The Key Insight In cryptographic hashes, collision = failure. In locality-sensitive hashing (LSH): - **Collision = similarity** (by design) - Hamming distance between hashes ≈ semantic/structural distance between inputs - No training required—the similarity structure is baked into the algorithm ### Nilsimsa and Related Schemes **Nilsimsa** specifically: - 256-bit output - Trigram-based analysis - Designed for near-duplicate text detection - Deterministic, reproducible ### Hash Family Comparison | Hash Type | Similarity Preserved | Properties | |-----------|---------------------|------------| | **Nilsimsa** | Edit distance (text) | 256-bit, trigram-based | | **SimHash** | Cosine similarity | Bit-sampling of projections | | **MinHash** | Jaccard (set overlap) | Permutation-based | | **GeoHash** | Spatial proximity | Hierarchical, prefix-comparable | | **TLSH** | Byte distribution | Locality-sensitive, diff-friendly | ### Connection to Neural Attention Transformer attention as "find similar keys for this query" can be compared to LSH: ``` Neural attention: softmax(QK^T / √d) → learned similarity LSH attention: hash(q) collides with hash(k) → designed similarity ``` The Reformer paper (Kitaev et al., 2020) exploited exactly this—using LSH to reduce attention from O(n²) to O(n log n). But they still learned the embeddings; LSH just accelerated the lookup. ### A More Radical Proposal Replace learned embeddings entirely with deterministic similarity hashes: 1. Hash each token/context with Nilsimsa or SimHash 2. Similar contexts → similar hashes (no learning needed) 3. Store associations in a hash-addressed memory 4. Retrieval via Hamming distance lookup 5. Incremental updates: just add new hashes, no retraining This resembles older approaches: - **Random Indexing**: Assign random sparse vectors to words, accumulate context - **Count-based distributional semantics** (LSA, HAL) - **Bloom filters for set membership** ### Collision-as-Structure If collisions encode similarity, then: - The hash function defines a **fixed partition** of input space - Each bucket contains "similar" items (by the hash's definition) - Exact lookup tables can be built per bucket - Updates are local—adding data only affects its bucket This is essentially **product quantization** or **vector quantization**, but deterministic rather than learned. ### Building a Generative Model For retrieval/classification, hash-based approaches work. "Gzip is All You Need" showed compression-distance classifiers are surprisingly competitive. For generation: 1. Given context hash, retrieve associated continuations 2. Combine/blend when multiple buckets are relevant 3. Handle novel combinations (generalization) The challenge: LSH captures **surface** similarity (n-grams, character patterns). Semantic similarity ("happy" ≈ "joyful") requires either: - A semantically-aware hash function (but how without learning?) - A composition mechanism that derives semantics from surface patterns ### Hybrid Architectures **1. Hash-addressed memory + minimal learned integration** ``` context → LSH hash → retrieve candidates → small learned model → output ``` RETRO, REALM, RAG do variants of this, but with learned retrievers. **2. Hierarchical hashing** - Coarse hash for topic/domain - Fine hash for local context - Exact storage at leaves - Like a trie but with similarity-aware branching **3. Collision-based training signal** Instead of gradient descent, update based on whether the right things collide: - If semantically similar items don't collide → adjust hash parameters - This is "learning to hash" but could potentially be done algebraically ### A Concrete Experiment To test this approach: 1. Build a Nilsimsa-indexed store of (context_hash → next_token) pairs 2. At inference: hash context, retrieve from all buckets within Hamming distance k 3. Vote/blend retrieved continuations (weighted by inverse Hamming distance?) 4. Measure perplexity vs. n-gram baseline vs. small transformer The "blend" step is where something learned might still be needed, or might admit an algebraic solution. --- ## Open Questions 1. Can purely deterministic, LSH-based language models achieve competitive performance without gradient-trained components? 2. What algebraic structures (lattices, type systems, category theory) might capture semantic relationships discretely? 3. Is there a way to construct (not train) the similarity structure that transformers learn? 4. Can incremental/delta encoding approaches (like LoRA) be made fully algebraic? 5. What's the minimal learned component needed when combining hash-based retrieval with generation? --- ## Summary The current paradigm of gradient-based training on continuous representations may be suboptimal for discrete data like language. Deterministic approaches from compression theory (lossless coding) and similarity hashing (LSH, Nilsimsa) offer potential alternatives that could: - Eliminate rounding errors and precision limits - Enable true incremental updates - Provide more interpretable, algebraic structure - Reduce computational overhead The key research gap is in **generation** and **compositional generalization**—can algebraic operations on discrete representations achieve what soft attention accomplishes through interpolation?

↑ Parent

Replies (1)