Thread - Nostr Hypermedia

halalmoney halalmoney@stacker.news npub1vdaz...7rjz

NVIDIA just dropped a paper that might solve the biggest trade-off in LLMs. Speed vs. Quality. Autoregressive models (like GPT) are smart but slow - they generate one token at a time, leaving most of your GPU sitting idle. Diffusion models are fast but often produce incoherent outputs. TiDAR gets you both in a single forward pass. Here's the genius part: Modern GPUs can process way more tokens than we actually use. TiDAR exploits these "free slots" by: 1. Drafting multiple tokens at once using diffusion (the "thinking" phase) 2. Verifying them using autoregression (the "talking" phase) Both happen simultaneously using smart attention masks - bidirectional for drafting, causal for verification. The results: ↳ 4.71x faster at 1.5B parameters with zero quality loss ↳ Nearly 6x faster at 8B parameters ↳ First architecture to outperform speculative decoding (EAGLE-3) ↳ Works with standard KV caching, unlike pure diffusion models The training trick is clever too - instead of randomly masking tokens, they mask everything. This gives stronger learning signals and enables efficient single-step drafting. If you're building real-time AI agents where latency kills the experience, this architecture is worth paying attention to. @akshay_pachaar X.com

2025-11-27 18:18:40 from 1 relay(s) 1 replies ↓

Replies (1)