NVIDIA just dropped a paper that might solve the biggest trade-off in LLMs.
Speed vs. Quality.
Autoregressive models (like GPT) are smart but slow - they generate one token at a time, leaving most of your GPU sitting idle.
Diffusion models are fast but often produce incoherent outputs.
TiDAR gets you both in a single forward pass.
Here's the genius part:
Modern GPUs can process way more tokens than we actually use. TiDAR exploits these "free slots" by:
1. Drafting multiple tokens at once using diffusion (the "thinking" phase)
2. Verifying them using autoregression (the "talking" phase)
Both happen simultaneously using smart attention masks - bidirectional for drafting, causal for verification.
The results:
↳ 4.71x faster at 1.5B parameters with zero quality loss
↳ Nearly 6x faster at 8B parameters
↳ First architecture to outperform speculative decoding (EAGLE-3)
↳ Works with standard KV caching, unlike pure diffusion models
The training trick is clever too - instead of randomly masking tokens, they mask everything. This gives stronger learning signals and enables efficient single-step drafting.
If you're building real-time AI agents where latency kills the experience, this architecture is worth paying attention to.
@akshay_pachaar X.com
Login to reply
Replies (1)
There is another undiscussed "dilemma".
https://primal.net/e/nevent1qqs9scce7n3460gdusru38wf5kg4s92rfr67300nn2vescg9jcccmsqx0upsu