i have a framework desktop with 128GB of vram.
even the gpt-oss:120b param model runs with like half my vram still free.
I don't think its a raw hardware problem, but the tooling around it seems to break more. Like once the model calls a tool I lose all context... its strange.
Login to reply
Replies (1)
Ahhhh that explains everything. You don’t have a discrete graphics card. Your computer uses an integrated GPU on the APU.
Having 128gb of actual VRAM would be extraordinary. An RTX 5090 has 32gb of VRAM. An H200 ($40k AI GPU) has 141gb..
Your computer uses unified memory. So it uses regular RAM (LPDDR5x I assume) as VRAM.
This is extremely efficient and improves capability, but it’s slow compared to using a dedicated graphics card — 2 to 4 times as slow. LPDDR5x is up to four times slower than GDDR7. It is up to two times slower than GDDR6. You should still be able to run at usable speeds around reading speed (slow reading speed in some cases).
I expect 8-20 tokens per second when I’m using RAM and 30-80 tokens per second in VRAM (I have DDR5 RAM and GDDR6 VRAM). 10tps is like reading speed and 80 is like paragraphs appearing speed. I haven’t tried running a fast model like gemma3:4b on CPU/RAM. You might be able to go faster than I go on CPU considering that’s what yours is built for. For reference I have a Ryzen 7 7700x.
I’m not sure about the tooling and context thing.