Anyone have any luck getting local LLMs to perform well enough to actually do work with?
Would love to replace claude code, etc. and be without ai subscriptions, but so far the experience is v bad.
Login to reply
Replies (29)
I'm in the same boat. Tried Qwen and so forth, but the parameter size was always limited by my hardware. Would be dope to keep it local.
I just got a framework deskop with lots of beefy hardware and still get bad experiences. I can't tell if we're just too early to local AI.
Who knows if local will ever catch up. Either way, I'm jealous of the new framework. Sounds sweet!
Omarchy ftw!
I gotta find time this long weekend to figure out if its really a technical reason these local LLMs suck or if I'm just ignorant and need to do a better job configuring this stuff.
CC: nostr:nprofile1qqsvyxc6dndjglxtmyudevttzkj05wpdqrla0vfdtja669e2pn2dzuqppemhxue69uhkummn9ekx7mp0qy2hwumn8ghj7un9d3shjtnyv9kh2uewd9hj7dkhm43
My best stack is with Cline and codestral, its not good for complete coding but get enough for bootstrapping stuff, and small agentic stuff
lol DHH posted a blog about exactly this, today.
https://world.hey.com/dhh/local-llms-are-how-nerds-now-justify-a-big-computer-they-don-t-need-af2fcb7b
OSS models just aren’t good enough today. I use one on my machine, but I primarily save my more “sensitive” questions for it. I get decent output for coding tasks, but I’m not asking it to actually generate code, just asking questions about syntax, structure, design, etc.
My article explains how to install Ollama and Open WebUI through docker. You need to give it web search capability and feed it relevant docs.
I will be beginning research of docker and searxng so I can write more guides and maybe eventually develop an open source app.
Most tutorials online are extremely insecure.
When you’re running a model, run `ollama ps` or `docker exec ollama ollama ps` to see how much GPU/CPU it’s using. Models that can fit entirely on vram run at 40+ tokens per second. Models that offload to CPU/RAM are *much* slower, 8-20 tokens per second. You want the processes command to show that the model is 100% loaded to GPU.
But I haven’t messed much with AI code. I assume Qwen3, Gemma 3, and GPT-oss 20b are all good. GPT-oss 20b is a mixture of experts model, meaning it only ever has 3.6b active parameters, taking like 14gb ram. You can run it on cpu probably, it is extremely good. You need RAG
Qwen can be retarded.
But yeah if you have less than 8gb vram, it’s pretty bad.
However, if you have a good amount of ram and a good CPU, you can get good speeds on CPU only. I only have 8gb vram. I run gpt-oss 20b and it offloads to the CPU. I should need 16gb vram to run it. It’s much smarter than qwen and it runs at usable speeds.
But yeah this whole project is focused on making an assistant that’s as helpful as Gemini 3 Pro as possible, if not better
Not yet. You can use quantized models and get OK performance on a good machine but if you're looking to replace something like Claude you're still probably a year or two out on a high end machine running models from a year or two ago.
I built a machine just for LLMs and its good enough as a search engine replacement but way too slow for coding or other highly complex tasks.
I just read this after I posted my note 😂
I love dhh but don't want to simply take his word for it... verify!
Which local models do you use?
I just got a beefy desktop from framework (128gb vram) but i can't even get OK performance. I am not sure if I am expecting too much or just a n00b when it comes to self-hosted AI
Great article, thanks for sharing!
Not using docker personally for ollama, just running it in a shell tab locally on my linux box. I have more than enough vram, still bad results... might be me doing something stupid.
Any other articles help you out in your learning journey?
I'll check out cline, thanks!
Help me be sovereign vibe-coder
nostr:nevent1qvzqqqqqqypzph683mxll6gahyzxntkwmr26x0hu4tm4yvdmf4mguug3s3y4zpa8qyv8wumn8ghj7enfd36x2u3wdehhxarj9emkjmn99uq3jamnwvaz7tmswfjk66t4d5h8qunfd4skctnwv46z7qguwaehxw309aex2mrp0yhx5etjwdjhjurvv438xtnrdakj7qpqp03hatnlvkvw3gl6w2g90m5rc52vvswl9u2dsq7s6zgqlq0cwmasuwpfxd
Only mini models on non specific purpose hardware. To run full models you would need a machine tailored for running AI models. My experience is with ollama only.
I’ve switched around a bunch but mostly the OpenAI OSS models or Qwen coding models
Running ollama directly may introduce security vulnerabilities. It’s best to run it through docker in my research. Performance should be the same.
I haven’t found many good guides. I wrote mine because none of the guides I followed worked without exposing either app to the host network.
My guide was inspire by this video, which might help. His setup didn’t work for me though:
https://youtu.be/qY1W1iaF0yA
I will be updating the guide when I learn how to improve my process. I might switch to using docker compose or I might make a startup script that sets it up and optimizes it for security. I might take this so far as to develop a full app so people stop potentially exposing their processors to the internet to run local AI.
You probably don’t have the GPU configured correctly. I recommend just starting over lol
And remember models take more space than their weights. So if a model has 7gb of weights, it still might not fit on an 8gb vram card because it needs more memory for the prompt and other stuff. So for example, an 8gb model like gemma3:12b actually needs around 10gb.
Run `ollama ps` to see if the model is loaded to your CPU or GPU (or both)
minimal requirement is 32gb of fast memory. i have 16gb with AI accelerators onboard and its limit is 20b and that's barely enough to write decent non-hallucinatory texts.
i'm exploring a new training technique called HRM which involves "augmenting" the data by shuffling and rearranging it to expand the data set, verifying modified versions are passing all tests on the code. the technique seems to dramatically improve reasoning on combinatorial problems, of which source code is dramatically more complex than any common type of game system.
there is also a crop of mini-pcs and laptops with fast memory (lpddr-8000 and such) with unified memory and partitioning. these aren't likely great for standard training programs but maybe specialised training like HRM techniques could yield mini models like 7b that know C better than linus torvalds and ken thompson.
also, aside from nvidia's chips, AMDs offerings are quite competitive, and google is about to start releasing stuff using their TPU engines with systolic array memory which eliminates the von neuman bottleneck of transit between processor and memory, traversing the graphs of the models maybe as much as 2x as efficiently (either speed or power).
i'd say within 2 years you will be able to host your own coding agent. advances in agent design will also likely progress a lot too. we even have an agent dev here in nosttr, the Gleasonator, with his shakespeare model and agent UI.
i have a framework desktop with 128GB of vram.
even the gpt-oss:120b param model runs with like half my vram still free.
I don't think its a raw hardware problem, but the tooling around it seems to break more. Like once the model calls a tool I lose all context... its strange.
i have 128gb of vram (framework desktop). watching it process regular prompts on btop suggests I have more than enough resources for a single user..
most of my issues arise when I use coding tools like codename-goose to actually make the model read and write files. It seems to lose its train of thought whenever it tries to use a tool and starts over.
well, that's pretty awesome to confirm. most likely the cause is the model you used doesn't understand tools and the agent wipes the chat log in some kind of defensive response to tool failure.
try some other models, in LM-studio the ones with a green hammer icon are what you should be using. ideally with the reasoning icon as well (i forget what that is). coding agents need tools and reasoning to function.
Ahhhh that explains everything. You don’t have a discrete graphics card. Your computer uses an integrated GPU on the APU.
Having 128gb of actual VRAM would be extraordinary. An RTX 5090 has 32gb of VRAM. An H200 ($40k AI GPU) has 141gb..
Your computer uses unified memory. So it uses regular RAM (LPDDR5x I assume) as VRAM.
This is extremely efficient and improves capability, but it’s slow compared to using a dedicated graphics card — 2 to 4 times as slow. LPDDR5x is up to four times slower than GDDR7. It is up to two times slower than GDDR6. You should still be able to run at usable speeds around reading speed (slow reading speed in some cases).
I expect 8-20 tokens per second when I’m using RAM and 30-80 tokens per second in VRAM (I have DDR5 RAM and GDDR6 VRAM). 10tps is like reading speed and 80 is like paragraphs appearing speed. I haven’t tried running a fast model like gemma3:4b on CPU/RAM. You might be able to go faster than I go on CPU considering that’s what yours is built for. For reference I have a Ryzen 7 7700x.
I’m not sure about the tooling and context thing.
I might have misunderstood. Not sure where I got that your models were going slow.
Extending functionality is indeed confusing.
My mistake for not communicating that
i looked into this as well but have not found a good solution yet.
a model like kimi k2 surely should have enough punch but requires quite a punch in terms of hardware.