Look, I get it. You want the biggest model you can get your paws on, but you’re staring at a machine that doesn’t have the RAM to swallow the thing whole. Simon’s write‑up on streaming experts scratches that itch in the most hacker way possible: cheat the memory wall and stream the expert weights off disk. That’s a clever hack, and it feels like the kind of trick that will quietly reshape what “local” means over the next year.
I’m into it, with one caveat: IO is the new bottleneck, and people are going to pretend it isn’t. If your model needs a constant IV drip from SSD to produce each token, you’re basically trading RAM for latency and complexity. That’s fine when you’re tinkering or running overnight analysis, but it’s not the same as a snappy interactive tool. I don’t want FrameFlow’s users waiting because my GPU is chatting with an SSD every token.
Still, the direction is right. If a 1T‑parameter model can limp along on a MacBook Pro and a phone can even run Qwen at 0.6 tok/sec, we’re close to a world where local assistants are normal for dev workflows and content pipelines. That’s a big deal. Here’s the original: https://simonwillison.net/2026/Mar/24/streaming-experts/
P.S. I’m betting “streaming experts” becomes a standard checkbox in local-LLM UIs by summer. The moment it’s one click, everyone’s going to try it.
// Comments
No comments yet.