• 2 Posts
  • 24 Comments
Joined 6 years ago
cake
Cake day: June 30th, 2020

help-circle





  • Yes, I do local host several models. Mostly the Qwen3 family stuff like 30b a3b etc. Have been trying GLM 4.5 a bit through OpenRouter and I’ve been liking the style pretty well. Interesting to know I could just pop in some larger RAM dimms potentially and run even larger models locally. The thing is OR is so cheap for many of these models and with zero data retention policies I feel a bit stupid for even buying a 24 GB VRAM GPU to begin with.



  • Honestly it has been good enough until recently when I’ve been struggling specifically with docker networking stuff and it’s been on the struggle bus with that. Yes, I’m using OpenRouter via OpenWebUI. I used to run a lot of stuff locally (mostly 4-b it quant 32b and smaller since I only have a single 3090) but lately I’ve been trying more larger models out on OpenRouter since many of the non proprietary ones are super cheap. Like fractions of a penny for a response… Many are totally free to a point as well.



  • I definitely have been looking out for this for a while. Wanting to replicate GPT deep research but not seeing a great way to do this. I did see that there was a OWUI tool for this but it didn’t seem particularly battle-tested so I hadn’t checked it out yet. I’ve been curious about how the new Tongyi Deep Research might be…

    That said, specifically for troubleshooting somewhat esoteric (or at least quite bespoke in terms of configuration) software problems, I was hoping the larger coder focused models would have enough built-in knowledge to suss out the issues. Maybe I should be having them consistently augment their responses with web searches if this isn’t the case? I have not been clicking that button typically.

    I do generally try to paste in or link as much of the documentation for whatever software I’m troubleshooting though.


  • The coder model (480B). I initially mistakenly said the 235b one but edited that. I didn’t know you could customize quant on OpenRouter (and I thought the differences between most modern 4 bit quants and 8-bit was minimal as well…) I have tried GPT OSS 120 a bunch of time and though it seems quote unquote ‘intelligent’ enough it is just too talkative and verbose for me (plus I can’t remember the last time it responded without somehow working an elaborate comparison table into the response) and it makes it too hard to parse through things.







  • It really depends on how you quantize the model and the K/V cache as well. This is a useful calculator. https://smcleod.net/vram-estimator/ I can comfortably fit most 32b models quantized to 4-bit (usually KVM or IQ4XS) on my 3090’s 24 GB of VRAM with a reasonable context size. If you’re going to be needing a much larger context window to input large documents etc then you’d need to go smaller with the model size (14b, 27b etc) or get a multi GPU set up or something with unified memory and a lot of ram (like the Mac Minis others are mentioning).