Is there a tool or a model that can do what ChatGPT does with documents

afansfw@lemmynsfw.com · 10 months ago

Is there a tool or a model that can do what ChatGPT does with documents

NSFW

hummingbird@lemmy.world · 10 months ago

Try the gemma3 models. They improved quite a bit and are now able to handle my grocery receipts which sometimes are barely readable even for the human eye.

afansfw@lemmynsfw.com · 10 months ago

Sounds like what I’m looking for! What do you use for inference?

afansfw@lemmynsfw.com · 10 months ago

Ok, turned out to be as simple to run as downloading llama.cpp binaries, gguf of gemma3 and an mmproj file and running it all like this

./llama-server -m ~/LLM-models/gemma-3-4b-it-qat-IQ4_NL.gguf --mmproj ~/LLM-models/gemma-3-4b-it-qat-mmproj-F16.gguf --port 5002

(Could be even easier if I’d let it download weights itself, and just used -hf option instead of -m and —mmproj).

And now I can use it from my browser at localhost:5002, llama.cpp already provides an interface there that supports images!

Tested high resolution images and it seems to either downscale or cut them into chunks or both, but the main thing is that 20 megapixels photos work fine, even on my laptop with no gpu, they just take a couple of minutes to get processed. And while 4b model is not very smart (especially quantized), it could still read and translate text for me.

Need to test more with other models but just wanted to leave this here already in case someone stumbles upon this question and wants to do it themselves. It turned out to be much more accessible than expected.

doodlebob@lemmy.world · 10 months ago

Check out open webui 10/10 do recommend

wise_pancake@lemmy.ca · 10 months ago

No idea your skill level, but try installing open webui and downloading any of ollama vision models

There’s a bit of a learning curve to running docker but ChatGPT can easily get you to a point it’s running.

afansfw@lemmynsfw.com · 10 months ago

I’m not sure if I’m doing something wrong here, but openwebui has been weird for me. I’ve tried running nanonets-ocr, but it only read the last lines visible on a photo. And other models would start reprocessing the whole chat and ignoring the last image I post, answering with the context of the previous reply instead… Using the websearch is easy with it though, so I think I’ll keep an eye on it and maybe will try again later

Oni_eyes@sh.itjust.works · 10 months ago

Not necessarily the ingredient summary, but Google translate and several other translator apps will do the “scan picture for text” and translate that.