removed by mod
fedilink
1.18K
@uis@lemm.ee
link
fedilink
English
210d

Aren’t LLMs external algorithms at this point? As in the all data will not fit in RAM.

@brucethemoose@lemmy.world
link
fedilink
English
5
edit-2
10d

No, all the weights, all the “data” essentially has to be in RAM. If you “talk to” a LLM on your GPU, it is not making any calls to the internet, but making a pass through all the weights every time a word is generated.

There are system to augment the prompt with external data (RAG is one word for this), but fundamentally the system is closed.

@Hackworth@lemmy.world
link
fedilink
English
210d

Yeah, I’ve had decent results running the 7B/8B models, particularly the fine tuned ones for specific use cases. But as ya mentioned, they’re only really good in thier scope for a single prompt or maybe a few follow-ups. I’ve seen little improvement with the 13B/14B models and find them mostly not worth the performance hit.

Depends which 14B. Arcee’s 14B SuperNova Medius model (which is a Qwen 2.5 with some training distilled from larger models) is really incrtedible, but old Llama 2-based 13B models are awful.

@Hackworth@lemmy.world
link
fedilink
English
210d

I’ll try it out! It’s been a hot minute, and it seems like there are new options all the time.

@brucethemoose@lemmy.world
link
fedilink
English
3
edit-2
10d

Try a new quantization as well! Like an IQ4-M depending on the size of your GPU, or even better, an 4.5bpw exl2 with Q6 cache if you can manage to set up TabbyAPI.

@uis@lemm.ee
link
fedilink
English
110d

If you “talk to” a LLM on your GPU, it is not making any calls to the internet,

No, I’m talking about https://en.m.wikipedia.org/wiki/External_memory_algorithm

Unrelated to RAGs

@brucethemoose@lemmy.world
link
fedilink
English
1
edit-2
10d

https://en.m.wikipedia.org/wiki/External_memory_algorithm

Unfortunately that’s not really relevant to LLMs beyond inserting things into the text you feed them. For every single word they predict, they make a pass through the multi-gigabyte weights. Its largely memory bound, and not integrated with any kind of sane external memory algorithm.

There are some techniques that muddy this a bit, like MoE and dynamic lora loading, but the principle is the same.

Create a post

A place to share alternatives to popular online services that can be self-hosted without giving up privacy or locking you into a service you don’t control.

Rules:

  1. Be civil: we’re here to support and learn from one another. Insults won’t be tolerated. Flame wars are frowned upon.

  2. No spam posting.

  3. Posts have to be centered around self-hosting. There are other communities for discussing hardware or home computing. If it’s not obvious why your post topic revolves around selfhosting, please include details to make it clear.

  4. Don’t duplicate the full text of your blog or github here. Just post the link for folks to click.

  5. Submission headline should match the article title (don’t cherry-pick information from the title to fit your agenda).

  6. No trolling.

Resources:

Any issues on the community? Report it using the report flag.

Questions? DM the mods!

  • 1 user online
  • 120 users / day
  • 471 users / week
  • 1.22K users / month
  • 3.8K users / 6 months
  • 1 subscriber
  • 3.96K Posts
  • 80.8K Comments
  • Modlog