Guide to Self Hosting LLMs Faster/Better than Ollama

@Grimy@lemmy.world

vLLM can only run on linux but it’s my personal favorite because of the speed gain when doing batch inference.

@brucethemoose@lemmy.world

Aphrodite is a fork of vllm. You should check it out!

If you are looking for raw batched speed, especially with some redundant context, I would actually recommend sglang instead. Check out its experimental flags too.

@morrowind@lemmy.ml

Honestly, I’m just gonna stick to llamafile. I really don’t want to mess around with python. It also causes way more trouble than I anticipate

@brucethemoose@lemmy.world

Llamafile is fine, but it still leaves a lot of performance on the table.

You can setup kobold.cpp with Q8 flash attention without ever having to install pytorch, which is the real headache. It does have a little python launch script, but its super minimal.

You can use the native llama.cpp server for absolutely zero python usage.

@banghida@lemm.ee

Does any of this work on Intel GPUs?

@brucethemoose@lemmy.world

Nope, Intel is more of a pain. Your best bet is the llama.cpp server’s SYCL backend (or kobold.cpp if they have a build for that).

They have an OpenVINO server, but its not for the faint of heart.

@banghida@lemm.ee

Just thinking down the road, I am thinking of buying a next gen Intel GPU.

@brucethemoose@lemmy.world

Support will get much better if they sell something decent (read: >32GB and cheap), but if they don’t, TBH I would not expect great support. Arc is in kinda a make or break moment, and right now Intel is breaking it hard by delaying Battlemage so much. I too an am Arc hopeful, but it really feels like Intel is going to cancel it.

@vividspecter@lemm.ee

Do you have any recommendations for a Perplexity.ai type setup? It’s one of the few recent innovations I’ve found useful. I’ve heard of Perplexica and a few others, but not sure what is the best approach.

LiveLM

What does Perplexity do different than other AI solutions?
Heard about it but haven’t tried yet

@Caboose12000@lemmy.world

I haven’t heard about it before today but I tried asking it what separates it from other LLMs and apparently the answer is just that it does a google search and shows you the source its summarizing, which if true is not very compelling, and if a hallucination or missing details then its at least not very compelling as a search replacement

projectmoon

Perplexica works. It can understand ollama and custom OpenAI providers.

@kitnaht@lemmy.world

If your “FIRST STEP” is to choose an OS: Fuck that.

You should never have to change your OS just to use this crap. It’s all written in Python. It should work on every OS available. Your first step is installing the prerequisites.

If you’re using something like Continue for local coding tasks, CodeQwen is awesome, and you’ll generally want a context window of 120k or so because for coding, you want all the code context - or else the LLM starts spitting out repetitious stuff, or can’t ingest all of your context so it’ll rewrite stuff that’s already there.

@brucethemoose@lemmy.world

CodeQwen 1.5 is pretty old at this point, afaik made obsolete by their latest release.

The Qwen models (at least 2.5) are really only good to like 32K, which is still a ton of context. But I’ve been testing Qwen 32B at 64K -90K and even that larger model is… Not great.

32K is generally enough to get the jist of whatever you’re trying to fill in.

@gravitas_deficiency@sh.itjust.works

Wtf are you talking about. PCIe passthrough exists.

@kitnaht@lemmy.world

Why would you even bother trying to run this all through a VM when you can just run it directly? If you’re to the point of using VMs, you don’t need this tutorial anyways.

Are you seriously telling me you’re jumping through all the hoops to spin up a VM on Linux, and then doing all the configuration for GPU passthrough, because you can’t just figure out how to run it locally?

@gravitas_deficiency@sh.itjust.works

Bro this is a community for sharing knowledge and increasing the technical aptitude of fellow users by doing said sharing. Maybe instead of shitting on a pretty solid digest of the fundamentals of setting up something like this, try adding to the body of knowledge instead.

@brucethemoose@lemmy.world

I would not recommend that for performance reasons, AFAIK.

Windows is fine, I should make that more clear.

@gravitas_deficiency@sh.itjust.works

Huh, really? Is there that much of a perf hit using passthrough? I’d have assumed that the bottleneck isn’t actually the PCIE, so much as it is the beefiness of the GPU crunching the model.

@brucethemoose@lemmy.world

I have not tested WSL or VMs in Windows in awhile, but my impression is that “it depends” and you should use the native windows version unless you are having some major installation issues.

@sturlabragason@lemmy.world

Choose OS is very relevant when doing cloud stuff.

@brucethemoose@lemmy.world

Or setting up a home server, which I figured some here would do.

@L_Acacia@lemmy.one

llama.cpp works on windows too (or any os for that matter), though linux will vive you better performances

@kwa@lemmy.zip

Thanks!

For people on MacOS, is there a better alternative than croco.cpp?

@brucethemoose@lemmy.world

If you download the source, you should be able to build it for metal? Croco.cpp is just a fork of kobold.cpp

I think lmstudio added MLX support, but otherwise you are stuck with anything llama.cpp based. I’d probably download llama.cpp directly and use the llama server first.

@kwa@lemmy.zip

I tried llama.cpp with llama-server and Qwen2.5 Coder 1.5B. Higher parameters just output garbage and I can see an OutOfMemory error in the logs. When trying the 1.5B model, I have an issue where the model will just stop outputting the answer, it will stop mid sentence or in the middle of a class. Is it an issue with my hardware not being performant enough or is it something I can tweak with some parameters?

@brucethemoose@lemmy.world

You can only allocate so much to metal backends, and if you are on (say) an 8GB Mac there won’t be much RAM left for the LLM itself.

But still, use a tighter quantization (like an IQ4 or IQ3_KM) of Qwen Coder 7B, and close as many background programs as you can. It should be small enough to fit.

@kwa@lemmy.zip

I have a MacBook Pro M1 Pro with 16GB RAM. I closed a lot of things and managed to have 10GB free, but that seems to still not be enough to run the 7B model. For the answer being truncated, it seems to be a frontend issue. I tried open-webui connected to llama-server and it seems to be working great, thank you!

@brucethemoose@lemmy.world

Try reducing the context size, and make sure Q8/Q8 flash attention is enabled with flags.

I’d link a specific GGUF quantization, but huggingface seems to be down for me!

@brucethemoose@lemmy.world

Try this one at least, it should still leave plenty of RAM free: https://huggingface.co/bartowski/Qwen2.5-Coder-7B-Instruct-GGUF/blob/main/Qwen2.5-Coder-7B-Instruct-IQ4_XS.gguf

@kwa@lemmy.zip

Indeed, this model is working on my machine. Can you explain the difference with the one I tried before?

@brucethemoose@lemmy.world

It’s probably much smaller than whatever other GGUF you got, aka more tightly quantized.

Look at the filesize, thats basically how much RAM it takes.

@brucethemoose@lemmy.world

deleted by creator

@AliasAKA@lemmy.world

Bookmarked and will come back to this. One thing that may be if interest to add is for AMD cards with 20gb of ram. I’d suppose that it would be Qwen 2.5 34B with maybe less strict quant or something.

Also, it may be interesting to look at the AllenAI molmo related models. I’m kind of planning to do this myself but haven’t had time as yet.

@brucethemoose@lemmy.world

Yep. 20GB is basically 24GB, though its too tight for 70B models.

One quirk for 7900 owners is that installing flash attention for long context usage can be a pain. Apparently it is doable now, I need to dig up the link, but it might just be easier to use kobold.cpp rocm with its native flash attention.

As for vision models, that is a whole different can of worms. Exllama does not support this, so you’d need a framework that does.

If you are looking for niche models, check out MiniG (which is a continued pretrain of the already very excellent GLM4-9B): https://huggingface.co/bartowski/miniG-GGUF

Llama.cpp support is recent, though I’m not 100% sure its completely fixed. It should work in Aphrodite as well.

@sleep_deprived@lemmy.world

I’d be interested in setting up the highest quality models to run locally, and I don’t have the budget for a GPU with anywhere near enough VRAM, but my main server PC has a 7900x and I could afford to upgrade its RAM - is it possible, and if so how difficult, to get this stuff running on CPU? Inference speed isn’t a sticking point as long as it’s not unusably slow, but I do have access to an OpenAI subscription so there just wouldn’t be much point with lower quality models except as a toy.

@brucethemoose@lemmy.world

CPU inference is, unfortunately, slow, even on my 7800X3D.

The one that might be interesting is deepseek code v2 lite, as its a very fast MoE model. IIRC microsoft also released a Phi MoE thats good for CPU.

Keep an eye out for upcoming bitnet models.

Dont bother upgrading RAM though. You will be bandwidth limited anyway, and it doesn’t make a huge difference.

fmstrat

Do you have any recommendation for integration into VSCode, specifically with something like Continue?

@brucethemoose@lemmy.world

I am “between” VScode extensions TBH, but any model that supports FIM (like Qwen or Mistral Code) should work fine.

@sturlabragason@lemmy.world

Frontendwise; Librechat is pretty cool.

@shaserlark@sh.itjust.works

I run a Mac Mini as a home server because it’s great for hardware transcoding, I was wondering if I could host an LLM locally. I work with python so that wouldn’t be an issue but I have no idea how to do CUDA or work on low level code. Is there anything I need to consider? Would probably start with a really small model.

@thirdBreakfast@lemmy.world

If it’s an M1, you def can and it will work great. With Ollama.

@shaserlark@sh.itjust.works

Yeah it’s an M1 16GB, sounds awesome I’ll try, thanks a lot for the guide it’s super helpful. I just got the Mac Mini for jellyfin but this is an unexpected use case where the server comes in very handy.

@brucethemoose@lemmy.world

For that you probably want the llama.cpp server and a Qwen2 14B IQ3 quantization.

16GB is kinda tight though, especially if you’re running other stuff in the background.

Eskuero

Ollama has had for a while an issue opened abou the vulkan backend but sadly it doesn’t seem to be going anywhere.

@brucethemoose@lemmy.world

Thats because llama.cpp’s vulkan backend is kinda slow and funky, unfortunately.

Eskuero

Better than anything. I run through vulkan on lm studio because rocm on my rx 5600xt is a heavy pain

@brucethemoose@lemmy.world

The best hope for you is ZLUDA’s revival. It’s explicitly targeting LLM runtimes now, and RDNA1 (aka your 5600XT) is the oldest supported generation.

https://www.phoronix.com/news/ZLUDA-Third-Life

TBH you should consider using free llama/qwen APIs as well, when appropriate.

DarkThoughts

I just can’t get ROCm / gpu generation to work on Bazzite, like at all. It seems completely cursed. I tried koboldcpp through a Fedora distrobox and it didn’t even show any hardware options. Tried through an Arch AUR package through distrobox and the ROCm option is there but ends with a CUDA error. lol The Vulkan option works but seems to still use the CPU more than the GPU and is consequently still kinda slow and I struggle to find a good model for my 8GB card. Fimbulvetr-10.7B-v1-Q5_K_M for example was still too slow to be practical.

Tried LM Studio directly in Bazzite and it also just uses the CPU. It also is very obtuse on how to connect to it with SillyTavern, as it asks for an API key? I managed it once in the past but I can’t remember how but it also ended up stopping generating anything after a few replies.

Krita’s diffusion also only runs on the CPU, which is abysmally slow, but I’m not sure if they expect Krita to be build directly on the system for ROCm support to work.

I’m not even trying to get SDXL or something to run at this point, since that seems to be still complicated enough even on a regular distro.

@brucethemoose@lemmy.world

I don’t like Fedora because its CUDA support is third party, and AFAIK they dont natively package ROCm. And its too complex to use through something like distrobox… I don’t want to tell you to switch OSes, but you’d have a much better time with CachyOS, which is also optimized for Steam gaming.

Alternatively you could try installing rocm images through docker, but you have to make sure GPU passthrough is working).

It also depends on your GPU. If you are on an RX 580, you can basically kiss rocm support goodbye, and might want to investigate mlc-llm’s vulkan backend.

Fimbulvetr is ancient now, your go to models are Qwen 2.5 14B at short context or llama 3.1 8B/Qwen 2.5 7B at longer context.

DarkThoughts

I distrohopped so much after each previous distro eventually broke and me clearly not being smart enough to recover. I’m honestly kinda sick of it, even if the immutable nature also annoys the shit out of me.

My GPU is a 6650 XT, which should in principle work with ROCm.

Which model specifically are you recommending? Llama-3.1-8B-Lexi-Uncensored-V2-GGUF? Because the original meta-llama ones are censored to all hell and Huggingface is not particularly easy to navigate, on top of figuring out the right model size & quantization being extremely confusing.

@brucethemoose@lemmy.world

Depends what you mean by censored. I never have a problem with Qwen or llama as long as I give them the right prompt and system prompt. Its not like an API model, they have to continue whatever response you give them.

And… For what? If you are just looking for like ERP, check out drummer’s finetunes. Otherwise I tend to avoid “uncensored” finetunes as they dumb the model down a bit, but take your pick: https://huggingface.co/models?sort=modified&search=14B

But you are going to struggle if you can’t get rocm working beyond very small context, as that means no flash attention anywhere.

Also, assuming you end up using kobold.cpp-rocm instead, I would use a IQ3_M or IQ3_XS GGUF quantization of a 14B model.

DarkThoughts

Well, anything remotely raunchy gets a “I cannot participate in explicit content” default reply.

I am using the rocm install of koboldcpp but as said, the ROCm option errors out with a CUDA error for some reason.

@brucethemoose@lemmy.world

Thonking What’s the error? Did you manually override your architecture as an environment variable?

https://old.reddit.com/r/ROCm/comments/18z29l6/comment/kgeuguq/

https://github.com/likelovewant/ROCmLibs-for-gfx1103-AMD780M-APU?tab=readme-ov-file#additional-information--installation-tips

You are gfx1032

DarkThoughts

ggml_cuda_compute_forward: ADD failed
CUDA error: shared object initialization failed
  current device: 0, in function ggml_cuda_compute_forward at ggml/src/ggml-cuda.cu:2365
  err
ggml/src/ggml-cuda.cu:107: CUDA error

I didn’t do anything past using yay to install the AUR koboldcpp-hipblas package, and customtkinter, since the UI wouldn’t work otherwise. The koboldcpp-rocm page very specifically does not mention any other steps in the Arch section and the AUR page only mentions the UI issue.

@brucethemoose@lemmy.world

mmmm I would not use the AUR version, especially on Fedora. It probably relies on a bunch of arch system packages, among other things.

Try installing the rocm fork directly, with its script: https://github.com/YellowRoseCx/koboldcpp-rocm?tab=readme-ov-file#linux

EDIT: There does seem to be a specific quirk related to Fedora.

@brucethemoose@lemmy.world

Oh, and again, for raunchy, there are explicit “RP” finetunes, like: https://huggingface.co/TheDrummer

But you just need to set a good system prompt or start a reply with “Sure,” and plain qwen or llama will write out unspeakable things.

@thirdBreakfast@lemmy.world

Guide to Self Hosting LLMs with Ollama.

Download and run Ollama
Open a terminal, type ollama run llama3.2

projectmoon

Super useful guide. However after playing around with TabbyAPI, the responses from models quickly become jibberish, usually halfway through or towards the end. I’m using exl2 models off of HuggingFace, with Q4, Q6, and FP16 cache. Any tips? Also, how do I control context length on a per-model basis? max_seq_len in config.json?

@brucethemoose@lemmy.world

What model, specifically? What other settings?

Context length is in the TabbyAPI config, yes.

projectmoon

I tried it with both Qwen 14b and Llama 3.1. Both were exl2 quants produced by bartowski.

@brucethemoose@lemmy.world

What context length? Neither of them likes to go over 32K.

And what kind of jibberish? If they are repeating, you need to change sampling settings. Incoherence… Also probably sampling settings, lol.

projectmoon

Context was set to anywhere between 8k and 16k. It was responding in English properly, and then about halfway to 3/4s of the way through a response, it would start outputting tokens in either a foreign language (Russian/Chinese in the case of Qwen 2.5) or things that don’t make sense (random code snippets, improperly formatted text). Sometimes the text was repeating as well. But I thought that might have been a template problem, because it seemed to be answering the question twice.

Otherwise, all settings are the defaults.

@brucethemoose@lemmy.world

Hmm, what’s the frontend?

And the defaults can sometimes be really bad lol. Qwen absolutely outputs chinese for me with a high temperature.

projectmoon

OpenWebUI connected tabbyUI’s OpenAI endpoint. I will try reducing temperature and seeing if that makes it more accurate.

Guide to Self Hosting LLMs Faster/Better than Ollama

Guide to Self Hosting LLMs Faster/Better than Ollama

Selfhosted