I’m currently shopping around for something a bit faster than ollama and because I could not get it to use a different context and output length, which seems to be a known and long ignored issue. Somehow everything I’ve tried so far did miss one or more critical features, like:

  • “Hot” model replacement, so loading and unloading models on demand
  • Function calling
  • Support of most models
  • OpenAI API compatibility (to work well with Open WebUI)

I’d be happy about any recommendations!

hendrik
link
fedilink
English
117d

I’m also aware of LocalAI with automatic model swapping and OpenAI compatible API.

But unless I’m mistaken, they all use ggml behind the scenes? So you might want to look for something that uses vllm or exllama or something if you want a completely different backend.

I would not recommend LocalAI. There documentation is somewhat lacking and it’s an all in one utility with many moving parts. The parts also tend to break, quite often.

@CaptnBook@feddit.org
link
fedilink
English
217d

Vllm unfortunately doesn’t support switching the model without a restart.

@theunknownmuncher@lemmy.world
link
fedilink
English
7
edit-2
18d

Ummm… did you try /set parameter num_ctx # and /set parameter num_predict #? Are you using a model that actually supports the context length that you desire…?

@RandomlyRight@sh.itjust.works
creator
link
fedilink
English
217d

Yeah, but there are many open issues on GitHub related to these settings not working right. I’m using the API, and just couldn’t get it to work. I used a request to generate a json file, and it never generated one longer than about 500 lines. With the same model on vllm, it worked instantly and generated about 2000 lines

@theunknownmuncher@lemmy.world
link
fedilink
English
2
edit-2
17d

Are you using a tiny model (1.5B-7B parameters)? ollama pulls 4bit quant by default. It looks like vllm does not used quantized models by default so this is likely the difference. Tiny models are impacted more by quantization

I have no problems with changing num_ctx or num_predict

@RandomlyRight@sh.itjust.works
creator
link
fedilink
English
117d

It was multiple models, mainly 32-70B

@theunknownmuncher@lemmy.world
link
fedilink
English
1
edit-2
17d

Can you try setting the num_ctx and num_predict using a Modelfile with ollama? https://github.com/ollama/ollama/blob/main/docs/modelfile.md#parameter

@RandomlyRight@sh.itjust.works
creator
link
fedilink
English
117d

I’ve read about this method in the GitHub issues, but to me it seemed impractical to have different models just to change the context size, and that was the point I started looking for alternatives

You can overwrite the model by using the same name instead of creating one with a new name if it bothers you. Either way there is no duplication of the llm model file

Possibly linux
link
fedilink
English
317d

I don’t think you are going to find anything faster. Ollama is pretty much as fast as it gets

@CaptnBook@feddit.org
link
fedilink
English
1
edit-2
17d

It’s not, by far. But vllm or SGLang don’t support switching the model… such a shame.

@RandomlyRight@sh.itjust.works
creator
link
fedilink
English
417d

There are many projects out there optimizing the speed significantly. Ollama is unbeaten in the convenience though

Try llamafile from Mozilla.

@Arehandoro@lemmy.ml
link
fedilink
English
-117d

I don’t think it’s OpenAI compatible, but deepseek is faster.

hendrik
link
fedilink
English
617d

Btw, Ollama is a software to run AI models. Deepseek is just a company. Or a model file or a service. But that’s not what OP is looking for. They want to run a model. And that needs software like Ollama.

@Arehandoro@lemmy.ml
link
fedilink
English
116d

Isn’t this a model? https://github.com/deepseek-ai/DeepSeek-V3

(Honest question, not an expert in AI)

hendrik
link
fedilink
English
3
edit-2
16d

Yes, Deepseek V3 is a model. But what I was trying to say, you download the file. But then what? Just having the file stored on your harddisk doesn’t do much. You need to run it. That’s called “inference” in machine learning/AI terms. The repository you linked, contains some example code how to do it with Huggingface’s Transformer library. But there are quite some frameworks out there for running AI models. Ollama would be another one. And it’s not just some example code where to start with your own Python program, but a ready-made project/framework with tools and frontends available and an interface for other software to hook into.

And generally, you need some software to actually do something. And how fast it is, depends on the software used, the hardware it’s executed on. And in this case, also on the size of the AI model and its architecture. But yeah, Deepseek V3 has some tricks up it’s sleeves to make it very efficient. Though, it is really big for home use. I think we’re looking at a six-figure price for the hardware to run it. Usually, people use Deepseek R1 models. Or other smaller AI models if they run them themselves.

@Arehandoro@lemmy.ml
link
fedilink
English
315d

I see, had no idea! Thanks for the detailed answer!

Create a post

A place to share alternatives to popular online services that can be self-hosted without giving up privacy or locking you into a service you don’t control.

Rules:

  1. Be civil: we’re here to support and learn from one another. Insults won’t be tolerated. Flame wars are frowned upon.

  2. No spam posting.

  3. Posts have to be centered around self-hosting. There are other communities for discussing hardware or home computing. If it’s not obvious why your post topic revolves around selfhosting, please include details to make it clear.

  4. Don’t duplicate the full text of your blog or github here. Just post the link for folks to click.

  5. Submission headline should match the article title (don’t cherry-pick information from the title to fit your agenda).

  6. No trolling.

Resources:

Any issues on the community? Report it using the report flag.

Questions? DM the mods!

  • 1 user online
  • 249 users / day
  • 652 users / week
  • 1.46K users / month
  • 3.93K users / 6 months
  • 1 subscriber
  • 4.18K Posts
  • 86.9K Comments
  • Modlog