• 1 Post
  • 5 Comments
Joined 2Y ago
cake
Cake day: Jul 14, 2023

help-circle
rss

I’ve read about this method in the GitHub issues, but to me it seemed impractical to have different models just to change the context size, and that was the point I started looking for alternatives


It was multiple models, mainly 32-70B


There are many projects out there optimizing the speed significantly. Ollama is unbeaten in the convenience though


Yeah, but there are many open issues on GitHub related to these settings not working right. I’m using the API, and just couldn’t get it to work. I used a request to generate a json file, and it never generated one longer than about 500 lines. With the same model on vllm, it worked instantly and generated about 2000 lines


Faster Ollama alternative
I'm currently shopping around for something a bit faster than ollama and because I could not get it to use a different context and output length, which seems to be a known and long ignored issue. Somehow everything I’ve tried so far did miss one or more critical features, like: - "Hot" model replacement, so loading and unloading models on demand - Function calling - Support of most models - OpenAI API compatibility (to work well with Open WebUI) I'd be happy about any recommendations!
fedilink

Take a look at NVIDIA Project Digits. It’s supposed to release in May for 3k usd and will be kind of the only sensible way to host LLMs then:

https://www.nvidia.com/en-us/project-digits/