Trying out VLLM

I decided to see how VLLM stacks up compared to Ollama. Lucky for me I have a friend who already figured this out, so I didn’t have to put a ton of effort into getting VLLM working on the Framework. I’m running Fedora, which has podman on it and a tool called toolbox. A person named kyuz0 did the hard work of putting all the pieces together to make VLLM work

toolbox create vllm \
  --image docker.io/kyuz0/vllm-therock-gfx1151:latest \
  -- --device /dev/dri --device /dev/kfd \
  --group-add video --group-add render --security-opt seccomp=unconfined

Then I can run

toolbox enter vllm

And run commands inside that container. There’s something called start-vllm that helps get everything running. It downloaded the GLM 4.7 model, then I ran it with

vllm serve zai-org/GLM-4.7-Flash --host 0.0.0.0 \
--port 8000 --tensor-parallel-size 1 --max-num-seqs 1 \
--gpu-memory-utilization 0.95 --dtype auto \
--trust-remote-code --enable-auto-tool-choice \
--tool-call-parser glm47 --max-model-len auto

And things mostly worked. I could point OpenCode and Goose at VLLM and they worked. I wouldn’t say it was substantially different than the results I saw with Ollama. It was a lot slower, but that’s because Ollama is doing something called “quantizing” models. It’s a way to speed up a model by reducing the accuracy. I don’t really understand this yet, so it will be a topic for another day.

I have no doubt if I was a smarter person I could make VLLM do much more than Ollama, there are a lot of knobs to turn. But there is a pretty big downside. VLLM runs one model at a time. If I want to switch to a different model, I have to stop VLLM then run it again with different options for the other model. It also takes quite a while to actually start the new models in VLLM.

With Ollama, I can pull and run different models without having to do anything special. It’s a feature built into Ollama. Being both stupid and lazy, this might be a killer feature for me. I’ve found that switching between different models is pretty useful. There are things the OSS-GPT does better than Qwen Coder and vice versa.

I’ll probably poke at VLLM every now and then, but I don’t see myself running it often unless I find a very good reason to do so.