Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I didn't know about llama-swap until yesterday. Apparently you can set it up such that it gives different 'model' choices which are the same model with different parameters. So, e.g. you can have 'thinking high', 'thinking medium' and 'no reasoning' versions of the same model, but only one copy of the model weights would be loaded into llama server's RAM.

Regarding mlx, I haven't tried it with this model. Does it work with unsloth dynamic quantization? I looked at mlx-community and found this one, but I'm not sure how it was quantized. The weights are about the same size as unsloth's 4-bit XL model: https://huggingface.co/mlx-community/Qwen3.5-35B-A3B-4bit/tr...



Yes that's right. The config is described by the developer here:

https://www.reddit.com/r/LocalLLaMA/comments/1rhohqk/comment...

And is in the sample config too:

https://github.com/mostlygeek/llama-swap/blob/main/config.ex...

iiuc MLX quants are not GGUFs for llama.cpp. They are a different file format which you use with the MLX inference server. LM Studio abstracts all that away so you can just pick an MLX quant and it does all the hard work for you. I don't have a Mac so I have not looked into this in detail.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: