I have a model with 64GB of ram. I’ve limited context to 16k, in an effort to make it more stable, but tbh - it is rather unreliable no matter what I do. With my setup - mlx_lm and webui, it frequently collapses or loops, no matter the settings. I have done a lot of debugging and have concluded it is probably inherent model behavior.
Are you running an mlx model? If not, try that. My m4 macbook runs qwen3.6-35b-a3b lightning fast. Has its issues, but fast nonetheless.
What kind of context length can you get with that, and how much ram?
I have a model with 64GB of ram. I’ve limited context to 16k, in an effort to make it more stable, but tbh - it is rather unreliable no matter what I do. With my setup - mlx_lm and webui, it frequently collapses or loops, no matter the settings. I have done a lot of debugging and have concluded it is probably inherent model behavior.