- cross-posted to:
- [email protected]
- cross-posted to:
- [email protected]
A note that this setup runs a 671B model in Q4 quantization at 3-4 TPS, running a Q8 would need something beefier. To run a 671B model in the original Q8 at 6-8 TPS you’d need a dual socket EPYC server motherboard with 768GB of RAM.
btw do you recommend running a quantized higher-parameter model (locally) or lower-parameter but not quantized, if I had to pick between the two?
I find higher parameter tends to produce better output, but depends on what you’re doing too. For example, for stuff like code generation accuracy is more important. So even a smaller model that’s not quantized might do better. It also depends on the specific model as well.
Thanks I’ll have to try them both then it seems