A note that this setup runs a 671B model in Q4 quantization at 3-4 TPS, running a Q8 would need something beefier. To run a 671B model in the original Q8 at 6-8 TPS you’d need a dual socket EPYC server motherboard with 768GB of RAM.

  • ☆ Yσɠƚԋσʂ ☆@lemmygrad.mlOP
    link
    fedilink
    arrow-up
    3
    arrow-down
    1
    ·
    2 months ago

    I find higher parameter tends to produce better output, but depends on what you’re doing too. For example, for stuff like code generation accuracy is more important. So even a smaller model that’s not quantized might do better. It also depends on the specific model as well.