A note that this setup runs a 671B model in Q4 quantization at 3-4 TPS, running a Q8 would need something beefier. To run a 671B model in the original Q8 at 6-8 TPS you’d need a dual socket EPYC server motherboard with 768GB of RAM.

  • CriticalResist8@lemmygrad.ml
    link
    fedilink
    arrow-up
    2
    arrow-down
    1
    ·
    2 months ago

    btw do you recommend running a quantized higher-parameter model (locally) or lower-parameter but not quantized, if I had to pick between the two?

    • ☆ Yσɠƚԋσʂ ☆@lemmygrad.mlOP
      link
      fedilink
      arrow-up
      3
      arrow-down
      1
      ·
      2 months ago

      I find higher parameter tends to produce better output, but depends on what you’re doing too. For example, for stuff like code generation accuracy is more important. So even a smaller model that’s not quantized might do better. It also depends on the specific model as well.