• cecilkorik@piefed.ca
      link
      fedilink
      English
      arrow-up
      13
      ·
      edit-2
      2 days ago

      I dabble in local AI and this always blows my mind. How do people just casually throw 135b parameter models around? Are people like, renting datacenter hardware or GPU time or something, or are people just building personal AI servers with 6 5090s in them, or are they quantizing them down to 0.025 bits or what? what’s the secret? how does this work? am I missing something? like the Q4 of Qwen3.5 122B is between 60-80GB just for the model alone. That’s 3x 5090s minimum, unless I’m doing the math wrong, and then you need to fit the huge context windows these things have in there too. I don’t get it.

      Meanwhile I’m over here nearly burning my house down trying to get my poor consumer cards to run glm-4.7-flash.

      • Septimaeus@infosec.pub
        link
        fedilink
        English
        arrow-up
        1
        ·
        16 hours ago

        I drafted several local-only designs for testing these models and that estimate looks about right (for sharding only, where most ancillary strategies come with some pretty big tradeoffs).

        It turns out the actual answer in practice is specialized private compute rental, meaning the actual hardware is generally shared. That made a lot more sense to me once I had an intuitive grasp of the fact that every second of idle time on a platform capable of running these bigger models well remains an intolerable expense, simply because they are built for specialized enterprise data center infrastructure.

        Put another way, to host locally you would end up needing to rent the idle time in order to make the infrastructural investment make sense financially, at which point you’re running your own data center and arrive at essentially the same destination.