• brucethemoose@lemmy.world
    link
    fedilink
    English
    arrow-up
    5
    ·
    edit-2
    3 days ago

    Kinda already done:

    https://arxiv.org/abs/2504.07866

    Huawei’s model splits the experts into 8 groups, routed so that each group always has the same number of experts active. This means that (on an 8 NPU server) intercommunication is minimized and the load is balanced.


    There’s another big MoE (ERNIE? Don’t quote me) that ships with native 2-bit QAT, too. It’s basically explicitly made to cram into 8 gaming GPUs.


    If you can get good results on gaming cards, then suddenly ordinary gaming hardware, run in parallel, may be quite capable of running the important models

    I mean. I can run GLM 4.6 350B at 7 tokens/sec on a single 3090 + Ryzen CPU. With modest token convergence compared to the full model. Most can run GLM air and replace base tier ChatGPT.

    Some businesses are already serving models split across cheap GPUs. It can be done, but its not turnkey like it is for NVLink HBM cards.

    Honestly the only thing keeping OpenAI in place is name recognition, a timing lead, SEO/convenience and… hype. Basically inertia + anticompetitiveness. The tech to displace them is there, it’s just inaccessible and unknown.