• brucethemoose@lemmy.world
    link
    fedilink
    English
    arrow-up
    3
    ·
    edit-2
    1 hour ago

    I did find this calculator the other day

    That calculator is total nonsense. Don’t trust anything like that; at best, its obsolete the week after its posted.

    I’d be hesitant to buy something just for AI that doesn’t also have RTX cores because I do a lot of Blender rendering. RDNA 5 is supposed to have more competitive RTX cores

    Yeah, that’s a huge caveat. AMD Blender might be better than you think though, and you can use your RTX 4060 on a Strix Halo motherboard just fine. The CPU itself is incredible for any kind of workstation workload.

    along with NPU cores, so I guess my ideal would be a SoC with a ton of RAM

    So far, NPUs have been useless. Don’t buy any of that marketing.

    I’m also not sure under 10 tokens per second will be usable, though I’ve never really tried it.

    That’s still 5 words/second. That’s not a bad reading speed.

    Whether its enough? That depends. GLM 350B without thinking is smarter than most models with thinking, so I end up with better answers faster.

    But anyway, I’m get more like 20 tokens a second with models that aren’t squeezed into my rig within an inch of their life. If you buy an HEDT/Server CPU with more RAM channels, it’s even faster.

    If you want to look into the bleeding edge, start with https://github.com/ikawrakow/ik_llama.cpp/

    And all the models on huggingface with the ik tag: https://huggingface.co/models?other=ik_llama.cpp&sort=modified

    You’ll see instructions for running big models on a 4060 + RAM.

    If you’re trying to like batch process documents quickly (so no CPU offloading), look at exl3s instead: https://huggingface.co/models?num_parameters=min%3A12B%2Cmax%3A32B&sort=modified&search=exl3

    And run them with this: https://github.com/theroyallab/tabbyAPI