• CriticalResist8@lemmygrad.ml
    link
    fedilink
    arrow-up
    3
    ·
    1 day ago

    Oh man you’re underselling it to the rest of the website haha.

    But it’s tough to understate just how fast this is without seeing it. 15,749 tokens/s is what I get, and most responses from the big models might be a bit over 1000 tokens, maybe 2000 if they’re stretching it (including chain of thought). Longest I got deepseek to go recently was just a bit below 5000 tokens.

    But at such speeds, all of these generation lengths - 1000, 2000, 5000 - are basically done in the blink of an eye. 5000 tokens will be written in a third of a second, or just slightly above the average reaction time.

    Unfortunately Jimmy seems limited to ~1000 tokens generation so we won’t be able to really push it to the limits lol.

      • CriticalResist8@lemmygrad.ml
        link
        fedilink
        arrow-up
        3
        ·
        22 hours ago

        I hope they can scale it and not only that but that others are able to replicate it. This definitely has potential and would bypass the entire GPU/TPU problem and the layer architecture which is very inefficient.

        Speed is not the end-all be-all, but it’s not just the speed, it’s also being able to run this fully locally. Imagine a PCI card for these chips, and you can just switch out the chip for another when you want to switch the model.

        I’m just hopium-posting mind you lol, they clearly ran into bottlenecks if all they can offer is a ‘tiny’ llama 8b model. The micro required to etch an 800b model on that chip is magnitudes above, and at that point it might cost as much as a new CPU. But, it does leave the GPU available for other things, and lets everyone run SOTA models.

        Really hope this goes somewhere, or if not that, something similar enough.

        • Che's Motorcycle@lemmygrad.ml
          link
          fedilink
          arrow-up
          2
          ·
          10 hours ago

          I think that last bit is exactly right. It doesn’t have to be exactly this that catches on, but the model of massive data centers that run their chips into oblivion every 6-12 months is peak monopoly capital irrationality.