that ASIC chip prototype is pretty impressive. You can try it on https://chatjimmy.ai/ without an account, ask it to write something big like an essay or guide - literally the longest you’ll wait is to get connected to the API, but the answer appears instantly.
Only limitation right now is they put a small llama 8b model on their chip, but it’s a prototype and proof of concept of course. I’m sure soon China will print a full Deepseek model on such a chip lol.
Right now there isn’t much interest in making AI more efficient to run but yeah there’s no reason we won’t find advances there. China is already doing a lot to squeeze models into smaller hardware.
I don’t run LLMs locally because what I’m limited to is not great (especially context size is limited) but the way things are going we will definitely start to see open options open up, I think. If only because academia requires it.
Oh man you’re underselling it to the rest of the website haha.
But it’s tough to understate just how fast this is without seeing it. 15,749 tokens/s is what I get, and most responses from the big models might be a bit over 1000 tokens, maybe 2000 if they’re stretching it (including chain of thought). Longest I got deepseek to go recently was just a bit below 5000 tokens.
But at such speeds, all of these generation lengths - 1000, 2000, 5000 - are basically done in the blink of an eye. 5000 tokens will be written in a third of a second, or just slightly above the average reaction time.
Unfortunately Jimmy seems limited to ~1000 tokens generation so we won’t be able to really push it to the limits lol.
I hope they can scale it and not only that but that others are able to replicate it. This definitely has potential and would bypass the entire GPU/TPU problem and the layer architecture which is very inefficient.
Speed is not the end-all be-all, but it’s not just the speed, it’s also being able to run this fully locally. Imagine a PCI card for these chips, and you can just switch out the chip for another when you want to switch the model.
I’m just hopium-posting mind you lol, they clearly ran into bottlenecks if all they can offer is a ‘tiny’ llama 8b model. The micro required to etch an 800b model on that chip is magnitudes above, and at that point it might cost as much as a new CPU. But, it does leave the GPU available for other things, and lets everyone run SOTA models.
Really hope this goes somewhere, or if not that, something similar enough.
I think that last bit is exactly right. It doesn’t have to be exactly this that catches on, but the model of massive data centers that run their chips into oblivion every 6-12 months is peak monopoly capital irrationality.
That’s what I’m thinking too. There’s no reason why you couldn’t make a chip like this for a full blown Deepseek model, and then when new models come out you just print new chips for them. The really nice part is that their approach doesn’t need DRAM either because the state of each transistor acts as memory, it just needs a bit of SRAM which we don’t have a shortage of.
I’m fully convinced that the whole AI as a service business model is going to be very short lived. Ultimately, nobody really likes their data going out to some company, and to have to pay subscription fees to use the models. If we start getting these kinds of specialized chips, they’re going to be a game changer.
I could however totally see an economy where the chips themselves while cheap to produce cost a premium based on model and number of parameters.
Because the tech is certainly impressive and they have proof of concept. I don’t know how scalable this is for them (or others), but it clearly works and shows immediate advantages. If it could integrate with existing consumer hardware, like say a PCI card you plug the chip into and switch them out when you want to change the model, anybody could easily have this at home.
But with capitalism we’d probably have to settle for DRM’d chips that self-destruct after X many tokens generated lol.
that ASIC chip prototype is pretty impressive. You can try it on https://chatjimmy.ai/ without an account, ask it to write something big like an essay or guide - literally the longest you’ll wait is to get connected to the API, but the answer appears instantly.
Only limitation right now is they put a small llama 8b model on their chip, but it’s a prototype and proof of concept of course. I’m sure soon China will print a full Deepseek model on such a chip lol.
Right now there isn’t much interest in making AI more efficient to run but yeah there’s no reason we won’t find advances there. China is already doing a lot to squeeze models into smaller hardware.
I don’t run LLMs locally because what I’m limited to is not great (especially context size is limited) but the way things are going we will definitely start to see open options open up, I think. If only because academia requires it.
Impressive speed! I only tried a couple small coding questions, but it’s faster than anything I’ve seen.
Oh man you’re underselling it to the rest of the website haha.
But it’s tough to understate just how fast this is without seeing it. 15,749 tokens/s is what I get, and most responses from the big models might be a bit over 1000 tokens, maybe 2000 if they’re stretching it (including chain of thought). Longest I got deepseek to go recently was just a bit below 5000 tokens.
But at such speeds, all of these generation lengths - 1000, 2000, 5000 - are basically done in the blink of an eye. 5000 tokens will be written in a third of a second, or just slightly above the average reaction time.
Unfortunately Jimmy seems limited to ~1000 tokens generation so we won’t be able to really push it to the limits lol.
Fills out modified DPRK form for Jimmy, lol
I hope they can scale it and not only that but that others are able to replicate it. This definitely has potential and would bypass the entire GPU/TPU problem and the layer architecture which is very inefficient.
Speed is not the end-all be-all, but it’s not just the speed, it’s also being able to run this fully locally. Imagine a PCI card for these chips, and you can just switch out the chip for another when you want to switch the model.
I’m just hopium-posting mind you lol, they clearly ran into bottlenecks if all they can offer is a ‘tiny’ llama 8b model. The micro required to etch an 800b model on that chip is magnitudes above, and at that point it might cost as much as a new CPU. But, it does leave the GPU available for other things, and lets everyone run SOTA models.
Really hope this goes somewhere, or if not that, something similar enough.
I think that last bit is exactly right. It doesn’t have to be exactly this that catches on, but the model of massive data centers that run their chips into oblivion every 6-12 months is peak monopoly capital irrationality.
That’s what I’m thinking too. There’s no reason why you couldn’t make a chip like this for a full blown Deepseek model, and then when new models come out you just print new chips for them. The really nice part is that their approach doesn’t need DRAM either because the state of each transistor acts as memory, it just needs a bit of SRAM which we don’t have a shortage of.
I’m fully convinced that the whole AI as a service business model is going to be very short lived. Ultimately, nobody really likes their data going out to some company, and to have to pay subscription fees to use the models. If we start getting these kinds of specialized chips, they’re going to be a game changer.
I could however totally see an economy where the chips themselves while cheap to produce cost a premium based on model and number of parameters.
Because the tech is certainly impressive and they have proof of concept. I don’t know how scalable this is for them (or others), but it clearly works and shows immediate advantages. If it could integrate with existing consumer hardware, like say a PCI card you plug the chip into and switch them out when you want to change the model, anybody could easily have this at home.
But with capitalism we’d probably have to settle for DRM’d chips that self-destruct after X many tokens generated lol.
that’s disgustingly plausible scenario