Taalas HC1: 17,000 tokens/sec on Llama 3.1 8B vs Nvidia H200’s 233 tokens/sec. 73x faster at one-tenth the power. Each chip runs ONE model, hardwired into the transistors.

  • tal@lemmy.today
    link
    fedilink
    English
    arrow-up
    22
    ·
    11 hours ago

    The HC1 chip doesn’t load model weights from memory. It etches them directly into the transistors. Every weight becomes a physical circuit.

    That’s one way to avoid memory bandwidth constraints!