wake up

open twitter to catch up

see deepseek did it again

(and as a reminder, Deepseek-r1 only came out in January so it’s been less than 12 months since their last bombshell)

One more graph:

What this all means

Traditional AI models are trained to be “rewarded” for a correct final answer. Get the expected answer, win points, be incentivized to get the answer more often. This has a major flaw: a correct answer does not guarantee correct reasoning. A model can guess, use a shortcut, or even have flawed logic but still output the right answer. This approach completely fails for tasks like theorem proving, where the process is the product. DeepSeekMath-V2 tackles this with a novel self-verifying reasoning framework:

  • the Generator: One part of the model generates mathematical proofs and solutions.
  • the Verifier: Another part acts as the critic, checking every step of the reasoning for logical rigor and correctness
  • The Loop: If the verifier finds a flaw, it provides feedback, and the generator revises the proof. This creates a co-evolution cycle where both components push each other to become smarter

This new approach allows the model to set record-breaking performance. As you can see from the charts above, it scores second-place on ProofBench-Advanced, just behind Gemini. But Gemini isn’t open-source, Deepseekmath-V2 is.

The model weights are available on Huggingface under an Apache 2.0 license: https://huggingface.co/deepseek-ai/DeepSeek-Math-V2.

This means researchers, developers, and enthusiasts around the world can download, study, and build upon this model right now. They can fine-tune or change the model to fit their needs and research, which promises a lot of exciting math discoveries happening soon - I predict (on no basis mind you) that this will help solve computing problems to start with, either practical or theoretical.

Beyond just the math, the self-verification mechanism is a crucial step towards building AI systems whose reasoning we can trust, which is vital for applications such as scientific research, formal verification, and safety-critical systems. It also proves that ‘verification-driven’ training is a viable and powerful alternative to the ‘answer-driven’ method used to this day.

  • Conselheiro@lemmygrad.ml
    link
    fedilink
    arrow-up
    4
    ·
    edit-2
    6 days ago

    A second open source LLM has hit the S&P500.

    I find it fascinating how much of the progress in DNNs for the last 5 years has been simply “how can we converge on training this in an Adversarial framework”. Somebody more versed in Hegel could probably write some relation between the success of GANs and the nature of contradictions.

    I also hope this training method removes the annoying artefact of the chatbot replying to all my comments with “You are absolutely correct!” like it wants some favour.

    I’ll check out the paper later, but at a glance I’m somewhat sceptical of the critic increasing reliability in non math stuff. Depending on how it’s modelled, it could just become a more convincing bullshitter.