I tested 9 flagships (Claude 4.6, GPT-5.2, Gemini 3.1 Pro, Kimi K2.5, etc.) in my own mini-benchmark with novel tasks, web search disabled and zero training contamination and no cheating possible.

TL;DR: Claude 4.6 is currently the best reasoning model, GPT-5.2 is overrated, and open-source is catching up fast, in particular Moonshot.ai’s Kimi K2.5 seems very capable.

  • MagicShel@lemmy.zip
    link
    fedilink
    English
    arrow-up
    2
    ·
    16 hours ago

    I have to admit, this is more entertaining than counting 'r’s in strawberry. Novel logic puzzles really are about impossible because there is no “logic” input in token selection.

    That being said, the first thing that came to my mind is that at some point the (presumable) adults, me and the priest, are going to be on the boat at some point, which would necessarily leave the baby alone on one shore or another.

    Clearly, the only viable solution is the baby eats the candy, and then the priest eats the baby.