Screenshot of this question was making the rounds last week. But this article covers testing against all the well-known models out there.

Also includes outtakes on the ‘reasoning’ models.

    • SuspciousCarrot78@lemmy.world
      link
      fedilink
      English
      arrow-up
      3
      ·
      edit-2
      6 hours ago

      Qwen3-4B HIVEMIND (abliterated) got it in 2, though it scores a lot higher on PIQA, HellaSwag and Winogrande benchmarks than normal Qwen3-30B. I think the new abliteration methods actually strengthen real world understanding.

      https://imgur.com/a/7YZme4i

      https://imgur.com/a/25ApzDN

      I wonder if an abliterated VL model could do even better? They tend to have the best real world model benchmarks. Perhaps a Qwen3-VL-30B ablit (if such a thing exists) could one shot this.

      I’d like to think a lot of these gotcha prompts rely on verbal misunderstanding, rather than failure in world models, but I can’t say that for certain.

      PS: Saw a pearler of a response to this: Chatgpt recommend “yeah, lift the car and carry it on your back. Make sure to bend your knees” (though I’m guessing someone edited that for the lulz)