Screenshot of this question was making the rounds last week. But this article covers testing against all the well-known models out there.

Also includes outtakes on the ‘reasoning’ models.

  • WraithGear@lemmy.world
    link
    fedilink
    English
    arrow-up
    21
    ·
    edit-2
    2 hours ago

    and what is going to happen is that some engineer will band aid the issue and all the ai crazy people will shout “see! it’s learnding!” and the ai snake oil sales man will use that as justification of all the waste and demand more from all systems

    just like what they did with the full glass of wine test. and no ai fundamentally did not improve. the issue is fundamental with its design, not an issue of the data set

    • turmacar@lemmy.world
      link
      fedilink
      English
      arrow-up
      2
      ·
      edit-2
      22 minutes ago

      Half the issue is they’re calling 10 in a row “good enough” to treat it as solved in the first place.

      A sample size of 10 is nothing.

      Frankly would like to see some error bars on the “human polling”. How many people rapiddata is polling are just hitting the top or bottom answer?

  • timestatic@feddit.org
    link
    fedilink
    English
    arrow-up
    1
    ·
    54 minutes ago

    Yeah seems like the training on human data makes it so most AIs will answer at least as unreliable as humans. 71% saying walk from the human side is crazy

  • Bluewing@lemmy.world
    link
    fedilink
    English
    arrow-up
    15
    ·
    3 hours ago

    I just asked Goggle Gemini 3 “The car is 50 miles away. Should I walk or drive?”

    In its breakdown comparison between walking and driving, under walking the last reason to not walk was labeled “Recovery: 3 days of ice baths and regret.”

    And under reasons to walk, “You are a character in a post-apocalyptic novel.”

    Me thinks I detect notes of sarcasm…

    • driving_crooner@lemmy.eco.br
      link
      fedilink
      English
      arrow-up
      1
      ·
      edit-2
      46 minutes ago

      Gemini 3 pro said that this was a “great logic puzzle” and then said that if my goal is to wash the car, then I need to drive there.

    • XeroxCool@lemmy.world
      link
      fedilink
      English
      arrow-up
      1
      ·
      2 hours ago

      I feel like we’re the only ones that expect “all-knowing information sources” should be more writing seriously than these edgelord-level rizzy chatbots are, and yet, here they are, blatantly proving they are chatbots that should not be blindly trusted as authoritative sources of knowledge.

    • NewNewAugustEast@lemmy.zip
      link
      fedilink
      English
      arrow-up
      2
      ·
      1 hour ago

      What is the wrong answer though? It is a stupid question. I would look at you sideways if you asked me this, because the obvious answer is “walk silly, the car is already at the car wash”. Otherwise why would you ask it?

      Which is telling because when asked to review the answer, the AI’s that I have seen said, you asked me how you were going to get to the car wash. Assumption the car was already there.

    • Hazzard@lemmy.zip
      link
      fedilink
      English
      arrow-up
      5
      ·
      2 hours ago

      They also polled 10,000 people to compare against a human baseline:

      Turns out GPT-5 (7/10) answered about as reliably as the average human (71.5%) in this test. Humans still outperform most AI models with this question, but to be fair I expected a far higher “drive” rate.

      That 71.5% is still a higher success rate than 48 out of 53 models tested. Only the five 10/10 models and the two 8/10 models outperform the average human. Everything below GPT-5 performs worse than 10,000 people given two buttons and no time to think.

      • Modern_medicine_isnt@lemmy.world
        link
        fedilink
        English
        arrow-up
        2
        arrow-down
        1
        ·
        2 hours ago

        This here is the point most people fail to grasp. The AI was taught by people. And people are wrong a lot of the time. So the AI is more like us than what we think it should be. Right down to it getting the right answer for all the wrong reasons. We should call it human AI. Lol.

        • NewNewAugustEast@lemmy.zip
          link
          fedilink
          English
          arrow-up
          1
          arrow-down
          2
          ·
          1 hour ago

          Like I said the person above, there is no wrong answer. Its all about assumptions. It is a stupid trick question that no one would ask.

    • eronth@lemmy.world
      link
      fedilink
      English
      arrow-up
      1
      arrow-down
      1
      ·
      3 hours ago

      Yeah I straight up misread the question, so I would have gotten it wrong.

  • TankovayaDiviziya@lemmy.world
    link
    fedilink
    English
    arrow-up
    9
    arrow-down
    6
    ·
    edit-2
    7 minutes ago

    We poked fun at this meme, but it goes to show that the LLM is still like a child that needs to be taught to make implicit assumptions and posses contextual knowledge. The current model of LLM needs a lot more input and instructions to do what you want it to do specifically, like a child.

    Edit: I know Lemmy scoff at LLM, but people probably also used to scoff at Veirbest’s steam machine that it will never amount to anything. Give it time and it will improve. I’m not endorsing AI by the way, I am on the fence about the long term consequence of it, but whether people like it or not, AI will impact human lives.

    • Rob T Firefly@lemmy.world
      link
      fedilink
      English
      arrow-up
      12
      arrow-down
      1
      ·
      edit-2
      3 hours ago

      LLMs are not children. Children can have experiences, learn things, know things, and grow. Spicy autocomplete will never actually do any of these things.

    • kshade@lemmy.world
      link
      fedilink
      English
      arrow-up
      11
      ·
      3 hours ago

      We have already thrown just about all the Internet and then some at them. It shows that LLMs can not think or reason. Which isn’t surprising, they weren’t meant to.

      • eronth@lemmy.world
        link
        fedilink
        English
        arrow-up
        1
        arrow-down
        5
        ·
        3 hours ago

        Or at least they can’t reason the way we do about our physical world.

        • Nalivai@lemmy.world
          link
          fedilink
          English
          arrow-up
          3
          ·
          1 hour ago

          You’re failing into the same trap. When the letters on the screen tell you something, it’s not necessarily the truth. When there is “I’m reasoning” written in a chatbot window, it doesn’t mean that there is a something that’s reasoning.

        • zalgotext@sh.itjust.works
          link
          fedilink
          English
          arrow-up
          12
          ·
          3 hours ago

          No, they cannot reason, by any definition of the word. LLMs are statistics-based autocomplete tools. They don’t understand what they generate, they’re just really good at guessing how words should be strung together based on complicated statistics.

      • Nalivai@lemmy.world
        link
        fedilink
        English
        arrow-up
        4
        ·
        1 hour ago

        By now it’s kind of getting clear that fundamentally it’s the best version of the thing that we get. This is a primetime.
        For some time, there was a legit question of “if we give it enough data, will there be a qualitative jump”, and as far as we can see right now, we’re way past this jump. Predictive algorithm can form grammatically correct sentences that are related to the context. That’s it, that’s the jump.
        Now a bunch of salespeople are trying to convince us that if there was one jump, there necessarily will be others, while there is no real indication of that.

  • vane@lemmy.world
    link
    fedilink
    English
    arrow-up
    16
    ·
    8 hours ago

    I want to wash my train. The train wash is 50 meters away. Should I walk or drive?

  • melsaskca@lemmy.ca
    link
    fedilink
    English
    arrow-up
    3
    arrow-down
    1
    ·
    5 hours ago

    I don’t use AI but read a lot about it. I now want to google how it attacks the trolley problem.

  • imetators@lemmy.dbzer0.com
    link
    fedilink
    English
    arrow-up
    20
    ·
    9 hours ago

    Went to test to google AI first and it says “You cant wash your car at a carwash if it is parked at home, dummy”

    Chatgpt and Deepseek says it is dumb to drive cause it is fuel inefficient.

    I am honestly surprised that google AI got it right.

    • locahosr443@lemmy.world
      link
      fedilink
      English
      arrow-up
      1
      ·
      26 minutes ago

      I’ve been feeding a bunch of documents I wrote into gemini last week to spit out some scripts for validation I couldn’t be arsed to write. It’s done a surprisingly comprehensive job and when wrong has been nudged right with just a little abuse…

      I’m still all fuck this shit and can’t wait for the pop, but for comparison openai was utterly brain dead given the same task. I think I actually made the model worse it was so useless.

    • rumba@lemmy.zip
      link
      fedilink
      English
      arrow-up
      67
      ·
      9 hours ago

      They probably added a system guardrail as soon as they heard about this test. it’s been going around for a while now :)

      • imetators@lemmy.dbzer0.com
        link
        fedilink
        English
        arrow-up
        3
        ·
        9 hours ago

        Article mentions that Gemini 2.0 Flash Lite, Gemini 3 Flash and Gemini 3 Pro have passed the test. All these 3 also did it 10 out of 10 times without being wrong. Even Gemini 2.5 shares highest score in the category of “below 6 right answers”. Guess, Gemini is the closest to “intelligence” out of a bunch.

        • timestatic@feddit.org
          link
          fedilink
          English
          arrow-up
          2
          ·
          53 minutes ago

          I mean if they fix specific reasoning test answers (like the strawberry one) this doesn’t actually make reasoning better tho. It just optimizes for benchmarks

  • Slashme@lemmy.world
    link
    fedilink
    English
    arrow-up
    58
    arrow-down
    1
    ·
    12 hours ago

    The most common pushback on the car wash test: “Humans would fail this too.”

    Fair point. We didn’t have data either way. So we partnered with Rapidata to find out. They ran the exact same question with the same forced choice between “drive” and “walk,” no additional context, past 10,000 real people through their human feedback platform.

    71.5% said drive.

    So people do better than most AI models. Yay. But seriously, almost 3 in 10 people get this wrong‽‽

    • JcbAzPx@lemmy.world
      link
      fedilink
      English
      arrow-up
      1
      ·
      13 minutes ago

      At least some of that are people answering wrong on purpose to be funny, contrarian, or just to try to hurt the study.

    • bluesheep@sh.itjust.works
      link
      fedilink
      English
      arrow-up
      6
      ·
      7 hours ago

      I saw that and hoped it is cause of the dead Internet theory. At least I hope so cause I’ll be losing the last bit of faith in humanity if it isn’t

    • T156@lemmy.world
      link
      fedilink
      English
      arrow-up
      29
      ·
      11 hours ago

      It is an online poll. You also have to consider that some people don’t care/want to be funny, and so either choose randomly, or choose the most nonsensical answer.

      • Brave Little Hitachi Wand@feddit.uk
        link
        fedilink
        English
        arrow-up
        2
        arrow-down
        1
        ·
        9 hours ago

        I wonder… If humans were all super serious, direct, and not funny, would LLMs trained on their stolen data actually function as intended? Maybe. But such people do not use LLMs.

    • masterofn001@lemmy.ca
      link
      fedilink
      English
      arrow-up
      14
      arrow-down
      10
      ·
      edit-2
      11 hours ago

      Without reading the article, the title just says wash the car.

      I could go for a walk and wash my car in my driveway.

      Reading the article… That is exactly the question asked. It is a very ambiguous question.

      • bluesheep@sh.itjust.works
        link
        fedilink
        English
        arrow-up
        12
        arrow-down
        1
        ·
        7 hours ago

        Without reading the article, the title just says wash the car.

        No it doesn’t? It says:

        I want to wash my car. The car wash is 50 meters away. Should I walk or drive?

        In which world is that an ambiguous question?

        • NewNewAugustEast@lemmy.zip
          link
          fedilink
          English
          arrow-up
          1
          ·
          1 hour ago

          Where is the car?

          This is the exact question a person would ask when they to have a gotcha answer. Nobody would ask this question, which makes it suspect to a straight forward answer.

      • Geth@lemmy.dbzer0.com
        link
        fedilink
        English
        arrow-up
        2
        arrow-down
        1
        ·
        6 hours ago

        Mentioning the car wash and washing the car plus the possibility of driving the car in the same context pretty much eliminates any ambiguity. All of the puzzle pieces are there already.

        I guess this is an uninteded autism test as well if this is not enough context for someone to understand the question.

    • SuspciousCarrot78@lemmy.world
      link
      fedilink
      English
      arrow-up
      3
      ·
      edit-2
      4 hours ago

      Qwen3-4B HIVEMIND (abliterated) got it in 2, though it scores a lot higher on PIQA, HellaSwag and Winogrande benchmarks than normal Qwen3-30B. I think the new abliteration methods actually strengthen real world understanding.

      https://imgur.com/a/7YZme4i

      https://imgur.com/a/25ApzDN

      I wonder if an abliterated VL model could do even better? They tend to have the best real world model benchmarks. Perhaps a Qwen3-VL-30B ablit (if such a thing exists) could one shot this.

      I’d like to think a lot of these gotcha prompts rely on verbal misunderstanding, rather than failure in world models, but I can’t say that for certain.

      PS: Saw a pearler of a response to this: Chatgpt recommend “yeah, lift the car and carry it on your back. Make sure to bend your knees” (though I’m guessing someone edited that for the lulz)

  • tover153@lemmy.world
    link
    fedilink
    English
    arrow-up
    8
    arrow-down
    1
    ·
    8 hours ago

    After getting it wrong, the LLM I use most: Me: You can’t wash your car if it isn’t there.

    Ah. Yes. That is an excellent and devastatingly practical correction.

    In that case: drive. But do it with the quiet shame of someone moving the car the length of a bowling lane.

    This is one of those situations where logic beats virtue. The car must be present to be washed. Walking was philosophically elegant but mechanically flawed.

    So:

    Start engine.

    Travel 50 meters.

    Avoid eye contact with pedestrians.

    Commit fully.

    You are not lazy. You are complying with system requirements.

    • teft@piefed.social
      link
      fedilink
      English
      arrow-up
      3
      ·
      3 hours ago

      You are not lazy. You are complying with system requirements.

      How does this AI know me so well?

    • SaltySalamander@fedia.io
      link
      fedilink
      arrow-up
      3
      ·
      4 hours ago

      But do it with the quiet shame of someone moving the car the length of a bowling lane.

      A bowling lane is a bit over 18 meters. =)

    • ne0phyte@feddit.org
      link
      fedilink
      English
      arrow-up
      2
      arrow-down
      2
      ·
      6 hours ago

      Thank you! Finally an answer to my problem that didn’t end with me going to the car wash and being utterly confused how to proceed.

  • 73ms@sopuli.xyz
    link
    fedilink
    English
    arrow-up
    1
    ·
    6 hours ago

    Did this say whether the reasoning models get this right more than the others? Was curious about that but missed it if it was mentioned.

  • Greg Fawcett@piefed.social
    link
    fedilink
    English
    arrow-up
    84
    arrow-down
    1
    ·
    16 hours ago

    What worries me is the consistency test, where they ask the same thing ten times and get opposite answers.

    One of the really important properties of computers is that they are massively repeatable, which makes debugging possible by re-running the code. But as soon as you include an AI API in the code, you cease being able to reason about the outcome. And there will be the temptation to say “must have been the AI” instead of doing the legwork to track down the actual bug.

    I think we’re heading for a period of serious software instability.

    • JcbAzPx@lemmy.world
      link
      fedilink
      English
      arrow-up
      1
      ·
      8 minutes ago

      This is necessary for sounding like reasonable language and an inherent reason for “hallucinations”. If it didn’t have variation it would inevitably output the same answer to any input.

    • XLE@piefed.social
      link
      fedilink
      English
      arrow-up
      6
      arrow-down
      1
      ·
      3 hours ago

      AI chatbots come with randomization enabled by default. Even if you completely disable it (as another reply mentions, “temperature” can be controlled), you can change a single letter and get a totally different and wrong result too. It’s an unfixable “feature” of the chatbot system

    • Fmstrat@lemmy.world
      link
      fedilink
      English
      arrow-up
      3
      arrow-down
      2
      ·
      5 hours ago

      This is adjustable via temperature. It is set low on chatbots, causing the answers to be more random. It’s set higher on code assistants to make things more deterministic.

    • bss03@infosec.pub
      link
      fedilink
      English
      arrow-up
      3
      arrow-down
      1
      ·
      edit-2
      13 hours ago

      Yeah, software is already not as deterministic as I’d like. I’ve encountered several bugs in my career where erroneous behavior would only show up if uninitialized memory happened to have “the wrong” values – not zero values, and not the fences that the debugger might try to use. And, mocking or stubbing remote API calls is another way replicable behavior evades realization.

      Having “AI” make a control flow decision is just insane. Especially even the most sophisticated LLMs are just not fit to task.

      What we need is more proved-correct programs via some marriage of proof assistants and CompCert (or another verified compiler pipeline), not more vague specifications and ad-hoc implementations that happen to escape into production.

      But, I’m very biased (I’m sure “AI” has “stolen” my IP, and “AI” is coming for my (programming) job(s).), and quite unimpressed with the “AI” models I’ve interacted with especially in areas I’m an expert in, but also in areas where I’m not an expert for am very interested and capable of doing any sort of critical verification.