Afaik that is handled through tool use in modern models (ie they didn’t learn to do maths, they learnt to use a calculator), assuming that’s true and I haven’t missed some advance, their conclusions are likely still relevant
Edit: though the article does seem to discard the chain of thought techniques a little readily, feels like they could come close to fitting the role of executive control, but perhaps that’s just the article lacking detail from the original work.
What I see in the modern models is that you can often ask them to write a program or script to do a task and they can do that successfully much better than doing the task itself directly - once they have debugged the program it is usually 100% reliable for the specified tasks. Ask them to do those simple tasks directly and you get all kinds of creatively wrong answers.
All of these features are not something the models themselves can do, but are grafted on.
I could easily write a Home Assistant automation pattern matching for nearly every way someone could say “how many Rs are in strawberry”, depluralize a plural letter, and run it against “wc” in a bash terminal.
That doesn’t mean it’s smarter. It’s that I’ve added something specific to it.
MCP and the like is just that too, gluing on functions or the ability to hopefully invoke a function. That’s why so many hilariously mundane ones exist.
At the core, it’s still a large language model: a statistical model of frequency of word and word chunk (token) patterns.
Sometimes one model can invoke another via that tooling but it’s still a grafting on. It isn’t a singular thing or system, but disjointed pieces so completely detached from how brains work.
This isn’t AI hate, it’s reality. I love the field of artificial intelligence and machine learning. It’s cool as hell. But an LLM is fundamentally incapable of being anything more than an LLM with glued on pieces that invoke functionality.
OpenAI saw people mock the inability to count so they wrote a specialized tool to count letters and glued it on.
The world is full of endless edge cases. The inability to simply resolve them without gluing on every single one means it just isn’t doing anything new.
They regularly win olympiad mathematics up from not standing a chance and just created a novel solution to the erdos conjecture, them counting the r’s in strawberry is inconsequential but also something they can do even if you just use the raw api or a local model.
Using computers to search for a counter example to a conjecture isn’t exactly new ground and I suspect they did so with the aide of some harness tweaks like some numerical LSP. Like cool, it pushed the envelope but like what the parent said, they grafted on the ability to do a specific task.
A lot of tools like Claude or ChatGPT have internal tools they call when they do math (or use a python script) rather than have the model actually compute anything.
The underlying tech itself can’t do it because you can’t do math by token probability.
You know the “DeepMind and OpenAi models” is the hint that the LLM model is not the one doing the math. The LLM provides a hypothesis and the DeepMind model provides grounding or feedback on whether the hypothesis even makes sense or works.
These models tested are so old they’re from the era where they couldn’t pass a math test or count letters in words
Afaik that is handled through tool use in modern models (ie they didn’t learn to do maths, they learnt to use a calculator), assuming that’s true and I haven’t missed some advance, their conclusions are likely still relevant
Edit: though the article does seem to discard the chain of thought techniques a little readily, feels like they could come close to fitting the role of executive control, but perhaps that’s just the article lacking detail from the original work.
What I see in the modern models is that you can often ask them to write a program or script to do a task and they can do that successfully much better than doing the task itself directly - once they have debugged the program it is usually 100% reliable for the specified tasks. Ask them to do those simple tasks directly and you get all kinds of creatively wrong answers.
My high school math teachers would be so disappointed in them.
If I could wire a calculator into my brain I would have cheated on all the maths tests tbf
So… last week then?
I get that you hate AI but there’s no reason to lie about its capabilities.
All of these features are not something the models themselves can do, but are grafted on.
I could easily write a Home Assistant automation pattern matching for nearly every way someone could say “how many Rs are in strawberry”, depluralize a plural letter, and run it against “wc” in a bash terminal.
That doesn’t mean it’s smarter. It’s that I’ve added something specific to it.
MCP and the like is just that too, gluing on functions or the ability to hopefully invoke a function. That’s why so many hilariously mundane ones exist.
At the core, it’s still a large language model: a statistical model of frequency of word and word chunk (token) patterns.
Sometimes one model can invoke another via that tooling but it’s still a grafting on. It isn’t a singular thing or system, but disjointed pieces so completely detached from how brains work.
This isn’t AI hate, it’s reality. I love the field of artificial intelligence and machine learning. It’s cool as hell. But an LLM is fundamentally incapable of being anything more than an LLM with glued on pieces that invoke functionality.
OpenAI saw people mock the inability to count so they wrote a specialized tool to count letters and glued it on.
The world is full of endless edge cases. The inability to simply resolve them without gluing on every single one means it just isn’t doing anything new.
I believe the progress of the last year is largely attributable to the appropriate “grafting on” of these wrappers around the LLM cores.
They regularly win olympiad mathematics up from not standing a chance and just created a novel solution to the erdos conjecture, them counting the r’s in strawberry is inconsequential but also something they can do even if you just use the raw api or a local model.
Using computers to search for a counter example to a conjecture isn’t exactly new ground and I suspect they did so with the aide of some harness tweaks like some numerical LSP. Like cool, it pushed the envelope but like what the parent said, they grafted on the ability to do a specific task.
A lot of tools like Claude or ChatGPT have internal tools they call when they do math (or use a python script) rather than have the model actually compute anything.
The underlying tech itself can’t do it because you can’t do math by token probability.
Whether they use tools to do it or not is entirely unimportant, that’s just how they do it?
That’s not lying. There’s nothing linguistic about numerical computation.
No.
https://www.nature.com/articles/d41586-025-02343-x
It’s lying
You know the “DeepMind and OpenAi models” is the hint that the LLM model is not the one doing the math. The LLM provides a hypothesis and the DeepMind model provides grounding or feedback on whether the hypothesis even makes sense or works.