• hikaru755@lemmy.world
    link
    fedilink
    English
    arrow-up
    7
    ·
    edit-2
    19 hours ago

    Quick note on terminology, there’s no thing called a “math engine”. Most models have the ability to run custom computer code in some way as one of the “tools” they have available, and that’s what’s used if a model decides to offload the calculation, rather than answer directly.

    This is what that looks like in Claude Code:

    Notice the lines starting with a green dot and the text Bash(python3...). Those are the model calling the “Bash” tool to run python code to answer the second and third question. The first question it answered (correctly, btw) without doing any tool call, that’s just the LLM itself getting it right in a straight shot, similar to DeepSeek in your example. Current models are actually good enough to generally get this kind of simple math correct on their own. I still wouldn’t want to rely on that, but I’m not surprised it got it correct without any tool calls.

    So I tested my more complex calculations against DeepSeek, and it seems like (at least in the Web UI) it doesn’t have any access to a math or code running tool. It just starts working through it in verbose text, basically explaining to itself how to do manual addition like you learn in school, and then doing that. Incredibly wasteful, but it did actually arrive at the correct answers.

    Gemini is the only web-based AI app I thought to test right now that seems to have access to a code running tool, here’s what that looks like:

    It’s hidden by default, but you can click on “Show code” in the top right to see what it did.

    This is what I mean when I say the harness matters. The models are all pretty similar, but the app you’re using to interact with them determines what tools are even made available to the LLM in the first place, and whether/how you’re shown when it calls those tools.