• fubarx@lemmy.world
    link
    fedilink
    arrow-up
    1
    ·
    1 day ago

    Simon may want to randomize his Pelican/Bicycle test.

    There is a long tradition in tech of firms tweaking their outputs to get higher scores on well-known tests. The ultimate example is VW Dieselgate.

    But in AI, it’s easy to game benchmarks, by adding the best answers to the training set for the next version.