• MartianSands@sh.itjust.works
    link
    fedilink
    arrow-up
    6
    arrow-down
    1
    ·
    16 days ago

    That depends on whether you consider an LLM to be reading the text, or reproducing it.

    Outside of the kind of malfunctions caused by overfitting, like when the same text appears again and again in the training data, it’s not difficult to construct an argument that an LLM does the former, not the latter.

    • awesomesauce309@midwest.social
      link
      fedilink
      English
      arrow-up
      5
      arrow-down
      2
      ·
      16 days ago

      It’s rare a person on social media understands they turn the input into predictive weights, and do not selectively copy and paste out of them.

      • Baggins [he/him]@lemmy.ca
        link
        fedilink
        English
        arrow-up
        4
        arrow-down
        4
        ·
        16 days ago

        You’re saying if I encode a copyrighted work into a JPEG it isn’t infringement? It also uses statistics to produce an approximation of the input.

        • awesomesauce309@midwest.social
          link
          fedilink
          English
          arrow-up
          3
          arrow-down
          2
          ·
          16 days ago

          You’re saying save one jpeg with the intent to reproduce exactly that image. I’m saying if you have a million images you have turned into weights, it won’t exactly reproduce anything unless there is very limited training data on what you’re having it predict.

              • Baggins [he/him]@lemmy.ca
                link
                fedilink
                English
                arrow-up
                2
                arrow-down
                2
                ·
                edit-2
                16 days ago

                Isn’t it? Both methods just produced a data structure you can query to obtain a statistical approximation of a subset of the input data.

                Just because you moved the statistics from the JPEG to the ZIP file? That makes it ok?

                • mindbleach@sh.itjust.works
                  link
                  fedilink
                  arrow-up
                  2
                  arrow-down
                  2
                  ·
                  16 days ago

                  Do you think two students writing an essay on the same topic is plagiarism? No? Then congratulations, you understand why a lossy copy is not remotely the same thing as a statistical model.

                  Really, you just chucked the word “statistical” into a poor description of JPEG, and refused all efforts to explain why that comparison does not work.

    • Arthur Besse@lemmy.mlOP
      link
      fedilink
      English
      arrow-up
      4
      arrow-down
      4
      ·
      edit-2
      16 days ago

      models can and do sometimes produce verbatim copies of individual items in their training data, and more frequently produce outputs that are close enough to them that they would clearly constitute copyright infringement if a human produced them.

      the argument that models are not derivative works of their training data is absurd, and the fact that it is being accepted by courts is yet another confirmation that the “justice system” is anything but just and the law simply doesn’t apply when there is enough money at stake.