AI’s Unpaid Debt: How LLM Scrapers Destroy the Social Contract of Open Source

yoasif@fedia.io · 7 months ago

AI’s Unpaid Debt: How LLM Scrapers Destroy the Social Contract of Open Source

atzanteol@sh.itjust.works · 7 months ago

destroy the bargain that made free software spread like wildfire

If you didn’t want your code to be used by others then don’t make it open source.

yoasif@fedia.io · 7 months ago

Do you understand how free software works? Did you read the post? I’d love to clarify, but I’m not going to rewrite the article.

atzanteol@sh.itjust.works · 7 months ago

Also - this conclusion is ridiculous:

By incorporating copyleft data into their models, the LLMs do share the work - but not alike. Instead, the AI strips the work of its provenance and transforms it to be copyright free.

That is absolutely not true. It doesn’t remove the copyright from the original work and no court has ruled as such.

If I wrote a “random code generator” that just happened to create the source code for Microsoft Windows in entirety it wouldn’t strip Microsoft of its copyright.

yoasif@fedia.io · 6 months ago

That is absolutely not true. It doesn’t remove the copyright from the original work and no court has ruled as such.

Sorry, I just got around to this message. That is the idea of the provenance – clearly, the canonical work is copyright. It is the version that has been stripped of its provenance via the LLM that no longer retains its copyright (because as I pointed out, LLM outputs cannot be copyright).

atzanteol@sh.itjust.works · 6 months ago

That doesn’t make it “no longer copy-written” though. The original copyright holder retains their copyright on it. I can’t see any court ruling otherwise.

yoasif@fedia.io · 6 months ago

The output of the LLM can be incorporated into copyrighted material and is copyright free. I never claimed that the copyright on the original work was lost.

atzanteol@sh.itjust.works · 6 months ago

I highly doubt the law is settled on this topic and you’re assuming it is. I can’t see the courts accepting that your duplicate version of my work created through “magic” is not going to be a violation of my copyright. Especially if my work was included as input to the “magic box” that created the output.

atzanteol@sh.itjust.works · 7 months ago

Yes. And this is kinda hand-wavy bullshit.

By incorporating copyleft data into their models, the LLMs do share the work - but not alike. Instead, the AI strips the work of its provenance and transforms it to be copyright free.

That’s not how it works. Your code is not “incorporated” into the model in any recognizable form. It trains a model of vectors. There isn’t a file with your for loop in there though.

I can read your code, learn from it, and create my own code with the knowledge gained from your code without violating an OSS license. So can an LLM.

calcopiritus@lemmy.world · 7 months ago

No you can’t. In the same way you can’t watch a Mickey mouse movie and then draw your own Mickey mouse from what you recall from the movie.

Copying can be done manually by memory, it doesn’t need to be a 1:1 match. Otherwise you could take a GPL licensed file, change the name of 1 variable, and make it proprietary code.

LLMs are just fancy lossy compression algorithms you can interact with. If I save a Netflix series in my hard drive, then re encode it, it is still protected by copyright, even if the bytes don’t match.

atzanteol@sh.itjust.works · 7 months ago

No you can’t. In the same way you can’t watch a Mickey mouse movie and then draw your own Mickey mouse from what you recall from the movie

Yes, I can. I can create a legally distinct mouse-bases cartoon.

You’re right that if an llm gives you copyrighted code that it would be a potential problem. But the article saying that it somehow “strips the code of any copyright” is ridiculous.

calcopiritus@lemmy.world · 7 months ago

Is there anything in the LLMs code preventing it from emitting copyrighted code? Nobody outside LLM companies know, but I’m willing to bet there isn’t.

Therefore, LLMs DO emit copyrighted code. Due to them being trained on copyrighted code and the statistical nature of LLMs.

Does the LLM tell its users that the code it outputted has copyright? I’m not aware of any instance of that happening. In fact, LLMs are probably programmed to not put a copyright header at the start of files, even if the code it “learnt” from had them. So in the literal sense, it is stripping the code of copyright notices.

Does the justice system prosecute LLMs for outputting copyrighted code? No it doesn’t.

I don’t know what definition you use for “strip X of copyright” but I’d say if you can copy something openly and nobody does anything against it, you are stripping it’s copyright.

atzanteol@sh.itjust.works · 7 months ago

I don’t know what definition you use for “strip X of copyright” but I’d say if you can copy something openly and nobody does anything against it, you are stripping it’s copyright.

Just what was stated in the fucking article

By incorporating copyleft data into their models, the LLMs do share the work - but not alike. Instead, the AI strips the work of its provenance and transforms it to be copyright free.

That’s bullshit.

VoterFrog@lemmy.world · 7 months ago

I can read your code, learn from it, and create my own code with the knowledge gained from your code without violating an OSS license. So can an LLM.

Not even just an OSS license. No license backed by law is any stronger than copyright. And you are allowed to learn from or statistically analyze even fully copyrighted work.

Copyright is just a lot more permissive than I think many people realize. And there’s a lot of good that comes from that. It’s enabled things like API emulation and reverse engineering and being able to leave our programming job to go work somewhere else without getting sued.

yoasif@fedia.io · 7 months ago

I can read your code, learn from it, and create my own code with the knowledge gained from your code without violating an OSS license.

Why is Clean-room design a thing then?

atzanteol@sh.itjust.works · 7 months ago

create my own code with the knowledge gained from your code

Not copy your code. Use it to learn what algorithms it uses and ideas on how to implement it.

a_non_monotonic_function@lemmy.world · 7 months ago

No, sometimes they spit out shit verbatim.

You are assuming way too much about how the models work.

atzanteol@sh.itjust.works · 7 months ago

No, sometimes they spit out shit verbatim.

Then that code world still be under the oss copyright. There’s no “licence washing” going on.

magic_lobster_party@fedia.io · 7 months ago

The article is about how LLMs circumvent copyleft licenses like GPL

ImgurRefugee114@reddthat.com · edit-2 7 months ago

“If you didn’t want people in your house then don’t have doors” buddy… That’s not how anything works.

atzanteol@sh.itjust.works · 7 months ago

If you put a fucking sign on your door saying “come on in!” then don’t be angry when people do?

ImgurRefugee114@reddthat.com · edit-2 7 months ago

We do hang signs on the doors but they say something sightly different

https://en.wikipedia.org/wiki/Open-source_license

Public domain licenses are truly as you describe, but copyleft licenses are far from that. There are also many “source available” licenses which aren’t open at all. Just because you can read a book doesn’t mean you can print and sell it.

atzanteol@sh.itjust.works · 7 months ago

Who is wholesale copying OSS code and releasing it under a non-compliant license with an LLM?

ImgurRefugee114@reddthat.com · 7 months ago

Uh… Lots of people? That’s kinda the problem. Maybe use a search engine. There are plenty of cases of LLMs ‘laundering’ copyleft code into (often) proprietary codebases. And that’s just the most blatant and brain-dead obvious example; the use of GPL code to train commercial models is a bit more subtle and nuanced but no less nefarious, and the laws are currently unequipped to handle that part at all.

atzanteol@sh.itjust.works · 7 months ago

You don’t need an LLM to find and copy GPL code. The LLM isn’t adding anything new here.

ImgurRefugee114@reddthat.com · 7 months ago

Oh I’m sorry I didn’t realize you had the intelligence of an LLM. The conversation is now over. I hope you have a pleasant day.

atzanteol@sh.itjust.works · 7 months ago

🙄