What are the platforms on the Fediverse doing to prevent data scraping and prevent bots?

thesharky@piefed.blahaj.zone · 22 days ago

What are the platforms on the Fediverse doing to prevent data scraping and prevent bots?

CombatWombat@feddit.online · edit-2 22 days ago

I feel pretty confident, despite a complete lack of evidence, that at least one state actor has had a listener running on the fediverse continuously since the w3c started publishing specs, and I would be surprised if the big llm providers like Anthropic and OpenAI don’t run them as well – they certainly have the resources and motivation to develop them. You’re certainly correct that the vast majority of scrapers are attempting to harvest historical data using the web frontend, but those are the scrapers I am least afraid of and I think as a mental model for the average user “assume every post is scraped” is the best stance.

irelephant [he/him]@lemmy.dbzer0.com · 21 days ago

You are right: https://www.404media.co/the-200-sites-an-ice-surveillance-contractor-is-monitoring/

The fediverse and atproto are both easily scraped.

frongt@lemmy.zip · 21 days ago

I don’t think Anthropic or OpenAI have spent the time developing a custom ingest pipeline for such a small dataset. It doesn’t seem like it’d give much enough of a return on investment.

cynar@lemmy.world · 21 days ago

Given that they are scrabbling around like drug addicts looking for anything they’ve split, including checking the cracks in the floorboards…

For some models, it’s obvious they’ve long scrapped the erotic fan fic sites!

CombatWombat@feddit.online · edit-2 21 days ago

I dunno, we had 1.8 billion posts and 50 million comments from 1.1 million MAUs in June according to the fediverse observer. It’s not nothing.

frongt@lemmy.zip · 21 days ago

Yeah, for them that’s small potatoes.