Bots are currently scraping the internet for LLM training data at unprecedented rates[1][2][3], driving up costs and destabilizing public-facing websites. I want to talk about how this has been particularly difficult for wikis, and has gotten much worse in the last few months.
We use NGINX’s 444 response A LOT.
In coordination with careful rate-limiting, it’s been a dramatic improvement.
The worst of the bots don’t advertise their User Agent (or worse, attempt to present they’re a normal user making 100s of requests a second) but there’s lots of low hanging fruit.
Hmm interesting. I wasn’t aware of this one