The issue was not caused, directly or indirectly, by a cyber attack or malicious activity of any kind. Instead, it was triggered by a change to one of our database systems’ permissions which caused the database to output multiple entries into a “feature file” used by our Bot Management system. That feature file, in turn, doubled in size. The larger-than-expected feature file was then propagated to all the machines that make up our network.

The software running on these machines to route traffic across our network reads this feature file to keep our Bot Management system up to date with ever changing threats. The software had a limit on the size of the feature file that was below its doubled size. That caused the software to fail.

  • floquant@lemmy.dbzer0.com
    link
    fedilink
    English
    arrow-up
    1
    ·
    edit-2
    2 hours ago

    Congratulations, now your “good” servers are dead from the extra load and you also have a queue of shit to go through once you’re back up, making the problem worse. Running a terabit-scale proxy network isn’t exactly easy, the amount of moving parts interacting with each other is insane. I highly suggest reading some of their postmortems, they’re usually really well written and very informative if you want to learn more about the failures they’ve encountered, the processes to handle them, and their immediate remediations