Need help?
<- Back

Comments (104)

  • fancyfredbot
    Who are these agressive scrapers run by?It is difficult to figure out the incentives here. Why would anyone want to pull data from LWN (or any other site) at a rate which would cause a DDOS like attack?If I run a big data hungry AI lab consuming training data at 100Gb/s it's much much easier to scrape 10,000 sites at 10Mb/s than DDOS a smaller number of sites with more traffic. Of course the big labs want this data but why would they risk the reputational damage of overloading popular sites in order to pull it in an hour instead of a day or two?
  • iamnothere
    I am starting to think these are not just AI scrapers blindly seeking out data. All kinds of FOSS sites including low volume forums and blogs have been under this kind of persistent pressure for a while now. Given the cost involved in maintaining this kind of widespread constant scraping, the economics don’t seem to line up. Surely even big budget projects would adjust their scraping rates based on how many changes they see on a given site. At scale this could save a lot of money and would reduce the chance of blocking.I haven’t heard of the same attacks facing (for instance) niche hobby communities. Does anyone know if those sites are facing the same scale of attacks?Is there any chance that this is a deniable attack intended to disrupt the tech industry, or even the FOSS community in particular, with training data gathered as a side benefit? I’m just struggling to understand how the economics can work here.
  • jacquesm
    AI allows companies to resell open source code as if they wrote it themselves doing an end run around all license terms. This is a major problem.Of course they're not going to stop at just code. They need all the rest of it as well.
  • tedivm
    I solved this problem for my blog by simply not being interesting.
  • blakesterz
    "It is a DDOS attack involving tens of thousands of addresses" It is amazing just how distributed some of these things are. Even on the small sites that I help host we see these types of attacks from very large numbers of diverse IPs. I'd love to know how these are being run.
  • sgc
    Can somebody tell me what is a normal "cost of doing business" level of bot traffic these days? I have way too much bot traffic like everybody else, but I don't know if I am an outlier or just run of the mill. I get about 100k bot hits a day, presumably because I have about 350k pages on my site.
  • Havoc
    That makes no sense.There is no reason for AI scrappers to use tens of thousands of IPs to scrape one site over and over.That just sounds like a classic DDOS.
  • zahlman
    Is it still ongoing? The thread appears to be over 24 hours old and as a quick test I had no issue loading the main page (which is as snappy and responsive as expected from a low-bandwidth site like LWN).
  • blibble
    the perverse incentive is if you ddos the website such that it shuts down, no other "AI" parasites can get the valuable databig tech incentivised to ddos... what a world they've built
  • gulugawa
    I've had luck blocking scrapers by overwriting JavaScript methods" a.getElementsByTagName = function (...args) {//Clear page content}"One can also hide components inside Shadow DOM to make it harder to scrape.However, these methods will interfere with automated testing tools such as Playwright and Selenium. Also, search engine indexing is likely to be affected.
  • bloppe
    I'm curious how they concluded this was done to scrape for AI training. If the traffic was easily distinguishable from regular users, they would be able to firewall it. If it was not, then how can they be sure it wasn't just a regular old malicious DDOS? Happens way more often than you might think. Sometimes a poorly-managed botnet can even misfire.
  • 2OEH8eoCRo0
    When are we going to start suing these assholes? Why isn't anybody leveraging the legal system? You're all searching for technical solutions to a legal problem and fighting with one hand behind your back.
  • chrisjj
    So which is it? DDOS attack or "AI" scrapers?