I'm actually not entirely against AIs #scraping the web.
Once the genie is out of the bottle, you can't put it back in. If there's some content out there that is freely accessible, and it can be used to make large models better, it will certainly be used - we shouldn't be too naive or ideological about that.
I've always supported total freedom of scraping for everyone. I've always supported a world were all the content on the Internet can also be parsed by machines (that was the entire idea behind the semantic web). Once public content is out there, we lose control over who accesses it and for what purposes - that's simply how the web works.
But if Google and Meta are suddenly in this "we ♥ scraping" mood, I'd expect them to stick to their words and allow bidirectional scraping at least.
As an AI geek, I'd love to train my models on large corpora of audio extracted from YouTube videos. Or what people post in public Facebook groups when particular events happen. Or how the price of a product fluctuates on Amazon as the result of several external factors.
But I can't legally do any of these things. Those platforms are sealed, their APIs are very limited by design, only a limited amount of researchers can access some of that data (after signing lengthy NDAs and agreeing that the mother company will decide if the research can be published), and they will have tons of frontend-only checks to ensure that only a human downloads that content - and that they watch a sufficient amount of ads in the process. Not only - the developers scraping software like youtube-dl also get regularly harassed by Google.
So how come should I tolerate a world where if you're big enough you can afford to scrape the shit out of everyone, and use that knowledge to become even bigger and more powerful, but nobody is allowed to do the same with your own content?
We urgently need regulation that creates a level playing field when it comes to automated access to online information. Freedom of scraping means freedom of growing. We cant give this freedom only to those who are big enough.
We need to make web scraping a fundamental human right.
And large companies should be compelled with sharing their data without barriers to scrapers too, if they aren't willing to build proper APIs.
Until that happens, I'll keep scraping the shit out of those bastards without feeling an inch of guilt.
https://www.indiehackers.com/post/it-will-be-the-greatest-theft-in-the-entire-history-of-humanity-indie-hackers-weigh-in-on-big-ai-companies-scraping-the-web-6e78a4a4b7