Personal Data Scraping for AI Training: Innovation or Exploitation

More than 10 billion requests hit the web every week from AI crawlers, most of them pulling content for training datasets, according to Cloudflare Radar. Wikipedia absorbed a 150% surge in multimedia bandwidth from this traffic alone. The bots generating those requests are not indexing the web for search. They are extracting the accumulated writing, opinions, and personal disclosures of millions of people who never agreed to any of it.

The word "scraping" makes this sound technical and distant. It is neither. When Anthropic allegedly scraped Reddit starting in 2021, refusing licensing deals the platform offered, it was pulling years of personal posts: arguments, confessions, medical questions, relationship disclosures. Reddit's 26-page suit details the refusal to pay. The University of Tübingen found in June 2025 that large language models memorize between 0.1% and 10% of their training data verbatim. That is not a rounding error. That is a re-identification risk hiding inside a product marketed as anonymous and general.

The Evasion Is the Tell

A Hacker News thread from late March 2026 describes AI crawlers routing through millions of residential proxy IPs, cycling user agents to mimic Chrome browsers, sending 40 to 100 redundant requests per page, and treating robots.txt as a suggestion. One commenter put it plainly: all their efficiency attempts are directed solely toward bypassing blocks. These are not companies optimizing for good data. They are companies optimizing against the people trying to stop them. The incentive structure is visible in the behavior.

Regulators have noticed. Brazil's ANPD ordered X to stop processing minors' data for AI training in December 2024. Italy's Garante went after OpenAI. France's CNIL concluded that mass web scraping fails the GDPR's reasonable expectations test. Germany's LfDI found that third-party data sharing for training purposes is something users would not anticipate. Nineteen separate regulatory guidelines issued between 2020 and 2024 have converged on the same basic finding: the people whose data is being used would not expect it to be used this way.

The AI industry's standard defense is that public data is fair game. The hiQ v. LinkedIn precedent, upheld through 2022, does limit how the Computer Fraud and Abuse Act applies to publicly accessible information. That is a fair legal point. But "publicly accessible" and "consented to AI training" are not the same category, and the industry has spent considerable energy conflating them.

Who Captures the Value

The researchers who wrote the academic analysis published in International Data Privacy Law after 2024 describe consent as "difficult to operationalize" at training scale. That framing treats operationalization as a technical problem. It is actually a business decision. Opt-in consent would shrink training datasets. Smaller datasets cost money and slow model development. The difficulty is not engineering. It is that consent, properly implemented, would transfer some negotiating power back to the people generating the data, and the current model captures that value without paying for it.

Enforcement has been episodic and underfunded. Brazil threatens fines of roughly €8,389 per day for deficiencies. Against companies valued in the hundreds of billions, that is not a deterrent. It is a line item.

Congress should require opt-in consent for personal data used in AI training, with penalties scaled to company revenue rather than fixed amounts. The EU's EDPB Opinion 28/2024 signals stricter baselines are coming in Europe. The U.S. is watching that happen from the outside, which means American users carry the cost of the regulatory gap while American companies capture the profit from it. That asymmetry is a policy choice, not an inevitability.