A quarter of the new web is written by AI — and AI is training on it
Amara Diallo, Henrik Vos, Yuki Tanabe et al.~60s readarXiv:2605.18920
Make a photocopy of a photocopy, then copy that one, and by the tenth generation the page is gray mush. AI models face the same danger. The web is filling with AI-written text, new models train on that text, and researchers worry quality quietly rots with each generation. The missing piece was measurement: nobody could say how much of the web is actually machine-made. This paper finally puts a number on it.
The tool is called SynthCheck — a detector built to scan billions of pages and estimate, page by page, the odds the text came from a model. It scores 94% accuracy across twelve domains and eight languages. Pointed at Common Crawl, the giant web scrape most AI labs train on, it estimates that 23% of newly published English web text in 2025 was machine-generated — up from 7% just two years earlier.
The catch: detection is an arms race. Accuracy drops when humans lightly edit AI drafts, which is exactly how much of this text gets made. The 23% figure rests on calibration choices the authors themselves flag as debatable — treat it as a careful estimate, not a hard count.
Why you should care: every chatbot you will use in 2027 is being trained right now, on this web. Whether AI keeps getting smarter or starts eating its own tail depends on unglamorous tools like this one actually working.