AI-generated text is now ~23% of the new web — training data quality is becoming auditable

Amara Diallo, Henrik Vos, Yuki Tanabe et al.~35s readarXiv:2605.18920

Bottom line: an open-source detector called SynthCheck estimates that 23% of newly published English web text is AI-generated, triple the 2023 share — and every model trained on web data inherits this contamination.

Details: 94% detection accuracy across twelve domains and eight languages; validated against a full Common Crawl snapshot, the corpus most labs train on. The authors also release decontaminated corpus indices, making clean training data a concrete deliverable rather than a vendor promise.

Caveat: accuracy degrades on human-edited AI text, so treat absolute percentages as estimates with error bars.

The strategic shift is that training-data provenance is now measurable, which means it will become a procurement question.

Recommended action: add a data-provenance item to your next AI vendor review — ask what share of their training corpus is synthetic and how they measured it.