Clean training data becomes a priced asset as synthetic contamination triples

Amara Diallo, Henrik Vos, Yuki Tanabe et al.~45s readarXiv:2605.18920

Winners: data-curation and provenance startups; owners of verifiably human corpora — publishers, forums, archives with pre-2023 moats. If contamination keeps climbing, licensed clean text commands a rising premium.

Pressured: labs whose pipelines lean on uncurated web scrapes now face a measurable quality discount; SEO content farms built on synthetic output, as detection gets cheap enough to deploy at ranking time.

Signals: watch licensing deals for human-verified corpora and their price per token; whether major labs begin publishing synthetic-share audits of training data; follow-up work on detecting human-edited AI text, the current blind spot.

Difficulty to commercialize: 5/10. The detector is open-source so the tool itself is not the moat — the business is enterprise audit and certification, which needs benchmarks, trust, and constant retraining against an adversarial arms race.