AI Oct 22, 2025

Addressing the Challenges of Public Web Data

Addressing the Challenges of Public Web Data - Greg Lindahl, Common Crawl The Common Crawl Foundation is dedicated to preserving humanity’s knowledge and making it accessible through its free public web dataset, a vital resource since 2008. As AI development accelerates, concerns have emerged regarding the accessibility and transparency of public web data, impacting open datasets in three key ways: robots.txt exclusions, legal demands, and “bot defenses.” Two of these are not visible in public and are not very well understood. In this talk, Greg will present insights from a new data product that utilizes Common Crawl’s crawl metadata to visually explore these three problems, advocating for greater transparency and informed solutions for the future of public web data.