Article Outline

Here’s How to Keep Google from Using your Content for AI Training

Good news for publishers who don’t want Google using their web content for AI training– you now have the ability to block the AI-training crawler and say no without shutting the tech giant out completely.


Using public data to improve AI models is a subject that’s certainly caused some controversy. In case you missed it, here’s the quick version of what went down:


  • In July, Google updated its privacy policy to allow for the use of public information to “help train Google’s AI models and build products and features like Google Translate, Bard, and Cloud AI capabilities.” 


  • Many publishing companies started coming forward with concerns about privacy, plagiarism, and the muddiness surrounding what AI models are legally allowed to do with public content. 


  • Google added an option for publishers to block the web crawler used to train ChatGPT and decline the use of their data for other AI training purposes. 


Several large publishers have already made moves to block the crawler used to train ChatGPT, including CNN and Reuters. Some even took it a step further. The New York Times, for example, updated their terms of service to legally ban companies from using their content to train AI. But the question remains, what kind of fallout will be caused by opting out? 


How to Block AI-Training Crawlers

First thing’s first: If you want to keep your content from being used for AI purposes, you’ll need to block the crawler used to train ChatGPT. You can do this using Google-Extended, described by Google as, “a new control that web publishers can use to manage whether their sites help improve Bard and Vertex AI generative APIs, including future generations of models that power those products.”


According to the tech giant, Google-Extended still allows your web content to get scraped and indexed, it just doesn’t use the data it collects to train Bard and Vertex AI. You can access it through the same text file you typically use to block crawlers, robots.txt. Google Search Central states, “Google-Extended doesn’t have a separate HTTP request user agent string. Crawling is done with existing Google user agent strings; the robots.txt user-agent token is used in a control capacity.”


You can read more about Google-Extended and how to control which crawlers have access to your information.


What’s the Catch?

In a nutshell, your rankings may be affected if you prevent Google from using your content to train AI. The company says your web material will still be crawled and indexed, which means it will still show up on search engine results pages (SERPs). But keep in mind, the better Google knows your site, the higher you’re likely to rank. When it’s crawling for AI training data, it gets to know your content really well. Check out this article for more on the benefits of content scraping for AI training. As for the potential disadvantages? There are a couple:


  • Lower Visibility — As mentioned earlier, blocking Google from using your data means your content may not be as well-understood by search algorithms. This could mean lower search visibility and reduced organic traffic.


  • Impaired SEO — Google’s AI algorithms play a pretty big role in determining rankings. By blocking data usage for AI training, you could make your site less competitive and harder to discover, overall.


  • Fewer Featured Snippets — AI-trained algorithms are responsible for generating featured snippets. Blocking data usage for AI training could make your content less likely to appear as a featured snippet, meaning you miss the chance to capture attention and establish authority.


  • Diminished User Experience — Personalized search results and recommendations are AI-driven features that improve overall user experience. If you block Google from using your data to train AI, you could limit the ability to provide people with the most relevant content.


  • Loss in Advertising Revenue — AI-driven algorithms make for more precise ad targeting based on user behavior and interests. If Google doesn’t have access to this data, the impact of your ads may suffer, which means you could be at risk of losing ad revenue. 


  • Competitive Disadvantage — While many publishers have decided to block Google from using their data for AI training, others haven’t. If your competitors are collaborating with Google, they may have higher search rankings, better SEO, and more user engagement.


What’s Next?

Prepare to be adaptable, because AI is evolving quickly and so are the rules surrounding what tech companies are allowed to do with your online content. Google claims that as AI applications expand, it will continue exploring choice and control options for web publishers. 


Content scraping practices are a double-edged sword. On one hand, Google is thoroughly evaluating your content, potentially leading to higher rankings. On the other, there are concerns about privacy and plagiarism, not to mention the impact AI models will have on publishers as they become more sophisticated. 


Despite these concerns, it remains clear that AI is at the forefront of technological innovation– it could very well revolutionize the way your company creates content and engages with users. Rejecting AI could make it more difficult for you to innovate and stay on top of industry trends.


The Bottom Line when Blocking the AI Web Crawler

As you decide which direction your company should take, remember: There is no right or wrong answer. Striking a balance between privacy considerations and the benefits of AI-driven technologies is complex. You need to have a thorough understanding of what either choice could mean for your business.


At the end of the day, your primary goal should stay the same. Provide consistent value to your audience. That’s Google’s golden ticket to long-term success in the ever-evolving world of SEO.

Skip to content