How to protect content from AI crawlers and bad bots with robots.txt file.

sustainable website scroll down

TLDR; How to protect content by using robots.txt file & to dissuade AI crawlers and bad bots scraping, indexing, and taking content.

Ever since ChatGPT announced in early August 2023 that it had released its crawler into the wild, content creators have been nervously looking over their shoulders for ways on how to protect content.

Yes, you can use plugins to try & discourage people from stealing content. Some humans are often like magpies, it’s shiny and they want it, so stopping them from using right-click & copy’n’paste is just an inconvenience.

Now there’s a new thief on the block, Artificial intelligence (AI).

The various players in this quickly evolving marketplace; OpenAI, Google, Meta, Webz.io, and so on, are all scrapping data to educate/train their large language models (LLM) or Chatbots.

The easiest way to do this? Just scrape content from websites already out there – Yeah, don’t worry about the breach of copyright implications right now, that’s a fight for another day.

So, what’s the simplest way to stop content theft/scraping? Unfortunately, there is no magic bullet, but there is a simple thing you can do and is often connected with SEO – your robots.txt file.

Ordinarily, the robots.txt file would tell search engine crawlers which pages to crawl and index which pages on your site. However, as time has gone by these crawlers have been manipulated by both white & black hats.

Good bots & Bad bots

According to the latest Impervia report on bad bots, it’s easy to see why content creators feel the need to have some level of protection.

Impervia bad bot report
Source: Impervia bad bot report

At this point, it is important to say, that Impervia defines bad bots as those that are malicious/harmful to websites, & not as content creators would probably define AI crawlers.

Also important to note, there is a good chance that Good bots facilitate the Human website traffic to getting to a destination which has a high probability of fulfilling the search request.

How to implement a robots.txt file.

There are a couple of methods of editing or implementing a robots.txt file, but by far the easiest is to use a SEO plugin.

We use RankMath, and it’s pretty easy to edit your robots.txt file from there. Other SEO plugins offer something similar, just search for whatever plugin you use.

  1. Enable Advanced mode on the RankMath dashboard (top right)
  2. The follow this navigation: WordPress Dashboard → Rank Math SEO Dashboard→ General Settings → Edit robots.txt
  3. Paste in your new file from a format free text editor (Notepad on Windows is fine)
  4. Save.

NOTE:
Editing your robots.txt file isn’t a thing to be done on a whim. Doing it incorrectly can damage your website’s visibility to search engine crawlers, and your indexing can suffer as a result.
How to protect content using Rank Math SEO plugin – how to edit robots.txt file,

Rank Math SEO plugin - how to edit robots.txt file
Rank Math SEO plugin – how to edit robots.txt file

As much as modifying your robots.txt file is a step in the right direction, it isn’t a foolproof method to block these AI and other crawlers. They are under no obligation to follow your wishes as it’s an honour system. Most will honour it, but let’s face it – if someone wants your content, they’re going to take it.

Which crawlers to dissuade from crawling your site?

Through several hours of research, I managed to compile a couple of handy lists of crawler user agent names & IP addresses that you might want to consider using to protect your content:

Company/OrgBot/CrawlerNotes

Common Crawl

common crawl logo

CCbotNot-for-profit org aimed at the democratisation of internet data so that everyone, not just big companies, can do high-quality research and analysis.

OpenAI GPT

openAI logo

GPTbotOpenAI’s crawler crawls websites and gathers content to train its proprietary large language models (LLMs), such as GPT-4 and GPT-5.

 

ChatGPT-userOpenAI’s crawler which is used ‘on-demand’ by users by responding to user prompts

Meta

metaAI logo

Meta-AIAn image-based user-generated tool for FaceBook, Instagram & WhatsApp
 FacebookBot“FacebookBot crawls public web pages to improve language models for our (FaceBook’s) speech recognition technology.”

Google Gemini

google gemini logo

Google-ExtendedReplaces Bard. Normal GoogleBot checks for robots.txt rules listing Google-Extended

Webz.io

Webz.io logo

OmgiliBot

&

Omgili

Webz.io originally developed OmgiliBot for their now-defunct search engine. This crawler collects data and then sells it to its clients.

Anthropic

Anthropic logo

anthropic-aiCollects information for their artificial intelligence products, such as Claude.

Cohere

cohere logo

cohere-aiScrapes data to power their commercial LLMs, each one can be applied in different scenarios.

Imagesift / Hive

imagesift logo

ImagesiftBotImageSiftBot is a software program that searches the web for images that are accessible to anyone and uses them to enhance our online analytics tools. A reverse image search tool.

Once you’ve decided to try to block these content scrapers, you might want to think about bad bots. Thankfully, there are several resources for that.

ResourceLinkNotes
Dark VisitorsUpdated websitefeatures a sign-up facility and the ability to submit new crawlers
Recent list of crawlers & botsrobots.txt fileReady-made robots.txt file

Sources for this post, & many thanks to the authors:

QED website design logo social media, SEO, blogging

Top Categories

    Top Tags

      To see the effect of our
      content creation,
      See our case study
      on The SV Group

      We created content over a six month period targeting key areas where their business wanted to expand