TLDR; How to protect content by using robots.txt file & to dissuade AI crawlers and bad bots scraping, indexing, and taking content.
Ever since ChatGPT announced in early August 2023 that it had released its crawler into the wild, content creators have been nervously looking over their shoulders for ways on how to protect content.
Yes, you can use plugins to try & discourage people from stealing content. Some humans are often like magpies, it’s shiny and they want it, so stopping them from using right-click & copy’n’paste is just an inconvenience.
Now there’s a new thief on the block, Artificial intelligence (AI).
The various players in this quickly evolving marketplace; OpenAI, Google, Meta, Webz.io, and so on, are all scrapping data to educate/train their large language models (LLM) or Chatbots.
The easiest way to do this? Just scrape content from websites already out there – Yeah, don’t worry about the breach of copyright implications right now, that’s a fight for another day.
So, what’s the simplest way to stop content theft/scraping? Unfortunately, there is no magic bullet, but there is a simple thing you can do and is often connected with SEO – your robots.txt file.
Ordinarily, the robots.txt file would tell search engine crawlers which pages to crawl and index which pages on your site. However, as time has gone by these crawlers have been manipulated by both white & black hats.
Good bots & Bad bots
According to the latest Impervia report on bad bots, it’s easy to see why content creators feel the need to have some level of protection.
At this point, it is important to say, that Impervia defines bad bots as those that are malicious/harmful to websites, & not as content creators would probably define AI crawlers.
Also important to note, there is a good chance that Good bots facilitate the Human website traffic to getting to a destination which has a high probability of fulfilling the search request.
How to implement a robots.txt file.
There are a couple of methods of editing or implementing a robots.txt file, but by far the easiest is to use a SEO plugin.
We use RankMath, and it’s pretty easy to edit your robots.txt file from there. Other SEO plugins offer something similar, just search for whatever plugin you use.
- Enable Advanced mode on the RankMath dashboard (top right)
- The follow this navigation: WordPress Dashboard → Rank Math SEO Dashboard→ General Settings → Edit robots.txt
- Paste in your new file from a format free text editor (Notepad on Windows is fine)
- Save.
NOTE:
Editing your robots.txt file isn’t a thing to be done on a whim. Doing it incorrectly can damage your website’s visibility to search engine crawlers, and your indexing can suffer as a result.
How to protect content using Rank Math SEO plugin – how to edit robots.txt file,
As much as modifying your robots.txt file is a step in the right direction, it isn’t a foolproof method to block these AI and other crawlers. They are under no obligation to follow your wishes as it’s an honour system. Most will honour it, but let’s face it – if someone wants your content, they’re going to take it.
Which crawlers to dissuade from crawling your site?
Through several hours of research, I managed to compile a couple of handy lists of crawler user agent names & IP addresses that you might want to consider using to protect your content:
Company/Org | Bot/Crawler | Notes |
---|---|---|
CCbot | Not-for-profit org aimed at the democratisation of internet data so that everyone, not just big companies, can do high-quality research and analysis. | |
GPTbot | OpenAI’s crawler crawls websites and gathers content to train its proprietary large language models (LLMs), such as GPT-4 and GPT-5. | |
| ChatGPT-user | OpenAI’s crawler which is used ‘on-demand’ by users by responding to user prompts |
Meta-AI | An image-based user-generated tool for FaceBook, Instagram & WhatsApp | |
FacebookBot | “FacebookBot crawls public web pages to improve language models for our (FaceBook’s) speech recognition technology.” | |
Google-Extended | Replaces Bard. Normal GoogleBot checks for robots.txt rules listing Google-Extended | |
OmgiliBot & Omgili | Webz.io originally developed OmgiliBot for their now-defunct search engine. This crawler collects data and then sells it to its clients. | |
anthropic-ai | Collects information for their artificial intelligence products, such as Claude. | |
cohere-ai | Scrapes data to power their commercial LLMs, each one can be applied in different scenarios. | |
ImagesiftBot | ImageSiftBot is a software program that searches the web for images that are accessible to anyone and uses them to enhance our online analytics tools. A reverse image search tool. |
Once you’ve decided to try to block these content scrapers, you might want to think about bad bots. Thankfully, there are several resources for that.
Resource | Link | Notes |
---|---|---|
Dark Visitors | Updated website | features a sign-up facility and the ability to submit new crawlers |
Recent list of crawlers & bots | robots.txt file | Ready-made robots.txt file |
Sources for this post, & many thanks to the authors:
- https://darkvisitors.com/
- https://netfuture.ch/2023/07/blocking-ai-crawlers-robots-txt-chatgpt/
- https://neil-clarke.com/block-the-bots-that-feed-ai-models-by-scraping-your-website/
- https://www.cyberciti.biz/web-developer/block-openai-bard-bing-ai-crawler-bots-using-robots-txt-file/
- https://www.imperva.com/resources/resource-library/reports/2023-imperva-bad-bot-report-report-ty