How to protect content from AI crawlers and bad bots with robots.txt file.

TLDR; How to protect content by using robots.txt file & to dissuade AI crawlers and bad bots scraping, indexing, and taking content.

Ever since ChatGPT announced in early August 2023 that it had released its crawler into the wild, content creators have been nervously looking over their shoulders for ways on how to protect content.

Yes, you can use plugins to try & discourage people from stealing content. Some humans are often like magpies, it’s shiny and they want it, so stopping them from using right-click & copy’n’paste is just an inconvenience.

Now there’s a new thief on the block, Artificial intelligence (AI).

The various players in this quickly evolving marketplace; OpenAI, Google, Meta, Webz.io, and so on, are all scrapping data to educate/train their large language models (LLM) or Chatbots.

The easiest way to do this? Just scrape content from websites already out there – Yeah, don’t worry about the breach of copyright implications right now, that’s a fight for another day.

So, what’s the simplest way to stop content theft/scraping? Unfortunately, there is no magic bullet, but there is a simple thing you can do and is often connected with SEO – your robots.txt file.

Ordinarily, the robots.txt file would tell search engine crawlers which pages to crawl and index which pages on your site. However, as time has gone by these crawlers have been manipulated by both white & black hats.

Good bots & Bad bots

According to the latest Impervia report on bad bots, it’s easy to see why content creators feel the need to have some level of protection.

At this point, it is important to say, that Impervia defines bad bots as those that are malicious/harmful to websites, & not as content creators would probably define AI crawlers.

Also important to note, there is a good chance that Good bots facilitate the Human website traffic to getting to a destination which has a high probability of fulfilling the search request.

How to implement a robots.txt file.

There are a couple of methods of editing or implementing a robots.txt file, but by far the easiest is to use a SEO plugin.

We use RankMath, and it’s pretty easy to edit your robots.txt file from there. Other SEO plugins offer something similar, just search for whatever plugin you use.

Enable Advanced mode on the RankMath dashboard (top right)
The follow this navigation: WordPress Dashboard → Rank Math SEO Dashboard→ General Settings → Edit robots.txt
Paste in your new file from a format free text editor (Notepad on Windows is fine)
Save.

NOTE:
Editing your robots.txt file isn’t a thing to be done on a whim. Doing it incorrectly can damage your website’s visibility to search engine crawlers, and your indexing can suffer as a result.
How to protect content using Rank Math SEO plugin – how to edit robots.txt file,

Rank Math SEO plugin - how to edit robots.txt file — Rank Math SEO plugin – how to edit robots.txt file

As much as modifying your robots.txt file is a step in the right direction, it isn’t a foolproof method to block these AI and other crawlers. They are under no obligation to follow your wishes as it’s an honour system. Most will honour it, but let’s face it – if someone wants your content, they’re going to take it.

Which crawlers to dissuade from crawling your site?

Through several hours of research, I managed to compile a couple of handy lists of crawler user agent names & IP addresses that you might want to consider using to protect your content:

Company/Org	Bot/Crawler	Notes
Common Crawl	CCbot	Not-for-profit org aimed at the democratisation of internet data so that everyone, not just big companies, can do high-quality research and analysis.
OpenAI GPT	GPTbot	OpenAI’s crawler crawls websites and gathers content to train its proprietary large language models (LLMs), such as GPT-4 and GPT-5.
	ChatGPT-user	OpenAI’s crawler which is used ‘on-demand’ by users by responding to user prompts
Meta	Meta-AI	An image-based user-generated tool for FaceBook, Instagram & WhatsApp
	FacebookBot	“FacebookBot crawls public web pages to improve language models for our (FaceBook’s) speech recognition technology.”
Google Gemini	Google-Extended	Replaces Bard. Normal GoogleBot checks for `robots.txt` rules listing `Google-Extended`
Webz.io	OmgiliBot & Omgili	Webz.io originally developed OmgiliBot for their now-defunct search engine. This crawler collects data and then sells it to its clients.
Anthropic	anthropic-ai	Collects information for their artificial intelligence products, such as Claude.
Cohere	cohere-ai	Scrapes data to power their commercial LLMs, each one can be applied in different scenarios.
Imagesift / Hive	ImagesiftBot	ImageSiftBot is a software program that searches the web for images that are accessible to anyone and uses them to enhance our online analytics tools. A reverse image search tool.

Once you’ve decided to try to block these content scrapers, you might want to think about bad bots. Thankfully, there are several resources for that.

Resource	Link	Notes
Dark Visitors	Updated website	features a sign-up facility and the ability to submit new crawlers
Recent list of crawlers & bots	robots.txt file	Ready-made robots.txt file

Sources for this post, & many thanks to the authors:

How to protect content from AI crawlers and bad bots with robots.txt file.

Good bots & Bad bots

How to implement a robots.txt file.

Which crawlers to dissuade from crawling your site?

Google AI Search: What It Is and How It’s Changing Search

SEO Plugins: The Truth

What Is Local SEO (and Why It’s Suddenly Got More Real)?

What is sustainable web design?

Why Your SEO Ranking Tool Stopped Working

Why Eco Friendly Web Design Is the Future

Digital Greenwashing? The Carbon Footprint of Corporate Websites

How to Build a WordPress Website That Ranks: Complete 2025 Setup Guide

Ofcom fines 4chan £20,000 under the Online Safety Act. 4Chan replies ‘F U’

BBC Reporting on Success of Age Verification debunked

The Tea App Hack: A Stark Warning About Age Verification Vulnerabilities

Study: Green Michelin Star Restaurants’ Websites 2025

How to protect content from AI crawlers and bad bots with robots.txt file.

Good bots & Bad bots

How to implement a robots.txt file.

Which crawlers to dissuade from crawling your site?

To see the effect of our content creation, See our case study on The SV Group

To see the effect of our
content creation,
See our case study
on The SV Group