OpenAI quietly unveils GPTBot dedicated web crawler
Website administrators have the power to prevent GPTBot from collecting information


OpenAI has quietly unveiled a way for website administrators to divert the company's web crawler from lifting, preventing it from lifting data.
The firm behind ChatGPT published instructions for turning off its web crawler on its online documentation. Members of the AI community spotted the addition on Monday but it has come without an official announcement.
GPTBot can be identified by the user agent token ‘GPTBot’. The full user-agent string is Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot)
Stopping GPTBot from crawling a site requires adding it to the robots.txt file along with the parts of the site off-limits to the crawler. The same technique is used to stop crawlers, such as Googlebot, from accessing all or part of a domain.
The company also confirmed the IP address block used by the crawler. Rather than taking the robots.txt route, an administrator could simply block those addresses.
There is currently no way to remove data already added to training models - GPT-3.5 and 4 are based on models dated up to September 2021.
The approach taken by GPTBot requires users to essentially ‘opt-out’ of crawling, requiring a proactive measure on the part of website administrators. Data could be used in future models unless an admin specifically adds GPTBot to a site’s robots.txt file to stop the crawler.
Get the ITPro daily newsletter
Sign up today and you will receive a free copy of our Future Focus 2025 report - the leading guidance on AI, cybersecurity and other IT challenges as per 700+ senior executives
Some commentators have speculated that OpenAI’s move could permit the company to lobby for anti-scraping regulation or defend itself against future actions.
However, it would be unlikely that the data already collected would be exempt from the attention of lawmakers. GPT-4, for example, was launched in March 2023 based on data already added to training sets.
RELATED RESOURCE
Understand why AI/ML is crucial to cyber security, how it fits in, and its best use cases.
OpenAI has used other datasets to train its models, including Common Crawl. The CCBot crawler bot used to generate the data can also be blocked with lines of code in robots.txt. However, GPTBot represents a dedicated crawler for the company.
As well as being able to block the crawler, there are other possible uses for the detection of the GPTBot. One suggestion has been serving up different responses to OpenAI following the identification of the crawler.
Being able to direct OpenAI’s crawler to pages of deliberate misinformation could result in training datasets lacking accuracy.
OpenAI’s published intention for the crawler is for its AI models to become more accurate and feature improved capabilities and safety.
What is a crawler and why does OpenAI need one?
A web crawler is a bot that systematically works its way through the World Wide Web, collecting data as it does so
For a search engine such as Google, this information is used to build an index for query purposes. Other uses include archiving web pages.
The robots.txt file is used to request that crawler bots only index certain parts of a website or nothing at all. Omitting a crawler from this file will result in public-facing information being collected.
Large language models, such as OpenAI's, require training datasets to provide accurate responses to user queries. Web crawlers are an ideal method for generating these datasets. The Common Crawl bot, for example, seeks to provide a copy of the internet for research and analysis.
ITPro contacted OpenAI for comment.

Richard Speed is an expert in databases, DevOps and IT regulations and governance. He was previously a Staff Writer for ITPro, CloudPro and ChannelPro, before going freelance. He first joined Future in 2023 having worked as a reporter for The Register. He has also attended numerous domestic and international events, including Microsoft's Build and Ignite conferences and both US and EU KubeCons.
Prior to joining The Register, he spent a number of years working in IT in the pharmaceutical and financial sectors.
-
Cleo attack victim list grows as Hertz confirms customer data stolen
News Hertz has confirmed it suffered a data breach as a result of the Cleo zero-day vulnerability in late 2024, with the car rental giant warning that customer data was stolen.
By Ross Kelly
-
Lateral moves in tech: Why leaders should support employee mobility
In-depth Encouraging staff to switch roles can have long-term benefits for skills in the tech sector
By Keri Allan
-
OpenAI woos UK government amid consultation on AI training and copyright
News OpenAI is fighting back against the UK government's proposals on how to handle AI training and copyright.
By Emma Woollacott
-
DeepSeek and Anthropic have a long way to go to catch ChatGPT: OpenAI's flagship chatbot is still far and away the most popular AI tool in offices globally
News ChatGPT remains the most popular AI tool among office workers globally, research shows, despite a rising number of competitor options available to users.
By Ross Kelly
-
‘DIY’ agent platforms are big tech’s latest gambit to drive AI adoption
Analysis The rise of 'DIY' agentic AI development platforms could enable big tech providers to drive AI adoption rates.
By George Fitzmaurice
-
OpenAI wants to simplify how developers build AI agents
News OpenAI is releasing a set of tools and APIs designed to simplify agentic AI development in enterprises, the firm has revealed.
By George Fitzmaurice
-
Elon Musk’s $97 billion flustered OpenAI – now it’s introducing rules to ward off future interest
News OpenAI is considering restructuring the board of its non-profit arm to ward off unwanted bids after Elon Musk offered $97.4bn for the company.
By Nicole Kobie
-
Sam Altman says ‘no thank you’ to Musk's $97bn bid for OpenAI
News OpenAI has rejected a $97.4 billion buyout bid by a consortium led by Elon Musk.
By Nicole Kobie
-
DeepSeek flips the script
ITPro Podcast The Chinese startup's efficiency gains could undermine compute demands from the biggest names in tech
By Rory Bathgate
-
SoftBank could take major stake in OpenAI
News Reports suggest the firm is planning to increase its stake in the ChatGPT maker
By Emma Woollacott