How to block OpenAI’s new AI-training web crawler from ingesting your data

[ad_1]

A man is seen using the OpenAI ChatGPT artificial intelligence chat website in this illustration photo on 18 July, 2023. (Photo by Jaap Arriens/NurPhoto via Getty Images)

Jaap Arriens/NurPhoto via Getty Images

ChatGPT creator OpenAI has released a new web crawler — called GPTBot — along with directions on how to block it. 

ChatGPT is one of the most capable AI systems ever built, despite recent reports of its wavering intelligence. OpenAI, the company behind the AI chatbot, continues to train its large language models (LLMs), like GPT-3.5 and GPT-4.

Also: ChatGPT is getting a slew of updates this week. Here’s what you need to know

Web crawlers, used by search engines like Google and Bing to scan websites and index content, are also used by AI companies to train LLMs. These models learn from the content of websites and any other data its developers choose to train them on. Using a web crawler expedites this process by enabling the LLMs to train on massive amounts of data.

“Allowing GPTBot to access your site can help AI models become more accurate and improve their general capabilities and safety,” OpenAI notes in its GPTBot documentation. The company claims it is filtering out web pages that require paywall access, gather personally-identifying information, and have text violating OpenAI’s policies 

Developers have the option of blocking the GPTBot from accessing their sites and using their information to train AI systems. 

OpenAI explains how to disallow or customize GPTBot access to your site.

OpenAI explains how to disallow or customize GPTBot access to your site.

Screenshot: OpenAI | Image Composition: Maria Diaz/ZDNET

To block GPTBot from accessing a site altogether, the site owner can add the GPTBot token to the site’s robots.txt and “Disallow: /”. 

OpenAI also lets users customize GPTBot’s access by only letting it crawl certain parts of their site. To block GPTBot from accessing parts of a website, add GPTBot to the site’s robots.txt and “Allow: /directory-1/” and “Disallow: /directory-2/” and customize as needed.

Also: Nvidia boosts its ‘superchip’ Grace-Hopper with faster memory for AI

OpenAI had not previously announced the use of web crawlers to train GPT-3.5, the LLM behind the free version of ChatGPT, or GPT-4, its newest LLM available to ChatGPT Plus subscribers and that powers Bing AI

Though it’s unclear if GPTBot was used to train OpenAI’s currently available LLMs, it could be the web crawler training GPT-5, especially as the company filed to trademark the name in July. While OpenAI has not announced a release date for GPT-5, the new LLM is expected to be more powerful and larger than GPT-4, which is currently the largest LLM available.

Also: AI bots could soon become your new customer service agent

Since the launch of ChatGPT, OpenAI has been hit with several lawsuits alleging that the AI tool is stealing data from users, including a copyright infringement case that made the company the target of an FTC investigation. Websites like Stack Overflow, Reddit, and Twitter have said they plan to begin charging AI companies to access their data.



[ad_2]

Source link

slot gacor slot gacor togel macau slot hoki bandar togel slot dana slot mahjong link slot link slot777 slot gampang maxwin slot hoki slot mahjong slot maxwin slot mpo slot777 slot toto slot toto situs toto toto slot situs toto situs toto situs toto situs toto slot88 toto slot slot gacor thailand slot bet receh situs toto situs toto slot toto slot situs toto situs toto situs toto situs togel macau toto slot slot demo slot pulsa slot pragmatic situs toto deposit dana 10k surga slot toto slot link situs toto situs toto slot situs toto situs toto slot777 slot gacor situs toto slot slot pulsa 10k toto togel situs toto slot situs toto slot gacor terpercaya slot dana slot gacor pay4d agen sbobet kedai168 kedai168 deposit pulsa situs toto slot pulsa situs toto slot pulsa situs toto situs toto situs toto slot dana toto slot situs toto slot pulsa toto slot situs toto slot pulsa situs toto situs toto situs toto toto slot