OpenAi, which is developing a generative artificial intelligence model for natural-language chat GPT conversation, is faced with the refusal of many websites to collect data from its pages on the pretext of improving the accuracy of artificial intelligence models developed by OpenAi.

 


Edited by |Alexander Yaxina

 

Technology  section -  CJ journalist

 

World – September,7,2023

 


On August 8, 2023, OpenAI unveiled, without fanfare, the gptbot web crawler, a robot or program that scans web pages in a systematic, automated and organized way, to index web pages and extract data. Its purpose, according to OpenAI, is to help "its AI models become more accurate", but also to "improve their overall capabilities and reliability".Then OpenAI announced that GPTBot avoids collecting data from sources that require paid access. It removes data containing personally identifiable information (PII) Personal Identifiable Information.

The purpose of this announcement is precisely to reassure media sites and news platforms that GPTBot will not collect data protected by the Paywall payment wall.

But soon many media outlets resorted to restricting GPTBot's access to the content of its affiliated websites, the New York Times website adopted the same approach, other media outlets such as the American CNN and the Australian ABC Group, the British Guardian, and also the news agencies Reuters and Bloomberg.

In France this approach has been adopted by the French Press Agency and also by public media groups as France mediamund, the parent institution of Radio Monte Carlo International, which also brings together FRANCE 24 and radio France Internationale. Followed by the radio France Group and the French television group, the French public media groups block the gptbot's access to the content of their websites "as a precautionary measure". This is in addition to the websites of private media organizations such as TF1 Group. The newspaper Le Figaro and others.

According to a study conducted by Originality.ai, a plagiarism text plagiarism plagiarism plagiarism detection tool, which examined robots files.txt, which manages the work of crawling bots on websites, in more than a thousand sites in the world, it turned out that 9.2% of the platforms blocked the GPTBot within the first two weeks of its operation.

The most prominent sites that have restricted Android access include Amazon.com, < Br > Wikihow.com, < Br > Quora.com, Shutterstock. And also the sites of WikiHow, Foursquare, Tumblr, Ikea, Airbnb, Lonely Planet

It is assumed that the percentage of sites that will block the gptbot web crawler will rise.

A study estimates Originality.ai, that the percentage of websites that will block GPTBot access will increase by 5% weekly.And that the percentage of blocking is higher on the websites that record the highest percentage of visits.

Most media organizations that have restricted the access of GPTBot, consider that what OpenAI is doing is unauthorized looting of content, and also fear the mode of operation of the generative AI model of natural-language chatgpt conversation. It is a predictive generative model, which mixes in its answers what it collects from serious, reliable and expensive journalistic work, along with information that may be uncertain or contain data as a result of the hallucination of the model.

"Generative artificial intelligence works according to a probabilistic model, which can correlate our location data with other more or less accurate, or even erroneous data,"says Vincent Fleury, director of digital environments at France mediamund.

"Today, there are hundreds of startups in various media-related fields that rely on generating texts from artificial intelligence models to generate conversation in natural language, and then publish them for profit, which is a process of stealing the work of journalists and also misinformation through unreliable information,"adds Vincent Fleury.

The British Guardian considers that what GPTBot is doing "is a commercial exploitation by collecting data protected by copyright".

Several press organizations, including the New York Times, are considering taking legal action against OpenAI for copyright protection, while others will adopt the logic of negotiations with companies specializing in generative artificial intelligence to sell their data for a commission.

Gptbot crawler access can be completely or partially restricted, for content creators, journalists and others, who have free sites oryou need a paid subscription, OpenAI put on its blog the commands to be added to the robots file.the txt of the website to block the access of the gptbot crawler, or allow the GPTBot to access certain parts of the site content.

 


{source}<script async src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js?client=ca-pub-4474625449481215"

crossorigin="anonymous"></script>

<!-- moss test ad -->

<ins class="adsbygoogle"

style="display:block"

data-ad-client="ca-pub-4474625449481215"

data-ad-slot="6499882985"

data-ad-format="auto"

data-full-width-responsive="true"></ins>

<script>

(adsbygoogle = window.adsbygoogle || []).push({});

</script>{/source}

Locations

  • Address: United Kingdom

        1, Neil J Ireland, solicitor of

         25 Warwick Road -Coventry CV1 2EZ


  •   Email: This email address is being protected from spambots. You need JavaScript enabled to view it.

Castle Journal Group