A technical paper from OpenAI revealed that millions of pages scraped from the web, Reddit posts, books, and more were used to train the large language model GPT-3, which offered a glimpse of the data used to train it when it was released in July 2020. However, the personal information of millions of Italians included in the training data has now gotten OpenAI into trouble. Italy’s data regulator, Garante per la Protezione dei Dati Personali, issued an emergency decision on March 31 demanding that OpenAI stop using the personal information.
According to the regulator, OpenAI does not have the legal right to use people’s personal information in ChatGPT. OpenAI has responded by blocking access to its chatbot in Italy while it provides responses to the officials who are investigating the matter. This is the first action taken against ChatGPT by a Western regulator and highlights privacy tensions surrounding the creation of giant generative AI models, which are often trained on vast swathes of internet data.
Elizabeth Renieris, a senior research associate at Oxford’s Institute for Ethics in AI and author on data practices, suggests that there is a fundamental issue with the technology itself, which may be difficult to resolve. Many data sets used for training machine learning systems have existed for years, and there were likely few privacy considerations when they were initially created.
Renieris notes that there is a complex supply chain of how the data ultimately makes its way into systems like GPT-4, with little to no data protection by design or default. In 2022, the creators of a widely-used image database suggested that images of people’s faces should be blurred in the data set. In Europe and California, privacy rules allow individuals to request that information be deleted or corrected if it is inaccurate. However, deleting inaccurate data from an AI system may not be straightforward, especially if the origins of the data are unclear. Experts question whether GDPR will be effective in upholding people’s rights in the long term, given the lack of provisions for large language models.
Although there has been at least one relevant instance where a company was ordered by the US Federal Trade Commission to delete algorithms created from unauthorized data, with increased scrutiny, such orders could become more common. However, it may be difficult to fully clear AI models of all personal data used to train them, especially if the data was unlawfully collected.
Go to Source
Author: Matt Burgess