Data pipelines powering generative AI systems rooted in invasions of privacy

Global: Enormous data pipelines powering major generative AI systems are rooted in mass invasions of privacy by design - Amnesty International

Which language would you like to use this site in?

English

Español

@Kassy Cho

Companies are extracting vast troves of online data through unlawful web scraping to build their generative artificial intelligence (AI) products in a way that is enabling a mass invasion of privacy, making these systems unlawful by design, Amnesty International said in a new briefing today.

Unlawful by Design: Exposing the Human Rights Costs of Generative AI documents serious risks in the large-scale data scraping and processing being used to build and train these systems, including violations of the right to privacy by design and adverse consequences for the environment and historically marginalized communities.

“Companies across the world are supplying generative AI products under the veneer of efficiency and sophistication, but in reality, these systems perpetuate mass invasions of privacy through unlawful web scraping: an automated process for extracting data from websites, including personal data, such as images and social media activity, to train AI models,” said Likhita Banerji, Head of the Algorithmic Accountability Lab, Amnesty International.

“The extractive data pipeline, inherent design choices made by tech companies and exploitative supply chains, to build generative AI systems have enabled a paradigm of technology development that opens up a risk of mass abuse of human rights.”

Amnesty International researched the models powering some of the most popular publicly available standalone generative AI tools, including GPT 3 by Open AI, Google’s Gemini, Meta’s Llama, DeepSeek and tools by Midjourney and Stable Diffusion.

Such systems rely on extracting information from billions of public online posts and images often without the explicit consent of the individuals appearing in or creating them. Not only does this infringe on privacy by design but as datasets powering AI models scale up, the presence of hateful and discriminatory content in their outputs also gets amplified, along with negative stereotypes and prejudices, especially along racial and gendered lines.

“These choices are not inevitable. We must challenge the design choices adopted by companies who build generative AI systems by relying on training data, including personal data, that is extracted non-consensually and on a grand scale.”

Likhita Banerji, Head of the Algorithmic Accountability Lab, Amnesty International

Racial, gender and cultural biases are consistent features of generative AI systems, a product of the training data that is largely pulled from the web and therefore polluted with real-world biases which harm historically marginalized communities. Additionally, generative AI systems pose risks to the right to freedom of thought as they are capable of influencing users’ thoughts and shaping their personal beliefs through predictive suggestions. This is especially true for larger models reliant on expansive training data.

“This is one of the most egregious practices among AI companies operating with disregard for human rights and must urgently be addressed. A different trajectory of technology development is possible if authorities act urgently to course correct.”

Heavy environmental costs

As the scale and speed of development has picked up at generative AI companies, so have the infrastructure requirements and associated environmental costs.

The higher processing needs of larger models require more energy-intensive chips, larger data centres, and consequently, more energy and water for its operationalisation. Generative AI production often results in a negative impact on communities that are historically marginalized as the lands and resources that belong to these communities are exploited to build data centres and fulfill processing requirements.

Google’s own sustainability report from 2024 noted a staggering 48 per cent increase in the company’s greenhouse gas emissions since 2019, attributable to data centre and supply chain emissions. Similarly, Microsoft’s emissions increased by 29 per cent between 2020 and 2024, attributable to data centres carrying out AI-supporting processes.

The intensive use of resources in generative AI production has led to communities from Cerrillos in Chile, and Querétaro in Mexico, to Arizona in the United States of America, resisting data centres in areas that are already heavily affected by droughts and shortages in electricity.

As part of...

Data pipelines powering generative AI systems rooted in invasions of privacy

Related Articles

Amazon, Facebook, FBI have access to a private intelligence-sharing network

Show HN: GoPeek – open links in live mini browser windows without new tabs

Agent Memory: An Anatomy

SpaceX not the behemoth everyone thought

The Mirror Is Part of the Machine