As it approaches any move to regulate the creation or use of generative artificial intelligence (AI) technologies, Congress should look to the state of data privacy laws in the U.S. – or the lack thereof – as an important guidepost in deciding the rules of the road for AI tech developers who must train their algorithms on massive amounts of data.

That’s the bottom-line takeaway from a report issued on May 23 by the Congressional Research Service (CRS), which operates as a public policy research institute for members of Congress and their staffs on a nonpartisan basis.

The report offers a primer on generative AI tech such as Open AI’s ChatGPT, and then points to policy and legislative issues that Congress may want to consider.

“Generative AI models have received significant attention and scrutiny due to their potential harms, such as risks involving privacy, misinformation, copyright, and non-consensual sexual imagery,” the report says.

“This report focuses on privacy issues and relevant policy considerations for Congress,” it continues. “Some policymakers and stakeholders have raised privacy concerns about how individual data may be used to develop and deploy generative models. These concerns are not new or unique to generative AI, but the scale, scope, and capacity of such technologies may present new privacy challenges for Congress.”

On the data front, the report says that generative AI models, particularly those built on large-language models (LLMs), “require massive amounts of data.”

“For example, OpenAI’s ChatGPT was built on a LLM that trained, in part, on over 45 terabytes of text data obtained (or ‘scraped’) from the internet. The LLM was also trained on entries from Wikipedia and corpora of digitized books,” CRS said. “Open AI’s GPT-3 models were trained on approximately 300 billion ‘tokens’ (or pieces of words) scraped from the web and had over 175 billion parameters, which are variables that influence properties of the training and resulting model.”

“Critics contend that such models rely on privacy-invasive methods for mass data collection, typically without the consent or compensation of the original user, creator, or owner,” the report says. “Additionally, some models may be trained on sensitive data and reveal personal information to users.”

“Academic and industry research has found that some existing LLMs may reveal sensitive data or personal information from their training datasets,” CRS said, adding that “some models are used for commercial purposes or embedded in other downstream applications.”

In the face of those data privacy-related concerns, CRS pointed out that “the United States does not currently have a comprehensive data privacy law. Congress enacted several laws that create data requirements for certain industries and subcategories of data, but these statutory protections are not comprehensive.”

The report talks about healthcare-related data, data collected from minors, and differing state data laws as having implications for generative AI applications.

“In many cases, the collection of personal information typically implicates certain state privacy laws that provide an individual a ‘right to know’ what a business collects about them; how data is used and shared; the ‘right to access and delete’ their data; or the right to opt-out of data transfers and sales,” the report says. “However, some of these laws include exemptions for the collection of public data, which may raise questions about how and whether they apply to generative AI tools that use information scraped from the internet.”

“In the absence of a comprehensive federal data privacy law, some individuals and groups have turned to other legal frameworks (e.g., copyright, defamation, right of publicity) to address potential privacy violations from generative AI and other AI tools,” the report says.

“Congress may consider enacting comprehensive federal privacy legislation that specifically addresses generative AI tools and related concerns,” CRS said, while adding that legislators may want to consider a proposed AI Act in the European Union that covers data regulation, disclosures, and documentation, and has a category for AI systems.

Read More About
More Topics
John Curran
John Curran
John Curran is MeriTalk's Managing Editor covering the intersection of government and technology.