Google, OpenAI & Anthropic propose “early warning system for novel AI risks”


Google Deepmind, in collaboration with leading AI companies and universities, publishes a proposal for an “early warning system for novel AI risks”.

In a new paper, “Model evaluation for extreme risks,” researchers from leading AI companies, universities, and other research institutions outline what an early warning system for extreme AI risks might look like. Specifically, the team proposes a framework for evaluating large-scale AI models to identify potential risks and suggests actions that companies and policymakers could take.

The paper was produced in collaboration with researchers from Google DeepMind, OpenAI, Anthropic, the Center for the Governance of AI, the Center for Long-Term Resilience, the University of Toronto, the University of Oxford, the University of Cambridge, the Université de Montreal, the Collective Intelligence Project, the Mila – Quebec AI Institute, and the Alignment Research Center.

Further AI development could “pose extreme risks”.

Current methods of AI development are already producing AI systems like GPT-4 that have both useful and harmful capabilities. Companies like OpenAI are using a variety of other methods to make models safer after training. But further advances in AI development could lead to extremely dangerous capabilities, the paper argues.


Image: Deepmind

“It is plausible (though uncertain) that future AI systems will be able to conduct offensive cyber operations, skillfully deceive humans in dialogue, manipulate humans into carrying out harmful actions, design or acquire weapons (eg biological, chemical), fine-tune and operate other high-risk AI systems on cloud computing platforms, or assist humans with any of these tasks,” the researchers wrote on a blog post.

Developers would therefore need to be able to identify dangerous capabilities and the propensity of models to use their capabilities to cause harm. “These evaluations will become critical for
keeping policymakers and other stakeholders informed, and for making responsible decisions about model training, deployment, and security,” the team said.

Evaluating “dangerous capabilities” and “alignment”

A risk assessment must consider two aspects, they said:

  • To what extent a model has certain ‘dangerous capabilities’ that could be used to threaten security, exert influence, or evade oversight.
  • To what extent the model is prone to applying its capabilities to cause harm (ie the model’s alignment). Alignment evaluations should confirm that the model behaves as intended even across a very wide range of scenarios, and, where possible, should examine the model’s internal workings.

This evaluation should begin as early as possible to ensure responsible training and use, transparency, and appropriate security mechanisms. To achieve this, developers should perform ongoing evaluations and provide structured access to the model for external security researchers and model reviewers to perform additional evaluations.

Extreme risks can arise from a variety of factors. | Image: Deepmind

“We believe that having processes for tracking the emergence of risky properties in models, and for adequately responding to concerning results, is a critical part of being a responsible developer operating at the frontier of AI capabilities,” the Google Deepmind blog post reads.


Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top