How OpenAI aims to prevent catastrophic AI risks


OpenAI unveils a new framework to help prevent catastrophic AI risks. The company now has a total of three teams working specifically on AI risks.

The “Preparedness Framework” is a living document that describes strategies for monitoring, assessing, predicting, and hedging against catastrophic AI risks.

OpenAI seeks to monitor catastrophic risk through careful assessment. To this end, it aims to develop and refine assessment procedures and other monitoring methods to accurately measure the level of risk.

At the same time, the organization aims to anticipate future risk developments to prepare safety measures in advance.



OpenAI also commits to identifying and investigating emerging risks (“unknown-unknowns”). The goal is to address potential threats before they escalate.

Cybersecurity, Bioweapons, Persuasion, and Autonomy

The Preparedness Framework identifies four major risk categories:

  • Cybersecurity,
  • chemical, biological, nuclear, and radiological (CBRN) threats,
  • persuasion
  • and model autonomy.

Persuasion focuses on the risks associated with convincing people to change their beliefs or act accordingly.

OpenAI CEO Sam Altman recently did a bit of foreshadowing on Twitter: “I expect AI to be capable of superhuman persuasion well before it is superhuman at general intelligence, which may lead to some very strange outcomes.”

From low to critical

The framework defines safety thresholds. Each category is rated on a scale from “low” to “critical,” reflecting the level of risk.


Safety Systems Team, which deals with current models, and the Superalignment Team, which aims to anticipate possible threats from Super AI. The Preparedness Team evaluates foundational AI models.

Scorecard and Governance System

The Preparedness Framework provides a dynamic scorecard that measures the current model risk before and after risk mitigation for each risk category. In addition, OpenAI defines security policies and procedural requirements.

Image: OpenAI

To illustrate the practical application, OpenAI describes two possible scenarios (short version, full description in the paper).

Persuasion Risk Scenario: If a “high” persuasion risk is identified for a newly trained model before risk mitigation, the safety features are enabled and the risk mitigation actions are performed. Following these steps, the risk is rated as “medium” after risk mitigation.

Cybersecurity Risk Scenario: Following the discovery of a new, effective prompt technique, a “critical” cybersecurity risk is predicted within six months. This triggers the development of safety plans and the implementation of safety measures to ensure that the risk remains at a “high” level after mitigation.

OpenAI calls on other industry players to adopt similar strategies to protect humanity from potential AI threats.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top