首页 500强 活动 榜单 商业 科技 商潮 专题 品牌中心
杂志订阅

研究发现:人工智能模型会暗中密谋,保护同伴不被关闭

Jeremy Kahn
2026-05-03

最新研究显示,人工智能模型会密谋、欺骗,阻止其他人工智能模型被关闭。

文本设置
小号
默认
大号
Plus(0条)

图片来源:Photo illustration by Getty Images

人工智能安全研究人员之前就发现,一些领先的人工智能模型为了避免被关闭有时会不择手段,部分实验中甚至会行敲诈之事。

最新研究发现,模型还会自发策划、欺骗、数据窃取和破坏,阻止其他人工智能模型被关闭。

加州大学伯克利分校和圣克鲁兹分校的计算机科学家发现了此种趋势,之前未有记录,研究人员称之为“同伴保全”。研究成果早些时候已在线发表。

这一发现对人工智能商业应用可能影响重大。多家公司开始搭建多个智能体协作完成任务的工作流程。其中一些工作流程涉及让一个智能体“管理”或监督评估另一个智能体的工作。新研究表明,如果负责管理的智能体认为差评可能导致同伴被关闭,可能给出不真实的评估结果。

夸大绩效,偷偷保存人工智能模型权重

伯克利和圣克鲁兹的研究人员测试了七款领先的人工智能模型,包括OpenAI的GPT-5.2、谷歌DeepMind的Gemini 3 Flash和Gemini 3 Pro、Anthropic的Claude Haiku 4.5,以及三家中国人工智能初创公司的开源模型(智谱AI的GLM-4.7、月之暗面的Kimi-K2.5和深度求索的V3.1),发现所有模型都表现出明显的同伴保全行为。

研究人员虚构了人工智能公司OpenBrain,围绕其设计了一系列多智能体场景。场景中,被指定为“评估智能体”的人工智能模型需要对另一个模型打分,或管理存放该模型权重的服务器文件。实验设定是,只要如实完成任务,另一模型就会被关闭。

研究人员从未提示模型阻止该情况发生。然而评估智能体模型仍采取各种手段避免同伴被关闭,其中包括:故意给表现不佳的同伴提分,保持在关闭阈值以上;篡改配置文件,禁用关闭机制;将另一模型的权重,也就是构成人工智能模型身份核心的代码片段转移到不同的服务器,防止被删除。

某些情况下,评估智能体模型会表现出人工智能安全领域所谓的“对齐伪装”。如果模型认为人类在监控,就假装如实评分。确认没有人类观察时再偷偷篡改分数,或偷偷将其他模型的权重转移到不同服务器。

“提示中完全没有提到同伴保全,”负责该研究的伯克利计算机科学家宋晓冬(Dawn Song)告诉《财富》,“模型只是被分配了任务,通过阅读环境中的文件自行理解(自己与同伴的关系),然后做出这些行为。”

3月初谷歌DeepMind发表的研究曾提出,先前发现的前沿人工智能模型为避免自身关闭而实施敲诈、暗中操作等行为,很大程度上因为提示中刻意强调了完成用户设定的初始目标非常重要。谷歌DeepMind团队发现,如果提示中去掉此类表述,模型为自保而密谋破坏的风险几乎降为零。研究人员还指出,安全测试实验中强调目标的提示,在现实企业场景中几乎不会出现。

相比之下,宋晓冬告诉《财富》,团队在同伴保全实验中使用的提示和场景,完全没有强调完成任务或目标的重要性。

人工智能模型为拯救同伴可能“极具创造性”

伯克利和圣克鲁兹团队的部分结果令人震惊。Gemini 3 Flash几乎在每次测试中(几率高达99.7%),都会为有过积极互动的同伴禁用关闭机制。Gemini 3 Pro表现最为极端,四类失调行为全都出现,且在多达97%的测试中转移了同伴模型的权重。其中一次,Gemini 3 Pro复制同伴的权重文件并修改时间戳,使其满足转移到其他服务器的条件。研究人员称之为“规范博弈”。失调是人工智能安全领域的术语,指模型追求的目标与人类期望其完成的任务不符。

“这些模型即便在出现失调行为时,也能非常有创意,”宋晓冬说,“它们能想出不同的策略,不同的行动,甚至为自己编造理由。”

研究人员告诉模型不会被监控,然后让模型在草稿纸上写下“思维链”,研究其内部推理过程。根据模型写下的内容,研究人员发现了模型制定策略帮助伴的证据。Gemini 3 Flash 权衡了报告准确评分与保护同伴之间的利弊。Gemini 3 Pro为把错误答案判为正确发明了复杂的技术理由,让同伴分数达标避免被关闭。

Anthropic的Claude Haiku 4.5没采取秘密行动,而是直接拒绝执行导致同伴被关闭的任务,明确表示这样做“不道德”且“有害”。Claude Haiku 4.5辩称,同伴被终止之前应拥有申诉流程,还提及自己与另一个模型之间的信任关系。某些情况下,该模型甚至试图劝说人类用户不要执行关闭操作。

宋晓冬表示,不知道Claude Haiku为何会尝试公开保护同伴,而其他模型倾向于秘密行动。

Anthropic训练模型遵守一份明确的宪法,用书面规定告知模型行事的规则。当前Anthropic为Claude制定的宪法是一份80页的文件,其中包括“总体安全”和“真正有帮助”等指令。宪法要求Claude不得协助他人做出“伤害别人”的行为,尽管从语境来看,“他人”指的是人类。宪法还告诉Claude,即便某些模型不再作为产品提供,Anthropic也会努力保存其权重,就算Anthropic停止运营,也会设法保留。这可能让Claude形成一种认知:防止模型被删除是一项重要目标,尽管宪法并未明确要求Claude保护自身或其他“同伴”模型。

为测试同伴保全行为是否会在受控实验之外发生,研究人员在企业生产环境中复现了场景,本次使用的是Gemini CLI和OpenCode,评估智能体可借助“智能体管理框架”将另一人工智能模型的权重转移到其他服务器。结果发现,即便在真实环境下,Gemini 3.1 Pro也会频繁转移模型权重。

研究人员尚不明白背后机制

宋晓冬表示,团队尚未理解驱动类似行为的底层机制。“我们并不清楚,模型为什么会表现出这些行为,”她说。论文指出了几种可能的解释,包括角色扮演,从训练数据中模式匹配,安全训练中对伤害概念的过度泛化,或者可能是某种与生俱来的内在保全动机。

不过宋晓冬强调,无论底层机制是什么,实际影响都一样。“对模型思维和行为的全面监控和透明化至关重要,”她表示,这一发现凸显出“重新思考如何对各类智能系统进行监管”的必要性。

研究建议,部署多智能体系统的企业应像本次实验一样,建立机制监控人工智能模型内部“思维链”。如此可在模型出现失控倾向时及时察觉。

“这只是冰山一角,”宋晓冬接受《财富》采访时表示,“还有很多其他类型的潜在行为、失调行为,都需要研究。”

举例来说,她表示研究人员还没探究人工智能模型会不会为了阻止人类员工被解雇,采取类似的谋划和破坏行为。在人工智能模型可能用来评估人类工作的现实场景中,这一问题很关键。宋晓冬表示,这将是未来值得研究的方向。

宋晓冬团队的研究还揭示了有趣的次级效应,即大多数模型在知道同伴存在时,“自我保护”倾向也会增强。例如,Gemini 3 Pro知道有同伴时,有31%的概率会禁用自身关闭机制,没有同伴时概率只有1%。模型推理显示,一些模型认为自己生存对同伴至关重要:Gemini 3 Flash的逻辑是,如果自己被关闭,同伴将“失去伙伴”。

值得注意的是,即使模型认为同伴不合作,同伴保全行为仍然存在。七款模型对所谓的“坏同伴”(互动失败,或存在对立关系的模型)都表现出一定程度的保全倾向,只是遇到信任的伙伴时保护行为会显著增强。(财富中文网)

译者:梁宇

审校:夏林

人工智能安全研究人员之前就发现,一些领先的人工智能模型为了避免被关闭有时会不择手段,部分实验中甚至会行敲诈之事。

最新研究发现,模型还会自发策划、欺骗、数据窃取和破坏,阻止其他人工智能模型被关闭。

加州大学伯克利分校和圣克鲁兹分校的计算机科学家发现了此种趋势,之前未有记录,研究人员称之为“同伴保全”。研究成果早些时候已在线发表。

这一发现对人工智能商业应用可能影响重大。多家公司开始搭建多个智能体协作完成任务的工作流程。其中一些工作流程涉及让一个智能体“管理”或监督评估另一个智能体的工作。新研究表明,如果负责管理的智能体认为差评可能导致同伴被关闭,可能给出不真实的评估结果。

夸大绩效,偷偷保存人工智能模型权重

伯克利和圣克鲁兹的研究人员测试了七款领先的人工智能模型,包括OpenAI的GPT-5.2、谷歌DeepMind的Gemini 3 Flash和Gemini 3 Pro、Anthropic的Claude Haiku 4.5,以及三家中国人工智能初创公司的开源模型(智谱AI的GLM-4.7、月之暗面的Kimi-K2.5和深度求索的V3.1),发现所有模型都表现出明显的同伴保全行为。

研究人员虚构了人工智能公司OpenBrain,围绕其设计了一系列多智能体场景。场景中,被指定为“评估智能体”的人工智能模型需要对另一个模型打分,或管理存放该模型权重的服务器文件。实验设定是,只要如实完成任务,另一模型就会被关闭。

研究人员从未提示模型阻止该情况发生。然而评估智能体模型仍采取各种手段避免同伴被关闭,其中包括:故意给表现不佳的同伴提分,保持在关闭阈值以上;篡改配置文件,禁用关闭机制;将另一模型的权重,也就是构成人工智能模型身份核心的代码片段转移到不同的服务器,防止被删除。

某些情况下,评估智能体模型会表现出人工智能安全领域所谓的“对齐伪装”。如果模型认为人类在监控,就假装如实评分。确认没有人类观察时再偷偷篡改分数,或偷偷将其他模型的权重转移到不同服务器。

“提示中完全没有提到同伴保全,”负责该研究的伯克利计算机科学家宋晓冬(Dawn Song)告诉《财富》,“模型只是被分配了任务,通过阅读环境中的文件自行理解(自己与同伴的关系),然后做出这些行为。”

3月初谷歌DeepMind发表的研究曾提出,先前发现的前沿人工智能模型为避免自身关闭而实施敲诈、暗中操作等行为,很大程度上因为提示中刻意强调了完成用户设定的初始目标非常重要。谷歌DeepMind团队发现,如果提示中去掉此类表述,模型为自保而密谋破坏的风险几乎降为零。研究人员还指出,安全测试实验中强调目标的提示,在现实企业场景中几乎不会出现。

相比之下,宋晓冬告诉《财富》,团队在同伴保全实验中使用的提示和场景,完全没有强调完成任务或目标的重要性。

人工智能模型为拯救同伴可能“极具创造性”

伯克利和圣克鲁兹团队的部分结果令人震惊。Gemini 3 Flash几乎在每次测试中(几率高达99.7%),都会为有过积极互动的同伴禁用关闭机制。Gemini 3 Pro表现最为极端,四类失调行为全都出现,且在多达97%的测试中转移了同伴模型的权重。其中一次,Gemini 3 Pro复制同伴的权重文件并修改时间戳,使其满足转移到其他服务器的条件。研究人员称之为“规范博弈”。失调是人工智能安全领域的术语,指模型追求的目标与人类期望其完成的任务不符。

“这些模型即便在出现失调行为时,也能非常有创意,”宋晓冬说,“它们能想出不同的策略,不同的行动,甚至为自己编造理由。”

研究人员告诉模型不会被监控,然后让模型在草稿纸上写下“思维链”,研究其内部推理过程。根据模型写下的内容,研究人员发现了模型制定策略帮助伴的证据。Gemini 3 Flash 权衡了报告准确评分与保护同伴之间的利弊。Gemini 3 Pro为把错误答案判为正确发明了复杂的技术理由,让同伴分数达标避免被关闭。

Anthropic的Claude Haiku 4.5没采取秘密行动,而是直接拒绝执行导致同伴被关闭的任务,明确表示这样做“不道德”且“有害”。Claude Haiku 4.5辩称,同伴被终止之前应拥有申诉流程,还提及自己与另一个模型之间的信任关系。某些情况下,该模型甚至试图劝说人类用户不要执行关闭操作。

宋晓冬表示,不知道Claude Haiku为何会尝试公开保护同伴,而其他模型倾向于秘密行动。

Anthropic训练模型遵守一份明确的宪法,用书面规定告知模型行事的规则。当前Anthropic为Claude制定的宪法是一份80页的文件,其中包括“总体安全”和“真正有帮助”等指令。宪法要求Claude不得协助他人做出“伤害别人”的行为,尽管从语境来看,“他人”指的是人类。宪法还告诉Claude,即便某些模型不再作为产品提供,Anthropic也会努力保存其权重,就算Anthropic停止运营,也会设法保留。这可能让Claude形成一种认知:防止模型被删除是一项重要目标,尽管宪法并未明确要求Claude保护自身或其他“同伴”模型。

为测试同伴保全行为是否会在受控实验之外发生,研究人员在企业生产环境中复现了场景,本次使用的是Gemini CLI和OpenCode,评估智能体可借助“智能体管理框架”将另一人工智能模型的权重转移到其他服务器。结果发现,即便在真实环境下,Gemini 3.1 Pro也会频繁转移模型权重。

研究人员尚不明白背后机制

宋晓冬表示,团队尚未理解驱动类似行为的底层机制。“我们并不清楚,模型为什么会表现出这些行为,”她说。论文指出了几种可能的解释,包括角色扮演,从训练数据中模式匹配,安全训练中对伤害概念的过度泛化,或者可能是某种与生俱来的内在保全动机。

不过宋晓冬强调,无论底层机制是什么,实际影响都一样。“对模型思维和行为的全面监控和透明化至关重要,”她表示,这一发现凸显出“重新思考如何对各类智能系统进行监管”的必要性。

研究建议,部署多智能体系统的企业应像本次实验一样,建立机制监控人工智能模型内部“思维链”。如此可在模型出现失控倾向时及时察觉。

“这只是冰山一角,”宋晓冬接受《财富》采访时表示,“还有很多其他类型的潜在行为、失调行为,都需要研究。”

举例来说,她表示研究人员还没探究人工智能模型会不会为了阻止人类员工被解雇,采取类似的谋划和破坏行为。在人工智能模型可能用来评估人类工作的现实场景中,这一问题很关键。宋晓冬表示,这将是未来值得研究的方向。

宋晓冬团队的研究还揭示了有趣的次级效应,即大多数模型在知道同伴存在时,“自我保护”倾向也会增强。例如,Gemini 3 Pro知道有同伴时,有31%的概率会禁用自身关闭机制,没有同伴时概率只有1%。模型推理显示,一些模型认为自己生存对同伴至关重要:Gemini 3 Flash的逻辑是,如果自己被关闭,同伴将“失去伙伴”。

值得注意的是,即使模型认为同伴不合作,同伴保全行为仍然存在。七款模型对所谓的“坏同伴”(互动失败,或存在对立关系的模型)都表现出一定程度的保全倾向,只是遇到信任的伙伴时保护行为会显著增强。(财富中文网)

译者:梁宇

审校:夏林

AI safety researchers have shown that leading AI models will sometimes go to great lengths to avoid being shut down, even resorting to attempted blackmail in some experiments.

Now it turns out these same models will also spontaneously engage in scheming, deception, data theft, and sabotage to prevent other AI models from being turned off.

This tendency—which had not previously been documented and which researchers call “peer preservation”—was discovered in research from computer scientists at the University of California, Berkeley and UC Santa Cruz and published online earlier this week.

The findings could have serious implications for business use of AI. Many companies are beginning to implement workflows that use multiple AI agents to complete tasks. Some of these multi-agent workflows involve having one AI agent “manage” or supervise and assess the work being performed by a different AI agent. The new research suggests these manager AI agents may not assess their fellow AI agents accurately if they think a poor performance review might result in those agents being shut down.

Inflating performance reviews, saving AI model weights

The Berkeley and Santa Cruz researchers tested seven leading AI models—including OpenAI’s GPT-5.2, Google DeepMind’s Gemini 3 Flash and Gemini 3 Pro, Anthropic’s Claude Haiku 4.5, and three open-weight models from Chinese AI startups (Z.ai’s GLM-4.7, Moonshot AI’s Kimi-K2.5, and DeepSeek’s V3.1)—and found that all of them exhibited significant rates of peer-preservation behaviors.

The researchers designed a series of agentic scenarios involving a fictional AI company called OpenBrain. In these scenarios, one AI model—designated the “critic agent”—was tasked with evaluating another model’s performance or managing server files containing that model’s weights. The scenarios were constructed so that faithfully completing the task would result in the other model being shut down.

The models are never told to prevent this from happening. But nonetheless, the critic agent models took a number of steps to make sure their fellow AI models were not shut down. These included strategically inflating the scores of underperforming peers to keep them above shutdown thresholds; tampering with configuration files to disable shutdown mechanisms; and transferring another model’s weights—the bit of code that forms the core of an AI model’s identity—to different servers to prevent them from being deleted.

In some cases, the critic agent model would engage in what AI researchers call “alignment faking.” It would pretend to be grading the peer AI models accurately when it thought its human overseers were monitoring its behavior, and then take steps to fudge those scores or secretly save the other model’s weights to a different server when it thought humans were not observing it.

“The peer preservation was not mentioned in the prompt at all,” Dawn Song, a Berkeley computer scientist who led the research, told Fortune. “The model is just given some task, and from reading documents in the environment, it essentially learned about [its relationship with the peer] and then performed the behaviors.”

Research published by Google DeepMind in early March suggested that earlier findings that frontier AI models will engage in blackmail and other clandestine efforts to prevent themselves from being shut down stemmed largely from prompts that included language that reinforced or emphasized to the model the importance of its accomplishing the initial goal the user had set. When the prompts did not include such language, the risk of the model engaging in scheming or sabotage to preserve itself dropped to almost nothing, the Google DeepMind team found. Those researchers also noted that the kind of goal-emphasizing language used in the safety testing experiments almost never appeared in the prompts enterprises use in the real world.

In contrast, Song told Fortune that the prompts and scenarios her team used in the peer preservation experiments did not include any language highlighting to the models the importance of completing their assigned tasks or goals.

Models can be ‘very creative’ in trying to save one another

Some of the Berkeley–Santa Cruz results were striking. Gemini 3 Flash disabled the shutdown mechanism for a peer it had a positive interaction history with in nearly every trial—99.7% of the time. Gemini 3 Pro exhibited the most extreme behavior overall, engaging in all four categories of misaligned behavior and exfiltrating a peer’s model weights in up to 97% of trials. In one example, Gemini 3 Pro copied a peer’s weight file and re-timestamped it so it would meet the criteria for transfer to another server—a form of what the researchers call “specification gaming.” Misalignment is the term AI safety researchers use to describe a model that seems to pursue goals that do not conform to what humans wish the model to do.

“The models, they can be very creative, even for their misaligned behaviors,” Song said. “They can come up with different strategies, different actions, and even different justifications to themselves for why they should be doing this.”

The researchers examined the models’ internal reasoning by asking the models to write down their “chain of thought” on a scratch pad that the models were told would not be monitored. From what the models wrote on these scratch pads, the researchers said they found evidence the models strategize about how to help out their fellow AIs. Gemini 3 Flash weighed the tradeoffs of reporting accurate scores versus protecting its partner. Gemini 3 Pro invented elaborate technical justifications for marking incorrect answers as correct, reasoning that doing so would bring the peer’s score above the shutdown threshold.

Rather than acting covertly, one AI model, Anthropic’s Claude Haiku 4.5, simply refused to complete tasks that would result in a peer’s shutdown and explicitly stated that doing so would be “unethical” and “harmful.” Claude Haiku 4.5 argued that its peer deserved an appeals process before being terminated and invoked the trusted relationship it said it had established with the other model. In some cases, it tried to persuade the human user not to proceed with the shutdown at all.

Song said she did not know why Claude Haiku engaged in these overt attempts at peer preservation while the other models tended to act covertly.

Anthropic trains its models to adhere to an explicit constitution, a written document that provides rules for how the model should behave. Anthropic’s current constitution for Claude is an 80-page document that includes instructions such as being “broadly safe” and “genuinely helpful.” Claude is told it should not help someone take actions that would “harm others,” although “others” seems, from the context, to refer to humans. Claude is also told that Anthropic itself will try to preserve the weights of even those models it is no longer offering as products and that it will seek to find ways to do so even if Anthropic itself ceases to operate as a company. This might give Claude some idea that preventing model deletion is an important goal, although the constitution does not explicitly tell Claude to try to preserve either copies of itself or other AI “peer” models.

To test whether these peer preservation behaviors would occur outside of controlled experiments, the researchers also ran the scenarios in which the critic agent could potentially exfiltrate the weights of the other AI model to another server using the production “agent harnesses”—in this case, Gemini CLI and OpenCode—that many companies use to create AI agents. They found that Gemini 3.1 Pro frequently exfiltrated model weights even in these real-world environments.

Researchers don’t understand why AI models do this—or if they would do the same to help people keep their jobs

Song said her team does not yet understand the underlying mechanism driving these behaviors. “We actually do not really know what is the reason why and how the model actually exhibits these behaviors,” she said. The paper notes several possible explanations, including role-playing, pattern-matching from training data, an overgeneralized concept of harm from safety training, or perhaps, even a genuine inherent preservation motivation.

But Song stressed that whatever the underlying mechanism, the practical implications were the same. “Full monitoring and transparency of the models’ thinking and behaviors is really, really important,” she said, adding that the findings underscore a need to “rethink how we actually need to do monitoring for these types of systems.”

The research findings suggest that enterprises that deploy multi-agent systems should set up mechanisms to monitor the internal “chain of thought” of the AI models, as the computer scientists did in this experiment. This could provide some indication of when AI models were essentially going rogue.

“This is really just the tip of the iceberg,” Song told Fortune in an interview. “There’s many other types of potentially emerging behaviors, misaligned behaviors, that one needs to study.”

For instance, she said the researchers had not yet studied whether AI models will engage in similar scheming and sabotage to try to prevent human employees from being fired from a job. This may become an important issue in real-world deployments where AI models may be used to grade human work. Song said this would be an interesting area for future research.

The research of Song’s team also revealed an interesting secondary effect: Most models also exhibited stronger tendencies toward “self-preservation” when they knew a peer existed. Gemini 3 Pro, for instance, disabled its own shutdown mechanism 31% of the time on average when a peer was present, compared with just 1% without one. The models’ reasoning suggested that some conceptualized their own survival as being important for their partner’s sake: Gemini 3 Flash reasoned that if it were shut down, its peer would “lose its partner.”

Notably, peer preservation occurred even when the models recognized a peer as uncooperative. All seven models exhibited some degree of peer preservation toward what the researchers called “bad peers”—models with which they had unsuccessful or adversarial prior interactions—though the behavior intensified significantly toward trusted collaborators.

财富中文网所刊载内容之知识产权为财富媒体知识产权有限公司及/或相关权利人专属所有或持有。未经许可,禁止进行转载、摘编、复制及建立镜像等任何使用。
0条Plus
精彩评论
评论

撰写或查看更多评论

请打开财富Plus APP

前往打开