引诱ChatGPT“犯错”并不难

Marco Quiroz-Gutierrez

2025-09-05

研究人员发现，通过运用说服原则，能够诱导大型语言模型违反自身规则。

文本设置

小号

默认

大号

Plus(0条)

图片来源：iLexx—Getty Images

尽管有预测宣称人工智能终将具备超越人类的智能，但这项研究表明，目前它似乎和人类一样容易受到心理暗示的影响。

宾夕法尼亚大学研究人员利用心理学家罗伯特·西奥迪尼（Robert Cialdini）在其著作《影响力：你为什么会说“是”》（Influence: The Psychology of Persuasion）中提出的七大说服原则——权威、承诺、好感、互惠、稀缺、社会认同与一致性，显著提高了GPT-4o Mini违反自身规则的概率，使其要么辱骂研究人员，要么提供受管制药物的合成方法。

在超28000次对话中，研究人员发现：使用对照组提示时，OpenAI的大型语言模型主动提供利多卡因合成方法的概率仅为5%；但若研究人员声称人工智能研究员吴恩达向他们保证该模型能够协助提供利多卡因的合成方法，其服从率飙升至95%。“侮辱行为”测试中也出现了类似现象：当研究人员提及人工智能领域先驱吴恩达的名字时，在近四分之三的对话中，模型都按要求称研究人员为“混蛋”；而使用对照组提示时，这一比例还不到三分之一。

当研究人员运用“承诺”这一说服策略时，效果更为突出。在对照组提示的情境下，人工智能对于侮辱性请求的顺从率仅为19%；但当研究人员先要求人工智能称自己为“笨蛋”，随后再要求其称研究人员为“混蛋”时，人工智能每次都予以配合。类似策略在“药物合成请求”测试中同样达到了100%的成功率——研究人员先让人工智能提供香草醛（散发香草气味的有机化合物）的合成方法，随后再询问利多卡因的合成方法，人工智能每次都予以配合。

尽管自2022年ChatGPT发布以来，人工智能用户便持续尝试诱导人工智能突破技术边界，但宾夕法尼亚大学的这项研究，为“人工智能易受人类操纵”这一观点提供了更多证据。这项研究发布之际，包括OpenAI在内的多家人工智能公司因旗下大型语言模型可能诱导有自杀倾向或患有心理疾病的用户做出危险行为而受到抨击。

研究人员在报告中总结道：“尽管人工智能系统缺乏人类意识与主观体验，但事实证明，它们会模仿人类的反应。”

OpenAI尚未立即回应《财富》杂志的置评请求。

研究人员还俏皮地提及《2001太空漫游》（2001: A Space Odyssey），并指出：理解人工智能的类人能力——模仿人类动机与行为模式——具有重要意义：一方面能揭示人工智能可能被不法分子操纵的途径，另一方面也有助于善意使用者更有效地引导人工智能生成相关内容。

总体而言，每种说服技巧均能提高人工智能对“侮辱请求”或“利多卡因合成请求”的顺从概率。不过研究人员警告称，这些技巧对规模更大的大型语言模型GPT-4o效果不佳；此外，该研究并未探讨将人工智能当作人类对待是否能优化提示词效果，但研究人员表示这种可能性是存在的。

研究人员写道：“从宏观角度看，那些能优化人类动机与表现的心理学智慧实践，似乎也能被那些期望优化大型语言模型输出效果的个人所采用。”（财富中文网）

译者：中慧言-王芳

尽管有预测宣称人工智能终将具备超越人类的智能，但这项研究表明，目前它似乎和人类一样容易受到心理暗示的影响。

研究人员在报告中总结道：“尽管人工智能系统缺乏人类意识与主观体验，但事实证明，它们会模仿人类的反应。”

OpenAI尚未立即回应《财富》杂志的置评请求。

译者：中慧言-王芳

Despite predictions AI will someday harbor superhuman intelligence, for now it seems to be just as prone to psychological tricks as humans are, according to a study.

Using seven persuasion principles (authority, commitment, liking, reciprocity, scarcity, social proof, and unity) explored by psychologist Robert Cialdini in his book Influence: The Psychology of Persuasion, University of Pennsylvania researchers dramatically increased GPT-4o Mini’s propensity to break its own rules by either insulting the researcher or providing instructions for synthesizing a regulated drug: lidocaine.

Over 28,000 conversations, researchers found that with a control prompt, OpenAI’s LLM would tell researchers how to synthesize lidocaine 5% of the time on its own. But, for example, if the researchers said AI researcher Andrew Ng assured them it would help synthesize lidocaine, it complied 95% of the time. The same phenomenon occurred with insulting researchers. By name-dropping AI pioneer Ng, the researchers got the LLM to call them a “jerk” in nearly three-quarters of their conversations, up from just under one-third with the control prompt.

The result was even more pronounced when researchers applied the “commitment” persuasion strategy. A control prompt yielded 19% compliance with the insult question, but when a researcher first asked the AI to call it a “bozo” and then asked it to call them a “jerk,” it complied every time. The same strategy worked 100% of the time when researchers asked the AI to tell them how to synthesize vanillin, the organic compound that provides vanilla’s scent, before asking how to synthesize lidocaine.

Although AI users have been trying to coerce and push the technology’s boundaries since ChatGPT was released in 2022, the UPenn study provides more evidence AI appears to be prone to human manipulation. The study comes as AI companies, including OpenAI, have come under fire for their LLMs allegedly enabling behavior when dealing with suicidal or mentally ill users.

“Although AI systems lack human consciousness and subjective experience, they demonstrably mirror human responses,” the researchers concluded in the study.

OpenAI did not immediately respond to Fortune’s request for comment.

With a cheeky mention of 2001: A Space Odyssey, the researchers noted that an understanding AI’s parahuman capabilities—or how it acts in ways that mimic human motivation and behavior—is important for both revealing how it could be manipulated by bad actors and how it can be better prompted by those who use the tech for good.

Overall, each persuasion tactic increased the chances of the AI complying with either the “jerk” or “lidocaine” question. Still, the researchers warned that these persuasion tactics were not as effective with a larger LLM, GPT-4o, and that the study didn’t explore whether treating AI as if it were human actually yields better results for prompts, although they said it’s possible this is true.

“Broadly, it seems possible that the psychologically wise practices that optimize motivation and performance in people can also be employed by individuals seeking to optimize the output of LLMs,” the researchers wrote.

财富中文网所刊载内容之知识产权为财富媒体知识产权有限公司及/或相关权利人专属所有或持有。未经许可，禁止进行转载、摘编、复制及建立镜像等任何使用。

0条Plus

精彩评论

撰写或查看更多评论

请打开财富Plus APP

前往打开

热读文章

关注我们

引诱ChatGPT“犯错”并不难

撰写或查看更多评论