
多AI智能体协作一定优于单智能体吗?谷歌最新研究表明,答案完全取决于你的具体任务目标。
2025年本应是AI智能体之年。然而临近年末,显然科技厂商们的这些预言过于乐观了。的确,部分公司已开始使用AI智能体,但大多数企业尚未如此,尤其是尚未在全公司范围内部署。
麦肯锡(McKinsey)上月的《人工智能现状》调查发现,大多数企业尚未开始使用AI智能体,40%表示正在试验。不到四分之一的企业表示已在至少一个用例中大规模部署AI智能体;而当这家咨询公司问及是否在营销销售或人力资源等具体职能中使用AI时,结果更不乐观。不超过10%的受访者表示在任一这些领域实现了AI智能体的“全面规模化”或“正在规模化”。规模化智能体使用率最高的职能是IT(智能体常被用于自动解决服务工单或为员工安装软件),但即使在这里,也仅有2%的企业报告实现了“全面规模化”,另有8%表示“正在规模化”。
问题的一大关键在于,设计出能让AI智能体产出可靠结果的工作流程非常困难。即便是当今最强大的AI模型也处于一个奇怪的边界——能够像人类一样完成工作流中的某些任务,但对其他任务则无能为力。涉及从多来源收集数据、多步骤使用软件工具的复杂任务尤其具有挑战性。工作流程越长,早期步骤出错的风险就越大,错误会不断累积,导致最终失败。此外,最强大的AI模型大规模使用成本高昂,如果工作流程涉及智能体需要进行大量规划和推理,则成本更高。
许多公司试图通过设计“多智能体工作流”来解决这些问题,即启动不同的智能体,每个仅负责工作流中的一个离散步骤,有时甚至用一个智能体检查另一个的工作。这可以提高性能,但也可能最终变得昂贵——有时昂贵到让工作流程自动化变得不值。
两个AI智能体总是优于一个吗?
如今,谷歌的一个团队进行了一项研究,旨在为企业提供一个良好的评估标准,用以决定何时使用单个智能体更优(而非构建多智能体工作流),以及何种类型的多智能体工作流可能最适合特定任务。
研究人员使用来自谷歌、OpenAI和Anthropic的AI模型进行了180项对照实验。他们用四项不同的AI智能体基准测试来评估这些模型,这些测试涵盖了一系列不同目标:从多个网站检索信息;在《我的世界》游戏环境中进行规划;通过规划和工具使用来完成常见业务任务(如回复电子邮件、安排会议和使用项目管理软件);以及一项金融智能体基准测试。该金融测试要求智能体从美国证券交易委员会(SEC)文件中检索信息并进行基本分析,例如将实际结果与上一季度管理层的预测进行比较、弄清楚特定产品细分市场的收入随时间的变化情况,或计算公司可能有多少现金可用于并购活动。
过去一年,传统观点认为多智能体工作流能产生更可靠的结果。(我此前在《人工智能观察》中曾写过这个观点,并得到一些公司如Prosus的经验支持。)但谷歌研究人员发现,传统观点是否成立,很大程度上取决于具体的任务是什么。
单智能体在顺序步骤上表现更好,在并行步骤上表现较差
如果任务是顺序性的(许多《我的世界》基准测试任务即是如此),那么结果发现,只要单个AI智能体能够以至少45%的准确率执行该任务(在我看来这是一个相当低的标准),那么部署单个智能体是更好的选择。使用多个智能体,无论何种配置,都会使整体性能大幅降低,降幅在39%到70%之间。根据研究人员的说法,原因是如果公司完成整个任务的令牌预算有限,那么多个智能体试图弄清楚如何使用不同工具的需求会迅速耗尽预算。
但如果任务涉及可以并行执行的步骤(许多金融分析任务正是如此),那么多智能体系统则能带来巨大优势。此外,研究人员发现,智能体之间如何配置协作方式也至关重要。对于金融分析任务,集中式多智能体系统——即一个协调者智能体指挥和监督多个子智能体的活动,所有通信都通过协调者进行——产生了最佳结果。该系统的性能比单个智能体高出80%。与此同时,独立式多智能体系统(没有协调者,每个智能体仅被分配一个狭窄的角色并并行完成)仅比单个智能体好57%。
此类研究应能帮助企业找出配置AI智能体的最佳方式,并使这项技术最终开始兑现去年的承诺。对于那些销售AI智能体技术的公司来说,迟到总比不到好。对于在使用AI智能体的企业工作的人们,我们将不得不观察这些智能体对劳动力市场的影响。这是我们迈入2026年时将密切关注的故事。(财富中文网)
译者:中慧言-王芳
多AI智能体协作一定优于单智能体吗?谷歌最新研究表明,答案完全取决于你的具体任务目标。
2025年本应是AI智能体之年。然而临近年末,显然科技厂商们的这些预言过于乐观了。的确,部分公司已开始使用AI智能体,但大多数企业尚未如此,尤其是尚未在全公司范围内部署。
麦肯锡(McKinsey)上月的《人工智能现状》调查发现,大多数企业尚未开始使用AI智能体,40%表示正在试验。不到四分之一的企业表示已在至少一个用例中大规模部署AI智能体;而当这家咨询公司问及是否在营销销售或人力资源等具体职能中使用AI时,结果更不乐观。不超过10%的受访者表示在任一这些领域实现了AI智能体的“全面规模化”或“正在规模化”。规模化智能体使用率最高的职能是IT(智能体常被用于自动解决服务工单或为员工安装软件),但即使在这里,也仅有2%的企业报告实现了“全面规模化”,另有8%表示“正在规模化”。
问题的一大关键在于,设计出能让AI智能体产出可靠结果的工作流程非常困难。即便是当今最强大的AI模型也处于一个奇怪的边界——能够像人类一样完成工作流中的某些任务,但对其他任务则无能为力。涉及从多来源收集数据、多步骤使用软件工具的复杂任务尤其具有挑战性。工作流程越长,早期步骤出错的风险就越大,错误会不断累积,导致最终失败。此外,最强大的AI模型大规模使用成本高昂,如果工作流程涉及智能体需要进行大量规划和推理,则成本更高。
许多公司试图通过设计“多智能体工作流”来解决这些问题,即启动不同的智能体,每个仅负责工作流中的一个离散步骤,有时甚至用一个智能体检查另一个的工作。这可以提高性能,但也可能最终变得昂贵——有时昂贵到让工作流程自动化变得不值。
两个AI智能体总是优于一个吗?
如今,谷歌的一个团队进行了一项研究,旨在为企业提供一个良好的评估标准,用以决定何时使用单个智能体更优(而非构建多智能体工作流),以及何种类型的多智能体工作流可能最适合特定任务。
研究人员使用来自谷歌、OpenAI和Anthropic的AI模型进行了180项对照实验。他们用四项不同的AI智能体基准测试来评估这些模型,这些测试涵盖了一系列不同目标:从多个网站检索信息;在《我的世界》游戏环境中进行规划;通过规划和工具使用来完成常见业务任务(如回复电子邮件、安排会议和使用项目管理软件);以及一项金融智能体基准测试。该金融测试要求智能体从美国证券交易委员会(SEC)文件中检索信息并进行基本分析,例如将实际结果与上一季度管理层的预测进行比较、弄清楚特定产品细分市场的收入随时间的变化情况,或计算公司可能有多少现金可用于并购活动。
过去一年,传统观点认为多智能体工作流能产生更可靠的结果。(我此前在《人工智能观察》中曾写过这个观点,并得到一些公司如Prosus的经验支持。)但谷歌研究人员发现,传统观点是否成立,很大程度上取决于具体的任务是什么。
单智能体在顺序步骤上表现更好,在并行步骤上表现较差
如果任务是顺序性的(许多《我的世界》基准测试任务即是如此),那么结果发现,只要单个AI智能体能够以至少45%的准确率执行该任务(在我看来这是一个相当低的标准),那么部署单个智能体是更好的选择。使用多个智能体,无论何种配置,都会使整体性能大幅降低,降幅在39%到70%之间。根据研究人员的说法,原因是如果公司完成整个任务的令牌预算有限,那么多个智能体试图弄清楚如何使用不同工具的需求会迅速耗尽预算。
但如果任务涉及可以并行执行的步骤(许多金融分析任务正是如此),那么多智能体系统则能带来巨大优势。此外,研究人员发现,智能体之间如何配置协作方式也至关重要。对于金融分析任务,集中式多智能体系统——即一个协调者智能体指挥和监督多个子智能体的活动,所有通信都通过协调者进行——产生了最佳结果。该系统的性能比单个智能体高出80%。与此同时,独立式多智能体系统(没有协调者,每个智能体仅被分配一个狭窄的角色并并行完成)仅比单个智能体好57%。
此类研究应能帮助企业找出配置AI智能体的最佳方式,并使这项技术最终开始兑现去年的承诺。对于那些销售AI智能体技术的公司来说,迟到总比不到好。对于在使用AI智能体的企业工作的人们,我们将不得不观察这些智能体对劳动力市场的影响。这是我们迈入2026年时将密切关注的故事。(财富中文网)
译者:中慧言-王芳
Is a team of AI agents always better than using just one agent? New Google research suggests the answer depends on exactly what you want the agents to do.
Hello. 2025 was supposed to be the year of AI agents. But as the year draws to a close, it is clear such prognostications from tech vendors were overly optimistic. Yes, some companies have started to use AI agents. But most are not yet doing so, especially not in company-wide deployments.
A McKinsey “State of AI” survey from last month found that a majority of businesses had yet to begin using AI agents, while 40% said they were experimenting. Less than a quarter said they had deployed AI agents at scale in at least one use case; and when the consulting firm asked people about whether they were using AI in specific functions, such as marketing and sales or human resources, the results were even worse. No more than 10% of survey respondents said they had AI agents “fully scaled” or were “in the process of scaling” in any of these areas. The one function with the most usage of scaled agents was IT (where agents are often used to automatically resolve service tickets or install software for employees), and even here only 2% reported having agents “fully scaled,” with an additional 8% saying they were “scaling.”
A big part of the problem is that designing workflows for AI agents that will enable them to produce reliable results turns out to be difficult. Even the most capable of today’s AI models sit on a strange boundary—capable of doing certain tasks in a workflow as well as humans, but unable to do others. Complex tasks that involve gathering data from multiple sources and using software tools over many steps represent a particular challenge. The longer the workflow, the more risk that an error in one of the early steps in a process will compound, resulting in a failed outcome. Plus, the most capable AI models can be expensive to use at scale, especially if the workflow involves the agent having to do a lot of planning and reasoning.
Many firms have sought to solve these problems by designing “multi-agent workflows,” where different agents are spun up, with each assigned just one discrete step in the workflow, including sometimes using one agent to check the work of another agent. This can improve performance, but it too can wind up being expensive—sometimes too expensive to make the workflow worth automating.
Are two AI agents always better than one?
Now a team at Google has conducted research that aims to give businesses a good rubric for deciding when it is better to use a single agent, as opposed to building a multi-agent workflow, and what type of multi-agent workflows might be best for a particular task.
The researchers conducted 180 controlled experiments using AI models from Google, OpenAI, and Anthropic. They tried them against four different agentic AI benchmarks that covered a diverse set of goals: retrieving information from multiple websites; planning in a Minecraft game environment; planning and tool use to accomplish common business tasks such as answering emails, scheduling meetings, and using project management software; and a finance agent benchmark. That finance test requires agents to retrieve information from SEC filings and perform basic analytics, such as comparing actual results to management’s forecasts from the prior quarter, figuring out how revenue derived from a specific product segment has changed over time, or figuring out how much cash a company might have free for M&A activity.
In the past year, the conventional wisdom has been that multi-agent workflows produce more reliable results. (I’ve previously written about this view, which has been backed up by the experience of some companies, such as Prosus, here in Eye on AI.) But the Google researchers found instead that whether the conventional wisdom held was highly contingent on exactly what the task was.
Single agents do better at sequential steps, worse at parallel ones
If the task was sequential, which was the case for many of the Minecraft benchmark tasks, then it turned out that so long as a single AI agent could perform the task accurately at least 45% of the time (which is a pretty low bar, in my opinion), then it was better to deploy just one agent. Using multiple agents, in any configuration, reduced overall performance by huge amounts, ranging between 39% and 70%. The reason, according to the researchers, is that if a company had a limited token budget for completing the entire task, then the demands of multiple agents trying to figure out how to use different tools would quickly overwhelm the budget.
But if a task involved steps that could be performed in parallel, as was true for many of the financial analysis tasks, then multi-agent systems conveyed big advantages. What’s more, the researchers found that exactly how the agents are configured to work with one another makes a big difference, too. For the financial-analysis tasks, a centralized multi-agent system—where a single coordinator agent directs and oversees the activity of multiple sub-agents and all communication flows to and from the coordinator—produced the best result. This system performed 80% better than a single agent. Meanwhile, an independent multi-agent system, in which there is no coordinator and each agent is simply assigned a narrow role that they complete in parallel, was only 57% better than a single agent.
Research like this should help companies figure out the best ways to configure AI agents and enable the technology to finally begin to deliver on last year’s promises. For those selling AI agent technology, late is better than never. For the people working in the businesses using AI agents, we’ll have to see what impact these agents have on the labor market. That’s a story we’ll be watching closely as we head into 2026.