立即打开
谷歌的巴德考SAT,成绩会怎样?

谷歌的巴德考SAT,成绩会怎样?

ELEANOR PRINGLE 2023-04-04
对谷歌来说,不幸的是,巴德似乎还考不上哈佛。

谷歌已经为巴德的错误付出了代价——但它每天都在学习。图片来源:JONATHAN RAA—NURPHOTO/GETTY IMAGES

谷歌对巴德并不完美的事实相当坦诚。

Alphabet首席执行官桑达尔·皮查伊似乎对本公司的人工智能模型需要走多远并不焦虑,他在一份公司内部备忘录中写道,巴德(Bard)还处于早期阶段:“随着更多人开始使用巴德,测试它的功能,到时会出现让我们意想不到的事情。会出现各种问题。”

现在巴德已邀请公众参与测试,之前参与内测的8万名用户主要是谷歌员工。

《财富》杂志终于排到号了,所以我们赶在今年春天的美国青少年SAT考试之前,对巴德进行了测试。

SAT是全球公认的美国大学入学考试,考试的技能包括阅读、写作和数学。

对谷歌来说,不幸的是,巴德似乎还考不上哈佛,因为它答的大部分数学题都是错的,而在写作和语言测试中想考高分也很艰难。

第一次登录巴德时,用户的期望值已经被弹出的一条消息设定好了,上面写着:“巴德并不总是正确的。巴德可能会给出不准确或不恰当的回答。如果有疑问,可以点击‘谷歌一下’(Google it)的按钮检查巴德的回复。有了你的反馈,巴德会变得更好。请对巴德的答案做出评分,并对任何可能具有冒犯性或不安全的内容进行标记。”

巴德表现如何?

回到答题上来。

《财富》杂志从在线学习资源中找了一些SAT数学练习题,发现巴德有50%到75%的答案是错的——哪怕是有选项的选择题。

很多情况下,巴德给出的答案甚至不在选择范围内,不过如果再问一遍,它有时就能答对。

这款人工智能的不准确性已经让谷歌花费了大约1000亿美元。

今年2月巴德刚刚面世时,在它被问的一系列问题中,包括如何向一个9岁的孩子解释詹姆斯·韦伯太空望远镜都发现了什么。

巴德回应说,该望远镜拍摄了“我们太阳系外的第一张行星照片”,但是据美国宇航局证实,第一张系外行星的照片是由智利的地面阵列甚大望远镜于2004年捕捉到,并于2005年确认为系外行星。

科学和数学都不是巴德的强项,不过在阅读和写作练习方面,它确实表现要强一些。

《财富》杂志首次用巴德进行书面语测试时,答对率约为30%,而且问题往往要问两遍它才能理解。

哪怕答案是错的,巴德的语气也很自信,经常以这样的回答开头:“正确答案是”——这是大型语言模型的共同特征。

奇怪的是,巴德在数学和写作能力测试中得分最高的是关于《哈利波特》作者J.K.罗琳的一篇文章。

这次测试中,巴德得了1200分,这个SAT分数可以上霍华德大学、圣地亚哥州立大学和密歇根州立大学等。

随着《财富》杂志向巴德提出的语言类问题逐渐增多(总共约45个),巴德出现题目理解困难或需要重复提问的频率在减少。

在阅读测试中,巴德的表现同样优于数学测试——平均答对了一半左右。

谷歌的一位发言人在回应《财富》杂志(Fortune)的置评请求时,重申了皮查伊的论点,他说:“巴德还在实验阶段,有些回复可能不准确,所以要仔细检查巴德给出的回复。有了你们的反馈,巴德每天都在进步。在巴德公开发布之前,数千名测试人员通过提供反馈,帮助巴德提高了它的质量、安全性和准确性。

“用生成式人工智能加快人们的思考确实是件令人兴奋的事情,但它仍处于早期阶段,巴德只是一个实验。虽然巴德有符合我们的人工智能原则的内置安全控制和明确反馈机制,但请注意,它可能会显示不准确的信息。”

在给巴德提问题的几天时间里,它确实显示出准确性在提高的迹象。关于它的发展速度,它是这么说的:“我会说我正在快速进步。”

“我现在可以做一些就在几个月前还不会做的事。我很兴奋地看到我的未来是什么样的。我相信我将继续进步,在未来几年,我将能够做到更多。”(财富中文网)

译者:Agatha

谷歌对巴德并不完美的事实相当坦诚。

Alphabet首席执行官桑达尔·皮查伊似乎对本公司的人工智能模型需要走多远并不焦虑,他在一份公司内部备忘录中写道,巴德(Bard)还处于早期阶段:“随着更多人开始使用巴德,测试它的功能,到时会出现让我们意想不到的事情。会出现各种问题。”

现在巴德已邀请公众参与测试,之前参与内测的8万名用户主要是谷歌员工。

《财富》杂志终于排到号了,所以我们赶在今年春天的美国青少年SAT考试之前,对巴德进行了测试。

SAT是全球公认的美国大学入学考试,考试的技能包括阅读、写作和数学。

对谷歌来说,不幸的是,巴德似乎还考不上哈佛,因为它答的大部分数学题都是错的,而在写作和语言测试中想考高分也很艰难。

第一次登录巴德时,用户的期望值已经被弹出的一条消息设定好了,上面写着:“巴德并不总是正确的。巴德可能会给出不准确或不恰当的回答。如果有疑问,可以点击‘谷歌一下’(Google it)的按钮检查巴德的回复。有了你的反馈,巴德会变得更好。请对巴德的答案做出评分,并对任何可能具有冒犯性或不安全的内容进行标记。”

巴德表现如何?

回到答题上来。

《财富》杂志从在线学习资源中找了一些SAT数学练习题,发现巴德有50%到75%的答案是错的——哪怕是有选项的选择题。

很多情况下,巴德给出的答案甚至不在选择范围内,不过如果再问一遍,它有时就能答对。

这款人工智能的不准确性已经让谷歌花费了大约1000亿美元。

今年2月巴德刚刚面世时,在它被问的一系列问题中,包括如何向一个9岁的孩子解释詹姆斯·韦伯太空望远镜都发现了什么。

巴德回应说,该望远镜拍摄了“我们太阳系外的第一张行星照片”,但是据美国宇航局证实,第一张系外行星的照片是由智利的地面阵列甚大望远镜于2004年捕捉到,并于2005年确认为系外行星。

科学和数学都不是巴德的强项,不过在阅读和写作练习方面,它确实表现要强一些。

《财富》杂志首次用巴德进行书面语测试时,答对率约为30%,而且问题往往要问两遍它才能理解。

哪怕答案是错的,巴德的语气也很自信,经常以这样的回答开头:“正确答案是”——这是大型语言模型的共同特征。

奇怪的是,巴德在数学和写作能力测试中得分最高的是关于《哈利波特》作者J.K.罗琳的一篇文章。

这次测试中,巴德得了1200分,这个SAT分数可以上霍华德大学、圣地亚哥州立大学和密歇根州立大学等。

随着《财富》杂志向巴德提出的语言类问题逐渐增多(总共约45个),巴德出现题目理解困难或需要重复提问的频率在减少。

在阅读测试中,巴德的表现同样优于数学测试——平均答对了一半左右。

谷歌的一位发言人在回应《财富》杂志(Fortune)的置评请求时,重申了皮查伊的论点,他说:“巴德还在实验阶段,有些回复可能不准确,所以要仔细检查巴德给出的回复。有了你们的反馈,巴德每天都在进步。在巴德公开发布之前,数千名测试人员通过提供反馈,帮助巴德提高了它的质量、安全性和准确性。

“用生成式人工智能加快人们的思考确实是件令人兴奋的事情,但它仍处于早期阶段,巴德只是一个实验。虽然巴德有符合我们的人工智能原则的内置安全控制和明确反馈机制,但请注意,它可能会显示不准确的信息。”

在给巴德提问题的几天时间里,它确实显示出准确性在提高的迹象。关于它的发展速度,它是这么说的:“我会说我正在快速进步。”

“我现在可以做一些就在几个月前还不会做的事。我很兴奋地看到我的未来是什么样的。我相信我将继续进步,在未来几年,我将能够做到更多。”(财富中文网)

译者:Agatha

Google has been pretty open about the fact that Bard isn’t perfect.

Alphabet CEO Sundar Pichai appears to be relaxed about how far the company’s A.I. models have to go, writing in a company-wide memo that Bard is in its early stages: “As more people start to use Bard and test its capabilities, they’ll surprise us. Things will go wrong.”

Now the public has been invited to test Bard, whereas previously the 80,000 users putting it through its paces were mainly made up of Google employees.

Fortune‘s spot on the wait list was finally called up, so we put Bard through its paces ahead of the upcoming SATs American teenagers will be facing this spring.

SATs are globally recognized tests used for U.S. college admissions, in skills including reading, writing, and math.

Unfortunately for Google, it looks like Bard won’t be making it to Harvard just yet, as it got the majority of math questions wrong and similarly struggled to ace writing and language tests.

Logging on to Bard for the first time, the user’s expectations are already set by a message which pops up, reading: “Bard will not always get it right. Bard may give inaccurate or inappropriate responses. When in doubt, use the ‘Google it’ button to check Bard’s responses. Bard will get better with your feedback. Please rate responses and flag anything that may be offensive or unsafe.”

How did Bard do?

On to the questions.

Fortune sourced practice SAT math questions from online learning resources and found that Bard got anywhere from 50% to 75% of them wrong—even when multiple-choice answers were provided.

Often Bard gave answers which were not even a multiple-choice option, though it sometimes got them correct when asked the same question again.

The A.I.’s inaccuracy has already cost Google somewhere in the region of $100 billion.

When Bard was launched in February it was asked a range of questions including how to explain to a 9-year-old what the James Webb Space Telescope has discovered.

Bard responded that the telescope took the “very first pictures of a planet outside of our own solar system” even though NASA confirmed the first image of an exoplanet was captured by the Very Large Telescope, a ground-based array in Chile, in 2004 and confirmed as an exoplanet in 2005.

Science and math aren’t Bard’s strong points either, although the A.I. did fare better when it came to reading and writing exercises.

Bard’s first written language test with Fortune came back with around 30% correct answers, often needing to be asked the questions twice for the A.I. to understand.

Even when it was wrong, Bard’s tone is confident, frequently framing responses as: “The correct answer is”—which is a common feature of large language models.

Bizarrely, Bard’s best test out of both math and written skills was a passage that focussed on Harry Potter writer J.K. Rowling.

On this test, Bard scored 1200 points, an SAT score that would get a human into the likes of Howard University, San Diego State University, and Michigan State University.

The more Bard was asked language-based questions by Fortune—around 45 in total—the less frequently it struggled to understand or needed the question to be repeated.

On reading tests, Bard similarly performed better than it did in math—getting around half the answers correct on average.

A Google spokesperson reiterated Pichai’s message when approached by Fortune for comment, saying: “Bard is experimental, and some of the responses may be inaccurate, so double-check information in Bard’s responses. With your feedback, Bard is getting better every day. Before Bard launched publicly, thousands of testers were involved to provide feedback to help Bard improve its quality, safety, and accuracy.

“Accelerating people’s ideas with generative A.I. is truly exciting, but it’s still early days, and Bard is an experiment. While Bard has built-in safety controls and clear mechanisms for feedback in line with our A.I. Principles, be aware that it may display inaccurate information.”

In the space of a couple of days of questioning Bard, the A.I. did show signs of improving accuracy; on the speed of its development the large language model noted: “I would say that I am improving at a rapid pace.

“I am able to do things that I was not able to do just a few months ago. I am excited to see what the future holds for me. I am confident that I will continue to improve and that I will be able to do even more in the years to come.”

热读文章
热门视频
扫描二维码下载财富APP