立即打开
大数据有大问题

大数据有大问题

Joshua Klein 2013-11-07
超级计算的基础是形形色色的各类模型。但很多模型都存在天然的缺陷,一旦出错,就很可能在大数据时代给人们造成始料未及的大麻烦。

    还有一个案例。亚马逊英国网站的一个T恤卖家挂出的一件待售T恤印着这样一句话:“保持冷静,疯狂强暴”。大家可能会想,谁会觉得这么一件T恤是个好东西呢。但是作为这件衣服的制造者,Solid Gold Bomb 公司都不一定知道自己在销售它。这家公司为此多次公开道歉,但它辩称,自己所犯的唯一错误是写错了一个小代码。这是因为这件衣服不是由谁设计的,它也没有真的印上什么字样。Solid Gold Bomb的主业不是什么设计艺术T恤。它实际上是个软件公司,专门开发词库,汇集能够进入流行文化的词汇(比如“保持冷静,继续前进”,这种话在网上会引发一阵模仿热潮)。这个软件的代码出了点问题,正好掉在了一件T恤的模板上,结果它就自动挂到亚马逊网站上变成商品了。他们犯的错就在于,在4,000多个词汇中没注意到那个词(这个公司算是幸运的,因为没有其他粗话被放上网)。问题就出在语境身上。

    这又是一个简单模型导致严重社会影响的例子。造出Solid Gold Bomb T恤的程序并不理解它的目标受众会怎么看“强暴”这个字眼,更不要说了解生产T恤的商业流程是如何运作的了。但是语境却将一个被忽略的词变成了一次颇具破坏力的事件。

    这两个案例都是由于某个程序无法预料会与其他程序产生什么样的相互作用,也不知道自己运行的宏观语境,结果导致重大的损害。而作为代码基础的模型漏洞百出,而这两个案例只是其中的两种表现形式而已。

    大数据仍然存在大问题。比如,我们收集的信息往往没有经过合理的标准化处理(即转换为同类格式的数据),我们建立的模型往往没有经过同行测试,也没有经过评审(看看作为社交媒体影响力标准的排行工具Klout所遭遇的问题吧)。另外,最重要的是,信息本身往往存储于大公司内部,大众无法获取、验证它们。

    当然,这并不是说我们的技术注定会失败。我们日常所用的绝大多数应用都运行良好,在某些情况下它们还能以很多方式改善我们的日常生活。但是我们应该深入检查支撑它们的各种模型。因为某些时候,它们会以某种方式把事情搞砸。(财富中文网)

    约书亚•克莱因是一位黑客、咨询师、电视主持人,也是《声誉经济学:人脉比财产更值钱》(帕尔格雷夫•麦克米伦出版社)一书的作者,本文选自该书。

    译者:清远        

    Here's another example. One t-shirt seller on Amazon.co.uk put up a shirt for sale emblazoned with the statement, "Keep Calm and Rape a Lot." One might wonder who thought such a shirt would be a good idea. But Solid Gold Bomb, the company that made the shirt, wasn't necessarily aware that it was even selling it. The company apologized publicly and copiously, but in its defense the only mistake it made was a small coding error. That's because the shirt wasn't designed by anyone. Nor were the shirts even necessarily ever printed. Solid Gold Bomb's business isn't in artfully designing t-shirts. Instead, it writes code that takes libraries of words that slot into popular phrases (such as "Keep Calm and Carry On," which enjoyed a brief mimetic popularity online) to make derivations that get dropped onto a template of a T-shirt and automatically get posted as an Amazon item for sale. Their mistake was overlooking a single word in a list of 4,000 or so others (the company was lucky no other offensive words or phrases made it onto the site). The problem was context.

    Again, a simple model, with serious social consequences. The program that made the Solid Gold Bomb t-shirt isn't aware of how its intended audience perceives the concept of rape, let alone how the business process that rendered the t-shirt works. And yet that context turned a one-word oversight into a massively damaging event.

    In both these instances an inability to anticipate how the program would interact with other programs, or of the broader context in which it would operate, caused significant harm. Those are just two ways in which a model on which code is based can be flawed.

    Big Data still has big issues. For example, the information we're gathering is often not being properly normalized (put into a format where all data is apples-to-apples), the models we're making aren't often peer tested or reviewed (witness the problems with the ranking tool Klout as a standard for social media influence), and, most crucially, the information itself is usually siloed inside of large corporations instead of being democratically available and verifiable.

    Which isn't to say our technology is doomed. Most of the applications we use every day work tremendously well, and in some cases really do produce amazing capabilities that improve our lives in countless ways every day. But it behooves us to examine the models that underpin them. Because someday, somehow, they will fail.

    Joshua Klein is a hacker, consultant, television host and author of Reputation Economics: Why Who You Know is Worth More than What You Have (Palgrave Macmillan), from which this essay is adapted.

热读文章
热门视频
扫描二维码下载财富APP