0条Plus

大数据遭遇数据净化难题

Verne Kopytoff 2014年07月17日

拼写错误、以及各种不准确和过时的信息就好比米堆里的砂子，如果不挑出来，企业和研究人员就很难利用大数据技术做出一锅好饭，而数据净化要做的工作就是去芜存菁。

很多医生在病历中没有记录病人的血压，这个问题是无论哪种数据净化方法都修复不了的。光凭借现有病历的信息去判断病人得了什么病对电脑来说就已经是一项极其困难的任务。医生在输入糖尿病编号的时候，可能忘了清楚地标注究竟是患者本人得了糖尿病，还是他的某个家人得了糖尿病。又或许他们光是输入了“胰岛素”三个字，而没有提到患者得了什么病，因为这对他们来说是再明显不过的事情。

医生用来诊断、开药和填写病人基本信息时会大量用到一套独特的速记字体。即使让人类来破解它也要大为头痛，而对于电脑基本上是不可能完成的任务。比如科夏瓦杰提到有个医生在病历中写下“gpa”三个字母，让他百思不得其解。好在他发现后面不远处又写着“gma”三字，他才恍然大悟——原来它们是爷爷（grandpa）和奶奶（grandma）的缩写。

科夏瓦杰说：“我花了好半天才明白它们到底是什么意思。”

科夏瓦杰认为，解决数据“不干净”的终极方法之一是要给病历制定一套“数据纪律”。要训练医生养成正确录入信息的习惯，这样事后净化数据时才不至于乱得一团糟。科夏瓦杰表示，谷歌有一个很有用的工具，可以在用户进行输入时告诉他们如何拼写生僻字，这样的工具完全可以添加到电子病历工具中。电脑虽然可以挑出拼写错误，但是让医生摒弃不良习惯才是朝着正确的方向迈出了一步。

科夏瓦杰的另一个建议是，在电子病历中设置更多标准化的域。这样电脑就会知道到哪里去找特定的信息，从而减少出错率。当然，实际操作起来并没有这么简单，因为很多病人同时身患好几种疾病。因此，一个标准的表格必须拥有足够的灵活性，把这些复杂情况全部考虑进去。

但是出于诊疗的需要，医生有时需要在病历上记下一些自由行文的东西，这些内容肯定不是一个小格子能装得下的。比如一个患者为什么会摔倒，如果不是受伤导致的，那么原因就非常重要。但是在没有上下文的条件下，软件对于自由行文的理解只能用撞大运来形容。筛选数据的时候，如果人们用关键词搜索的话可能会做得更好些，但这样也难免会漏掉很多有关的记录。

当然，在有些案例中，有些看起来不干净的数并不是真的不干净。博思艾伦咨询公司副总裁沙利文举例说，有一次他的团队为一家豪华连锁酒店分析顾客的人口统计数据，突然发现，数据显示一个富有的中东国家的青少年群体是这家酒店的常客。

沙利文回忆道：“有一大群17岁的青少年在世界各地都住这家酒店，我们以为：‘这肯定不是真的。’”

但做了一些挖掘工作后，他们发现这个信息其实是正确的。这家酒店有大量的青少年顾客，甚至连酒店自己也没有意识到，而且酒店也没有针对这部分顾客做过任何促销和宣传。所有22岁以下的顾客都被这家公司的电脑自动列入“低收入”群体，酒店的高管们也从来没有考虑过这些孩子的腰包有多鼓。

沙利文说：“我认为如果没有离群值的话，构建模型会更难。”

即便有时数据明显不干净，它有时依然能派上大用场。比如上文提到的谷歌（Google）的拼写纠正技术。它可以自动识别拼写错误的单词，然后提供替代拼写。这个工具之所以有这样神奇的功用，是因为谷歌在过去几年中已经收集了几亿甚至几十亿个拼写错误的词条。因此不干净的数据也可以变废为宝。

最终，从大数据中获得结论的是人而不是机器。电脑虽然可以整理几百万份文件，但它并不能真的解读它。数据净化就是为了方便人们从数据中获取结论而反复试错的过程。尽管大数据已被奉为能提高商业利润、能造福全人类的神器，但它也是个很让人头痛的东西。

沙利文指出：“失败的概念在数据科学中完全是另一回事。如果我们每天不失败10次或12次来试错，它们就不会给出正确的结果。”（财富中文网）

译者：朴成奎

Many doctors neglect to note a patient’s blood pressure in their medical records, something that no amount of data cleaning can fix. Simply determining what ails patients—based on what’s in their files—is surprisingly difficult for computers. Doctors may enter the proper code for diabetes without clearly indicating whether it’s the patient who has the disease or a family member. Or they may just enter “insulin” without mentioning the underlying diagnosis because, to them, it’s obvious.

Physicians also use a lot of idiosyncratic shorthand for medications, illnesses and basic patient details. Deciphering it takes a lot of head scratching for humans and is nearly impossible for a computer. For example, Keshavjee came across one doctor who used the abbreviation”gpa.” Only after coming across a variation, “gma,” did he finally solve the puzzle—they were shorthand for “grandpa” and “grandma.”

“It took a while to figure that one out,” he said.

Ultimately, Keshavjee said one of the only ways to solve the problem of dirty data in medical records is “data discipline.” Doctors need to be trained to enter information correctly so that cleaning up after them is less of a chore. Incorporating something like Google’s helpful tool that suggests how to spell words as users type them would be a great addition for electronic medical records, he said. Computers can learn to pick out spelling errors, but minimizing the need is a step in the right direction.

Another of Keshavjee’s suggestions is to create medical records with more standardized fields. A computer would then know where to look for specific information, reducing the chance of error. Of course, doing so is not as easy as it sounds because many patients suffer from multiple illnesses, he said. A standard form would have to be flexible enough to take such complications into account.

Still, doctors would need to be able to jot down more free-form electronic notes that could never fit in a small box. Nuance like why a patient fell, for example, and not just the injury suffered, is critical for research. But software is hit and miss in understanding free-form writing without context. Humans searching by keyword may do a better job, but they still inevitably miss many relevant records.

Of course, in some cases, what appears to be dirty data, really isn’t. Sullivan, from Booz Allen, gave the example the time his team was analyzing demographic information about customers for a luxury hotel chain and came across data showing that teens from a wealthy Middle Eastern country were frequent guests.

“There were a whole group of 17 year-olds staying at the properties worldwide,’ Sullivan said. “We thought, ‘That can’t be true.’ “

But after some digging, they found that the information was, in fact, correct. The hotel had legions of young customers that it didn’t even realize were there, and had never done anything to market to them. All guests under 22 were automatically logged as “low-income” in the company’s computers. Hotel executives had never considered the possibility of teens with deep pockets.

“I think it’s harder to build models if you don’t have outliers,” Sullivan said.

Even when data is clearly dirty, it can sometimes be put to good use. Take the example, again, of Google’s spelling suggestion technology. It automatically recognizes misspelled words and offers alternative spellings. It’s only possible because Google GOOG -0.34% has collected millions and perhaps billions of misspelled queries over the years. Instead of garbage, the dirty data is an opportunity.

Ultimately, humans, and not machines, draw conclusions from the data they crunch. Computers can sort through millions of documents, but they can’t interpret the findings. Cleaning data is just one of step in a long trial and error process to get to that point. Big data, for all its hype about its ability to lift business profits and help humanity, is a big headache.

“The idea of failure is completely different in data science,” Sullivan said. “If you they don’t fail 10 or 12 times a day to get to where they should be, they’re not doing it right.”

上一页 1 2

撰写或查看更多观点, 请打开财富Plus APP

《财富》APP下载

杂志订阅

在社交媒体上找到我们

大数据遭遇数据净化难题