0条Plus

大数据遭遇数据净化难题

Verne Kopytoff 2014年07月17日

拼写错误、以及各种不准确和过时的信息就好比米堆里的砂子，如果不挑出来，企业和研究人员就很难利用大数据技术做出一锅好饭，而数据净化要做的工作就是去芜存菁。

卡里姆•科夏瓦杰是多伦多的一名医生和网络健康顾问，他要从500名医生那里反馈的海量数据中总结出怎样才能更好地治疗病人。但是众所周知，医生的“书法”本来就堪比天书，要想让电脑识别出其中的拼写错误和缩写更是难于登天。

比如科夏瓦杰指出：“患者是否吸烟是个很重要的信息。如果你直接阅读病历，你马上就能明白医生是什么意思。但是要想让电脑去理解它，那就只能祝你好运了。虽然你也可以在电脑上设置‘从不吸烟’或‘吸烟=0’的选项。但是一个患者每天吸多少支烟？这几乎是电脑不可能搞明白的问题。

由于宣传报道把大数据吹得神乎其神，因此很多人可能觉得大数据用起来特别简单：只要把相当于一整个图书馆的信息插到电脑上，然后就可以坐在一边，等着电脑给出精辟见解，告诉你如何提高自动生产线的生产效率，如何让网购者在网上购买更多的运动鞋，或是如何治疗癌症。但事实远远比想象复杂得多。由于信息会过时、不准确和缺失，因此数据不可避免地也有“不干净”的时候。如何把数据变“干净”是一个越来越重要但又经常被人忽略的工作，但它可以防止你犯下代价高昂的错误。

虽然科技一直都在进步，但是人们在净化数据上能想到的法子并不多。即便是处理一些相对较“干净”的数据，要想获得有用的结果往往也是件费时费力的事情。

博思艾伦咨询公司（Booz Allen）副总裁约什•沙利文说：“我对我的客户说，这是个混乱肮脏的世界，没有完全干净的数据集。”

数据分析师一般喜欢先寻找非常态的信息。由于数据量太巨大，他们一般都会把筛选数据的工作交给软件来完成，来寻找是否有些反常的东西需要进一步检查。随着时间的推移，电脑筛选数据的精确性也会提高。通过对类似案例进行分类，它们也会更好地了解一些词语和句子的含义，然后提高筛选的精确性。

沙利文说：“这种方法简单直接，但‘训练’你的模型可以需要一周又一周的时间。”

有些公司也提供了用来净化数据的软件和服务，其中既包括像IBM和SAP一样的科技巨头，也包括Cloudera和Talend开放工作室从事等大数据和分析的专门机构。一大批创业公司也想争当大数据的看门人，其中有代表性的包括Trifacta、Tamr和Paxata等。

由于“不干净”的数据太多，医疗业被认为是大数据技术最难搞定的行业之一。虽然随着电子病历的普及，将医疗信息输入电脑的难度已经变得越来越低，但是研究人员、制药公司和医疗业分析人士要想把他们需要的数据尽情地拿来分析，在数据上要提高的地方还有很多。

健康数据咨询公司InfoClin的医生兼CEO科夏瓦杰花了很多时间，希望数以万计的电子医疗病历中筛选有用的数据，以提高对病人的诊疗水平。但他们在筛选的过程中却不断遇到阻碍。

Karim Keshavjee, a Toronto physician and digital health consultant, crunches mountains of data from 500 doctors to figure out how to improve patient treatment. But it’s a frustrating slog to get a computer to decipher all the misspellings, abbreviations, and notes written in unintelligible medical shorthand.

For example, “smoking information is very hard to parse,” Keshavjee said. “If you read the records, you understand right away what the doctor meant. But good luck trying to make a computer understand. There’s ‘never smoked’ and ‘smoking = 0.’ How many cigarettes does a patient smoke? That’s impossible to figure out.”

The hype around slicing and dicing massive amounts of data, or big data, makes it sound so easy: Just plug a library’s worth of information into a computer and wait for valuable insights to pour out about how to speed up an auto assembly line, get online shoppers to buy more sneakers, or fight cancer. The reality is much more complicated. Data is inevitably “dirty” thanks to obsolete, inaccurate, and missing information. Cleaning it up is an increasingly important and overlooked job that can help prevent costly mistakes.

Although techniques are improving all the time, scrubbing data can only accomplish so much. Even when dealing with a relatively tidy set of information, getting useful results can be arduous and time-consuming.

“I tell my clients that the world is messy and dirty,” said Josh Sullivan, a vice president at business consulting firm Booz Allen who handles data crunching for clients. “There are no clean data sets.”

Data analysts start by looking for information that’s out of the norm. Because the volume of data is so huge, they typically hand the job over to software that automatically sifts through numbers and text to look for anything unusual that needs further review. Over time, computers can improve their accuracy in spotting what’s belongs and what doesn’t. They can also better understand what words and phrases mean by clustering similar examples together and then grading their interpretations for accuracy.

“The approach is easy and straightforward, but training your models can take weeks and weeks,” Sullivan said.

A constellation of companies offer software and services for cleaning data. They range from technology giants like IBM IBM -0.24% and SAP SAP 0.12% to big data and analytics specialists like Cloudera and Talend Open Studio. A legion of start-ups are also trying to get a toehold as data janitors including Trifacta, Tamr, and Paxata.

Healthcare, with all its dirty data, is one of the toughest industries for big data technology. Electronic health records make medical information increasingly easy to dump into computers, but there’s still a lot room for improvement before researchers, pharmaceutical companies and hospital business analysts can slice and dice all the information they want.

Keshavjee, the doctor and CEO of InfoClin, a health data consulting firm, spends his days trying to tease out ways to improve patient treatment by sifting through tens of thousands of electronic medical records. Obstacles pop up all the time.

1 2 下一页

撰写或查看更多观点, 请打开财富Plus APP

《财富》APP下载

杂志订阅

在社交媒体上找到我们

大数据遭遇数据净化难题