立即打开
从贾斯汀•比伯到数据学家,Twitter何以成为一门显学

从贾斯汀•比伯到数据学家,Twitter何以成为一门显学

Erika Fry 2014年09月01日
自Twitter创建以来,各路学者纷纷涌向这一微博平台,不是去发帖,而是去从事研究工作。在学术界看来,Twitter拥有最为丰富,也许是前所未有的数据集。它就相当于一个实时数据的虚拟培养皿,吸引着各个学科的学者开展五花八门的研究。

    Texifter公司CEO斯图尔特•舒尔曼表示,“一些在社会科学研究中较早使用Twitter研究数据的研究人员遭到了嘲笑。”该公司是一家文本分析工具开发商,也是一家通常向学者授权使用Twitter数据的供应商。他说,资深学者往往不信任这些同事(大多数是年轻人)。“你为什么要这么做?难道你可以靠这些数据获得终身教职?而现在,即将从研究生院毕业的整整一代人都准备撰写与社交平台数据有关的硕士论文。”

    如今,成为一名社交数据博士似乎不愁没事做。随着Twitter研究论文的数量不断增长,邀请学者提交其研究成果的会议数量也在迅速增多。实际上,阿达尔的网络博客与社交媒体国际大会正面临多个同类会议的竞争压力。

    Twitter在学者们中如此受欢迎,不仅仅是因为它是一个海量公共数据集,还因为它是一个带有时间刻度的海量公共数据集——捕捉特定时间中(在一些情况下,也是在特定空间中)数百万人关于所有主题事项的想法。如果你认为人们在公共舞台上谈论或推送的内容是有限制的,那你就大错特错了,实际情况绝非如此。而如果你认为人们在公共舞台上几乎可以谈论、推送任何内容,那么你就对了:人们在Twitter上无话不谈,实际上,卫生研究者正在利用这个平台跟踪爆发性食物中毒。(可以花点时间想象一下……)

    这些特性使得Twitter有别于其他数据丰富的社交网站。例如,Facebook拥有隐私政策,其内容不是按照时间顺序,而是按照动态消息(NewsFeed)的新颖算法排列。

    这并不是说,利用Twitter开展学术研究就特别容易。尽管Twitter是一个公共平台,但仅有很小一部分——约占Twitter数据流的1%,Twitter将其称为“汽酒”(spritzer)——是公众可以通过Twitter应用程序编程接口(API)免费获取的。一些特定合作伙伴(其中一些是学者)经协商可以通过Twitter的“浇水管”(garden hose)略微扩大数据获取量(占数据流的10%)。若要通过Twitter 的“消防带”(firehose)进行完全访问,甚至取得特定搜索查询的无限访问权,则需付出高昂的费用,且只能通过少数几家供应商获得。【尽管国会图书馆(Library of Congress)存储有整个Twitter档案,但它并没有能力满足它收到的大量数据请求。】

    今年早些时候,在一片群情激动的欢呼声中,Twitter宣布了一项数据授权计划,以减轻学者开展此类研究的成本负担。但事实上,该公司的授权数量极其有限:在1300个申请人中,仅有6人获得了授权,占0.5%。Texifter公司目前向36个研究团队提供类似授权。

    现在,学者们在使用这个平台从事研究时显然更加得心应手。数据过滤技术正在变得愈发精确和复杂。同时,学者们正逐渐了解Twitter 最适合哪类研究。阿达尔称,该平台的数据最适合了解某时某地正在发生什么,但依然不是一个特别靠谱的预测工具。

    也有人仍在担心Twitter数据样本的代表性。正如一位涉猎Twitter研究的学者对我所说的那样,你很难判断你所观察到的有多少是人类行为,有多少是Twitter上的人类行为。

    Texifter公司的舒尔曼表示,“这可能是一时的风潮,可能我们会认为,以对Twitter500万活跃用户的研究概括整个世界完全是一种愚蠢的行为”。“但我不这样认为。如果有人声称Twitter无足轻重,那才是真正的愚蠢。”

    或者,也许Twitter的确不容小觑,但它仍然是一时的风潮。阿达尔已经注意到了这样的迹象:学者对该平台的青睐程度已不如从前。他指出,“仍然有大量关于Twitter的研究。但有人已将目光投向其他社交媒体。当研究同一事物的人数过多时,我们就不得不转移目标了,尝试着做出更加新颖的贡献。”(财富中文网)

    译者:Simon

    “Early adopters in the social sciences of research data from Twitter were just mocked,” says Stuart Shulman, the CEO of Texifter, a developer of text analysis tools and a vendor of Twitter data that often licenses it to academics. Seasoned academics tended to be incredulous towards these (mostly) younger colleagues, he says. “Why would you do that? You can’t get tenure using that? Now there’s a whole generation coming through grad school that are going to write their masters theses about social data.”

    These days, becoming a doctor of social data looks like a secure line of work. Just as the number of papers based on Twitter research has soared, so has the number of conferences inviting academics to submit their findings. Indeed, Adar’s International Conference on Weblogs and Social Media’s annual conference now competes with a number of rival meetings.

    What has made Twitter so popular with academics, though, isn’t just that it’s an enormous public dataset, it’s that it’s an enormous public dataset with a time scale—capturing thoughts from millions of people on all matters of subjects recorded in specific time (and, in some cases, specific space). You might think there’d be limitations to the things people would say, or tweet, on a public stage—okay, scratch that: we all know better. You might think there’d be virtually no limitations to the things people would say, or tweet, on a public stage, and you’d be right: folks on Twitter are so unfiltered, in fact, that health researchers are using the platform to track food-poisoning outbreaks. (Take a moment to figure that one out….)

    Such properties set Twitter apart from other data-rich social networking sites. Facebook, for example, has privacy issues and rolls out content, not chronologically but according to the funky algorithm of its NewsFeed.

    That’s not to say academic research with Twitter is particularly easy. While Twitter is a public platform, only a fraction of its data, or 1% of the Twitter stream—Twitter calls it the “spritzer”—is free and accessible to the public through Twitter’s application programming interface (API). Some select partners—some of whom are academics—have negotiated slightly more robust access via Twitter’s “garden hose” (10% of the stream). Complete access, via the Twitter firehose or even unlimited access to particular search queries, is costly and can be obtained only through a handful of vendors. (While the Library of Congress warehouses the whole Twitter archive, it does not have the capacity to address the many data requests it receives.)

    Twitter, to much excitement and fanfare, announced a data grant program earlier this year to help academics shoulder the costs of such research. In truth, the company barely opened the spigot: of 1300 applicants, just six, or 0.5%, were awarded grants. Texifter is now making similar grants to a total of 36 research teams.

    Academics using the platform for research are certainly getting better at it. Data filtering techniques are getting more precise and sophisticated. Meanwhile, scholars are learning what sort of research Twitter is good for. Adar says the platform’s data is best for understanding what’s going on in a particular place at a particular instant; it’s a less proven (yet more highly sought-after) tool for prediction.

    There also remain concerns about just how representative the Twitter data sample is. As one scholar, who dabbles in Twitter research told me, it’s hard to know how much you’re watching human behavior versus how much you’re watching human behavior on Twitter.

    “Maybe it’s a fad, and maybe we’ll determine that studying the five million active users of Twitter, and talking about whole world is really kind of stupid,” says Texifter’s Shulman. “But I don’t think so. You’d be an idiot if you said Twitter doesn’t matter.”

    Or, maybe Twitter does matter…but it’s still a fad. Adar has already seen signs that the platform isn’t as hot among academics as it used to be. “There’s still a lot of research on Twitter,” he says.“But some attention has shifted to other social media. When too many people are studying one thing, we have to move on to, you know, try to make novel contributions.”

  • 热读文章
  • 热门视频
活动
扫码打开财富Plus App