立即打开
从草根工程到行业标准:一个开源小项目的进化神话

从草根工程到行业标准:一个开源小项目的进化神话

Katherine Noyes 2014年07月03日
开源项目Hadoop的诞生原本只是为了解决数据管理的技术问题。而如今,它已经演变成了行业标准,它驱动的市场所具有的价值在2020年预计将达到502亿美元。

    如今的软件界有着数不清的开源项目,它们拥有疯狂的名字,但其中的大多数从来都没有入过企业的法眼,只有Hadoop是个例外。

    Hadoop的名字来源于一个小孩的玩具,如今已被用于雅虎(Yahoo)和Facebook等公司的大数据程序中。供应商表示,《财富》50强中有半数以上的公司都在用它。

    根据弗雷斯特研究公司(Forrester)分析师麦克•瓜尔蒂耶里的说法,这个软件“在数据管理上采用了令人耳目一新的独特方法,改变了各公司存储、处理、分析和分享大数据的方式。”弗雷斯特认为Hadoop会成为大型企业必备的架构。Hadoop在2012年的全球市值为15亿美元,而到2020年,人们估计它的价值将会达到502亿美元。

    一个草根的开源项目最终成了行业标准,并不是一件常有的事。Hadoop是如何做到的?

    “一个拥有迫切需求的市场”

    分析公司RedMonk共同创始人和首席分析师史蒂芬•奥格雷迪说:“Hadoop是由基础的差异化技术、获得许可的开源代码库和迫切需要解决数据爆炸的方法的市场三者结合形成的巧合。从这一点上来说,它的成功并不令人意外。”

    这个软件的创造者是道格•卡廷和麦克•卡法雷拉。它与许多其他发明一样,都是应需而生。2002年,两人都在为一个叫做Nutch的开源搜索引擎工作。卡廷说:“我们取得了一些进展,在小范围的机器上运行了它。但我们仍然不清楚要怎么扩大它的使用范围,让它像谷歌(Google)一样被成千上万的机器使用。”

    之后不久,谷歌就谷歌文件系统(Google File System)和MapReduce发表了一系列学术论文,卡法雷拉说:“于是我们很快就清楚了,Nutch需要拥有一些类似的架构。”

    卡廷解释道:“谷歌处理问题的方法与众不同,十分有用。”目前为止,人们通常认为“你需要为每一个想要完成的分布式任务建立专门的系统”,而在这一点上,谷歌提供了一个通用的自动化架构来完成分布式计算。卡廷说:“它能够处理分布式计算中的那些困难的部分,如此一来,人们就可以专心编写自己的程序。”

    卡廷和卡法雷拉【如今分别是Cloudera首席架构师和密歇根大学(University of Michigan)计算机科学和工程专业的助理教授】知道,他们得做出自己的架构——不仅是为了Nutch,也是为了造福其他业内人士——他们明白自己想把它做成开源。

    卡廷说:“我不喜欢商业的那些事,我只是个搞技术的。我喜欢写代码,与同事合作解决问题,完善我们的产品,而不是试着把它卖掉。我更愿意告诉别人‘这一点上它做得不错,那一点上太糟糕了,也许我们可以改进一下。’能够当一个彻底诚实的人感觉很好,而在商业环境中,你很难保持这一点。”

    但是这两人知道,这项技术一旦取得成功,将会具有巨大的潜力。卡廷说:“如果我没判断错,这是项很有用的技术,许多人都想用,那我就能付我的房租了,我们的初创公司也就没那么大风险了。”

    对卡法雷拉而言,“将Nutch开源,部分原因是想要看到搜索引擎技术摆脱少数几家公司的垄断,但这也是一项战略决定。如此一来,我们就最可能得到来自大公司的工程师的帮助。我们特地选择了一个能让其他公司最轻松地参与进来的开源许可。”

    There are countless open source projects with crazy names in the software world today, but the vast majority of them never make it onto enterprises’ collective radar. Hadoop is an exception of pachydermic proportions.

    Named after a child’s toy elephant, Hadoop is now powering big data applications at companies such as Yahoo YHOO 2.57% and Facebook FB -0.46% ; more than half of the Fortune 50 use it, providers say.

    The software’s “refreshingly unique approach to data management is transforming how companies store, process, analyze and share big data,” according toForrester analyst Mike Gualtieri. “Forrester believes that Hadoop will become must-have infrastructure for large enterprises.”

    Globally, the Hadoop market was valued at $1.5 billion in 2012; by 2020, it is expected to reach $50.2 billion.

    It’s not often a grassroots open source project becomes a de facto standard in industry. So how did it happen?

    ‘A market that was in desperate need’

    “Hadoop was a happy coincidence of a fundamentally differentiated technology, a permissively licensed open source codebase and a market that was in desperate need of a solution for exploding volumes of data,” said RedMonk cofounder and principal analyst Stephen O’Grady. “Its success in that respect is no surprise.”

    Created by Doug Cutting and Mike Cafarella, the software—like so many other inventions—was born of necessity. In 2002, the pair were working on an open source search engine called Nutch. “We were making progress and running it on a small cluster, but it was hard to imagine how we’d scale it up to running on thousands of machines the way we suspected Google was,” Cutting said.

    Shortly thereafter Google GOOG -0.34% published a series of academic papers on its own Google File System and MapReduce infrastructure systems, and “it was immediately clear that we needed some similar infrastructure for Nutch,” Cafarella said.

    “The way Google was approaching things was different and powerful,” Cutting explained. Whereas so far at that point “you had to build a special-purpose system for each distributed thing you wanted to do,” Google’s approach offered instead a general-purpose automated framework for distributed computing. “It took care of the hard part of distributed computing so you could focus just on your application,” Cutting said.

    Both Cutting and Cafarella (who are now chief architect at Cloudera and University of Michigan assistant professor of computer science and engineering, respectively) knew they wanted to make a version of their own—not just for Nutch, but for the benefit of others as well—and they knew they wanted to make it open source.

    “I don’t enjoy the business aspects,” Cutting said. “I’m a technical guy. I enjoy working on the code, tackling the problems with peers and trying to improve it, not trying to sell it. I’d much rather tell people, ‘It’s kind of OK at this; it’s terrible at that; maybe we can make it better.’ To be able to be brutally honest is really nice—it’s much harder to be that way in a commercial setting.”

    But the pair knew that the potential upside of success could be staggering. “If I was right and it was useful technology that lots of people wanted to use, I’d be able to pay my rent—and without having to risk my shirt on a startup,” Cutting said.

    For Cafarella, “Making Nutch open source was part of a desire to see search engine technology outside the control of a few companies, but also a tactical decision that would maximize the likelihood of getting contributions from engineers at big companies. We specifically chose an open source license that made it easy for a company to contribute.”

  • 热读文章
  • 热门视频
活动
扫码打开财富Plus App