立即打开
人工智能重大突破:预测蛋白质形状

人工智能重大突破:预测蛋白质形状

Jeremy Kahn 2020-12-02
Deepmind科学家开发了一款人工智能软件,利用蛋白质的DNA序列预测其三维结构,准确度误差不超过一个原子的宽度。

研究人员利用人工智能技术取得了巨大突破,可能为新药研发带来革命。

科学家开发的一款人工智能软件,利用蛋白质的DNA序列预测其三维结构,准确度误差不超过一个原子的宽度。

这项成就解决了困扰分子生物学领域50年的挑战。它来自于伦敦人工智能公司DeepMind的研究团队。目前,DeepMind隶属于谷歌(Google)母公司Alphabet旗下。到目前为止,DeepMind最为人所知的是其创造的人工智能在围棋比赛中打败了人类选手,创下了计算机科学领域的一个重要里程碑。

DeepMind在两年一次预测蛋白质结构的算法竞赛中取得了该项突破。该竞赛要求参赛者根据一个蛋白质的DNA序列,确定该蛋白质的三维形状。

马里兰大学(University of Maryland)的分子生物学家约翰·莫尔特是“结构预测关键评估”(Critical Assessment of Structure Prediction)竞赛的负责人。他表示,在100多种蛋白质中,DeepMind的人工智能软件AlphaFold 2预测蛋白质结构的准确度,有三分之二的偏差在一个原子宽度以内,剩余三分之一大部分的预测结果也非常准确。他表示,AlphaFold 2的准确度远高于参加竞赛的任何其他方法。

DeepMind的联合创始人及首席执行官德米斯·哈萨比斯表示,公司希望“利用这些技术最大程度造福社会。”但他表示,DeepMind尚未确定通过哪种方式将该蛋白质结构预测软件提供给学术研究人员使用,或者是否向制药公司和生物科技公司寻求商业合作。他说公司将在明年某个时间“详细说明我们如何以能够规模化的方式提供该系统。”

结构生物学家、诺贝尔奖得主文卡特拉曼·拉马克里希南评价AlphaFold 2称:“这款软件的计算结果代表蛋白质折叠问题取得了惊人的进步。”拉马克里希南是英国最负盛名的科研机构皇家学会(Royal Society)的会长,即将卸任。

蛋白质结构专家、欧洲分子生物学实验室(European Molecular Biology Laboratory)欧洲生物信息研究所(European Bioinformatics Institute)的前负责人珍妮特·桑顿表示,DeepMind的突破为绘制完整的“人类蛋白质组图谱”开辟了道路。人类蛋白质组图谱中将包含人体内的所有蛋白质。她表示,目前只有约四分之一的人类蛋白质被用作药物靶点。现在可以将更多蛋白质作为药物靶点,为发明新药创造了巨大的机会。

桑顿还表示,DeepMind的人工智能系统对于研究合成蛋白质的科学家同样有着深远的意义,也可能产生巨大的影响:例如培养更有营养的新型转基因作物品种,开发能够通过消化塑料来清洁环境的新型酶等。

蛋白质是生物学过程的基本机制。蛋白质由氨基酸长链组成,氨基酸长链又称DNA。但细胞生成蛋白质之后,蛋白质会立即折叠成复杂的形状,类似于一团绳子缠绕在一起,有条状结构和类似于花饰的附着结构。蛋白质的具体结构决定了它的功能。蛋白质结构对于小分子设计也至关重要。小分子可以与蛋白质结合,并修改蛋白质的功能,这就是新药研发的过程。

到目前为止,为获取一种蛋白质结构的高分辨率模型,使用的主要方法是X射线晶体学。这种技术能够将一种蛋白质溶液变成晶体,这个过程极其复杂并且要耗费大量时间。然后用X射线连续照射晶体,通常会使用一种名为同步加速器的环形粒子加速器。研究人员可以通过X射线的绕射图绘制出蛋白质的内部结构图。据多伦多大学(University of Toronto)估计,通过X射线晶体学这种方法获取一个蛋白质的结构,需要耗时一年,成本约为12万美元。

最近,还有两种实验方法也被用于预测蛋白质结构,它们分别是核磁共振和低温电子显微技术。这两种方法的速度更快,成本更低,但其生成的模型精确度不及X射线晶体学。

而按照DeepMind蛋白质折叠团队的首席研究员约翰·江珀的说法,AlphaFold 2使用“适度的”计算资源,只需要“几天时间”就能够计算出蛋白质的每一种结构。江珀表示,训练该系统需要在16个芯片上使用由谷歌开发的128个专用人工智能计算单元,连续运行“大约几周”。这种人工智能计算单元被称为张量处理单元。他表示,该系统需要的计算能力,比公司最近的多项人工智能突破要少得多,包括之前的AlphaGo。

1972年,诺贝尔奖得主、化学家克里斯蒂安·安芬森曾经假设,DNA应该完全能够决定蛋白质的最终结构。为了证明安芬森提出的设想,科学界数十年来一直在寻找数学模型。但问题是,即使物理定律可以决定蛋白质的折叠方式,蛋白质折叠可能存在大量其他排列,因此正如生物学家赛勒斯·利文索尔提出的一种著名的说法,通过随机试错法确定一个蛋白质的结构所需要的时间,可能比已知宇宙的年龄更长。

但DeepMind的AlphaFold 2现在已经基本实现了安芬森的设想。莫尔特表示,在“结构预测关键评估”竞赛中,对于超过三分之二的蛋白质,AlphaFold 2和X射线晶体学的准确度不相上下。现在希望研究人员能够利用AlphaFold 2,或者至少用相同的方法,直接根据蛋白质的DNA序列得出其3D形状,不需要使用X射线晶体学或其他物理实验。获取蛋白质的DNA序列相对容易,并且成本低廉。

位于德国蒂宾根的马克斯·普朗克发育生物学研究所(Max Planck Institute for Developmental Biology)的蛋白质进化系主任安德烈·卢帕斯是今年“结构预测关键评估”竞赛的评审之一。他说DeepMind的结果“令人震惊。”

在“结构预测关键评估”竞赛过程中,为了验证DeepMind系统的能力,卢帕斯利用AlphaFold 2的预测结果,以确认它能否预测出一种蛋白质结构的最后一部分。10多年来,他利用X射线晶体学一直无法完成这部分结构的绘制。卢帕斯说利用AlphaFold 2生成的预测,他可以在短短半个小时内确定最后一个蛋白质区段的形状。

AlphaFold 2已经被用于准确预测一种名为ORF3a的蛋白质的结构,这种蛋白质存在于导致新冠肺炎的SARS-CoV-2病毒当中。未来,科学家能够根据其预测的结果,将这种蛋白质作为靶点,开发治疗药物。

卢帕斯表示,他认为对于从事蛋白质研究的科学家而言,这款人工智能软件将“带来颠覆性的变化”。目前已知约2亿种蛋白质的DNA序列,并且每年可以发现数千万个新的蛋白质。但已经绘制出3D结构的蛋白质不足20万种。

AlphaFold 2是唯一一款专门用于预测单个蛋白质结构的人工智能。但蛋白质的性质决定了一种蛋白质通常会与其他蛋白质组成复杂的结构。江珀表示,下一步的目标是开发一种能够预测蛋白质之间的复杂动态的人工智能系统,例如蛋白质之间如何结合,或者相邻的蛋白质如何改变彼此的形状等。

DeepMind两年前参加了“结构预测关键评估”竞赛并获奖。但当时所使用的人工智能系统AlphaFold配置不同,在最难预测的一类蛋白质中,其平均“全局距离完全测试得分”(global distance test total score)只有58分。全局距离完全测试得分相当于其准确绘制的每一种蛋白质的百分比。

虽然这个分数比第二名的团队高了约6分,但无法与X射线晶体学等实证研究方法相媲美。今年,即使是最难预测的蛋白质,DeepMind的全局距离完全测试得分中位数也达到了87分,接近于X射线晶体学的分数,比紧随其后的团队高出约26分。(财富中文网)

翻译:刘进龙

审校:汪皓

研究人员利用人工智能技术取得了巨大突破,可能为新药研发带来革命。

科学家开发的一款人工智能软件,利用蛋白质的DNA序列预测其三维结构,准确度误差不超过一个原子的宽度。

这项成就解决了困扰分子生物学领域50年的挑战。它来自于伦敦人工智能公司DeepMind的研究团队。目前,DeepMind隶属于谷歌(Google)母公司Alphabet旗下。到目前为止,DeepMind最为人所知的是其创造的人工智能在围棋比赛中打败了人类选手,创下了计算机科学领域的一个重要里程碑。

DeepMind在两年一次预测蛋白质结构的算法竞赛中取得了该项突破。该竞赛要求参赛者根据一个蛋白质的DNA序列,确定该蛋白质的三维形状。

马里兰大学(University of Maryland)的分子生物学家约翰·莫尔特是“结构预测关键评估”(Critical Assessment of Structure Prediction)竞赛的负责人。他表示,在100多种蛋白质中,DeepMind的人工智能软件AlphaFold 2预测蛋白质结构的准确度,有三分之二的偏差在一个原子宽度以内,剩余三分之一大部分的预测结果也非常准确。他表示,AlphaFold 2的准确度远高于参加竞赛的任何其他方法。

DeepMind的联合创始人及首席执行官德米斯·哈萨比斯表示,公司希望“利用这些技术最大程度造福社会。”但他表示,DeepMind尚未确定通过哪种方式将该蛋白质结构预测软件提供给学术研究人员使用,或者是否向制药公司和生物科技公司寻求商业合作。他说公司将在明年某个时间“详细说明我们如何以能够规模化的方式提供该系统。”

结构生物学家、诺贝尔奖得主文卡特拉曼·拉马克里希南评价AlphaFold 2称:“这款软件的计算结果代表蛋白质折叠问题取得了惊人的进步。”拉马克里希南是英国最负盛名的科研机构皇家学会(Royal Society)的会长,即将卸任。

蛋白质结构专家、欧洲分子生物学实验室(European Molecular Biology Laboratory)欧洲生物信息研究所(European Bioinformatics Institute)的前负责人珍妮特·桑顿表示,DeepMind的突破为绘制完整的“人类蛋白质组图谱”开辟了道路。人类蛋白质组图谱中将包含人体内的所有蛋白质。她表示,目前只有约四分之一的人类蛋白质被用作药物靶点。现在可以将更多蛋白质作为药物靶点,为发明新药创造了巨大的机会。

桑顿还表示,DeepMind的人工智能系统对于研究合成蛋白质的科学家同样有着深远的意义,也可能产生巨大的影响:例如培养更有营养的新型转基因作物品种,开发能够通过消化塑料来清洁环境的新型酶等。

蛋白质是生物学过程的基本机制。蛋白质由氨基酸长链组成,氨基酸长链又称DNA。但细胞生成蛋白质之后,蛋白质会立即折叠成复杂的形状,类似于一团绳子缠绕在一起,有条状结构和类似于花饰的附着结构。蛋白质的具体结构决定了它的功能。蛋白质结构对于小分子设计也至关重要。小分子可以与蛋白质结合,并修改蛋白质的功能,这就是新药研发的过程。

到目前为止,为获取一种蛋白质结构的高分辨率模型,使用的主要方法是X射线晶体学。这种技术能够将一种蛋白质溶液变成晶体,这个过程极其复杂并且要耗费大量时间。然后用X射线连续照射晶体,通常会使用一种名为同步加速器的环形粒子加速器。研究人员可以通过X射线的绕射图绘制出蛋白质的内部结构图。据多伦多大学(University of Toronto)估计,通过X射线晶体学这种方法获取一个蛋白质的结构,需要耗时一年,成本约为12万美元。

最近,还有两种实验方法也被用于预测蛋白质结构,它们分别是核磁共振和低温电子显微技术。这两种方法的速度更快,成本更低,但其生成的模型精确度不及X射线晶体学。

而按照DeepMind蛋白质折叠团队的首席研究员约翰·江珀的说法,AlphaFold 2使用“适度的”计算资源,只需要“几天时间”就能够计算出蛋白质的每一种结构。江珀表示,训练该系统需要在16个芯片上使用由谷歌开发的128个专用人工智能计算单元,连续运行“大约几周”。这种人工智能计算单元被称为张量处理单元。他表示,该系统需要的计算能力,比公司最近的多项人工智能突破要少得多,包括之前的AlphaGo。

1972年,诺贝尔奖得主、化学家克里斯蒂安·安芬森曾经假设,DNA应该完全能够决定蛋白质的最终结构。为了证明安芬森提出的设想,科学界数十年来一直在寻找数学模型。但问题是,即使物理定律可以决定蛋白质的折叠方式,蛋白质折叠可能存在大量其他排列,因此正如生物学家赛勒斯·利文索尔提出的一种著名的说法,通过随机试错法确定一个蛋白质的结构所需要的时间,可能比已知宇宙的年龄更长。

但DeepMind的AlphaFold 2现在已经基本实现了安芬森的设想。莫尔特表示,在“结构预测关键评估”竞赛中,对于超过三分之二的蛋白质,AlphaFold 2和X射线晶体学的准确度不相上下。现在希望研究人员能够利用AlphaFold 2,或者至少用相同的方法,直接根据蛋白质的DNA序列得出其3D形状,不需要使用X射线晶体学或其他物理实验。获取蛋白质的DNA序列相对容易,并且成本低廉。

位于德国蒂宾根的马克斯·普朗克发育生物学研究所(Max Planck Institute for Developmental Biology)的蛋白质进化系主任安德烈·卢帕斯是今年“结构预测关键评估”竞赛的评审之一。他说DeepMind的结果“令人震惊。”

在“结构预测关键评估”竞赛过程中,为了验证DeepMind系统的能力,卢帕斯利用AlphaFold 2的预测结果,以确认它能否预测出一种蛋白质结构的最后一部分。10多年来,他利用X射线晶体学一直无法完成这部分结构的绘制。卢帕斯说利用AlphaFold 2生成的预测,他可以在短短半个小时内确定最后一个蛋白质区段的形状。

AlphaFold 2已经被用于准确预测一种名为ORF3a的蛋白质的结构,这种蛋白质存在于导致新冠肺炎的SARS-CoV-2病毒当中。未来,科学家能够根据其预测的结果,将这种蛋白质作为靶点,开发治疗药物。

卢帕斯表示,他认为对于从事蛋白质研究的科学家而言,这款人工智能软件将“带来颠覆性的变化”。目前已知约2亿种蛋白质的DNA序列,并且每年可以发现数千万个新的蛋白质。但已经绘制出3D结构的蛋白质不足20万种。

AlphaFold 2是唯一一款专门用于预测单个蛋白质结构的人工智能。但蛋白质的性质决定了一种蛋白质通常会与其他蛋白质组成复杂的结构。江珀表示,下一步的目标是开发一种能够预测蛋白质之间的复杂动态的人工智能系统,例如蛋白质之间如何结合,或者相邻的蛋白质如何改变彼此的形状等。

DeepMind两年前参加了“结构预测关键评估”竞赛并获奖。但当时所使用的人工智能系统AlphaFold配置不同,在最难预测的一类蛋白质中,其平均“全局距离完全测试得分”(global distance test total score)只有58分。全局距离完全测试得分相当于其准确绘制的每一种蛋白质的百分比。

虽然这个分数比第二名的团队高了约6分,但无法与X射线晶体学等实证研究方法相媲美。今年,即使是最难预测的蛋白质,DeepMind的全局距离完全测试得分中位数也达到了87分,接近于X射线晶体学的分数,比紧随其后的团队高出约26分。(财富中文网)

翻译:刘进龙

审校:汪皓

Researchers have made a major breakthrough using artificial intelligence that could revolutionize the hunt for new medicines.

The scientists have created A.I. software that uses a protein’s DNA sequence to predict its three-dimensional structure to within an atom’s width of accuracy.

The achievement, which solves a 50-year-old challenge in molecular biology, was accomplished by a team from DeepMind, the London-based artificial intelligence company that is part of Google parent Alphabet. Until now, DeepMind was best known for creating A.I. that could beat the best human players at the strategy game Go, a major milestone in computer science.

DeepMind achieved the protein shape breakthrough in a biennial competition for algorithms that can be used to predict protein structures. The competition asks participants to take a protein’s DNA sequence and then use it to determine the protein’s three-dimensional shape.

Across more than 100 proteins, DeepMind’s A.I. software, which it called AlphaFold 2, was able to predict the structure to within about an atom’s width of accuracy in two-thirds of cases and was highly accurate in most of the remaining one-third of cases, according to John Moult, a molecular biologist at the University of Maryland who is director of the competition, called the Critical Assessment of Structure Prediction, or CASP. It was far better than any other method in the competition, he said.

Demis Hassabis, DeepMind’s cofounder and chief executive officer, said the company wants “to make the maximal positive societal impact with these technologies.” But he said DeepMind had not yet determined how it would provide academic researchers with access to the protein structure prediction software or whether it would seek commercial collaborations with pharmaceutical and biotechnology firms. He said the company would announce “further details on how we’re going to be able to give access to the system in a scalable way” sometime next year.

“This computational work represents a stunning advance on the protein-folding problem,” Venki Ramakrishnan, a Nobel Prize–winning structural biologist who is also the outgoing president of the Royal Society, Britain’s most prestigious scientific body, said of AlphaFold 2.

Janet Thornton, an expert in protein structure and former director of the European Molecular Biology Laboratory’s European Bioinformatics Institute, said that DeepMind’s breakthrough opened up the way to mapping the entire “human proteome”—the set of all proteins found within the human body. Currently, only about a quarter of human proteins have been used as targets for medicines, she said. Now, many more proteins could be targeted, creating a huge opportunity to invent new medicines.

Thornton also said that DeepMind’s A.I. system would have profound implications for scientists who create synthetic proteins and that these could have big impacts too: everything from creating new genetically modified crop strains that will be far more nutritious to new enzymes that could help clean up the environment by digesting plastics.

Proteins are the basic mechanisms of biological processes. They are formed from long chains of amino acids, coded for in DNA, but once manufactured by a cell, they fold themselves spontaneously into complex shapes that often resemble a tangle of cord, with ribbons and curlicue-like appendages. The exact structure of a protein is essential to its function. It is also critical for designing small molecules that might be able to bind with the protein and alter this function, which is how new medicines are created.

Until now, the primary way to obtain a high-resolution model of a protein’s structure was through a method called X-ray crystallography. In this technique, a solution of proteins is turned into a crystal, itself a difficult and time-consuming process, and then this crystal is bombarded with X-rays, often from a large circular particle accelerator called a synchrotron. The diffraction pattern of the X-rays allows researchers to build up a picture of the internal structure of the protein. It takes about a year and costs about $120,000 to obtain the structure of a single protein through X-ray crystallography, according to an estimate from the University of Toronto.

More recently, two other experimental methods—nuclear magnetic resonance and cryogenic electron microscopy—have also been used. They can be faster and less expensive but tend to produce models that are less precise than X-ray crystallography.

It takes AlphaFold 2 “a matter of days” to calculate each protein structure using what John Jumper, the researcher who leads the protein-folding team at DeepMind, characterized as “modest” computing resources. Training the system required 128 specialized A.I. computing units on 16 chips created by Google, called tensor processing units, running continuously for “roughly a few weeks,” Jumper said. He noted that this is much less computing power than has been required for many other recent A.I. breakthroughs, including DeepMind’s previous work on Go.

In 1972, Nobel Prize–winning chemist Christian Anfinsen postulated that DNA alone should fully determine what final structure a protein takes—a supposition that set off the decades-long quest to find a mathematical model that could do what Anfinsen was proposing. The problem was, however, that even though the laws of physics control how a protein folds, there are so many possible permutations that biologist Cyrus Levinthal famously estimated it would take longer than the age of the known universe to puzzle out a single protein’s structure through random trial and error.

But DeepMind’s AlphaFold 2 has now essentially done what Anfinsen suggested. AlphaFold 2 is “on par” with X-ray crystallography across more than two-thirds of the proteins in the CASP competition, Moult said. Now the hope is that researchers will be able to use AlphaFold 2, or at least the same method, to go directly from a protein’s DNA sequence, which has become relatively easy and inexpensive to obtain, to knowing its 3D shape, without having to use X-ray crystallography or other physical experiments at all.

Andrei Lupas, director of the department of protein evolution at the Max Planck Institute for Developmental Biology in Tübingen, Germany, who served as one of the assessors for this year’s CASP competition, called DeepMind’s results “astonishing.”

As part of CASP’s efforts to verify the capabilities of DeepMind’s system, Lupas used the predictions from AlphaFold 2 to see if it could solve the final portion of a protein’s structure that he had been unable to complete using X-ray crystallography for more than a decade. With the predictions generated by AlphaFold 2, Lupas said he was able to determine the shape of the final protein segment in just half an hour.

AlphaFold 2 has also already been used to accurately predict the structure of a protein called ORF3a that is found in SARS-CoV-2, the virus that causes COVID-19, which scientists might be able to use as a target for future treatments.

Lupas said he thought the A.I. software would “change the game entirely” for those who work on proteins. Currently, DNA sequences are known for about 200 million proteins, and tens of millions more are being discovered every year. But 3D structures have been mapped for less than 200,000 of them.

AlphaFold 2 was only trained to predict the structure of single proteins. But in nature, proteins are often present in complex arrangements with other proteins. Jumper said the next step was to develop an A.I. system that could predict complicated dynamics between proteins—such as how two proteins will bind to one another or the way that proteins in close proximity morph one another’s shapes.

DeepMind had entered and won the CASP competition two years ago. But at the time, using an A.I. system called AlphaFold that was configured differently, it was only able to achieve an average “global distance test total score” (GDT) —a measure that is approximately equivalent to the percentage of each protein that it accurately maps—of 58 on the hardest class of proteins.

Although this was about six points better than the next best team, it was not a result that was competitive with empirical methods like X-ray crystallography. This year, even on these hardest proteins, DeepMind achieved a median GDT of 87, which is close to being as good as crystallography and was about 26 points better than its nearest competitor.

热读文章
热门视频
扫描二维码下载财富APP