Accelerating Science: Modelling Protein-Protein Interactions With The Francis Crick Institute
The Francis Crick Institute, a London-based consortium of six of the UK’s most preeminent life sciences research organisations, mission is to solve some of the world’s most fundamental problems in biomedical science. However, despite access to top of the range tools like on-premises HPC facilities, as the scale and complexity of data continues to increase, researchers are increasingly facing computational engineering challenges, and often not taught the skills to adequately approach them. We realised that because our simulation engine, Aether Engine, is powered by Hadean (a platform designed to natively distribute tasks into the cloud), we can unlock the computational power available in commodity clouds, without needing researchers to learn esoteric skills like MPI. We were also curious to see how production engineering practices could improve the workflows of computational biologists. For this we sought out the makers of public servers to see what challenges they face, we discovered SwarmDock Server. Paul Bates, leader of the Biomolecular Modelling Laboratory at The Francis Crick Institute, has been spearheading the creation of generalised tools for protein-protein docking via his team’s SwarmDock Server for many years. The lab’s focus on creating algorithms and simulations to decipher the natural patterns of biological systems, in conjunction with clinical research into complex diseases like cancers, meant they were the perfect partners to embark on this project with. SwarmDock is a web based tool for modelling how proteins interact. Protein crystal structures (PDB files) to be docked are fed in, and the server generates a set of protein-protein complex structures, ranked in order of lowest energy states. However, as we know proteins are intrinsically dynamic molecules, and PDB files simply represent a snapshot of a protein’s structure, SwarmDock increases the accuracy of the results by applying a degree of flexibility too (using normal modes). Proteins are the building blocks of life. Understanding their form, function and behaviour will work towards solving the grand challenges in life sciences and medicine today (such as cancer, flu, aging etc). Pharmaceutical research employs docking techniques for a variety of purposes, most notably drug design. The Bates Lab have been involved in the development of a new technique, Cross-Docking, to push this further. Here a variety of input protein structures are used and the results of different flexible body docking runs are combined together. However, choosing the ‘best’ inputs and increasing the number of runs means there is an increase in computational time required. We hypothesised that the more diverse protein structures we used in the multiple Cross-Docking runs, the better the final docked structures would be. We used Aether Engine to sample tens of thousands of possible conformations for the input proteins, profiled by potential energy, and selected candidates for docking according to features in this energy space. Aether Engine accelerated the potential energy calculation by partitioning space to consider the short range interactions between atoms within each localised area, thereby enabling calculations to run in parallel. 56 protein-protein pairs were investigated, and we compared our final docked structures against their known complexes (available as PDB files). We also compared these results against docked structures found using: theoretically calculated ‘best’ input proteins; and a naive approach of simply stretching the input proteins to their maximum extension (as determined with normal modes). Our results demonstrated that Cross-Docking is a promising approach showing a 10% uplift in docked structure quality, and the sophisticated sampling of inputs using Aether Engine led to a significant reduction in computation time – as sampling can be easily parallelised with no extra effort for the developer, and using more diverse inputs means fewer docking runs are needed, as hypothesised. Given publicly available servers like SwarmDock rely on shared resources, and a docking run of all 56 protein pairs can take weeks to complete, we concluded that the benefits of sophisticated sampling with Aether Engine to generate the most diverse input structures possible will negate any additional burden brought on by this preprocessing step. This investigation included a difficult target from this year’s CAPRI competition, a large community driven initiative to blind test protein-protein docking algorithms and generated an ‘Acceptable’ result. Large scale, blind, standardised initiatives like CAPRI (protein docking) and CASP (protein folding) are important for computational biology to increase reliability and reproducibility of results and encourage engineering best practice. As both are seen as the gold standard for assessment of techniques, major players like Google DeepMind have also chosen to publicly evaluate their work with this community – their paper on AlphaFold, the CASP13 winner, is available in the same issue as our work described here. Our research conducted by The Francis Crick Institute and Hadean has been published in Proteins, in a paper entitled, Enhanced sampling of protein conformational states of dynamic cross-docking within the protein-protein docking server SwarmDock. We have also made a set of diverse docked results available for scoring communities on figshare – DYNACROSS. I’d like to thank Paul Bates and his team, as well as my colleagues at Hadean who worked on the project; Dr. Mieczyslaw Torchala, Patrick Gordon, Dr. Francis Russell. As well as Innovate UK for funding the project. If you’re interested in downloading and reading the paper, it’s now available for free from the Wiley Online Library.