Thanks to funding from the Welch Foundation, Rice University professors from BioSciences and Computer Science are using machine learning methods to continue solving the phase problem first identified by physicists in the 1950s.
“A protein structure can be compared to a line of beads, perhaps 20 different colors, collected on a string and folded up into a unique three-dimensional shape that determines its function. These microscopic molecular 3D machines tackle jobs like digesting food, guiding muscle movement, firing neurons, etc. — actions that are all determined by how a specific sequence of the amino acid beads fold into a distinctive shape.
“X-ray crystallography is used to reveal the 3D shape that explains how these machines will work, but the direct revelation is incomplete in most cases because the X-ray measurements don’t include the phase data, a critical part of the calculation,” said George N. Phillips, Jr., Rice’s Ralph and Dorothy Looney Professor and Associate Dean of Research for the Wiess School of Natural Sciences.
Phillips uses a water ripple analogy to explain X-ray crystallography. He said if two rocks are thrown into a pond, the ripples across the water’s surface will create patterns that overlap and even cancel each other out in some directions. If someone captured a snapshot of those two sets of ripples, a physicist could work backwards — like solving a puzzle — to determine exactly where the rocks first hit the water. Shining X-rays on the molecules of a crystal is similar:The X-rays bounce off the atoms sort of like ripples from the rock, enabling scientists to determine where the atoms are positioned in their 3D structure.
“Physicists have worked out the mathematics to calculate the diffraction pattern — but without the phases the known arrangement of the atoms indicated by the X-ray crystallography snapshot cannot be discerned,” said Phillips. “In mathematical terms, what we don’t know is how much of the measured diffracted ray is a real number and how much of it is imaginary. That is another way to describe the phase problem.”
“When we can identify a protein’s phase data, and carry out the math to make an image of the molecules, we consider the protein structure solved. Hauptman and Karle were awarded the Nobel Prize in Chemistry for solving the phase problem for simple chemical compounds back in the 1950s. Most complex compounds, like proteins, are still not able to be solved directly and each one requires extra experiments to estimate its phases. Using experiments, my students and I have solved around 500 protein structures. With machine learning, we hope to solve many more.”
At its simplest, machine learning is the process of training a computer system to predict answers to new questions by first feeding it thousands of known question and answer sets. For machine learning to be useful in identifying phase data for unsolved proteins, thousands of known protein structures would have to be available for training. Those experimentally determined structures are available in the Protein Data Bank (PDB). In fact, Google’s AlphaFold project used machine learning linked with the PDB and other databases to successfully predict millions of 3D protein structures.
“When they participated in a biennial competition on modeling protein structures, Google proved with AlphaFold and AlphaFold2 how ample computer resources, fantastic machine learning algorithms, and thousands of PDB datasets could solve millions of protein structures. Their results were far better than any previous attempts in the competition,” said Phillips.
“The folks at Google want to make everything they discover about protein structures available at the click of a button, and researchers around the world are already benefitting from that data. When it works, it works great. However, there are some cases in which AlphaFold2 can’t provide the precise detail about the workings of 3D molecular machines that X-ray crystallography can, and that is where the our research at Rice is focused.”
Phillips was interested using a physics-based approach augmented with machine learning methods, so he reached out to Anastasios Kyrillidis, Rice’s Noah Harding Assistant Professor of Computer Science for input.
Kyrillidis directs the OptimaLab at Rice, a research team that specializes in optimizing — increasing the efficiency of — very complicated calculations. “I was intrigued by George’s ideas, and I’m happy to explore optimization for use in an area that is new to me,” he said. “We believe we can train our deep neural network (DNN) with enough data from solved protein structures in the PDB so that the model can learn to solve structures for a new object. George knows all about the science of the structures and our job is multifaceted: Understand the problem at hand and its intricacies; map this knowledge in math relations and machine learning model design; optimize the model and its parameters, having in mind that such problems involve massive amount of data and millions to billions of parameters. To be successful for such a challenge, new model designs and optimization methods need to be devised, as well as proper system-level design to handle such large-scale problems.
According to Kyrillidis, the “Machine Learning” description of the project is rather straightforward: Given a mathematical model (like a neural network), the goal is to train it based on data (i.e., given hundreds of thousands or millions of data points, like the measurements of a diffraction pattern), the neural network is trained — after spending lots of computer time — showing best mapping from the initial input data to final answer.
He said, “That is the learning part. If the system is working well, it will accurately identify answers, even on data that were not included during the training phase. What is challenging and exciting is what the phase retrieval as an application brings on the table: What should be the neural network model design? What additional prior knowledge does phase retrieval data provide, that can help towards new models and algorithms? Given the vast amount of data, what is a good algorithm to train such models that works well and is fast? Such questions and more are pending and wide open, and the goal of this proposal to provide satisfactory answers.”
Kyrillidis is pleased with the progress of their research and eager to bring more of his team members into the work.
“We’ve been doing quite well, getting correct answers to our test set. We’re now gradually increasing the complexity of the problems to get closer to a real-world case. Each time we train on hundreds of thousands of data sets, and then test on tens of thousands of data sets. That’s the way these deep learning networks work. Train them up and then score to see if they are working. Next steps include finding better and more targeted models for phase retrieval,” said Kyrillidis.
Phillips is also satisfied with the trajectory of their research. He said, “I think our work was partially inspired by the success of AlphaFold2, which demonstrated the power of machine learning in solving very complicated structural biology questions. We did some initial testing of our hypothesis with funding from the Rice University Faculty Innovation Fund Program and then we reached out to the Welch Foundation when the trial proved successful.”
The Welch Foundation is one of the United States’ largest private funding sources for fundamental chemical research at universities, colleges and other educational institutions in Texas. The 2022 machine learning-crystallography grant is Phillips’ third Welch Foundation award.