A recent paper by Rice CS Associate Professor Anastasios Kyrillidis and Ph.D. student Fangshuo (Jasper) Liao proves that faster neural network training is achievable. The authors show that using Nesterov’s momentum provably accelerates the convergence rate in non-trivial neural networks (that is, non-linear, and having more than one hidden layer) that use the rectified linear unit (ReLU) activation function.
When a machine learning model is “optimized,” it stably predicts accurate results, and the “convergence rate” is how fast it gets to that point. “Accelerated convergence” means a faster, more efficient route to a solution.
That Kyrillidis and Liao prove Nesterov’s momentum accelerates convergence rates in deep neural networks (DNN) is significant for several reasons. “Nesterov’s momentum was initially designed with convex optimization in mind,” Liao explained, and DNN are not convex. Yet, Nesterov’s momentum has been applied to DNN and shows favorable empirical results. No one really knew how or why, though.
“It has been a mystery why the approach we have been using for this long time really works,” said Liao, “so the purpose of this work is to take a further step into solving that mystery — investigate why the methods that we designed for another class of functions would work on this more complicated class of functions.
“Proofwise, even on convex functions, the theoretical guarantee of this algorithm is hard to derive and to understand,” Liao continued.
Kyrillidis echoed the significance of their research bridging theory and practice. “There has been a gap between the empirical success of momentum methods in training neural networks and the theoretical understanding of why they work,” he said. “In essence, this work provides a significant step forward in our theoretical understanding of why popular optimization methods work well in deep learning, aligning theory more closely with observed practical performance. This improved understanding can drive future innovations in both the theory and practice of machine learning optimization.”
Novelty and significance of the research
“On a complicated objective such as neural network,” said Liao, “proving convergence or acceleration, showing that Nesterov’s momentum has superior performance to other optimization algorithms, such as gradient descent, is in itself a novel task.”
“I have seen only two previous papers that do this, but these two papers only work for shallow neural networks,” he continued. They assume the neural network is simple, “whereas our work considers a deep neural network; you can have an arbitrary number of hidden layers.”
“Previous theoretical work on momentum methods in deep learning was limited to shallow ReLU networks or deep linear networks,” Kyrillidis added. “This is the first work to prove an accelerated convergence rate for Nesterov's momentum in non-trivial neural network architectures, specifically deep ReLU networks.”
Kyrillidis said the work also presents a new problem class and new assumptions. “We introduce a new class of objective functions where only a subset of parameters satisfies strong convexity,” he explained. “This is really helpful in machine learning problems, including neural networks, which are typically nonconvex. This also extends to conditions we often assume for the problems we work on. This research goes beyond that, showing acceleration under new conditions.”
But what is Nesterov’s momentum?
Nesterov’s momentum, which was introduced by Russian mathematician Yurii Nesterov in 1983, “is a generic optimization procedure,” Liao explained. “So, any real-world scenarios that involve solving an optimization problem” — such as scheduling flights or developing an investment portfolio — "you can use Nesterov’s momentum.”
Kyrillidis likens the momentum to “skiing down a hill, trying to reach the lowest point. Regular gradient descent is like skiing straight down, always moving in the direction of the steepest slope. You'll eventually reach the bottom, but you might zigzag a lot, which slows you down. Standard momentum is like skiing down with some speed, building velocity as you go. This helps you move past small bumps and get to the bottom faster; however, you might overshoot the lowest point because of your speed.”
In comparison, Nesterov's momentum is “a smarter way of skiing,” where, “instead of just using your current position to decide where to go next, you make a prediction of where you'll end up based on your current speed and direction, then look at the slope at that predicted position, and use that information to adjust your path.”
It is this "look-ahead" feature that allows Nesterov's Momentum to make intelligent decisions about which direction to move, Kyrillidis said. “It can slow down earlier when approaching the bottom, reducing overshooting and allowing for faster overall progress.”
Nesterov’s momentum is “hard to characterize, because it is kind of counterintuitive,” said Liao. “You're not going in a direction pointing to the thing you want to achieve. You are going in a little bit of a different direction, but we can prove that this is actually faster.” He added, “I find it a very elegant algorithm.”
Theoretical meets empirical
As mentioned, a key aspect of this research is providing a theoretical basis for empirical results. One way to advance the field of DNN, the authors say, is considering both approaches when tackling a problem.
“Both empirical and theoretical approaches play crucial roles in advancing our understanding of deep neural networks,” said Kyrillidis. “The empirical approach, often characterized by trial-and-error and extensive testing, has led to numerous breakthroughs in the field. It allows researchers to quickly test hypotheses and discover unexpected patterns. On the other hand, the mathematical approach, focusing on theorems and formal proofs, offers a more structured path to understanding the underlying principles of neural networks.
Although some researchers may lean toward empirical methods due to their immediate results, there's growing recognition of the value in combining both approaches,” he continued. “The mathematical foundation can guide empirical research, making it more targeted and efficient. Conversely, empirical findings can inspire new mathematical inquiries and validate theoretical predictions.”
“I believe that this is a very important open question, and we should keep doing this,” Liao said of approaching neural networks from a mathematical/theoretical angle. “One of my aspirations, and what I want to do in my PhD life, is to solve more rooted problems in this field.”
“Ideally, increased collaboration between empiricists and theoreticians could lead to a more comprehensive understanding of deep neural networks,” said Kyrillidis. “By embracing both methodologies, the AI research community can leverage the strengths of each approach, potentially leading to more robust, efficient, and interpretable neural network models in the future.”
Future research
In their conclusion, the authors suggest, for future work, AI researchers “can extend the analysis to different neural network architectures,” such as convolutional neural networks (CNNs) and residual neural networks (ResNets).
“The deep neural network we focused on is called ‘multilayer perceptron’ [MLP], which is very simple,” Liao said. “You have neurons in each layer that are fully connected. A CNN selects only a subset of things that are connected, to suit vision tasks better. For example, a picture: each pixel in the picture should only be considered with respect to its neighbors, because the relative position really makes sense in this context. Resnet is different in that it has skip connections in between layers. For example, my second layer output is directly passed to the fourth layer, bypassing the third layer.”
“This research opens up new avenues,” said Kyrillidis, “potentially leading to the development of efficient training algorithms or extending these results to other types of neural networks and optimization methods. This work provides a significant step forward in understanding why popular optimization methods work well in deep learning. This improved understanding can drive future innovations in both the theory and practice of machine learning optimization.”