Customizing, or fine-tuning, a large AI model to a specific downstream task can be resource intensive when server-side changes need to be made to the model’s parameters. The number of model parameters can be in the hundreds of billions, and storing many different sets of parameters for different tasks can be prohibitive. “Prompt-tuning” approaches offer an alternative. In prompt-tuning, task-specific instructions on how to follow the prompt are added to the prompt itself. These instructions are typically not human readable. While prompt tuning does not require changing the model itself, its performance can fall short of actually changing model parameters.
Now, a method called “low-rank prompt adaptation,” or LoPA, offers a solution that can achieve accuracy similar to actually modifying model parameters. LoPA allows the model to be customized on the user’s side, not the server side, making model tuning scalable and more accessible for researchers, companies, and individual users.
Rice CS Professor Chris Jermaine and his advisee Abhinav Jain, a fourth-year PhD student, discuss these findings in the paper “Prompt Tuning Strikes Back: Customizing Foundation Models with Low-Rank Prompt Adaptation.” The other contributors are Professor Thomas Reps, from the Department of Computer Science at the University of Wisconsin-Madison, and Professor Swarat Chaudhuri, from the Department of Computer Science at the University of Texas at Austin.
Jain presented LoPA at NeurIPS 2024 in Vancouver. “We were trying a completely different thing initially,” said Jain about the development of LoPA. The team was trying to use a variational autoencoder, which is a model that can learn to generate new data by looking at the examples in a training data set, to get large language models to write more useful programs. “We were trying to learn libraries of code in the latent space,” that is, to learn to map common patterns in programs to an abstract representation, which could be used to generate programs that use those patterns.
The researchers were “trying to see, with that compression, can you learn some high-level abstractions of the low-level data?” Jain said. “And while we were approaching that, we found out that the high-level abstractions, when interpreted as soft prompts and preprended to the input of the language model, improve its performance on a downstream task without changing any of its parameters.” That’s when the researchers decided, “Maybe we can investigate this further and not worry about the initial problem statement. This is a very interesting idea.”
How LoPA could be a more desirable alternative to LoRA and prompt-tuning
There are a number of ways to customize a model to a specific task. The most thorough is full fine tuning, where every parameter is altered for the new task. “Personalization can usually be done using supervised fine tuning or full fine tuning, which is very expensive because, imagine the scale of these models--it's billions of parameters,” explained Jain. More efficient and less memory-hungry is parameter efficient fine tuning, or PEFT. “It is an umbrella term,” Jain said, that encompasses many methods of customizing a model by altering only a subset of the parameters most relevant to the new task.
Low-rank adaptation, or LoRA, is the most commonly used type of PEFT, offering a faster and lower-cost alternative to full fine tuning by altering a limited number of parameters. As Jain and his coauthors pointed out, however, LoRA “require[s] maintaining multiple adapter-like modules for each user-specific task on the FM server and the need to select and assemble a subset of these modules every time a batch of user requests is processed.”
“Clearly, it’s not scalable,” Jain said.
Prompt tuning, in which task-specific prefixes are added before processing begins, is a LoRA alternative that requires no changes on the server side. With the exception of prompt tuning, “all the existing methods that personalize a language model require storing these task-specific parameters on the server side,” Jain explained, “but prompt tuning is not as high performing as LoRA. So how do you bridge the gap?
“That's where LoPA comes in,” he continued. “It is better performing than prompt tuning; it comes close in performance to LoRA, and further, it is more parameter efficient than Lora. It consumes fewer parameters.”
Achieving high performance: tasks and instances
LoPA is a unique approach in that it constructs soft prompts “from two components: a task-specific element that shares task information across samples and an instance-specific element that incorporates information from each individual instance,” the authors wrote. Previous methods of constructing prompts were either “fully task specific, or instance specific. Striking a balance between the two is something that we studied in this paper,” Jain said. Used individually, the approaches “are not optimal enough in personalizing the language model on a downstream task,” he said.
“A gating function turned out to be the most optimal way” to combine them, explained Jain—a “nonlinear function that combines the two turned out to give better performance than a basic linear composition or no composition at all. That's what we found.”
Advantages of LoPA
“The end goal” of LoPA, said Jain, is “making the personalization of language models scalable.”
He explained, “Imagine you as a user are trying to use a language model on your task, but it does not do so well. So you want to update the parameters or personalize it. And now imagine this being a request from thousands of users or maybe millions of users. Any existing method that offers such personalization is not scalable… To make it computationally efficient, we have low-rank prompt adaptation. It achieves scalable personalization from the user end.”
Other important advantages to LoPA “are cost effectiveness plus computational efficiency,” said Jain. In addition, “a user gets more autonomy and also privacy. I don't have to store any user specific parameters on the server and risk it being accessed by someone else.”
Presenting the research
Jain, a first-time presenter, shared this research at the Thirty-Eighth Annual Conference on Neural Information Processing Systems (NeurIPS 2024) in Vancouver.
“It’s a very large, top-tier conference,” he said. “It will give visibility to our work, and it will also give us the opportunity to further investigate and maybe exercise this work in different domains that we haven't thought about.
“This conference gives us a very broad audience. People from varying backgrounds and different interests come here, and I see a lot of potential collaboration, and maybe some interesting takes on this approach. It’s huge.”