Tang, Jermaine and team pioneer breakthrough solution to big data challenges

ICML 2023 paper reveals groundbreaking techniques in differentiating relational algebra for large-scale ML

Yuxin Tang at ICML

Rice University Computer Science Ph.D. student Yuxin Tang wants to help data scientists solve large-scale machine learning problems more efficiently. The auto-differentiation paper he is presenting at ICML 2023 is a large step in that direction. 

“Auto-differentiation” software automatically computes derivatives that quantify how a change in a single vector, matrix, or tensor changes a complex machine learning model’s final output. Auto-differentiation has become a key enabling technology for machine learning, but it has not been studied much in the context of the relational data model, which is the core contribution of Tang’s paper. 

In the relational data model, data are stored as massive tables, like very large spreadsheets. The relational data model is implemented by the most popular database engines such as Oracle and Microsoft SQLServer.

“Before our work, data scientists would need to differentiate computations over relational databases by themselves, but with the new auto-differentiation technology invented by us, their workload can be significantly reduced by automating this process using our auto-differentiation algorithm, so there is no longer a need to evaluate the derivatives and gradients manually. This new technology has the potential to enable widespread use of databases for machine learning” said Tang. 

Relational databases are attractive platforms for large-scale, data-oriented tasks because they are built to handle very large data sets. Programmers typically specify computations (or “queries”) over such databases using the popular SQL language. Using the methods described in Tang’s paper, it is now possible to differentiate SQL, making relational databases suitable for machine learning. 

“Relational data models are often used to solve complex problems that involve managing and analyzing vast amounts of data in any company anywhere, from retail stores to investment firms to mass transportation providers. Many data scientists use relational models to manage and analyze data as they help their companies model the impact to revenue of a change to policies, products, or strategies. In addition, some deep learning models like graph neural networks can be expressed with their relational data models.” said Tang. 

“Relational systems have many benefits,” Tang continued. “They can distribute computations across multiple processors or machines easily, enabling automatic distribution, parallelization and optimization when handling large-scale data.”

Tang and his advisor, Rice CS Department Chair and J.S. Abercrombie Professor of Engineering Chris Jermaine, together with Rice CS Ph.D. students Zhimin Ding and Dimitrije Jankov, Statistics Ph.D. student Daniel Bourgeois, and Binhang Yuan, a CS postdoctoral researcher at ETH Zurich, developed a method for auto-differentiating relational computations for very large scale machine learning (ML). 

“We also demonstrated experimentally that a relational engine running an auto-diff algorithm can execute various ‘big data’ ML tasks as fast as special-purpose distributed ML systems,” said Tang. 

“Most computer scientists and data scientists are familiar with machine learning systems like PyTorch and TensorFlow or special graph-based machine learning systems like DGL. To be honest, the performance of general-purpose engines usually cannot outpace the optimized special engines. But in our work, we've shown that a relational engine like a database can be used to express general computations, matching the capabilities of specialized systems like PyTorch and DGL while achieving comparable performance.” 


Carlyn Chatfield, contributing writer