Archaeologists may dig through middens for details about past civilizations, but Rice University and Houston Health Department researchers have developed novel tools to search wastewater data for clues about a current community’s health.
“In the fall of 2021, when we were first asked if we could identify when the SARS-CoV-2 Omicron variant reached Houston, the research team was faced with a challenging problem. How can we provide the earliest possible information to the Houston Health Department while being confident we are correct?” said Todd Treangen, an associate professor in Rice’s Computer Science Department whose research focuses on computational methods and software tools used for real-time monitoring of microbial community dynamics, infectious disease, and biothreats.
“To answer this question, we were fortunately invited to join forces with Dr. Lauren Stadler in Civil and Environmental Engineering team along with Dr. Kathy Ensor in Statistics and Dr. Loren Hopkins in the Houston Health Department and Statistics. This exceptional team has been leading Houston Wastewater Epidemiology efforts, and Nicolae and I were very fortunate to collaborate with them on this important research question.”
Nicolae Sapoval, the first author of the team’s paper published last month in Nature Communications, came at the problem from a computational perspective. The Rice CS Ph.D. student said, “My main interest lies in developing novel algorithms for analyzing large and diverse collections of genomic data. On a day-to-day basis, I am mostly driven by my own curiosity and the pursuit of knowledge, but I deeply appreciate opportunities to make my work applicable to real world challenges. From that lens, I would argue that tracking down SARS-CoV-2 in wastewater felt like a call of duty moment, especially early in the project when we were not yet sure of how much information we could truly recover.”
Complicating the challenge of identifying new variants in the data was the sheer volume of insights from multiple hosts harboring a diversity of SARS-CoV-2 variants. The pool of samples was a messy mix of signals — what researchers call a “noisy” data source.
Sapoval said, “Dealing with a mixed set of signals is a common problem in computer science and engineering. From disentangling spectra of different chemicals in a solution to determining key frequencies in a transmission, this problem arises over and over in multiple application domains with their unique sets of nuances and challenges. In our case, there are two main sources of complexity: 1) the sample representing multiple individual signals combined and 2) the noise coming from the lab protocols such as SARS-CoV-2 RNA amplification, measurement devices like sequencing machines, and the subsequent processing of their data (e.g., variant calling software).”
“Our work is examining a mix of genomes, but think of it as a choir where you try to pick out if a specific soprano has been slacking off. Dealing with noisy and mixed sources is a very common problem in both bioinformatics and other areas of computer science. However, detecting ‘an under or over-performing singer in a choir’ while being a common problem, requires a good bit of domain knowledge. This is where we turn to the databases of what is known so far and begin mining for patterns that can help us distinguish the voices in the choir.”
Continuing his choir analogy, Sapoval said hearing an unrecognized sound or voice forces the listener to make a decision: attribute it to what it most resembles or create a new name for it. He said, “Another member of our group, Yunxi Liu, has been leading an effort that specifically focuses on tracking down ‘cryptic variants’ — the things that have evaded our databases, yet keep popping up in the local wastewater.
“However, the problems don't end here. As the metadata is often entered by hand, there is an inherent probability of error in it — like finding a tenor that was categorized as a bass in the database. Of course, that makes finding tenor-specific patterns more challenging. Finally, the dynamic nature of the database poses another set of challenges. For example, at some point a major UK effort ended up resulting in hundreds of thousands of new SARS-CoV-2 genomes being added to the database overnight. This kind of immediate and large-scale expansion required us to rapidly recompute variant specific signatures on the new expanded set of data.”
Treangen said, “QuaID was specifically designed to leverage all mutational signatures, including variable length deletions, to facilitate early and accurate detection of SARS-CoV-2 variants of concern in wastewater.” He underscores that the team’s advances depended heavily on the contributions of co-senior author Stadler, and key collaborators Ensor, and Hopkins. Stadler, Ensor, and Hopkins launched Houston Wastewater Epidemiology in the summer of 2020. Two years later, it was named a CDC National Wastewater Surveillance System (NWSS) Center of Excellence.
Stadler, a Rice CEVE assistant professor whose research focuses on wastewater disease monitoring and wastewater microbiology, was focused on SARS-CoV-2 variant detection using wastewater samples in early 2021. Her lab was using targeted assays that looked for individual characteristic mutations to identify variants, and proposed to try sequencing the genomes of SARS-CoV-2 in wastewater samples to get a more comprehensive picture of circulating variants.
“The Houston Health Department Lab began sequencing SARS-CoV-2 genomes from wastewater samples and once we had the data, we realized that there was no off-the-shelf computational pipelines for analyzing environmental samples that contain a mixture of variants and degraded and fragmented genomes, which are common to wastewater samples,” she said. Stadler and Treangen’s groups joined forces to develop QuaID, specifically designed for identifying low abundance/emerging variants of concern in environmental samples.
Ensor, Rice’s Noah G. Harding professor of Statistics and Director of the Center for Computational Finance and Economic Systems (CoFES) established and implemented the statistical system for assessing the pertinent health information from wastewater samples for SARS-CoV-2 in May 2020, and expanded her work through Houston Wastewater Epidemiology. She said, “The publication of our paper in Nature Communications gives us a global platform to share our results and methodology.”
Hopkins blends her role as a Rice professor in the practice of Statistics with her duties as Chief Environmental Science Officer and Bureau Chief for the Data Science Division in the Houston Health Department. Essentially, she translates research advances in science, engineering and higher education to inform city policymakers and improve public health.
The QuaID model provides Hopkins with a visual image of a spreading threat that her colleagues at the Houston Health Department can understand. She said, “the QualID model allows our teams at the Houston Health Department to track when variants of concern first appeared in the city’s wastewater as well as where these variants of concern are present in the different parts of the city. Tracking not just SARS-CoV-2 in the wastewater, but also the variants, is important information for our public health leaders.”
For Sapoval, the QuaID model’s beauty is in the anonymity it provides for people shedding traces of viral genomes into wastewater. He said the importance of data collection and analysis should be balanced against the medical or social stigma of a single individual being identified as the first known carrier of a new variant.
“At Rice, we’ve been focusing on the question of ethics in computer science. With that perspective in mind, our creation of a universal yet anonymous type of surveillance allows us to assess large populations without imposing feelings of shame or guilt. I think one of the most valuable lessons learned from this project is that wastewater surveillance is a powerful tool that can be used to track the spread of a pathogen in a community without inflicting psychological damage on its people.”