Rice University data science grad students tap public datasets for ML models in capstone projects

Chevron Digital Scholar MDS teams predict renewable energy capacity, cancer markers, and technical documentation details

Rice University data science grad students tap public datasets for ML models in capstone projects

On August 9, 2023, the newest cohort of Rice University data scientists presented their capstone projects to wrap up their Master of Data Science degrees. These MDS alumni were Chevron Digital Scholars, an initiative launched by the energy company in 2019 to develop change agents and leaders who integrate their business knowledge and experience with emerging skill sets. This is the third cohort of Chevron Digital Scholars to complete their MDS degrees at Rice.

“No other energy company sends their employees to graduate school for a data science degree that takes 12 months to complete. This is a unique program and an amazing opportunity for us,” said one of the Chevron participants. 

During their final capstone showcase, the teams presented projects to demonstrate their skills in machine learning, data analysis and visualization, and other tools for exploring and understanding big data. The audience of over 100 guests included the students’ colleagues as well as executives and the next cohort of Chevron Digital Scholars to begin their MDS degrees at Rice this fall.

Data Science students tackle medical, energy, and documentation datasets in capstone projects

The three teams chose projects that relied on databases of medical images, technical engineering documents, and renewable energy sources. One of the students said the cohort purposely chose projects that were not necessarily related to their own day-to-day roles or teams in the oil and gas industry. 

“We can learn from all kinds of data science problems,” they said. “These projects in technical documentation searches, medical imaging, and renewable energy forecasts allowed us to think outside the box and determine just how we could come to the answer; we deliberated which tools to choose and how to implement them.”

Improved identification of medical imaging markers for brain cancer 

Chevron Digital Scholars stand next to research poster presentation

Survival rates and mortality stats for brain cancer (glioblastoma) have been virtually unchanged for decades, prompting Todd Engelder, Keith Pulmano, Huafeng Liu, Ben Dowdell, Zida Wang, and Nicolas Osa to tackle this issue by applying computer vision to the field of medical imaging. 

Using machine learning via neural networks, the team focused on leveraging the growing amount of publicly available medical imaging data to improve early detection, diagnosis, and treatment of glioblastoma.

“Our objectives included improving the segmentation of glioblastoma lesions using vision transformers and improving the classification of one biomarker that is associated with the effectiveness of chemical treatment,” said Liu.

The team also used convolutional neural networks (CNN) to investigate the robustness of 2D glioblastoma lesion detection; they wanted to improve the explainability of cancer classification through the application of Gradient-weighted Class Activation Mapping (Grad-CAM) to CNN.

"Our data set is curated and made publicly available by the University of Pennsylvania Medical System and is hosted on an amazing repository known as The Cancer Imaging Archive (TCIA), funded by the United States National Cancer Institute," said Dowdell. "The UPenn-GBM contains 3D multi-parametric magnetic resonance imaging (mpMRI) volumes for 611 unique pre-operative patients with four different imaging sequences for each patient. 

"Each of the four imaging sequences represent a slightly different 3D image of a patient's brain as each sequence type has different acquisition and recording parameters tailored to capturing different information. Additionally, 147 of the 611 patients have accompanying segmentation masks annotated by expert radiologists which define the sub-tumor regions as well as healthy brain tissue. We use the 3D imaging sequences to train MaskFormer, a recent state-of-the-art vision transformer model, to predict the expert-generated segmentation masks."

Wang felt drawn to image segmentation and classification and believes this aspect of data science –particularly medical imaging– has vast potential, although it also teems with challenges. He said, “Our team's decision to focus on brain cancer MRI image classification and segmentation for our summer capstone project was both a challenging and enriching journey. Applying our theoretical insights to a topic with such impactful real-world applications was truly fulfilling.”

Day-ahead prediction of Texas’ renewable energy supply

Chevron Digital Scholars stand next to research poster presentation

Texas has quadrupled the proportion of wind and solar power on its power grid in the last decade, but these renewable energy sources –like the natural forces they rely on—are intermittent and makes balancing the grid more difficult for its operators. Makamba Sackey, Kyle Sneed, Nasim Taheri, Taylor Chambers, and Arthur Ning worked to improve the ‘day-ahead’ market prediction for available renewable power generation by using tomorrow’s weather forecast with their data science innovation called WiSDOM: Wind and Solar Day-ahead operational model.

“WiSDOM is trained on historical weather data in key regions of Texas and the historical renewable power generation given those weather conditions,” said Ning.

Sneed added, “Our ultimate objective was to display a dashboard of the next 24 hours of renewable power generation, with an estimate of the most impactful weather conditions that generate that production.” 

Public data from the Electric Reliability Council of Texas (ERCOT) was supplemented by data from more than 40 weather stations. By incorporating traditional models and state-of-the-art tools, the team searched for patterns and variables that had the greatest impact on predicting available renewable energy availability, and discovered data from the previous 120 hours could be used to make a solid prediction for the next 24.

New ML search and find tool speeds up documentation research for engineers

Chevron Digital Scholars stand next to research poster presentation

“According to S&P Global, engineers spend 42% of their time searching for technical specs,” said Diyaz Yespayev. “So our team focused on Natural Language Processing (NLP), leveraging generative transformers and vector databases to improve information retrieval for engineering technical documents and specifications.” 

Yespayev worked with Andrew Meaux, Chris Leinweber, Rhett Johnson, Ruslan Kharko, and Zhanibek Issabekov to develop EnGenIR: Engineering Generative Transformers for Information Retrieval. Their initial project narrowed their scope to documents in the Facilities Engineering spectrum, especially in the pipeline design domain, with the expectation of creating an architecture that could generalize well across other technical domains and functions.

“Our pipeline combined vector-stored technical documentation, embedding models, cosine similarity-based information retrieval, and generative AI information synthesis. And yes, we incorporated GPT-4,” said Yespayev. 

Drawing on publicly available engineering code and standard documentation from sources like ASME and API, the team hoped EnGenIR would allow an engineer to get answers to questions like, “Does X, Y, and Z of my current pipeline design meet the required technical standards?” Their model performed well, but the team also recognized that creating appropriate questions or prompts was a critical component in achieving a usable answer.

Director declares MDS capstone showcase a success

Arko Barman, the director for Rice’s Data Science Capstone Program, expressed enthusiasm for the projects the MDS teams presented.

“The presentations we saw tonight are a culmination of a months-long project that each of the teams completed where they applied the skills and knowledge that the students and scholars have acquired through the duration of their master’s. Every person has contributed to their team for completing a challenging project to solve important problems and derive actionable insights from real-world data that is already publicly available,” said Barman.  He suggested that the students step back from their work and take a moment to look back on all they had accomplished. 

“You have taken all that you have learned in the courses throughout the degree program and applied it in your capstone project. You discovered holes or gaps in your knowledge and your faculty mentors have been there to help you bridge that gap. This may also have been the very first time you have written software with the goal of reusability and reproducibility that might be used by other people and data scientists.”

For more information about Rice University’s applied Master of Data Science program, offered both 100% online or on-campus, please visit:


Carlyn Chatfield, contributing writer