Enhancing STEM Retention by Identifying Cultural Capital Themes in Student Writing Using State-of-the-Art Language Models
Khalid Mehtab Khan
Department of Computer Science
Faculty Supervisor: Anagha Kulkarni
This project aims to improve STEM retention by identifying Cultural Capital Themes (CCTs) within student essays using state-of-the-art language models. These themes—such as Aspiration, Familial, Resistance, and Navigational—are deeply embedded in students' lived experiences and are critical to understanding educational persistence, especially in underrepresented groups. We fine-tuned DeBERTa, a transformer-based model, on a domain-specific corpus using Masked Language Modeling and multilabel classification to detect these themes at the sentence level. Unlike traditional machine learning approaches, our method leverages contextual understanding and transfer learning to outperform classical models in both accuracy and interpretability. Additionally, we explore generative modeling using T5 to synthesize CCT-expressive text based on prompt-based definitions. This dual approach enables both detection and generation of culturally relevant narratives. Our long-term goal is to integrate these insights into an interactive annotation and feedback system for training educators and researchers, contributing toward more inclusive, data-driven interventions that support student success in STEM fields.