Deep Structural Feature Extraction from Maize Proteins and Gene Sequences Using DNA Language Models (MDREN Participant)
Dr. Kyoung Tak Cho, Department of Computer Science, University of Maryland Eastern Shore
Understanding the relationship between the maize (Zea mays) genome and its corresponding protein structures remains a major challenge in computational biology. This project applies advanced machine learning approaches to analyze large-scale maize gene and protein sequence data, aiming to uncover novel structural and functional insights encoded within DNA. We propose to develop and implement a DNA language model trained on maize genomic sequences to learn biologically meaningful representations that capture the intrinsic syntax and semantics of DNA. These learned embeddings will be integrated with our custom k-mer distance–based model and deep neural architectures to predict tertiary structural and physicochemical features directly from gene and protein sequences.
The large-scale data processing, model training, and structural mapping tasks require access to GPU-enabled high-performance computing (HPC) resources with distributed processing capabilities. This project is conducted in collaboration with the Maize Genetics and Genomics Database (MaizeGDB), USDA-ARS, Corn Insects and Crop Genetics Research Unit, Iowa State University, Ames, IA, leveraging their extensive maize genomic and proteomic datasets. Expected outcomes include (1) novel computational methods for extracting structural features from biological sequences, (2) deeper understanding of gene-to-structure relationships in maize, and (3) new machine learning tools to support functional genomics and crop improvement research within the MaizeGDB community.