Abstract

Recent advances in deep learning have revolutionized macromolecular structure prediction, as exemplified by the success of AlphaFold and related frameworks. Despite these breakthroughs, accurately modeling biomolecular assemblies—particularly protein–RNA complexes remains a major challenge, partly due to the scarcity of evolutionary information used as inputs in existing approaches.

In this talk, I will focus on our recent work, ProRNA3D-single, a novel deep learning framework for predicting protein–RNA complex structures. ProRNA3D-single employs geometric attention-enabled pairing of biological language models of proteins and RNAs to predict interatomic interaction maps, which are subsequently transformed into multi-scale geometric restraints for 3D structure modeling. Benchmark results demonstrate that ProRNA3D-single outperforms state-of-the-art methods, including AlphaFold 3, particularly when evolutionary information is limited. I will conclude with a brief overview of my ongoing and future research directions.

Speaker Bio

Rahmatullah Roche, PhD. is a tenure-track Assistant Professor in the Department of Computer Science at Columbus State University. His research focuses on computational biology, applied machine learning, data science, and human-computer interaction, with a particular emphasis on macromolecular predictive modeling using advanced artificial intelligence techniques. Dr. Roche earned his Ph.D. in Computer Science from Virginia Tech in 2024. He holds a Master of Science in Computer Science and Software Engineering from Auburn University (2021) and a Bachelor of Science in Computer Science and Engineering from Bangladesh University of Engineering and Technology (BUET) (2016). He is interested in interdisciplinary collaborations to advance scientific discovery and technological innovation.

Talk slides (pdf)

Abstract

Drug discovery is a complex, costly, and time-intensive process, often taking over a decade and billions of dollars to bring a single drug to market. Advances in artificial intelligence (AI) have transformed this field by accelerating early stages of drug discovery, including target and hit identification and lead optimization. However, the accuracy, generalizability, and interpretability of these predictions remain major challenges, specially when experimental data are limited.

In this talk, I will present our research on developing data-driven AI frameworks that can help predict and evaluate biomolecular interactions for drug discovery, focusing on interaction prediction and reliability estimation. I will first present our methods EquiPPIS and EquiPNAS for predicting protein–protein and protein–nucleic acid interaction sites using equivariant graph neural networks (EGNNs) and protein language models, and EquiRank, our approach for estimating the quality of protein–protein interfaces in multimeric structures, all of which can help target identification in early-stage drug discovery. Moving to hit identification, I will present how graph neural network-based frameworks can further help in predicting drug–target binding affinities and interactions to accelerate virtual screening and drug repurposing.

Together, these models show how AI-driven, symmetry-aware, and biologically informed frameworks can improve the efficiency and reliability of early-stages of drug discovery.

Speaker Bio

Md Hossain Shuvo is an Assistant Professor in the Department of Computer Science at Prairie View A&M University. His research lies at the intersection of computational biology, bioinformatics, machine learning, and data science. He develops data-driven computational frameworks to model and evaluate biomolecular interactions, with a focus on improving the reliability of predictions for biomolecules. His work has been published in leading journals and conferences, including Nucleic Acids Research, Bioinformatics, and ISMB. For more information, please visit https://mdhossainshuvo.github.io.

Abstract

Chain-of-Thought (CoT) prompting has been shown to improve Large Language Model (LLM) performance on various tasks. With this approach, LLMs appear to produce human-like reasoning steps before providing answers (a.k.a., CoT reasoning), which often leads to the perception that they engage in deliberate inferential processes. However, some initial findings suggest that CoT reasoning may be more superficial than it appears, motivating us to explore further. In this paper, we study CoT reasoning via a data distribution lens and investigate if CoT reasoning reflects a structured inductive bias learned from in-distribution data, allowing the model to conditionally generate reasoning paths that approximate those seen during training. Thus, its effectiveness is fundamentally bounded by the degree of distribution discrepancy between the training data and the test queries. With this lens, we dissect CoT reasoning via three dimensions: task, length, and format. To investigate each dimension, we design DataAlchemy, an isolated and controlled environment to train LLMs from scratch and systematically probe them under various distribution conditions. Our results reveal that CoT reasoning is a brittle mirage that vanishes when it is pushed beyond training distributions. This work offers a deeper understanding of why and when CoT reasoning fails, emphasizing the ongoing challenge of achieving genuine and generalizable reasoning.

Speaker Bio

Chengshuai Zhao (he/him) is a second-year Ph.D. student in Computer Science at Arizona State University (ASU). He is also a Graduate Research Associate in the Data Mining and Machine Learning Lab (DMML), advised by Prof. Huan Liu. His research spans data mining, AI for science, representation learning, and large language models, with the goal of building systems that are more generalizable, transparent, and capable of uncovering knowledge at the frontiers of human understanding. He also serves as a program committee member and reviewer for leading conferences, including NeurIPS, SIGKDD, ACL, and AAAI.

Abstract

Molecular classification of cancer using multi-omics data is central to precision oncology, enabling identification of distinct subtypes based on gene expression, mutations, and methylation—beyond traditional histology. While machine learning methods like Support Vector Machines and Random Forests have shown success, they struggle with high-dimensionality, multi-omics integration, and limited generalizability. Deep learning offers promise but faces challenges such as overfitting on small datasets, lack of interpretability, and poor cross-cohort performance.

We present TabPFN, a Prior-data Fitted Network designed for tabular data, as a new foundation model for cancer classification. Unlike traditional models requiring dataset-specific training, TabPFN uses in-context learning via pretraining on synthetic data. This eliminates hyperparameter tuning, allows fast inference, performs well on small datasets, and includes built-in uncertainty estimation—making it ideal for clinical use.

Applied to multiple cohorts of gene expression data of bladder cancer, including The Cancer Genome Atlas (TCGA) RNA-seq data, TabPFN achieves competitive or superior performance with significantly reduced computational time. Our results show robust classification of tumor subtypes using gene expression alone.

Foundation models like TabPFN represent a paradigm shift in computational oncology, addressing long-standing barriers to clinical translation. We conclude by outlining future directions, including multi-omics integration, automated feature selection, and cancer-specific foundation models.

Speaker Bio

Dr. Seungchan Kim is a Chief Scientist and Executive Professor at the Department of Electrical and Computer Engineering and the Director of the CRI Center for Computational Systems Biology at the Prairie View A&M University (PVAMU). Prior to this appointment, he was the Head of Biocomputing Unit and an Associate Professor at Integrated Cancer Genomics Division of Translational Genomics Research Institute (TGen). He was one of the founding faculty members of TGen, founded in 2002, by Dr. Trent, then-Scientific Director of the National Human Genome Research Institute at the National Institutes of Health, leading computational systems biology research at the institute. He was also an Assistant Professor in the School of Computing, Informatics, Decision Systems Engineering (CIDSE) at the Arizona State University from 2004 till 2011. Dr. Kim received B.S. and M.S. degrees in Agriculture Engineering from the Seoul National University, and Ph.D. in Electrical Engineering from the Texas A&M University. He also got his post-doctoral training at the Cancer Genetics Branch of National Human Genome Research Institute.

Dr. Kim’s research interests include: 1) mathematical modeling of genetic regulatory networks, 2) development of computational methods to analyze multitude of high throughput multi-omics data to identify disease biomarkers, and 3) computational models to diagnose patients or predict patient outcomes, for example, disease subtypes or drug response. His studies have had a large influence on the development of computational tools to study underlying mechanisms for cancer development and better understand the molecular mechanisms behind cancer biology and biological systems.

Abstract

Mass-shooting events pose a significant challenge to public safety, generating large volumes of unstructured textual data that hinder effective investigations and the formulation of public policy. Despite the urgency, few prior studies have effectively automated the extraction of key information from these events to support legal and investigative efforts. This paper presented the first dataset designed for knowledge acquisition on mass-shooting events through the application of named entity recognition (NER) techniques. It focuses on identifying key entities such as offenders, victims, locations, and criminal instruments, that are vital for legal and investigative purposes. The NER process is powered by Large Language Models (LLMs) using few-shot prompting, facilitating the efficient extraction and organization of critical information from diverse sources, including news articles, police reports, and social media. Experimental results on real-world mass-shooting corpora demonstrate that GPT-4o is the most effective model for mass-shooting NER, achieving the highest Micro Precision, Micro Recall, and Micro F1-scores. Meanwhile, o1-mini delivers competitive performance, making it a resource-efficient alternative for less complex NER tasks. It is also observed that increasing the shot count enhances the performance of all models, but the gains are more substantial for GPT-4o and o1-mini, highlighting their superior adaptability to few-shot learning scenarios.

Speaker Bio

Dr. Xishuang Dong is a member of CRI Center for Computational Systems Biology and CREDIT, and Associate Professor at Department of Electrical and Computer Engineering at Prairie View A&M University (PVAMU). His research interests include: (1) machine learning based computational systems biology; (2) biomedical information processing; (3) deep learning for big data analysis; (4) natural language processing.