Abstract

Drug discovery is a complex, costly, and time-intensive process, often taking over a decade and billions of dollars to bring a single drug to market. Advances in artificial intelligence (AI) have transformed this field by accelerating early stages of drug discovery, including target and hit identification and lead optimization. However, the accuracy, generalizability, and interpretability of these predictions remain major challenges, specially when experimental data are limited.

In this talk, I will present our research on developing data-driven AI frameworks that can help predict and evaluate biomolecular interactions for drug discovery, focusing on interaction prediction and reliability estimation. I will first present our methods EquiPPIS and EquiPNAS for predicting protein–protein and protein–nucleic acid interaction sites using equivariant graph neural networks (EGNNs) and protein language models, and EquiRank, our approach for estimating the quality of protein–protein interfaces in multimeric structures, all of which can help target identification in early-stage drug discovery. Moving to hit identification, I will present how graph neural network-based frameworks can further help in predicting drug–target binding affinities and interactions to accelerate virtual screening and drug repurposing.

Together, these models show how AI-driven, symmetry-aware, and biologically informed frameworks can improve the efficiency and reliability of early-stages of drug discovery.

Speaker Bio

Md Hossain Shuvo is an Assistant Professor in the Department of Computer Science at Prairie View A&M University. His research lies at the intersection of computational biology, bioinformatics, machine learning, and data science. He develops data-driven computational frameworks to model and evaluate biomolecular interactions, with a focus on improving the reliability of predictions for biomolecules. His work has been published in leading journals and conferences, including Nucleic Acids Research, Bioinformatics, and ISMB. For more information, please visit https://mdhossainshuvo.github.io.

Abstract

Chain-of-Thought (CoT) prompting has been shown to improve Large Language Model (LLM) performance on various tasks. With this approach, LLMs appear to produce human-like reasoning steps before providing answers (a.k.a., CoT reasoning), which often leads to the perception that they engage in deliberate inferential processes. However, some initial findings suggest that CoT reasoning may be more superficial than it appears, motivating us to explore further. In this paper, we study CoT reasoning via a data distribution lens and investigate if CoT reasoning reflects a structured inductive bias learned from in-distribution data, allowing the model to conditionally generate reasoning paths that approximate those seen during training. Thus, its effectiveness is fundamentally bounded by the degree of distribution discrepancy between the training data and the test queries. With this lens, we dissect CoT reasoning via three dimensions: task, length, and format. To investigate each dimension, we design DataAlchemy, an isolated and controlled environment to train LLMs from scratch and systematically probe them under various distribution conditions. Our results reveal that CoT reasoning is a brittle mirage that vanishes when it is pushed beyond training distributions. This work offers a deeper understanding of why and when CoT reasoning fails, emphasizing the ongoing challenge of achieving genuine and generalizable reasoning.

Speaker Bio

Chengshuai Zhao (he/him) is a second-year Ph.D. student in Computer Science at Arizona State University (ASU). He is also a Graduate Research Associate in the Data Mining and Machine Learning Lab (DMML), advised by Prof. Huan Liu. His research spans data mining, AI for science, representation learning, and large language models, with the goal of building systems that are more generalizable, transparent, and capable of uncovering knowledge at the frontiers of human understanding. He also serves as a program committee member and reviewer for leading conferences, including NeurIPS, SIGKDD, ACL, and AAAI.

Abstract

Molecular classification of cancer using multi-omics data is central to precision oncology, enabling identification of distinct subtypes based on gene expression, mutations, and methylation—beyond traditional histology. While machine learning methods like Support Vector Machines and Random Forests have shown success, they struggle with high-dimensionality, multi-omics integration, and limited generalizability. Deep learning offers promise but faces challenges such as overfitting on small datasets, lack of interpretability, and poor cross-cohort performance.

We present TabPFN, a Prior-data Fitted Network designed for tabular data, as a new foundation model for cancer classification. Unlike traditional models requiring dataset-specific training, TabPFN uses in-context learning via pretraining on synthetic data. This eliminates hyperparameter tuning, allows fast inference, performs well on small datasets, and includes built-in uncertainty estimation—making it ideal for clinical use.

Applied to multiple cohorts of gene expression data of bladder cancer, including The Cancer Genome Atlas (TCGA) RNA-seq data, TabPFN achieves competitive or superior performance with significantly reduced computational time. Our results show robust classification of tumor subtypes using gene expression alone.

Foundation models like TabPFN represent a paradigm shift in computational oncology, addressing long-standing barriers to clinical translation. We conclude by outlining future directions, including multi-omics integration, automated feature selection, and cancer-specific foundation models.

Speaker Bio

Dr. Seungchan Kim is a Chief Scientist and Executive Professor at the Department of Electrical and Computer Engineering and the Director of the CRI Center for Computational Systems Biology at the Prairie View A&M University (PVAMU). Prior to this appointment, he was the Head of Biocomputing Unit and an Associate Professor at Integrated Cancer Genomics Division of Translational Genomics Research Institute (TGen). He was one of the founding faculty members of TGen, founded in 2002, by Dr. Trent, then-Scientific Director of the National Human Genome Research Institute at the National Institutes of Health, leading computational systems biology research at the institute. He was also an Assistant Professor in the School of Computing, Informatics, Decision Systems Engineering (CIDSE) at the Arizona State University from 2004 till 2011. Dr. Kim received B.S. and M.S. degrees in Agriculture Engineering from the Seoul National University, and Ph.D. in Electrical Engineering from the Texas A&M University. He also got his post-doctoral training at the Cancer Genetics Branch of National Human Genome Research Institute.

Dr. Kim’s research interests include: 1) mathematical modeling of genetic regulatory networks, 2) development of computational methods to analyze multitude of high throughput multi-omics data to identify disease biomarkers, and 3) computational models to diagnose patients or predict patient outcomes, for example, disease subtypes or drug response. His studies have had a large influence on the development of computational tools to study underlying mechanisms for cancer development and better understand the molecular mechanisms behind cancer biology and biological systems.

Abstract

Mass-shooting events pose a significant challenge to public safety, generating large volumes of unstructured textual data that hinder effective investigations and the formulation of public policy. Despite the urgency, few prior studies have effectively automated the extraction of key information from these events to support legal and investigative efforts. This paper presented the first dataset designed for knowledge acquisition on mass-shooting events through the application of named entity recognition (NER) techniques. It focuses on identifying key entities such as offenders, victims, locations, and criminal instruments, that are vital for legal and investigative purposes. The NER process is powered by Large Language Models (LLMs) using few-shot prompting, facilitating the efficient extraction and organization of critical information from diverse sources, including news articles, police reports, and social media. Experimental results on real-world mass-shooting corpora demonstrate that GPT-4o is the most effective model for mass-shooting NER, achieving the highest Micro Precision, Micro Recall, and Micro F1-scores. Meanwhile, o1-mini delivers competitive performance, making it a resource-efficient alternative for less complex NER tasks. It is also observed that increasing the shot count enhances the performance of all models, but the gains are more substantial for GPT-4o and o1-mini, highlighting their superior adaptability to few-shot learning scenarios.

Speaker Bio

Dr. Xishuang Dong is a member of CRI Center for Computational Systems Biology and CREDIT, and Associate Professor at Department of Electrical and Computer Engineering at Prairie View A&M University (PVAMU). His research interests include: (1) machine learning based computational systems biology; (2) biomedical information processing; (3) deep learning for big data analysis; (4) natural language processing.

Abstract

Clinical care is inherently multimodal, with medical image data collected throughout the patient’s journey. For example, a patient at risk of cancer will undergo an ultrasound-guided biopsy, and when available with MRI revealing regions to be targeted due to higher risk to harbor aggressive disease. This biopsy procedure seeks to collect tissue samples for pathology and will inform treatment strategies for best outcomes. This common scenario provides unique opportunities for Artificial Intelligence (AI) methods to effectively integrate multimodal data, and learn imaging signatures in patients with known outcomes, to enable early cancer detection for patients at risk. My research focuses on developing AI methods that bridge the gap between highly informative modalities, e.g., pathology or MRI, and lower resolution modalities, e.g., ultrasound. These methods rely on multimodal image registration, image feature fusion, or integration of patient-specific data and population-specific information and rely on AI approaches for effective integration. While the learning is done with multiple imaging modalities, the inference requires only the low-resolution modality, e.g., ubiquitous conventional ultrasound, with applications in low-resource settings. These methods are applied to detect cancer and its aggressive extent in various cancers, e.g. prostate, kidney, or breast.

Speaker Bio

Dr. Rusu is an Assistant Professor, in the Department of Radiology, and, by courtesy, Department of Urology and Biomedical Data Science, at Stanford University, where she leads the Personalized Integrative Medicine Laboratory (PIMed). The PIMed Laboratory has a multi-disciplinary direction and focuses on developing analytic methods for biomedical data integration, with a particular interest in multimodal fusion, e.g., radiology-pathology fusion to facilitate radiology image labeling, or MRI-ultrasound for guiding procedure. These fusion approaches allow the downstream training of advanced multimodal machine learning for cancer detection and subtype identification at pixel-level. Our approaches have been applied in oncologic (prostate, breast, kidney) and non-oncologic applications.

Dr. Rusu received a Master of Engineering in Bioinformatics from the National Institute of Applied Sciences in Lyon, France. She continued her training at the University of Texas Health Science Center in Houston, where she received a Master of Science and PhD degree in Health Informatics for her work in biomolecular structural data integration of cryo-electron micrographs and X-ray crystallography models.

During her postdoctoral training at Rutgers and Case Western Reserve University, Dr. Rusu has developed computational tools for the integration and interpretation of multi-modal medical imaging data and focused on studying prostate and lung cancers. Prior to joining Stanford, Dr. Rusu was a Lead Engineer and Medical Image Analysis Scientist at GE Global Research Niskayuna NY where she was involved in the development of analytic methods to characterize biological samples in microscopy images and pathologic conditions in MRI or CT.