Poster Abstract: Large Language Model–Based Extraction of Structured Clinical and Phenotypic Data from Postmortem Clinical Narratives

Ajeet Mandal, Staff Scientist, National Institute of Mental Health

Abstract

Introduction: Postmortem human brain research depends on detailed clinical and phenotypic characterization to enable meaningful molecular and genomic interpretation. However, critical demographic, medical, and psychiatric information is often embedded within unstructured narrative summaries, limiting scalability and systematic integration into analytic datasets.

Methods: We implemented a large language model (LLM)-based framework using GPT architectures to extract structured clinical and phenotypic variables from unstructured postmortem case summaries. The system was designed to identify demographic characteristics, psychiatric history, substance use, medical comorbidities, medication exposure, and relevant clinical descriptors. Prompt templates were iteratively refined to improve specificity and contextual interpretation across heterogeneous narrative formats.

Model outputs were transformed into standardized, analysis-ready variables suitable for downstream integration with genomic and molecular datasets. Performance was evaluated through comparison with expert-curated annotations to assess extraction accuracy and consistency across key phenotypic domains.

The LLM-based approach substantially reduced manual abstraction time while maintaining strong agreement with curated data for core clinical variables. Structured outputs enabled improved cohort stratification, harmonization across studies, and enhanced reproducibility in translational analyses.

Conclusion: These findings demonstrate the feasibility of leveraging large language models to convert unstructured clinical documentation into scalable, structured datasets, facilitating more efficient and systematic integration of phenotypic information in postmortem neuroscience research.