Poster Abstract: Machine-learning models using population survey data for chronic disease risk prediction: a scalable framework for preventive health

Reagan Mogire, Postdoctoral research fellow, NHGRI

Abstract

Background: Early identification of individuals at risk of chronic diseases is critical for prevention, yet most machine-learning risk models depend on electronic health records or biomarkers, limiting scalability. We aimed to develop an interpretable machine-learning framework for estimating chronic disease risk using only population-based survey data.

Methods: We trained gradient-boosted decision tree models on data from 2·38 million adults in the US Behavioral Risk Factor Surveillance System (2011–2015). Outcomes included myocardial infarction, coronary heart disease, stroke, chronic kidney disease, diabetes, and depression. Data were harmonised across survey years and split into training (80%) and test (20%) sets. Model performance was assessed using AUROC, calibration metrics, and SHAP-based feature attribution.

Findings: Models achieved strong discrimination (AUROC 0·79–0·86) and good calibration (Brier scores 0·031–0·110). Specificity exceeded 96% at a default threshold of 0·5. Age, body-mass index, hypertension history, mobility limitation, and self-rated health were key predictors. Predictive uncertainty was lowest for cardiometabolic outcomes and higher for depression.

Interpretation: Survey-based, interpretable machine-learning models can provide accurate, well-calibrated risk estimates for multiple chronic diseases without reliance on clinical or laboratory data. This approach offers a scalable framework for population-level risk stratification and preventive health programmes, particularly in resource-limited settings.