Purpose: Gene expression profiling followed by unsupervised machine learning (ML) has failed to provide specific, clinically-actionable subgroups. Standard strategies sample and profile tumours, but this produces a mixed signal from cancer cells, cancer-infiltrating cells and unwanted adjacent tissue variably captured during surgery. As most ML approaches rely on highly variable features, patient classifiers built from such heterogeneous yet unreproducible profiles do not significantly improve patient outcomes. Such strategies also ignore decades of mechanistic cancer biology when deciding on the most informative features. We focused on muscle-invasive urothelial carcinoma of the bladder (MIBC; 40% 5-year survival). To derive cell-type-specific profiles, we performed RNAseq on histologically normal, pure human urothelium across three differentiation states (n=88). We constructed a co-expression network using the most variable 5000 genes expressed in over 90% of samples, prioritising the connections of transcription factors (TFs). Perturbation experiments revealed 98 consistently hyper-connected TFs which we used to stratify MIBC samples from The Cancer Genome Atlas.
Conclusions: This identified a novel chemo- and immunotherapy-resistant MIBC subgroup driven by NRF2 overactivity (15% 2-year survival). Driver mutations which modify NRF2 activity are well-described but, crucially, our stratification identified a subset of drivers which respond positively to NRF2 inhibitor therapy. This delivers a new treatment strategy for aggressive MIBC, and improves pan-cancer treatment selection specificity for NRF2-overactive cancers. By employing a biologically-informed ML approach, we have improved the specificity and pan-cancer range of an existing therapeutic. We propose a return to hypothesis-driven identification of cancer subgroups, optimised to confidently assign patients to personalised treatment..