Subclonal reconstruction from bulk DNA sequencing is essential for understanding tumor evolution, therapeutic resistance, and clinical outcomes in cancer. Multi-sample data from longitudinal or multi-region sequencing of a patient’s tumor offers detailed insights into tumor evolution. Yet, existing methods face challenges in scalability, failing to reconstruct phylogenies for datasets with as few as ten subclones. Here we present EMulSI-Phy (Efficient Multi-Sample Inference of cancer Phylogeny), a computational framework designed to efficiently reconstruct subclonal architecture and evolutionary relationships across multiple tumor samples. EMulSI-Phy addresses key limitations of current approaches through optimized clustering algorithms, and rule-based phylogenetic inference suitable for large whole-exome and whole-genome sequencing datasets.
Benchmarked against DPClust, PyClone, PyClone-VI, CONIPHER, Pairtree, and PhyClone on 504 simulated datasets and 400 patients from the TRACERx NSCLC cohort, EMulSI-Phy achieved the highest task completion rates (100% clustering, 98.4% phylogeny) and the lowest compute cost while producing results consistent with ground truth or published reconstructions. A one-at-a-time parameter sensitivity analysis identified the minimum SNV threshold and clustering distance metric as the most influential parameters. We further demonstrate that a consensus approach across input perturbations increases cluster silhouette width by ~30% and reduces the magnitude of sum rule violations by ~20% relative to a single full-data run. Collectively, these results demonstrate that EMulSI-Phy delivers competitive accuracy and improved output stability across datasets where other tools fail. EMulSI-Phy is distributed as a Dockerised R package with a Nextflow pipeline for reproducible multi-tool analysis.