Poster Abstract: Transfer Learning of Supervised Sequence to Function Models Enables Cell Type Specific Regulatory Sequence Prediction

Arif Rather, Research Fellow, Boston Children's Hospital

Abstract

Deep learning models trained on DNA sequences have demonstrated strong performance in predicting chromatin accessibility, providing valuable insights into gene regulation and the functional impact of genetic variants in regulatory regions. Large-scale models are trained on extensive genomic datasets across different modalaties; however, they do not encompass all cell types or experimental conditions. Consequently, studying chromatin accessibility in new biological contexts often requires training additional models, which is computationally intensive and resource- demanding. We developed a kidney tissue-specific state of the art sequence model leveraging scATAC-seq data. We focus on the recently published state-of-the-art models Enformer and Borzoi on new single-cell ATAC-seq data generated from kidney tissue. Our results demonstrate that transferred model effectively learns cell type specific accessibility changes. Furthermore, transferred model outperform existing state-of-the-art general purpose as well as specialized methods, in predicting regulatory variant effects, as well as prioritizing GWAS fine-mapped causal variants. For instance, the transferred model achieves 18.5% relative improvement over Enformer, 32.3% over Borzoi, 23% over Sei, and a 14.2% improvement over ChromeKid (specialized model for kidney) in predicting cell type specific regulatory variant effects in proximal tubule cells measured in terms of AUROC. Through systematic sequence context ablation, we find that an input window of approximately 34-64 kbp is sufficient for optimal regulatory variant effect prediction, capturing the essential local regulatory context within just 6-12% of the full model input. We further benchmarked single-task against multi-task transfer learning, demonstrating that the multi-task paradigm achieves comparable predictive performance at substantially reduced computational cost.