Poster Abstract: Haplograph: a targeted long-read assembly, variant calling, phasing and methylation analysis toolkit

Hang Su, Postdoc, Broad Institute of MIT and Harvard

Abstract

Background: Long read sequencing produces reads spanning over 15kb, which span entire genes in a single read, enabling phased, full-length gene assemblies and novel allele structures, with inherent 5mC modification signals. Targeted long-read genome sequencing, coupled with fast and efficient computational tools for local assembly, phased variant calling, and methylation profiling are critical for large-scale genetic study and clinical diagnosis, with advantages including high accuracy, relatively low cost, and low storage burdens. However, such comprehensive computational tools for these analysis remain limited.

Methods: We developed Haplograph, a graph-based tool that integrates long-read targeted assembly, variant calling and phasing, methylation analysis into a single, automated toolkit. Haplograph identifies confident haplotype sequences as nodes in continuous genomic windows, and ligate adjacent nodes using the overlapping reads. Heterozygous nodes, identified by their allele frequencies and the co-occurrence among the read sets, were applied to perform physical phasing in the graph. The major haplotypes are constructed by traversing the graph and are emitted in the standard FASTA files. Phased germline variants, along with the sub-clonal variants that are not on the phased assemblies are output as the standard vcf files. Methylation signals on each phased haplotype are further aggregated into the standard methyl-bed format.

Results: We perform benchmarking analysis using Human Pangenome Reference Consortium year 2 (HPRC-Y2) datasets, comparing with the released high quality assemblies. Benchmarking experiments indicate that Haplograph assemblies delivers performance that is identical to or superior to current state-of-the-art tools across various metrics of accuracy and computational efficiency.

Conclusion: Haplograph addresses a critical unmet need in genomics by enabling the precise and versatile analysis of targeted genomic regions. It facilitates the scalable computational workflow and the creation of robust reference panels for targeted genomic regions, expanding the boundaries of clinically actionable genomic analysis.

Code Availability: https://github.com/broadinstitute/haplograph