New method for resolving highly similar paralogous genes

Previously inaccessible regions of the human genome can now be analyzed using HiFi sequencing.
A collaborative team led by PacBio (CA, USA), GeneDx (MD, USA) and a consortium of genomics experts have developed a new method for analyzing notoriously complex regions of the human genome. The method applies the informatics tool Paraphrase in combination with HiFi long-read sequencing to provide a single framework for resolving highly similar paralogous genes.
Population-wide whole-genome sequencing studies – based on short reads – have enabled variants in over 90% of the human genome to be characterized. However, some regions and variant classes remain largely inaccessible to short reads, with a large portion of these occurring within segmental duplications (SDs).
SDs are duplicated regions of the genome where high sequence similarities exist, presenting a persistent challenge for genetic analysis with short read sequencing. These regions contain hundreds of medically relevant genes, including those involved in spinal muscular atrophy (SMN1/SMN2), congenital adrenal hyperplasia (CYP21A2) and red-green color blindness (OPN1LW/OPN1MW). As such, there remains a need to fully characterize these genes.
The team previously developed and validated a phasing approach, Paraphrase, that accurately identifies haplotypes and their paralogs; however, the study was limited to one difficult region, SMN1/SMN2. The team has now extended and paired Paraphrase with HiFi long-read sequencing to analyze 316 previously inaccessible paralogous genes that fall into 160 long (>10kb) SD regions across the genome.
Paraphrase resolves highly similar genes by realigning HiFi reads to one ‘archetype gene’ – a relevant gene chosen to represent all gene copies and paralogs. The aligned reads are then phased into haplotypes for variant calling. Applying this method to 259 individuals from five ancestral populations, the team gathered insights into the genetic diversity of SD regions across the populations.
Showcasing biotech: the tools facilitating breakthroughs in sequencing, imaging and data management
At this year’s ABRF, biotech companies showcased their technologies, highlighting how their products are facilitating breakthroughs in sequencing, imaging and data management.
Key findings from their analysis include the identification of 11 previously undetected de novo events, 7 of which are single nucleotide variants and 4 of which are gene conversion events, from 36 parent-offspring trios. Further to this, the team identified 23 paralog groups with extremely low genetic diversity between genes and paralogs, suggesting that extensive gene conversion and unequal crossing over may contribute to highly similar gene copies.
“With Paraphase and HiFi sequencing, we now have a scalable way to accurately genotype SD-encoded genes across diverse populations, filling in long-standing gaps in genomic research and improving our ability to identify disease-linked variants,” commented senior author Michael A. Eberle (PacBio).
The team also applied their method to characterize the extensive genetic diversity of nine medically relevant genes that have been challenging to genotype. They found a previously overlooked duplication allele in the CYP21A2/CYP21A1P region that carries both a functional CYP21A2 copy and a pseudogene CYP21A2(Q319X) copy, which could lead to misclassification with standard testing.
“This study demonstrates that when we use HiFi sequencing we see a much richer and more complex picture of genetic variation,” said lead author Xiao Chen (PacBio).
The team has compiled a database of variant allele frequencies collected from the study, with the aim of expanding the database as more results are obtained from Paraphrase and HiFi.