
Long-read RNA sequencing technologies, which can read RNA sequences up to >10,000 bases long, have the potential to propel our understanding of the transcriptome by overcoming the limitations of short-read sequencing. However, long-read RNA sequencing has a much higher per-base error rate than its short-read counterpart – between 5% and 20% compared with just 0.1% for short reads. To improve the viability of long-read RNA sequencing for diagnostics and disease research, a team from the Children’s Hospital of Philadelphia (CHOP) have developed a new computational tool that yields more accurate information from error-prone RNA sequencing data.
The tool is named Error Statistics PRomoted Evaluator of Splice Site Options, or ESPRESSO, and enables the accurate discovery and quantification of RNA isoforms using long-read sequencing data. It does this by analyzing the alignment of all long RNA sequencing reads of a given gene to a corresponding reference genome in order to identify potential splice junction sites and areas of “perfect alignment” around these sites. By annotating “high-confidence” splice junctions, analyzing the error profiles of individual long reads, and borrowing information across all long reads for the given gene, ESPRESSO is able to identify splice junctions and RNA isoforms with high reliability, as well as discover novel isoforms not previously documented in existing databases.
The researchers tested ESPRESSO’s performance using both simulated data and data from real biological samples. When compared with other computational tools used to analyze long-read RNA sequencing data, ESPRESSO exhibited the highest precision and sensitivity for transcript isoform discovery, as well as the highest accuracy in transcript isoform quantification. Additionally, the team used ESPRESSO to generate and analyze more than 1 billion long RNA sequencing reads, which cover 30 human tissue types and three human cell lines, providing a valuable resource for studying human transcriptome variation at higher resolution than short-read sequencing. This study was published in Science Advances.
“ESPRESSO addresses a long-standing problem of long-read RNA sequencing and could usher in new opportunities of discovery,” said senior author Yi Xing. “We envision that ESPRESSO will be a useful tool for researchers to explore the RNA repertoire of cells in various biomedical and clinical settings.”
The data from long-read RNA sequences could potentially be used to improve diagnoses of rare genetic diseases, as well as discover new potential therapeutic targets for diseases such as cancer.
“We are probably at an inflection point in how we discover and analyze RNA molecules,” said Xing. “The transition from short-read to long-read RNA sequencing represents an exciting technological transformation, and computational tools that reliably interpret long-read RNA sequencing data are urgently needed.”