Penguin: A Tool for Predicting Pseudouridine Sites in Direct RNA Nanopore Sequencing Data

Hassan, Doaa; Acevedo, Daniel; Daulatabad, Swapna Vidhur; Mir, Quoseena; Janga, Sarath Chandra

Penguin: A Tool for Predicting Pseudouridine Sites in Direct RNA Nanopore Sequencing Data

dc.contributor.author	Hassan, Doaa
dc.contributor.author	Acevedo, Daniel
dc.contributor.author	Daulatabad, Swapna Vidhur
dc.contributor.author	Mir, Quoseena
dc.contributor.author	Janga, Sarath Chandra
dc.contributor.department	BioHealth Informatics, School of Informatics and Computing
dc.date.accessioned	2024-01-31T17:46:29Z
dc.date.available	2024-01-31T17:46:29Z
dc.date.issued	2022
dc.description.abstract	Pseudouridine is one of the most abundant RNA modifications, occurring when uridines are catalyzed by Pseudouridine synthase proteins. It plays an important role in many biological processes and has been reported to have application in drug development. Recently, the single-molecule sequencing techniques such as the direct RNA sequencing platform offered by Oxford Nanopore technologies have enabled direct detection of RNA modifications on the molecule being sequenced. In this study, we introduce a tool called Penguin that integrates several machine learning (ML) models to identify RNA Pseudouridine sites on Nanopore direct RNA sequencing reads. Pseudouridine sites were identified on single molecule sequencing data collected from direct RNA sequencing resulting in 723K reads in Hek293 and 500K reads in Hela cell lines. Penguin extracts a set of features from the raw signal measured by the Oxford Nanopore and the corresponding basecalled k-mer. Those features are used to train the predictors included in Penguin, which in turn, can predict whether the signal is modified by the presence of Pseudouridine sites in the testing phase. We have included various predictors in Penguin, including Support vector machines (SVM), Random Forest (RF), and Neural network (NN). The results on the two benchmark data sets for Hek293 and Hela cell lines show outstanding performance of Penguin either in random split testing or in independent validation testing. In random split testing, Penguin has been able to identify Pseudouridine sites with a high accuracy of 93.38% by applying SVM to Hek293 benchmark dataset. In independent validation testing, Penguin achieves an accuracy of 92.61% by training SVM with Hek293 benchmark dataset and testing it for identifying Pseudouridine sites on Hela benchmark dataset. Thus, Penguin outperforms the existing Pseudouridine predictors in the literature by 16 % higher accuracy than those predictors using independent validation testing. Employing penguin to predict Pseudouridine revealed a significant enrichment of “regulation of mRNA 3’-end processing” in Hek293 cell line and positive regulation of transcription from RNA polymerase II promoter involved in cellular response to chemical stimulus in Hela cell line. Penguin software and models are available on GitHub at https://github.com/Janga-Lab/Penguin and can be readily employed for predicting Ψ sites from Nanopore direct RNA-sequencing datasets.
dc.eprint.version	Author's manuscript
dc.identifier.citation	Hassan D, Acevedo D, Daulatabad SV, Mir Q, Janga SC. Penguin: A tool for predicting pseudouridine sites in direct RNA nanopore sequencing data. Methods. 2022;203:478-487. doi:10.1016/j.ymeth.2022.02.005
dc.identifier.uri	https://hdl.handle.net/1805/38245
dc.language.iso	en_US
dc.publisher	Elsevier
dc.relation.isversionof	10.1016/j.ymeth.2022.02.005
dc.relation.journal	Methods
dc.rights	Publisher Policy
dc.source	PMC
dc.subject	RNA modifications
dc.subject	Pseudouridine
dc.subject	Nanopore
dc.title	Penguin: A Tool for Predicting Pseudouridine Sites in Direct RNA Nanopore Sequencing Data
dc.type	Article

Files

Original bundle

Now showing 1 - 1 of 1

Name:: nihms-1785721.pdf
Size:: 1.4 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.99 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Open Access Policy Articles
Department of Biomedical Engineering and Informatics Works