Swift: An R package for novel, conserved transcript discovery

Howard, Morgan; Quoseena, Mir; Janga, Sarath Chandra

Swift: An R package for novel, conserved transcript discovery

Files

BISP-Howard&Quoseena-OCR.pdf (648.38 KB)

Authors

Howard, Morgan

Quoseena, Mir

Janga, Sarath Chandra

Abstract

Recent developments in short read RNA sequencing technologies have enabled transcriptome wide analysis of both modal and non-modal organisms. However, transcriptomic annotations resulting from short read technologies have frequently lead to several bioinformatics challenges such as mis-assembly of transcripts, poor mapping of reads as well as mis- and incomplete annotation of the transcribed regions. Hence, there exists a critical gap in improving the transcriptomic annotations in both human and most model organisms. Fourth generation single molecule sequencing technologies such as nanopore and pacbio sequencing enable the discovery of full-length isoforms including novel transcribed regions. However, RNA-seq data from these platforms are usually available in an unprocessed form for which there is a lack of tools for discovering conserved sequences between species and the degree to which that sequence is presently annotated. To address this need and to obtain a comprehensive analysis of these newly discovered regions from single molecule long read sequencing datasets, we developed Swift, an R package for querying NCBI databases to determine sequence annotation, novelty and conservation across species. Swift uses a collection of functions to take full advantage of NCBl's local blast program to query a locally installed database. Swift extracts the unmapped regions from the BAM file and generate a FASTA file to be used with BLAST. The user can also directly supply a FASTA file already containing unmapped reads as an additional input option, to generate a comprehensive report for the newly identified transcripts in a given experiment. The user can specify the databases to search against, the number of species a sequence must be present in, the percent similarity of the submitted sequences in reference to the BLAST query sequence, and any known annotations for retrieved sequences. Several output files are generated after analysis is complete to provide a report from the tools analysis.

Description

Digitized for IUPUI ScholarWorks inclusion in 2021.

Rights

Type

Poster

Permanent Link

https://hdl.handle.net/1805/28535

Collections

Department of Biomedical Engineering and Informatics Works
Sarath Janga

Full item page