Prediction and Evolutionary Analysis of RNA Binding Proteins Across Eukaryotic Genomes
Date
Embargo Lift Date
Department
Committee Members
Degree
Degree Year
Department
Grantor
Journal Title
Journal ISSN
Volume Title
Found At
Abstract
RNA Binding Proteins (RBPs) are key players in several post transcriptional regulatory mechanisms and mediate the metabolism of RNA in the cell. High throughput technologies such as cross-linking followed by Mass Spectrometry (MS) have led to the identification of large number of RBPs and RNA binding domains (RBDs) encoded by them. Although experimental methods have increased the repertoire of RBPs in model systems, the complete repertoire of RBPs across eukaryotic species is far from complete. In this study, we developed a computational pipeline to predict RNA binding proteins using RNA binding domains and protein homology information. Our approach involved, using peptides which can bind to RNA from 529 RBPs and a dataset of 1344 experimentally known human RBPs as a reference set. Domain based predictions using HMMER were integrated with homology information to get an integrated genome wide prediction of RBPs across 69 species. Benchmarking of these predictions against mouse genes annotated as RBPs resulted in a precision of 60% and recall of 75%. An average of 1750 RBPs were identified across eukaryotes comprising of mammals, birds, amphibians, insects and worms. Although RBPs were found to be highly conserved across the phylogenetic spectrum, few lower order species such as lamprey, Caenorhabditis elegans and yeast exhibited fewer RBPs encoded in their genomes, suggestive of the divergence of RBP repertoire in distant relatives. In contrast to Transcription Factors (TFs) and kinases, genes encoding for RBPs exhibited an increase in their number (p-value: 0.0013) with increase in genome size. Although majority (56%) of the RNA binding regions could be mapped to the domains present in the Pfam database, a small fraction of the unmapped novel domains were detected in > 1 % of protein coding genes analyzed across genomes. A co-occurrence network of RBDs revealed prominent enrichment of Nup160, WD40 and RRM domains with other RBDs across eukaryotic genomes. Our proposed prediction pipeline and corresponding repertoire of RBPs would stand as a valuable resource for studying post transcriptional regulatory networks across eukaryotic species.