MolEvolvR: A web-app for characterizing proteins using molecular evolution and phylogeny
Author(s): Janani Ravi,Jacob Dennis Krol
Affiliation(s): University of Colorado Anschutz
Background: The landscape of protein analyses software/databases is distributed and siloed, for e.g., BLAST suite for homology searches, InterPro for domain scans, and several more packages for individual sequence feature analysis. Often, biologists create in-house pipelines for data cleanup, wrangling, and interoperability, and summarizing and visualizing the disparate outputs. Notably, there’s an absence of methods/tools that integrate evolutionary methods with multiscale protein features. Furthermore, as demonstrated by the development of in-group pipelines and the overall growth of bioinformatics, there is an increasing demand for programmatic access to analysis applications. Approach: MolEvolvR [DOI: doi.org/10.1101/2022.02.18.461833; web-app: jravilab.org/molevolvr] is a web-app for characterizing proteins by domains, domain architectures, homology, and phylogeny to create intuitive and interactive summarizations and visualizations. In the first phase, MolEvolvR finds the domains of input proteins and performs an iterative, domain-sensitive homology search across the tree of life. Next, input proteins and their homologs are characterized by scanning for domains and assembling the domains for each protein into residue-ordered architectures. The front-end of the web-app is written in R/Shiny, while the backend consists of R and shell scripts for command-line applications and server-side job scheduling. MolEvolvR has been tested on Mac, Windows, and Linux operating systems with Chrome, Brave, Firefox, and Safari browsers. Results: We developed a free, user-friendly web-application, MolEvolvR, to characterize proteins by domains and phylogeny on an evolutionary scale. Supported protein input formats include FASTA, BLAST and InterProScan output, MSA, and UniProt/NCBI accession numbers. Up to 1000 proteins are allowed per submission, and each query is allocated a job code that is used to view job progress and retrieve results. Results can be viewed in any browser split across the various analysis tabs — query and homolog data, domain architectures, and phylogeny. Notable analyses include an upset plot to view both the distribution of constituent domains and the distribution of architectures across homologs. Networks visualize domain architectures across all homologs, with domains as nodes and co-occurring domains within a protein connected by edges; the node and edge sizes are proportional to their frequencies of occurrence. The evolution of protein families are summarized with phylogenetic trees, multiple sequence alignments, sunburst plots, and other visualizations that summarize the phyletic spreads of domain architectures across the tree of life. Results in each case can be filtered to provide custom summaries and visualizations. The app’s protein analysis methods have been applied to various pathogenic protein families. For example, Listeria internalin proteins, Staphylococcus aureus nutrient acquisition, Bacillus anthracis surface layer proteins, Vibrio cholerae’s phage defense system, and helicase operators across bacteria. We are now actively developing a companion R package (and docker container) to allow local analyses and customized summarization/visualization of any data from MolEvolvR. This will also allow users to host their web instances, if needed. The primary MolEvolvR R package will also provide an easy-to-use API for these key functionalities, and the data packages will carry the complementary pre-processed databases.