SeqScreen: A biocuration platform for robust taxonomic and biological process characterization of nucleic acid sequences of interest


Rapid advancements in synthetic biology and nucleic acid synthesis, in particular concerns about its intentional or accidental misuse, call for more sophisticated screening tools to identify genes of interest within short sequence fragments. One major gap in predicting genes of concern is the inadequacy of current tools and ontologies to describe the specific biological processes of pathogenic proteins. The objective of this work is to design software that sensitively assigns taxonomic classifications, functional annotations, and biological processes of interest to short nucleotide sequences of unknown origin (50bp-1,000bp). The overarching goal is to perform sensitive characterization of short sequences and highlight specific pathogenic biological processes of interest (BPoIs). The SeqScreen software executes these tasks in analytical workflows with Nextflow and outputs results in a tab-delimited report. Local and global alignments differentiate hits to taxonomically-related sequences from similar but unrelated sequences, and an ensemble approach leverages multiple tools and databases to assign a variety of functional terms to each query sequence. Final biological process assessments are made from the predicted functional annotations, which leverage information in pre-existing databases, as well as new custom biocurations. Machine learning models predict each biological process of interest on large protein databases before incorporation into the SeqScreen framework to streamline computational efficiency, ensure reproducible results, allow for version control, and facilitate the review of the automated predictions by expert biocurators.

2019 IEEE international conference on bioinformatics and biomedicine (BIBM). [DOI:10.1109/BIBM47256.2019.8982987]
R.A. Leo Elworth
R.A. Leo Elworth
NLM Postdoctoral Fellow

Leo (NLM Postdoctoral Fellow, primary mentor Prof. Lauren Stadler, secondary mentor Prof. Todd Treangen) received his PhD in Computer Science at Rice University in 2019 working on statistical modeling of DNA sequence evolution. He was advised by Dr. Luay Nakhleh, the J.S. Abercrombie Professor and Chair of the Department of Computer Science at Rice. Since joining at Rice, Leo was awarded a graduate research fellowship from the National Library of Medicine, has published work in computational biology in journals such as Bioinformatics, presented research at scientific conferences like RECOMB-CG in Barcelona and WABI in Helsinki, and contributed to a soon to be released book on computational modeling of evolutionary histories of genomes.

Advait Balaji
Advait Balaji
PhD student

Advait (4th year PhD student) obtained a dual degree, B.E Computer Science and MS Biological Sciences from BITS, Pilani in India. During his undergraduate degree, he received the Khorana Scholarship (2016) from the Indo-US Science and Technology Forum and also a thesis fellowship (2017-18) to work at Icahn School of Medicine, Mount Sinai, NY. At Mount Sinai, he worked on creating a Sub-cellular process-based ontology that predicts whole cell function using Natural Language Processing. His research interests are at the intersection of genomic data science and designing efficient algorithms to analyze genomic data.