SeqScreen: Accurate and Sensitive Functional Screening of Pathogenic Sequences via Ensemble Learning

Abstract

The COVID-19 pandemic has emphasized the importance of detecting known and emerging pathogens from clinical and environmental samples. However, robust characterization of pathogenic sequences remains an open challenge. To this end, we developed SeqScreen, which can accurately characterize short nucleotide sequences using taxonomic and functional labels, and a customized set of curated Functions of Sequences of Concern (FunSoCs) specific to microbial pathogenesis. We show our ensemble machine learning model can label protein-coding sequences with FunSoCs with high recall and precision. SeqScreen is a step towards a novel paradigm of functionally informed pathogen characterization and is available for download at: www.gitlab.com/treangenlab/seqscreen

Publication
Advait Balaji
Advait Balaji
PhD student

Advait (4th year PhD student) obtained a dual degree, B.E Computer Science and MS Biological Sciences from BITS, Pilani in India. During his undergraduate degree, he received the Khorana Scholarship (2016) from the Indo-US Science and Technology Forum and also a thesis fellowship (2017-18) to work at Icahn School of Medicine, Mount Sinai, NY. At Mount Sinai, he worked on creating a Sub-cellular process-based ontology that predicts whole cell function using Natural Language Processing. His research interests are at the intersection of genomic data science and designing efficient algorithms to analyze genomic data.

Bryce Kille
Bryce Kille
PhD student

Bryce (1st year PhD student) received his MS in Bioinformatics and BS in Computer Science + Chemistry from the University of Illinois at Urbana-Champaign. As an undergraduate, he worked at Dow Agrosciences in both the computational biology and cheminformatics groups. His projects included developing software for phylogeny analysis and creating models for compound activity prediction. During his Master’s program, Bryce worked in a biochemistry lab developing software for genome mining as well as a on research project for creating bit-wise algorithms for the C++ STL. One of his main interests is casting biological and chemical problems into theoretical computer science questions.

R.A. Leo Elworth
R.A. Leo Elworth
NLM Postdoctoral Fellow

Leo (NLM Postdoctoral Fellow, primary mentor Prof. Lauren Stadler, secondary mentor Prof. Todd Treangen) received his PhD in Computer Science at Rice University in 2019 working on statistical modeling of DNA sequence evolution. He was advised by Dr. Luay Nakhleh, the J.S. Abercrombie Professor and Chair of the Department of Computer Science at Rice. Since joining at Rice, Leo was awarded a graduate research fellowship from the National Library of Medicine, has published work in computational biology in journals such as Bioinformatics, presented research at scientific conferences like RECOMB-CG in Barcelona and WABI in Helsinki, and contributed to a soon to be released book on computational modeling of evolutionary histories of genomes.

Next
Previous