GenomeDepot: Computational Methods for Decoding Biological Information Encoded in Engineered DNA and Microbial Genomes

December 2021

Abstract

Although great successes have been made in DNA sequencing and genome engineering, fully elucidating the underlying biological information encoded in genomic data, and the ability to fully control biological systems, are still limited. My research has focused on deciphering signatures hidden in genomic data, specifically in engineered synthetic sequences, and metagenomes. Recent advances in genome engineering and editing have enabled researchers to create novel genetic parts and redesign biological systems. As genome engineering develops, there is a heightened awareness of potential misuse related to biosafety concerns. In parallel, we are now able to study microbial communities at unprecedented resolution thanks to metagenomics. Previous efforts in this area allow us to identify species composition and estimate their metabolic functions of given microbial communities. Despite this great progress, low-level knowledge of bacteria driving microbial interactions within microbiomes remains unknown, limiting our ability to fully understand and control microbial communities. In the first part of my thesis, I developed PlasmidHawk, a linear time pan-genome alignment-based pipeline to predict the lab-of-origin of unknown sequences. Compared to the previous deep learning method, PlasmidHawk has higher prediction accuracy. PlasmidHawk can successfully predict unknown sequences’ depositing labs 76% of the time and 85% of the time the correct lab is in the top 10 candidates. In addition, PlasmidHawk can precisely single out the signature sub-sequences that are responsible for the lab-of-origin detection. PlasmidHawk represents an explainable and accurate tool for lab-of-origin prediction of synthetic plasmid sequences. In the second part of my thesis, I developed Bakdrive, a novel method for identifying driver species within microbiomes. Bakdrive has three key innovations in this space: (i) it leverages inherent information from metagenomic sequencing samples to identify driver species, (ii) it explicitly takes host-specific variation into consideration, and (iii) it does not require a known ecological network. Through simulated and real dataset, we demonstrate detecting driver species from healthy donor samples and introducing them to the disease samples, we can restore the gut microbiome in recurrent Clostridioides difficile infection patients to a healthy state. In summary, Bakdrive provides a novel approach for teasing apart microbial interactions and facilitates future personalized probiotic design. In conclusion, GenomeDepot represents a collection of novel, computationally efficient software tools and algorithms suited for deciphering biological information encoded in engineered and microbial genomes. Real-world applications of GenomeDepot have included lab-of-origin prediction and detection of driver species in healthy and disease associated microbiomes, feeding back into biosecurity decisions and human health.

Publication

PhD Thesis, Rice University. Available online

Dr. Qi Wang

PhD student from September 2018 through January 2022 (currently Sr. Bioinformatics Scientist at Illumina)

Dr. Wang is a Bioinformatics Scientist at Illumina, and finished her PhD in the Treangen Lab December 2021. Previously, Dr. Wang obtained B.S. degrees in Biotechnology from Hong Kong Baptist University and MS in Biotechnology from Northwestern University. During her undergraduate, she did research in University of Chinese Academy of Sciences, Beijing University of Chemical Technology and Capital Medical University, focusing on using bioinformatics and experimental approaches to solve various life science problems, including synthetic biology, developmental biology, oncology and drug discovery. Her interest is to improve human health and environment by understanding complex biology data.