Squeegee Pipeline


Computational analysis of host-associated microbiomes has opened the door to numerous discoveries relevant to human health and disease. However, contaminant sequences in metagenomic samples can potentially impact the interpretation of findings reported in microbiome studies, especially in low biomass environments. Although this have been a known issue for some time, negative control data are often not available in public databases, making it nearly impossible to perform contamination removal on uploaded data.


Our hypothesis is that contamination from DNA extraction kits or sampling lab environments will leave taxonomic “bread crumbs” across multiple distinct sample types, allowing for the detection of microbial contaminants when negative controls are unavailable. To test this hypothesis we implemented Squeegee, a de novo contamination detection tool. We tested Squeegee on simulated and real low biomass metagenomic datasets. On the low biomass samples, we compared Squeegee predictions to experimental negative control data and show that Squeegee accurately recovers known contaminants. We also analyzed 749 metagenomic datasets from the Human Microbiome Project and identified likely previously unreported kit contamination. Collectively, our results highlight that Squeegee can identify microbial contaminants with high precision.


In summary, Squeegee is the first computational method for identifying potential microbial contaminants in the absence of environmental negative control samples. Squeegee is open-source and available at: https://gitlab.com/treangenlab/squeegee


  • Dr. Kjersti Aagaard (BCM)
  • Dr. Michael Jochum (BCM)
Yunxi Liu
Yunxi Liu
PhD student

Louis (4th year PhD student) obtained a B.S. degree in Computer Science from the University of Houston and a B.S. degree in Pharmacology from China Pharmaceutical University. During his undergraduate in UH, he did research in the Pattern Analysis Laboratory on image feature extraction. His current research interests include computational biology, metagenomics, and data science.

Dr. R.A. Leo Elworth
Dr. R.A. Leo Elworth
Postdoctoral Scientist from August 2019 through April 2022

Leo (NLM Postdoctoral Fellow, primary mentor Prof. Lauren Stadler, secondary mentor Prof. Todd Treangen) received his PhD in Computer Science at Rice University in 2019 working on statistical modeling of DNA sequence evolution. He was advised by Dr. Luay Nakhleh, the J.S. Abercrombie Professor and Chair of the Department of Computer Science at Rice. Since joining at Rice, Leo was awarded a graduate research fellowship from the National Library of Medicine, has published work in computational biology in journals such as Bioinformatics, presented research at scientific conferences like RECOMB-CG in Barcelona and WABI in Helsinki, and contributed to a soon to be released book on computational modeling of evolutionary histories of genomes.

Todd J. Treangen
Todd J. Treangen
Associate Professor of Computer Science

My research interests include algorithms and data structures for efficient analysis of microbial genomes and metagenomes