Minimal Data Filtering is Problematic for ASVs

First, I would like to say that I think MicrobiomeAnalyst has many amazing features that allow researchers with limited bioinformatics skills to investigate the microbiome. I have read prior posts in this forum regarding the Minimal Data Filtering step that is applied to all datasets in MicrobiomeAnalyst with Marker Data Profiling, but none address the real issue with the implementation of this step, especially for ASVs.
The primary issue relates to the indiscriminate filtering of any ASV, OTU or taxon with greater than or equal to 2 counts. At first, I thought this step was just removing singleton features (features with a single read count across all samples in the dataset). However, after inspecting multiple datasets, it became clear that this step is removing any ASV, OTU or taxon that is only present in one sample, without regard for the number of read counts. This filtering is particularly problematic for all downstream analyses in MicrobiomeAnalyst, especially for ASV data. I have assessed this issue with multiple ASV datasets from gut and vaginal samples.
To provide one example, I imported an ASV dataset derived from V4 amplicon sequencing of 168 vaginal samples, processed with Qiime2 and DADA2. My initial raw counts table consists of 1829 ASVs. After the minimal data filtering was applied in MicrobiomeAnalyst, only 1028 ASVs passed. So, 801 ASVs were removed by minimal data filtering. In the subsequent Data Filtering step, I set options to 0, so that no further filtering was applied. I then downloaded the filtered ASV table from MicrobiomeAnalyst, so I could compare the ASVs that were removed by minimal data filtering with my input ASV table. Out of the 801 filtered ASVs, only 5 were true singletons. The other ASVs that were filtered ranged in counts from 2 reads all the way up to 20,357 reads in a single sample. And the filtered ASVs were spread across all samples in the dataset, so the filtered ASVs, including the most abundant ones, will have variable influence on the whole dataset. Obviously, the biggest concern is that high abundance ASVs are being filtered out. Of the 801 ASVs that were filtered, 10 had > 5000 reads, 23 had > 1000 reads, 118 had > 100 reads. And nearly all of these were classified to the species level (based on their V4 sequences, using sklearn and Greengenes2 in Qiime2) as common and important vaginal microbes in the context of health and disease. In addition, NCBI blast analysis of most of these filtered sequences shows 99-100 percent identity with 99-100 query coverage to the V4 sequence of these classified species. Other ASVs for these same species are also present in my dataset but happen to be present in more than one sample, so they are retained. Thus, it is highly unlikely that these are sequencing artifacts. Rather, they are true ASVs and shouldn’t be filtered out as their removal affects all downstream analyses both the ASV level and when assessing data at other taxonomic levels.
Even if the argument is made that these ASVs shouldn’t be considered in differential abundance analysis, differential abundance analysis is not the only downstream analysis that is carried out (rarefaction, transformation, relative abundance and diversity and analyses at other taxonomic-levels will be adversely affected), and the removal of these ASVs will still affect differential abundance analyses because the true relative abundance or transformation of the retained ASVs or taxonomic levels (i.e., genera, species) will be affected by the loss of these ASVs, especially when high abundance ASVs are filtered.
The only way I can find around the minimal filtering is to add a pseudocount of 1 to my data. However, if I then need to rarify my data there are issues since the pseudocount will be retained in the lowest sequencing depth samples and mostly removed from the highest depth samples which can impact downstream diversity metrics.
As most marker gene studies now utilize ASVs (this was probably less of an issue with OTUs or with taxon-level collapsed tables), I would argue that the minimal filtering step should either be disabled or amended to only remove singleton features. I greatly appreciate having MicrobiomeAnalyst as a resource and its varied options and capabilities to analyze microbiome data and hope you will consider my suggestion. I look forward to hearing back from you and am happy to provide any clarifications about the issues I’ve encountered.


Thanks for the comments. We have update our website and only the singleton features (features with 1 count and present in only one sample ) will be automatically removed. Other features with more than one counts will all be kept for further processing. You can then do filtration and normalization based on your own needs.

Hope this helps!

This topic was automatically closed after 9 days. New replies are no longer allowed.