Low variance filter - IQR potential issues?

Hi all - I hope your research is going well.

I am working on processing a combined 16s dataset. For context, I realized when reviewing visualizations that I had skipped running a particular analysis we were interested in. When re-running the data with the same taxonomy and metadata files, I ran through my entire workflow again and noticed different statistical values for alpha diversity.

We started asking why this would happen, and came to the conclusion that it had to be something with the data processing steps. After a process of elimination (i.e. uploading the exact same metadata and taxonomy files and choosing different filtering options) we identified that when filtering the data utilizing the Low Variance Filter based on IQR, different taxa is filtered out of the data set each time. As it is the exact same taxonomy and metadata files, the IQR, and thus the taxa filtered out of the data, should be consistent based on my understanding.

For reference, I have attached three different filtered data outputs produced with the same dataset and low-variance filter applied based on IQR (10%). I have also attached two filtered data outputs without the low-variance filter applied.

Additionally, for reference I have pasted the code through the filtering and normalization steps for each run below.

Of interest, when my PI applied the low variance filter based on IQR with a different dataset, the same issue arose. However, when she applied the low variance filter based on standard deviation instead of IQR, the same taxa were filtered out each time.

Any thoughts are appreciated.

-Scott

data_filtered.csv (82.7 KB)
data_filtered.csv (82.7 KB)
data_filtered.csv (82.8 KB)
data_filtered no low variance.csv (91.1 KB)
data_filtered no low variance.csv (91.1 KB)

Code Run 1 (low variance filter applied):

mbSet<-Init.mbSetObj()
mbSet<-SetModuleType(mbSet, “mdp”)
mbSet<-ReadSampleTable(mbSet, “Phage2_CCTSI_microbiomeanalyst_MetaData_TC_noBeetRoot.csv”);
mbSet<-Read16SAbundData(mbSet, “Phage2_CCTSI_ASV_TC_noBeetRoot.csv”,“text”,“Greengenes”,“T”,“false”);
mbSet<-SanityCheckData(mbSet, “text”);
mbSet<-SanityCheckSampleData(mbSet);
mbSet<-SetMetaAttributes(mbSet, “1”)
mbSet<-PlotLibSizeView(mbSet, “norm_libsizes_0”,“png”);
mbSet<-CreatePhyloseqObj(mbSet, “text”,“Greengenes”,“T” , “false”)
mbSet<-ApplyAbundanceFilter(mbSet, “prevalence”, 4, 0.1);
mbSet<-ApplyVarianceFilter(mbSet, “iqr”, 0.1);
mbSet<-PerformNormalization(mbSet, “none”, “colsum”, “none”, “true”);

Code Run 2 (low variance filter applied):

mbSet<-Init.mbSetObj()
mbSet<-SetModuleType(mbSet, “mdp”)
mbSet<-ReadSampleTable(mbSet, “Phage2_CCTSI_microbiomeanalyst_MetaData_TC_noBeetRoot.csv”);
mbSet<-Read16SAbundData(mbSet, “Phage2_CCTSI_ASV_TC_noBeetRoot.csv”,“text”,“Greengenes”,“T”,“false”);
mbSet<-SanityCheckData(mbSet, “text”);
mbSet<-SanityCheckSampleData(mbSet);
mbSet<-SetMetaAttributes(mbSet, “1”)
mbSet<-PlotLibSizeView(mbSet, “norm_libsizes_0”,“png”);
mbSet<-CreatePhyloseqObj(mbSet, “text”,“Greengenes”,“T” , “false”)
mbSet<-ApplyAbundanceFilter(mbSet, “prevalence”, 4, 0.1);
mbSet<-ApplyVarianceFilter(mbSet, “iqr”, 0.1);
mbSet<-PerformNormalization(mbSet, “none”, “colsum”, “none”, “true”);

Code Run 3 (low variance filter applied):

mbSet<-Init.mbSetObj()
mbSet<-SetModuleType(mbSet, “mdp”)
mbSet<-ReadSampleTable(mbSet, “Phage2_CCTSI_microbiomeanalyst_MetaData_TC_noBeetRoot.csv”);
mbSet<-Read16SAbundData(mbSet, “Phage2_CCTSI_ASV_TC_noBeetRoot.csv”,“text”,“Greengenes”,“T”,“false”);
mbSet<-SanityCheckData(mbSet, “text”);
mbSet<-SanityCheckSampleData(mbSet);
mbSet<-SetMetaAttributes(mbSet, “1”)
mbSet<-PlotLibSizeView(mbSet, “norm_libsizes_0”,“png”);
mbSet<-CreatePhyloseqObj(mbSet, “text”,“Greengenes”,“T” , “false”)
mbSet<-ApplyAbundanceFilter(mbSet, “prevalence”, 4, 0.1);
mbSet<-ApplyVarianceFilter(mbSet, “iqr”, 0.1);
mbSet<-PerformNormalization(mbSet, “none”, “colsum”, “none”, “true”);

Code Run 4 (no low variance):

mbSet<-Init.mbSetObj()
mbSet<-SetModuleType(mbSet, “mdp”)
mbSet<-ReadSampleTable(mbSet, “Phage2_CCTSI_microbiomeanalyst_MetaData_TC_noBeetRoot.csv”);
mbSet<-Read16SAbundData(mbSet, “Phage2_CCTSI_ASV_TC_noBeetRoot.csv”,“text”,“Greengenes”,“T”,“false”);
mbSet<-SanityCheckData(mbSet, “text”);
mbSet<-SanityCheckSampleData(mbSet);
mbSet<-SetMetaAttributes(mbSet, “1”)
mbSet<-PlotLibSizeView(mbSet, “norm_libsizes_0”,“png”);
mbSet<-CreatePhyloseqObj(mbSet, “text”,“Greengenes”,“T” , “false”)
mbSet<-ApplyAbundanceFilter(mbSet, “prevalence”, 4, 0.1);
mbSet<-ApplyVarianceFilter(mbSet, “iqr”, 0.0);
mbSet<-PerformNormalization(mbSet, “none”, “colsum”, “none”, “true”);

Code Run 5 (no low variance):

mbSet<-Init.mbSetObj()
mbSet<-SetModuleType(mbSet, “mdp”)
mbSet<-ReadSampleTable(mbSet, “Phage2_CCTSI_microbiomeanalyst_MetaData_TC_noBeetRoot.csv”);
mbSet<-Read16SAbundData(mbSet, “Phage2_CCTSI_ASV_TC_noBeetRoot.csv”,“text”,“Greengenes”,“T”,“false”);
mbSet<-SanityCheckData(mbSet, “text”);
mbSet<-SanityCheckSampleData(mbSet);
mbSet<-SetMetaAttributes(mbSet, “1”)
mbSet<-PlotLibSizeView(mbSet, “norm_libsizes_0”,“png”);
mbSet<-CreatePhyloseqObj(mbSet, “text”,“Greengenes”,“T” , “false”)
mbSet<-CreatePhyloseqObj(mbSet, “text”,“Greengenes”,“T” , “false”)
mbSet<-ApplyAbundanceFilter(mbSet, “prevalence”, 4, 0.1);
mbSet<-ApplyVarianceFilter(mbSet, “iqr”, 0.0);
mbSet<-PerformNormalization(mbSet, “none”, “colsum”, “none”, “true”);

Thank you for sharing the thoughts. I feel you are right. Data filtering uses ranks based on abundance or variance. There will be cases when several features have the same rank AND at the threshold. The algorithm will choose randomly (among those of the same rank) to be excluded.

Since you are using R, set a random seed may help solve the issue. Not sure if this is a good pracitce for our public server, as randomness is disirable in many cases (i.e., irreproducible features are not very robust)

1 Like

Great, thank you for taking the time to reply, Dr. Xia.

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.