Why by default, only the top 20% of features are used for PCoA analysis (PPD module)?

Beta diversity or PCoA analysis is mainly affected by those abundant taxa that are shared across samples. Therefore, most of the clustering patterns used in PCoA are driven by these abundant taxa or features.

Since most reference studies in Projection with Public Data (PPD) module have a large number of samples, and many distance measures such as Unweighted UniFrac will require a lot of computational time for dissimilarity or distance calculations between samples. In order to avoid this issue, the PPD module in MicrobiomeAnalyst by default uses the top 20% most abundant features for fast computation.

In our experience, this is usually sufficient to get the same patterns as using the full dataset.