Can I perform PCA using all features (i.e. not filtered to a max. of 5000)?

For statistical analysis of untargeted metabolomics data, MetaboAnalyst provides a data filtering step to help reduce noise and improve signals. Even you choose the “None" option, only the top 5000 features (based on a chosen raking method) will be used. This number is generally sufficient for current metabolomics.

PCA is an excellent visualization technique for data overview. It is driven by abundant features / peaks. We find that using top 25% abundant features will get almost the same patterns as using 100% features. If your data contains 8000 features (typical in LC-HRMS untargeted metabolomics), using the top 5000 features (ranked by mean/median intensity - the last two options in the Data Filtering page), the PCA results will be literally identical to the results based on all 8000 features. This is because that in such high-dimensional data, the remaining 3000 low-abundant features will have coefficients that are too small (essentially zero) to have any effects on the PCA results.

2 Likes