Cautions must be taken for the practice.
- It is OK to use the features selected based on the background biological knowledge (domain knowledge);
- It is OK to use features selected based on overall data characteristics (such as abundance level, variance, etc)
In MetaboAnalyst, one can use non-specific filtering (i.e. not using the class labels) to select features. We recently introduced the Data filter function under the Data Process category. Users can use this function to remove low quality features using a variety of criteria. Additionally, users can also try to remove sample outliers if exists using the Data editor. These are safe procedures that can potentially improve the classification performance.
- It is not proper to use the features selected based on the whole dataset using some supervised methods (methods that utilize the class label information such as t-tests, PLS-DA or other supervised classification methods). Feature selection using supervised methods will introduce selection bias and result in a very optimistic performance based on cross-validation due to “information leak”. Please refer to paper by Ambroise C and McLachlan GJ for a more detailed discussion. In order to obtain an objective performance evaluation, one should include the feature selection procedure in the cross validation. Alternatively, one can evaluate the classifier using independent dataset not used in the feature selections.
In MetaboAnalyst Biomarker Analysis module - Multivariate Exploratory ROC curve analysis path, we have implemented the feature selection procedure embedded within the cross validation, which is further enhanced with repeated, balanced subsampling to deal with relatively small sample size as well as unbalanced class issues, common in clinical biomarker analysis