What considerations should I have when selecting proteins manually (in Manual Biomarker Selection & Evaluation)?

When manually building a biomarker panel, you should prioritize proteins that provide a balance between strong statistical performance and clear biological significance.

  • Consult Feature Importance Rankings: Before making selections based solely on biological interest, check the rankings generated by the platform’s machine learning models, such as Random Forest Variable Importance or SVM recursive feature elimination. These rankings identify which proteins are the most statistically powerful “drivers” for distinguishing your experimental groups.
  • Avoid redundancy: ProteoAnalyst computes k-means clustering as part of the feature ranking process to help researchers identify and mitigate redundancy when building biomarker panels. High-performing features that belong to the same cluster often provide overlapping information.
  • Prioritize High Univariate AUC: Review the performance of individual proteins in the Univariate ROC analysis view. Selecting proteins that already exhibit a high Area Under the Curve (AUC) (e.g., > 0.80) generally results in a more robust and accurate multivariate panel.
  • Parsimony: While it is tempting to include many proteins, smaller panels are preferred to avoid overfitting and to ensure clinical or experimental feasibility. Use the Predicted Accuracy plot to find the “elbow” point where adding more proteins no longer yields significant performance gains.
  • Leverage Biological Context: Manual selection is your opportunity to include “hub” proteins from relevant pathways identified in the WGCNA or PPI Network modules, even if they have slightly lower statistical scores. Combining biomarkers from different biological processes often provides more complementary information.
  • Data Completeness: Noise can be introduced by proteins with a high percentage of missing values. For the most reliable predictions, prioritize proteins that were consistently detected across the majority of samples in your original quantification matrix.