Missing values in proteomics data arise from two distinct mechanisms, and ProteoAnalyst lets you choose accordingly:
Missing Not At Random (MNAR) — values missing because the protein abundance is below the detection limit:
- LoD Replace (simple): Replaces missing values with the minimum observed value in the dataset — fast and simple
- MinDet (deterministic): Replaces with a low percentile (1st percentile) of observed values per sample
- MinProb (stochastic): Draws random values from a left-shifted Gaussian distribution, adding realistic variability
- QRILC (left-censored): Uses quantile regression to impute from a truncated distribution, explicitly modeling the left-censored nature of the data
Missing At Random (MAR) — values missing due to technical reasons unrelated to abundance:
- Feature Mean/Median/Min Replace: Simple replacement with the feature’s mean, median, or minimum value
- KNN by Feature: K-nearest neighbors imputation using similar features (proteins)
- KNN by Sample: K-nearest neighbors imputation using similar samples
- SeqKNN: Sequential KNN, which iteratively imputes using increasingly complete data
- PPCA: Probabilistic Principal Component Analysis
- BPCA: Bayesian Principal Component Analysis — robust for small sample sizes
- SVD Impute: Singular Value Decomposition-based imputation
- ImpSeq: Sequential imputation using iterative regression
How to choose:
First, consider the nature of your missing data. In label-free proteomics, most missing values are MNAR (low-abundance proteins not detected). MinProb is a widely recommended default for MNAR data, as it adds realistic stochastic variation.
If your data has a mix of MNAR and MAR missingness (e.g., from inconsistent peptide identification), consider KNN by Feature for the MAR component. For small datasets (<10 samples), BPCA tends to perform well. Avoid simple replacement methods (mean, min) for large-scale studies, as they can distort variance and bias downstream statistical tests.