What are the missing value imputation options in ProteoAnalyst and how should I choose?

Missing values in proteomics data arise from two distinct mechanisms, and ProteoAnalyst lets you choose accordingly:

Missing Not At Random (MNAR) — values missing because the protein abundance is below the detection limit:

  • LoD Replace (simple): Replaces missing values with the minimum observed value in the dataset — fast and simple
  • MinDet (deterministic): Replaces with a low percentile (1st percentile) of observed values per sample
  • MinProb (stochastic): Draws random values from a left-shifted Gaussian distribution, adding realistic variability
  • QRILC (left-censored): Uses quantile regression to impute from a truncated distribution, explicitly modeling the left-censored nature of the data

Missing At Random (MAR) — values missing due to technical reasons unrelated to abundance:

  • Feature Mean/Median/Min Replace: Simple replacement with the feature’s mean, median, or minimum value
  • KNN by Feature: K-nearest neighbors imputation using similar features (proteins)
  • KNN by Sample: K-nearest neighbors imputation using similar samples
  • SeqKNN: Sequential KNN, which iteratively imputes using increasingly complete data
  • PPCA: Probabilistic Principal Component Analysis
  • BPCA: Bayesian Principal Component Analysis — robust for small sample sizes
  • SVD Impute: Singular Value Decomposition-based imputation
  • ImpSeq: Sequential imputation using iterative regression

How to choose:

First, consider the nature of your missing data. In label-free proteomics, most missing values are MNAR (low-abundance proteins not detected). MinProb is a widely recommended default for MNAR data, as it adds realistic stochastic variation.

If your data has a mix of MNAR and MAR missingness (e.g., from inconsistent peptide identification), consider KNN by Feature for the MAR component. For small datasets (<10 samples), BPCA tends to perform well. Avoid simple replacement methods (mean, min) for large-scale studies, as they can distort variance and bias downstream statistical tests.