What are the missing value imputation options in ProteoAnalyst and how should I choose?

xia.lab · March 23, 2026, 4:15am

Missing values in proteomics data arise from two distinct mechanisms, and ProteoAnalyst lets you choose accordingly:

Missing Not At Random (MNAR) — values missing because the protein abundance is below the detection limit:

LoD Replace (simple): Replaces missing values with the minimum observed value in the dataset — fast and simple
MinDet (deterministic): Replaces with a low percentile (1st percentile) of observed values per sample
MinProb (stochastic): Draws random values from a left-shifted Gaussian distribution, adding realistic variability
QRILC (left-censored): Uses quantile regression to impute from a truncated distribution, explicitly modeling the left-censored nature of the data

Missing At Random (MAR) — values missing due to technical reasons unrelated to abundance:

Feature Mean/Median/Min Replace: Simple replacement with the feature’s mean, median, or minimum value
KNN by Feature: K-nearest neighbors imputation using similar features (proteins)
KNN by Sample: K-nearest neighbors imputation using similar samples
SeqKNN: Sequential KNN, which iteratively imputes using increasingly complete data
PPCA: Probabilistic Principal Component Analysis
BPCA: Bayesian Principal Component Analysis — robust for small sample sizes
SVD Impute: Singular Value Decomposition-based imputation
ImpSeq: Sequential imputation using iterative regression

How to choose:

First, consider the nature of your missing data. In label-free proteomics, most missing values are MNAR (low-abundance proteins not detected). MinProb is a widely recommended default for MNAR data, as it adds realistic stochastic variation.

If your data has a mix of MNAR and MAR missingness (e.g., from inconsistent peptide identification), consider KNN by Feature for the MAR component. For small datasets (<10 samples), BPCA tends to perform well. Avoid simple replacement methods (mean, min) for large-scale studies, as they can distort variance and bias downstream statistical tests.