How to interpret the Prob. Overview in biomarker analysis?

Qiang · June 12, 2022, 7:58pm

In multivariate ROC curve analysis, the prediction results from PLS-DA, SVM, RandomForests are presented in the form of probability [0, 1]. Since MetaboAnalyst uses balanced random subsampling, 0.5 will always be the cutoff point.

The prediction overview shows the predicted class probabilities (x-axis) of each sample (y-axis). The probability scores are the average from the 50 iterations, ranging from 0 ~ 1. For instance, less than 0.5 will belong to group A, more than 0.5 belong to group B. In theory, a sample could be located on the 0.5 line, which means the sample has never been selected for testing during the iterations.

Users can also use the Pred. Overview to identify potential outliers. For instance, if a sample is always predicted to have a high probability in the wrong group, this may indicate that the sample could be labeled incorrectly. Users can check “Label samples classified to the wrong groups” to identify these potential outliers. An example output is shown below

venturini.gabriela · July 21, 2022, 5:55pm

Hi! Thank you for the explanation. I still have one question about what y-axis stands for? So, in the y-axis, based on which information the samples are distributed? What’s the difference of 2 samples with tha same probablitity but being in -2 or 1 in the y-axis?

Thank you!
Best

jeff.xia · July 25, 2022, 12:04pm

The Y values are just random values to help separate the samples so that users can better view individual probability scores. If you know R, the R code is below

y <- rnorm(length(prob.vec)); # prob.vec is x-values containing all probabilities
max.y <- max(abs(y));
ylim <- max.y*c(-1.05, 1.05); # large sample size more spread out