The calculation of VIP score in PLS-DA module

boriskong · December 19, 2023, 8:14am

Dear MetaboAnalyst team,

The VIP scores calculated by the MetaboAnalyst PLS-DA module differ from those obtained using other calculation modules, such as scikit-learn in Python and mixOmics in R. While the scores from scikit-learn and mixOmics are consistent, the MetaboAnalyst scores highlight influential features more prominently. Are your calculations different from those of other modules?

Here is a test dataset with 6 samples and 8 features, and the feature XXXX is manually added with a dramatic difference between classes T and C.

sample T1 T2 T3 C1 C2 C3
class T T T C C C
ALBU 8.27E9 1.08E10 4.63E9 1.45E10 2.37E10 9.37E9
TRFE 9.11E8 8.53E8 5.34E8 5.57E8 1.51E9 1.06E9
CO3 1.53E8 1.05E8 1.18E8 2.95E8 2.75E8 8.10E7
IGG1 2.08E9 1.06E9 2.22E9 2.61E8 2.66E8 1.87E9
A2MG 4.63E7 3.45E7 1.54E8 1.03E8 1.50E8 1.18E8
XXXX 1.00E5 8.00E4 9.00E4 8.00E8 1.00E9 9.00E7
FIBB 2.81E8 1.89E8 2.30E8 2.48E8 2.93E8 3.60E8
CO4B 1.77E5 3.97E5 1.36E5 2.19E6 1.15E6 3.51E5

The VIP scores calculated by scikit-learn/mixOmics and MetaboAnalyst are shown below. MetaboAnalyst highlights the feature XXXX with a VIP score of 2.74, while scikit-learn/mixOmics assigns a score of only 1.24. Additionally, the order of the features is different.
sklearn/mixOmics MetaboAnalyst
XXXX 1.242534 2.7451
CO4B 1.120459 0.38577
ALBU 1.086513 0.10519
FIBB 1.029846 0.041263
IGG1 1.001192 0.53945
CO3 0.907570 0.013373
A2MG 0.808401 0.099583
TRFE 0.692824 0.040896

Scikit-learn (v1.3.2) python script

import numpy as np
from sklearn.cross_decomposition import PLSRegression, PLSCanonical
import pandas as pd

def vip_PLSR(model: PLSRegression):
t = model.x_scores_
w = model.x_rotations_
q = model.y_loadings_
p, h = w.shape
vips = np.zeros((p,))
s = np.diag(t.T @ t @ q.T @ q).reshape(h, -1)
total_s = np.sum(s)
for i in range(p):
weight = np.array([ (w[i,j] / np.linalg.norm(w[:,j]))*2 for j in range(h) ])
vips[i] = np.sqrt(p(s.T @ weight)/total_s)
return vips

def main():
df = pd.read_csv(‘demodata.tsv’, sep=‘\t’, index_col=0)
X = np.array(df.T)
Y = np.array([0,0,0,1,1,1])

n_comp = 1
pls = PLSRegression(n_components=n_comp)
pls.fit(X, Y)

vips = vip_PLSR(pls)
coef = pls.coef_[0]

for i, feature in enumerate(df.index):
    print(feature, vips[i], coef[i])

if name == ‘main’:
main()

MixOmics R script:

if (!require(“BiocManager”, quietly = TRUE))
install.packages(“BiocManager”)
BiocManager::install(“mixOmics”)
library(“mixOmics”)

data ← read.table(“demodata.tsv”, header = TRUE, row.names = 1, sep = “\t”)
features ← rownames(data)

data ← as.data.frame(sapply(data, function(x) as.numeric(as.character(x))))
X ← t(data.matrix(data))

response_variable ← factor(c(rep(0, 3), rep(1, 3)))
Y = as.numeric(response_variable)

plsda_model ← mixOmics::plsda(X, Y, ncomp = 5)

vip_scores ← vip(plsda_model)
rownames(vip_scores) ← features
print(vip_scores)