What is the meaning of Components in PLS-DA?

Robin · August 10, 2022, 7:02am

Hello everyone,
I have a dataset of concentrations from various metabolites taken from different parts of the digestive system of pigs.
I’d like to use the PLS-DA to see which metabolites are mostly influenced by the different parts of the digestive system. Unfortunately I don’t quite understand the meaning of the components in the PLS-DA charts.
I understand the PLS-DA as follows (In simple words):
In the PLS-DA the concentrations are taken as input for a model which tries to connect those to the different groups (=the different parts of the digestive system).
The methabolites get weighted via the VIP score based on how good they drive the separation of groups in the model.

My question is regarding the components of the following plot (example plot):

I’m having a hard time to describe and value the separation of these bubbles in B. as I don’t understand the meaning of the components (x-axis / y-axis). Also for graph A. I don’t get how it helps me to know that the first component explains 33% of the variability in the three groups.
Are the components correlated to the VIP score (C.), so that component 1 is connected to the first metabolite of the vip score (Tryptophan)? Or are the components just variables of the model? If yes, what kind of information can I get from this graph for my metabolite analysis?
Can someone help me understand the “components”?

Thank you for your help!

jess.ewald · August 10, 2022, 4:14pm

It’s ok, PLS-DA is very useful, but also very difficult to understand! Here is my best attempt to explain the parts that are important for interpretation in simple terms:

Your dataset has some metabolites (ie. metabolite A, B, C, D, E). There is some total amount of variability in your dataset (var(A) + var(B) + var(C) + var(D) + var(E)), with each variability being calculated with concentrations from all samples.

If we have had a successful experiment, the metabolites with the largest variability should be the most interesting, a.k.a. different between your experimental factors. We could make plot B with just two metabolites, ie. x-axis = metabolite C conc, y-axis = metabolite E conc.

However, plotting component scores instead of individual metabolite concentrations can give a better overview because components capture information from many metabolites. A component score is just some coefficients times each metabolite concentration and added together: coef1A conc + coef2B conc + coef3*C conc … = component score. See, inputting concentrations from each sample produces a component score for each sample.

In PLS-DA, we first find the set of coefficients that produces component scores that have the biggest difference between your experimental factors (this is component 1). The % variability is: var(component 1 scores)/total variability. Remember that total variability is: (var(A) + var(B) + var(C) + var(D) + var(E)). Then, we find the second set of coefficients that explains the second most variability between experimental factors to calculate component 2 scores, and so on for each component. Each component must be orthogonal to all the others, which just means that it must describe new patterns in the data that were not captured by the previous components.

So for interpretation:

If there is no separation between sample bubbles, it means that there are no consistent patterns in the metabolites that explain differences between the treatment groups (sad!)
If there is separation, we can tell some other things.
First, sample groups that are closer together have more similar metabolic profiles to each other. Ie if you had liver, small intestine, large intestine, you probably expect the two intestine groups to fall closer together on the plot compared to the liver.
Second, once we see the components that separate our samples, we can try understand which individual metabolites are driving this through the VIP score. VIP scores are related to the coefficients for each component score. Metabolites with higher VIP had bigger coefficients, meaning they influenced that score the most.

A typical interpretation: Plots A and B show separation of sample groups, great, there are some consistent metabolic patterns! I see in component 1, cluster 2&3 are similar to each other, and different from cluster 3. I wonder why? Which metabolites are driving this? I see from the VIP scores for component 1 that it is mainly 4 essential amino acids.

hanszamora · August 25, 2022, 2:21am

Hello jess.ewald

Nice answer! Now that Robin mention about the interpretation of PLS-DA results, could you please help me?

How should the graph D be interpretated? As the separation between 1 and 2&3 groups was in Component 1, for validating the PLS-DA model I only have to take in count the performance of Component 1 on graph D?

Thank you for your help!

jess.ewald · August 25, 2022, 5:29pm

The header for plot D on MetaboAnalyst says “Select optimal number of components for classification” - in Robin’s plot D, there is a star for 2 components, meaning that a model built with only component 1 and 2 is best in some way for that dataset.

Usually in model selection, best means the least complicated model that still does a good job. As we add more complexity to a model, we run a higher risk of overfitting, so a common practice is to continue adding model terms (here, components) until we no longer see a substantial improvement in the model performance scores.

Looking at plot B, you can see why both component 1 and component 2 are important: component 1 shows separation between cluster 1 and cluster 2/3, while component 2 shows separation between cluster 2 and cluster 1/3. Only by using both are we able to do a good job of classifying all samples. Also, components after #2 do not add much value to the model as we are already able to classify majority of samples with just #1 and #2.