Does the order of your file upload affect the DEGs?

EmilyB · November 10, 2022, 1:08pm

Hi!
My PI and I were using the Xplorer, but when we uploaded the same data files (which were sorted differently) we got different DEG results… The main difference was the p-values, which were not the same.

For example, if you have samples A-H to upload, you will get different results depending on whether you sort them in your folder as A-H or H-A. The things that we noticed were different (not necessarily all results) were:

-the order that the bars are displayed in on the HKG graph (see attached for examples)
-The PCA plot (sometimes almost a mirror image, but not quite)
-The number of DEGs
-The FC values were all the same but the p-values for some were different

Please resolve

Thank you.

jess.ewald · November 23, 2022, 4:33pm

Yes, these are expected differences based on the sample order.

The bars on some plots are impacted by the sample order, as the groups are ordered based on their order in the datasets.

For PCA, the algorithm in base R is partly sequential, where calculations slightly vary depending on the order of the samples. This is outside of our control, and is always the case for PCA analysis in R. The overall variability structure should be extremely similar, but the view might be roatated.

The differences in DEGs relates to missing value imputation. You can read our other posts for a description of how missing value imputation works, but there is a stochastic component to the process (R does some random sampling from a distribution to choose appropriate values). We have the randomness fixed in EcoToxXplorer, meaning that whenever the same scenario is encountered, R will choose the exact same sequence of random numbers. However, when the sample order is different, the random sampling scenario is changed and so the imputed values are slightly different. This could cause some genes that are very close to the significance cutoff to fall on the other side of the signficance threshold. The results should still be very similar to each other (difference of only a few genes).

If you are uncomfortable with the methods used for missing value imputation, there is the option to turn it off on the normalization page. In this case, genes with missing values will be excluded from the differential analysis.