ANOVA analysis is misassigning groups if the uploaded data is not sorted by group in the one factor statistical module

cbarlow · September 15, 2025, 12:36am

Hello,
I’m observing an error when performing One-way ANOVA analysis in the Statistical Analysis [one factor] module. Specifically if I upload data in which the samples are not sorted by group (see attached: P24_0802_Exp_1_combined_for_MA-group_1_to_4_only.csv) I observe no statistically significant features.
If I sort the data by group (see attached: P24_0802_Exp_1_combined_for_MA-group_1_to_4_only_group_sorted.csv) then I get a very different result with around 1000 significant features at 1% FDR. This is also the result that I get on my MetaboAnalyst Pro account and is more in line with what I would have anticipated from this dataset.

I suspect that the samples are being assigned to the wrong groups for the ANOVA analysis.

I’m having trouble providing the files here but if you would like them then please shoot me an email and I’ll send them through.

Kind regards,

Chris

xia.lab · September 25, 2025, 12:38am

Can you reproduce this using our example data? You can also share your data with me. Please document every step (including the R command history) so we can investigate in details.

cbarlow · November 28, 2025, 4:23am

Hi Jeff,

Sorry for the late reply. I looked at this a while ago and figured out that it occurs with data sets above a certain size, but it then fell off my radar.
I’ve come across another example today with a dataset which is small enough to be uploaded. The attached files have the same data but one has the samples (rows) in a random order while the other has sorted the data to have all the samples belonging to the same group next to each other.

The scrambled data misassigns the groups. For example, here is a PCA for the scrambled groups:

Here is the PCA when uploaded the sorted data

You can see that the PCAs show the points in the same position, but the group assignment is different. This misassignment of groups is occurring in other modules I tried as well.

Kind regards,

Chris

Two_group_order_sorted_small.csv (9.9 MB)

Two_group_order_scrambled_small.csv (9.9 MB)

xia.lab · November 29, 2025, 1:07am

Hi Chris,

Thanks for the data. Using the scrambled data and all default steps (log transformation). The PCA is correct (see below). Can you provide the steps you performed that led to the PCA results you generated? Please refer to our post guide for more details for such requests

cbarlow · November 30, 2025, 3:33am

Hi Jeff,

Thanks for your response. My initial conditions would have been with the default parameters:

ie. No data filtering, imputation = 20% of lowest pos. value, Median normalized, log10 transformed and Pareto scaled.

I have tried to reproduce the result from your last post using:
No data filtering, default missing value imputation, log10 transformed, the R history is attached:

Default_processing_log10_trans_on_scrambled.txt (1.4 KB)

This is what I get.

If I use the sorted file with the same conditions, I get a result similar to yours.

Is there any chance that you run the sorted rather than the scrambled file?

Would you mind quickly running the scrambled file through again, I’ve attached it here:

Two_group_order_scrambled_small.csv (9.9 MB)

Kind regards,

Chris

cbarlow · November 30, 2025, 6:22am

Okay, I think I might have figured this out.

I uploaded a slightly smaller version of the file so that I could upload files from the download page. Following upload, from the Data check page, I then skipped straight to the download page to avoid any complications associated with differences in data processing steps. I can’t upload a .zip here but I’ve attached all the files (I changed the file extension on the Rhistory file to allow upload).

The sample order in the data_original.csv is different to the uploaded file (Two_group_order_scrambled_smaller.csv). Specifically, the samples have been sorted by group however the groups have not been updated to reflect this. Note, this is consistent with my earlier observation that there is no problem when data is sorted by group prior to upload.

My recommendation for anybody reading this is to sort your data by group prior to upload.

There is some additional weirdness I’ve found while trying to solve this. If I upload data with samples in columns rather than rows, I get an error saying duplicate features are not allowed. This error doesn’t appear when uploading the same data with samples in rows. Are duplicate feature names allowed?

Kind regards,

Chris

data_original.csv (7.6 MB)

Two_group_order_scrambled_smaller.csv (6.6 MB)

raw_dataview.csv (11.1 KB)

Rhistory.txt (247 Bytes)

xia.lab · December 2, 2025, 6:39pm

Hi Chris, Thanks for the details. We were able to reproduce the bug. This is the first time that we noticed the production server gave results different from our development enviroment. The issue is now fixed …

For your second question, there should be no duplicated names allowed (features or samples uploaded in rows or columns). I can see there are special characters in your feature names that will lead to issues during parsing. MetaboAnalyst will replace spaces with underscore, and special characters will be removed. The process may introduce duplicates … this is so hard to prevent, as there are so many exceptions

xia.lab · December 6, 2025, 1:01pm

This topic was automatically closed after 3 days. New replies are no longer allowed.