Corrections for MetaboAnalystR directions and a question about the functionality of Covariate Analysis

steeny · May 26, 2023, 6:07pm

Hello, I am a new user of MetaboAnalystR. While attempting a workflow for multifactor metadata analysis with covariates there were a couple errors in the R directions from the MetaboAnalyst website. I have an example workflow below with fixes for these issue.

################################################################################
# MetaboAnalystR workflow
################################################################################

# Resest the workflow
mSet<-NULL

# Initialize object
mSet<-InitDataObjects("pktable", "mf", FALSE)
# metadata with covariates
mSet<-SetDesignType(mSet, "multi")
mSet<-Read.TextDataTs(mSetObj = mSet, filePath="feat.dlco.metab.txt", format="colmf");
mSet<-ReadMetaData(mSet, "class.dlco.txt");

# fix metadata types
# it seems MetaboAnalystR gets its datatypes from the metadata title.
# this section corrects all the metadata stuff, except for those labeled "orig"
mSet$dataSet$meta.types[1:3]<-c("cont","disc","cont")
mSet$dataSet$meta.info[,1]<-as.numeric(as.character(mSet$dataSet$meta.info[,1]))
mSet$dataSet$meta.info[,3]<-as.numeric(as.character(mSet$dataSet$meta.info[,3]))
mSet$dataSet$types.cls.lbl[1:3]<-c("cont","disc","cont")
mSet$dataSet$type.cls.lbl<-"cont"
mSet$dataSet$cls<-as.numeric(as.character(mSet$dataSet$cls))
# fixes some kind of missing internal variable
mSet$dataSet$cls.type<-"cont"
mSet<-SanityCheckData(mSet)

# replaces na's (but not 0's)
mSet<-ReplaceMin(mSet);

mSet<-SanityCheckMeta(mSet, 1)
mSet<-SetDataTypeOfMeta(mSet);
mSet<-SanityCheckData(mSet)

# Filter based on iqr, RSD, then logtransform then normalize
mSet<-FilterVariable(mSet, "iqr", 40, "F", 25, F)
mSet<-PreparePrenormData(mSet)
mSet<-Normalization(mSet, "QuantileNorm", "LogNorm", "NULL", ratio=FALSE, ratioNum=20)

# Covariate adjusted anova
# Directions in MetaboAnalystR indicate using square brackets instead of vector notation
# but that doesn't work
adj.vec <<- c("SmokingStatus", "AGE_YRS")
mSet<-CovariateScatter.Anal(mSet, "covariate_plot_0_dpi72.png", "png", "DLCO", "NA", "NA" , 0.05,"anova")
# Get plot for scatter plot
# Triple colon required for hidden functions
mSet <- MetaboAnalystR:::PlotCovariateMap(mSet, theme="default", imgName="", format="png", dpi=72)
# Get plot for individual metabolite
mSet <- MetaboAnalystR:::PlotMultiFacCmpdSummary(mSet, "134.8651018","DLCO", 0, "png", 72, width=NA)

This seems to get the intended analysis and figures (I am seeing similar issues for heatmap and pca, and will post solutions soon). However, while reading through the code on GitHub I found a couple lines of code that indicate a that for mixed models of continuous variables and factor variables the linear regression employed by MetaboAnalystR sets the intercept of the model to 0. The code that does this is shown below, and is in multifac_covariate.R at lines 503 through 507.

    if(sum(types == "cont") == length(vars)){ #in case of single, continuous variable, must use different intercept or limma will give unreasonable results
      design <- model.matrix(formula(paste0("~", paste0(" + ", vars, collapse = ""))), data = covariates);
    } else {
      design <- model.matrix(formula(paste0("~ 0", paste0(" + ", vars, collapse = ""))), data = covariates);
    }

I would appreciate a explanation of the reasoning for this model adjustment, especially the comment “in case of single, continuous variable, must use different intercept or limma will give unreasonable results”. My understanding is that removing the intercept should generally have bad effects on the results, and I could not find a case where it was done in this way. I understand that this behavior can be avoided by setting metadata types to continuous, but as it stands I worry that people using the main website are using this intercept-less model without any awareness.

jess.ewald · May 26, 2023, 8:35pm

Please see the linked comment on this thread, from the creator of the limma package: Limma when to use zero intercept

Basically, for a categorical metadata, ~x and ~ 0 + x are the exact same model. For continuous metadata, these are very different models. Explicitly setting the intercept to 0 forces the fitted linear model through the origin, which is usually not a reasonable assumption and causes weird results (ie. 100% of features are called differentially expressed, or all p-values are virtually identical).

MetaboAnalyst was originally designed for only categorical metadata, and so the formula didn’t matter. When we expanded it to accept continuous values, we added this exception after testing datasets with only continuous values and getting bad results. The if/else condition is really redundant: since ~x and ~ 0 + x are the same for categorical, we can use the second implementation for everything. It’s been simplified in some of our other tools, we just haven’t updated MetaboAnalystR yet.

jess.ewald · May 26, 2023, 8:39pm

I would also add - the intercept vs. no intercept gives the exact same model as long as you have at least one categorical metadata. This isn’t true in some other linear model R packages, I think its specific to limma.

steeny · May 30, 2023, 2:42pm

Thanks a ton! This clarifies the issue a lot.

My other comments about hidden functions and variables not represented in the MetaboAnalyst R code still holds, so I’ll use this thread to point out other instances as I try out different sections of the workflow.