What are the various normalization methods and which method should I choose?

Yao · June 26, 2022, 9:17pm

MicrobiomeAnalyst supports a variety of methods for data normalization. Please note, data normalization is mainly used for visual data exploration such as beta-diversity and clustering analysis. It is also used for comparative analysis using statistical methods without known normalization procedures that work best (univariate statistics and LEfSe). Meanwhile, other comparative analyses will use their own specific normalization methods. For example, cumulative sum scaling (CSS) normalization is used for metagenomeSeq, and trimmed mean of M-values (TMM) is applied for edgeR.

A brief summary of normalization methods are provided below:

Total Sum Scaling (TSS) normalization: this method removes technical bias related to different sequencing depth in different libraries via simply dividing each feature count with the total library size to yield relative proportion of counts for that feature. For easier interpretation, we can multiply it by 1,000,000 to get the number of reads corresponding to that feature per million reads. LEfSe utilizes this kind of approach.
Relative log expression (RLE) normalization: this is the scaling factor method proposed by Anders and Huber (2010). This method calculates the median library from the geometric mean of all columns. The median ratio of each sample to the median library is taken as the scaling factor.
Trimmed mean of M-values (TMM) normalization: this is the weighted trimmed mean of M-values proposed by Robinson and Oshlack (2010), where the weights are from the delta method on Binomial data.
Upper Quantile normalization: this is the upper-quartile normalization method of Bullard et al (2010), in which the scale factors are calculated from the 75% quantile of the counts for each library, after removing features which are zero in all libraries. This idea is generalized here to allow scaling by any quantile of the distributions.
Cumulative Sum Scaling (CSS) normalization: it calculates the quantile of the count distribution of samples where they all should be roughly equivalent and independent of each other up to this quantile under the assumption that, at this range, counts are derived from a common distribution. By default, metagenomeSeq utilizes this approach for differential analysis.
Centered Log-Ratio (CLR) Transformation: This method is specially designed to normalize compositional data. It convert the relative abundances of each part, or the values in the table of counts for each part, to ratios between all parts by calculating geometric mean of all values. This method is robust if data sets were not sparse because the geometric mean cannot be computed if any of the feature counts are zero.

At the moment, there is no consensus guideline with regard to which normalization should be used. Users are adviced to explore different approaches and then visually examine the separation patterns (i.e. PCoA plot) to assess the effects of different normalization procedures with regard to experimental conditions or other meta-data of interest. For detailed discussion about these methods, users are referred to two recent papers Paul J. McMurdie et al. and Jonathan Thorsen et al.

Scott_Wrigley · July 2, 2024, 5:54pm

data_normalized.csv (404.8 KB)

Thank you for the descriptions above. I have a question regarding the Total Sum Scaling approach. I pulled our normalized data (attached “data_normalized.cvs”) off of MicrobiomeAnalyst via download.zip file. According to the post above, once total sum scaling is completed, the results for each feature are then multiplied by 1,000,000 to get feature/million reads, similar to a LEfSe approach.

However, looking at our normalized dataset downloaded directly from MicrobiomeAnalyst, it appears the multiplier used was 10,000,000. Dividing the normalized data by 1,000,000 as indicated above and summing the results comes out to 10. Comparatively, dividing each feature by 10,000,000 and adding up the total for each sample comes to 1, which should be be outcome of TSS methodology. This indicates to me that 10,000,000 was the multiplier utilized to generate the final normalized data output, rather than 1,000,000 as indicated above.

Is this correct? If so, why is 10,000,000 used instead of 1,000,000 as indicated in the summary?

jeff.xia · July 2, 2024, 11:39pm

Thank you for the note. I can confirm it is indeed 10M, not 1M used in TSS. I don’t think there is a specific reason other than typing one more zero in the R code by mistake. The underlying R code has been updated to 1M. The web site will be updated late this week.

jeff.xia · July 6, 2024, 12:00pm

This topic was automatically closed after 3 days. New replies are no longer allowed.