What are the various normalization methods and which method should I choose?

MicrobiomeAnalyst supports a variety of methods for data normalization. Please note, data normalization is mainly used for visual data exploration such as beta-diversity and clustering analysis. It is also used for comparative analysis using statistical methods without known normalization procedures that work best (univariate statistics and LEfSe). Meanwhile, other comparative analyses will use their own specific normalization methods. For example, cumulative sum scaling (CSS) normalization is used for metagenomeSeq, and trimmed mean of M-values (TMM) is applied for edgeR.

A brief summary of normalization methods are provided below:

  • Total Sum Scaling (TSS) normalization: this method removes technical bias related to different sequencing depth in different libraries via simply dividing each feature count with the total library size to yield relative proportion of counts for that feature. For easier interpretation, we can multiply it by 1,000,000 to get the number of reads corresponding to that feature per million reads. LEfSe utilizes this kind of approach.
  • Relative log expression (RLE) normalization: this is the scaling factor method proposed by Anders and Huber (2010). This method calculates the median library from the geometric mean of all columns. The median ratio of each sample to the median library is taken as the scaling factor.
  • Trimmed mean of M-values (TMM) normalization: this is the weighted trimmed mean of M-values proposed by Robinson and Oshlack (2010), where the weights are from the delta method on Binomial data.
  • Upper Quantile normalization: this is the upper-quartile normalization method of Bullard et al (2010), in which the scale factors are calculated from the 75% quantile of the counts for each library, after removing features which are zero in all libraries. This idea is generalized here to allow scaling by any quantile of the distributions.
  • Cumulative Sum Scaling (CSS) normalization: it calculates the quantile of the count distribution of samples where they all should be roughly equivalent and independent of each other up to this quantile under the assumption that, at this range, counts are derived from a common distribution. By default, metagenomeSeq utilizes this approach for differential analysis.
  • Centered Log-Ratio (CLR) Transformation: This method is specially designed to normalize compositional data. It convert the relative abundances of each part, or the values in the table of counts for each part, to ratios between all parts by calculating geometric mean of all values. This method is robust if data sets were not sparse because the geometric mean cannot be computed if any of the feature counts are zero.

At the moment, there is no consensus guideline with regard to which normalization should be used. Users are adviced to explore different approaches and then visually examine the separation patterns (i.e. PCoA plot) to assess the effects of different normalization procedures with regard to experimental conditions or other meta-data of interest. For detailed discussion about these methods, users are referred to two recent papers Paul J. McMurdie et al. and Jonathan Thorsen et al.