What are the different types of normalization?

There are different normalization options displayed for microarray and RNAseq data because raw counts are typically processed differently than microarray intensity values. For both microarray and RNAseq data, there is a “None” option to give users the flexibility of uploading a gene expression matrix that has already been normalized using other methods.

Many microarray normalization methods are customized to deal with particular platforms. In FastBMD, we have only included methods that can be applied to any type of microarray platform. If you would like to use more customized methods, we encourage you to normalize your data outside of FastBMD and select the “None” option. The other options for normalizing microarray data are:

  • Log2 Transformation: Since log2(0) is undefined, gene expression values that equal 0 are replaced with the minimum non-zero expression value divided by 10. After this, the data matrix is log2-transformed.

  • Variance Stabilizing Normalization (VSN): This option uses the normalizeVSN() function within the “limma” R package. It has been observed that the variance of microarray probes depends on their mean values, with probes with higher expression values having higher variance. While log-transforming the data addresses this problem somewhat, it is less effective for values closer to zero. The VSN algorithm aims to transform the data such that all probes have roughly the same variance, and it more dramatically changes values closer to zero. VSN-transformed data can be treated the same as log2-transformed data for downstream analysis - in fact, VSN ratios are equal to log2 ratios for probes with higher intensity values. More detail are given in this paper.

  • Quantile Normalization: This method is a non-parametric transformation that seeks to make the statistical distributions of each sample identical to each other. To achieve this, intensity values are ranked from high to low. Next, each value is replaced with the mean of the values with the same ranking across all samples in the matrix. This method relies on the assumption that the vast majority of genes should have roughly the same expression across different biological replicates, and thus any global increases or decreases in expression are due to technical variation and should be removed during normalization. FastBMD uses the quantile normalization function within the “preprocessCore” R package.

  • VSN followed by Quantile Normalization: This option simply calls the exact methods used in the “VSN” option, followed by the methods used for the quantile normalization method.

All of the RNAseq normalization methods use the “voom” method from the “limma” R package. Each one starts by calculating the “normalization factors” with the calcNormFactors() function from the “edgeR” R package, and then applies the voom() function to the scaled counts matrix. The normalization factors are used to scale the samples for sequencing depth, and then limma-voom converts counts to log2-counts per million (logCPM), which can be analyzed using the same eBayes differential expression as are used for microarray data. The different normalization options simply use different methods to scale the raw counts for sequencing depth:

  • Log2-counts per million: To account for sequencing depth, raw counts are divided by the total counts for that sample. This converts the raw counts to the commonly used “Reads Per Kilobase of transcript, per Million mapped reads” (RPKM).

  • Upper Quantile Normalization: Bullard et al. note that the total number of counts for a sample can is very dependent on a small number of highly abundant genes. If these genes are differentially expressed, converting counts to RPKM decreases the sensitivity of differential expression analysis. To solve this problem, the authors recommend dividing by the 75th percentile of the total number of counts. This provides a sequencing depth scaling factor that does not depend on the small proportion (less than 5%) of highly expressed genes.

  • Trimmed Mean of M-Values (TMM): For this method, one sample is chosen as the reference. Then, the M-values are computed for each gene and sample as the log2(sample gene counts/total sample counts * reference gene counts/total reference sample counts). All genes with the highest and lowest expression values and fold-changes are removed. Finally, the scaling factor for each sample is set to the weighted average M-value of the remaining genes. More details can be found in this paper.

  • Relative Log Expression Normalization: This method is very similar to the TMM method. Here, a “median sample” is created by setting the expression of each gene equal to the median expression across all samples. Next, the scaling factor for a sample is set to the median of all gene fold changes for that sample, calculated relative to the “median sample”. More details can be found in this paper.