How does DESeq2 normalize the counts?

The workflow of DESeq2, part 1
BioBit
Adrià Mitjavila on May 31, 2022
Reposted from BioBit.




When studying RNA-seq data, one of the most common approaches is to perform differential expression analysis (or DE analysis), a method in which two different biological conditions (i.e. different developmental stages or disease status) are compared in order to see if the expression of genes changes between them. Counts, which are sequencing reads (for single-end sequencing) or fragments (for paired-end sequencing) that map to a given genomic feature (i.e. exons), are a proxy to estimate gene expression but, in order to compare the expression of the genes across different condition or even to compare different genes within the same condition, we have to normalize these counts.

Recently, I explained some RNA-seq count normalization methods, and their uses. Here, I am going to explain some theory behind the normalization used by DESeq2, a common R package for DE analysis.

If you want to read more about normalization methods, read this post.

Why another normalization method?

Remember the FPKM and TPM normalization methods we explained in the this post? They are nice normalization methods in normalizing read counts by sequencing depth (also called library size) and gene length, which makes them perfect to compare the expression of different genes in the same sample, as well as for comparison of a gene expression between similar conditions or between condition replicates.

When doing DE analysis, the normalization by sequencing depth is extremely important, since is one sample has, overall, the double of reads than the other, you will see differences among sample that are not due to biological causes, but due to the sequencing. On the other hand, since we compare the two samples gene by gene, one should expect that in both samples the same gene has the same length, so normalization by gene length is not necessary.

However, there is still one issue that must be addressed: the library composition (or RNA composition).

To explain differences in library composition:

  • Let’s say, for example, that we have two samples A and B that come from different cell types.
  • To make things easy, assume that only 3 genes are expressed (gene1, gene2, and gene3).
  • Imagine that gene1 and gene2 have equal, but very low expression in sample A and sample B, and that gene3 is highly expressed in sample A, but not in sample B.
  • If we sequence the same number of reads, we will probably have a lot of reads for gene3 in sample A and less in sample B, obtaining that gene3 is differentially expressed. No problem since this is what happens biologically.
  • However, in gene1 and gene2, despite being equally expressed in sample A and B, in sample A they will have less reads, since gene3 captures most of the sequenced reads. Due to this, we will obtain that gene1 and gene2 are upregulated in sample B compared to sample A, when this is not happening biologically.


Diagram of the example explained above. Top panel shows the real exression of the genes: gene1 and 2 are low but equally expressed in sample A and B, while gene3 is highly expressed in sample A but not in sample B. Bottom panel shows the sequenced transcripts: gene 3 shows similar trend as the real expression, however, since in sample A most reads map to gene3, gene1 and gene2 recieve less reads than in sample B, despite having equal real expression.

Figure 1: Diagram of the example explained above. Top panel shows the real exression of the genes: gene1 and 2 are low but equally expressed in sample A and B, while gene3 is highly expressed in sample A but not in sample B. Bottom panel shows the sequenced transcripts: gene 3 shows similar trend as the real expression, however, since in sample A most reads map to gene3, gene1 and gene2 recieve less reads than in sample B, despite having equal real expression.


How does DESeq2 adresses this differences?

To account for sequencing depth and library composition, DESeq2 calculates sample-specific scaling factors using the mean of rations method. For the user runing DESeq2 in R, this implies only one step but, the DESeq2 algorithm performs multiple steps:

  1. Compute the pseudo-reference: geometric mean for each gene across all samples.

    • For each gene, it computes the geometric mean of the raw counts accross all samples in order to create a pseudo-reference sample. The good thing about this mean is that is not as susceptible to outliers as arithmetic mean.

    • This means that DESeq2 calculates the product of the raw counts in a gene accross all the samples and then, it does a root with base equal to the number of samples.

    \[ Geometric \space mean = \sqrt[n]{\prod_{i = 1}^{n}counts_i} = \sqrt[n]{counts_1 \times counts_2 \times \cdots \times counts_n} \]

    • The geometric mean can also be calculated by computing the aritmetic mean of the natural logarithm transformed values and then raising the e number to the calculated aritmetich mean if the log values.

    \[ Geometric \space mean = e^{\sum_{i = 1}^{n}log(counts_i)} = e^{\frac{log(counts_1) + log(counts_2) + \cdots + log(counts_n)}{n}} \]

  2. Removing genes without counts.

    • If a gene has 0 counts in one sample, its geometric mean accross all samples will be 0. This can will lead to errors in the next steps, hence this gene must be removed from the calculation of the normalization factors.
    • Note that this gene will be introduced once more in the normalization step (step 5 here), since it’s only removed to calculate the normalization factors.
  3. Ratio of each sample to the pseudo-reference.

    • The next step is computing the ratio between the raw counts in each sample and the geometric mean previously calculated. This way, we will adress the differences in library composition.

    \[ Ratio_i = \frac{counts_i}{Geometric \space mean} \]

    • Have in mind that, since most genes won’t be differential expressed, the majority of genes should have ratios near to 1.
  4. Sample-specific size normalization: median of the ratios.

    • For each sample, take the median of the all the counts/reference ratios calculated before. With this median, we will address the sequence depth differences accross different samples which, added to the ratios, accounts for both, differences in library composition and sequence depth.

    • A cool thing about the median is that, unlike the mean, it is not very much affected by the presence of very or very low expressed genes.

    • Note that, since most genes won’t be differentiall expressed, the median of the ratios should be near one.

  5. Normalization of the raw counts.

    • For each gene in each samples, divide the raw counts by the median of the ratios calculated above.

A little example

To understand things better, let’s do a little example with an invented counts matrix with 6 genes and 3 samples. Note that a real counts matrix would have tenths of thousands of genes, hence this examples is not representative of a real situation.

set.seed(1234) ## set seed to get same result in sample()
counts_matrix <- 
  tibble(gene     = paste("gene", 1:6, sep = ""),
         sampleA  = sample(x = 0:100, size = 6, replace = F),
         sampleB  = sample(x = 0:50, size = 6, replace = F),
         sampleC  = sample(x = 50:150, size = 6, replace = F)) %>% 
  column_to_rownames("gene")
total = colSums(counts_matrix)

counts_matrix %>% 
  bind_rows(total) %>% 
  set_rownames(c(rownames(counts_matrix), "Total")) %>%
  knitr::kable(caption = "<span style='color:darkblue; font-size:12px'>Raw (unnormalized) counts in genes from each sample. Note that the total number of counts has been added to illustrate differences in sequencing depth, but it won't be used for the calculation of normalization (or scaling) factors. Moreover, no genes with 0 counts have been added to make the example easier.</span>", align = "c", format = "html") %>% kableExtra::kable_styling(full_width = T)
Table 1: Raw (unnormalized) counts in genes from each sample. Note that the total number of counts has been added to illustrate differences in sequencing depth, but it won’t be used for the calculation of normalization (or scaling) factors. Moreover, no genes with 0 counts have been added to make the example easier.
sampleA sampleB sampleC
gene1 27 15 119
gene2 79 3 128
gene3 21 33 127
gene4 8 38 63
gene5 4 21 105
gene6 37 25 111
Total 176 135 653


Note that in table 1 the total number of counts for each samples has been added to illustrate differences in sequencing depth, but it won’t be used for the calculation of normalization (or scaling) factors. Moreover, no genes with 0 counts have been added to make the example easier.


Step 1: build a pseudo-reference by computing the geometric mean

What we have to do here is to follow the equation given above to compute the geometric mean of the raw counts of each gene across all samples.

geometric_mean <- function(df) {
  
  gmean <- numeric(length = nrow(df))
  for(i in 1:nrow(df)){
    product  <- prod(df[i,])
    gmean[i] <- pracma::nthroot(product, n = length(df[i,]))
  }
  
  return(gmean)
}

gmean <- geometric_mean(counts_matrix)
counts_matrix %>% 
  mutate(pseudoRef = gmean) %>% 
  set_rownames(c(rownames(counts_matrix))) %>%
  knitr::kable(caption = "<span style='color:darkblue; font-size:12px'>Table with the raw counts in every gene in eac sample and the pseudo-reference for each gene, computed by using the geometric mean.</span>", align = "c", format = "html") %>%
  kableExtra::kable_styling(full_width = T)
Table 2: Table with the raw counts in every gene in eac sample and the pseudo-reference for each gene, computed by using the geometric mean.
sampleA sampleB sampleC pseudoRef
gene1 27 15 119 36.39156
gene2 79 3 128 31.18790
gene3 21 33 127 44.48146
gene4 8 38 63 26.75499
gene5 4 21 105 20.66123
gene6 37 25 111 46.82613


Step 2: filter 0 count genes

This step is not necessary here since we don’t have any genes with 0 counts.

Step 3: compute the ratios of the raw counts and the pseudo-reference

What we have to do here is to divide the raw counts of every gene in each sample for the pseudo-reference calculated in the step 1 (table 2).

ratios_matrix <- counts_matrix %>%
  mutate(pseudoRef = gmean,
         ratiosA = sampleA/pseudoRef,
         ratiosB = sampleB/pseudoRef,
         ratiosC = sampleC/pseudoRef) %>%
  select(contains("ratios"))

ratios_matrix %>% 
  knitr::kable(caption = "<span style='color:darkblue; font-size:12px'>Table showing, for each gene, the ratios of raw counts and the geometric mean.</span>", align = "c", format = "html") %>%
  kableExtra::kable_styling(full_width = T)
Table 3: Table showing, for each gene, the ratios of raw counts and the geometric mean.
ratiosA ratiosB ratiosC
gene1 0.7419303 0.4121835 3.269989
gene2 2.5330338 0.0961912 4.104156
gene3 0.4721069 0.7418822 2.855122
gene4 0.2990097 1.4202961 2.354701
gene5 0.1935993 1.0163964 5.081982
gene6 0.7901572 0.5338900 2.370471


Step 4: retrieve the median of the ratios

In this step, we have to calculate, in each sample, the median value for the ratios obtained in step 3 (table 3=. These values will be the scaling factor for each sample.

medianOfRatios <- function(df){
  
  mr <- numeric(length = ncol(df))
  for(i in 1:ncol(df)){
    mr[i] <- median(df[,i])
  }
  
  return(mr)
}

medianOfRatios <- medianOfRatios(ratios_matrix)

medianOfRatios %>% 
  as.data.frame() %>%
  set_colnames("Scaling factor") %>%
  set_rownames(paste("sample", LETTERS[1:3], sep = "")) %>% 
  knitr::kable(caption = "<span style='color:darkblue; font-size:12px'>Normalization (or scaling) factors for each of the samples in the experimental design.</span>", align = "c", format = "html") %>%
  kableExtra::kable_styling(full_width = T)
Table 4: Normalization (or scaling) factors for each of the samples in the experimental design.
Scaling factor
sampleA 0.6070186
sampleB 0.6378861
sampleC 3.0625557

As seen in table 3, the scaling factors are far from 1, specially for the sample3. This is not the real situation, since with thousands of genes, where most of them are not differentially expressed (similar in all conditions), the mean of the ratios would be near to 1.


Step 5. Normalize the raw counts.

To normalize the raw counts in each sample, we have to divide them by the sample scaling factor.

normalize <- function(df, scale_factor){
  
  norm_counts <- df
  for(i in 1:ncol(df)){
    norm_counts[,i] <- df[,i] / scale_factor[i]
  }
  
  return(norm_counts)
}

norm_counts <- normalize(counts_matrix, medianOfRatios)
norm_total  <- colSums(norm_counts)
norm_counts %>% 
  bind_rows(norm_total) %>% 
  set_rownames(c(rownames(norm_counts), "Total")) %>%
  knitr::kable(caption = "<span style='color:darkblue; font-size:12px'>Table of normalized counts for every gene in each sample</span>", 
               align = "c", format = "html") %>%
  kableExtra::kable_styling(full_width = T)
Table 5: Table of normalized counts for every gene in each sample
sampleA sampleB sampleC
gene1 44.479693 23.515170 38.85644
gene2 130.144288 4.703034 41.79516
gene3 34.595317 51.733374 41.46863
gene4 13.179168 59.571764 20.57105
gene5 6.589584 32.921238 34.28509
gene6 60.953654 39.191950 36.24424
Total 289.941704 211.636531 213.22061


Note that the total number of normalized counts for each sample has is much more similar across samples than the non normalized counts. However, in a real example, the total number of counts would be even more similar across all samples


More information

For a complete explanation of how DESeq normalizes counts, look at this video from StatQuest (here, the author of the video does the calculations using logarithmic values, but is essentially the same):

You can also take a look at this tutorial from Harvard Bioinformatics Core.

Finally, you can take a look to this video from Chipster, which covers the differntial expression analysis from raw count normalization to multiple testing correction:

Bibliography

Journal articles

Anders et al (2010). Differential expression analysis for sequence count data. Genome Biol. Online here.

Love et al (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. Online here.

Other sources

Harvard Bioinformatics Core (2019). Introduction to DGE: Count normalization with DESeq2. DGE workshop. Online here. Consulted on March 19, 2021.

Renesh Bedre (2021). Gene expression units explained: RPM, RPKM, FPKM, TPM, DESeq, TMM, SCnorm, GeTMM, and ComBat-Seq. Renesh Bedre Data Science Blog. Online here. Consulted on March 19, 2021.