Skip to content

Differential gene expression

Let’s continue with the Differential Gene Expression analysis.

Simple Comparison between two groups

To start let’s consider a simple comparison between all the treated samples and the control samples.

Note

That is why we passed the meta$Treatment variable when we created the dge object.

The first step is to get the dispersion in our data. Remember, the variance is dependent of the level of expression of a gene, we need to properly estimate it.

dge <- estimateDisp(dge)

Now the dge object is updated with the dispersion information.

We can quickly make a scatter plot of the dispersion as a function of the average gene expression with the use of the plotBCV function.

plotBCV(dge)

Does the observed trend of the dispersion make sense?

To detect the genes that are differentially expressed between the treated and control samples, we can use the exactTest function. We provide the dge object, and the groups that we want to compare (as indicated in the meta$Treatment variable).

dge.result <-
  exactTest(object = dge, pair = c("control", "dexamethasone"))

To know how many genes are differentially expressed at a FDR of 1%, we can run the following command.

summary(decideTests(object = dge.result, p.value = 0.01))

What is FDR correction, why is it needed? Repeat the DGE analysis and set FDR to 5%. Count the number of hits.

Finally, we can also make a MA plot, by using the plotMD function. A mean-difference plot (aka MA plot) is a scatter plot, where each point represent a gene in our analysis. The x-axis corresponds to the average expression, and the y-axis corresponds to the log-fold-change (logFC). The MA plot can help us detect a potential bias in our data.

plotMD(dge.result)

Multiple comparisons

Let’s now consider the samples treated for 2 hours and those treated for 4 hours separately. However, we want to detect the genes that are differentially expressed in either of the two comparisons, i.e. 2 hours vs control and 4 hours vs control.

To do this we need to recreate the dge object, this time providing the meta$Treatment_Duration variable for the group option.

Note

The control samples have 0 hours duration.

# Change group
dge <-
  DGEList(counts = raw_counts_filtered,
          group = meta$Treatment_Duration)

# Calculate again normalization factors
dge <- calcNormFactors(object = dge, method = "TMM")

Since, this comparison is more complicated, we will fit a Generalized Linear Model (GLM) to the data. To do this we should first create a design matrix.

dge.design <-
  model.matrix(object = ~ 0 + Treatment_Duration, data = meta)

Print the dge.design object. As you can see, its columns correspond to the groups in our dataset.

Then, we estimate the dispersion in our data as we did before.

dge <- estimateDisp(y = dge, design = dge.design)

Another thing that we need is a contrast. The contrast matrix will help us define the comparisons that we are interested in.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# Create contrasts for possible comparisons
dge.contrasts <-
  makeContrasts(
    "2vs0" = Treatment_Duration2hr - Treatment_Duration0hr,
    "4vs0" = Treatment_Duration4hr - Treatment_Duration0hr,
    "4vs2" = Treatment_Duration4hr - Treatment_Duration2hr,
    "TreatmentVsControl" =
      (Treatment_Duration4hr + Treatment_Duration2hr) / 2 -
        Treatment_Duration0hr,
    levels = dge.design
  )

We used the makeContrasts function to create a contrast for a number of possible comparisons among our groups. You can print it to understand how it works.

Do you understand which is the comparison indicated by the last column of the contrast?

Now, let’s fit the GLM model on our data, and detect differentially expressed genes by performing the appropriate statistical test.

dge.fit <- glmQLFit(dge, dge.design)

dge.result <-
  glmQLFTest(dge.fit,
    contrast = dge.contrasts[, c("2vs0", "4vs0")]
  )

As you see, we provided the part of the contrast that includes the 2vs0 and 4vs0 comparisons. That is going to test for the genes that are differentially expressed compared to the control in either of the two time points of the treated samples.

We can check again the number of differentially expressed genes (DEGs).

summary(decideTests(object = dge.result, p.value = 0.01))

And we can also print the “top” DEGs, in terms of FDR (adjusted p-value). Genes with negative log fold-change (logFC) are under-expressed, and gene with positive logFC are over-expressed.

topTags(object = dge.result)

Let’s save all DEGs in a table. Moreover, we usually filter for genes with high absolute logFC. Here, we use an arbitrary cutoff of 0.5. Finally, we also add a column for the gene name, and another for the gene biotype, taken from the gene_annotation object.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
allDeg <-
  # Get table of differetial expression results
  topTags(object = dge.result, n = Inf, p.value = 0.01)$table %>%
  # push rownames to a column
  rownames_to_column("gene_id") %>%
  # make the table a tibble
  as_tibble() %>%
  # filter for genes with an aboslute logFC of 0.5
  # or greater in at least one of the two comparisons
  filter(abs(logFC.2vs0) >= 0.5 | abs(logFC.4vs0) >= 0.5) %>%
  # Include columns for gene_name and
  # gene_type from the annotation table
  left_join(gene_annotation %>%
    filter(gene_id %in% rownames(dge$counts) &
      !(gene_id %in% y_genes & chrom == "chrY")) %>%
    select(gene_id, gene_name, gene_type))

Note

The y_genes vector, that is used here, was created in the Data preparation section.

What does a value of 1 for the logFC mean in terms of fold-change?

Which time point, 2 or 4 hours, has more DEGs?

Are there any DEGs that are over-expressed in 2vs0 but under-expressed in 4vs0?

Another method for correcting p-values is the Bonferroni correction. Try to use this instead, and compare the distribution of the adjusted p-values generated in the two cases.

Tip

The adjust.method option in the topTags function can be used to select a different correction method. You may also use the hist function of base R to visualise the distribution of the adjusted p-values with a histogram.

Can you identify the genes that are differentially expressed only in one of the two comparisons?

Example solution
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
two_hr_specific <-
  allDeg %>%
  filter(abs(logFC.2vs0) > 0.5) %>%
  pull(gene_id) %>%
  setdiff(allDeg %>%
    filter(abs(logFC.4vs0) > 0.5) %>%
    pull(gene_id))

four_hr_specific <-
  allDeg %>%
  filter(abs(logFC.4vs0) > 0.5) %>%
  pull(gene_id) %>%
  setdiff(allDeg %>%
    filter(abs(logFC.2vs0) > 0.5) %>%
    pull(gene_id))

DEG visualization

To visualize DEG expression we can make a heatmap to depict the FPKM values of the DEGs in our dataset.

fpkms.degs <- fpkms[allDeg %>% pull(gene_id), ]

pheatmap(
  mat = fpkms.degs,
  show_rownames = FALSE,
  annotation_col = forAnnotation,
  clustering_method = "single",
  scale = "row"
)

Another graph very common for illustrating differential gene expression and for identifying genes of interest is the volcano plot. The volcano plot is a scatter plot, where each point represents a gene in our dataset. The x-axis corresponds to the logFC value, and the y-axis corresponds to the negative logarithm of the p-value -log(p-value). It is named “volcano plot” because the plot typically resembles a volcano when the data points are scattered and the significant differentially expressed genes are highlighted.

Can you make a volcano plot?

Tip

In order to generate the volcano plot, you may use the plot function from base R, or even the ggplot and geom_point functions from the ggplot2 package.

Example solution
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
# Get table with logFC and p-value for all genes tested
topTags(object = dge.result, n = Inf)$table %>%
  # Make a label column based on our significance threshold
  mutate(
    Significant =
      ifelse(FDR < 0.01, "FDR < 0.01", "Not Sig")
  ) %>%
  # Merge logFC in one column, add another one indicating the comparison
  pivot_longer(c("logFC.2vs0", "logFC.4vs0"),
    names_to = "comparison",
    values_to = "logFC"
  ) %>%
  separate(comparison, into = c("prefix", "comparison")) %>%
  select(-prefix) %>%
  # Make volcano plot using ggplot
  ggplot(aes(x = logFC, y = -log10(FDR))) +
  geom_point(aes(color = Significant)) +
  scale_color_manual(values = c("red", "grey")) +
  theme_bw(base_size = 12) +
  theme(legend.position = "bottom") +
  # Split plot for the two comparisons
  facet_wrap(~comparison, nrow = 1, scales = "free_x") +
  # Add labels for the top30 degs
  geom_text(
    data =
      allDeg %>%
        # Do the same processing as before to the allDeg table
        pivot_longer(c("logFC.2vs0", "logFC.4vs0"),
          names_to = "comparison",
          values_to = "logFC"
        ) %>%
        separate(comparison, into = c("prefix", "comparison")) %>%
        select(-prefix) %>%
        arrange(desc(abs(logFC))) %>%
        head(30),
    aes(label = gene_name),
    size = 3,
    position = position_jitter(width = 1, height = 1, seed = 150)
  )