Differential gene expression
Let’s continue with the Differential Gene Expression analysis.
Simple Comparison between two groups
To start let’s consider a simple comparison between all the treated samples and the control samples.
Note
That is why we passed the meta$Treatment
variable when we
created the dge
object.
The first step is to get the dispersion in our data. Remember, the variance is dependent of the level of expression of a gene, we need to properly estimate it.
dge <- estimateDisp(dge)
Now the dge object is updated with the dispersion information.
We can quickly make a scatter plot of the dispersion as a function of
the average gene expression with the use of the plotBCV
function.
plotBCV(dge)
Does the observed trend of the dispersion make sense?
To detect the genes that are differentially expressed between the treated and
control samples, we can use the exactTest
function. We provide the dge
object, and the groups that we want to compare (as indicated in the
meta$Treatment
variable).
dge.result <-
exactTest(object = dge, pair = c("control", "dexamethasone"))
To know how many genes are differentially expressed at a FDR of 1%, we can run the following command.
summary(decideTests(object = dge.result, p.value = 0.01))
What is FDR correction, why is it needed? Repeat the DGE analysis and set FDR to 5%. Count the number of hits.
Finally, we can also make a MA plot, by using the plotMD
function. A
mean-difference plot (aka MA plot) is a scatter plot, where each point
represent a gene in our analysis. The x-axis corresponds to the
average expression, and the y-axis corresponds to the log-fold-change
(logFC). The MA plot can help us detect a potential bias in our data.
plotMD(dge.result)
Multiple comparisons
Let’s now consider the samples treated for 2 hours and those treated for 4 hours separately. However, we want to detect the genes that are differentially expressed in either of the two comparisons, i.e. 2 hours vs control and 4 hours vs control.
To do this we need to recreate the dge
object, this time providing
the meta$Treatment_Duration
variable for the group
option.
Note
The control samples have 0 hours duration.
# Change group
dge <-
DGEList(counts = raw_counts_filtered,
group = meta$Treatment_Duration)
# Calculate again normalization factors
dge <- calcNormFactors(object = dge, method = "TMM")
Since, this comparison is more complicated, we will fit a Generalized Linear Model (GLM) to the data. To do this we should first create a design matrix.
dge.design <-
model.matrix(object = ~ 0 + Treatment_Duration, data = meta)
Print the dge.design
object. As you can see, its columns correspond to the
groups in our dataset.
Then, we estimate the dispersion in our data as we did before.
dge <- estimateDisp(y = dge, design = dge.design)
Another thing that we need is a contrast. The contrast matrix will help us define the comparisons that we are interested in.
1 2 3 4 5 6 7 8 9 10 11 |
|
We used the makeContrasts
function to create a contrast for a number
of possible comparisons among our groups. You can print it to
understand how it works.
Do you understand which is the comparison indicated by the last column of the contrast?
Now, let’s fit the GLM model on our data, and detect differentially expressed genes by performing the appropriate statistical test.
dge.fit <- glmQLFit(dge, dge.design)
dge.result <-
glmQLFTest(dge.fit,
contrast = dge.contrasts[, c("2vs0", "4vs0")]
)
As you see, we provided the part of the contrast that includes the
2vs0
and 4vs0
comparisons. That is going to test for the genes
that are differentially expressed compared to the control in either of
the two time points of the treated samples.
We can check again the number of differentially expressed genes (DEGs).
summary(decideTests(object = dge.result, p.value = 0.01))
And we can also print the “top” DEGs, in terms of FDR (adjusted p-value). Genes with negative log fold-change (logFC) are under-expressed, and gene with positive logFC are over-expressed.
topTags(object = dge.result)
Let’s save all DEGs in a table. Moreover, we usually filter for genes
with high absolute logFC. Here, we use an arbitrary cutoff of 0.5
.
Finally, we also add a column for the gene name, and another for the
gene biotype, taken from the gene_annotation
object.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
|
Note
The y_genes
vector, that is used here, was created in the Data
preparation section.
What does a value of 1 for the logFC mean in terms of fold-change?
Which time point, 2 or 4 hours, has more DEGs?
Are there any DEGs that are over-expressed in 2vs0
but under-expressed in 4vs0
?
Another method for correcting p-values is the Bonferroni correction. Try to use this instead, and compare the distribution of the adjusted p-values generated in the two cases.
Tip
The adjust.method
option in the topTags
function can be used to select
a different correction method. You may also use the hist
function of base
R
to visualise the distribution of the adjusted p-values with a
histogram.
Can you identify the genes that are differentially expressed only in one of the two comparisons?
Example solution
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
|
DEG visualization
To visualize DEG expression we can make a heatmap to depict the FPKM values of the DEGs in our dataset.
fpkms.degs <- fpkms[allDeg %>% pull(gene_id), ]
pheatmap(
mat = fpkms.degs,
show_rownames = FALSE,
annotation_col = forAnnotation,
clustering_method = "single",
scale = "row"
)
Another graph very common for illustrating differential gene
expression and for identifying genes of interest is the volcano plot. The
volcano plot is a scatter plot, where each point represents a gene in our
dataset. The x-axis corresponds to the logFC value, and the y-axis corresponds
to the negative logarithm of the p-value -log(p-value)
. It is named “volcano
plot” because the plot typically resembles a volcano when the data points are
scattered and the significant differentially expressed genes are highlighted.
Can you make a volcano plot?
Tip
In order to generate the volcano plot, you may use the plot
function from base R
, or even the ggplot
and geom_point
functions from the ggplot2
package.
Example solution
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 |
|