Functional enrichment
The goal of this section is to identify GO terms or biological pathways, that are statistically enriched in the set of our DEGs.
There are diverse packages for performing GO enrichment analysis from R
, which differ in the databases they consider, the statistical tests they perform, or the options for graphical representation they offer; some examples are gprofiler2
, topGO
, clusterProfiler
or GOfuncR
.
An alternative to those are web servers, such as g:Profiler –the one the gprofiler2
R
client derives from. This is the one we will use here, but you can find examples on how to use both gprofiler2
and topGO
for GO enrichment in the following section.
To use this tool, we need to have the IDs of the following set of genes:
- the genes of interest: up/down/all DE genes
- the background: the list of the genes that were used to performed the DE analysis
We can get the DEG IDs from the allDeg
table.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
|
Note
Here, we used the separate
function to clean the IDs, and isolate only
their major part that is recognised by gProfiler. The minor part after the
.
is related to versioning, and we want to discard it for the purposes of
this section.
We also need the background, i.e., the genes that are expressed.
# Set of all genes used in initially in our analysis
background <-
raw_counts_filtered %>%
rownames_to_column("gene_id") %>%
separate(gene_id, into = c("gene_id", "num")) %>%
pull(gene_id)
What is the background? How should you choose it?
Now, go to the g:Profiler website. As you can see, there are four different tools available, but we will only work with the Functional profiling (g:GOSt
). We will keep the basic options as they are, but select Bonferroni correction
as the Significance threshold and select a custom Statistical domain scope, where we will upload our background
genes.
We will perform the analysis on all three sets of genes of interest (up/down/all DE genes).
Note
To obtain the inputs needed to run g:GOSt
, we would:
-
generate the files containing our gene IDs of interest:
write.table(x = overexpressed, file = "../functional_analysis/overexpressed_genes.txt", sep = "\n", quote = FALSE, row.names = FALSE, col.names = FALSE) write.table(x = underexpressed, file = "../functional_analysis/underexpressed_genes.txt", sep = "\n", quote = FALSE, row.names = FALSE, col.names = FALSE) write.table(x = degs, file = "../functional_analysis/degs.txt", sep = "\n", quote = FALSE, row.names = FALSE, col.names = FALSE) write.table(x = background, file = "../functional_analysis/background_genes.txt", sep = "\n", quote = FALSE, row.names = FALSE, col.names = FALSE)
-
Copy them to our local device using
scp
. -
Upload them to
g:Profiler
.
For this hands-on, you can download these four files from the GitHub repository of the course
Leave the rest of the options as they are and Run query.
Can a gene have more than 1 GO term associated? Can you think of any category of genes that are not suitable for this kind of analysis?
Which functions do you find enriched when considering all DEG genes? Do they make sense with the biological process we are looking at?
How does the result change if you change the method for p-value correction in the gost
function?
Can you make a barplot illustrating the number of DEGs in our dataset that belong to the top enriched KEGG and/or REAC pathways?
Tip
In order to get the results in an easy-to-parse table, you need to go to to the Detailed results tab and download the CSV file, which you can then copy to the cluster with scp
.
The intersection_size
column in the output is the number of
genes from the query that belong to each enriched term. You may use the
barplot
function from base R
, or the ggplot
and geom_bar
functions
from the ggplot2
package, in order to make the plot. Each one of the top
enriched terms should be represented by a different bar in the graph.
GO enrichment in R
gprofiler2
In this package, the function that performs the analysis is gost
. We can directly pipe the
output to the gostplot
function to easily generate an interactive plot
illustrating the enriched functional pathways and terms. You will see enriched,
terms from different databases, like KEGG, REAC, GO, and
WikiPathways.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
|
topGO
The procedure with this library is less straightforward, than gprofiler2
’s. First, we need to generate a database containing the GO terms of the genes included in our background
:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 |
|
The p-values reported by this analysis do not take into account multiple testing; for that, I refer you to topGO
’s documentation:
The p-values computed by the runTest function are unadjusted for multiple testing. We do not advocate against adjusting the p-values of the tested groups, however in many cases adjusted p-values might be misleading.