Functional enrichment

The goal of this section is to identify GO terms or biological pathways, that are statistically enriched in the set of our DEGs.

There are diverse packages for performing GO enrichment analysis from R, which differ in the databases they consider, the statistical tests they perform, or the options for graphical representation they offer; some examples are gprofiler2, topGO, clusterProfiler or GOfuncR.

An alternative to those are web servers, such as g:Profiler –the one the gprofiler2 R client derives from. This is the one we will use here, but you can find examples on how to use both gprofiler2 and topGO for GO enrichment in the following section.

To use this tool, we need to have the IDs of the following set of genes:

the genes of interest: up/down/all DE genes
the background: the list of the genes that were used to performed the DE analysis

We can get the DEG IDs from the allDeg table.

# Set of overexpressed genes
overexpressed <-
  allDeg %>%
  filter(logFC.2vs0 > 0 & logFC.4vs0 > 0) %>%
  separate(gene_id, into = c("gene_id", "num")) %>%
  pull(gene_id)

# Set of underexpressed genes
underexpressed <-
  allDeg %>%
  filter(logFC.2vs0 < 0 & logFC.4vs0 < 0) %>%
  separate(gene_id, into = c("gene_id", "num")) %>%
  pull(gene_id)

# Set if all DEGs
degs <-
  allDeg %>%
  separate(gene_id, into = c("gene_id", "num")) %>%
  pull(gene_id)

Note

Here, we used the separate function to clean the IDs, and isolate only their major part that is recognised by gProfiler. The minor part after the . is related to versioning, and we want to discard it for the purposes of this section.

We also need the background, i.e., the genes that are expressed.

# Set of all genes used in initially in our analysis
background <-
  raw_counts_filtered %>%
  rownames_to_column("gene_id") %>%
  separate(gene_id, into = c("gene_id", "num")) %>%
  pull(gene_id)

What is the background? How should you choose it?

Now, go to the g:Profiler website. As you can see, there are four different tools available, but we will only work with the Functional profiling (g:GOSt). We will keep the basic options as they are, but select Bonferroni correction as the Significance threshold and select a custom Statistical domain scope, where we will upload our background genes.

We will perform the analysis on all three sets of genes of interest (up/down/all DE genes).

Note

To obtain the inputs needed to run g:GOSt, we would:

generate the files containing our gene IDs of interest:

write.table(x = overexpressed, 
            file = "../functional_analysis/overexpressed_genes.txt", 
            sep = "\n", 
            quote = FALSE, 
            row.names = FALSE, 
            col.names = FALSE)

write.table(x = underexpressed, 
            file = "../functional_analysis/underexpressed_genes.txt", 
            sep = "\n", 
            quote = FALSE, 
            row.names = FALSE, 
            col.names = FALSE)

write.table(x = degs, 
            file = "../functional_analysis/degs.txt", 
            sep = "\n", 
            quote = FALSE, 
            row.names = FALSE, 
            col.names = FALSE)

write.table(x = background, 
            file = "../functional_analysis/background_genes.txt", 
            sep = "\n", 
            quote = FALSE, 
            row.names = FALSE, 
            col.names = FALSE)

Copy them to our local device using scp.
Upload them to g:Profiler.

For this hands-on, you can download these four files from the GitHub repository of the course

Leave the rest of the options as they are and Run query.

Can a gene have more than 1 GO term associated? Can you think of any category of genes that are not suitable for this kind of analysis?

Which functions do you find enriched when considering all DEG genes? Do they make sense with the biological process we are looking at?

How does the result change if you change the method for p-value correction in the gost function?

Can you make a barplot illustrating the number of DEGs in our dataset that belong to the top enriched KEGG and/or REAC pathways?

Tip

In order to get the results in an easy-to-parse table, you need to go to to the Detailed results tab and download the CSV file, which you can then copy to the cluster with scp.

The intersection_size column in the output is the number of genes from the query that belong to each enriched term. You may use the barplot function from base R, or the ggplot and geom_bar functions from the ggplot2 package, in order to make the plot. Each one of the top enriched terms should be represented by a different bar in the graph.

GO enrichment in R

gprofiler2

In this package, the function that performs the analysis is gost. We can directly pipe the output to the gostplot function to easily generate an interactive plot illustrating the enriched functional pathways and terms. You will see enriched, terms from different databases, like KEGG, REAC, GO, and WikiPathways.

library(gprofiler2)

# Functional overrepresentation analysis for overexpressed genes

gost(
  query = overexpressed,
  custom_bg = background,
  organism = "hsapiens",
  correction_method = "bonferroni"
) %>%
  gostplot(capped = FALSE, interactive = TRUE)


# Functional overrepresentation analysis for underexpressed genes
gost(
  query = underexpressed,
  custom_bg = background,
  organism = "hsapiens",
  correction_method = "bonferroni"
) %>%
  gostplot(capped = FALSE, interactive = TRUE)


# Functional overrepresentation analysis for all DEGs
gost(
  query = degs,
  custom_bg = background,
  organism = "hsapiens",
  correction_method = "bonferroni"
) %>%
  gostplot(capped = FALSE, interactive = TRUE)

topGO

The procedure with this library is less straightforward, than gprofiler2’s. First, we need to generate a database containing the GO terms of the genes included in our background:

library(topGO)
library(biomaRt)

# Load complete database from ENSEMBL
db <- useMart("ensembl",
              dataset = "hsapiens_gene_ensembl")

# Subset the GO terms from our background genes
go_db <- getBM(attributes = c("go_id", 
                                  "ensembl_gene_id"), 
                   filters = "ensembl_gene_id", 
                   values = background, 
                   mart = db) %>%
  unstack() # Convert it to a list 

# Functional overrepresentation analysis for overexpressed genes
## Keep only annotated genes 
over4go <- overexpressed[overexpressed %in% names(go_db)]

## And generate named factor to signal the candidate genes to be tested (1 for candidate genes)
over4go <- factor(as.integer(background %in% over4go))
names(overgo) <- background

## Generate topGO data object
over_GOdata <- new("topGOdata", 
                   ontology = "BP", # choosing the biological process ontology
                   allGenes = over4go, 
                   annot = annFUN.gene2GO, # annotation function when annotations are given in a gene-to-GO manner
                   gene2GO = go_db)

## Test for significant enrichments
over_result <- runTest(over_GOdata, 
                       algorithm = "weight01", # take into account GO hierarchy
                       statistic = "fisher") 

## Generate a table for the results
over_allGO <- usedGO(over_GOdata)
over_res <- GenTable(over_GOdata, 
                     weightFisher = over_result, 
                     orderBy = "weightFisher", 
                     topNodes = length(over_allGO))

# Functional overrepresentation analysis for underexpressed genes
## Keep only annotated genes 
under4go <- underexpressed[underexpressed %in% names(go_db)]

## And generate named factor to signal the candidate genes to be tested (1 for candidate genes)
under4go <- factor(as.integer(background %in% under4go))
names(undergo) <- background

## Generate topGO data object
under_GOdata <- new("topGOdata", 
                    ontology = "BP", # choosing the biological process ontology
                    allGenes = under4go, 
                    annot = annFUN.gene2GO, # annotation function when annotations are given in a gene-to-GO manner
                    gene2GO = go_db)

## Test for significant enrichments
under_result <- runTest(under_GOdata, 
                        algorithm = "weight01", # take into account GO hierarchy
                        statistic = "fisher") 

## Generate a table for the results
under_allGO <- usedGO(under_GOdata)
under_res <- GenTable(under_GOdata, 
                      weightFisher = under_result, 
                      orderBy = "weightFisher", 
                      topNodes = length(over_allGO))

# Functional overrepresentation analysis for all DEGs
## Keep only annotated genes 
degs4go <- degs[degs %in% names(go_db)]

## And generate named factor to signal the candidate genes to be tested (1 for candidate genes)
degs4go <- factor(as.integer(background %in% degs4go))
names(degs4go) <- background

## Generate topGO data object
degs_GOdata <- new("topGOdata", 
                   ontology = "BP", # choosing the biological process ontology
                   allGenes = degs4go, 
                   annot = annFUN.gene2GO, # annotation function when annotations are given in a gene-to-GO manner
                   gene2GO = go_db)

## Test for significant enrichments
degs_result <- runTest(degs_GOdata, 
                       algorithm = "weight01", # take into account GO hierarchy
                       statistic = "fisher") 

## Generate a table for the results
degs_allGO <- usedGO(degs_GOdata)
degs_res <- GenTable(degs_GOdata, 
                     weightFisher = degs_result, 
                     orderBy = "weightFisher", 
                     topNodes = length(degs_allGO))

The p-values reported by this analysis do not take into account multiple testing; for that, I refer you to topGO’s documentation:

The p-values computed by the runTest function are unadjusted for multiple testing. We do not advocate against adjusting the p-values of the tested groups, however in many cases adjusted p-values might be misleading.