12. References (enrichment)

12.1. Enrichment tools

class GSEA(gene_sets={})[source]

Constructor

Parameters:

species --

compute_enrichment(gene_list, background=None, verbose=False, outdir=None)[source]
Parameters:
  • gene_list -- list of genes (e.g. genes with significant fold change)

  • background -- expected background of the species. Should be number of genes for the species of interest.

  • verbose --

  • outdir (str) -- a temporary directory to store reports and intermediate results

gene_sets

attribute to store the gene sets to be checked for enrichment

class KEGGPathwayEnrichment(gene_lists, organism, progress=True, mapper=None, background=None, padj_cutoff=0.05, preload_directory=None, convert_input_gene_to_upper_case=False, color_node_with_annotation='Name', used_genes=None)[source]

KEGG Pathways enrichment from DGE results

DGE = Differentially Gene Expression

Current input is the output of the RNADiff analysis. This is a file than can be read by RNADiffResults

This class takes as input a dictionary with 3 lists of genes: 'up', 'down' and 'all'.

For example, using the output of the RNA-seq DGE, use:

from sequana import RNADiffResults
r = RNADiffResults("rnadiff.csv")
r._log2_fc = 1
gl = r.get_gene_lists(annot_col="gene_name")
gl['KO_vs_cont"]

ke = KEGGPathwayEnrichment(gl, "mmu") # mmu for mouse

this calls ke.compute_enrichment() that stores the up, down and all results in the attribute enrichment as a dictionary.

You can now plot the results:

ke.barplot('down')

and save enriched pathways as follows:

up = ke.save_significant_pathways("up")
down = ke.save_significant_pathways("down")
up.to_csv("kegg_pathway_up_regulated.csv")
down.to_csv("kegg_pathway_down_regulated.csv")

This class works like a charm for ecoli with GFF that uses gene names.

For mus musculus, organism is mmu (not **mus*). you will need to have a mapping of the Ensembl ID into KEGG IDs (actually gene name).

You can perform the conversion using BioServices/BioMart. We have implemented a simple function inside Sequana:

from sequana import Mart
conv = Mart("mmusculus_gene_ensembl")
df = conf.query()
conf.save(df)

You can then import the dataframe, as follows using the mapper argument:

import pandas as pd
df = pd.read_csv("biomart.csv")
df = df.rename({"external_gene_name":"name", "ensembl_gene_id": "ensembl"},
        axis=1)
df = df.set_index("ensembl", inplace=True)

ke = KEGGPathwayEnrichment("path_to_rnadiff", "mmu", mapper=df)

ke.scatterplot('down')
savefig("B4052_T1vsT0_KE_scatterplot_down.png")
ke.scatterplot('up')
savefig("B4052_T1vsT0_KE_scatterplot_up.png")

Pathways are loaded from KEGG which may take some time. For development or production, you may want to save the KEGG pathways in a given place. Note however, that the pathways that are enriched will still be downloaded for the final annotation and image creation

To save the pathways locally and load them later do as follows (here for human):

ke = KEGGPathwayEnrichment({}, organism="hsa")
ke.save_pathways("all_pathways/")

# load them back next time
ke = KEGGPathwayEnrichment({}, organism="hsa", preload_directory="all_pathways")

In some cases, the input identifiers are converted into names thanks to the input mapper (csv file). Yet, if the external name are from one species and you use another species in kegg, the kegg names may be upper case while your species' name are in lower case. In such situations, you may set input identifiers are upper case setting the convert_input_gene_to_upper_case parameter to True

barplot(category, nmax=15)[source]
barplot_up_and_down(nmax=15)[source]
compute_enrichment(background=None)[source]
find_pathways_by_gene(gene_name)[source]

Returns pathways that contain the gene name

ke.find_pathways_by_gene("ysgA")
plot_genesets_hist(bins=20)[source]
save_pathway(pathway_ID, df, scale=None, show=False, filename=None)[source]
save_pathways(out_directory)[source]
save_project(tag, outdir='.')[source]

Save tables and visualisations of the complete enrichmment analysis.

save_significant_pathways(category, nmax=20, tag='', outdir='.')[source]

category should be up, down or all

scatterplot(category, nmax=15)[source]
class Mart(dataset="mmusculus_gene_ensembl") # you could choose hsapiens_gene_ensembl for instance df = conv.query() df.set_index("ensembl_gene_id") conv.save(df)[source]

The file can now be loaded in e.g. KeggPathwayEnrichment as a mapper of the ensemble identifier to external names understood by Kegg.

For fungi (cryptococcus), you can use:

m = Mart(host="fungi.ensembl.org", dataset="cneoformans_eg_gene", mart="fungi_mart")
m.query(['cneoformans_eg_gene'])

See more information on https://bioservices.readthedocs.io

property dataset
query(attributes=['ensembl_gene_id', 'go_id', 'entrezgene_id', 'external_gene_name'])[source]
save(df, filename=None)[source]

df is the output of query(). This function save it keeping track of day/month/year and dataset.

class Ontology[source]

Simple place holder for ontology information

class PantherEnrichment(gene_lists, taxon, requests_per_sec=10, padj_threshold=0.05, log2_fc_threshold=0, fc_threshold=None, enrichment_fdr=0.05, annot_col='Name')[source]

By default, we keep only the genes with a adjusted pvalue <= 0.05. The fold_change threshold is on a log2 scale and set to 0 (meaning no filtering). Only one value is requested and used to filter out positive and negative fold change (on the log2 scale). In other word, a log2 fold change threshold of 2 means that we filter out values between -2 and 2.

If you prefer to work in natural scale, just set the parameter fc_threshold, which overwrite the log2_fc_threshold parameter.

pe = PantherEnrichment("input_file.tsv", taxon=10090, log2_fc_threshold=1)
# compute enrichment for genes down and up
pe.compute_enrichment()

# Results for up case is stored in pe.enrichment
# then, we plot the most mportat go terms
df_up = pe.plot_go_terms("up")

df = pe.plot_go_terms("up", pe.MF)
pe.save_chart(df, "chart_MF_up.png")

# all 3 main ontology
df = pe.plot_go_terms("up")
pe.save_chart(df, "chart_up.png")

e.stats contains some statistics. One important is the list of unmapped genes. The results from the GO enrichment are stored in the attributes enrichment. There, we have again adjusted p-value and a fold enrichment, which can in turn be filtered or not.

You can retrieve the cleaned data using the get_data method.

You can also plot the GO terms that are significantly enriched using:

e.plot_go_terms(['MF', 'CC', 'BP])

This function returns the dataframe used during the plotting.

If you want to look at the up regulated genes only:

e.compute_enrichment(pe.mygenes["up"], 83333)
df = e.plot_go_terms(['MF', 'CC', 'BP'],
        log=False, include_negative_enrichment=False,
        fontsize=8, sort_by='fold_enrichment',
        show_pvalues=True, fdr_threshold=0.05)

The number of genes is limited to about 3100 depending (don't ask me why, this seem to be a hard-coded limitation on PantherDB website). In such case, you should add a filter e.g on padj or fold change

rnadiff if provided, superseeds the input filename. This is useful for debugging

compute_enrichment(taxid=None, ontologies=None, enrichment_test='FISHER', correction='FDR')[source]
Parameters:
  • enrichment_test -- Fisher or Binomial

  • correction -- FDR or Bonferonni

The field number_in_reference indicates from the reference, the number of genes that have a given ontolgy term. For instance, 998 genes have the term. This is stored in number_in_reference. If the reference contains 4391 genes, and you provided 49 genes , the expected number of genes that have this ontology term is 49*998/4391 that is 11.1369, which is stored in "expected.

Now, if you actually find that 14 out of 49 genes have the term, you need to compare the numbers 11.1369 and 14. Are they really different ? The ratio 14 / 11.1369 is stored in fold_enrichment. The pvalue and FDR are stored as well.

Some genes may be missing If you provide 50 genes, you may end up with only 45 being mapped onto panther db database. This may explain some differences with the expected value.

Fold enrichment is the number_in_list / expected ratio. Another close metric is the fractional difference: (observed - expected) / expected. This metric is slighlty less than the fold enrichment

To get the number f genes from the database, simply take one output, and compute number_in_reference * number_in_list / expected

The fold enrichment is also called odd-ratio.

get_data(category, ontologies, include_negative_enrichment=True, fdr=0.05)[source]

From all input GO term that have been found and stored in enrichment[ONTOLOGY]['result'], we keep those with fdr<0.05. We also exclude UNCLASSIFIED entries. The final dataframe is returned

pe.get_data(""up", MF")
get_functional_classification(mygenes, taxon)[source]

Mapping information from pantherDB for the lisf of genes

We also store uniprot persistent id

get_mapping_stats()[source]
plot_piechart(df)[source]
save_chart("B4052-V1.T1vsT0.complete.xls", fc_threshold=5, padj_threshold=0.05) df = pe.plot_go_terms("down", log=True, compute_levels=False) pe.save_chart(df, "chart.png")[source]
property taxon