14. References (Viz)

14.1. Visualisation 

14.1.1. Anova 

class ANOVA(df)[source]

DRAFT

Testing if 3(+) population means are all equal.

Looks like the group are different, visually, and naively.

from pylab import *
from sequana.viz import ANOVA
import pandas as pd

A = normal(0.5,size=10000)
B = normal(0.25, size=10000)
C = normal(0, 0.5,size=10000)
df = pd.DataFrame({"A":A, "B":B, "C":C})
a = ANOVA(df)
print(a.anova())
a.imshow_anova_pairs()

(Source code, png, hires.png, pdf)

anova()[source]

Perform one-way ANOVA.

The one-way ANOVA tests the null hypothesis that two or more groups have the same population mean. The test is applied to samples from two or more groups. Since we ar using a dataframe, vector length are identical.

return: the F value (test itself), and its p-value

imshow_anova_pairs(log=True, **kargs)[source]

14.1.2. corrplot 

Corrplot utilities

references:: http://cran.r-project.org/web/packages/corrplot/vignettes/corrplot-intro.html

class Corrplot(data, na=0, compute_correlation=False)[source]

An implementation of correlation plotting tools (corrplot)

Here is a simple example with a correlation matrix as an input (stored in a pandas dataframe):

# create a correlation-like data set stored in a Pandas' dataframe.
import string
# letters = string.uppercase[0:10] # python2
letters = string.ascii_uppercase[0:10]
import pandas as pd
df = pd.DataFrame(dict(( (k, np.random.random(10)+ord(k)-65) for k in letters)))

# and use corrplot
from sequana.viz import corrplot
c = corrplot.Corrplot(df)
c.plot()

(Source code, png, hires.png, pdf)

14.1.3. Heatmap, dendogram 

Heatmap and dendograms

class Clustermap(data_df, sample_groups_df=None, sample_groups_sel=[], sample_groups_palette=[(0.23843137254901958, 0.44549019607843127, 0.5890196078431373), (0.8109803921568628, 0.5098039215686274, 0.24392156862745096), (0.2635294117647059, 0.5364705882352941, 0.2635294117647059), (0.7019607843137255, 0.2901960784313725, 0.292549019607843), (0.5772549019607842, 0.4713725490196078, 0.6737254901960784), (0.4980392156862746, 0.3709803921568628, 0.3450980392156863), (0.8054901960784313, 0.551372549019608, 0.7278431372549017), (0.4980392156862745, 0.4980392156862745, 0.4980392156862745), (0.6172549019607845, 0.6196078431372549, 0.2549019607843137), (0.23450980392156862, 0.6274509803921567, 0.6674509803921569)], gene_groups_df=None, gene_groups_sel=[], gene_groups_palette=[(0.23843137254901958, 0.44549019607843127, 0.5890196078431373), (0.8109803921568628, 0.5098039215686274, 0.24392156862745096), (0.2635294117647059, 0.5364705882352941, 0.2635294117647059), (0.7019607843137255, 0.2901960784313725, 0.292549019607843), (0.5772549019607842, 0.4713725490196078, 0.6737254901960784), (0.4980392156862746, 0.3709803921568628, 0.3450980392156863), (0.8054901960784313, 0.551372549019608, 0.7278431372549017), (0.4980392156862745, 0.4980392156862745, 0.4980392156862745), (0.6172549019607845, 0.6196078431372549, 0.2549019607843137), (0.23450980392156862, 0.6274509803921567, 0.6674509803921569)], yticklabels='auto', **kwargs)[source]

Heatmap and dendrograms based on seaborn Clustermap

from sequana.viz.heatmap import Clustermap, get_clustermap_data
df, sample_groups_df, gene_groups_df = get_clustermap_data()
h = Clustermap(df, sample_groups_df=sample_groups_df, gene_groups_df=gene_groups_df)
h.plot()

(Source code, png, hires.png, pdf)

Constructor

Parameters:

data_df -- a dataframe.
sample_groups_df -- a dataframe with sample id as index (same as in data_df columns) and a group definition per column. Use to produce the x axis color groups.
sample_group_sel -- a list of the columns to select from the sample_groups_df.
sample_groups_palette -- the palette to use for sample color groups.
gene_groups_df -- a dataframe with gene id as index (same as in data_df columns) and a group definition per column. Use to produce the y axis color groups.
gene_group_sel -- a list of the columns to select from the gene_groups_df.
gene_groups_palette -- the palette to use for gene color groups.
ytickslabels -- "auto" for classical heatmap behaviour, [] for no ticks or a pandas Series giving the mapping between the index (gene names in data_df) and the gene names to be used for the heatmap
kwargs -- All other kwargs are passed to seaborn.Clustermap.

plot(cmap=None)[source]

class Heatmap(data=None, row_method='complete', column_method='complete', row_metric='euclidean', column_metric='euclidean', cmap='yellow_black_blue', col_side_colors=None, row_side_colors=None, verbose=True)[source]

Heatmap and dendograms of an input matrix

A heat map is an image representation of a matrix with a dendrogram added to the left side and to the top. Typically, reordering of the rows and columns according to some set of values (row or column means) within the restrictions imposed by the dendrogram is carried out.

from sequana.viz import heatmap
df = heatmap.get_heatmap_df()
h = heatmap.Heatmap(df)
h.plot()

(Source code, png, hires.png, pdf)

side colors can be added:

h = viz.Heatmap(df, col_side_colors=['r', 'g', 'b', 'y', 'k']); h.category_column = category; h.category_row = category

where category is a dictionary with keys as df.columns and values as category defined by you. The number of colors in col_side_colors and row_side_colors should match the number of category

constructor

Parameters:

data -- a dataframe or possibly a numpy matrix.
row_method -- complete by default
column_method -- complete by default. See linkage module for details
row_metric -- euclidean by default
column_metric -- euclidean by default
cmap -- colormap. any matplotlib accepted or combo of colors as defined in colormap package (pypi)
col_side_colors --
row_side_colors --

property column_method

property column_metric

property df

property frame

plot(num=1, cmap=None, colorbar=True, vmin=None, vmax=None, colorbar_position='right', gradient_span='None', figsize=(12, 8), fontsize=None)[source]

Using as input:

df = pd.DataFrame({'A':[1,0,1,1],
                   'B':[.9,0.1,.6,1],
                'C':[.5,.2,0,1],
                'D':[.5,.2,0,1]})

we can plot the heatmap + dendogram as follows:

h = Heatmap(df)
h.plot(vmin=0, vmax=1.1)

from sequana.viz import heatmap
df = heatmap.get_heatmap_df()
h = heatmap.Heatmap(df)
h.category_column['A'] = 1
h.category_column['C'] = 1
h.category_column['D'] = 2
h.category_column['B'] = 2
h.plot()

(Source code, png, hires.png, pdf)

property row_method

property row_metric

14.1.4. Hinton plot 

Hinton plot

author:: Thomas Cokelaer

hinton(df, fig=1, shrink=2, method='square', bgcolor='grey', cmap='gray_r', binarise_color=True)[source]

Hinton plot (simplified version of correlation plot)

Parameters:

df -- the input data as a dataframe or list of items (list, array). See Corrplot for details.
fig -- in which figure to plot the data
shrink -- factor to increase/decrease sizes of the symbols
method -- set the type of symbols for each coordinates. (default to square). See Corrplot for more details.
bgcolor -- set the background and label colors as grey
cmap -- gray color map used by default
binarise_color -- use only two colors. One for positive values and one for negative values.

from sequana.viz import hinton
df = np.random.rand(20, 20) - 0.5
hinton(df)

(Source code, png, hires.png, pdf)

Note

Idea taken from a matplotlib recipes http://matplotlib.org/examples/specialty_plots/hinton_demo.html but solely using the implementation within Corrplot

Note

Values must be between -1 and 1. No sanity check performed.

14.1.5. 2D Histogram 

2D histograms

class Hist2D(x, y=None, verbose=False)[source]

2D histogram

from numpy import random
from sequana.viz import hist2d
X = random.randn(10000)
Y = random.randn(10000)
h = hist2d.Hist2D(X,Y)
h.plot(bins=100, contour=True)

(Source code, png, hires.png, pdf)

constructor

Parameters:

x -- an array for X values. See VizInput2D for details.
y -- an array for Y values. See VizInput2D for details.

plot(bins=100, cmap='hot_r', fontsize=10, Nlevels=4, xlabel=None, ylabel=None, norm=None, range=None, normed=False, colorbar=True, contour=True, grid=True, **kargs)[source]

plots histogram of mean across replicates versus coefficient variation

Parameters:

bins (int) -- binning for the 2D histogram (either a float or list of 2 binning values).
cmap -- a valid colormap (defaults to hot_r)
fontsize -- fontsize for the labels
Nlevels (int) -- must be more than 2
xlabel (str) -- set the xlabel (overwrites content of the dataframe)
ylabel (str) -- set the ylabel (overwrites content of the dataframe)
norm -- set to 'log' to show the log10 of the values.
normed -- normalise the data
range -- as in pylab.Hist2D : a 2x2 shape [[-3,3],[-4,4]]
contour -- show some contours (default to True)
grid (bool) -- Show unerlying grid (defaults to True)

If the input is a dataframe, the xlabel and ylabel will be populated with the column names of the dataframe.

14.1.6. Image 

Imshow utility

class Imshow(x, verbose=True)[source]

Wrapper around the matplotlib.imshow function

Very similar to matplotlib but set interpolation to None, and aspect to automatic and accepts input as a dataframe, in whic case x and y labels are set automatically.

import pandas as pd
data = dict([ (letter,np.random.randn(10)) for letter in 'ABCDEFGHIJK'])
df = pd.DataFrame(data)

from sequana.viz import Imshow
im = Imshow(df)
im.plot()

(Source code, png, hires.png, pdf)

constructor

Parameters:: x -- input dataframe (or numpy matrix/array). Must be squared.

plot(interpolation='None', aspect='auto', cmap='hot', tight_layout=True, colorbar=True, fontsize_x=None, fontsize_y=None, rotation_x=90, xticks_on=True, yticks_on=True, **kargs)[source]

wrapper around imshow to plot a dataframe

Parameters:

interpolation -- set to None
aspect -- set to 'auto'
cmap -- colormap to be used.
tight_layout --
colorbar -- add a colobar (default to True)
fontsize_x -- fontsize on xlabels
fontsize_y -- fontsize on ylabels
rotation_x -- rotate labels on xaxis
xticks_on -- switch off the xticks and labels
yticks_on -- switch off the yticks and labels

14.1.7. PCA 

class PCA(data, colors={})[source]

from sequana.viz.pca import PCA
from sequana import sequana_data
import pandas as pd

data = sequana_data("test_pca.csv")
df = pd.read_csv(data)
df = df.set_index("Id")
p = PCA(df, colors={
    "A1": 'r', "A2": 'r', 'A3': 'r',
    "B1": 'b', "B2": 'b', 'B3': 'b'})
p.plot(n_components=2)

(Source code, png, hires.png, pdf)

From R, a PCA is selecting the first 500 features based on variance.

constructor

Parameters:

data -- a dataframe; Each column being a sample.
colors -- a mapping of column/sample name a color

plot(n_components=2, transform='log', switch_x=False, switch_y=False, switch_z=False, colors=None, max_features=500, show_plot=True, fontsize=10, adjust=True)[source]

Parameters:

n_components -- at number starting at 2 or a value below 1 e.g. 0.95 means select automatically the number of components to capture 95% of the variance
transform -- can be 'log' or 'anscombe', log is just log10. count with zeros, are set to 1

plot_pca_vs_max_features(step=100, n_components=2, transform='log')[source]

from sequana.viz.pca import PCA
from sequana import sequana_data
import pandas as pd

data = sequana_data("test_pca.csv")
df = pd.read_csv(data)
df = df.set_index("Id")

p = PCA(df)
p.plot_pca_vs_max_features()

(Source code, png, hires.png, pdf)

14.1.8. ScatterPlot 

Scatter plots

author:: Thomas Cokelaer

class ScatterHist(x, y=None, verbose=True)[source]

Scatter plots and histograms

constructor

Parameters:

x -- if x is provided, it should be a dataframe with 2 columns. The first one will be used as your X data, and the second one as the Y data
y --
verbose --

plot(kargs_scatter={'c': 'b', 's': 20}, kargs_grids={}, kargs_histx={}, kargs_histy={}, scatter_position='bottom left', width=0.5, height=0.5, offset_x=0.1, offset_y=0.1, gap=0.06, facecolor='lightgrey', grid=True, show_labels=True, **kargs)[source]

Scatter plot of set of 2 vectors and their histograms.

Parameters:

x -- a dataframe or a numpy matrix (2 vectors) or a list of 2 items, which can be a mix of list or numpy array. if size and/or color are found in the columns dataframe, those columns will be used in the scatter plot. kargs_scatter keys c and s will then be ignored. If a list of lists, x will be the first row and y the second row.
y -- if x is a list or an array, then y must also be provided as a list or an array
kargs_scatter -- a dictionary with pairs of key/value accepted by matplotlib.scatter function. Examples is a list of colors or a list of sizes as shown in the examples below.
kargs_grid -- a dictionary with pairs of key/value accepted by the maplotlib.grid (applied on histogram and axis at the same time)
kargs_histx -- a dictionary with pairs of key/value accepted by the matplotlib.histogram
kargs_histy -- a dictionary with pairs of key/value accepted by the matplotlib.histogram
kargs -- other optional parameters are hold, facecolor.
scatter_position -- can be 'bottom right/bottom left/top left/top right'
width -- width of the scatter plot (value between 0 and 1)
height -- height of the scatter plot (value between 0 and 1)
offset_x --
offset_y --
gap -- gap between the scatter and histogram plots.
grid -- defaults to True

Returns:

the scatter, histogram1 and histogram2 axes.

import pylab
import pandas as pd
X = pylab.randn(1000)
Y = pylab.randn(1000)
df = pd.DataFrame({'X':X, 'Y':Y})

from sequana.viz import ScatterHist
ScatterHist(df).plot()

(Source code, png, hires.png, pdf)

from sequana.viz import ScatterHist
ScatterHist(x=[1,2,3,4], y=[3,5,6,4]).plot(
    kargs_scatter={
        's':[200,400,600,800],
        'c': ['red', 'green', 'blue', 'yellow'],
        'alpha':0.5},
    kargs_histx={'color': 'red'},
    kargs_histy={'color': 'green'})

(Source code, png, hires.png, pdf)

14.1.9. Venn diagram 

plot_venn(subsets, labels=None, title=None, ax=None, alpha=0.8, weighted=False, colors=('r', 'b', 'y'))[source]

Plot venn diagramm according to number of groups.

Parameters:: subsets -- This parameter may be (1) a dict, providing sizes of three diagram regions. The regions are identified via two-letter binary codes ('10', '01', and '11'), hence a valid set could look like: {'10': 10, '01': 20, '11': 40}. Unmentioned codes are considered to map to 0. (2) a list (or a tuple) with three numbers, denoting the sizes of the regions in the following order: (10, 01, 11) and (3) a list containing the subsets of values.

The subsets can be a list (or a tuple) containing two set objects. For instance:

from sequana.viz.venn import plot_venn
A = set([1,2,3,4,5,6,7,8,9])
B = set([            7,8,9,10,11])
plot_venn((A, B), labels=("A", "B"))

(Source code, png, hires.png, pdf)

This is the unweighted version by default meaning all circles have the same size. If you prefer to have circle scaled to the size of the sets, add the relevant parameter as follows:

from sequana.viz.venn import plot_venn
A = set([1,2,3,4,5,6,7,8,9])
B = set([            7,8,9,10,11])
plot_venn((A, B), labels=("A", "B"), weighted=True)

(Source code, png, hires.png, pdf)

Similarly for 3 sets, a Venn diagram can be represented as follows. Note here that we also use the title parameter:

from sequana.viz.venn import plot_venn

A = set([1,2,3,4,5,6,7,8,9])
B = set([      4,5,6,7,8,9,10,11,12,13])
C = set([   3,4,5,6,7,8,9])
plot_venn((A, B, C), labels=("A", "B", "C"), title="my Venn3 diagram")

(Source code, png, hires.png, pdf)

Input can be a list/tuple of 2 or 3 sets as described above.

14.1.10. Volcano plots 

Volcano plot

class Volcano(data=None, log2fc_col='log2FoldChange', pvalues_col='padj', annot_col='', color='auto', pvalue_threshold=1.3010299956639813, log2fc_threshold=1)[source]

Volcano plot

In essence, just a scatter plot with annotations.

import numpy as np
fc = np.random.randn(1000)
pvalue = np.random.randn(1000)
import pandas as pd
data = pd.DataFrame([fc, pvalue])
data = data.T
data.columns = ['log2FoldChange', 'padj']

from sequana.viz import Volcano
v = Volcano(data)
v.plot()

(Source code, png, hires.png, pdf)

constructor

Parameters:

data (DataFrame) -- Pandas DataFrame with rnadiff results.
log2fc_col -- Name of the column with log2 Fold changes.
pvalues_col -- Name of the column with adjusted pvalues.
annot_col -- Name of the column with genes names for plot annotation.
color -- for color choice
pvalue_threshold -- Adjusted pvalue threshold to use for coloring/annotation.
log2fc_threshold -- Log2 Fold Change threshold to use for coloring/annotation.

annotate(**kwargs)[source]

plot(size=10, alpha=0.7, marker='o', fontsize=16, xlabel='fold change', logy=False, threshold_lines={'color': 'black', 'ls': '--', 'width': 0.5}, ylabel='p-value', add_broken_axes=False, broken_axes={'ylims': ((0, 10), (50, 100))})[source]

Parameters:

size -- size of the markers
alpha -- transparency of the marker
fontsize --
xlabel --
ylabel --
center -- If centering the x axis