14. References (Viz)

14.1. Visualisation

14.1.1. Anova

class ANOVA(df)[source]

DRAFT

Testing if 3(+) population means are all equal.

Looks like the group are different, visually, and naively.

from pylab import *
from sequana.viz import ANOVA
import pandas as pd

A = normal(0.5,size=10000)
B = normal(0.25, size=10000)
C = normal(0, 0.5,size=10000)
df = pd.DataFrame({"A":A, "B":B, "C":C})
a = ANOVA(df)
print(a.anova())
a.imshow_anova_pairs()

(Source code, png, hires.png, pdf)

_images/references_viz-1.png
anova()[source]

Perform one-way ANOVA.

The one-way ANOVA tests the null hypothesis that two or more groups have the same population mean. The test is applied to samples from two or more groups. Since we ar using a dataframe, vector length are identical.

return: the F value (test itself), and its p-value

imshow_anova_pairs(log=True, **kargs)[source]

14.1.2. corrplot

Corrplot utilities

references:

http://cran.r-project.org/web/packages/corrplot/vignettes/corrplot-intro.html

class Corrplot(data, na=0, compute_correlation=False)[source]

An implementation of correlation plotting tools (corrplot)

Here is a simple example with a correlation matrix as an input (stored in a pandas dataframe):

# create a correlation-like data set stored in a Pandas' dataframe.
import string
# letters = string.uppercase[0:10] # python2
letters = string.ascii_uppercase[0:10]
import pandas as pd
df = pd.DataFrame(dict(( (k, np.random.random(10)+ord(k)-65) for k in letters)))

# and use corrplot
from sequana.viz import corrplot
c = corrplot.Corrplot(df)
c.plot()

(Source code, png, hires.png, pdf)

_images/references_viz-2.png

See also

All functionalities are covered in this notebook

Constructor

Plots the content of square matrix that contains correlation values.

Parameters:
  • data -- input can be a dataframe (Pandas), or list of lists (python) or a numpy matrix. Note, however, that values must be between -1 and 1. If not, or if the matrix (or list of lists) is not squared, then correlation is computed. The data or computed correlation is stored in df attribute.

  • compute_correlation (bool) -- if the matrix is non-squared or values are not bounded in -1,+1, correlation is computed. If you do not want that behaviour, set this parameter to False. (True by default).

  • na -- replace NA values with this value (default 0)

The params contains some tunable parameters for the colorbar in the plot() method.

# can be a list of lists, the correlation matrix is then a 2x2 matrix
c = corrplot.Corrplot([[1,1], [2,4], [3,3], [4,4]])
df

The input data is stored in a dataframe and must therefore be compatible (list of lists, dictionary, matrices...)

order(method='complete', metric='euclidean', inplace=False)[source]

Rearrange the order of rows and columns after clustering

Parameters:
  • method -- any scipy method (e.g., single, average, centroid, median, ward). See scipy.cluster.hierarchy.linkage

  • metric -- any scipy distance (euclidean, hamming, jaccard) See scipy.spatial.distance or scipy.cluster.hieararchy

  • inplace (bool) -- if set to True, the dataframe is replaced

You probably do not need to use that method. Use plot() and the two parameters order_metric and order_method instead.

params

tunable parameters for the plot() method.

plot(fig=None, grid=True, rotation=30, lower=None, upper=None, shrink=0.9, facecolor='white', colorbar=True, label_color='black', fontsize='small', edgecolor='black', method='ellipse', order_method='complete', order_metric='euclidean', cmap=None, ax=None, binarise_color=False)[source]

plot the correlation matrix from the content of df (dataframe)

By default, the correlation is shown on the upper and lower triangle and is symmetric wrt to the diagonal. The symbols are ellipses. The symbols can be changed to e.g. rectangle. The symbols are shown on upper and lower sides but you could choose a symbol for the upper side and another for the lower side using the lower and upper parameters.

Parameters:
  • fig -- Create a new figure by default. If an instance of an existing figure is provided, the corrplot is overlayed on the figure provided. Can also be the number of the figure.

  • grid -- add grid (Defaults to grey color). You can set it to False or a color.

  • rotation -- rotate labels on y-axis

  • lower -- if set to a valid method, plots the data on the lower left triangle

  • upper -- if set to a valid method, plots the data on the upper left triangle

  • shrink (float) -- maximum space used (in percent) by a symbol. If negative values are provided, the absolute value is taken. If greater than 1, the symbols wiill overlap.

  • facecolor -- color of the background (defaults to white).

  • colorbar -- add the colorbar (defaults to True).

  • label_color (str) -- (defaults to black).

  • fontsize -- size of the fonts defaults to 'small'.

  • method -- shape to be used in 'ellipse', 'square', 'rectangle', 'color', 'text', 'circle', 'number', 'pie'.

  • order_method -- see order().

  • order_metric -- see : meth:order.

  • cmap -- a valid cmap from matplotlib or colormap package (e.g., 'jet', or 'copper'). Default is red/white/blue colors.

  • ax -- a matplotlib axes.

The colorbar can be tuned with the parameters stored in params.

Here is an example. See notebook for other examples:

c = corrplot.Corrplot(dataframe)
c.plot(cmap=('Orange', 'white', 'green'))
c.plot(method='circle')
c.plot(colorbar=False, shrink=.8, upper='circle'  )

14.1.3. Heatmap, dendogram

Heatmap and dendograms

class Clustermap(data_df, sample_groups_df=None, sample_groups_sel=[], sample_groups_palette=[(0.23843137254901958, 0.44549019607843127, 0.5890196078431373), (0.8109803921568628, 0.5098039215686274, 0.24392156862745096), (0.2635294117647059, 0.5364705882352941, 0.2635294117647059), (0.7019607843137255, 0.2901960784313725, 0.292549019607843), (0.5772549019607842, 0.4713725490196078, 0.6737254901960784), (0.4980392156862746, 0.3709803921568628, 0.3450980392156863), (0.8054901960784313, 0.551372549019608, 0.7278431372549017), (0.4980392156862745, 0.4980392156862745, 0.4980392156862745), (0.6172549019607845, 0.6196078431372549, 0.2549019607843137), (0.23450980392156862, 0.6274509803921567, 0.6674509803921569)], gene_groups_df=None, gene_groups_sel=[], gene_groups_palette=[(0.23843137254901958, 0.44549019607843127, 0.5890196078431373), (0.8109803921568628, 0.5098039215686274, 0.24392156862745096), (0.2635294117647059, 0.5364705882352941, 0.2635294117647059), (0.7019607843137255, 0.2901960784313725, 0.292549019607843), (0.5772549019607842, 0.4713725490196078, 0.6737254901960784), (0.4980392156862746, 0.3709803921568628, 0.3450980392156863), (0.8054901960784313, 0.551372549019608, 0.7278431372549017), (0.4980392156862745, 0.4980392156862745, 0.4980392156862745), (0.6172549019607845, 0.6196078431372549, 0.2549019607843137), (0.23450980392156862, 0.6274509803921567, 0.6674509803921569)], yticklabels='auto', **kwargs)[source]

Heatmap and dendrograms based on seaborn Clustermap

from sequana.viz.heatmap import Clustermap, get_clustermap_data
df, sample_groups_df, gene_groups_df = get_clustermap_data()
h = Clustermap(df, sample_groups_df=sample_groups_df, gene_groups_df=gene_groups_df)
h.plot()

(Source code, png, hires.png, pdf)

_images/references_viz-3.png

Constructor

Parameters:
  • data_df -- a dataframe.

  • sample_groups_df -- a dataframe with sample id as index (same as in data_df columns) and a group definition per column. Use to produce the x axis color groups.

  • sample_group_sel -- a list of the columns to select from the sample_groups_df.

  • sample_groups_palette -- the palette to use for sample color groups.

  • gene_groups_df -- a dataframe with gene id as index (same as in data_df columns) and a group definition per column. Use to produce the y axis color groups.

  • gene_group_sel -- a list of the columns to select from the gene_groups_df.

  • gene_groups_palette -- the palette to use for gene color groups.

  • ytickslabels -- "auto" for classical heatmap behaviour, [] for no ticks or a pandas Series giving the mapping between the index (gene names in data_df) and the gene names to be used for the heatmap

  • kwargs -- All other kwargs are passed to seaborn.Clustermap.

plot(cmap=None)[source]
class Heatmap(data=None, row_method='complete', column_method='complete', row_metric='euclidean', column_metric='euclidean', cmap='yellow_black_blue', col_side_colors=None, row_side_colors=None, verbose=True)[source]

Heatmap and dendograms of an input matrix

A heat map is an image representation of a matrix with a dendrogram added to the left side and to the top. Typically, reordering of the rows and columns according to some set of values (row or column means) within the restrictions imposed by the dendrogram is carried out.

from sequana.viz import heatmap
df = heatmap.get_heatmap_df()
h = heatmap.Heatmap(df)
h.plot()

(Source code, png, hires.png, pdf)

_images/references_viz-4.png

side colors can be added:

h = viz.Heatmap(df, col_side_colors=['r', 'g', 'b', 'y', 'k']); h.category_column = category; h.category_row = category

where category is a dictionary with keys as df.columns and values as category defined by you. The number of colors in col_side_colors and row_side_colors should match the number of category

constructor

Parameters:
  • data -- a dataframe or possibly a numpy matrix.

  • row_method -- complete by default

  • column_method -- complete by default. See linkage module for details

  • row_metric -- euclidean by default

  • column_metric -- euclidean by default

  • cmap -- colormap. any matplotlib accepted or combo of colors as defined in colormap package (pypi)

  • col_side_colors --

  • row_side_colors --

property column_method
property column_metric
property df
property frame
plot(num=1, cmap=None, colorbar=True, vmin=None, vmax=None, colorbar_position='right', gradient_span='None', figsize=(12, 8), fontsize=None)[source]

Using as input:

df = pd.DataFrame({'A':[1,0,1,1],
                   'B':[.9,0.1,.6,1],
                'C':[.5,.2,0,1],
                'D':[.5,.2,0,1]})

we can plot the heatmap + dendogram as follows:

h = Heatmap(df)
h.plot(vmin=0, vmax=1.1)
from sequana.viz import heatmap
df = heatmap.get_heatmap_df()
h = heatmap.Heatmap(df)
h.category_column['A'] = 1
h.category_column['C'] = 1
h.category_column['D'] = 2
h.category_column['B'] = 2
h.plot()

(Source code, png, hires.png, pdf)

_images/references_viz-5.png
property row_method
property row_metric

14.1.4. Hinton plot

Hinton plot

author:

Thomas Cokelaer

hinton(df, fig=1, shrink=2, method='square', bgcolor='grey', cmap='gray_r', binarise_color=True)[source]

Hinton plot (simplified version of correlation plot)

Parameters:
  • df -- the input data as a dataframe or list of items (list, array). See Corrplot for details.

  • fig -- in which figure to plot the data

  • shrink -- factor to increase/decrease sizes of the symbols

  • method -- set the type of symbols for each coordinates. (default to square). See Corrplot for more details.

  • bgcolor -- set the background and label colors as grey

  • cmap -- gray color map used by default

  • binarise_color -- use only two colors. One for positive values and one for negative values.

from sequana.viz import hinton
df = np.random.rand(20, 20) - 0.5
hinton(df)

(Source code, png, hires.png, pdf)

_images/references_viz-6.png

Note

Idea taken from a matplotlib recipes http://matplotlib.org/examples/specialty_plots/hinton_demo.html but solely using the implementation within Corrplot

Note

Values must be between -1 and 1. No sanity check performed.

14.1.5. 2D Histogram

2D histograms

class Hist2D(x, y=None, verbose=False)[source]

2D histogram

from numpy import random
from sequana.viz import hist2d
X = random.randn(10000)
Y = random.randn(10000)
h = hist2d.Hist2D(X,Y)
h.plot(bins=100, contour=True)

(Source code, png, hires.png, pdf)

_images/references_viz-7.png

constructor

Parameters:
  • x -- an array for X values. See VizInput2D for details.

  • y -- an array for Y values. See VizInput2D for details.

plot(bins=100, cmap='hot_r', fontsize=10, Nlevels=4, xlabel=None, ylabel=None, norm=None, range=None, normed=False, colorbar=True, contour=True, grid=True, **kargs)[source]

plots histogram of mean across replicates versus coefficient variation

Parameters:
  • bins (int) -- binning for the 2D histogram (either a float or list of 2 binning values).

  • cmap -- a valid colormap (defaults to hot_r)

  • fontsize -- fontsize for the labels

  • Nlevels (int) -- must be more than 2

  • xlabel (str) -- set the xlabel (overwrites content of the dataframe)

  • ylabel (str) -- set the ylabel (overwrites content of the dataframe)

  • norm -- set to 'log' to show the log10 of the values.

  • normed -- normalise the data

  • range -- as in pylab.Hist2D : a 2x2 shape [[-3,3],[-4,4]]

  • contour -- show some contours (default to True)

  • grid (bool) -- Show unerlying grid (defaults to True)

If the input is a dataframe, the xlabel and ylabel will be populated with the column names of the dataframe.

14.1.6. Image

Imshow utility

class Imshow(x, verbose=True)[source]

Wrapper around the matplotlib.imshow function

Very similar to matplotlib but set interpolation to None, and aspect to automatic and accepts input as a dataframe, in whic case x and y labels are set automatically.

import pandas as pd
data = dict([ (letter,np.random.randn(10)) for letter in 'ABCDEFGHIJK'])
df = pd.DataFrame(data)

from sequana.viz import Imshow
im = Imshow(df)
im.plot()

(Source code, png, hires.png, pdf)

_images/references_viz-8.png

constructor

Parameters:

x -- input dataframe (or numpy matrix/array). Must be squared.

plot(interpolation='None', aspect='auto', cmap='hot', tight_layout=True, colorbar=True, fontsize_x=None, fontsize_y=None, rotation_x=90, xticks_on=True, yticks_on=True, **kargs)[source]

wrapper around imshow to plot a dataframe

Parameters:
  • interpolation -- set to None

  • aspect -- set to 'auto'

  • cmap -- colormap to be used.

  • tight_layout --

  • colorbar -- add a colobar (default to True)

  • fontsize_x -- fontsize on xlabels

  • fontsize_y -- fontsize on ylabels

  • rotation_x -- rotate labels on xaxis

  • xticks_on -- switch off the xticks and labels

  • yticks_on -- switch off the yticks and labels

14.1.7. PCA

class PCA(data, colors={})[source]
from sequana.viz.pca import PCA
from sequana import sequana_data
import pandas as pd

data = sequana_data("test_pca.csv")
df = pd.read_csv(data)
df = df.set_index("Id")
p = PCA(df, colors={
    "A1": 'r', "A2": 'r', 'A3': 'r',
    "B1": 'b', "B2": 'b', 'B3': 'b'})
p.plot(n_components=2)

(Source code, png, hires.png, pdf)

_images/references_viz-9.png

From R, a PCA is selecting the first 500 features based on variance.

constructor

Parameters:
  • data -- a dataframe; Each column being a sample.

  • colors -- a mapping of column/sample name a color

plot(n_components=2, transform='log', switch_x=False, switch_y=False, switch_z=False, colors=None, max_features=500, show_plot=True, fontsize=10, adjust=True)[source]
Parameters:
  • n_components -- at number starting at 2 or a value below 1 e.g. 0.95 means select automatically the number of components to capture 95% of the variance

  • transform -- can be 'log' or 'anscombe', log is just log10. count with zeros, are set to 1

plot_pca_vs_max_features(step=100, n_components=2, transform='log')[source]
from sequana.viz.pca import PCA
from sequana import sequana_data
import pandas as pd

data = sequana_data("test_pca.csv")
df = pd.read_csv(data)
df = df.set_index("Id")

p = PCA(df)
p.plot_pca_vs_max_features()

(Source code, png, hires.png, pdf)

_images/references_viz-10.png

14.1.8. ScatterPlot

Scatter plots

author:

Thomas Cokelaer

class ScatterHist(x, y=None, verbose=True)[source]

Scatter plots and histograms

constructor

Parameters:
  • x -- if x is provided, it should be a dataframe with 2 columns. The first one will be used as your X data, and the second one as the Y data

  • y --

  • verbose --

plot(kargs_scatter={'c': 'b', 's': 20}, kargs_grids={}, kargs_histx={}, kargs_histy={}, scatter_position='bottom left', width=0.5, height=0.5, offset_x=0.1, offset_y=0.1, gap=0.06, facecolor='lightgrey', grid=True, show_labels=True, **kargs)[source]

Scatter plot of set of 2 vectors and their histograms.

Parameters:
  • x -- a dataframe or a numpy matrix (2 vectors) or a list of 2 items, which can be a mix of list or numpy array. if size and/or color are found in the columns dataframe, those columns will be used in the scatter plot. kargs_scatter keys c and s will then be ignored. If a list of lists, x will be the first row and y the second row.

  • y -- if x is a list or an array, then y must also be provided as a list or an array

  • kargs_scatter -- a dictionary with pairs of key/value accepted by matplotlib.scatter function. Examples is a list of colors or a list of sizes as shown in the examples below.

  • kargs_grid -- a dictionary with pairs of key/value accepted by the maplotlib.grid (applied on histogram and axis at the same time)

  • kargs_histx -- a dictionary with pairs of key/value accepted by the matplotlib.histogram

  • kargs_histy -- a dictionary with pairs of key/value accepted by the matplotlib.histogram

  • kargs -- other optional parameters are hold, facecolor.

  • scatter_position -- can be 'bottom right/bottom left/top left/top right'

  • width -- width of the scatter plot (value between 0 and 1)

  • height -- height of the scatter plot (value between 0 and 1)

  • offset_x --

  • offset_y --

  • gap -- gap between the scatter and histogram plots.

  • grid -- defaults to True

Returns:

the scatter, histogram1 and histogram2 axes.

import pylab
import pandas as pd
X = pylab.randn(1000)
Y = pylab.randn(1000)
df = pd.DataFrame({'X':X, 'Y':Y})

from sequana.viz import ScatterHist
ScatterHist(df).plot()

(Source code, png, hires.png, pdf)

_images/references_viz-11.png
from sequana.viz import ScatterHist
ScatterHist(x=[1,2,3,4], y=[3,5,6,4]).plot(
    kargs_scatter={
        's':[200,400,600,800],
        'c': ['red', 'green', 'blue', 'yellow'],
        'alpha':0.5},
    kargs_histx={'color': 'red'},
    kargs_histy={'color': 'green'})

(Source code, png, hires.png, pdf)

_images/references_viz-12.png

See also

notebook

14.1.9. Venn diagram

plot_venn(subsets, labels=None, title=None, ax=None, alpha=0.8, weighted=False, colors=('r', 'b', 'y'))[source]

Plot venn diagramm according to number of groups.

Parameters:

subsets -- This parameter may be (1) a dict, providing sizes of three diagram regions. The regions are identified via two-letter binary codes ('10', '01', and '11'), hence a valid set could look like: {'10': 10, '01': 20, '11': 40}. Unmentioned codes are considered to map to 0. (2) a list (or a tuple) with three numbers, denoting the sizes of the regions in the following order: (10, 01, 11) and (3) a list containing the subsets of values.

The subsets can be a list (or a tuple) containing two set objects. For instance:

from sequana.viz.venn import plot_venn
A = set([1,2,3,4,5,6,7,8,9])
B = set([            7,8,9,10,11])
plot_venn((A, B), labels=("A", "B"))

(Source code, png, hires.png, pdf)

_images/references_viz-13.png

This is the unweighted version by default meaning all circles have the same size. If you prefer to have circle scaled to the size of the sets, add the relevant parameter as follows:

from sequana.viz.venn import plot_venn
A = set([1,2,3,4,5,6,7,8,9])
B = set([            7,8,9,10,11])
plot_venn((A, B), labels=("A", "B"), weighted=True)

(Source code, png, hires.png, pdf)

_images/references_viz-14.png

Similarly for 3 sets, a Venn diagram can be represented as follows. Note here that we also use the title parameter:

from sequana.viz.venn import plot_venn

A = set([1,2,3,4,5,6,7,8,9])
B = set([      4,5,6,7,8,9,10,11,12,13])
C = set([   3,4,5,6,7,8,9])
plot_venn((A, B, C), labels=("A", "B", "C"), title="my Venn3 diagram")

(Source code, png, hires.png, pdf)

_images/references_viz-15.png

Input can be a list/tuple of 2 or 3 sets as described above.

14.1.10. Volcano plots

Volcano plot

class Volcano(data=None, log2fc_col='log2FoldChange', pvalues_col='padj', annot_col='', color='auto', pvalue_threshold=1.3010299956639813, log2fc_threshold=1)[source]

Volcano plot

In essence, just a scatter plot with annotations.

import numpy as np
fc = np.random.randn(1000)
pvalue = np.random.randn(1000)
import pandas as pd
data = pd.DataFrame([fc, pvalue])
data = data.T
data.columns = ['log2FoldChange', 'padj']

from sequana.viz import Volcano
v = Volcano(data)
v.plot()

(Source code, png, hires.png, pdf)

_images/references_viz-16.png

constructor

Parameters:
  • data (DataFrame) -- Pandas DataFrame with rnadiff results.

  • log2fc_col -- Name of the column with log2 Fold changes.

  • pvalues_col -- Name of the column with adjusted pvalues.

  • annot_col -- Name of the column with genes names for plot annotation.

  • color -- for color choice

  • pvalue_threshold -- Adjusted pvalue threshold to use for coloring/annotation.

  • log2fc_threshold -- Log2 Fold Change threshold to use for coloring/annotation.

annotate(**kwargs)[source]
plot(size=10, alpha=0.7, marker='o', fontsize=16, xlabel='fold change', logy=False, threshold_lines={'color': 'black', 'ls': '--', 'width': 0.5}, ylabel='p-value', add_broken_axes=False, broken_axes={'ylims': ((0, 10), (50, 100))})[source]
Parameters:
  • size -- size of the markers

  • alpha -- transparency of the marker

  • fontsize --

  • xlabel --

  • ylabel --

  • center -- If centering the x axis