11. References

11.1. Adapters

Utilities to manipulate adapters

Adapters removal can be performed by many different tools such as CutAdapt, AlienTrimmer, Trimmomatic. Unfortunately, they tend to use different formats from FASTA to text files. Moreover outputs are generally also reported in different formats.

Tools to extract specific adapters from FASTA files would also be handy. For instance, you may have all your adapters in a single file.

In this module, we provide:

  • tools to manipulate adapters stored in Fasta format (AdapterReader).
  • tools to export Fasta files with adapter content into other formats required by various adapter removal software
  • A tool used to extract adapters from a FASTA file given their identifier, or sequence FindAdaptersFromDesign.

Our convention is to store list of adapters in FASTA format, which can be read using the AdapterReader:

from sequana import sequana_data, AdapterReader
filename = sequana_data("adapters_Nextera_fwd.fa")
ar = AdapterReader(filename)

Given a design file (see mod:sequana.expdesign), and a name for the type of adapters, one can easily extract the subset of relevant adapters to be used for a sample. Currently, the following set of adapters/design are available:

  • Nextera single and double indexing
  • Rubicon single indexing
  • PCRFree single indexing
  • TruSeq

Note that TruSeq index 17, 24, and 26 are missing. This is normal. Those are “reserved” Illumina index.

For instance given a design file that gives the mapping between samples and a set of Nextera adapters, one would use:

>>> from sequana import *
>>> filename = sequana_data("test_expdesign_hiseq.csv")
>>> design = ExpDesignAdapter(filename)
>>> fa = FindAdaptersFromDesign(design, "PCRFree")
>>> print(fa.sample_names[0])
>>> fa.get_adapters_from_sample("553-iH2-1")

See FindAdaptersFromDesign for details.

class Adapter(identifier, sequence=None, comment=None)[source]

Class to store one adapter

An adapter is just a sequence from a FASTA file. It contains an identifier, a sequence and possibly a comment.


The identifier provided must not contain the starting “>” character, which is added automatically when needed.

One can check if an adapter is equal to another. Only the sequence is checked for equality though.

Some Sequana notation have been added in the identifier to ease retrieval of index’s name and index’s sequence:


Of course the string CGATGT must be found in the sequence itself.

ar = AdapterReader(sequana_data("adapters_PCRFree_fwd.fa"))
adapter = Adapter(ar[0])

R/W adapter’s identifier


R/W adapter’s identifier


Read only access to the index sequence


Read only access to the inedx name


R/W adapter’s sequence

class AdapterReader(filename)[source]

Reader of FASTA file dedicated to adapters

A Fasta is just a set of this kind of paired-lines:

>Nextera_index_N501|name:N501|seq:ACGT optional comment

where the optional comment is separated from the identifier by a tabulation.

In the FASTA identifier, the first pipe delimits the official name (left hand side) from the name tag. The information on this example may be redundant but the name will be used throughout the Sequana code to ensure reproductibility.


sequences are all in big caps.


the universal adapter has no index so does not need to have the any tags for the name of index sequence. However, it must be called Universal_Adapter

>>> from sequana import sequana_data, AdapterReader
>>> filename = sequana_data("adapters_Nextera_fwd.fa")
>>> ar = AdapterReader(filename)
>>> candidate = ar.get_adapter_by_index_name("S505")
>>> print(candidate[0])
>>> len(ar)


Checks for uniqueness of the identifers. It not unique, an error is raised

Sources:document illumina #1000000002694 v01
Sources:For NextFlex PCR-Free adapters, there are 48 barcodes. http://www.biooscientific.com/Portals/0/IEM/Bioo-Scientific-PCR-Free-Barcode-Indices-v1-1-15.pdf


Parameters:filename (str) – the input FASTA file

Return adapter whose identifier matches the user text

Parameters:index_identifier – the unique index identifier to be found. If several sequence do match, this is an error meaning the fasta file with all adapters is not correctly formatted.
Returns:the adapter that match the index_name (if any) otherwise returns None

Return adapter for the index name provided

Can be used only if the identifier contains the tag:


For instance:


are valid identifiers


See get_adapter_by_index_name().


Return one or several adapters with sub-sequence in their sequence

Parameters:subsequence (str) – a string (ACGT letters)
Returns:name and sequence in FASTA format that have the user sequence contained in their sequence

If the subsequence is short, it may return more than 1 adapters. Besides, the sequence is searched for without position information right now.


Reverse all sequences inplace

>>> from sequana import sequana_data, AdapterReader
>>> filename = sequana_data("adapters_Nextera_fwd.fa")
>>> filename2 = sequana_data("adapters_Nextera_rev.fa")
>>> ar = AdapterReader(filename)
>>> ar2 = AdapterReader(filename2)
>>> ar.reverse()
>>> ar == ar2

Reverse-complement all sequences inplace

>>> from sequana import sequana_data, AdapterReader
>>> filename = sequana_data("adapters_Nextera_fwd.fa")
>>> filename = sequana_data("adapters_Nextera_revcomp.fa")
>>> ar = AdapterReader(filename)
>>> ar.reverse_complement()
>>> ar.to_fasta()
>>> ar == ar2

Returns dictionary with key as identifier and values as list with comments and sequences


Save sequences into fasta file

class FindAdaptersFromDesign(design_filename, adapters)[source]

Extract adapter(s) corresponding to an experimental design file

Used by sequana main script to build the adapter files for multi-samples projects as input to various adapter removal software.


  • design_filename (str) – a CSV file that is compatible with our sequana.expdesign.ExpDesignAdapter
  • adapters – the type of adapters (PCRFree, Nextera, Rubicon, TruSeq, SMARTer, Small)

The files of adapters are stored in Sequana and accessible with the sequana_data function. So, for instance if adapters is set to Nextera, the following file is used to identify the adapters:


New adapters files can be added on request. Currently, Nextera and PCRFree are available. Rubicon and TruSeq will be added soon.


Return a dictionary with adapters corresponding to the sample name

Parameters:sample_name (str) – a valid sample name as found in the design file. One can check the content of the sample_names attribute.
Returns:a dictionary with the adapters in forward, reverse, reverse complement for index1 and index2 (if relevant).

Return basic info about the sample name (from the design file)


return all sample names contained in the design file

save_adapters_to_fasta(sample_name, output_dir='.')[source]

Get index1, index2 and universal adapter


Parses output of AdapterRemoval software

>>> from sequana import adapters, sequana_data
>>> data = sequana_data("test_adapter_removal_output.txt", "testing")
>>> results = adapters.adapter_removal_parser(data)
>>> results["adapter1"]
adapters_to_clean_ngs(input_filename, output_filename='adapters_ngs.txt')[source]

Convert a FASTA formatted file into clean_ngs format

  • input_filename (str) – the input FASTA file
  • output_filename (str) – a TSV formatted file
fasta_fwd_rev_to_columns(file1, file2=None, output_filename=None)[source]

From 2 FASTA files (reverse and forward) adapters, returns 2-columns file

This is useful for some tools related to adapter removal that takes as input this kind of format

  • filename1 (str) – FASTA format
  • filename2 (stsr) – FASTA format (optional)

The files must have a one-to-one mapping

get_sequana_adapters(type_, direction)[source]

Return path to a list of adapters in FASTA format

  • tag – PCRFree, Rubicon, Nextera
  • type – fwd, rev, revcomp

path to the adapter filename

11.2. Assembly related

class BUSCO(filename='full_table_testbusco.tsv')[source]

Wrapper of the BUSCO output

“BUSCO provides a quantitative measures for the assessment of a genome assembly, gene set, transcriptome completeness, based on evolutionarily-informed expectations of gene content from near-universal single-copy orthologs selected from OrthoDB v9.” – BUSCO website 2017

This class reads the full report generated by BUSCO and provides some visualisation of this report. The information is stored in a dataframe df. The score can be retrieve with the attribute score in percentage in the range 0-100.



Filename:a valid BUSCO input file (full table). See example in sequana code source (testing)
pie_plot(filename=None, hold=False)[source]

Plot PIE plot of the status (complete / fragment / missed)

from sequana import BUSCO, sequana_data
b = BUSCO(sequana_data("test_busco_full_table.tsv"))

(Source code, png, hires.png, pdf)

scatter_plot(filename=None, hold=False)[source]

Scatter plot of the score versus length of each ortholog

from sequana import BUSCO, sequana_data
b = BUSCO(sequana_data("test_busco_full_table.tsv"))

(Source code)


Return summary information of the missing, completed, fragemented orthologs

11.3. CIGAR tools

class Cigar(cigarstring)[source]
>>> from sequana.cigar import Cigar
>>> c = Cigar("2S30M1I")
>>> len(c)
>>> c.items()

>>> c = Cigar("1S1S1S1S")
>>> c.compress()
>>> c.cigarstring

Possible CIGAR types are:

  • “M” for alignment MATCH (0)
  • “I” for Insertion to the reference (1)
  • “D” for deletion from the reference 2
  • “N” for skipped region from the reference 3
  • “S” for soft clipping (clipped sequence present in seq) 4
  • “H” for hard clipping (clipped sequence NOT present in seq) 5
  • “P” for padding (silent deletion from padded reference)
  • “=” for equal
  • “X” for diff (sequence mismatched)
  • “B” for back !!!! could be also NM ???

!!! BWA MEM get_cigar_stats returns list with 11 items Last item is !!! what is the difference between M and = ??? Last item is I + S + X !!! dans BWA, mismatch (X) not provided… should be deduced from last item - I - S


the length of the query sequence based on the CIGAR is calculated by adding the M, I, S, =, or X and other operations are ignored. source: https://stackoverflow.com/questions/39710796/infer-the-length-of-a-sequence-using-the-cigar/39812985#39812985



Parameters:cigarstring (str) – the CIGAR string.


the input CIGAR string validity is not checked. If an unknown type is found, it is ignored generally. For instance, the length of 1S100Y is 1 since Y is not correct.


Return cigar types and their count


Note that repeated types are added:

>>> c = Cigar('1S2M1S')
>>> c.as_dict()

Decompose the cigar string into tuples keeping track of repeated types

>>> from sequana import Cigar
>>> c = Cigar("1S2M1S")
(('S', 1), ('M', 2), ('S', 1))

the CIGAR string attribute


1S1S should become 2S. inplace modification

pattern = '(\\d+)([A-Za-z])?'

Returns number of occurence for each type found in types

>>> c = Cigar("1S2M1S")
>>> c.stats()
[2, 0, 0, 0, 2, 0, 0, 0, 0, 0]
types = 'MIDNSHP=XB'

valid CIGAR types

11.4. BAMTOOLS related

Tools to manipulate BAM/SAM files

Alignment(alignment) Helper class to retrieve info about Alignment
BAM(filename[, mode]) BAM data structure, statistics and plotting
SAMFlags([value]) Utility to extract bits from a SAM flag


BAM being the compressed version of SAM files, we do not implement any functionalities related to SAM files. We strongly encourage developers to convert their SAM to BAM.

class BAM(filename, mode='rb', *args)[source]

BAM data structure, statistics and plotting


Python2.7 and 3.5 behave differently and we would recommend the Python 3.5 version. For instance, to_fastq() would work only with Python 3.5.

We provide a test file in Sequana:

>>> from sequana import BAM, sequana_data
>>> b = BAM(sequana_data("test.bam"))
>>> len(b)


Once you loop over this data structure, you must call reset() to force the next iterator to start at position 0 again. The methods implemented in this data structure take care of that for you thanks to a decorator called seek. If you want to use the next() function, call reset() to make sure you start at the beginning.



Create a json file with information related to the bam file.

This includes some metrics (see get_stats(); eg MAPQ), combination of flags, SAM flags, counters about the read length.


Same as in sequana.fastq.FastQC


Return flags of all reads as a list


Returns flags as a dataframe

>>> from sequana import BAM, sequana_data
>>> b = BAM(sequana_data('test.bam'))
>>> df = b.get_flags_as_df()
>>> df.sum()
1       1000
2        484
4          2
8          2
16       499
32       500
64       477
128      523
256       64
512        0
1024       0
2048       0
dtype: int64

See also

SAMFlags for meaning of each flag


Return a dictionary with full stats about the BAM file

The index of the dataframe contains the flags. The column contains the counts.

>>> from sequana import BAM, sequana_data
>>> b = BAM(sequana_data("test.bam"))
>>> df = b.get_full_stats_as_df()
>>> df.query("description=='average quality'")


uses samtools behind the scene


Return GC content for all reads (mapped or not)


Return counter of all fragment lengths


Return dataframe with read length for each read

(Source code, png, hires.png, pdf)


Return dataframe with mapq for each read


Count flags/mapq/read length in one pass.


Return the reads’ names


Count how many reads have each flag of sam format after count metrics.


Return basic stats about the reads

Returns:dictionary with basic stats:
  • total_reads : number reads
  • mapped_reads : number of mapped reads
  • unmapped_reads : number of unmapped
  • mapped_proper_pair : R1 and R2 mapped face to face
  • hard_clipped_reads: number of reads with supplementary alignment
  • reads_duplicated: number of reads duplicated


works only for BAM files. Use get_full_stats_as_df() for SAM files.

from sequana import sequana_data, BAM
b = BAM(sequana_data("measles.fa.sorted.bam"))

(Source code, png, hires.png, pdf)


return True is the BAM is sorted


Return an iterator on the reads that are mapped


Return an iterator on the reads that are unmapped

plot_acgt_content(stacked=False, fontsize=16, include_N=True)[source]

Plot ACGT content

from sequana import sequana_data, BAM
b = BAM(sequana_data("measles.fa.sorted.bam"))

(Source code, png, hires.png, pdf)

plot_bar_flags(logy=True, fontsize=16, filename=None)[source]

Plot an histogram of the flags contained in the BAM

from sequana import BAM, sequana_data
b = BAM(sequana_data('test.bam', "testing"))

(Source code, png, hires.png, pdf)


See also

SAMFlags for meaning of each flag

plot_bar_mapq(fontsize=16, filename=None)[source]

Plots bar plots of the MAPQ (quality) of alignments

from sequana import BAM, sequana_data
b = BAM(sequana_data('test.bam', "testing"))

(Source code, png, hires.png, pdf)


Please use GenomeCov for more sophisticated tools to plot the genome coverage

from sequana import sequana_data, BAM
b = BAM(sequana_data("measles.fa.sorted.bam"))

(Source code, png, hires.png, pdf)

plot_gc_content(fontsize=16, ec='k', bins=100)[source]

plot GC content histogram

Params bins:a value for the number of bins or an array (with a copy() method)
Parameters:ec – add black contour on the bars
from sequana import BAM, sequana_data
b = BAM(sequana_data('test.bam'))

(Source code, png, hires.png, pdf)


Plot indel count (+ ratio)

Return:list of insertions, deletions and ratio insertion/deletion for different length starting at 1
from sequana import sequana_data, BAM
b = BAM(sequana_data("measles.fa.sorted.bam"))

(Source code, png, hires.png, pdf)


What you see on this figure is the presence of 10 insertions of length 1, 1 insertion of length 2 and 3 deletions of length 1

# Note that in samtools, several insertions or deletions in a single alignment are ignored and only the first one seems to be reported. For instance 10M1I10M1I stored only 1 insertion in its report; Same comment for deletions.


Plot occurences of aligned read lengths

from sequana import sequana_data, BAM
b = BAM(sequana_data("test.bam"))

(Source code, png, hires.png, pdf)


Export the BAM to FastQ format


to be tested


comments from original reads ?

Method 1 (bedtools):

bedtools bamtofastq -i JB409847.bam  -fq test1.fastq

Method2 (samtools):

samtools bam2fq JB409847.bam > test2.fastq

Method3 (sequana):

from sequana import BAM

Note that the samtools method removes duplicated reads so the output is not identical to method 1 or 3.

class Alignment(alignment)[source]

Helper class to retrieve info about Alignment

Takes an alignment as read by BAM and provides a simplified version of pysam.Alignment class.

>>> from sequana.bamtools import Alignment
>>> from sequana import BAM, sequana_data
>>> b = BAM(sequana_data("test.bam"))
>>> segment = next(b)
>>> align = Alignment(segment)
>>> align.as_dict()
>>> align.FLAG

The original data is stored in hidden attribute _data and the following values are available as attributes or dictionary:

  • QNAME: a query template name. Reads/segment having same QNAME come from the same template. A QNAME set to * indicates the information is unavailable. In a sam file, a read may occupy multiple alignment
  • FLAG: combination of bitwise flags. See SAMFlags
  • RNAME: reference sequence
  • POS
  • MAPQ: mapping quality if segment is mapped. equals -10 log10 Pr
  • CIGAR: See sequana.cigar.Cigar
  • RNEXT: reference sequence name of the primary alignment of the NEXT read in the template
  • PNEXT: position of primary alignment
  • TLEN: signed observed template length
  • SEQ: segment sequence
  • QUAL: ascii of base quality


Parameters:alignment – alignment instance from BAM
class SAMFlags(value=4095)[source]

Utility to extract bits from a SAM flag

>>> from sequana import SAMFlags
>>> sf = SAMFlags(257)
>>> sf.get_flags()
[1, 256]

You can also print the bits and their description:

bit Meaning/description
1 template having multiple segments in sequencing
2 each segment properly aligned according to the aligner
4 segment unmapped
8 next segment in the template unmapped
16 SEQ being reverse complemented
32 SEQ of the next segment in the template being reverse complemented
64 the first segment in the template
128 the last segment in the template
256 secondary alignment
512 not passing filters, such as platform/vendor quality controls
1024 PCR or optical duplicate
2048 supplementary alignment

Return the individual bits included in the flag


Return all description sorted by bit

11.5. BEDTOOLS related (coverage)

Utilities for the genome coverage

class GenomeCov(input_filename, genbank_file=None, low_threshold=-4, high_threshold=4, ldtr=0.5, hdtr=0.5, force=False, chunksize=5000000, quiet_progress=False, chromosome_list=[])[source]

Create a list of dataframe to hold data from a BED file generated with samtools depth.

This class can be used to plot the coverage resulting from a mapping, which is stored in BED format. The BED file may contain several chromosomes. There are handled independently and accessible as a list of ChromosomeCov instances.


from sequana import GenomeCov, sequana_data

filename = sequana_data('JB409847.bed')
reference = sequana_data("JB409847.fasta")

gencov = GenomeCov(filename)

# you can change the thresholds:
gencov.thresholds.low = -4
gencov.thresholds.high = 4

gencov = GenomeCov(filename)
for chrom in gencov:
    chrom.running_median(n=3001, circular=True)

(Source code, png, hires.png, pdf)


Results are stored in a list of ChromosomeCov named chr_list. For Prokaryotes and small genomes, this API is convenient but takes lots of memory for larger genomes.

Computational time information: scanning 24,000,000 rows

  • constructor (scanning 40,000,000 rows): 45s
  • select contig of 24,000,000 rows: 1min20
  • running median: 16s
  • compute zscore: 9s
  • c.get_rois() ():


  • input_filename (str) – the input data with results of a bedtools genomecov run. This is just a 3-column file. The first column is a string (chromosome), second column is the base postion and third is the coverage.
  • genbank_file (str) – annotation file of your referenve.
  • low_threshold (float) – threshold used to identify under-covered genomic region of interest (ROI). Must be negative
  • high_threshold (float) – threshold used to identify over-covered genomic region of interest (ROI). Must be positive
  • ldtr (float) – fraction of the low_threshold to be used to define the intermediate threshold in the double threshold method. Must be between 0 and 1.
  • rdtr (float) – fraction of the low_threshold to be used to define the intermediate threshold in the double threshold method. Must be between 0 and 1.
  • chunksize – size of segments to analyse. If a chromosome is larger than the chunk size, it is split into N chunks. The segments are analysed indepdently and ROIs and summary joined together. Note that GC, plotting functionalities will only plot the first chunk.
  • force – some constraints are set in the code to prevent unwanted memory issues with specific data sets of parameters. Currently, by default, (i) you cannot set the threshold below 2.5 (considered as noise).
  • chromosome_list – list of chromosomes to consider (names). This is useful for very large input data files (hundreds million of lines) where each chromosome can be analysed one by one. Used by the sequana_coverage standalone. The only advantage is to speed up the constructor creation and could also be used by the Snakemake implementation.

Get the circularity of chromosome(s). It must be a boolean.

compute_gc_content(fasta_file, window_size=101, circular=False, letters=['G', 'C', 'c', 'g'])[source]

Compute GC content of genome sequence.

  • fasta_file (str) – fasta file name.
  • window_size (int) – size of the sliding window.
  • circular (bool) – if the genome is circular (like bacteria chromosome)
Store the results in the ChromosomeCov.df attribute (dataframe)
with a column named gc.

Get the features dictionary of the genbank.


Get or set the window size to compute the GC content.


Get or set the genbank filename to annotate ROI detected with ChromosomeCov.get_roi(). Changing the genbank filename will configure the GenomeCov.feature_dict.


Return basic statistics for each chromosome

Returns:dictionary with chromosome names as keys and statistics as values.

See also



used in sequana_summary standalone

hist(logx=True, logy=True, fignum=1, N=25, lw=2, **kwargs)[source]
input_filename = None

# check if the input is a csv of a previous analysis try:

self.chr_list = None self._read_csv(input_filename) self.positions = {} #for chrom in self.chrom_names: #self.positions[chrom] = {“start”:start “end”:end “N”: N}
except FileNotFoundError as e:
print(“FileNotFound error({0}): {1}”.format(e.errno, e.strerror)) sys.exit(1)
to_csv(output_filename, **kwargs)[source]

Write all data in a csv.

  • output_filename (str) – csv output file name.
  • kwargs (**dict) –

    parameters of pandas.DataFrame.to_csv().


Get or set the window size to compute the running median. Size must be an interger.

class ChromosomeCov(genomecov, chrom_name, thresholds=None, chunksize=5000000)[source]

Factory to manipulate coverage and extract region of interests.


from sequana import GenomeCov, sequana_data
filename = sequana_data("virus.bed")

gencov = GenomeCov(filename)

chrcov = gencov[0]

df = chrcov.get_rois().get_high_rois()

(Source code, png, hires.png, pdf)


The df variable contains a dataframe with high region of interests (over covered)

If the data is large, the input data set is split into chunk. See chunksize, which is 5,000,000 by default.

If your data is larger, then you should use the run() method.

See also

sequana_coverage standalone application


  • df – dataframe with position for a chromosome used within GenomeCov. Must contain the following columns: [“pos”, “cov”]
  • genomecov
  • chrom_name – to save space, no need to store the chrom name in the dataframe.
  • thresholds – a data structure DoubleThresholds that holds the double threshold values.
  • chunksize – if the data is large, it is split and analysed by chunk. In such situations, you should use the run() instead of calling the running_median and compute_zscore functions.

breadth of coverage


The coefficient of variation (CV) is defined as sigma / mu


depth of coverage


standard deviation of depth of coverage

compute_zscore(k=2, use_em=True, clip=4, verbose=True)[source]

Compute zscore of coverage and normalized coverage.

  • k (int) – Number gaussian predicted in mixture (default = 2)
  • clip (float) – ignore values above the clip threshold

Store the results in the df attribute (dataframe) with a column named zscore.


needs to call running_median() before hand.


Return Evenness of the coverage

Reference:Konrad Oexle, Journal of Human Genetics 2016, Evaulation of the evenness score in NGS.

work before or after normalisation but lead to different results.


Proportion of central (normal) genome coverage

This is 1 - (number of non normal data) / (total length)


depends on the thresholds attribute being used.


depends slightly on W the running median window


Return the correlation between the coverage and GC content

The GC content is the one computed in GenomeCov.compute_gc_content() (default window size is 101)

get_max_gc_correlation(reference, guess=100)[source]

Plot correlation between coverage and GC content by varying the GC window

The GC content uses a moving window of size W. This parameter affects the correlation bewteen coverage and GC. This function find the optimal window length.


Keep positions with zscore outside of the thresholds range.

Returns:a dataframe from FilteredGenomeCov


depends on the thresholds low and high values.


Return basic stats about the coverage data

only “cov” column is required.

get_summary(C3=None, C4=None, stats=None)[source]
moving_average(n, circular=False)[source]

Compute moving average of the genome coverage

  • n – window’s size. Must be odd
  • circular (bool) – is the chromosome circular or not

Store the results in the df attribute (dataframe) with a column named ma.

plot_coverage(filename=None, fontsize=16, rm_lw=1, rm_color='#0099cc', rm_label='Running median', th_lw=1, th_color='r', th_ls='--', main_color='k', main_lw=1, main_kwargs={}, sample=True, set_ylimits=True, x1=None, x2=None)[source]

Plot coverage as a function of base position.

  • filename
  • rm_lw – line width of the running median
  • rm_color – line color of the running median
  • rm_color – label for the running median
  • th_lw – line width of the thresholds
  • th_color – line color of the thresholds
  • main_color – line color of the coverage
  • main_lw – line width of the coverage
  • sample – if there are more than 1 000 000 points, we use an integer step to skip data points. We can still plot all points at your own risk by setting this option to False
  • set_ylimits – we want to focus on the “normal” coverage ignoring unsual excess. To do so, we set the yaxis range between 0 and a maximum value. This maximum value is set to the minimum between the 6 times the mean coverage and 1.5 the maximum of the high coverage threshold curve. If you want to let the ylimits free, set this argument to False
  • x1 – restrict lower x value to x1
  • x2 – restrict lower x value to x2 (x2 must be greater than x1)


if there are more than 1,000,000 points, we show only 1,000,000 by points. For instance for 5,000,000 points,

In addition to the coverage, the running median and coverage confidence corresponding to the lower and upper zscore thresholds are shown.


uses the thresholds attribute.

plot_gc_vs_coverage(filename=None, bins=None, Nlevels=None, fontsize=20, norm='log', ymin=0, ymax=100, contour=True, cmap='BrBG', **kwargs)[source]

Plot histogram 2D of the GC content versus coverage

plot_hist_coverage(logx=True, logy=True, fontsize=16, N=25, fignum=1, hold=False, alpha=0.8, ec='k', filename=None, zorder=10, **kw_hist)[source]
  • N
  • ec
plot_hist_normalized_coverage(filename=None, binwidth=0.05, max_z=3)[source]

Barplot of the normalized coverage with gaussian fitting

plot_hist_zscore(fontsize=16, filename=None, max_z=6, binwidth=0.5, **hist_kargs)[source]

Barplot of the zscore values

plot_rois(x1, x2, set_ylimits=False, rois=None, fontsize=16)[source]
run(W, k=2, circular=False, binning=None, cnv_delta=None)[source]
running_median(n, circular=False)[source]

Compute running median of genome coverage

  • n (int) – window’s size.
  • circular (bool) – if a mapping is circular (e.g. bacteria whole genome sequencing), set to True

Store the results in the df attribute (dataframe) with a column named rm.

Changed in version 0.1.21: Use Pandas rolling function to speed up computation.

thresholds = None
self.thresholds = thresholds.copy()
self.thresholds = DoubleThresholds()
to_csv(filename=None, start=None, stop=None, **kwargs)[source]

Write CSV file of the dataframe.

  • filename (str) – csv output filename. If None, return string.
  • start (int) – start row index.
  • stop (int) – stop row index.

Params of pandas.DataFrame.to_csv():

  • columns (list) – columns you want to write.
  • header (bool) – determine if the header is written.
  • index (bool) – determine if the index is written.
  • float_format (str) – determine the float format.
class DoubleThresholds(low=-3, high=3, ldtr=0.5, hdtr=0.5)[source]

Simple structure to handle the double threshold for negative and positive sides

Used yb GenomeCov and related classes.

dt = DoubleThresholds(-3,4,0.5,0.5)

This means the low threshold is -3 while the high threshold is 4. The two following values must be between 0 and 1 and are used to define the value of the double threshold set to half the value of the main threshold.

Internally, the main thresholds are stored in the low and high attributes. The secondary thresholds are derived from the main thresholds and the two ratios. The ratios are named ldtr and hdtr for low double threshold ratio and high double threshold ration. The secondary thresholds are denoted low2 and high2 are are update automatically if low, high, ldtr or hdtr are changed.


11.6. Coverage (theoretical)

class Coverage(N=None, L=None, G=None, a=None)[source]

Utilities related to Lander and Waterman theory

We denote G the genome length in nucleotides and L the read length in nucleotides. These two numbers are in principle well defined since G is defined by biology and L by the sequencing machine.

The total number of reads sequenced during an experiment is denoted N. Therefore the total number of nucleotides is simply NL.

The depth of coverage (DOC) at a given nucleotide position is the number of times that a nucleotide is covered by a mapped read.

The theoretical fold-coverage is defined as :

a = NL/G

that is the average number of times each nucleotide is expected to be sequenced (in the whole genome). The fold-coverage is often denoted aX (e.g., 50X).

In the Coverage class, G and N are fixed at the beginning. Then, if one changes a, then N is updated and vice-versa so that the relation a=NL/G is always true:

>>> cover = Coverage(G=1000000, L=100)
>>> cover.N = 100000    # number of reads
>>> cover.a             # What is the mean coverage
>>> cover.a = 50
>>> cover.N

From the equation aforementionned, and assuming the reads are uniformly distributed, we can answer a few interesting questions using probabilities.

In each chromosome, a read of length L could start at any position (except the last position L-1). So in a genome G with n_c chromosomes, there are G - n_c (L-1) possible starting positions. In general G >> n_c (L-1) so the probability that one of the N read starts at any specific nucleotide is N/G.

The probability that a read (of length L) covers a given position is L/G. The probability of not covering that location is 1-L/G. For N fragments, we obtain the probability \left(1-L/G\right)^N. So, the probability of covering a given location with at least one read is :

P = 1 - \left(1- \frac{L}{G}\right)^N

Since in general, N>>1, we have:

P = 1- \exp^{-NL/G}

From this equation, we can derive the fold-coverage required to have e.g., E=99\% of the genome covered:

a = log(-1/(E-1)

equivalent to

a = -log(1-E)

The method get_required_coverage() uses this equation. However, for numerical reason, one should not provide E as an argument but (1-E). See get_required_coverage()

Other information can also be derived using the methods get_mean_number_contigs(), get_mean_contig_length(), get_mean_contig_length().

See also

get_table() that provides a summary of all these quantities for a range of coverage.


genome length


length of the reads


number of reads defined as aG/L


coverage defined as NL/G


Expected length of the contigs



Expected number of contigs

A binomial distribution with parameters N and p

(aG/L) \exp^{-a}


Expected number of reads per contig

Number of reads divided by expected number of contigs:

N / (N\exp^{-a}) = e^a


Return percent of the genome covered

100 (1-\exp{-a})


Return the required coverage to ensure the genome is covered

A general question is what should be the coverage to make sure that e.g. E=99% of the genome is covered by at least a read.

The answer is:


This equation is correct but have a limitation due to floating precision. If one provides E=0.99, the answer is 4.6 but we are limited to a maximum coverage of about 36 when one provides E=0.9999999999999999 after which E is rounded to 1 on most computers. Besides, it is no convenient to enter all those numbers. A scientific notation would be better but requires to work with M=1-E instead of E.

\log^{-1/ - M}

So instead of asking the question what is the requested fold coverage to have 99% of the genome covered, we ask the question what is the requested fold coverage to have 1% of the genome not covered. This allows us to use M values as low as 1e-300 that is a fold coverage as high as 690.

Parameters:M (float) – this is the fraction of the genome not covered by any reads (e.g. 0.01 for 1%). See note above.
Returns:the required fold coverage

(Source code, png, hires.png, pdf)


# The inverse equation is required fold coverage = [log(-1/(E - 1))]


Return a summary (dictionary) for the current fold coverage a


Return a summary dataframe for a set of fold coverage

Parameters:coverage (list) – if None, coverage list starts at 0.5 and ends at 10 with a step of 0.5

11.7. Access to online database (e.g. ENA)

Utilities to access to online FASTA, taxon, lineage …

class ENADownload[source]

Downloader to retrieve genome fasta files from ENA amongst other things

In order to facilitate the download of FASTA files (e.g. to build a Kraken DB), this class can be used to download a bunch of FASTA files, or just one given its accession.

Pre-defined lists are available from ENA. We refer to them as virus, plasmid, phage, archaealvirus, archaea, bacteria, organelle, viroid. In addition we have predefined lists within Sequana. For now, there is one named macaca_fascicularis.


the header of the FASTA files are changed to add the GI number instead of embl.


download_accession(acc, output='Custom')[source]

Download a specific FASTA file given its ENA accession number


organisms (may 2016)


this download method is the longest to end. It took about 20mins on a good connection.

download_fasta(filelist, output_dir=None, from_ena=True)[source]

Download a FASTA (or list of)

Parameters:filelist – a name to find on the ENA web server OR the name of an accession number.


The filename is named after the accession without .X number If there are several variant .1, .2 the later will be used. This should not happen if the list is properly defined.


Download all standard lists of accession numbers from ENA


Kraken will only accept the GI from NCBI so we need to convert the ENA accession to GI numbers

class EUtilsTools[source]

Simple wrapper around EUtils to fetch basic informatino about an accession number

>>> from sequana.databases import EUtilsTools
>>> et.accession_to_info("K01711.1")
{'K01711.1': {'accession': '331784',
  'comment': 'Measles virus (strain Edmonston), complete genome',
  'gi': '331784',
  'identifier': 'gi|331784|gb|K01711.1|MEANPCG[331784]',
  'taxid': '11234'}}

An accession or list of them returns list of dictionaries

11.8. Experimental design

Module to handle experimental design files (adapters)

Sequencers or experimentalists create so-called design files to store information about the sequencing experiment. For example the name of the samples, the sample well, and the index sequences.

The format used to store the design information may vary from one sequencing platform to another. The design file can be used for many different purposes. Currently, we only use them to retrieve information about adapters.

Since there are different formats possible, we decided on a minimal and common set of information. For now, this is a CSV file with the following minimal header:

Sample_ID, Index1_Seq

or, for double-indexing:

Sample_ID, Index1_Seq, Index2_Seq

Users should only use the ExpDesignAdapter class, which understands the different formats. Currently, the design file created by MiSeq Illumina machine

class ExpDesignAdapter(filename, verbose=True)[source]

Generic Experimental design class

This class is used to store the mapping between sample ID and adapter used.

The design information is stored as a dataframe in the attribute df.

The input format is not unique. There are currently 3 different inputs possible as defined in

The dataframe index is the list of sample identifiers (Sample_ID). The columns contain at least the following:

Index1_Seq, Index1_ID, Index2_Seq, Index2_ID


from sequana import *
filename = sequana_data('test_test_expdesign_hiseq.csv')
eda = ExpDesignAdapter(filename)


class ExpDesignMiSeq(filename)[source]

Dedicated experimental design format from Illumina MiSeq sequencers

This MiSeq design format has the following format:





In the Data section, a CSV file is to be found with the following header:


The index column may be prefixed. For instance as NFXX where XX is the index so NF should be dropped.

If double-indexing, the header is:

filename = sequana_data("test_expdesign_miseq_illumina_1.csv")
ff = ExpDesignMiSeq(filename)
class ExpDesignHiSeq(filename)[source]

Dedicated experimental design format created by a demultiplexing soft.

This format is used by a demultiplex software used locally at biomics platform. The format of the header is:

FCID,Lane,SampleID,SampleRef,Index Seq, Description,Control,Recipe,Operator,Project

This is a format that may change in the future.

The SampleID is convert into Sample_ID, “Index Seq”. Note that “Index Seq” may be empty, or filled with an index sequence, or 2 index sequences separated by a “-” sign.

note also FCID = flowcell ID

class ExpDesignBase(filename)[source]

The Base class for all ExpDesignAdapter classes

The input filename must be a CSV file with at least the following column in the header:


Derived class must define at least Index1_Seq and possibly Index2_Seq.

Examples of specialised classes are ExpDesignMiSeq, ExpDesignHiSeq.


Check the presence of the Sample_ID column


Read a CSV file

11.9. FASTQ module

Utilities to manipulate FASTQ and Reads

class Identifier(identifier, version='unknown')[source]

Class to interpret Read’s identifier


Implemented for Illumina 1.8+ and 1.4 . Other cases will simply stored the identifier without interpretation

>>> from sequana import Identifier
>>> ident = Identifier('@EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG')
>>> ident.info['x_coordinate']

Currently, the following identifiers will be recognised automatically:


An example is


An example is:

@EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG

Other that could be implemented are NCBI

@FSRRS4401BE7HA [length=395] [gc=36.46] [flows=800] [phred_min=0]                                            [phred_max=40] [trimmed_length=95]

Information can also be found here http://support.illumina.com/help/SequencingAnalysisWorkflow/Content/Vault/Informatics/Sequencing_Analysis/CASAVA/swSEQ_mCA_FASTQFiles.htm

class FastQ(filename, verbose=False)[source]

Class to handle FastQ files

Some of the methods are based on pysam but a few are also original to sequana. In general, input can be zipped ot not and output can be zipped or not (based on the extension).

An example is the extract_head() method:

f = FastQ("input_file.fastq.gz")
f.extract_head(100000, output='test.fastq')
f.extract_head(100000, output='test.fastq.gz')

equivalent to:

zcat myreads.fastq.gz | head -100000 | gzip > test100k.fastq.gz

An efficient implementation to count the number of lines is also available:


or reads (assuming 4 lines per read):


Operators available:

  • equality ==

Return number of lines


Return count_lines divided by 4

extract_head(N, output_filename)[source]

Extract the heads of a FastQ files

  • N (int) –
  • output_filename (str) – Based on the extension the output file is zipped or not (.gz extension only)

This function is convenient since it takes into account the input file being compressed or not and the output file being compressed ot not. It is in general 2-3 times faster than the equivalent unix commands combined together but is 10 times slower for the case on uncompressed input and uncompressed output.


this function extract the N first lines and does not check if there are empty lines in your FastQ/FastA files.

filter(identifiers_list=[], min_bp=None, max_bp=None, progressbar=True, output_filename='filtered.fastq', remove=True)[source]

Filter reads

  • min_bp (int) – ignore reads with length shorter than min_bp
  • max_bp (int) – ignore reads with length above max_bp
joining(pattern, output_filename)[source]

not implemented

zcat Block*.fastq.gz | gzip > combined.fastq.gz


return number of lines (should be 4 times number of reads)


return number of reads


Allows to iter from the beginning without openning the file or creating a new instance.

select_random_reads(N=None, output_filename='random.fastq')[source]

Select random reads and save in a file

  • N (int) – number of random unique reads to select should provide a number but a list can be used as well. You can select random reads for R1, and re-use the returned list as input for the R2 (since pairs must be kept)
  • output_filename (str) –

If you have a pair of files, the same reads must be selected in R1 and R2.:

f1 = FastQ(file1)
selection = f1.select_random_reads(N=1000)
f2 = FastQ(file2)

Not implemented

split_lines(N=100000, gzip=True)[source]

Not implemented


Return a Series with kmer count across all reads

Parameters:k (int) – (default to 7-mers)
Returns:Pandas Series with index as kmer and values as count.

Takes about 30 seconds on a million reads.

to_krona(k=7, output_filename='fastq.krona')[source]

Save Krona file with ACGT content within all k-mers

Parameters:k (int) – (default to 7-mers)

Save results in file, which can then be translated into a HTML file using:

ktImportText fastq.krona 
open text.krona.html
class FastQC(filename, max_sample=500000, dotile=False, verbose=True)[source]

Simple QC diagnostic

Similarly to some of the plots of FastQC tools, we scan the FastQ and generates some diagnostic plots. The interest is that we’ll be able to create more advanced plots later on.

Here is an example of the boxplot quality across all bases:

from sequana import sequana_data
from sequana import FastQC
filename  = sequana_data("test.fastq", "testing")
qc = FastQC(filename)

(Source code, png, hires.png, pdf)



some plots will work for Illumina reads only right now


Although all reads are parsed (e.g. to count the number of nucleotides, some information uses a limited number of reads (e.g. qualities), which is set to 500,000 by deafult.


  • filename
  • max_sample (int) – Large files will not fit in memory. We therefore restrict the numbers of reads to be used for some of the statistics to 500,000. This also reduces the amount of time required to get a good feeling of the data quality. The entire input file is parsed tough. This is required for instance to get the number of nucleotides.
boxplot_quality(hold=False, ax=None)[source]

Boxplot quality

Same plots as in FastQC that is average quality for all bases. In addition a 1 sigma error enveloppe is shown (yellow).

Background separate zone of good, average and bad quality (arbitrary).


Plot histogram of GC content

from sequana import sequana_data
from sequana import FastQC
filename  = sequana_data("test.fastq", "testing")
qc = FastQC(filename)

(Source code, png, hires.png, pdf)


Histogram of the sequence coordinates on the plate

from sequana import sequana_data
from sequana import FastQC
filename  = sequana_data("test.fastq", "testing")
qc = FastQC(filename)


in this data set all points have the same coordinates.


Histogram sequence lengths

from sequana import sequana_data
from sequana import FastQC
filename  = sequana_data("test.fastq", "testing")
qc = FastQC(filename)

(Source code, png, hires.png, pdf)



from sequana import sequana_data
from sequana import FastQC
filename  = sequana_data("test.fastq", "testing")
qc = FastQC(filename)
from pylab import tight_layout; tight_layout()

Plot histogram of GC content

from sequana import sequana_data
from sequana import FastQC
filename  = sequana_data("test.fastq", "testing")
qc = FastQC(filename)

(Source code, png, hires.png, pdf)


11.10. FASTA module

Utilities to manipulate FASTQ and Reads

class FastA(filename, verbose=False)[source]

Class to handle FastA files. Cannot be compressed

format_contigs_denovo(output_file, len_min=500)[source]

Replace NODE with the project name and remove contigs with a length lower than len_min.

  • output_file (str) – output file name.
  • len_min (int) – minimal length of contigs.


from sequana import FastA

contigs = FastA(“denovo_assembly.fasta”) contigs.format_contigs_denovo(“path/to/file.fasta”, len_min=500)

Results are stored in “path/to/file.fasta”.


11.11. Sequence module

class DNA(sequence, codons_stop=['TAA', 'TGA', 'TAG'], codons_stop_rev=['TTA', 'TCA', 'CTA'], codons_start=['ATG'], codons_start_rev=['CAT'])[source]

Simple DNA class

>>> d = DNA("ACGTTTT")
>>> d.complement
>>> d.reverse_complement

Some long computations are done when setting the window size:

d.window = 100
barplot_count_ORF_CDS_by_frame(alpha=0.5, bins=40, xlabel='Frame', ylabel='#', bar_width=0.35)[source]
hist_ORF_CDS_linearscale(alpha=0.5, bins=40, xlabel='Length', ylabel='#')[source]
hist_ORF_CDS_logscale(alpha=0.5, bins=40, xlabel='Length', ylabel='#')[source]
plot_all_skews(figsize=(10, 12), fontsize=16, alpha=0.5)[source]
class RNA(sequence)[source]

Simple RNA class

>>> d = RNA("ACGUUUU")
>>> d.complement
>>> d.reverse_complement
class Repeats(filename_fasta, merge=False, name=None)[source]

Class for finding repeats in DNA or RNA linear sequences.

Computation is performed each time the threshold is set to a new value.

from sequana import sequana_data, Repeats
rr = Repeats(sequana_data("measles.fa"))
rr.threshold = 4

(Source code, png, hires.png, pdf)



Works with shustring package from Bioconda (April 2017)


use a specific sequence (right now it is the first one)


Input must be a fasta file with valid DNA or RNA characters

  • filename_fasta (str) – a Fasta file, only the first sequence is used !
  • threshold (int) – Minimal length of repeat to output
  • name (str) – if name is provided, scan the Fasta file and select the corresponding sequence. if you want to analyse all sequences, you need to use a loop by setting _header for each sequence with the sequence name found in sequence header.

Return dataframe with shortest unique substring length at each position shortest unique substrings are unique in the sequence and its complement Uses shustring tool


get first line of fasta (needed in input shustring) and replace spaces by underscores

hist_length_repeats(bins=20, alpha=0.5, hold=False, fontsize=12, grid=True, title='Repeat length', xlabel='Repeat length', ylabel='#')[source]

Plots histogram of the repeat lengths

class Sequence(sequence, complement_in=b'ACGT', complement_out=b'TGCA', letters='ACGT')[source]

Abstract base classe for other specialised sequences such as DNA.

Sequenced is the base class for other classes such as DNA and RNA.

from sequana import Sequence
s = Sequence("ACGT")


You may use a Fasta file as input (see constructor)


A sequence is just a string stored in the sequence attribute. It has properties related to the type of alphabet authorised.

  • sequence (str) – May be a string of a Fasta File, in which case only the first sequence is used.
  • complement_in
  • complement_out
  • letters – authorise letters. Used in check() only.


use counter only once as a property


Check that all letters are valid


Alias to get_complement()


Return mean GC content


Return complement

get_occurences(pattern, overlap=False)[source]

Return position of the input pattern in the sequence

>>> from sequana import Sequence
>>> s = Sequence('ACGTTTTACGT')
>>> s.get_occurences("ACGT")
[0, 7]

Return reverse sequence


Return reverse complement


Alias to get_reverse()


Alias to get_reverse_complement


Return basic stats about the number of letters

11.12. Kmer module

build_kmer(length=6, letters='CG')[source]

Return list of kmer of given length based on a set of letters

Returns:list of kmers
get_kmer(sequence, k=7)[source]

Given a sequence, return consecutive kmers

Returns:iterator of kmers

11.13. IOTools module

class YamlDocParser(filename)[source]

A simple parser to extract block content to be found in YAML files

So as to create tooltips automatically in Sequanix: GUI for snakemake workflows, one can comment YAML configuration file with block comments (see developers guide in Developer guide )

Once read and parsed, all block comments before top-level sections are to be found in the dictionary sections.

from sequana import snaketools
from sequana.iotools import YamlDocParser
module = snaketools.Module('quality_control')
r = YamlDocParser(module.config)

Those lines are removed from the docstring but available as a dictionary


Parameters:filename (str) – the YAML file to parse
# main documentation

# block comment
    - item

# block comment

# a comment


Here, section1 and section2 have block comments but not section3

11.14. Taxonomy related (Kraken - Krona)

class KrakenResults(filename='kraken.out')[source]

Translate Kraken results into a Krona-compatible file

If you run a kraken analysis with KrakenAnalysis, you will end up with a file e.g. named kraken.out (by default).

You could use kraken-translate but then you need extra parsing to convert into a Krona-compatible file. Here, we take the output from kraken and directly transform it to a krona-compatible file.

k = KrakenResults("kraken.out")

Then format expected looks like:

C    HISEQ:426:C5T65ACXX:5:2301:18719:16377    1    203    1:71 A:31 1:71
C    HISEQ:426:C5T65ACXX:5:2301:21238:16397    1    202    1:71 A:31 1:71

Where each row corresponds to one read.

"562:13 561:4 A:31 0:1 562:3" would indicate that:

the first 13 k-mers mapped to taxonomy ID #562
the next 4 k-mers mapped to taxonomy ID #561
the next 31 k-mers contained an ambiguous nucleotide
the next k-mer was not in the database
the last 3 k-mers mapped to taxonomy ID #562

See kraken documentation for details.


a taxon of ID 1 (root) means that the read is classified but in differen domain. https://github.com/DerrickWood/kraken/issues/100


This takes care of fetching taxons and the corresponding lineages from online web services.


Parameters:filename – the input from KrakenAnalysis class

Show distribution of the read length grouped by classified or not


Retrieve taxons given a list of taxons

Parameters:ids (list) – list of taxons as strings or integers. Could also be a single string or a single integer
Returns:a dataframe


the first call first loads all taxons in memory and takes a few seconds but subsequent calls are much faster

kraken_to_csv(filename, dbname)[source]
kraken_to_json(filename, dbname)[source]
kraken_to_krona(output_filename=None, mode=None, nofile=False)[source]
Returns:status: True is everything went fine otherwise False
plot(kind='pie', cmap='copper', threshold=1, radius=0.9, textcolor='red', **kargs)[source]

A simple non-interactive plot of taxons

Returns:None if no taxon were found and a dataframe otherwise

A Krona Javascript output is also available in kraken_to_krona()

from sequana import KrakenResults, sequana_data
test_file = sequana_data("test_kraken.out", "testing")
k = KrakenResults(test_file)
df = k.plot(kind='pie')

(Source code, png, hires.png, pdf)


See also

to generate the data see KrakenPipeline or the standalone application sequana_taxonomy.

to_js(output='krona.html', onweb=False)[source]
class KrakenPipeline(fastq, database, threads=4, output_directory='kraken', dbname=None)[source]

Used by the standalone application sequana_taxonomy

This runs Kraken on a set of FastQ files, transform the results in a format compatible for Krona, and creates a Krona HTML report.

from sequana import KrakenPipeline
kt = KrakenPipeline(["R1.fastq.gz", "R2.fastq.gz"], database="krakendb")


We do not provide Kraken database within sequana. You may either download a database from https://ccb.jhu.edu/software/kraken/ or use this class to download a toy example that will be stored in e.g .config/sequana under Unix platforms. See KrakenDownload.

See also

We provide a standalone application of this class, which is called sequana_taxonomy and can be used within a command shell.


  • fastq – either a fastq filename or a list of 2 fastq filenames
  • database – the path to a valid Kraken database
  • threads – number of threads to be used by Kraken
  • output_directory – output filename of the Krona HTML page
  • dbname

Description: internally, once Kraken has performed an analysis, reads are associated to a taxon (or not). We then find the correponding lineage and scientific names to be stored within a Krona formatted file. KtImportTex is then used to create the Krona page.

run(output_filename_classified=None, output_filename_unclassified=None, only_classified_output=False)[source]

Run the analysis using Kraken and create the Krona output


reuse the KrakenResults code to simplify this method.


Opens the filename defined in the constructor

class KrakenAnalysis(fastq, database, threads=4)[source]

Run kraken on a set of FastQ files

In order to run a Kraken analysis, we firtst need a local database. We provide a Toy example. The ToyDB is downloadable as follows ( you will need to run the following code only once):

from sequana import KrakenDownload
kd = KrakenDownload()

See also

KrakenDownload for more database and sequana.kraken_builder.KrakenBuilder to build your own databases

The path to the database is required to run the analysis. It has been stored in the directory ./config/sequana/kraken_toydb under Linux platforms The following code should be platform independent:

import os
from sequana import sequana_config_path
database = sequana_config_path + os.sep + "kraken_toydb")

Finally, we can run the analysis on the toy data set:

from sequana import sequana_data
data = sequana_data("Hm2_GTGAAA_L005_R1_001.fastq.gz", "data")
ka = KrakenAnalysis(data, database=database)

This creates a file named kraken.out. It can be interpreted with KrakenResults


  • fastq – either a fastq filename or a list of 2 fastq filenames
  • database – the path to a valid Kraken database
  • threads – number of threads to be used by Kraken
  • output – output filename of the Krona HTML page
  • return
run(output_filename=None, output_filename_classified=None, output_filename_unclassified=None, only_classified_output=False)[source]

Performs the kraken analysis

  • output_filename (str) – if not provided, a temporary file is used and stored in kraken_output.
  • output_filename_classified (str) – not compressed
  • output_filename_unclassified (str) – not compressed
class KrakenDownload[source]

Utility to download Kraken DB and place them in a local directory

from sequana import KrakenDownload
kd = KrakenDownload()

A large database (8Gb) is available on synapse and has the following DOI:


It can be downloaded manually or if you have a Synapse login (https://www.synapse.org), you can use:

from sequana import KrakenDownload
kd = KrakenDownload()
download(name, verbose=True)[source]
dv = <easydev.tools.DevTools object>
class KrakenHierarchical(filename_fastq, fof_databases, threads=1, output_directory='./kraken_hierarchical/', keep_temp_files=False, force=False)[source]

Kraken Hiearchical Analysis

This runs Kraken on a FastQ file with multiple k-mer databases in a hierarchical way. Unclassified sequences with the first database are input for the second, and so on.

The input may be a single FastQ file or paired, gzipped or not. FastA are also accepted.


  • filename_fastq – FastQ file to analyse
  • fof_databases – file that contains a list of databases paths (one per line). The order is important. Note that you may also provide a list of datab ase paths.
  • threads – number of threads to be used by Kraken
  • output_directory – name of the output directory
  • keep_temp_files – bool, if True, will keep intermediate files from each Kraken analysis, and save html report at each step
  • force (bool) – if the output directory already exists, the instanciation fails so that the existing data is not overrwritten. If you wish to overwrite the existing directory, set this parameter to True.
run(dbname='multiple', output_prefix='kraken_final')[source]

Run the hierachical analysis

This method does not return anything but creates a set of files:

  • kraken_final.out
  • krona_final.html
  • kraken.png (pie plot of the classified/unclassified reads)


the databases are run in the order provided in the constructor.

class KronaMerger(filename)[source]

Utility to merge two Krona files

Imagine those two files (formatted for Krona; first column is a counter):

14011   Bacteria    Proteobacteria  species1
591 Bacteria    Proteobacteria  species4

184 Bacteria    Proteobacteria  species3
132 Bacteria    Proteobacteria  species2
32  Bacteria    Proteobacteria  species1

You can merge the two files. The first and last lines correspond to the same taxon (species1) so we should end up with a new Krona file with 4 lines only.

The test files are available within Sequana as test_krona_k1.tsv and test_krona_k2.tsv:

from sequana import KronaMerger, sequana_data
k1 = KronaMerger(sequana_data("test_krona_k1.tsv"))
k2 = KronaMerger(sequana_data("test_krona_k2.tsv"))
k1 += k2
# Save the results. Note that it must be tabulated for Krona external usage


separator must be tabulars


Parameters:filename (str) –

Save the content into a new file in TSV format

class KrakenBuilder(dbname)[source]

This class will help you building a custom Kraken database

You will need a few steps, and depending on the FASTA files you want to include lots of resources (memory and space wise). In the following example, we will be reasonable and use only viruses FASTA files.

First, we need to create the data structure directory. Let us call it virusdb:

from sequana import KrakenBuilder
kb = KrakenBuilder("virusdb")

We then need to download a large taxonomic database from NCBI. You may already have a local copy, in which case you would need to copy it in virusdb/taxonomy directory. If not, type:


The virusdb/taxonomy directory will contain about 8.5G of data.

Note that this currently requires the unix tools wget and tar.

Then, we need to add some fasta files. You may download specific FASTA files if you know the accession numbers using download_accession(). However, we also provide a method to download all viruses from ENA:


This will take a while to download the more than 4500 FASTA files (10 minutes on a good connection). You will end up with a data set of about 100 Mb of FASTA files.

If you wish to download other FASTA (e.g. all bacteria), you will need to use another class from the sequana.databases:

from sequana.databases import ENADownload
ena = ENADownload()
ena.download_fasta("bacteria.txt", output_dir="virusdb/library/added")

Please see the documentation for more options and list of species to download.

It is now time to build the DB itself. This is based on the kraken tool. You may do it yourself in a shell:

kraken-build  --rebuild -db virusdb --minimizer-len 10 --max-db-size 4 --threads 4
--kmer-len 26 --jellyfish-hash-size 500000000

Or you the KrakenBuilder. First you need to look at the params attribute. The most important key/value that affect the size of the DB are:

kb.params['kmer_length']  (max value is 31)
kb.params['max_db_size'] is tha max size of the DB files in Gb

To create a small DB quickly, we set those values:

kb.params['kmer_length']  = 26
kb.params['minimizer_len'] = 10

However, for production, we would recommend 31 and 13 (default)

This takes about 2 minutes to build and the final DB is about 800Mb.

Lots of useless files are in the direcory and can be removed using kraken itself. However we do a little bit more and therefore have our own cleaning function:


Kraken-build uses jellyfish. The hash_size parameter is the jellyfish hash_size parameter. If you set it to 6400M, the memory required is about 6.9bytes times 6400M that is 40Gb of memory. The default value used here means 3.5Gb are required.

The size to store the DB itself should be

Math:sD + 8 (4^M)

where s is about 12 bytes (used to store a kmer/taxon pair, D is the number of kmer in the final database, which cannot be estimated before hand, and M the length minimiser parameter.


Parameters:dbname (str) – Create the Kraken DB in this directory

Once called, you will not be able to append more FASTA files


Donwload a specific Fasta from ENA given its accession number

Note that if you want to add specific FASTA from ENA, you must use that function to make sure the header will be understood by Kraken; The header must use a GI number (not ENA)


Download kraken data

The downloaded file is large (1.3Gb) and the unzipped file is about 9Gb.

If already present, do not download the file except if the force parameter is set to True.

get_taxons_from_gis(gis, filename='gi_taxid_nucl.dmp')[source]
run(dbs=[], download_taxon=True)[source]

Create the Custom Kraken DB

  1. download taxonomy files
  2. Load the DBs (e.g. viruses)
  3. Build DB with kraken-build
  4. Clean it up

11.15. Pacbio module

Pacbio QC and stats

class BAMPacbio(filename, testing=0)[source]

BAM reader for Pacbio (reads)

You can read a file as follows:

from sequana.pacbio import BAMPacbio
from sequana import sequana_data
filename = sequana_data("test_pacbio_subreads.bam")
b = BAMPacbio(filename)

A summary of the data is stored in the attribute df. It contains information such as the length of the reads, the ACGT content, the GC content.

Several plotting methods are available. For instance, hist_snr().

The BAM file used to store the Pacbio reads follows the BAM/SAM specification. Note that the sequence read are termed query, a subsequence of an entire Pacbio ZMW read ( a subread), which is basecalls from a single pass of the insert DNA molecule.

In general, only a subsequence of the query will align to the reference genome, and that subsequence is referred to as the aligned query.

When introspecting the aligned BAM file, the extent of the query in ZMW read is denoted as [qStart, qEnd) and the extent of the aligned subinterval as [aStart, aEnd). The following graphic illustrates these intervals:

qStart qEnd

0 | aStart aEnd | [–…—-————————–—–…——) < “ZMW read” coord. system

~~~———————-~~~~~~ < query; “-” =aligning subseq.

[–…——-———…——————–…——) < “ref.” / “target” coord. system 0 tStart tEnd

In the BAM files, the qStart, qEnd are contained in the qs and qe tags, (and reflected in the QNAME); the bounds of the aligned query in the ZMW read can be determined by adjusting qs and qe by the number of soft-clipped bases at the ends of the alignment (as found in the CIGAR).

See also the comments in the code for other tags.



  • filename (str) – filename of the input pacbio BAM file. The content of the BAM file is not the ouput of a mapper. Instead, it is the output of a Pacbio (Sequel) sequencing (e.g., subreads).
  • testing (int) – for testing, you can set the number of subreads to read (0 means read all subreads)
filter_length(output_filename, threshold_min=0, threshold_max=inf)[source]

Select and Write reads within a given range

  • output_filename (str) – name of output file
  • threshold_min (int) – minimum length of the reads to keep
  • threshold_max (int) – maximum length of the reads to keep
filter_mapq(output_filename, threshold_min=0, threshold_max=255)[source]

Select and Write reads within a given range

  • output_filename (str) – name of output file
  • threshold_min (int) – minimum length of the reads to keep
  • threshold_max (int) – maximum length of the reads to keep
hist_GC(bins=50, alpha=0.5, hold=False, fontsize=12, grid=True, xlabel='GC %', ylabel='#', label='', title=None)

Plot histogram GC content

  • bins (int) – binning for the histogram
  • alpha (float) – transparency of the histograms
  • hold (bool) –
  • fontsize (int) – fontsize of the x and y labels and title.
  • grid (bool) – add grid or not
  • xlabel (str) –
  • ylabel (str) –
  • label (str) – label of the histogram (for the legend)
  • title (str) –
from sequana.pacbio import BAMPacbio
from sequana import sequana_data
b = BAMPacbio(sequana_data("test_pacbio_subreads.bam"))

(Source code, png, hires.png, pdf)

hist_ZMW_subreads(alpha=0.5, hold=False, fontsize=12, grid=True, xlabel='Number of ZMW passes', logy=True, ylabel='#', label='', title='Number of ZMW passes')[source]

Plot histogram of number of reads per ZMW (number of passes)

  • alpha (float) – transparency of the histograms
  • hold (bool) –
  • fontsize (int) –
  • grid (bool) –
  • xlabel (str) –
  • ylabel (str) –
  • logy (bool) – use log scale on the y axis (default to True)
  • label (str) – label of the histogram (for the legend)
  • title (str) –
from sequana.pacbio import BAMPacbio
from sequana import sequana_data
b = BAMPacbio(sequana_data("test_pacbio_subreads.bam"))

(Source code, png, hires.png, pdf)

hist_concordance(method, bins=100, fontsize=16)[source]
formula : 1 - (in + del + mismatch / (in + del + mismatch + match) )

For BWA and BLASR, the get_cigar_stats are different !!! BWA for instance has no X stored while Pacbio forbids the use of the M (CMATCH) tag. Instead, it uses X (CDIFF) and = (CEQUAL) characters.

Subread Accuracy: The post-mapping accuracy of the basecalls. Formula: [1 - (errors/subread length)], where errors = number of deletions + insertions + substitutions.

hist_len(bins=50, alpha=0.5, hold=False, fontsize=12, grid=True, xlabel='Read Length', ylabel='#', label='', title=None)

Plot histogram Read length

  • bins (int) – binning for the histogram
  • alpha (float) – transparency of the histograms
  • hold (bool) –
  • fontsize (int) –
  • grid (bool) –
  • xlabel (str) –
  • ylabel (str) –
  • label (str) – label of the histogram (for the legend)
  • title (str) –
from sequana.pacbio import BAMPacbio
from sequana import sequana_data
b = BAMPacbio(sequana_data("test_pacbio_subreads.bam"))

(Source code, png, hires.png, pdf)

hist_mean_ccs(bins=1000, **kwargs)[source]

Group subreads by ZMW and plot mean of read length for each polymerase

hist_median_ccs(bins=1000, **kwargs)[source]

Group subreads by ZMW and plot median of read length for each polymerase

hist_snr(bins=50, alpha=0.5, hold=False, fontsize=12, grid=True, xlabel='SNR', ylabel='#', title='')[source]

Plot histogram of the ACGT SNRs for all reads

  • bins (int) – binning for the histogram
  • alpha (float) – transparency of the histograms
  • hold (bool) –
  • fontsize (int) –
  • grid (bool) –
  • xlabel (str) –
  • ylabel (str) –
  • title (str) –
from sequana.pacbio import BAMPacbio
from sequana import sequana_data
b = BAMPacbio(sequana_data("test_pacbio_subreads.bam"))

(Source code, png, hires.png, pdf)


number of passes (ZMW)

plot_GC_read_len(hold=False, fontsize=12, bins=[60, 60], grid=True, xlabel='GC %', ylabel='#', cmap='BrBG')

Plot GC content versus read length

  • hold (bool) –
  • fontsize (int) – for x and y labels and title
  • bins – a integer or tuple of 2 integers to specify the binning of the x and y 2D histogram.
  • grid (bool) –
  • xlabel (str) –
  • ylabel (str) –
from sequana.pacbio import BAMPacbio
from sequana import sequana_data
b = BAMPacbio(sequana_data("test_pacbio_subreads.bam"))
b.plot_GC_read_len(bins=[10, 10])

(Source code, png, hires.png, pdf)

random_selection(output_filename, nreads=None, expected_coverage=None, reference_length=None)[source]

Select random reads

  • nreads – number of reads to select randomly. Must be less than number of available reads in the orignal file.
  • expected_coverage
  • reference_length

of expected_coverage and reference_length provided, nreads is replaced automatically.


return basic stats about the read length

stride(output_filename, stride=10, shift=0, random=False)[source]

Write a subset of reads to BAM output

  • output_filename (str) – name of output file
  • stride (int) – optionnal, number of reads to read to output one read
  • shift (int) – number of reads to ignore at the begining of input file
  • random (bool) – if True, at each step the read to output is randomly selected
to_fasta(output_filename, threads=2)

Export BAM reads into a Fasta file

  • output_filename – name of the output file (use .fasta extension)
  • threads (int) – number of threads to use


this executes a shell command based on samtools


this takes a few minutes for 500,000 reads

to_fastq(output_filename, threads=2)

Export BAM reads into FastQ file

class PBSim(input_bam, simul_bam)[source]

Filter an input BAM (simulated with pbsim) so as so keep reads that fit a target distribution.

This uses a MH algorithm behind the scene.

ss = pacbio.PBSim("test10X.bam")
ss.run(bins=100, step=50)
run(bins=50, xmin=0, xmax=30000, step=1000, burn=1000, alpha=1, output_filename=None)[source]

The target distribution

Compute histogram. Get X, Y. Given xprime, interpolate to get yprime use e.g. np.interp

class BAMSimul(filename)[source]

BAM reader for Pacbio simulated reads (PBsim)

A summary of the data is stored in the attribute df. It contains information such as the length of the reads, the ACGT content, the GC content.


Parameters:filename (str) – filename of the input pacbio BAM file. The content of the BAM file is not the ouput of a mapper. Instead, it is the output of a Pacbio (Sequel) sequencing (e.g., subreads).
filter_bool(output_filename, mask)[source]

Select and Write reads using a mask

  • output_filename (str) – name of output file
  • list_bool (list) – True to write read to output, False to ignore it
filter_length(output_filename, threshold_min=0, threshold_max=inf)[source]

Select and Write reads within a given range

  • output_filename (str) – name of output file
  • threshold_min (int) – minimum length of the reads to keep
  • threshold_max (int) – maximum length of the reads to keep
hist_GC(bins=50, alpha=0.5, hold=False, fontsize=12, grid=True, xlabel='GC %', ylabel='#', label='', title=None)

Plot histogram GC content

  • bins (int) – binning for the histogram
  • alpha (float) – transparency of the histograms
  • hold (bool) –
  • fontsize (int) – fontsize of the x and y labels and title.
  • grid (bool) – add grid or not
  • xlabel (str) –
  • ylabel (str) –
  • label (str) – label of the histogram (for the legend)
  • title (str) –
from sequana.pacbio import BAMPacbio
from sequana import sequana_data
b = BAMPacbio(sequana_data("test_pacbio_subreads.bam"))

(Source code, png, hires.png, pdf)

hist_len(bins=50, alpha=0.5, hold=False, fontsize=12, grid=True, xlabel='Read Length', ylabel='#', label='', title=None)

Plot histogram Read length

  • bins (int) – binning for the histogram
  • alpha (float) – transparency of the histograms
  • hold (bool) –
  • fontsize (int) –
  • grid (bool) –
  • xlabel (str) –
  • ylabel (str) –
  • label (str) – label of the histogram (for the legend)
  • title (str) –
from sequana.pacbio import BAMPacbio
from sequana import sequana_data
b = BAMPacbio(sequana_data("test_pacbio_subreads.bam"))

(Source code, png, hires.png, pdf)

plot_GC_read_len(hold=False, fontsize=12, bins=[60, 60], grid=True, xlabel='GC %', ylabel='#', cmap='BrBG')

Plot GC content versus read length

  • hold (bool) –
  • fontsize (int) – for x and y labels and title
  • bins – a integer or tuple of 2 integers to specify the binning of the x and y 2D histogram.
  • grid (bool) –
  • xlabel (str) –
  • ylabel (str) –
from sequana.pacbio import BAMPacbio
from sequana import sequana_data
b = BAMPacbio(sequana_data("test_pacbio_subreads.bam"))
b.plot_GC_read_len(bins=[10, 10])

(Source code, png, hires.png, pdf)

to_fasta(output_filename, threads=2)

Export BAM reads into a Fasta file

  • output_filename – name of the output file (use .fasta extension)
  • threads (int) – number of threads to use


this executes a shell command based on samtools


this takes a few minutes for 500,000 reads

to_fastq(output_filename, threads=2)

Export BAM reads into FastQ file

11.16. Phred quality

Manipulate phred quality of reads

FastQ quality are stored as characters. The phred scales indicates the range of characters.

In general, characters goes from ! to ~ that is from 33 to 126 in an ascii table. This convention starts at 33 because characters before ! may cause trouble (e.g. white spaces). This scale is the Sanger scale. There are 2 other scales that could be used ranging from 59 to 126 (illumina 1) and from 64 to 126 (illumina 1.3+).

So, here are the offset to use:

Name offset Numeric range
Sanger 33 0 to 93
Solexa 64 -5 to 62
illumina1.3+ 64 0 to 62

Even though dedicated tools would have better performances, we provide a set of convenient functions. An example is provided here below to plot the quality corresponding to a character string extracted from a FastQ read.

In this example, we use Quality class where the default offset is 33 (Sanger). We compare the quality for another offset

from sequana import phred

from sequana.phred import Quality
q.offset = 64
from pylab import legend

(Source code, png, hires.png, pdf)

class Quality(seq, offset=33)[source]

Phred quality

>>> from sequana.phred import Quality
>>> q.plot()

(Source code, png, hires.png, pdf)


You can access to the quality as a list using the quality attribute and the mean quality from the mean_quality attribute.


return mean quality


plot quality versus base position


phred string into quality list


A value between 0 and 93

Parameters:pe – the probability of error.
Returns:Q is the quality score.
  • a high probability of error (0.99) gives Q=0
  • q low proba of errors (0.05) gives Q = 13
  • q low proba of errors (0.01) gives Q = 20

Quality to probability (Sanger)

11.17. Running median

Data analysis tool

RunningMedian(data, width[, container]) Running median (fast)
class RunningMedian(data, width, container=<class 'list'>)[source]

Running median (fast)

This is an efficient implementation of running median, faster than SciPy implementation v0.17 and a skip list method.

The main idea comes from a recipe posted in this website: http://code.activestate.com/recipes/576930/#c3 that uses a simple list as proposed in https://gist.github.com/f0k/2f8402e4dfb6974bfcf1 and was adapted to our needs included object oriented implementation.


a circular running median is implemented in sequana.bedtools.GenomeCov

from sequana.running_median import RunningMedian
rm = RunningMedian(data, 101)
results = rm.run()


the first W/2 and last W/2 positions should be ignored since they do not use W values. In this implementation, the last W/2 values are currently set to zero.

This shows how the results agree with scipy

from pylab import *
import scipy.signal
from sequana.running_median import RunningMedian

x = randn(100)
plot(x, 'k')
plot(RunningMedian(x,9).run(), 'r', lw=4)
plot(scipy.signal.medfilt(x, 9), 'go')

(Source code, png, hires.png, pdf)

from sequana.running_median import RunningMedian
from pylab import *

N = 1000
X = linspace(0, N-1, N)

# Create some interesting data with SNP and longer over
# represented section.
data = 20 + randn(N) + sin(X*2*pi/1000.*5)
data[300:350] += 10
data[500:505] += 100
data[700] = 1000

plot(X, data, "k", label="data")
rm = RunningMedian(data, 101)
plot(X, rm.run(), 'r', label="median W=201")

from sequana.stats import moving_average as ma
plot(X[100:-100], ma(data, 201), 'g', label="mean W=201")
ylim([10, 50])

(Source code, png, hires.png, pdf)


Note that for visualisation, we set the ylimits to 50 but the data at position 500 goes up to 120 and there is an large outlier (1000) at position 700 .

We see that the median is less sensible to the outliers, as expected. The median is highly interesting for large outliers on short duration (e.g. here the peak at position 500) but is also less biases by larger regions.


The beginning and end of the running median are irrelevant. There are actually equal to the data in our implementation.


using blist instead of list is not always faster. It depends on the width of the window being used. list and blist are equivaltn for W below 20,000 (list is slightly faster). However, for large W, blist has an O(log(n)) complexity while list has a O(n) complexity


  • data – your data vector
  • width – running window length
  • container – a container (defaults to list). Could be a B-tree blist from the blist package but is 30% slower than a pure list for W < 20,000

scipy in O(n) list in sqrt(n) blist in O(log(n))

running_median(data, width, container=<class 'list'>)[source]

11.18. Snakemake module

Set of tools to manipulate Snakefile and config files

Here is an overview (see details here below)

sequana.snaketools.DOTParser Utility to manipulate the dot file returned by Snakemake
sequana.snaketools.FastQFactory FastQ Factory tool
sequana.snaketools.FileFactory Factory to handle a set of files
sequana.snaketools.Module Data structure that holds metadata about a Module
sequana.snaketools.ModuleFinderSingleton Data structure to hold the Module names
sequana.snaketools.PipelineManager Utility to manage easily the snakemake pipeline
sequana.snaketools.SnakeMakeStats Interpret the snakemake stats file
sequana.snaketools.SequanaConfig Reads YAML config file and ease access to its contents
sequana.snaketools.message Dedicated print function to include in Snakefiles
class DOTParser(filename, mode='v2')[source]

Utility to manipulate the dot file returned by Snakemake

This class is used in the dag and rulegraph rules used in the snakemake pipeline. The input must be a dag/rulegraph created by snakemake.

Consider this example where the test file was created by snakemake –dag

from sequana import sequana_data
from sequana.snaketools import DOTParser

filename = sequana_data("test_dag.dot")
dot = DOTParser(filename)

# creates test_dag.ann.dot locally
dot.add_urls("test.dot", {"fastqc": "fastqc.html"})

You can then convert the dag in an unix shell:

dot -Tsvg test.ann.dot -o test.svg

(Source code, png, hires.png, pdf)



Parameters:filename (str) – a DAG in dot format created by snakemake
add_urls(output_filename=None, mapper={}, title=None)[source]

Change the dot file adding URL on some nodes

  • output_filename (str) – the DAG file in dot format (graphviz)
  • mapper (dict) – a dictionary where keys are named after the rule names for which an HTML will be available (provided here as keys)
class FastQFactory(pattern, extension=['fq.gz', 'fastq.gz'], read_tag='_R[12]_', verbose=False)[source]

FastQ Factory tool

In NGS experiments, reads are stored in a so-called FastQ file. The file is named:


where _R1_ tag is always to be found. This is a single-ended case. In paired case, a second file is to be found:


The PREFIX indicates the sample name. The SUFFIX does not convey any information per se. The default read tag (“_R[12]_”) handle this case. It can be changed if data have another read tags. (e.g. “[12].fastq.gz”)

Yet, in long reads experiments (for instance), naming convention is different and may nor be single/paired end convention.

In a directory (recursively or not), there could be lots of samples. This class can be used to get all the sample prefix in the tags attribute.

Given a tag, one can get the corresponding file(s):

ff = FastQFactory("*fastq.gz")


  • pattern (str) – a global pattern (e.g., H*fastq.gz)
  • extension (list) – not used
  • read_tag (str) – regex tag used to join paired end files. Some characters need to be escaped with a ‘’ to be interpreted as character. (e.g. ‘_R[12]_.fastq.gz’)
  • verbose (bool) –
class FileFactory(pattern)[source]

Factory to handle a set of files

from sequana.snaketools import FileFactory
ff = FileFactory("H*.gz")

A set of useful methods are available based on this convention:

>>> fullpath = /home/user/test/A.fastq.gz
>>> dirname(fullpath)
>>> basename(fullpath)
>>> realpath(fullpath) # is .., expanded to /home/user/test

>>> all_extensions
>>> extensions


Parameters:pattern – can be a filename, list of filenames, or a global pattern (a unix regular expression with wildcards). For instance, “*/*fastq.gz”


Only in Python 3.X supports the recursive global pattern for now.


the extensions (list)


list of filenames and their extensions without the path


the last extension (list)


list of filenames (no path, no extension)


the common relative path


the relative path for each file (list)


real path is the full path + the filename the extension

class Module(name)[source]

Data structure that holds metadata about a Module

In Sequana, we provide rules and pipelines to be used with snakemake. Snakemake rules look like:

rule <name>:
    :input: file1
    :output: file2
    :shell: "cp file1 file2"

A pipeline may look like:

include: "path_to_rule1"
include: "path_to_rule2"
rule all:
    input: FINAL_FILES

Note that the pipeline includes rules by providing the path to them.

All rules can be stored in a single directory. Similarly for pipelines. We decided not to use that convention. Instead, we bundle rules (and pipelines) in their own directories so that other files can be stored with them. We also consider that

  1. if the Snakefile includes other Snakefile then it is Pipeline.
  2. Otherwise it is a simple Rule.

So, a Module in sequana’s parlance is a directory that contains a rule or a pipeline and associated files. There is currently no strict conventions for rule Modules except for their own rule file. However, pipeline Modules should have the following files:

  • A snakemake file named after the directory with the extension .rules
  • A README.rst file in restructured text format
  • An optional config file in YAML format named config.yaml. Although json format is possible, we use YAML throughout sequana for consistency. Rules do not have any but pipelines do. So if a pipeline does not provide a config.yaml, the one found in ./sequana/sequana/pipelines will be used.
  • a requirements.txt


Developers who wish to include new rules should refer to the Developer guide.


it is important that module’s name should be used to name the directory and the rule/pipeline.

The Modules are stored in sequana/rules and sequana/pipelines directories. The modules’ names cannot be duplicated.



The Module will ease the retrieval of information linked to a rule or pipeline. For instance if a pipeline has a config file, its path can be retrived easily:

m = Module("quality_control")

This Module may be rule or pipeline, the method is_pipeline() can be used to get that information. Other useful methods are available such as onweb() that open the web page of the pipeline (linked to the README).


Parameters:name (str) – the name of an available module.

full path to the config cluster file of the module


full path to the config file of the module


Content of the README file associated with


Is the module executable

A Pipeline Module should have a requirements.txt file that is introspectied to check if all executables are available;

Returns:a tuple. First element is a boolean to tell if it executable. Second element is the list of missing executables.

Return true is this module is a pipeline


return md5 of snakefile and its default configuration file

>>> from sequana import snaketools as sm
>>> m = sm.Module("variant_calling")
>>> m.md5()
{'config': 'e23b26a2ff45fa9ddb36c40670a8a00e',
 'snakefile': '7d3917743a6b123d9861ddbbb5f3baef'}

name of the module


full path to the module directory


full path to the README file of the module


list of requirements


full path to the schema config file of the module


full path to the Snakefile file of the module

class PipelineManager(name, config, pattern='*.fastq.gz', fastq=True)[source]

Utility to manage easily the snakemake pipeline

Inside a snakefile, use it as follows:

from sequana import PipelineManager
manager = PipelineManager("pipeline_name", "config.yaml")

config file must have these fields:

- input_directory:  #a_path
- input_extension:  fastq.gz  # this is the default. could be also fq.gz
- input_readtag: _R[12]_ # default
- input_pattern:    # a_global_pattern e.g. H*fastq.gz
- input_samples:
    - file1:
    - file2:

The manager can then easily access to the data with a FastQFactory instance:


This can be further used to get a wildcards with the proper directory.

The manager also tells you if the samples are paired or not assuming all samples are homogneous (either all paired or all single-ended).

If there is only one sample, the attribute mode is set to “nowc” meaning no wildcard. Otherwise, we assume that we are in a wildcard mode.

When the mode is set, two attributes are also set: sample and basename.

If the mode is nowc, the sample and basename are hardcoded to the sample name and sample/rule/sample respectively. Whereas in the wc mode, the sample and basename are wildcard set to “{sample}” and “{sample}/rulename/{sample}”. See the following methods getname().

For developers: the config attribute should be used as getter only.


  • name – name of the pipeline
  • config – name of a configuration file
  • pattern – a default pattern if not provided in the configuration file as an input_pattern field.

Create log directory: */sample/logs/sample_rule.logs

getname(rulename, suffix=None)[source]

Returns basename % rulename + suffix


Return list of raw data

If mode is nowc, a list of files is returned (one or two files) otherwise, a function compatible with snakemake is returned. This function contains a wildcard to each of the samples found by the manager.


Create the report directory.

class SnakeMakeStats(filename, N=1)[source]

Interpret the snakemake stats file

Run the Snakemake with this option:

-- stats stats.txt


from sequana.snaketools import SnakeMakeStats
from sequana import sequana_data
filename = sequana_data("test_snakemake_stats.txt", "testing")
s = SnakeMakeStats(filename)

(Source code, png, hires.png, pdf)




Create the barplot from the stats file

plot_and_save(filename='snakemake_stats.png', outputdir='report')[source]
class SequanaConfig(data=None, converts_none_to_str=True)[source]

Reads YAML config file and ease access to its contents

This can also be used to check the validity of the config file

>>> sc = SequanaConfig(config)
>>> sc.config.pattern == "*.fastq.gz"

Input files should be stored into:

    - file1: FILE1
    - file2: FILE2

The second file may be optional.

Empty strings in a config are interpreted as None but SequanaConfig will replace None with empty strings, which is probably what was expected from the user. Similarly, in snakemake when settings the config file, one can override a value with a False but this is interepted as “False” This will transform back the “True” into True.

Another interest concerns the automatic expansion of the path to directories and files starting with the special ~ (tilde) character, that are expanded transparently.

Could be a JSON or a YAML file

Parameters:filename (str) – filename to a config file in json or YAML format.

SEQUANA config files must have some specific fields:


Check the config file with respect to a schema file

Sequana pipelines should have a schema file in the Module.


Remove template elements and change None to empty string.


Copy files to run the pipeline

If a requirement file exists, it is copied in the target directory. If not, it can be either an http resources or a sequana resources.

save(filename='config.yaml', cleanup=True)[source]

Save the yaml code in _yaml_code with comments

pipeline_names = ['rnaseq', 'smallrnaseq', 'quality_control', 'pacbio_qc', 'atac-seq', 'pacbio_denovo', 'coverage', 'compressor', 'pacbio_isoseq', 'variant_calling', 'denovo_assembly']

list of pipeline names found in the list of modules

11.19. Snpeff module

Tools to launch snpEff.

class SnpEff(reference, log=None)[source]

SnpEff is a tool dedicated to annotate detected variants in a VCF file. This wrapper eases the annotation with a genbank file. It create automatically the custom database. Then, run snpEff with a subprocess. Caution, the locus name (or chromosome name) in genbank file and the sequence name in VCF file must be the same. Otherwise, snpEff is not able to bind informations.


snpeff = SnpEff('file.gbk')
snpeff.launch_snpeff('variants.vcf', 'variant.ann.vcf')


  • reference – annotation reference.
  • file_format – format of your file. (‘only genbank actually’)
  • log – log file
add_locus_in_fasta(fasta, output_file)[source]

Add locus of annotation file in description line of fasta file. If fasta file and genbank file do not have the same names.

  • fasta (str) – input fasta file where you want to add locus.
  • output_file (str) – output file.
launch_snpeff(vcf_filename, output, html_output=None, options='')[source]

Launch snpEff with the custom genbank file.

  • vcf_filename (str) – input VCF filename.
  • output (str) – output VCF filename.
  • html_output (str) – filename of the HTML creates by snpEff.
  • options (str) – any options recognised by snpEff.
download_fasta_and_genbank(identifier, tag, genbank=True, fasta=True)[source]
  • identifier – valid identifier to retrieve from NCBI (genbank) and ENA (fasta)
  • tag – name of the filename for the genbank and fasta files.

11.20. General tools

misc utilities

textwrap(text, width=80, indent=0)[source]

Wrap a string with 80 characters

  • text – input text
  • width – (defaults to 80 characters)
  • indent – possible indentation (0 by default)

Converts a restructuredText document into HTML

Note that the returned object is a bytes so need to be decoded with decode()

wget(link, output)[source]

Retrieve a file from internet.

  • link (str) – a valid URL
  • output (str) – the output filename


no sanity check of any kind for now


move to easydev

findpos(seq, chr)[source]

Find position(s) of a substring into a longer string.

Note that this function is a generator:

>>> list(findpos("AACCGGAAGGTT", "GG"))

Used to check if we are on a cluster

“tars-” is the name of a cluster’s hostname. Change or append the argument pattern with your cluster’s hostname

Parameters:pattern (str) – a list of names (strings) or a string

Statistical tools

moving_average(data, n)[source]

Compute moving average

Parameters:n – window’s size (odd or even).
>>> from sequana.stats import moving_average as ma
>>> ma([1,1,1,1,3,3,3,3], 4)
array([ 1. ,  1.5,  2. ,  2.5,  3. ])


the final vector does not have the same size as the input vector.


Return Evenness of the coverage

Reference:Konrad Oexle, Journal of Human Genetics 2016, Evaulation of the evenness score in NGS.

work before or after normalisation but lead to different results.

C = mean(X)
D2 = X[X<=C]
N = len(X)
n = len(D2)
E = 1 - (n - sum(D2) / C) / N

General tools

class StatsBAM2Mapped() -> new empty dictionary dict(mapping) -> new dictionary initialized from a mapping object's (key, value) pairs dict(iterable) -> new dictionary initialized as if via: d = {} for k, v in iterable: d[k] = v dict(**kwargs) -> new dictionary initialized with the name=value pairs in the keyword argument list. For example: dict(one=1, two=2)[source]
bam_to_mapped_unmapped_fastq(filename, output_directory=None, verbose=True)[source]

Create mapped and unmapped fastq files from a BAM file


given a reference, one or two FastQ files are mapped onto the reference to generate a BAM file. This BAM file is a compressed version of a SAM file, which interpretation should be eased within this function.

  • filename – input BAM file
  • output_directory – where to save the mapped and unmapped files

dictionary with number of reads for each file (mapped/unmapped for R1/R2) as well as the mode (paired or not), the number of unpaired reads, and the number of duplicated reads. The unpaired reads should be zero (sanity check)

Given a BAM file, create FASTQ with R1/R2 reads mapped and unmapped. In the paired-end case, 4 files are created.

Note that this function is efficient in that it does not create intermediate files limiting IO in the process. As compared to standard tools such as bedtools bamtofastq, it is 1.5 to 2X slower but it does create the mapped AND unmapped reads.


Secondary alignment (flag 256) are dropped so as to remove any ambiguous alignments. The output dictionary stores “secondary” key to keep track of the total number of secondary reads that are dropped. If the flag is 256 and the read is unpaired, the key unpaired is also incremented.

If the flag is not equal to 256, we first reverse complement reads that are tagged as reverse in the BAM file. Then, reads that are not paired or not “proper pair” (neither flag 4 nor flag 8) are ignored.

If R1 is mapped or R2 is mapped then the reads are considered mapped. If both R1 and R2 are unmapped, then reads are unmapped.


about chimeric alignment: one is the representative and the other is the supplementary. This flag is not used in this function. Note also that chimeric alignment have same QNAME and flag 4 and 8


the contamination reported is basde on R1 only.


comments are missing since there are not stored in the BAM file.


the mapped reads may not be synchronized because we include also the chimeric alignment (cf samtools documentation). However, total reads = unmappeds reads + R1 mapped + R2 mapped - supplementary reads (those with flag 2048).

class GZLineCounter(filename)[source]

Fast GZipped line counter

Uses zcat if possible, otherwise gzip library (twice as slow).

>>> from sequana import sequana_data
>>> from sequana.misc import GZLineCounter
>>> gz = GZLineCounter(sequana_data("test.fastq.gz"))
>>> len(gz)

11.21. VCF module

Analysis of VCF file generated by freebayes.

class Filtered_freebayes(variants, fb_vcf)[source]

Variants filtered with VCF_freebayes.


  • variants (list) – list of variants record.
  • fb_vcf (VCF_freebayes) – class parent.

Get columns index.


Get the data frame.

to_csv(output_filename, info_field=False)[source]

Write DataFrame in CSV format.

Params str output_filename:
 output CSV filename.

Write VCF file in VCF format.

Params str output_filename:
 output VCF filename.

Get the variant list.


Get the VCF_freebayes object.

class VCF_freebayes(filename, **kwargs)[source]

VCF class (Variant Calling Format)

This class is a wrapping of vcf.Reader class from the pyVCF package. It is dedicated for VCF file generated by freebayes. A data frame with all variants is produced which can be written as a csv file. It can filter variants with a dictionnary of filter parameter. Filter variants are written in a new VCF file.

from sequana import sequana_data
from sequana.freebayes_vcf_filter import VCF_freebayes
vcf_filename = sequana_data("JB409847.vcf")

# Read the data
v = VCF_freebayes(vcf_filename)

# Filter the data
filter_dict = {"freebayes_score": 200,
               "frequency": 0.8,
               "min_depth": 10,
               "strand_ratio": 0.2}
filter_v = v.filter_vcf(filter_dict)

Information about strand bias (aka strand balance, or strand ratio). This is a type of sequencing bias in which one DNA strand is favored over the other, which can result in incorrect evaluation of the amount of evidence observed for one allele vs. the other.


  • filename (str) – a vcf file.
  • kwargs – any arguments accepted by vcf.Reader

Filter variants in the VCF file.

Parameters:filter_dict (dict) – dictionary of filters. It updates the attribute VCF_freebayes.filters

Return Filtered_freebayes object.


Get or set the filters parameters to select variants of interest. Setter take a dictionnary as parameter to update the attribute VCF_freebayes.filters_params. Delete will reset different variable to 0.

v = VCF_freebayes("input.vcf")
v.filters_params = {"freebayes_score": 200,
                   "frequency": 0.8,
                   "min_depth": 10,
                   "strand_ratio": 0.2}

Get VCF_freebayes.is_joint if the vcf file is a joint_freebayes.


Rewind the reader

class Variant(record)[source]

Variant reader and dictionary that stores important variant information


  • record (RecordVariant) – variant record
  • resume (dict) – most important informations of variant
Compute frequency of alternate allele.
alt_freq = Count Alternate Allele / Depth
Parameters:vcf_line (vcf.model._Record) – variant record

Compute strand balance of alternate allele include in [0,0.5]. strand_bal = Alt Forward / (Alt Forward + Alt Reverse)

Parameters:vcf_line (vcf.model._Record) – variant record

FYI: in freebayes, the allele balance (reported under AB), strand bias counts (SRF, SRR, SAF, SAR) and bias estimate (SAP) can be used as well for filtering. Here, we use the strand balance computed as SAF / (SAF + SAR)

strand_ratio(number1, number2)[source]

Compute ratio between two number. Return result between [0:0.5].

11.22. Module Reports

Generic module is the parent module of all other module

class SequanaBaseModule(template_fn='standard.html')[source]

Generic Module to write HTML reports.

add_code_section(content, language)[source]

Add code in your html.


Align a content to right.

copy_file(filename, target_dir)[source]

Copy a file to a target directory in report dir. Return the relative path of your file.

  • filename (str) – file to copy.
  • target_dir (str) – directory where to copy.

Return relative path of the new file location.

create_combobox(path_list, html_id, newtab=True)[source]

Create a dropdown menu with QueryJS.

Parameters:path_list (list) – list of links.

return html div and js script as string.

create_embedded_png(plot_function, input_arg, style=None, **kwargs)[source]

Take as a plot function as input and create a html embedded png image. You must set the arguments name for the output to connect buffer.

create_hide_section(html_id, name, content, hide=False)[source]

Create an hideable section.

  • html_id (str) – add short id to connect all elements.
  • name (str) – name of the hyperlink to hide or display the content.
  • content (str) – hideable HTML content.
  • hide (bool) – set if the first state is hiding or not.

Return tuple that contains HTML hyperlink and hideable section.


Create HTML file with Jinja2.

Parameters:output_filename (str) – HTML output filename

Create an HTML hyperlink with name and target.

  • target (str) – the target url.
  • newtab (bool) – open html page in a new tab.
  • download (bool) – download the target.

Return as string the HTML hyperlink to the target.


Include SVG image in the html.

png_to_embedded_png(png, style=None)[source]

Include a PNG file as embedded file.

required_dir = ('css', 'js', 'images')

Report dedicated to BAM file

BAMQCModule(bam_input[, output_filename]) Report dedicated to BAM file
class BAMQCModule(bam_input, output_filename=None)[source]

Report dedicated to BAM file

from sequana import sequana_data
from sequana.modules_report.bamqc import BAMQCModule
filename = sequana_data("test.bam")

r = BAMQCModule(filename)

# report/bam.html is now available


right now, the computation is performed in the class. Ideally, we would like the computation to happen elsewhere, where a json is stored. The json would be the input to this class.


Module to write coverage report

class CoverageModule(data, region_window=200000)[source]

Write HTML report of coverage analysis. This class takes either a GenomeCov instances or a csv file where analysis are stored.


Parameters:data – it can be a csv filename created by sequana_coverage or a

bedtools.GenomeCov object.


Create table with links to chromosome reports


Create HTML report for each chromosome present in data.


Initiate DataTableFunction to create table to link each row with sub HTML report. All table will have the same appearance. We can therefore initialise the roi once for all.

Parameters:rois – can be a ROIs from ChromosomeCov instance or a simple dataframe
class ChromosomeCoverageModule(chromosome, datatable, directory='coverage_reports', region_window=200000, options=None, command='')[source]

Write HTML report of coverage analysis for each chromosome. It is created by CoverageModule.

  • chromosome
  • datatable
  • directory
  • region_window (int) – length of the sub coverage plot
  • options – should contain “W”, “k”, “circular”

Basics statistics section.


Coverage barplots section.


Coverage section.

create_report_content(directory, options=None)[source]

Generate the sections list to fill the HTML report.


3 dimensional plot of GC content versus coverage.


Barplot of normalized coverage section.

regions_of_interest(rois, links)[source]

Region of interest section.

subcoverage(rois, directory)[source]

Create subcoverage reports to have access to a zoomable line plot.

Params rois:

This method create sub reports for each region of 200,000 bases (can be changed). Usually, it starts at position 0 so reports will be stored in e.g. for a genome of 2,300,000 bases:


Note that if the BED file positions does not start at zero, then names will take care of that.


Barplot of zscore distribution section.

11.23. Others

11.23.1. data related

Retrieve data from sequana library

sequana_data(filename=None, where=None)[source]

Return full path of a sequana resource data file.

  • filename (str) – a valid filename to be found
  • where (str) – one of the registered data directory (see below)

the path of file. See also here below in the case where filename is set to “*”.

from sequana import sequana_data
filename = sequana_data("test.bam")

Type the function name with “*” parameter to get a list of available files. Withe where argument set, the function returns a list of files. Without the where argument, a dictionary is returned where keys correspond to the registered directories:

filenames = sequana_data("*", where="images")

Registered directories are:

  • data
  • testing
  • data/adapters
  • images


this does not handle wildcards. The * means retrieve all files.

Some useful data sets to be used in the analysis

The command sequana.sequana_data() may be used to retrieved data from this package. For example, a small but standard reference (phix) is used in some NGS experiments. The file is small enough that it is provided within sequana and its filename (full path) can be retrieved as follows:

from sequana import sequana_data
fullpath = sequana_data("phiX174.fa", "data")

Other files stored in this directory will be documented here.

11.23.2. report related

Utilities to create a Jquery DataTable for your HTML file.

DataTableFunction(df, html_id[, index]) Class that contains Jquery DataTables function and options.
DataTable(df, html_id[, datatable, index]) Class that contains html table which used a javascript function.
class DataTable(df, html_id, datatable=None, index=False)[source]

Class that contains html table which used a javascript function.

You must add in your HTML file the JS function (DataTable.create_javascript_function()) and the HTML code (DataTable.create_datatable()).


df = pandas.read_csv('data.csv')
datatable = DataTable(df, 'data')
datatable.datatable.datatable_options = {'pageLength': 15,
                                         'dom': 'Bfrtip',
                                         'buttons': ['copy', 'csv']}
js = datatable.create_javascript_function()
html = datatable.create_datatable()

# Second CSV file with same format
df2 = pandas.read_csv('data2.csv')
datatable2 = DataTable(df2, 'data2', datatable.datatable)
html2 = datatable.create_datatable()

The reason to include the JS manually is that you may include many HTML table but need to include the JS only once.


  • df – data frame.
  • html_id (str) – the unique ID used in the HTML file.
  • datatable (DataTableFunction) – javascript function to create the Jquery Datatables. If None, a DataTableFunction is generated from the df.
  • index (bool) – indicates whether the index dataframe should be included in the CSV table
create_datatable(style='width:100%', **kwargs)[source]

Return string well formated to include in a HTML page.

  • style (str) – CSS option of your table.
  • kwargs (**dict) –

    parameters of pandas.DataFrame.to_csv().


Generate the javascript function to create the DataTable in a HTML page.

class DataTableFunction(df, html_id, index=False)[source]

Class that contains Jquery DataTables function and options.


import pandas as pd
from sequana.utils import DataTableFunction

df = pandas.read_csv('data.csv')
datatable_js = DataTableFunction(df, 'data')
datatable_js.datatable_options = {'pageLength': 15,
                                  'dom': 'Bfrtip',
                                  'buttons': ['copy', 'csv']}
js = datatable_js.create_javascript_function()
html_datatables = [DataTable(df, "data_{0}".format(i), datatable_js)
                   for i, df in enumerate(df_list)]

Here, the datatable_options dictionary is used to fine tune the appearance of the table.


DataTables add a number of elements around the table to control the table or show additional information about it. There are controlled by the order in the document (DOM) defined as a string made of letters, each of them having a precise meaning. The order of the letter is important. For instance if B is first, the buttons are put before the table. If B is at the end, it is shown below the table. Here are some of the valid letters and their meaning:

  • B: add the Buttons (copy/csv)
  • i: add showing 1 to N of M entries
  • f: add a search bar (f filtering)
  • r: processing display element
  • t: the table itself
  • p: pagination control

Each option can be specified multiple times (with the exception of the table itself).


other useful options are:

  • pageLength: 15
  • scrollX: “true”
  • paging: 15
  • buttons: [‘copy’, ‘csv’]

Note that buttons can also be excel, pdf, print, …

All options of datatable:


  • df – data frame.
  • html_id (str) – the ID used in the HTML file.

Return javascript to create the DataTable.


Get datatable_columns dictionary. It is automatically set from the dataframe you want to plot.


Get, set or delete the DataTable options. Setter takes a dict as parameter with the desired options and updates the current dictionary.


datatable = DataTableFunction("tab")
datatable.datatable_options = {'dom': 'Bfrtip',
                               'buttons': ['copy', 'csv']}

source: https://datatables.net/reference/option/


Get the html_id, which cannot be set by the user after the instanciation of the class.

Hide a column with urls and connect it with a column.

  • link_col (str) – column with your URLs.
  • target_col (str) – column to connect.
set_tooltips_to_column(tooltips_col, target_col)[source]

Hide a column with tooltips and connect it with a column.

  • tooltips_col (str) – column with your tooltips.
  • target_col (str) – column to connect.