11. References

11.1. Adapters

Utilities to manipulate adapters

Adapters removal can be performed by many different tools such as CutAdapt, AlienTrimmer, Trimmomatic. Unfortunately, they tend to use different formats from FASTA to text files. Moreover outputs are generally also reported in different formats.

Tools to extract specific adapters from FASTA files would also be handy. For instance, you may have all your adapters in a single file.

In this module, we provide:

  • tools to manipulate adapters stored in Fasta format (AdapterReader).
  • tools to export Fasta files with adapter content into other formats required by various adapter removal software
  • A tool used to extract adapters from a FASTA file given their identifier, or sequence FindAdaptersFromDesign.

Our convention is to store list of adapters in FASTA format, which can be read using the AdapterReader:

from sequana import sequana_data, AdapterReader
filename = sequana_data("adapters_Nextera_fwd.fa")
ar = AdapterReader(filename)
ar.get_adapter_by_index("N501")

Given a design file (see mod:sequana.expdesign), and a name for the type of adapters, one can easily extract the subset of relevant adapters to be used for a sample. Currently, the following set of adapters/design are available:

  • Nextera single and double indexing
  • Rubicon single indexing
  • PCRFree single indexing
  • TruSeq

Note that TruSeq index 17, 24, and 26 are missing. This is normal. Those are “reserved” Illumina index.

For instance given a design file that gives the mapping between samples and a set of Nextera adapters, one would use:

>>> from sequana import *
>>> filename = sequana_data("test_expdesign_hiseq.csv")
>>> design = ExpDesignAdapter(filename)
>>> fa = FindAdaptersFromDesign(design, "PCRFree")
>>> print(fa.sample_names[0])
'553-iH2-1'
>>> fa.get_adapters_from_sample("553-iH2-1")

See FindAdaptersFromDesign for details.

class Adapter(identifier, sequence=None, comment=None)[source]

Class to store one adapter

An adapter is just a sequence from a FASTA file. It contains an identifier, a sequence and possibly a comment.

Warning

The identifier provided must not contain the starting “>” character, which is added automatically when needed.

One can check if an adapter is equal to another. Only the sequence is checked for equality though.

Some Sequana notation have been added in the identifier to ease retrieval of index’s name and index’s sequence:

>NextFlex_PCR_Free_adapter1|name:1|seq:CGATGT

Of course the string CGATGT must be found in the sequence itself.

ar = AdapterReader(sequana_data("adapters_PCRFree_fwd.fa"))
adapter = Adapter(ar[0])
adapter.identifier
adapter.comment
adapter.index_sequence
adapter.sequence
adapter.name
comment

R/W adapter’s identifier

identifier

R/W adapter’s identifier

index_sequence

Read only access to the index sequence

name

Read only access to the inedx name

sequence

R/W adapter’s sequence

class AdapterReader(filename)[source]

Reader of FASTA file dedicated to adapters

A Fasta is just a set of this kind of paired-lines:

>Nextera_index_N501|name:N501|seq:ACGT optional comment
ACGTACGTACGT

where the optional comment is separated from the identifier by a tabulation.

In the FASTA identifier, the first pipe delimits the official name (left hand side) from the name tag. The information on this example may be redundant but the name will be used throughout the Sequana code to ensure reproductibility.

Note

sequences are all in big caps.

Note

the universal adapter has no index so does not need to have the any tags for the name of index sequence. However, it must be called Universal_Adapter

>>> from sequana import sequana_data, AdapterReader
>>> filename = sequana_data("adapters_Nextera_fwd.fa")
>>> ar = AdapterReader(filename)
>>> candidate = ar.get_adapter_by_index_name("S505")
>>> print(candidate[0])
>Nextera_index_S505|name:S505|seq:GTAAGGAG
AATGATACGGCGACCACCGAGATCTACACGTAAGGAGTCGTCGGCAGCGTC
>>> len(ar)
56

Note

Checks for uniqueness of the identifers. It not unique, an error is raised

Sources:document illumina #1000000002694 v01
Sources:For NextFlex PCR-Free adapters, there are 48 barcodes. http://www.biooscientific.com/Portals/0/IEM/Bioo-Scientific-PCR-Free-Barcode-Indices-v1-1-15.pdf

Constructor

Parameters:filename (str) – the input FASTA file
comments
get_adapter_by_identifier(text)[source]

Return adapter whose identifier matches the user text

Parameters:index_identifier – the unique index identifier to be found. If several sequence do match, this is an error meaning the fasta file with all adapters is not correctly formatted.
Returns:the adapter that match the index_name (if any) otherwise returns None
get_adapter_by_index_name(index_name)[source]

Return adapter for the index name provided

Can be used only if the identifier contains the tag:

|name:an_index_to_be_found

For instance:

>Nextera_blabal|name:N505|seq:ACGT
>Nextera_blabal|seq:ACGT|name:N505
>Nextera_blabal|name:N505

are valid identifiers

get_adapter_by_index_seq(index_name)[source]

See get_adapter_by_index_name().

get_adapter_by_sequence(subsequence)[source]

Return one or several adapters with sub-sequence in their sequence

Parameters:subsequence (str) – a string (ACGT letters)
Returns:name and sequence in FASTA format that have the user sequence contained in their sequence

If the subsequence is short, it may return more than 1 adapters. Besides, the sequence is searched for without position information right now.

identifiers
index_names
index_sequences
reverse()[source]

Reverse all sequences inplace

>>> from sequana import sequana_data, AdapterReader
>>> filename = sequana_data("adapters_Nextera_fwd.fa")
>>> filename2 = sequana_data("adapters_Nextera_rev.fa")
>>> ar = AdapterReader(filename)
>>> ar2 = AdapterReader(filename2)
>>> ar.reverse()
>>> ar == ar2
True
reverse_complement()[source]

Reverse-complement all sequences inplace

>>> from sequana import sequana_data, AdapterReader
>>> filename = sequana_data("adapters_Nextera_fwd.fa")
>>> filename = sequana_data("adapters_Nextera_revcomp.fa")
>>> ar = AdapterReader(filename)
>>> ar.reverse_complement()
>>> ar.to_fasta()
>>> ar == ar2
sequences
to_dict()[source]

Returns dictionary with key as identifier and values as list with comments and sequences

to_fasta(filename)[source]

Save sequences into fasta file

class FindAdaptersFromDesign(design_filename, adapters)[source]

Extract adapter(s) corresponding to an experimental design file

Used by sequana main script to build the adapter files for multi-samples projects as input to various adapter removal software.

Constructor

Parameters:
  • design_filename (str) – a CSV file that is compatible with our sequana.expdesign.ExpDesignAdapter
  • adapters – the type of adapters (PCRFree, Nextera, Rubicon, TruSeq, SMARTer, Small)

The files of adapters are stored in Sequana and accessible with the sequana_data function. So, for instance if adapters is set to Nextera, the following file is used to identify the adapters:

sequana_data("adapters_Nextera_fwd.fa")

New adapters files can be added on request. Currently, Nextera and PCRFree are available. Rubicon and TruSeq will be added soon.

check()[source]
get_adapters_from_sample(sample_name)[source]

Return a dictionary with adapters corresponding to the sample name

Parameters:sample_name (str) – a valid sample name as found in the design file. One can check the content of the sample_names attribute.
Returns:a dictionary with the adapters in forward, reverse, reverse complement for index1 and index2 (if relevant).
get_sample(sample_name)[source]

Return basic info about the sample name (from the design file)

sample_names

return all sample names contained in the design file

save_adapters_to_fasta(sample_name, output_dir='.')[source]

Get index1, index2 and universal adapter

adapter_removal_parser(filename)[source]

Parses output of AdapterRemoval software

>>> from sequana import adapters, sequana_data
>>> data = sequana_data("test_adapter_removal_output.txt", "testing")
>>> results = adapters.adapter_removal_parser(data)
>>> results["adapter1"]
'AGATCGGAAGAGCACACGTCTGAACTCCAGTCACNNNNNNATCTCGTATGCCGTCTTCTGCTTG'
adapters_to_clean_ngs(input_filename, output_filename='adapters_ngs.txt')[source]

Convert a FASTA formatted file into clean_ngs format

Parameters:
  • input_filename (str) – the input FASTA file
  • output_filename (str) – a TSV formatted file
fasta_fwd_rev_to_columns(file1, file2=None, output_filename=None)[source]

From 2 FASTA files (reverse and forward) adapters, returns 2-columns file

This is useful for some tools related to adapter removal that takes as input this kind of format

Parameters:
  • filename1 (str) – FASTA format
  • filename2 (stsr) – FASTA format (optional)

The files must have a one-to-one mapping

get_sequana_adapters(type_, direction)[source]

Return path to a list of adapters in FASTA format

Parameters:
  • tag – PCRFree, Rubicon, Nextera
  • type – fwd, rev, revcomp
Returns:

path to the adapter filename

11.2. Assembly related

class BUSCO(filename='full_table_testbusco.tsv')[source]

Wrapper of the BUSCO output

“BUSCO provides a quantitative measures for the assessment of a genome assembly, gene set, transcriptome completeness, based on evolutionarily-informed expectations of gene content from near-universal single-copy orthologs selected from OrthoDB v9.” – BUSCO website 2017

This class reads the full report generated by BUSCO and provides some visualisation of this report. The information is stored in a dataframe df. The score can be retrieve with the attribute score in percentage in the range 0-100.

Reference:http://busco.ezlab.org/

constructor

Filename:a valid BUSCO input file (full table). See example in sequana code source (testing)
get_summary_string()[source]
pie_plot(filename=None, hold=False)[source]

Plot PIE plot of the status (complete / fragment / missed)

from sequana import BUSCO, sequana_data
b = BUSCO(sequana_data("test_busco_full_table.tsv"))
b.pie_plot()

(Source code, png, hires.png, pdf)

_images/references-1.png
scatter_plot(filename=None, hold=False)[source]

Scatter plot of the score versus length of each ortholog

from sequana import BUSCO, sequana_data
b = BUSCO(sequana_data("test_busco_full_table.tsv"))
b.scatter_plot()

(Source code)

score
summary()[source]

Return summary information of the missing, completed, fragemented orthologs

11.3. CIGAR tools

class Cigar(cigarstring)[source]
>>> from sequana.cigar import Cigar
>>> c = Cigar("2S30M1I")
>>> len(c)
33
>>> c.items()

>>> c = Cigar("1S1S1S1S")
>>> c.compress()
>>> c.cigarstring
"4S"

Possible CIGAR types are:

  • “M” for MATCH (0)
  • “I” for Insertion (1)
  • “D” for deletion 2
  • “N” for reference skipped 3
  • “S” for soft clipping 4
  • “H” for hard clipping 5
  • “P” for padding
  • “=” for equal
  • “X” for diff
  • “B” for back

Note

the length of the query sequence based on the CIGAR is calculated by adding the M, I, S, =, or X and other operations are ignored. source: https://stackoverflow.com/questions/39710796/infer-the-length-of-a-sequence-using-the-cigar/39812985#39812985

Reference:https://github.com/samtools/htslib/blob/develop/htslib/sam.h

Constructor

Parameters:cigarstring (str) – the CIGAR string.

Note

the input CIGAR string validity is not checked. If an unknown type is found, it is ignored generally. For instance, the length of 1S100Y is 1 since Y is not correct.

as_dict()[source]

Return cigar types and their count

Returns:dictionary

Note that repeated types are added:

>>> c = Cigar('1S2M1S')
>>> c.as_dict()
{"S":2,"M":2}
as_sequence()[source]
as_tuple()[source]

Decompose the cigar string into tuples keeping track of repeated types

Returns:tuple
>>> from sequana import Cigar
>>> c = Cigar("1S2M1S")
(('S', 1), ('M', 2), ('S', 1))
cigarstring

the CIGAR string attribute

compress()[source]

1S1S should become 2S. inplace modification

pattern = '(\\d+)([A-Za-z])?'
stats()[source]

Returns number of occurence for each type found in types

>>> c = Cigar("1S2M1S")
>>> c.stats()
[2, 0, 0, 0, 2, 0, 0, 0, 0, 0]
types = 'MIDNSHP=XB'

valid CIGAR types

11.4. BAMTOOLS related

Tools to manipulate BAM/SAM files

Alignment(alignment) Helper class to retrieve info about Alignment
BAM(filename[, mode]) BAM data structure, statistics and plotting
SAMFlags([value]) Utility to extract bits from a SAM flag

Note

BAM being the compressed version of SAM files, we do not implement any functionalities related to SAM files. We strongly encourage developers to convert their SAM to BAM.

class BAM(filename, mode='rb', *args)[source]

BAM data structure, statistics and plotting

Note

Python2.7 and 3.5 behave differently and we would recommend the Python 3.5 version. For instance, to_fastq() would work only with Python 3.5.

We provide a test file in Sequana:

>>> from sequana import BAM, sequana_data
>>> b = BAM(sequana_data("test.bam"))
>>> len(b)
1000

Note

Once you loop over this data structure, you must call reset() to force the next iterator to start at position 0 again. The methods implemented in this data structure take care of that for you thanks to a decorator called seek. If you want to use the next() function, call reset() to make sure you start at the beginning.

Constructor

bam_analysis_to_json(filename)[source]

Create a json file with information related to the bam file.

This includes some metrics (see get_stats(); eg MAPQ), combination of flags, SAM flags, counters about the read length.

boxplot_qualities(max_sample=500000)[source]

Same as in sequana.fastq.FastQC

get_actg_content(max_sample=500000)[source]
get_flags()[source]

Return flags of all reads as a list

get_flags_as_df()[source]

Returns flags as a dataframe

>>> from sequana import BAM, sequana_data
>>> b = BAM(sequana_data('test.bam'))
>>> df = b.get_flags_as_df()
>>> df.sum()
1       1000
2        484
4          2
8          2
16       499
32       500
64       477
128      523
256       64
512        0
1024       0
2048       0
dtype: int64

See also

SAMFlags for meaning of each flag

get_full_stats_as_df()[source]

Return a dictionary with full stats about the BAM file

The index of the dataframe contains the flags. The column contains the counts.

>>> from sequana import BAM, sequana_data
>>> b = BAM(sequana_data("test.bam"))
>>> df = b.get_full_stats_as_df()
>>> df.query("description=='average quality'")
36.9

Note

uses samtools behind the scene

get_gc_content()[source]

Return GC content for all reads (mapped or not)

get_length_count()[source]

Return counter of all fragment lengths

get_mapped_read_length()[source]

Return dataframe with read length for each read

(Source code, png, hires.png, pdf)

_images/references-3.png
get_mapq_as_df()[source]

Return dataframe with mapq for each read

get_metrics_count()[source]

Count flags/mapq/read length in one pass.

get_read_names()[source]

Return the reads’ names

get_samflags_count()[source]

Count how many reads have each flag of sam format after count metrics.

get_stats()[source]

Return basic stats about the reads

Returns:dictionary with basic stats:
  • total_reads : number reads
  • mapped_reads : number of mapped reads
  • unmapped_reads : number of unmapped
  • mapped_proper_pair : R1 and R2 mapped face to face
  • hard_clipped_reads: number of reads with supplementary alignment
  • reads_duplicated: number of reads duplicated

Warning

works only for BAM files. Use get_full_stats_as_df() for SAM files.

hist_coverage(bins=100)[source]
from sequana import sequana_data, BAM
b = BAM(sequana_data("measles.fa.sorted.bam"))
b.hist_coverage()

(Source code, png, hires.png, pdf)

_images/references-4.png
is_sorted

return True is the BAM is sorted

iter_mapped_reads()[source]

Return an iterator on the reads that are mapped

iter_unmapped_reads()[source]

Return an iterator on the reads that are unmapped

plot_acgt_content(stacked=False, fontsize=16, include_N=True)[source]

Plot ACGT content

from sequana import sequana_data, BAM
b = BAM(sequana_data("measles.fa.sorted.bam"))
b.plot_acgt_content()

(Source code, png, hires.png, pdf)

_images/references-5.png
plot_bar_flags(logy=True, fontsize=16, filename=None)[source]

Plot an histogram of the flags contained in the BAM

from sequana import BAM, sequana_data
b = BAM(sequana_data('test.bam', "testing"))
b.plot_bar_flags()

(Source code, png, hires.png, pdf)

_images/references-6.png

See also

SAMFlags for meaning of each flag

plot_bar_mapq(fontsize=16, filename=None)[source]

Plots bar plots of the MAPQ (quality) of alignments

from sequana import BAM, sequana_data
b = BAM(sequana_data('test.bam', "testing"))
b.plot_bar_mapq()

(Source code, png, hires.png, pdf)

_images/references-7.png
plot_coverage()[source]

Please use GenomeCov for more sophisticated tools to plot the genome coverage

from sequana import sequana_data, BAM
b = BAM(sequana_data("measles.fa.sorted.bam"))
b.plot_coverage()

(Source code, png, hires.png, pdf)

_images/references-8.png
plot_gc_content(fontsize=16, ec='k', bins=100)[source]

plot GC content histogram

Params bins:a value for the number of bins or an array (with a copy() method)
Parameters:ec – add black contour on the bars
from sequana import BAM, sequana_data
b = BAM(sequana_data('test.bam'))
b.plot_gc_content()

(Source code, png, hires.png, pdf)

_images/references-9.png
plot_indel_dist(fontsize=16)[source]

Plot indel count (+ ratio)

Return:list of insertions, deletions and ratio insertion/deletion for different length starting at 1
from sequana import sequana_data, BAM
b = BAM(sequana_data("measles.fa.sorted.bam"))
b.plot_indel_dist()

(Source code, png, hires.png, pdf)

_images/references-10.png

What you see on this figure is the presence of 10 insertions of length 1, 1 insertion of length 2 and 3 deletions of length 1

# Note that in samtools, several insertions or deletions in a single alignment are ignored and only the first one seems to be reported. For instance 10M1I10M1I stored only 1 insertion in its report; Same comment for deletions.

plot_read_length()[source]

Plot occurences of aligned read lengths

from sequana import sequana_data, BAM
b = BAM(sequana_data("test.bam"))
b.plot_read_length()

(Source code, png, hires.png, pdf)

_images/references-11.png
set_fast_stats()[source]
to_fastq(filename)[source]

Export the BAM to FastQ format

Warning

to be tested

Todo

comments from original reads ?

Method 1 (bedtools):

bedtools bamtofastq -i JB409847.bam  -fq test1.fastq

Method2 (samtools):

samtools bam2fq JB409847.bam > test2.fastq

Method3 (sequana):

from sequana import BAM
BAM(filename)
BAM.to_fastq("test3.fastq")

Note that the samtools method removes duplicated reads so the output is not identical to method 1 or 3.

class Alignment(alignment)[source]

Helper class to retrieve info about Alignment

Takes an alignment as read by BAM and provides a simplified version of pysam.Alignment class.

>>> from sequana.bamtools import Alignment
>>> from sequana import BAM, sequana_data
>>> b = BAM(sequana_data("test.bam"))
>>> segment = next(b)
>>> align = Alignment(segment)
>>> align.as_dict()
>>> align.FLAG
353

The original data is stored in hidden attribute _data and the following values are available as attributes or dictionary:

  • QNAME: a query template name. Reads/segment having same QNAME come from the same template. A QNAME set to * indicates the information is unavailable. In a sam file, a read may occupy multiple alignment
  • FLAG: combination of bitwise flags. See SAMFlags
  • RNAME: reference sequence
  • POS
  • MAPQ: mapping quality if segment is mapped. equals -10 log10 Pr
  • CIGAR: See sequana.cigar.Cigar
  • RNEXT: reference sequence name of the primary alignment of the NEXT read in the template
  • PNEXT: position of primary alignment
  • TLEN: signed observed template length
  • SEQ: segment sequence
  • QUAL: ascii of base quality

constructor

Parameters:alignment – alignment instance from BAM
as_dict()[source]
class SAMFlags(value=4095)[source]

Utility to extract bits from a SAM flag

>>> from sequana import SAMFlags
>>> sf = SAMFlags(257)
>>> sf.get_flags()
[1, 256]

You can also print the bits and their description:

print(sf)
bit Meaning/description
1 template having multiple segments in sequencing
2 each segment properly aligned according to the aligner
4 segment unmapped
8 next segment in the template unmapped
16 SEQ being reverse complemented
32 SEQ of the next segment in the template being reverse complemented
64 the first segment in the template
128 the last segment in the template
256 secondary alignment
512 not passing filters, such as platform/vendor quality controls
1024 PCR or optical duplicate
2048 supplementary alignment
Reference:http://samtools.github.io/hts-specs/SAMv1.pdf
get_flags()[source]

Return the individual bits included in the flag

get_meaning()[source]

Return all description sorted by bit

11.5. BEDTOOLS related (coverage)

Utilities for the genome coverage

class GenomeCov(input_filename, genbank_file=None, low_threshold=-3, high_threshold=3, ldtr=0.5, hdtr=0.5)[source]

Create a list of dataframe to hold data from a BED file generated with samtools depth.

This class can be used to plot the coverage resulting from a mapping, which is stored in BED format. The BED file may contain several chromosomes. There are handled independently and accessible as a list of ChromosomeCov instances.

Example:

from sequana import GenomeCov, sequana_data

filename = sequana_data('JB409847.bed')
reference = sequana_data("JB409847.fasta")

gencov = GenomeCov(filename)
gencov.compute_gc_content(reference)

gencov = GenomeCov(filename)
for chrom in gencov:
    chrom.running_median(n=3001, circular=True)
    chrom.compute_zscore()
    chrom.plot_coverage()
gencov[0].plot_coverage()

(Source code, png, hires.png, pdf)

_images/references-12.png

Results are stored in a list of ChromosomeCov named chr_list.

constructor

Parameters:
  • input_filename (str) – the input data with results of a bedtools genomecov run. This is just a 3-column file. The first column is a string (chromosome), second column is the base postion and third is the coverage.
  • genbank_file (str) – annotation file of your referenve.
  • low_threshold (float) – threshold used to identify under-covered genomic region of interest (ROI). Must be negative
  • high_threshold (float) – threshold used to identify over-covered genomic region of interest (ROI). Must be positive
  • ldtr (float) – fraction of the low_threshold to be used to define the intermediate threshold in the double threshold method. Must be between 0 and 1.
  • rdtr (float) – fraction of the low_threshold to be used to define the intermediate threshold in the double threshold method. Must be between 0 and 1.
circular

Get the circularity of chromosome(s). It must be a boolean.

compute_coverage(window, circular=False, reference=None)[source]

Compute GC content (if reference provided), running_median/zscore for each chromosome.

compute_gc_content(fasta_file, window_size=101, circular=False, letters=['G', 'C', 'c', 'g'])[source]

Compute GC content of genome sequence.

Parameters:
  • fasta_file (str) – fasta file name.
  • window_size (int) – size of the sliding window.
  • circular (bool) – if the genome is circular (like bacteria chromosome)
Store the results in the ChromosomeCov.df attribute (dataframe)
with a column named gc.
feature_dict

Get the features dictionary of the genbank.

gc_window_size

Get or set the window size to compute the GC content.

genbank_filename

Get or set the genbank filename to annotate ROI detected with ChromosomeCov.get_roi(). Changing the genbank filename will configure the GenomeCov.feature_dict.

get_stats(output='json')[source]

Return basic statistics for each chromosome

Returns:dictionary with chromosome names as keys and statistics as values.

See also

ChromosomeCov.

hist(logx=True, logy=True, fignum=1, N=20, lw=2, **kwargs)[source]
to_csv(output_filename, **kwargs)[source]

Write all data in a csv.

Parameters:
  • output_filename (str) – csv output file name.
  • kwargs (**dict) –

    parameters of pandas.DataFrame.to_csv().

window_size

Get or set the window size to compute the running median. Size must be an interger.

class ChromosomeCov(df, genomecov, thresholds=None)[source]

Factory to manipulate coverage and extract region of interests.

Example:

from sequana import GenomeCov, sequana_data
filename = sequana_data("virus.bed")

gencov = GenomeCov(filename)

chrcov = gencov[0]
chrcov.running_median(n=3001)
chrcov.compute_zscore()
chrcov.plot_coverage()

df = chrcov.get_roi().get_high_roi()

(Source code, png, hires.png, pdf)

_images/references-13.png

The df variable contains a dataframe with high region of interests (over covered)

See also

sequana_coverage standalone application

constructor

Parameters:
  • df – dataframe with position for a chromosome used within GenomeCov. Must contain the following columns: [“chr”, “pos”, “cov”]
  • thresholds – a data structure DoubleThresholds that holds the double threshold values.
bed
columns()[source]

Return immutable ndarray implementing an ordered, sliceable set.

compute_zscore(k=2, step=10, use_em=True, verbose=True)[source]

Compute zscore of coverage and normalized coverage.

Parameters:
  • k (int) – Number gaussian predicted in mixture (default = 2)
  • step (int) – (default = 10). This parameter is used to speed up computation and is ignored if the length of the coverage/sequence is below 100,000

Store the results in the df attribute (dataframe) with a column named zscore.

Note

needs to call running_median() before hand.

get_centralness()[source]

Proportion of central (normal) genome coverage

This is 1 - (number of non normal data) / (total length)

Note

depends on the thresholds attribute being used.

Note

depends slightly on W the running median window

get_cv()[source]

Return the coefficient variation

The coefficient of variation (CV) is defined as sigma / mu

To get percentage, you must multiply by 100.

get_df()[source]
get_evenness()[source]

Return Evenness of the coverage

Reference:Konrad Oexle, Journal of Human Genetics 2016, Evaulation of the evenness score in NGS.

work before or after normalisation but lead to different results.

get_gaussians()[source]
get_gc_correlation()[source]

Return the correlation between the coverage and GC content

The GC content is the one computed in GenomeCov.compute_gc_content() (default window size is 101)

get_max_gc_correlation(reference)[source]

Plot correlation between coverage and GC content by varying the GC window

The GC content uses a moving window of size W. This parameter affects the correlation bewteen coverage and GC. This function find the optimal window length.

get_mean_cov()[source]
get_roi()[source]

Keep positions with zscore outside of the thresholds range.

Returns:a dataframe from FilteredGenomeCov

Note

depends on the thresholds low and high values.

get_size()[source]
get_stats(output='json')[source]

Return basic stats about the coverage data

get_var_coef()[source]
moving_average(n, circular=False)[source]

Compute moving average of the genome coverage

Parameters:
  • n – window’s size. Must be odd
  • circular (bool) – is the chromosome circular or not

Store the results in the df attribute (dataframe) with a column named ma.

plot_coverage(filename=None, fontsize=16, rm_lw=1, rm_color='#0099cc', rm_label='Running median', th_lw=1, th_color='r', th_ls='--', main_color='k', main_lw=1, main_kwargs={}, sample=True, set_ylimits=True)[source]

Plot coverage as a function of base position.

Parameters:
  • filename
  • rm_lw – line width of the running median
  • rm_color – line color of the running median
  • rm_color – label for the running median
  • th_lw – line width of the thresholds
  • th_color – line color of the thresholds
  • main_color – line color of the coverage
  • main_lw – line width of the coverage
  • sample – if there are more than 1 000 000 points, we use an integer step to skip data points. We can still plot all points at your own risk by setting this option to False
  • set_ylimits – we want to focus on the “normal” coverage ignoring unsual excess. To do so, we set the yaxis range between 0 and a maximum value. This maximum value is set to the minimum between the 6 times the mean coverage and 1.5 the maximum of the high coverage threshold curve. If you want to let the ylimits free, set this argument to False

Note

if there are more than 1,000,000 points, we show only 1,000,000 by points. For instance for 5,000,000 points,

In addition to the coverage, the running median and coverage confidence corresponding to the lower and upper zscore thresholds are shown.

Note

uses the thresholds attribute.

plot_gc_vs_coverage(filename=None, bins=None, Nlevels=6, fontsize=20, norm='log', ymin=0, ymax=100, contour=True, **kwargs)[source]
plot_hist_coverage(logx=True, logy=True, fontsize=16, N=20, fignum=1, hold=False, alpha=0.5, filename=None, **kw_hist)[source]
plot_hist_normalized_coverage(filename=None, binwidth=0.1, max_z=4)[source]

Barplot of the normalized coverage with gaussian fitting

plot_hist_zscore(fontsize=16, filename=None, max_z=6, binwidth=0.5, **hist_kargs)[source]

Barplot of the zscore values

running_median(n, circular=False)[source]

Compute running median of genome coverage

Parameters:
  • n (int) – window’s size.
  • circular (bool) – if a mapping is circular (e.g. bacteria whole genome sequencing), set to True

Store the results in the df attribute (dataframe) with a column named rm.

Changed in version 0.1.21: Use Pandas rolling function to speed up computation.

set_gaussians(gaussians)[source]

Set gaussians predicted if you read a csv file generated by GenomeCov.

to_csv(filename=None, start=None, stop=None, **kwargs)[source]

Write CSV file of the dataframe.

Parameters:
  • filename (str) – csv output filename. If None, return string.
  • start (int) – start row index.
  • stop (int) – stop row index.

Params of pandas.DataFrame.to_csv():

Parameters:
  • columns (list) – columns you want to write.
  • header (bool) – determine if the header is written.
  • index (bool) – determine if the index is written.
  • float_format (str) – determine the float format.
class DoubleThresholds(low=-3, high=3, ldtr=0.5, hdtr=0.5)[source]

Simple structure to handle the double threshold for negative and positive sides

Used yb GenomeCov and related classes.

dt = DoubleThresholds(-3,4,0.5,0.5)

This means the low threshold is -3 while the high threshold is 4. The two following values must be between 0 and 1 and are used to define the value of the double threshold set to half the value of th the main threshold by default.

Internally, the main thresholds are stored in the low and high attributes. The secondary thresholds are derived from the main thresholds and the two ratios. The ratios are named ldtr and hdtr for low double threshold ratio and high double threshold ration. The secondary thresholds are denoted low2 and high2 are are update automatically if low, high, ldtr or hdtr are changed.

copy()[source]
get_args()[source]
hdtr
high
high2
ldtr
low
low2

11.6. Coverage (theoretical)

class Coverage(N=None, L=None, G=None, a=None)[source]

Utilities related to Lander and Waterman theory

We denote G the genome length in nucleotides and L the read length in nucleotides. These two numbers are in principle well defined since G is defined by biology and L by the sequencing machine.

The total number of reads sequenced during an experiment is denoted N. Therefore the total number of nucleotides is simply NL.

The depth of coverage (DOC) at a given nucleotide position is the number of times that a nucleotide is covered by a mapped read.

The theoretical fold-coverage is defined as :

a = NL/G

that is the average number of times each nucleotide is expected to be sequenced (in the whole genome). The fold-coverage is often denoted aX (e.g., 50X).

In the Coverage class, G and N are fixed at the beginning. Then, if one changes a, then N is updated and vice-versa so that the relation a=NL/G is always true:

>>> cover = Coverage(G=1000000, L=100)
>>> cover.N = 100000    # number of reads
>>> cover.a             # What is the mean coverage
10
>>> cover.a = 50
>>> cover.N
500000

From the equation aforementionned, and assuming the reads are uniformly distributed, we can answer a few interesting questions using probabilities.

In each chromosome, a read of length L could start at any position (except the last position L-1). So in a genome G with n_c chromosomes, there are G - n_c (L-1) possible starting positions. In general G >> n_c (L-1) so the probability that one of the N read starts at any specific nucleotide is N/G.

The probability that a read (of length L) covers a given position is L/G. The probability of not covering that location is 1-L/G. For N fragments, we obtain the probability \left(1-L/G\right)^N. So, the probability of covering a given location with at least one read is :

P = 1 - \left(1- \frac{L}{G}\right)^N

Since in general, N>>1, we have:

P = 1- \exp^{-NL/G}

From this equation, we can derive the fold-coverage required to have e.g., E=99\% of the genome covered:

a = log(-1/(E-1)

equivalent to

a = -log(1-E)

The method get_required_coverage() uses this equation. However, for numerical reason, one should not provide E as an argument but (1-E). See get_required_coverage()

Other information can also be derived using the methods get_mean_number_contigs(), get_mean_contig_length(), get_mean_contig_length().

See also

get_table() that provides a summary of all these quantities for a range of coverage.

Reference:http://www.math.ucsd.edu/~gptesler/283/FA14/slides/shotgun_14-handout.pdf
G

genome length

L

length of the reads

N

number of reads defined as aG/L

a

coverage defined as NL/G

get_mean_contig_length()[source]

Expected length of the contigs

\frac{e^a-1)L}{a}

get_mean_number_contigs()[source]

Expected number of contigs

A binomial distribution with parameters N and p

(aG/L) \exp^{-a}

get_mean_reads_per_contig()[source]

Expected number of reads per contig

Number of reads divided by expected number of contigs:

N / (N\exp^{-a}) = e^a

get_percent_genome_sequenced()[source]

Return percent of the genome covered

100 (1-\exp{-a})

get_required_coverage(M=0.01)[source]

Return the required coverage to ensure the genome is covered

A general question is what should be the coverage to make sure that e.g. E=99% of the genome is covered by at least a read.

The answer is:

\log^{-1/(E-1)}

This equation is correct but have a limitation due to floating precision. If one provides E=0.99, the answer is 4.6 but we are limited to a maximum coverage of about 36 when one provides E=0.9999999999999999 after which E is rounded to 1 on most computers. Besides, it is no convenient to enter all those numbers. A scientific notation would be better but requires to work with M=1-E instead of E.

\log^{-1/ - M}

So instead of asking the question what is the requested fold coverage to have 99% of the genome covered, we ask the question what is the requested fold coverage to have 1% of the genome not covered. This allows us to use M values as low as 1e-300 that is a fold coverage as high as 690.

Parameters:M (float) – this is the fraction of the genome not covered by any reads (e.g. 0.01 for 1%). See note above.
Returns:the required fold coverage

(Source code, png, hires.png, pdf)

_images/references-14.png

# The inverse equation is required fold coverage = [log(-1/(E - 1))]

get_summary()[source]

Return a summary (dictionary) for the current fold coverage a

get_table(coverage=None)[source]

Return a summary dataframe for a set of fold coverage

Parameters:coverage (list) – if None, coverage list starts at 0.5 and ends at 10 with a step of 0.5

11.7. Access to online database (e.g. ENA)

Utilities to access to online FASTA, taxon, lineage …

class ENADownload[source]

Downloader to retrieve genome fasta files from ENA amongst other things

In order to facilitate the download of FASTA files (e.g. to build a Kraken DB), this class can be used to download a bunch of FASTA files, or just one given its accession.

Pre-defined lists are available from ENA. We refer to them as virus, plasmid, phage, archaealvirus, archaea, bacteria, organelle, viroid. In addition we have predefined lists within Sequana. For now, there is one named macaca_fascicularis.

Warning

the header of the FASTA files are changed to add the GI number instead of embl.

constructor

download_accession(acc, output='Custom')[source]

Download a specific FASTA file given its ENA accession number

download_archaea()[source]
download_archaealvirus()[source]
download_bacteria()[source]

organisms (may 2016)

Note

this download method is the longest to end. It took about 20mins on a good connection.

download_fasta(filelist, output_dir=None, from_ena=True)[source]

Download a FASTA (or list of)

Parameters:filelist – a name to find on the ENA web server OR the name of an accession number.

Warning

The filename is named after the accession without .X number If there are several variant .1, .2 the later will be used. This should not happen if the list is properly defined.

download_list()[source]

Download all standard lists of accession numbers from ENA

download_macaca()[source]
download_organelle()[source]
download_phage()[source]
download_plasmids()[source]
download_viroid()[source]
download_viruses()[source]
ena_id_to_gi_number(identifiers)[source]
switch_header_to_gi(acc)[source]

Kraken will only accept the GI from NCBI so we need to convert the ENA accession to GI numbers

class EUtilsTools[source]

Simple wrapper around EUtils to fetch basic informatino about an accession number

>>> from sequana.databases import EUtilsTools
>>> et.accession_to_info("K01711.1")
{'K01711.1': {'accession': '331784',
  'comment': 'Measles virus (strain Edmonston), complete genome',
  'gi': '331784',
  'identifier': 'gi|331784|gb|K01711.1|MEANPCG[331784]',
  'taxid': '11234'}}
accession_to_info(ids)[source]

An accession or list of them returns list of dictionaries

11.8. Experimental design

Module to handle experimental design files (adapters)

Sequencers or experimentalists create so-called design files to store information about the sequencing experiment. For example the name of the samples, the sample well, and the index sequences.

The format used to store the design information may vary from one sequencing platform to another. The design file can be used for many different purposes. Currently, we only use them to retrieve information about adapters.

Since there are different formats possible, we decided on a minimal and common set of information. For now, this is a CSV file with the following minimal header:

Sample_ID, Index1_Seq

or, for double-indexing:

Sample_ID, Index1_Seq, Index2_Seq

Users should only use the ExpDesignAdapter class, which understands the different formats. Currently, the design file created by MiSeq Illumina machine

class ExpDesignAdapter(filename, verbose=True)[source]

Generic Experimental design class

This class is used to store the mapping between sample ID and adapter used.

The design information is stored as a dataframe in the attribute df.

The input format is not unique. There are currently 3 different inputs possible as defined in

The dataframe index is the list of sample identifiers (Sample_ID). The columns contain at least the following:

Index1_Seq, Index1_ID, Index2_Seq, Index2_ID

Example:

from sequana import *
filename = sequana_data('test_test_expdesign_hiseq.csv')
eda = ExpDesignAdapter(filename)

constructor

Parameters:
class ExpDesignMiSeq(filename)[source]

Dedicated experimental design format from Illumina MiSeq sequencers

This MiSeq design format has the following format:

[Header]
blabla

[Reads]
blabla

[Settings]
blabla

[Data]
blabla

In the Data section, a CSV file is to be found with the following header:

Sample_ID,Sample_Name,Sample_Plate,Sample_Well,I7_Index_ID,index,
Sample_Project,Description

The index column may be prefixed. For instance as NFXX where XX is the index so NF should be dropped.

If double-indexing, the header is:

Sample_ID,Sample_Name,Sample_Plate,Sample_Well,I7_Index_ID,
index,I5_Index_ID,index2,Sample_Project,Description
filename = sequana_data("test_expdesign_miseq_illumina_1.csv")
ff = ExpDesignMiSeq(filename)
ff.df
class ExpDesignHiSeq(filename)[source]

Dedicated experimental design format created by a demultiplexing soft.

This format is used by a demultiplex software used locally at biomics platform. The format of the header is:

FCID,Lane,SampleID,SampleRef,Index Seq, Description,Control,Recipe,Operator,Project

This is a format that may change in the future.

The SampleID is convert into Sample_ID, “Index Seq”. Note that “Index Seq” may be empty, or filled with an index sequence, or 2 index sequences separated by a “-” sign.

note also FCID = flowcell ID

class ExpDesignBase(filename)[source]

The Base class for all ExpDesignAdapter classes

The input filename must be a CSV file with at least the following column in the header:

Sample_ID

Derived class must define at least Index1_Seq and possibly Index2_Seq.

Examples of specialised classes are ExpDesignMiSeq, ExpDesignHiSeq.

check()[source]

Check the presence of the Sample_ID column

read()[source]

Read a CSV file

11.9. FASTQ module

Utilities to manipulate FASTQ and Reads

class Identifier(identifier, version='unknown')[source]

Class to interpret Read’s identifier

Warning

Implemented for Illumina 1.8+ and 1.4 . Other cases will simply stored the identifier without interpretation

>>> from sequana import Identifier
>>> ident = Identifier('@EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG')
>>> ident.info['x_coordinate']
'15343'

Currently, the following identifiers will be recognised automatically:

Illumina_1_4:

An example is

@HWUSI-EAS100R:6:73:941:1973#0/1
Illumina_1_8:

An example is:

@EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG

Other that could be implemented are NCBI

@FSRRS4401BE7HA [length=395] [gc=36.46] [flows=800] [phred_min=0]                                            [phred_max=40] [trimmed_length=95]

Information can also be found here http://support.illumina.com/help/SequencingAnalysisWorkflow/Content/Vault/Informatics/Sequencing_Analysis/CASAVA/swSEQ_mCA_FASTQFiles.htm

class FastQ(filename, verbose=False)[source]

Class to handle FastQ files

Some of the methods are based on pysam but a few are also original to sequana. In general, input can be zipped ot not and output can be zipped or not (based on the extension).

An example is the extract_head() method:

f = FastQ("input_file.fastq.gz")
f.extract_head(100000, output='test.fastq')
f.extract_head(100000, output='test.fastq.gz')

equivalent to:

zcat myreads.fastq.gz | head -100000 | gzip > test100k.fastq.gz

An efficient implementation to count the number of lines is also available:

f.count_lines()

or reads (assuming 4 lines per read):

f.count_reads()

Operators available:

  • equality ==
count_lines()[source]

Return number of lines

count_reads()[source]

Return count_lines divided by 4

extract_head(N, output_filename)[source]

Extract the heads of a FastQ files

Parameters:
  • N (int) –
  • output_filename (str) – Based on the extension the output file is zipped or not (.gz extension only)

This function is convenient since it takes into account the input file being compressed or not and the output file being compressed ot not. It is in general 2-3 times faster than the equivalent unix commands combined together but is 10 times slower for the case on uncompressed input and uncompressed output.

Warning

this function extract the N first lines and does not check if there are empty lines in your FastQ/FastA files.

filter(identifiers_list=[], min_bp=None, max_bp=None, progressbar=True, output_filename='filtered.fastq', remove=True)[source]

Filter reads

Parameters:
  • min_bp (int) – ignore reads with length shorter than min_bp
  • max_bp (int) – ignore reads with length above max_bp
joining(pattern, output_filename)[source]

not implemented

zcat Block*.fastq.gz | gzip > combined.fastq.gz

n_lines

return number of lines (should be 4 times number of reads)

n_reads

return number of reads

next()[source]
rewind()[source]

Allows to iter from the beginning without openning the file or creating a new instance.

select_random_reads(N=None, output_filename='random.fastq')[source]

Select random reads and save in a file

Parameters:
  • N (int) – number of random unique reads to select should provide a number but a list can be used as well. You can select random reads for R1, and re-use the returned list as input for the R2 (since pairs must be kept)
  • output_filename (str) –

If you have a pair of files, the same reads must be selected in R1 and R2.:

f1 = FastQ(file1)
selection = f1.select_random_reads(N=1000)
f2 = FastQ(file2)
f2.select_random_reads(selection)
split_chunks(N=10)[source]

Not implemented

split_lines(N=100000, gzip=True)[source]

Not implemented

to_kmer_content(k=7)[source]

Return a Series with kmer count across all reads

Parameters:k (int) – (default to 7-mers)
Returns:Pandas Series with index as kmer and values as count.

Takes about 30 seconds on a million reads.

to_krona(k=7, output_filename='fastq.krona')[source]

Save Krona file with ACGT content within all k-mers

Parameters:k (int) – (default to 7-mers)

Save results in file, which can then be translated into a HTML file using:

ktImportText fastq.krona 
open text.krona.html
class FastQC(filename, max_sample=500000, dotile=False, verbose=True)[source]

Simple QC diagnostic

Similarly to some of the plots of FastQC tools, we scan the FastQ and generates some diagnostic plots. The interest is that we’ll be able to create more advanced plots later on.

Here is an example of the boxplot quality across all bases:

from sequana import sequana_data
from sequana import FastQC
filename  = sequana_data("test.fastq", "testing")
qc = FastQC(filename)
qc.boxplot_quality()

(Source code, png, hires.png, pdf)

_images/references-15.png

Warning

some plots will work for Illumina reads only right now

Note

Although all reads are parsed (e.g. to count the number of nucleotides, some information uses a limited number of reads (e.g. qualities), which is set to 500,000 by deafult.

constructor

Parameters:
  • filename
  • max_sample (int) – Large files will not fit in memory. We therefore restrict the numbers of reads to be used for some of the statistics to 500,000. This also reduces the amount of time required to get a good feeling of the data quality. The entire input file is parsed tough. This is required for instance to get the number of nucleotides.
boxplot_quality(hold=False, ax=None)[source]

Boxplot quality

Same plots as in FastQC that is average quality for all bases. In addition a 1 sigma error enveloppe is shown (yellow).

Background separate zone of good, average and bad quality (arbitrary).

get_actg_content()[source]
get_stats()[source]
histogram_gc_content()[source]

Plot histogram of GC content

from sequana import sequana_data
from sequana import FastQC
filename  = sequana_data("test.fastq", "testing")
qc = FastQC(filename)
qc.histogram_gc_content()

(Source code, png, hires.png, pdf)

_images/references-16.png
histogram_sequence_coordinates()[source]

Histogram of the sequence coordinates on the plate

from sequana import sequana_data
from sequana import FastQC
filename  = sequana_data("test.fastq", "testing")
qc = FastQC(filename)
qc.histogram_sequence_coordinates()

Note

in this data set all points have the same coordinates.

histogram_sequence_lengths(logy=True)[source]

Histogram sequence lengths

from sequana import sequana_data
from sequana import FastQC
filename  = sequana_data("test.fastq", "testing")
qc = FastQC(filename)
qc.histogram_sequence_lengths()

(Source code, png, hires.png, pdf)

_images/references-17.png
imshow_qualities()[source]

Qualities

from sequana import sequana_data
from sequana import FastQC
filename  = sequana_data("test.fastq", "testing")
qc = FastQC(filename)
qc.imshow_qualities()
from pylab import tight_layout; tight_layout()
plot_acgt_content(stacked=False)[source]

Plot histogram of GC content

from sequana import sequana_data
from sequana import FastQC
filename  = sequana_data("test.fastq", "testing")
qc = FastQC(filename)
qc.plot_acgt_content()

(Source code, png, hires.png, pdf)

_images/references-18.png

11.10. FASTA module

Utilities to manipulate FASTQ and Reads

class FastA(filename, verbose=False)[source]

Class to handle FastA files. Cannot be compressed

comments
format_contigs_denovo(output_file, len_min=500)[source]

Replace NODE with the project name and remove contigs with a length lower than len_min.

Parameters:
  • output_file (str) – output file name.
  • len_min (int) – minimal length of contigs.

Example:

from sequana import FastA

contigs = FastA(“denovo_assembly.fasta”) contigs.format_contigs_denovo(“path/to/file.fasta”, len_min=500)

Results are stored in “path/to/file.fasta”.

lengths
names
next()[source]
sequences

11.11. Sequence module

class DNA(sequence, codons_stop=['TAA', 'TGA', 'TAG'], codons_stop_rev=['TTA', 'TCA', 'CTA'], codons_start=['ATG'], codons_start_rev=['CAT'])[source]

Simple DNA class

>>> d = DNA("ACGTTTT")
>>> d.complement
>>> d.reverse_complement
AT_skew
GC_skew
ORF_pos
barplot_count_ORF_CDS_by_frame(alpha=0.5, bins=40, xlabel='Frame', ylabel='#', bar_width=0.35)[source]
hist_ORF_CDS_linearscale(alpha=0.5, bins=40, xlabel='Length', ylabel='#')[source]
hist_ORF_CDS_logscale(alpha=0.5, bins=40, xlabel='Length', ylabel='#')[source]
plot_all_skews(figsize=(10, 12), fontsize=16, alpha=0.5)[source]
threshold
type_filter
type_window
window
class RNA(sequence)[source]

Simple RNA class

>>> d = RNA("ACGUUUU")
>>> d.complement
>>> d.reverse_complement
class Repeats(filename_fasta, merge=False, name=None)[source]

Class for finding repeats in DNA or RNA linear sequences.

Computation is performed each time the threshold is set to a new value.

from sequana import sequana_data, Repeats
rr = Repeats(sequana_data("measles.fa"))
rr.threshold = 4
rr.hist_length_repeats()

(Source code, png, hires.png, pdf)

_images/references-19.png

Note

Works with shustring package from Bioconda (April 2017)

Todo

use a specific sequence (right now it is the first one)

Constructor

Input must be a fasta file with valid DNA or RNA characters

Parameters:
  • filename_fasta (str) – a Fasta file, only the first sequence is used !
  • threshold (int) – Minimal length of repeat to output
  • name (str) – if name is provided, scan the Fasta file and select the corresponding sequence. if you want to analyse all sequences, you need to use a loop by setting _header for each sequence with the sequence name found in sequence header.
begin_end_repeat_position
df_shustring

Return dataframe with shortest unique substring length at each position shortest unique substrings are unique in the sequence and its complement Uses shustring tool

do_merge
header

get first line of fasta (needed in input shustring) and replace spaces by underscores

hist_length_repeats(bins=20, alpha=0.5, hold=False, fontsize=12, grid=True, title='Repeat length', xlabel='Repeat length', ylabel='#')[source]

Plots histogram of the repeat lengths

length
list_len_repeats
longest_shustring
threshold
class Sequence(sequence, complement_in=b'ACGT', complement_out=b'TGCA', letters='ACGT')[source]

Abstract base classe for other specialised sequences such as DNA.

Sequenced is the base class for other classes such as DNA and RNA.

from sequana import Sequence
s = Sequence("ACGT")
s.stats()
s.get_complement()

Note

You may use a Fasta file as input (see constructor)

Constructor

A sequence is just a string stored in the sequence attribute. It has properties related to the type of alphabet authorised.

Parameters:
  • sequence (str) – May be a string of a Fasta File, in which case only the first sequence is used.
  • complement_in
  • complement_out
  • letters – authorise letters. Used in check() only.

Todo

use counter only once as a property

check()[source]

Check that all letters are valid

complement()[source]

Alias to get_complement()

gc_content()[source]

Return mean GC content

get_complement()[source]

Return complement

get_occurences(pattern, overlap=False)[source]

Return position of the input pattern in the sequence

>>> from sequana import Sequence
>>> s = Sequence('ACGTTTTACGT')
>>> s.get_occurences("ACGT")
[0, 7]
get_reverse()[source]

Return reverse sequence

get_reverse_complement()[source]

Return reverse complement

reverse()[source]

Alias to get_reverse()

reverse_complement()[source]

Alias to get_reverse_complement

sequence
stats()[source]

Return basic stats about the number of letters

11.12. Kmer module

build_kmer(length=6, letters='CG')[source]

Return list of kmer of given length based on a set of letters

Returns:list of kmers
get_kmer(sequence, k=7)[source]

Given a sequence, return consecutive kmers

Returns:iterator of kmers

11.13. IOTools module

class YamlDocParser(filename)[source]

A simple parser to extract block content to be found in YAML files

So as to create tooltips automatically in Sequanix: GUI for snakemake workflows, one can comment YAML configuration file with block comments (see developers guide in Developer guide )

Once read and parsed, all block comments before top-level sections are to be found in the dictionary sections.

from sequana import snaketools
from sequana.iotools import YamlDocParser
module = snaketools.Module('quality_control')
r = YamlDocParser(module.config)
r.sections['fastqc']

Those lines are removed from the docstring but available as a dictionary

constructor

Parameters:filename (str) – the YAML file to parse
# main documentation

# block comment
section1:
    - item

# block comment
section2:

# a comment

section3:

Here, section1 and section2 have block comments but not section3

11.14. Taxonomy related (Kraken - Krona)

class KrakenResults(filename='kraken.out')[source]

Translate Kraken results into a Krona-compatible file

If you run a kraken analysis with KrakenAnalysis, you will end up with a file e.g. named kraken.out (by default).

You could use kraken-translate but then you need extra parsing to convert into a Krona-compatible file. Here, we take the output from kraken and directly transform it to a krona-compatible file.

k = KrakenResults("kraken.out")
k.kraken_to_krona()

Then format expected looks like:

C    HISEQ:426:C5T65ACXX:5:2301:18719:16377    1    203    1:71 A:31 1:71
C    HISEQ:426:C5T65ACXX:5:2301:21238:16397    1    202    1:71 A:31 1:71

Where each row corresponds to one read.

"562:13 561:4 A:31 0:1 562:3" would indicate that:

the first 13 k-mers mapped to taxonomy ID #562
the next 4 k-mers mapped to taxonomy ID #561
the next 31 k-mers contained an ambiguous nucleotide
the next k-mer was not in the database
the last 3 k-mers mapped to taxonomy ID #562

See kraken documentation for details.

Note

This takes care of fetching taxons and the corresponding lineages from online web services.

constructor

Parameters:filename – the input from KrakenAnalysis class
boxplot_classified_vs_read_length()[source]

Show distribution of the read length grouped by classified or not

df
get_taxonomy_biokit(ids)[source]

Retrieve taxons given a list of taxons

Parameters:ids (list) – list of taxons as strings or integers. Could also be a single string or a single integer
Returns:a dataframe

Note

the first call first loads all taxons in memory and takes a few seconds but subsequent calls are much faster

kraken_to_csv(filename, dbname)[source]
kraken_to_json(filename, dbname)[source]
kraken_to_krona(output_filename=None, mode=None, nofile=False)[source]
Returns:status: True is everything went fine otherwise False
plot(kind='pie', cmap='copper', threshold=1, radius=0.9, textcolor='red', **kargs)[source]

A simple non-interactive plot of taxons

Returns:None if no taxon were found and a dataframe otherwise

A Krona Javascript output is also available in kraken_to_krona()

from sequana import KrakenResults, sequana_data
test_file = sequana_data("test_kraken.out", "testing")
k = KrakenResults(test_file)
df = k.plot(kind='pie')

(Source code, png, hires.png, pdf)

_images/references-20.png

See also

to generate the data see KrakenPipeline or the standalone application sequana_taxonomy.

taxons
to_js(output='krona.html', onweb=False)[source]
class KrakenPipeline(fastq, database, threads=4, output_directory='kraken', dbname=None)[source]

Used by the standalone application sequana_taxonomy

This runs Kraken on a set of FastQ files, transform the results in a format compatible for Krona, and creates a Krona HTML report.

from sequana import KrakenTaxon
kt = KrakenPipeline(["R1.fastq.gz", "R2.fastq.gz"], database="krakendb")
kt.run()
kt.show()

Warning

We do not provide Kraken database within sequana. You may either download a database from https://ccb.jhu.edu/software/kraken/ or use this class to download a toy example that will be stored in e.g .config/sequana under Unix platforms. See KrakenDownload.

See also

We provide a standalone application of this class, which is called sequana_taxonomy and can be used within a command shell.

Constructor

Parameters:
  • fastq – either a fastq filename or a list of 2 fastq filenames
  • database – the path to a valid Kraken database
  • threads – number of threads to be used by Kraken
  • output – output filename of the Krona HTML page

Description: internally, once Kraken has performed an analysis, reads are associated to a taxon (or not). We then find the correponding lineage and scientif names to store within a Krona formatted file. KtImportTex is then used to create the Krona page.

run(output_filename_classified=None, output_filename_unclassified=None, only_classified_output=False)[source]

Run the analysis using Kraken and create the Krona output

Todo

reuse the KrakenResults code to simplify this method.

show()[source]

Opens the filename defined in the constructor

class KrakenAnalysis(fastq, database, threads=4)[source]

Run kraken on a set of FastQ files

In order to run a Kraken analysis, we firtst need a local database. We provide a Toy example. The ToyDB is downloadable as follows ( you will need to run the following code only once):

from sequana import KrakenDownload
kd = KrakenDownload()
kd.download_kraken_toydb()

See also

KrakenDownload for more database and sequana.kraken_builder.KrakenBuilder to build your own databases

The path to the database is required to run the analysis. It has been stored in the directory ./config/sequana/kraken_toydb under Linux platforms The following code should be platform independent:

import os
from sequana import sequana_config_path
database = sequana_config_path + os.sep + "kraken_toydb")

Finally, we can run the analysis on the toy data set:

from sequana import sequana_data
data = sequana_data("Hm2_GTGAAA_L005_R1_001.fastq.gz", "data")
ka = KrakenAnalysis(data, database=database)
ka.run()

This creates a file named kraken.out. It can be interpreted with KrakenResults

Constructor

Parameters:
  • fastq – either a fastq filename or a list of 2 fastq filenames
  • database – the path to a valid Kraken database
  • threads – number of threads to be used by Kraken
  • output – output filename of the Krona HTML page
  • return
run(output_filename=None, output_filename_classified=None, output_filename_unclassified=None, only_classified_output=False)[source]

Performs the kraken analysis

Parameters:
  • output_filename (str) – if not provided, a temporary file is used and stored in kraken_output.
  • output_filename_classified (str) – not compressed
  • output_filename_unclassified (str) – not compressed
class KrakenDownload[source]

Utility to download Kraken DB and place them in a local directory

from sequana import KrakenDownload
kd = KrakenDownload()
kd.download('toydb')
kd.download('minikraken')

A large database (8Gb) is available on synapse and has the following DOI:

doi:10.7303/syn6171000

It can be downloaded manually or if you have a Synapse login (https://www.synapse.org), you can use:

from sequana import KrakenDownload
kd = KrakenDownload()
kd.downloaded("sequana_db1")
download(name, verbose=True)[source]
dv = <easydev.tools.DevTools object>
class KrakenHierarchical(filename_fastq, fof_databases, threads=1, output_directory='./kraken_hierarchical/', keep_temp_files=False, force=False)[source]

Kraken Hiearchical Analysis

This runs Kraken on a FastQ file with multiple k-mer databases in a hierarchical way. Unclassified sequences with the first database are input for the second, and so on.

The input may be a single FastQ file or paired, gzipped or not. FastA are also accepted.

constructor

Parameters:
  • filename_fastq – FastQ file to analyse
  • fof_databases – file that contains a list of databases paths (one per line). The order is important. Note that you may also provide a list of datab ase paths.
  • threads – number of threads to be used by Kraken
  • output_directory – name of the output directory
  • keep_temp_files – bool, if True, will keep intermediate files from each Kraken analysis, and save html report at each step
  • force (bool) – if the output directory already exists, the instanciation fails so that the existing data is not overrwritten. If you wish to overwrite the existing directory, set this parameter to True.
run(dbname='multiple', output_prefix='kraken_final')[source]

Run the hierachical analysis

This method does not return anything but creates a set of files:

  • kraken_final.out
  • krona_final.html
  • kraken.png (pie plot of the classified/unclassified reads)

Note

the databases are run in the order provided in the constructor.

class KronaMerger(filename)[source]

Utility to merge two Krona files

Imagine those two files (formatted for Krona; first column is a counter):

14011   Bacteria    Proteobacteria  species1
591 Bacteria    Proteobacteria  species4

184 Bacteria    Proteobacteria  species3
132 Bacteria    Proteobacteria  species2
32  Bacteria    Proteobacteria  species1

You can merge the two files. The first and last lines correspond to the same taxon (species1) so we should end up with a new Krona file with 4 lines only.

The test files are available within Sequana as test_krona_k1.tsv and test_krona_k2.tsv:

from sequana import KronaMerger, sequana_data
k1 = KronaMerger(sequana_data("test_krona_k1.tsv"))
k2 = KronaMerger(sequana_data("test_krona_k2.tsv"))
k1 += k2
# Save the results. Note that it must be tabulated for Krona external usage
k1.to_tsv("new.tsv")

Warning

separator must be tabulars

constructor

Parameters:filename (str) –
to_tsv(output_filename)[source]

Save the content into a new file in TSV format

class KrakenBuilder(dbname)[source]

This class will help you building a custom Kraken database

You will need a few steps, and depending on the FASTA files you want to include lots of resources (memory and space wise). In the following example, we will be reasonable and use only viruses FASTA files.

First, we need to create the data structure directory. Let us call it virusdb:

from sequana import KrakenBuilder
kb = KrakenBuilder("virusdb")

We then need to download a large taxonomic database from NCBI. You may already have a local copy, in which case you would need to copy it in virusdb/taxonomy directory. If not, type:

kb.download_taxonomy()

The virusdb/taxonomy directory will contain about 8.5G of data.

Note that this currently requires the unix tools wget and tar.

Then, we need to add some fasta files. You may download specific FASTA files if you know the accession numbers using download_accession(). However, we also provide a method to download all viruses from ENA:

kb.download_viruses()

This will take a while to download the more than 4500 FASTA files (10 minutes on a good connection). You will end up with a data set of about 100 Mb of FASTA files.

If you wish to download other FASTA (e.g. all bacteria), you will need to use another class from the sequana.databases:

from sequana.databases import ENADownload
ena = ENADownload()
ena.download_fasta("bacteria.txt", output_dir="virusdb/library/added")

Please see the documentation for more options and list of species to download.

It is now time to build the DB itself. This is based on the kraken tool. You may do it yourself in a shell:

kraken-build  --rebuild -db virusdb --minimizer-len 10 --max-db-size 4 --threads 4
--kmer-len 26 --jellyfish-hash-size 500000000

Or you the KrakenBuilder. First you need to look at the params attribute. The most important key/value that affect the size of the DB are:

kb.params['kmer_length']  (max value is 31)
kb.params['max_db_size'] is tha max size of the DB files in Gb
kb.params['minimizer_len']

To create a small DB quickly, we set those values:

kb.params['kmer_length']  = 26
kb.params['minimizer_len'] = 10

However, for production, we would recommend 31 and 13 (default)

This takes about 2 minutes to build and the final DB is about 800Mb.

Lots of useless files are in the direcory and can be removed using kraken itself. However we do a little bit more and therefore have our own cleaning function:

kb.clean_db()

Kraken-build uses jellyfish. The hash_size parameter is the jellyfish hash_size parameter. If you set it to 6400M, the memory required is about 6.9bytes times 6400M that is 40Gb of memory. The default value used here means 3.5Gb are required.

The size to store the DB itself should be

Math:sD + 8 (4^M)

where s is about 12 bytes (used to store a kmer/taxon pair, D is the number of kmer in the final database, which cannot be estimated before hand, and M the length minimiser parameter.

Constructor

Parameters:dbname (str) – Create the Kraken DB in this directory
clean_db()[source]

Once called, you will not be able to append more FASTA files

download_accession(acc)[source]

Donwload a specific Fasta from ENA given its accession number

Note that if you want to add specific FASTA from ENA, you must use that function to make sure the header will be understood by Kraken; The header must use a GI number (not ENA)

download_taxonomy(force=False)[source]

Download kraken data

The downloaded file is large (1.3Gb) and the unzipped file is about 9Gb.

If already present, do not download the file except if the force parameter is set to True.

download_viruses()[source]
get_gis(extensions=['fa'])[source]
get_taxons_from_gis(gis, filename='gi_taxid_nucl.dmp')[source]
init()[source]
run(dbs=[], download_taxon=True)[source]

Create the Custom Kraken DB

  1. download taxonomy files
  2. Load the DBs (e.g. viruses)
  3. Build DB with kraken-build
  4. Clean it up

11.15. Pacbio module

Pacbio QC and stats

class BAMPacbio(filename)[source]

BAM reader for Pacbio (reads)

You can read a file as follows:

from sequana.pacbio import BAMPacbio
from sequana import sequana_data
filename = sequana_data("test_pacbio_subreads.bam")
b = BAMPacbio(filename)

A summary of the data is stored in the attribute df. It contains information such as the length of the reads, the ACGT content, the GC content.

Several plotting methods are available. For instance, hist_snr().

Constructor

Parameters:filename (str) – filename of the input pacbio BAM file. The content of the BAM file is not the ouput of a mapper. Instead, it is the output of a Pacbio (Sequel) sequencing (e.g., subreads).
df
filter_length(output_filename, threshold_min=0, threshold_max=inf)[source]

Select and Write reads within a given range

Parameters:
  • output_filename (str) – name of output file
  • threshold_min (int) – minimum length of the reads to keep
  • threshold_max (int) – maximum length of the reads to keep
hist_GC(bins=50, alpha=0.5, hold=False, fontsize=12, grid=True, xlabel='GC %', ylabel='#', label='', title=None)

Plot histogram GC content

Parameters:
  • bins (int) – binning for the histogram
  • alpha (float) – transparency of the histograms
  • hold (bool) –
  • fontsize (int) – fontsize of the x and y labels and title.
  • grid (bool) – add grid or not
  • xlabel (str) –
  • ylabel (str) –
  • label (str) – label of the histogram (for the legend)
  • title (str) –
from sequana.pacbio import BAMPacbio
from sequana import sequana_data
b = BAMPacbio(sequana_data("test_pacbio_subreads.bam"))
b.hist_GC()

(Source code, png, hires.png, pdf)

_images/references-21.png
hist_ZMW_subreads(alpha=0.5, hold=False, fontsize=12, grid=True, xlabel='Number of ZMW passes', ylabel='#', label='', title='Number of ZMW passes')[source]

Plot histogram of number of reads per ZMW (number of passes)

Parameters:
  • alpha (float) – transparency of the histograms
  • hold (bool) –
  • fontsize (int) –
  • grid (bool) –
  • xlabel (str) –
  • ylabel (str) –
  • label (str) – label of the histogram (for the legend)
  • title (str) –
from sequana.pacbio import BAMPacbio
from sequana import sequana_data
b = BAMPacbio(sequana_data("test_pacbio_subreads.bam"))
b.hist_ZMW_subreads()

(Source code, png, hires.png, pdf)

_images/references-22.png
hist_len(bins=50, alpha=0.5, hold=False, fontsize=12, grid=True, xlabel='Read Length', ylabel='#', label='', title=None)

Plot histogram Read length

Parameters:
  • bins (int) – binning for the histogram
  • alpha (float) – transparency of the histograms
  • hold (bool) –
  • fontsize (int) –
  • grid (bool) –
  • xlabel (str) –
  • ylabel (str) –
  • label (str) – label of the histogram (for the legend)
  • title (str) –
from sequana.pacbio import BAMPacbio
from sequana import sequana_data
b = BAMPacbio(sequana_data("test_pacbio_subreads.bam"))
b.hist_len()

(Source code, png, hires.png, pdf)

_images/references-23.png
hist_snr(bins=50, alpha=0.5, hold=False, fontsize=12, grid=True, xlabel='SNR', ylabel='#', title='')[source]

Plot histogram of the ACGT SNRs for all reads

Parameters:
  • bins (int) – binning for the histogram
  • alpha (float) – transparency of the histograms
  • hold (bool) –
  • fontsize (int) –
  • grid (bool) –
  • xlabel (str) –
  • ylabel (str) –
  • title (str) –
from sequana.pacbio import BAMPacbio
from sequana import sequana_data
b = BAMPacbio(sequana_data("test_pacbio_subreads.bam"))
b.hist_snr()

(Source code, png, hires.png, pdf)

_images/references-24.png
nb_pass

number of passes (ZMW)

plot_GC_read_len(hold=False, fontsize=12, bins=[40, 40], grid=True, xlabel='GC %', ylabel='#')

Plot GC content versus read length

Parameters:
  • hold (bool) –
  • fontsize (int) – for x and y labels and title
  • bins – a integer or tuple of 2 integers to specify the binning of the x and y 2D histogram.
  • grid (bool) –
  • xlabel (str) –
  • ylabel (str) –
from sequana.pacbio import BAMPacbio
from sequana import sequana_data
b = BAMPacbio(sequana_data("test_pacbio_subreads.bam"))
b.plot_GC_read_len(bins=[10, 10])

(Source code, png, hires.png, pdf)

_images/references-25.png
random_selection(output_filename, nreads=None, expected_coverage=None, reference_length=None)[source]

Select random reads

Parameters:
  • nreads – number of reads to select randomly. Must be less than number of available reads in the orignal file.
  • expected_coverage
  • reference_length

of expected_coverage and reference_length provided, nreads is replaced automatically.

reset()
save_summary(filename)[source]
stats

return basic stats about the read length

stride(output_filename, stride=10, shift=0, random=False)[source]

Write a subset of reads to BAM output

Parameters:
  • output_filename (str) – name of output file
  • stride (int) – optionnal, number of reads to read to output one read
  • shift (int) – number of reads to ignore at the begining of input file
  • random (bool) – if True, at each step the read to output is randomly selected
summary()[source]
to_fasta(output_filename, threads=2)

Export BAM reads into a Fasta file

Parameters:
  • output_filename – name of the output file (use .fasta extension)
  • threads (int) – number of threads to use

Note

this executes a shell command based on samtools

Warning

this takes a few minutes for 500,000 reads

to_fastq(output_filename, threads=2)

Export BAM reads into FastQ file

class PBSim(input_bam, simul_bam)[source]

Filter an input BAM (simulated with pbsim) so as so keep reads that fit a target distribution.

This uses a MH algorithm behind the scene.

ss = pacbio.PBSim("test10X.bam")
clf(); 
ss.run(bins=100, step=50)
run(bins=50, xmin=0, xmax=30000, step=1000, burn=1000, alpha=1, output_filename=None)[source]
target_distribution(xprime)[source]

The target distribution

Compute histogram. Get X, Y. Given xprime, interpolate to get yprime use e.g. np.interp

class BAMSimul(filename)[source]

BAM reader for Pacbio simulated reads (PBsim)

A summary of the data is stored in the attribute df. It contains information such as the length of the reads, the ACGT content, the GC content.

Constructor

Parameters:filename (str) – filename of the input pacbio BAM file. The content of the BAM file is not the ouput of a mapper. Instead, it is the output of a Pacbio (Sequel) sequencing (e.g., subreads).
df
filter_bool(output_filename, mask)[source]

Select and Write reads using a mask

Parameters:
  • output_filename (str) – name of output file
  • list_bool (list) – True to write read to output, False to ignore it
filter_length(output_filename, threshold_min=0, threshold_max=inf)[source]

Select and Write reads within a given range

Parameters:
  • output_filename (str) – name of output file
  • threshold_min (int) – minimum length of the reads to keep
  • threshold_max (int) – maximum length of the reads to keep
hist_GC(bins=50, alpha=0.5, hold=False, fontsize=12, grid=True, xlabel='GC %', ylabel='#', label='', title=None)

Plot histogram GC content

Parameters:
  • bins (int) – binning for the histogram
  • alpha (float) – transparency of the histograms
  • hold (bool) –
  • fontsize (int) – fontsize of the x and y labels and title.
  • grid (bool) – add grid or not
  • xlabel (str) –
  • ylabel (str) –
  • label (str) – label of the histogram (for the legend)
  • title (str) –
from sequana.pacbio import BAMPacbio
from sequana import sequana_data
b = BAMPacbio(sequana_data("test_pacbio_subreads.bam"))
b.hist_GC()

(Source code, png, hires.png, pdf)

_images/references-26.png
hist_len(bins=50, alpha=0.5, hold=False, fontsize=12, grid=True, xlabel='Read Length', ylabel='#', label='', title=None)

Plot histogram Read length

Parameters:
  • bins (int) – binning for the histogram
  • alpha (float) – transparency of the histograms
  • hold (bool) –
  • fontsize (int) –
  • grid (bool) –
  • xlabel (str) –
  • ylabel (str) –
  • label (str) – label of the histogram (for the legend)
  • title (str) –
from sequana.pacbio import BAMPacbio
from sequana import sequana_data
b = BAMPacbio(sequana_data("test_pacbio_subreads.bam"))
b.hist_len()

(Source code, png, hires.png, pdf)

_images/references-27.png
plot_GC_read_len(hold=False, fontsize=12, bins=[40, 40], grid=True, xlabel='GC %', ylabel='#')

Plot GC content versus read length

Parameters:
  • hold (bool) –
  • fontsize (int) – for x and y labels and title
  • bins – a integer or tuple of 2 integers to specify the binning of the x and y 2D histogram.
  • grid (bool) –
  • xlabel (str) –
  • ylabel (str) –
from sequana.pacbio import BAMPacbio
from sequana import sequana_data
b = BAMPacbio(sequana_data("test_pacbio_subreads.bam"))
b.plot_GC_read_len(bins=[10, 10])

(Source code, png, hires.png, pdf)

_images/references-28.png
reset()
to_fasta(output_filename, threads=2)

Export BAM reads into a Fasta file

Parameters:
  • output_filename – name of the output file (use .fasta extension)
  • threads (int) – number of threads to use

Note

this executes a shell command based on samtools

Warning

this takes a few minutes for 500,000 reads

to_fastq(output_filename, threads=2)

Export BAM reads into FastQ file

11.16. Phred quality

Manipulate phred quality of reads

FastQ quality are stored as characters. The phred scales indicates the range of characters.

In general, characters goes from ! to ~ that is from 33 to 126 in an ascii table. This convention starts at 33 because characters before ! may cause trouble (e.g. white spaces). This scale is the Sanger scale. There are 2 other scales that could be used ranging from 59 to 126 (illumina 1) and from 64 to 126 (illumina 1.3+).

So, here are the offset to use:

Name offset Numeric range
Sanger 33 0 to 93
Solexa 64 -5 to 62
illumina1.3+ 64 0 to 62
reference:http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2847217/

Even though dedicated tools would have better performances, we provide a set of convenient functions. An example is provided here below to plot the quality corresponding to a character string extracted from a FastQ read.

In this example, we use Quality class where the default offset is 33 (Sanger). We compare the quality for another offset

from sequana import phred

from sequana.phred import Quality
q = Quality('BCCFFFFFHHHHHIIJJJJJJIIJJJJJJJJFH')
q.plot()
q.offset = 64
q.plot()
from pylab import legend
legend(loc="best")

(Source code, png, hires.png, pdf)

_images/references-29.png
class Quality(seq, offset=33)[source]

Phred quality

>>> from sequana.phred import Quality
>>> q = Quality('BCCFFFFFHHHHHIIJJJJJJIIJJJJJJJJFH')
>>> q.plot()

(Source code, png, hires.png, pdf)

_images/references-30.png

You can access to the quality as a list using the quality attribute and the mean quality from the mean_quality attribute.

mean_quality

return mean quality

plot(fontsize=16)[source]

plot quality versus base position

quality

phred string into quality list

proba_to_quality_sanger(pe)[source]

A value between 0 and 93

Parameters:pe – the probability of error.
Returns:Q is the quality score.
  • a high probability of error (0.99) gives Q=0
  • q low proba of errors (0.05) gives Q = 13
  • q low proba of errors (0.01) gives Q = 20
quality_to_proba_sanger(quality)[source]

Quality to probability (Sanger)

11.17. Running median

Data analysis tool

RunningMedian(data, width[, container]) Running median (fast)
class RunningMedian(data, width, container=<class 'list'>)[source]

Running median (fast)

This is an efficient implementation of running median, faster than SciPy implementation v0.17 and a skip list method.

The main idea comes from a recipe posted in this website: http://code.activestate.com/recipes/576930/#c3 that uses a simple list as proposed in https://gist.github.com/f0k/2f8402e4dfb6974bfcf1 and was adapted to our needs included object oriented implementation.

Note

a circular running median is implemented in sequana.bedtools.GenomeCov

from sequana.running_median import RunningMedian
rm = RunningMedian(data, 101)
results = rm.run()

Warning

the first W/2 and last W/2 positions should be ignored since they do not use W values. In this implementation, the last W/2 values are currently set to zero.

This shows how the results agree with scipy

from pylab import *
import scipy.signal
from sequana.running_median import RunningMedian

clf()
x = randn(100)
plot(x, 'k')
plot(RunningMedian(x,9).run(), 'r', lw=4)
plot(scipy.signal.medfilt(x, 9), 'go')
grid()

(Source code, png, hires.png, pdf)

_images/references-31.png
from sequana.running_median import RunningMedian
from pylab import *

N = 1000
X = linspace(0, N-1, N)

# Create some interesting data with SNP and longer over
# represented section.
data = 20 + randn(N) + sin(X*2*pi/1000.*5)
data[300:350] += 10
data[500:505] += 100
data[700] = 1000

plot(X, data, "k", label="data")
rm = RunningMedian(data, 101)
plot(X, rm.run(), 'r', label="median W=201")

from sequana.stats import moving_average as ma
plot(X[100:-100], ma(data, 201), 'g', label="mean W=201")
grid()
legend()
ylim([10, 50])

(Source code, png, hires.png, pdf)

_images/references-32.png

Note that for visualisation, we set the ylimits to 50 but the data at position 500 goes up to 120 and there is an large outlier (1000) at position 700 .

We see that the median is less sensible to the outliers, as expected. The median is highly interesting for large outliers on short duration (e.g. here the peak at position 500) but is also less biases by larger regions.

Note

The beginning and end of the running median are irrelevant. There are actually equal to the data in our implementation.

Note

using blist instead of list is not always faster. It depends on the width of the window being used. list and blist are equivaltn for W below 20,000 (list is slightly faster). However, for large W, blist has an O(log(n)) complexity while list has a O(n) complexity

constructor

Parameters:
  • data – your data vector
  • width – running window length
  • container – a container (defaults to list). Could be a B-tree blist from the blist package but is 30% slower than a pure list for W < 20,000

scipy in O(n) list in sqrt(n) blist in O(log(n))

run()[source]
running_median(data, width, container=<class 'list'>)[source]

11.18. Snakemake module

Set of tools to manipulate Snakefile and config files

Here is an overview (see details here below)

sequana.snaketools.DOTParser Utility to manipulate the dot file returned by Snakemake
sequana.snaketools.FastQFactory FastQ Factory tool
sequana.snaketools.FileFactory Factory to handle a set of files
sequana.snaketools.Module Data structure that holds metadata about a Module
sequana.snaketools.ModuleFinderSingleton Data structure to hold the Module names
sequana.snaketools.PipelineManager Utility to manage easily the snakemake pipeline
sequana.snaketools.SnakeMakeStats Interpret the snakemake stats file
sequana.snaketools.SequanaConfig Reads YAML config file and ease access to its contents
sequana.snaketools.message Dedicated print function to include in Snakefiles
sequana.snaketools.modules
class DOTParser(filename, mode='v2')[source]

Utility to manipulate the dot file returned by Snakemake

This class is used in the dag and rulegraph rules used in the snakemake pipeline. The input must be a dag/rulegraph created by snakemake.

Consider this example where the test file was created by snakemake –dag

from sequana import sequana_data
from sequana.snaketools import DOTParser

filename = sequana_data("test_dag.dot")
dot = DOTParser(filename)

# creates test_dag.ann.dot locally
dot.add_urls("test.dot", {"fastqc": "fastqc.html"})

You can then convert the dag in an unix shell:

dot -Tsvg test.ann.dot -o test.svg

(Source code, png, hires.png, pdf)

_images/references-33.png

constructor

Parameters:filename (str) – a DAG in dot format created by snakemake
add_urls(output_filename=None, mapper={}, title=None)[source]

Change the dot file adding URL on some nodes

Parameters:
  • output_filename (str) – the DAG file in dot format (graphviz)
  • mapper (dict) – a dictionary where keys are named after the rule names for which an HTML will be available (provided here as keys)
class FastQFactory(pattern, extension=['fq.gz', 'fastq.gz'], read_tag='_R[12]_', verbose=False)[source]

FastQ Factory tool

In NGS experiments, reads are stored in a so-called FastQ file. The file is named:

PREFIX_R1_SUFFIX.fastq.gz

where _R1_ tag is always to be found. This is a single-ended case. In paired case, a second file is to be found:

PREFIX_R2_SUFFIX.fastq.gz

The PREFIX indicates the sample name. The SUFFIX does not convey any information per se. The default read tag (“_R[12]_”) handle this case. It can be changed if data have another read tags. (e.g. “[12].fastq.gz”)

Yet, in long reads experiments (for instance), naming convention is different and may nor be single/paired end convention.

In a directory (recursively or not), there could be lots of samples. This class can be used to get all the sample prefix in the tags attribute.

Given a tag, one can get the corresponding file(s):

ff = FastQFactory("*fastq.gz")
ff.tags
ff.get_file1(ff.tags[0])
len(ff)

Constructor

Parameters:
  • pattern (str) – a global pattern (e.g., H*fastq.gz)
  • extension (list) – not used
  • read_tag (str) – regex tag used to join paired end files. Some characters need to be escaped with a ‘’ to be interpreted as character. (e.g. ‘_R[12]_.fastq.gz’)
  • verbose (bool) –
get_file1(tag=None)[source]
get_file2(tag=None)[source]
class FileFactory(pattern)[source]

Factory to handle a set of files

from sequana.snaketools import FileFactory
ff = FileFactory("H*.gz")
ff.filenames

A set of useful methods are available based on this convention:

>>> fullpath = /home/user/test/A.fastq.gz
>>> dirname(fullpath)
'/home/user/test'
>>> basename(fullpath)
'A.fastq.gz'
>>> realpath(fullpath) # is .., expanded to /home/user/test

>>> all_extensions
"fastq.gz"
>>> extensions
".gz"

Constructor

Parameters:pattern – can be a filename, list of filenames, or a global pattern (a unix regular expression with wildcards). For instance, “*/*fastq.gz”

Warning

Only in Python 3.X supports the recursive global pattern for now.

all_extensions

the extensions (list)

basenames

list of filenames and their extensions without the path

extensions

the last extension (list)

filenames

list of filenames (no path, no extension)

pathname

the common relative path

pathnames

the relative path for each file (list)

realpaths

real path is the full path + the filename the extension

class Module(name)[source]

Data structure that holds metadata about a Module

In Sequana, we provide rules and pipelines to be used with snakemake. Snakemake rules look like:

rule <name>:
    :input: file1
    :output: file2
    :shell: "cp file1 file2"

A pipeline may look like:

include: "path_to_rule1"
include: "path_to_rule2"
rule all:
    input: FINAL_FILES

Note that the pipeline includes rules by providing the path to them.

All rules can be stored in a single directory. Similarly for pipelines. We decided not to use that convention. Instead, we bundle rules (and pipelines) in their own directories so that other files can be stored with them. We also consider that

  1. if the Snakefile includes other Snakefile then it is Pipeline.
  2. Otherwise it is a simple Rule.

So, a Module in sequana’s parlance is a directory that contains a rule or a pipeline and associated files. There is currently no strict conventions for rule Modules except for their own rule file. However, pipeline Modules should have the following files:

  • A snakemake file named after the directory with the extension .rules
  • A README.rst file in restructured text format
  • An optional config file in YAML format named config.yaml. Although json format is possible, we use YAML throughout sequana for consistency. Rules do not have any but pipelines do. So if a pipeline does not provide a config.yaml, the one found in ./sequana/sequana/pipelines will be used.
  • a requirements.txt

Note

Developers who wish to include new rules should refer to the Developer guide.

Note

it is important that module’s name should be used to name the directory and the rule/pipeline.

The Modules are stored in sequana/rules and sequana/pipelines directories. The modules’ names cannot be duplicated.

Example:

pipelines/test_pipe/test_pipe.rules
pipelines/test_pipe/README.rst
rules/rule1/rule1.rules
rules/rule1/README.rst

The Module will ease the retrieval of information linked to a rule or pipeline. For instance if a pipeline has a config file, its path can be retrived easily:

m = Module("quality_control")
m.config

This Module may be rule or pipeline, the method is_pipeline() can be used to get that information. Other useful methods are available such as onweb() that open the web page of the pipeline (linked to the README).

Constructor

Parameters:name (str) – the name of an available module.
check(mode='warning')[source]
cluster_config

full path to the config cluster file of the module

config

full path to the config file of the module

description

Content of the README file associated with

is_executable(verbose=False)[source]

Is the module executable

A Pipeline Module should have a requirements.txt file that is introspectied to check if all executables are available;

Parameters:verbose
Returns:a tuple. First element is a boolean to tell if it executable. Second element is the list of missing executables.
is_pipeline()[source]

Return true is this module is a pipeline

md5()[source]

return md5 of snakefile and its default configuration file

>>> from sequana import snaketools as sm
>>> m = sm.Module("variant_calling")
>>> m.md5()
{'config': 'e23b26a2ff45fa9ddb36c40670a8a00e',
 'snakefile': '7d3917743a6b123d9861ddbbb5f3baef'}
name

name of the module

onweb()[source]
overview
path

full path to the module directory

readme

full path to the README file of the module

requirements

list of requirements

snakefile

full path to the Snakefile file of the module

class PipelineManager(name, config, pattern='*.fastq.gz', fastq=True)[source]

Utility to manage easily the snakemake pipeline

Inside a snakefile, use it as follows:

from sequana import PipelineManager
manager = PipelineManager("pipeline_name", "config.yaml")

config file must have these fields:

- input_directory:  #a_path
- input_extension:  fastq.gz  # this is the default. could be also fq.gz
- input_readtag: _R[12]_ # default
- input_pattern:    # a_global_pattern e.g. H*fastq.gz
- input_samples:
    - file1:
    - file2:

The manager can then easily access to the data with a FastQFactory instance:

manager.ff.filenames

This can be further used to get a wildcards with the proper directory.

The manager also tells you if the samples are paired or not assuming all samples are homogneous (either all paired or all single-ended).

If there is only one sample, the attribute mode is set to “nowc” meaning no wildcard. Otherwise, we assume that we are in a wildcard mode.

When the mode is set, two attributes are also set: sample and basename.

If the mode is nowc, the sample and basename are hardcoded to the sample name and sample/rule/sample respectively. Whereas in the wc mode, the sample and basename are wildcard set to “{sample}” and “{sample}/rulename/{sample}”. See the following methods getname().

For developers: the config attribute should be used as getter only.

Constructor

Parameters:
  • name – name of the pipeline
  • config – name of a configuration file
  • pattern – a default pattern if not provided in the configuration file as an input_pattern field.
error(msg)[source]
getlogdir(rulename)[source]

Create log directory: */sample/logs/sample_rule.logs

getname(rulename, suffix=None)[source]

Returns basename % rulename + suffix

getrawdata()[source]

Return list of raw data

If mode is nowc, a list of files is returned (one or two files) otherwise, a function compatible with snakemake is returned. This function contains a wildcard to each of the samples found by the manager.

getreportdir(acronym)[source]

Create the report directory.

getwkdir(rulename)[source]
class SnakeMakeStats(filename, N=1)[source]

Interpret the snakemake stats file

Run the Snakemake with this option:

-- stats stats.txt

Then:

from sequana.snaketools import SnakeMakeStats
from sequana import sequana_data
filename = sequana_data("test_snakemake_stats.txt", "testing")
s = SnakeMakeStats(filename)
s.plot()

(Source code, png, hires.png, pdf)

_images/references-34.png

Cosntructor

plot(fontsize=16)[source]

Create the barplot from the stats file

plot_and_save(filename='snakemake_stats.png', outputdir='report')[source]
class SequanaConfig(data=None, converts_none_to_str=True)[source]

Reads YAML config file and ease access to its contents

This can also be used to check the validity of the config file

>>> sc = SequanaConfig(config)
>>> sc.config.pattern == "*.fastq.gz"
True

Input files should be stored into:

input_samples:
    - file1: FILE1
    - file2: FILE2

The second file may be optional.

Empty strings in a config are interpreted as None but SequanaConfig will replace None with empty strings, which is probably what was expected from the user. Similarly, in snakemake when settings the config file, one can override a value with a False but this is interepted as “False” This will transform back the “True” into True.

Could be a JSON or a YAML file

Parameters:filename (str) – filename to a config file in json or YAML format.

SEQUANA config files must have some specific fields:

input_directory
input_samples...
check_sequana_fields()[source]
cleanup()[source]

Remove template elements and change None to empty string.

cleanup_config()[source]
copy_requirements(target)[source]

Copy files to run the pipeline

If a requirement file exists, it is copied in the target directory. If not, it can be either an http resources or a sequana resources.

save(filename='config.yaml', cleanup=True)[source]

Save the yaml code in _yaml_code with comments

pipeline_names = ['atac-seq', 'smallrnaseq', 'quality_control', 'compressor', 'pacbio_qc', 'rnaseq', 'variant_calling', 'denovo_assembly', 'pacbio_denovo']

list of pipeline names found in the list of modules

11.19. Snpeff module

Tools to launch snpEff.

class SnpEff(reference, log=None)[source]

SnpEff is a tool dedicated to annotate detected variants in a VCF file. This wrapper eases the annotation with a genbank file. It create automatically the custom database. Then, run snpEff with a subprocess. Caution, the locus name (or chromosome name) in genbank file and the sequence name in VCF file must be the same. Otherwise, snpEff is not able to bind informations.

Example:

snpeff = SnpEff('file.gbk')
snpeff.launch_snpeff('variants.vcf', 'variant.ann.vcf')

Constructor

Parameters:
  • reference – annotation reference.
  • file_format – format of your file. (‘only genbank actually’)
  • log – log file
add_locus_in_fasta(fasta, output_file)[source]

Add locus of annotation file in description line of fasta file. If fasta file and genbank file do not have the same names.

Parameters:
  • fasta (str) – input fasta file where you want to add locus.
  • output_file (str) – output file.
launch_snpeff(vcf_filename, output, html_output=None, options='')[source]

Launch snpEff with the custom genbank file.

Parameters:
  • vcf_filename (str) – input VCF filename.
  • output (str) – output VCF filename.
  • html_output (str) – filename of the HTML creates by snpEff.
  • options (str) – any options recognised by snpEff.
download_fasta_and_genbank(identifier, tag, genbank=True, fasta=True)[source]
Parameters:
  • identifier – valid identifier to retrieve from NCBI (genbank) and ENA (fasta)
  • tag – name of the filename for the genbank and fasta files.

11.20. General tools

misc utilities

textwrap(text, width=80, indent=0)[source]

Wrap a string with 80 characters

Parameters:
  • text – input text
  • width – (defaults to 80 characters)
  • indent – possible indentation (0 by default)
rest2html(s)[source]

Converts a restructuredText document into HTML

Note that the returned object is a bytes so need to be decoded with decode()

wget(link, output)[source]

Retrieve a file from internet.

Parameters:
  • link (str) – a valid URL
  • output (str) – the output filename

Warning

no sanity check of any kind for now

Todo

move to easydev

findpos(seq, chr)[source]

Find position(s) of a substring into a longer string.

Note that this function is a generator:

>>> list(findpos("AACCGGAAGGTT", "GG"))
[4,8]
on_cluster(pattern=['tars-'])[source]

Used to check if we are on a cluster

“tars-” is the name of a cluster’s hostname. Change or append the argument pattern with your cluster’s hostname

Parameters:pattern (str) – a list of names (strings) or a string

Statistical tools

moving_average(data, n)[source]

Compute moving average

Parameters:n – window’s size (odd or even).
>>> from sequana.stats import moving_average as ma
>>> ma([1,1,1,1,3,3,3,3], 4)
array([ 1. ,  1.5,  2. ,  2.5,  3. ])

Note

the final vector does not have the same size as the input vector.

evenness(data)[source]

Return Evenness of the coverage

Reference:Konrad Oexle, Journal of Human Genetics 2016, Evaulation of the evenness score in NGS.

work before or after normalisation but lead to different results.

General tools

class StatsBAM2Mapped() -> new empty dictionary dict(mapping) -> new dictionary initialized from a mapping object's (key, value) pairs dict(iterable) -> new dictionary initialized as if via: d = {} for k, v in iterable: d[k] = v dict(**kwargs) -> new dictionary initialized with the name=value pairs in the keyword argument list. For example: dict(one=1, two=2)[source]
to_html()[source]
bam_to_mapped_unmapped_fastq(filename, output_directory=None, verbose=True)[source]

Create mapped and unmapped fastq files from a BAM file

Context:

given a reference, one or two FastQ files are mapped onto the reference to generate a BAM file. This BAM file is a compressed version of a SAM file, which interpretation should be eased within this function.

Parameters:
  • filename – input BAM file
  • output_directory – where to save the mapped and unmapped files
Returns:

dictionary with number of reads for each file (mapped/unmapped for R1/R2) as well as the mode (paired or not), the number of unpaired reads, and the number of duplicated reads. The unpaired reads should be zero (sanity check)

Given a BAM file, create FASTQ with R1/R2 reads mapped and unmapped. In the paired-end case, 4 files are created.

Note that this function is efficient in that it does not create intermediate files limiting IO in the process. As compared to standard tools such as bedtools bamtofastq, it is 1.5 to 2X slower but it does create the mapped AND unmapped reads.

Details:

Secondary alignment (flag 256) are dropped so as to remove any ambiguous alignments. The output dictionary stores “secondary” key to keep track of the total number of secondary reads that are dropped. If the flag is 256 and the read is unpaired, the key unpaired is also incremented.

If the flag is not equal to 256, we first reverse complement reads that are tagged as reverse in the BAM file. Then, reads that are not paired or not “proper pair” (neither flag 4 nor flag 8) are ignored.

If R1 is mapped or R2 is mapped then the reads are considered mapped. If both R1 and R2 are unmapped, then reads are unmapped.

Note

about chimeric alignment: one is the representative and the other is the supplementary. This flag is not used in this function. Note also that chimeric alignment have same QNAME and flag 4 and 8

Note

the contamination reported is basde on R1 only.

Todo

comments are missing since there are not stored in the BAM file.

Note

the mapped reads may not be synchronized because we include also the chimeric alignment (cf samtools documentation). However, total reads = unmappeds reads + R1 mapped + R2 mapped - supplementary reads (those with flag 2048).

class GZLineCounter(filename)[source]

Fast GZipped line counter

Uses zcat if possible, otherwise gzip library (twice as slow).

>>> from sequana import sequana_data
>>> from sequana.misc import GZLineCounter
>>> gz = GZLineCounter(sequana_data("test.fastq.gz"))
>>> len(gz)
100

11.21. VCF module

Analysis of VCF file generated by freebayes.

class Filtered_freebayes(variants, fb_vcf)[source]

Variants filtered with VCF_freebayes.

constructor

Parameters:
  • variants (list) – list of variants record.
  • fb_vcf (VCF_freebayes) – class parent.
columns

Get columns index.

df

Get the data frame.

to_csv(output_filename, info_field=False)[source]

Write DataFrame in CSV format.

Params str output_filename:
 output CSV filename.
to_vcf(output_filename)[source]

Write VCF file in VCF format.

Params str output_filename:
 output VCF filename.
variants

Get the variant list.

vcf

Get the VCF_freebayes object.

class VCF_freebayes(filename, **kwargs)[source]

VCF class (Variant Calling Format)

This class is a wrapping of vcf.Reader class from the pyVCF package. It is dedicated for VCF file generated by freebayes. A data frame with all variants is produced which can be written as a csv file. It can filter variants with a dictionnary of filter parameter. Filter variants are written in a new VCF file.

from sequana import sequana_data
from sequana.freebayes_vcf_filter import VCF_freebayes
vcf_filename = sequana_data("JB409847.vcf")

# Read the data
v = VCF_freebayes(vcf_filename)

# Filter the data
filter_dict = {"freebayes_score": 200,
               "frequency": 0.8,
               "min_depth": 10,
               "forward_depth":3,
               "reverse_depth":3,
               "strand_ratio": 0.2}
filter_v = v.filter_vcf(filter_dict)
filter_v.to_vcf('output.filter.vcf')

constructor

Parameters:
  • filename (str) – a vcf file.
  • kwargs – any arguments accepted by vcf.Reader
filter_vcf(filter_dict=None)[source]

Filter variants in the VCF file.

Parameters:filter_dict (dict) – dictionary of filters. It updates the attribute VCF_freebayes.filters

Return Filtered_freebayes object.

filters_params

Get or set the filters parameters to select variants of interest. Setter take a dictionnary as parameter to update the attribute VCF_freebayes.filters_params. Delete will reset different variable to 0.

v = VCF_freebayes("input.vcf")
v.filters_params = {"freebayes_score": 200,
                   "frequency": 0.8,
                   "min_depth": 10,
                   "forward_depth":3,
                   "reverse_depth":3,
                   "strand_ratio": 0.2}
is_joint

Get VCF_freebayes.is_joint if the vcf file is a joint_freebayes.

rewind()[source]

Rewind the reader

class Variant(record)[source]

Variant class to stock variant reader and dictionary that resume most important informations.

constructor

Parameters:
  • record (RecordVariant) – variant record
  • resume (dict) – most important informations of variant
record
resume
samples
compute_frequency(vcf_line)[source]
Compute frequency of alternate allele.
alt_freq = Count Alternate Allele / Depth
Parameters:vcf_line (vcf.model._Record) – variant record
compute_strand_balance(vcf_line)[source]

Compute strand balance of alternate allele include in [0,0.5]. strand_bal = Alt Forward / (Alt Forward + Alt Reverse)

Parameters:vcf_line (vcf.model._Record) – variant record
strand_ratio(number1, number2)[source]

Compute ratio between two number. Return result between [0:0.5].

11.22. Module Reports

Generic module is the parent module of all other module

class SequanaBaseModule(template_fn='standard.html')[source]

Generic Module to write HTML reports.

add_code_section(content, language)[source]

Add code in your html.

add_float_right(content)[source]

Align a content to right.

copy_file(filename, target_dir)[source]

Copy a file to a target directory in report dir. Return the relative path of your file.

Parameters:
  • filename (str) – file to copy.
  • target_dir (str) – directory where to copy.

Return relative path of the new file location.

create_combobox(path_list, html_id, newtab=True)[source]

Create a dropdown menu with QueryJS.

Parameters:path_list (list) – list of links.

return html div and js script as string.

create_embedded_png(plot_function, input_arg, style=None, **kwargs)[source]

Take as a plot function as input and create a html embedded png image. You must set the arguments name for the output to connect buffer.

create_hide_section(html_id, name, content, hide=False)[source]

Create an hideable section.

Parameters:
  • html_id (str) – add short id to connect all elements.
  • name (str) – name of the hyperlink to hide or display the content.
  • content (str) – hideable HTML content.
  • hide (bool) – set if the first state is hiding or not.

Return tuple that contains HTML hyperlink and hideable section.

create_html(output_filename)[source]

Create HTML file with Jinja2.

Parameters:output_filename (str) – HTML output filename

Create an HTML hyperlink with name and target.

Parameters:
  • target (str) – the target url.
  • newtab (bool) – open html page in a new tab.
  • download (bool) – download the target.

Return as string the HTML hyperlink to the target.

include_svg_image(filename)[source]

Include SVG image in the html.

png_to_embedded_png(png, style=None)[source]

Include a PNG file as embedded file.

required_dir = ('css', 'js', 'images')

Report dedicated to BAM file

BAMQCModule(bam_input[, output_filename]) Report dedicated to BAM file
class BAMQCModule(bam_input, output_filename=None)[source]

Report dedicated to BAM file

from sequana import sequana_data
from sequana.modules_report.bamqc import BAMQCModule
filename = sequana_data("test.bam")

r = BAMQCModule(filename)
r.create_html("test.html")

# report/bam.html is now available

Todo

right now, the computation is performed in the class. Ideally, we would like the computation to happen elsewhere, where a json is stored. The json would be the input to this class.

add_flag_section()[source]
add_images_section()[source]
create_report_content()[source]

Module to write coverage report

class CoverageModule(data)[source]

Write HTML report of coverage analysis. This class takes either a GenomeCov instances or a csv file where analysis are stored.

constructor

Parameters:data – it can be a csv filename created by sequana_coverage or a

bedtools.GenomeCov object.

chromosome_table(html_list)[source]

Create table with links to chromosome reports

create_report_content(html_list)[source]
create_reports()[source]

Create HTML report for each chromosome present in data.

init_roi_datatable(chrom)[source]

Initiate DataTableFunction to create table to link each row with sub HTML report. All table will have the same appearance. So, let’s initiate its only once.

class ChromosomeCoverageModule(chromosome, datatable, directory='coverage_reports')[source]

Write HTML report of coverage analysis for each chromosome. It is created by CoverageModule.

basic_stats()[source]

Basics statistics section.

coverage_barplot()[source]

Coverage barplots section.

coverage_plot()[source]

Coverage section.

create_report_content(directory)[source]

Generate the sections list to fill the HTML report.

gc_vs_coverage()[source]

3 dimensional plot of GC content versus coverage.

normalized_coverage()[source]

Barplot of normalized coverage section.

regions_of_interest(rois, links)[source]

Region of interest section.

subcoverage(rois, directory)[source]

Create subcoverage report to have acces of a zoomable line plot.

zscore_distribution()[source]

Barplot of zscore distribution section.

11.23. Others

11.23.1. data related

Retrieve data from sequana library

sequana_data(filename=None, where=None)[source]

Return full path of a sequana resource data file.

Parameters:
  • filename (str) – a valid filename to be found
  • where (str) – one of the registered data directory (see below)
Returns:

the path of file. See also here below in the case where filename is set to “*”.

from sequana import sequana_data
filename = sequana_data("test.bam")

Type the function name with “*” parameter to get a list of available files. Withe where argument set, the function returns a list of files. Without the where argument, a dictionary is returned where keys correspond to the registered directories:

filenames = sequana_data("*", where="images")

Registered directories are:

  • data
  • testing
  • data/adapters
  • images

Note

this does not handle wildcards. The * means retrieve all files.

Some useful data sets to be used in the analysis

The command sequana.sequana_data() may be used to retrieved data from this package. For example, a small but standard reference (phix) is used in some NGS experiments. The file is small enough that it is provided within sequana and its filename (full path) can be retrieved as follows:

from sequana import sequana_data
fullpath = sequana_data("phiX174.fa", "data")

Other files stored in this directory will be documented here.

11.23.2. report related

Utilities to create a Jquery DataTable for your HTML file.

DataTableFunction(df, html_id[, index]) Class that contains Jquery DataTables function and options.
DataTable(df, html_id[, datatable, index]) Class that contains html table which used a javascript function.
class DataTable(df, html_id, datatable=None, index=False)[source]

Class that contains html table which used a javascript function.

You must add in your HTML file the JS function (DataTable.create_javascript_function()) and the HTML code (DataTable.create_datatable()).

Example:

df = pandas.read_csv('data.csv')
datatable = DataTable(df, 'data')
datatable.datatable.datatable_options = {'pageLength': 15,
                                         'dom': 'Bfrtip',
                                         'buttons': ['copy', 'csv']}
js = datatable.create_javascript_function()
html = datatable.create_datatable()

# Second CSV file with same format
df2 = pandas.read_csv('data2.csv')
datatable2 = DataTable(df2, 'data2', datatable.datatable)
html2 = datatable.create_datatable()

The reason to include the JS manually is that you may include many HTML table but need to include the JS only once.

contructor

Parameters:
  • df – data frame.
  • html_id (str) – the unique ID used in the HTML file.
  • datatable (DataTableFunction) – javascript function to create the Jquery Datatables. If None, a DataTableFunction is generated from the df.
  • index (bool) – indicates whether the index dataframe should be included in the CSV table
create_datatable(style='width:100%', **kwargs)[source]

Return string well formated to include in a HTML page.

Parameters:
  • style (str) – CSS option of your table.
  • kwargs (**dict) –

    parameters of pandas.DataFrame.to_csv().

create_javascript_function()[source]

Generate the javascript function to create the DataTable in a HTML page.

df
html_id
class DataTableFunction(df, html_id, index=False)[source]

Class that contains Jquery DataTables function and options.

Example:

import pandas as pd
from sequana.utils import DataTableFunction

df = pandas.read_csv('data.csv')
datatable_js = DataTableFunction(df, 'data')
datatable_js.datatable_options = {'pageLength': 15,
                                  'dom': 'Bfrtip',
                                  'buttons': ['copy', 'csv']}
js = datatable_js.create_javascript_function()
html_datatables = [DataTable(df, "data_{0}".format(i), datatable_js)
                   for i, df in enumerate(df_list)]

Here, the datatable_options dictionary is used to fine tune the appearance of the table.

Note

DataTables add a number of elements around the table to control the table or show additional information about it. There are controlled by the order in the document (DOM) defined as a string made of letters, each of them having a precise meaning. The order of the letter is important. For instance if B is first, the buttons are put before the table. If B is at the end, it is shown below the table. Here are some of the valid letters and their meaning:

  • B: add the Buttons (copy/csv)
  • i: add showing 1 to N of M entries
  • f: add a search bar (f filtering)
  • r: processing display element
  • t: the table itself
  • p: pagination control

Each option can be specified multiple times (with the exception of the table itself).

Note

other useful options are:

  • pageLength: 15
  • scrollX: “true”
  • paging: 15
  • buttons: [‘copy’, ‘csv’]

Note that buttons can also be excel, pdf, print, …

All options of datatable:
https://datatables.net/reference/option/

contructor

Parameters:
  • df – data frame.
  • html_id (str) – the ID used in the HTML file.
create_javascript_function()[source]

Return javascript to create the DataTable.

datatable_columns

Get datatable_columns dictionary. It is automatically set from the dataframe you want to plot.

datatable_options

Get, set or delete the DataTable options. Setter takes a dict as parameter with the desired options and updates the current dictionary.

Example:

datatable = DataTableFunction("tab")
datatable.datatable_options = {'dom': 'Bfrtip',
                               'buttons': ['copy', 'csv']}

source: https://datatables.net/reference/option/

html_id

Get the html_id, which cannot be set by the user after the instanciation of the class.

Hide a column with urls and connect it with a column.

Parameters:
  • link_col (str) – column with your URLs.
  • target_col (str) – column to connect.
set_tooltips_to_column(tooltips_col, target_col)[source]

Hide a column with tooltips and connect it with a column.

Parameters:
  • tooltips_col (str) – column with your tooltips.
  • target_col (str) – column to connect.