References (genomics: sequences, motifs, RNA)#

Sequence basics #

class DNA(sequence, codons_stop=['TAA', 'TGA', 'TAG'], codons_stop_rev=['TTA', 'TCA', 'CTA'], codons_start=['ATG'], codons_start_rev=['CAT'])[source]#

Simple DNA class

>>> from sequana.sequence import DNA
>>> d = DNA("ACGTTTT")
>>> d.reverse_complement()

Some long computations are done when setting the window size:

d.window = 100

The ORF detection has been validated agains a plasmodium 3D7 ORF file found on plasmodb.org across the 14 chromosomes.

Constructor

A sequence is just a string stored in the sequence attribute. It has properties related to the type of alphabet authorised.

Parameters:

sequence (str) -- May be a string of a Fasta File, in which case only the first sequence is used.
complement_in
complement_out
letters -- authorise letters. Used in check() only.

Todo

use counter only once as a property

property AT_skew#

property GC_skew#

property ORF_pos#

barplot_count_ORF_CDS_by_frame(alpha=0.5, bins=40, xlabel='Frame', ylabel='#', bar_width=0.35)[source]#

entropy(sequence)[source]#

get_dinucleotide_count(window=100)[source]#

get_dna_flexibility(window=100, step=1, threshold=13.7)[source]#

get_entropy(window)[source]#

get_homopolymers(N, window=100)[source]#

get_informational_entropy(window=500, poly=3)[source]#

get_karlin_signature_difference(window=500, dinucleotide_only=False)[source]#

get_trinucleotide_count(window=100)[source]#

hist_ORF_CDS_linearscale(alpha=0.5, bins=40, xlabel='Length', ylabel='#')[source]#

hist_ORF_CDS_logscale(alpha=0.5, bins=40, xlabel='Length', ylabel='#')[source]#

plot_all_skews(figsize=(10, 12), fontsize=16, alpha=0.5)[source]#

property threshold#

property type_filter#

property type_window#

property window#

class RNA(sequence)[source]#

Simple RNA class

>>> from sequana.sequence import RNA
>>> d = RNA("ACGUUUU")
>>> d.reverse_complement()

Constructor

A sequence is just a string stored in the sequence attribute. It has properties related to the type of alphabet authorised.

Parameters:

sequence (str) -- May be a string of a Fasta File, in which case only the first sequence is used.
complement_in
complement_out
letters -- authorise letters. Used in check() only.

Todo

use counter only once as a property

class Repeats(filename_fasta, merge=False, name=None)[source]#

Class for finding repeats in DNA or RNA linear sequences.

Computation is performed each time the threshold is set to a new value.

from sequana import sequana_data, Repeats
rr = Repeats(sequana_data("measles.fa"))
rr.threshold = 4
rr.hist_length_repeats()

(Source code)

Note

Works with shustring package from Bioconda (April 2017)

Todo

use a specific sequence (first one by default). Others can be selected by name

Constructor

Input must be a fasta file with valid DNA or RNA characters

Parameters:

filename_fasta (str) -- a Fasta file, only the first sequence is used !
threshold (int) -- Minimal length of repeat to output
name (str) -- if name is provided, scan the Fasta file and select the corresponding sequence. if you want to analyse all sequences, you need to use a loop by setting _header for each sequence with the sequence name found in sequence header.

Note

known problems. Header with a > character (e.g. in the comment) are left strip and only the comments is kept. Another issue is for multi-fasta where one sequence is ignored (last or first ?)

property begin_end_repeat_position#

property df_shustring#: Return dataframe with shortest unique substring length at each position shortest unique substrings are unique in the sequence and its complement Uses shustring tool

property do_merge#

get_peak_position_and_length(THRESHOLD=3000)[source]#

property header#

hist_length_repeats(bins=20, alpha=0.5, hold=False, fontsize=12, grid=True, title='Repeat length', xlabel='Repeat length', ylabel='#', logy=True)[source]#: Plots histogram of the repeat lengths

property length#

property list_len_repeats#

property longest_shustring#

property names#

plot(clf=True, fontsize=12)[source]#

property threshold#

to_wig(filename, step=1000)[source]#: export repeats into WIG format to import in IGV

class Sequence(sequence, complement_in=b'ACGT', complement_out=b'TGCA', letters='ACGT')[source]#

Abstract base classe for other specialised sequences such as DNA.

Sequenced is the base class for other classes such as DNA and RNA.

from sequana import Sequence
s = Sequence("ACGT")
s.stats()
s.get_complement()

Note

You may use a Fasta file as input (see constructor)

Constructor

A sequence is just a string stored in the sequence attribute. It has properties related to the type of alphabet authorised.

Parameters:

sequence (str) -- May be a string of a Fasta File, in which case only the first sequence is used.
complement_in
complement_out
letters -- authorise letters. Used in check() only.

Todo

use counter only once as a property

check()[source]#: Check that all letters are valid

complement()[source]#: Alias to get_complement()

gc_content()[source]#: Return mean GC content

get_complement()[source]#: Return complement

get_occurences(pattern, overlap=False)[source]#

Return position of the input pattern in the sequence

>>> from sequana import Sequence
>>> s = Sequence('ACGTTTTACGT')
>>> s.get_occurences("ACGT")
[0, 7]

get_reverse()[source]#: Return reverse sequence

get_reverse_complement()[source]#: Return reverse complement

get_statistics()[source]#

reverse()[source]#: Alias to get_reverse()

reverse_complement()[source]#: Alias to get_reverse_complement

property sequence#

stats()[source]#: Return basic stats about the number of letters

Utilities to manipulate and find codons

class Codon[source]#

Utilities to manipulate codons

The codon contains hard-coded set of start and stop codons (bacteria) for strand plus and minus. Adapt to your needs for other organisms. Based on the scan of Methanothermobacter thermautotrophicus bacteria.

from sequana import Codon
c = Codon()
c.start_codons['+']

codons = {'start': {'+': frozenset({'ATG', 'GTG', 'TTG'}), '-': frozenset({'CAA', 'CAC', 'CAT'})}, 'stop': {'+': frozenset({'TAA', 'TAG', 'TGA'}), '-': frozenset({'CTA', 'TCA', 'TTA'})}}#

find_start_codon_position(sequence, position, strand, max_shift=10000)[source]#

Return starting position and string of closest start codon to a given position

The starting position is on the 5-3 prime direction (see later)

Parameters:

sequence (str)
position (int) -- 0-base position
strand (str) -- '+' or '-'

The search starts at the given position, then +1 base, then -1 base, then +2, -2, +3, etc

Here, we start at position 2 (letter G), then shift by +1 and find the ATG string.

>>> from sequana import Codon
>>> c = Codon()
>>> c.find_start_codon_position("ATGATGATG", 2, "+")
(3, 'ATG')

whereas starting at position 1, a shift or +1 (position 2 ) does not hit a start codon. Next, a shift of -1 (position 0) hits the ATG start codon.:

>>> c.find_start_codon_position("ATGATGATG", 1, "+")
(3, 'ATG')

On strand -, the start codon goes from right to left. So, in the following example, the CAT start codon (reverse complement of ATG) is found at position 3. Developers must take into account a +3 shift if needed:

>>> c.find_start_codon_position("AAACAT", 3, "-")
(3, 'CAT')

>>> c.find_start_codon_position("AAACATCAT", 8, "-")

find_stop_codon_position(sequence, position, strand, max_shift=10000)[source]#

Return position and string of closest stop codon to a given position

Parameters:

sequence (str)
position (int) -- 0-base position
strand (str) --
- or -

See find_start_codon_position() for details.

Only difference is that the search is based on stop codons rather than start codons.

>>> from sequana import Codon
>>> c = Codon()
>>> c.find_stop_codon_position("ATGACCCC", 2, "+")
(1, 'TGA')

get_codons_from_fasta_and_gff(fasta, gff)[source]#

compute_melting_temperature_salt_adjusted(sequence)[source]#

compute melting temperature with salt adjustement

This rule is a quick estimation for sequences greater than 14bp in length (Chester and Marshak 1993)

Tm(Celsius) = 69.3 + 0.41 x %GC - 650 / (sequence length)

This formula accounts for the stability conferred by GC content but does not account for secondary structures or mismatches.

compute_melting_temperature_wallace_rule(sequence)[source]#

Compute mekting temperature Tm of a sequence using Wallace rule

This rule is a quick estimation for short oligonucleotides (20-25 base pairs), based on [Marmu and Doty 1962]:

Tm(Celsius) = 2 ({A} + {T}) + 4 ({G} + {C})

Where A and T bases contribute 2°C each and G and C bases contribute 4°C each.

This formula assumes standard conditions and is less accurate for longer sequences or those with unusual salt concentrations.

Regulatory / regulatory-adjacent #

class ConsensusBuilder(df)[source]#

all_consensus(threshold=0.25, relative=0.8, strong=0.6, majority=0.5, info_threshold=1.0, notation='iupac')[source]#

get_consensus(mode='majority', threshold=0.25, relative=0.8, strong=0.6, majority=0.5, info_threshold=1.0, notation='iupac')[source]#

Parameters:

modestr: majority | threshold | relative | information | max_only
thresholdfloat: used in threshold mode
relativefloat: keep bases >= relative * max_frequency
strongfloat: uppercase if max >= strong
info_thresholdfloat: uppercase if information >= this value
notationstr: how to encode ambiguous positions: iupac (single ambiguity code, e.g. R) or expanded (slash-joined bases, e.g. A/G)

get_consensus_cavener(notation='iupac')[source]#

Consensus following the Cavener (1987) rule.

For each position:

if a single nucleotide exceeds 50% frequency and is more than twice as frequent as the next one, it is reported as a capital letter;
otherwise, if the two most frequent nucleotides have a combined frequency above 75%, a co-consensus is reported as the uppercase encoding of that pair;
otherwise the most frequent nucleotide is reported as a lowercase letter.

Parameters:: notation -- how to encode the co-consensus pair: iupac (single ambiguity code, e.g. R) or expanded (slash-joined bases, e.g. A/G, as in the original Cavener notation).

>>> import pandas as pd
>>> from sequana.kozak import ConsensusBuilder
>>> df = pd.DataFrame({
...     "A": [0.90, 0.45, 0.30, 0.55, 0.10],
...     "C": [0.04, 0.40, 0.30, 0.20, 0.10],
...     "G": [0.03, 0.10, 0.30, 0.15, 0.70],
...     "T": [0.03, 0.05, 0.10, 0.10, 0.10],
... })
>>> cb = ConsensusBuilder(df)
>>> cb.get_consensus_cavener()
'AMaAG'
>>> cb.get_consensus_cavener(notation="expanded")
'AA/CaAG'

Position 1 (A=0.90) is a capital because A exceeds 50% and is more than twice the next base. Position 2 (A=0.45, C=0.40) is a co-consensus (combined 0.85 > 75%). Position 3 (A=C=G=0.30) is lowercase. Position 5 (G=0.70) is a capital again.

Reference: Cavener DR (1987), Nucleic Acids Research 15(4):1353-1361.

class KLAnalysis(df)[source]#

compute_W50()[source]#

get_III()[source]#

get_KSI()[source]#

average across Kozak length (6bp before ATG)

Average across 6 bp (Kozak sequence)

get_peak_position()[source]#: max peak position

get_peak_strength()[source]#: max peak strength

get_power()[source]#

get_signal_concentration()[source]#

get_total_information(min=-1000000.0, max=0)[source]#

Area under the curve for position<0

This removes dilution from averaging. Independent of window scaling. Measures total constraint.

class Kozak("fasta", "gff", feature="gene")[source]#

Filter only to keep ATG since others seems to ncRNA

raw Kozak sequence names and counts
a Kozak is e.g GGCRGG . first position is the less important
for the enumeration of kmers, get of the rid of the Ns
odds ratio have 4 cases depending on the on enumeration:
use entire genome use chromosome by chromosome use of gene on genome use gene on chromosomes

Table of counts of Kozak sequences without dna ambiguities. - across the entire genome - by chromosomes

counts = k.get_all_kmer_counts() counts_chroms = k.get_all_kmer_counts_by_chromosome() counts_genes = k.get_all_kmer_counts_genes_only()

# proportions of kmer in genes: sum(list(counts_genes.values())) / length_genome

# counts in chroms should equal counts in genomes: Sgenes = sum([sum(list(counts_chroms[x].values())) for x in counts_chroms.keys()]) Sgenome = sum(list(counts.values()))

k = Kozak("ecoli_MG1655.fa", "ecoli_MG1655.gff", "gene", "ID")
df = k.get_data()
k.plot_logo(df.query("start_codon=='ATG'"))

bootstrap(df, contexts, n_boot=500, ci=95, sample_size=200)[source]#: Perform bootstrapping to compute confidence intervals for KL divergence.

property collapse_first_cds#

export_meme(filename, name='Kozak')[source]#: PWM compatible with standard motif scanners

filter_dataframe(df, strand=None, query=None, genes_set=None, attribute=None)[source]#

property genetic_type#

get_data()[source]#

get_gc_per_chromosome(quiet=True)[source]#

get_information_content(motif)[source]#

get_random_contexts(Nmax=None, quiet=True)[source]#: Return a background distribution of Kozak contexts.

property include_start_codon#

property keep_ATG_only#

kl_vs_random_atg(motif_df, random_df)[source]#

Compute position-wise divergence using Kullback–Leibler (KL) divergence between Kozak contexts and random (non-annotated) ATG contexts.

This quantifies how specific the Kozak signal is compared to generic ATG neighborhoods.

Compute positional KL divergence between the observed Kozak motif and a background nucleotide distribution.

This method computes, for each position i:

D_KL(P_i || Q) = sum_i (P_i x log2(P_i / Q_i))

where P_i is the observed nucleotide frequency distribution at position i, and Q is a fixed background distribution.

Parameters:

motif_dfpandas.DataFrame: DataFrame with columns ['A', 'C', 'G', 'T'] containing nucleotide frequencies per position.
random_dfpandas.DataFrame: Background distribution.

Returns:

numpy.ndarray: KL divergence (bits) for each motif position.

Notes

This quantity is mathematically related to Shannon entropy.
When Q is uniform, this is equivalent to classical sequence logo information content.
The result is deterministic for a given motif_df and random_df input pair

property left_kozak#

plot_GC_per_chromosome(ylim=[0, 100])[source]#

plot_KL_divergence(df=None, ax=None, n_boot=500, ci=95)[source]#

plot_kozak_chi2(motif=None, GC_mode=None, fontsize=10, noplot=False)[source]#

for GC computation, 2 options:

genomic ATG background excluding annotated starts. Good for GC bias, codon bias, local sequence structure.
genome wide base composition. simple and fast but inflates signal in GC biases genomes. does not control for ATG specific context. This is a simple a genome-wide background is estimated assuming strand symmetry:
G = C = GC / 2 A = T = AT / 2

plot_logo(df=None, ax=None, color_scheme='colorblind')[source]#

plot_logo_bits(df=None, ax=None, color_scheme='colorblind')[source]#

Plot sequence logo with letter heights scaled by information content (bits).

Unlike plot_logo() which shows relative nucleotide frequencies with uniform column heights, this method scales each column by its information content (IC = 2 - Shannon entropy), so highly conserved positions appear tall (up to 2 bits) and variable positions appear short.

Parameters:

df -- output of get_data(). If None, get_data() is called.
ax -- matplotlib axes object. If None, the current axes is used.
color_scheme -- "colorblind" (default) or "classic".

Returns:

the probability logo_data DataFrame (same as plot_logo()).

plot_logo_purine_pyrimidine(df=None, ax=None)[source]#: df is the output of get_data()

property right_kozak#

set_context(left_kozak=6, right_kozak=6, keep_ATG_only=True, include_start_codon=False, background_method='context', collapse_first_cds=True)[source]#

Configure context windows and feature-row collapsing.

Parameters:

left_kozak (int) -- number of nucleotides to keep upstream of the start codon.
right_kozak (int) -- number of nucleotides to keep downstream of the start codon.
keep_ATG_only (bool) -- if True, restrict downstream analyses to rows whose start codon is ATG.
include_start_codon (bool) -- include the start codon itself in the Kozak window when True.
background_method (str) -- one of "context", "shuffled", or "uniform".
collapse_first_cds (bool) -- when True (default), collapse multi-exon CDS rows to one row per transcript (the 5'-most CDS, which is the only CDS row corresponding to a real start codon). See _collapse_to_first_cds() for rationale. Set to False to recover the legacy behaviour where every CDS row is treated as a separate start (useful for benchmarking the bug fix).

temporary_lr(left=None, right=None)[source]#

class KozakAddon(*args, **kwargs)[source]#

builddata()[source]#

property df#

find_kmers(sequence, pattern='GCC[AG]CC')[source]#

>>> k.find_kmers("GCCACC")
True
>>> k.find_kmers("AAAAAA")
False

get_all_kmer_counts(k=6, reverse=False, quiet=True)[source]#: Get all kmers from the entire genome

get_all_kmer_counts_by_chromosome(k=6, reverse=False)[source]#

get_all_kmer_counts_genes_only(k=6, genetic_type='gene', reverse=False)[source]#

get_kmer_counts()[source]#: Get kmer counts for both strands. Returns (minus_strand_counts, plus_strand_counts)

get_odd_ratio(mode='all')[source]#

plot_cumulated()[source]#

plot_logo_all_kmers()[source]#

plot_scatter_odds_ratio_annotated(pattern='GCC[AG]CC')[source]#

plot_scatter_odds_ratio_gene_vs_genome()[source]#

class KozakWeightScore(weight_matrix=None, left_flank=10, right_flank=10)[source]#

Compute Kozak Similarity Score using weight matrix approach.

Weight-matrix based scoring for translation initiation prediction. Implements the algorithm from https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0256411

Flexible left/right flank parameters allow custom window sizes.

Example:

from sequana.kozak import KozakWeightScore
kss = KozakWeightScore(left_flank=10, right_flank=10)
score = kss.score("CGCCGCCACCATGGCGGCGGAGG")

Initialize weight-based scorer.

Parameters:

weight_matrixnp.ndarray, optional: Shape (n_positions, 5) where columns: A, T, G, C, N. Default: canonical ATG 23bp matrix.
left_flankint, default 10: Bases upstream of codon.
right_flankint, default 10: Bases downstream of codon.

score(sequence)[source]#

Compute KSS for sequence.

Parameters:

sequencestr: Length left_flank + 3 + right_flank. Codon centered.

Returns:

float: Normalized score [0, 1].

score_batch(sequences)[source]#

Score multiple sequences.

Parameters:

sequenceslist of str

Returns:

np.ndarray: Scores for each sequence.

class Motif(motif)[source]#

property entropy#

Compute Shannon entropy for each position in the motif.

Shannon Entropy H(p) = -sum(p * log_base(p)) for DNA (4-letter alphabet).

Parameters:: base -- the base of the logarithm. Common values: 2 (bits, default), e (nats), 10 (bans).
Returns:: numpy array of entropy values per position.

property information_content#

Compute information content (bits) for each position in the motif.

Information Content (IC) = max_entropy - Shannon_entropy = 2 - H(p) for DNA (4-letter alphabet).

Parameters:: base -- the base of the logarithm. Common values: 2 (bits, default), e (nats), 10 (bans).
Returns:: numpy array of information content values per position.

plot_entropy()[source]#: Plot Shannon entropy per position.

plot_information_content()[source]#: Plot information content (IC) per position.

classify_kozak_strength(motif, left=6)[source]#

Classify a Kozak consensus motif as optimal, strong, adequate or weak.

The tiers follow Meijer and Thomas (2002), Biochem J, doi:10.1042/bj20011706. The two key nucleotides are a purine (R = A/G) at the -3 position and a G at the +4 position (the base just after the ATG codon). U is treated as T. The 6 upstream positions (-6..-1) plus the +4 position are compared, the ATG codon being fixed:

GCCRCC-AUG-G -> "optimal" (full optimal context, both key nucleotides present);
NNNRNN-AUG-G -> "strong" (both key nucleotides present);
NNNRNN-AUG-(A/C/U) or NNN(C/U)NN-AUG-G -> "adequate" (only one of the two key nucleotides present; termed "moderate" in Hernandez et al. 2019, Cell);
NNN(C/U)NN-AUG-(A/C/U) -> "weak" (both key nucleotides absent).

Parameters:

motif (str) -- a Kozak motif. The ATG start codon is located automatically (first ATG substring); if absent, left is used as the number of upstream positions and the motif is assumed to be the 6 upstream bases immediately followed by the +4 base (no ATG codon).
left (int) -- number of upstream positions when no ATG is found.

Returns:

one of "optimal", "strong", "adequate" or "weak".

>>> from sequana.kozak import classify_kozak_strength
>>> classify_kozak_strength("GCCRCCATGG")
'optimal'
>>> classify_kozak_strength("AAARAAATGG")
'strong'
>>> classify_kozak_strength("AAACAAATGG")
'adequate'
>>> classify_kozak_strength("AAACAAATGA")
'weak'

kozak_weight_score(sequence, weight_matrix=None, left_flank=10, right_flank=10)[source]#

Quick-score function (convenience wrapper).

Parameters:

sequencestr: DNA sequence centered on codon.
weight_matrixnp.ndarray, optional: Weight matrix. Uses default ATG if None.
left_flankint, default 10
right_flankint, default 10

Returns:

float: Normalized KSS [0, 1].

class TelomerFilter(filename, pattern='AACCCT', threshold=0.8)[source]#

Filter reads based on telomeric repeat content.

Parameters:

filename (str) -- Input FastQ file (can be .gz)
pattern (str) -- Telomeric repeat unit (default: "AACCCT")
threshold (float) -- Fraction of the read that must be telomeric (default: 0.8)

save_non_telomeric_reads(output_filename='non_telomeric.fastq', progress=True)[source]#: Save non-telomeric reads to a file.

save_reads(telomeric_output=None, non_telomeric_output=None, progress=True)[source]#

Identify and save reads list on-the-fly for maximum speed.

Parameters:

telomeric_output (str) -- File to save telomeric reads (optional)
non_telomeric_output (str) -- File to save non-telomeric reads (optional)

save_telomeric_reads(output_filename='telomeric.fastq', progress=True)[source]#: Save telomeric reads to a file.

class Telomere(reference_file=None, peak_height=20, peak_width=50)[source]#

Scan a FASTA file and identify the extent of telomeric regions.

Basic usage:

from sequana.telomere import Telomere
telo = Telomere("ref.fa")
df = telo.run(tag="myrun")

run() processes every contig, writes per-contig PNG plots and a CSV summary, and returns a DataFrame with one row per contig.

Identifying the right k-mers

By default the scanner uses all six rotations of the canonical vertebrate telomere repeat AACCCT plus additional 6-mers that improve signal-to-noise. To discover organism-specific k-mers:

contig = telo.fasta.sequences[0]
candidates = telo.find_representative_kmers(contig, kmers=6)
telo.kmers = [k for k, _ in candidates]

Sliding-window counts

The two core signals used for peak detection:

XX = telo.get_sliding_kmer_count_five_to_three_prime(seq)
YY = telo.get_sliding_kmer_count_three_to_five_prime(seq)

A telomere on a correctly oriented contig will appear as a peak at the left of XX (LHS) and a peak at the right of YY (RHS). Any peak at the opposite end indicates a reversed/mis-oriented contig.

Plotting

Two plot styles are available through the plot_style argument of run():

'annotated' (default)

Per-contig: plot_contig() — single panel, both strand signals overlaid with colour-coded shaded telomere regions and a status badge. Reversed telomeres are highlighted in red.

Summary: plot_summary() — horizontal chromosome map with each contig drawn proportionally to its length, telomere blocks coloured by orientation, and per-row status badges.

'legacy'

Original two-subplot raw signals (per-contig) and two heatmaps (binary presence + length, summary).

Telomere categories

Each contig in the output DataFrame is assigned a telomere column:

`complete`	Both LHS (forward) and RHS (reverse-complement) detected.
`LHS_only`	Only the left-hand telomere detected.
`RHS_only`	Only the right-hand telomere detected.
`none`	No telomeric signal above threshold.

Reversed peaks (signal on the unexpected end) are flagged by run() via logging warnings and highlighted in red in the annotated plots.

Initialize the Telomere scanner.

Parameters:

reference_file -- path to a FASTA file.
peak_height -- minimum peak height for telomere peak detection.
peak_width -- minimum peak width for telomere peak detection.

find_LHS_telomere(XX, plotting=False)[source]#

find_RHS_telomere(XX, plotting=False)[source]#

find_representative_kmers(seq, N=100000, pocc=0.005, n_sigma=5, kmers=6)[source]#

Identify over-represented k-mers in the first N bases of seq.

The probability of occurrence threshold pocc works well for ~100,000 bases. For shorter sequences the threshold may need to be adjusted.

Note

The fitted beta distribution used for plotting is estimated from a Leishmania genome and may not be appropriate for other organisms.

get_sliding_kmer_count_five_to_three_prime(seq, W=100)[source]#: Return sliding kmer counts in the 5' → 3' direction.

get_sliding_kmer_count_three_to_five_prime(seq, W=100)[source]#: Return sliding kmer counts in the 3' → 5' direction.

is_telomeric(seq, W=100)[source]#

plot_contig(XX, YY, chrom, total_length, midpoint, lhs1, rhs1, lhs1_extend, rhs1_extend, lhs2, rhs2, lhs2_extend, rhs2_extend)[source]#

Produce an annotated per-contig telomere figure.

Both sliding-kmer-count signals are shown overlaid in a single panel:

Blue line / shading: 5'→3' signal; telomere expected at the left (LHS). A signal at the right (RHS) indicates a reversed orientation and is highlighted in red.
Orange line / shading: 3'→5' signal; telomere expected at the right (RHS). A signal at the left (LHS) is reversed and shown in red.

A vertical grey bar marks the midpoint when the sequence was trimmed to Nmax. The title carries a status badge (COMPLETE / LHS only / RHS only / NONE).

Parameters:

XX -- 5'→3' sliding kmer count array.
YY -- 3'→5' sliding kmer count array.
chrom -- contig/chromosome name.
total_length -- original full sequence length in bp.
midpoint -- index of the midpoint cut (0 means no trimming).
lhs1 -- LHS telomere boundary position from XX.
rhs1 -- RHS telomere boundary position from XX.
lhs1_extend -- LHS telomere length from XX.
rhs1_extend -- RHS telomere length from XX.
lhs2 -- LHS telomere boundary position from YY.
rhs2 -- RHS telomere boundary position from YY.
lhs2_extend -- LHS telomere length from YY.
rhs2_extend -- RHS telomere length from YY.

plot_summary(df)[source]#

Produce an annotated genome-wide telomere summary figure.

Each contig is drawn as a horizontal bar scaled to its length (so longer chromosomes appear wider). Telomeric blocks are overlaid at the appropriate ends:

Blue: LHS telomere (5'→3', expected orientation)
Orange: RHS telomere (3'→5', expected orientation)
Red: Reversed telomere (unexpected end)

A coloured status badge on the right of each row indicates: COMPLETE (green), LHS only (blue), RHS only (orange), NONE (grey), or ⚠ REVERSED (red).

Contigs are sorted by descending length so the largest chromosomes appear at the top.

Parameters:: df -- DataFrame returned by run().
Returns:: matplotlib Figure.

run(tag, names=None, W=100, Nmax=100000, plot_style='annotated')[source]#

Run telomere detection across all (or selected) contigs.

Parameters:

tag -- output filename prefix (None suppresses file output).
names -- list of chromosome/contig names to process; defaults to all.
W -- sliding window half-width in bp.
Nmax -- maximum bp to examine at each end of the contig.
plot_style -- 'annotated' (default) uses plot_contig() which overlays both strand signals with colour-coded telomeric regions and a status badge; 'legacy' reproduces the original two-subplot raw output.

circular_shifts(sequence)[source]#

Return all circular shifts of sequence as an ordered list.

For example, circular_shifts("AACCCT") returns all 6 rotations of the string. Useful for kmer-based analyses where all phases of a repeat unit must be considered.

factorize_sequences(sequences)[source]#

CpG(sequence, window=200)[source]#

The Sequence Manipulation Suite: CpG Islands Results for 1200 residue sequence "sample sequence" starting "taacatactt". CpG islands search using window size of 200. Range, value 32 to 231, the y-value is 1.75 and the %GC content is 50.5 33 to 232, the y-value is 1.75 and the %GC content is 50.5

Gardiner-Garden M, Frommer M. J Mol Biol. 1987 Jul 20;196(2):261-82.

compute_cpg_content(seq)[source]#

find_restriction_sites(sequence, enzymes)[source]#

Find restriction sites in a DNA sequence.

Args:: sequence (str): The DNA sequence. enzymes (dict): Dictionary of enzyme names and recognition sequences.
Returns:: dict: A dictionary with enzyme names as keys and lists of start positions as values.

build_kmer(length=6, letters='CG')[source]#

Return list of kmer of given length based on a set of letters

Returns:: list of kmers

get_kmer(sequence, k=7)[source]#

Given a sequence, return consecutive kmers

Returns:: iterator of kmers

Non-B DNA / motif detection #

class G4Hunter(fastafile, window=25, score=1)[source]#

This is an implementation of G4hunter that is 1-2 fold faster

Idea to speed up 50% is to use numba.

base_score(line)[source]#

get_G4(line, fileout, scores, header)[source]#

run(outdir)[source]#

write_sequences(line, fileout, liste, LISTE, header)[source]#

class G4HunterReader(filename_merged=None, filename_all=None)[source]#

load_files(file_pattern)[source]#

load_merged_data(filename)[source]#

to_bed(bedfile, cmap='seismic', threshold=0)[source]#

class Cruciforms(fasta_file, min_stem_len=6, max_stem_len=115, min_spacer=0, max_spacer=100, short_spacer_max=4, min_stem_len_long=10)[source]#

Detect inverted repeats (cruciform-forming motifs).

An inverted repeat is two arms that are reverse-complements of each other, optionally separated by a spacer (loop):

5'- ARM ----- spacer ----- revcomp(ARM) -3'

This mirrors the Inverted_Repeat output of the non-B gfa / nBMST tool (gfa_IR.gff). Differences from a naive adjacent-arm scan:

a spacer (loop) between the two arms is allowed (min_spacer .. max_spacer); gfa default is 0..100,
for every (gap, spacer) center the maximal stem is reported instead of one hit per stem length, avoiding nested-register duplicates,
hits fully contained in a longer hit are kept but flagged (subset=1), matching gfa's subset attribute. Set include_subsets=False to drop them.

gfa couples the loop and arm lengths: a short loop tolerates a short arm, a long loop requires a long arm. The two-tier default is reproduced here: spacer <= short_spacer_max (4) needs stem >= min_stem_len (6), otherwise stem >= min_stem_len_long (10).

Validated against gfa on a 30+ Mb Leishmania assembly: 57,702 non-subset inverted repeats vs gfa's 57,740 (0.07% difference), with 100% reciprocal positional overlap on a test slice. With the numba kernel the whole genome is processed in ~25 s.

Parameters:

fasta_file -- input FASTA.
min_stem_len -- minimum arm length for short loops (gfa default 6).
max_stem_len -- maximum arm length (cap on stem extension).
min_spacer -- minimum loop length (gfa default 0).
max_spacer -- maximum loop length (gfa default 100).
short_spacer_max -- largest loop still allowed with min_stem_len.
min_stem_len_long -- minimum arm length once spacer > short_spacer_max.

Note

cost scales as len(genome) * (max_spacer + 1). For a whole genome lower max_spacer or run per-chromosome if too slow.

is_cruciform(left, right)[source]#

run(include_subsets=True, progress=True, processes=None)[source]#

Scan every sequence for inverted repeats.

Uses a numba-compiled kernel when numba is installed, otherwise a numpy-vectorised fallback. Sequences are independent, so the scan is spread over processes workers (None = all CPUs; 1 = serial).

to_bed(output_file)[source]#

to_gff(output_file, source='sequana')[source]#: Write a gff3 comparable to gfa gfa_IR.gff (1-based, inclusive).

Detect mirror repeats.

A mirror repeat is two arms that read the same forwards and backwards across a central axis (the right arm is the reverse of the left arm, NOT the reverse complement — that would be an inverted repeat / cruciform), optionally separated by a spacer:

5'- ARM ----- spacer ----- reverse(ARM) -3'

This reproduces the Mirror_Repeat output of the non-B gfa / nBMST tool (gfa_MR.gff). The homopurine / homopyrimidine subset of mirror repeats forms H-DNA (triplex); see sequana.hdna.HDNA.

gfa defaults (from print_usage.c): arm (repeat) >= 10, spacer 0..100.

Maximal arms are reported; hits fully contained in a longer hit are kept but flagged (subset=1); include_subsets=False drops them.

Validated against gfa on a 30+ Mb Leishmania assembly: all 22254 gfa mirror repeats are recovered (100% positional overlap), density within ~2% (22704 vs 22254). The numba kernel processes the whole genome in a few seconds.

class MirrorRepeats(fasta_file, min_repeat=10, max_repeat=200, min_spacer=0, max_spacer=100)[source]#

Detect mirror repeats like gfa gfa_MR.gff.

Parameters:

fasta_file -- input FASTA.
min_repeat -- minimum arm length (gfa default 10).
max_repeat -- maximum arm length (cap on arm extension).
min_spacer -- minimum loop length (gfa default 0).
max_spacer -- maximum loop length (gfa default 100).

run(include_subsets=True, progress=True, processes=None)[source]#

Scan every sequence for mirror repeats.

Sequences are independent, so the scan is spread over processes workers (None = all CPUs; 1 = serial).

to_bed(output_file, mode='w')[source]#

to_gff(output_file, source='sequana')[source]#: Write a gff3 comparable to gfa gfa_MR.gff (1-based, inclusive).

Detect intramolecular triplex (H-DNA) forming motifs.

H-DNA forms at mirror repeats whose strands are homopurine or homopyrimidine. This reproduces the Triplex output of the non-B gfa / nBMST tool (gfa_TPX.gff).

A motif is:

5'- ARM ----- spacer ----- mirror(ARM) -3'

where mirror(ARM) is the arm read backwards (same bases, not the reverse complement, unlike a cruciform/inverted repeat), and the whole motif (both arms + spacer) is 100% purine (A/G) or 100% pyrimidine (C/T).

gfa gfa_TPX.gff defaults, reverse-engineered from its output, are reproduced here: arm (repeat) >= 10, spacer 0..8, and each arm is allowed up to 10% impurities (impure * 10 <= arm_length) judged on the arm only, not the spacer. Every gfa Triplex is also a gfa Mirror_Repeat; the triplex set is the homopurine / homopyrimidine subset of mirror repeats.

Validated against gfa on a 30+ Mb Leishmania assembly (gfa_TPX.gff): all 1746 gfa Triplex loci are recovered (100% positional overlap), with density within ~6% (1859 vs 1746). The numba kernel processes the whole genome in ~4 s.

class HDNA(fasta_file, min_repeat=10, max_repeat=100, min_spacer=0, max_spacer=8)[source]#

Detect H-DNA (intramolecular triplex) forming mirror repeats.

Parameters:

fasta_file -- input FASTA.
min_repeat -- minimum arm length (gfa Triplex default 10).
max_repeat -- maximum arm length (cap on arm extension).
min_spacer -- minimum loop length (gfa default 0).
max_spacer -- maximum loop length (gfa default 8).

Hits fully contained in a longer hit are kept but flagged (subset=1), matching gfa's subset attribute; include_subsets=False drops them.

run(include_subsets=True, progress=True)[source]#

to_bed(output_file, mode='w')[source]#

to_gff(output_file, source='sequana')[source]#: Write a gff3 comparable to gfa gfa_TPX.gff (1-based, inclusive).

Detect A-phased repeats (intrinsically bent DNA).

A-phased repeats are clusters of short A-tracts (runs of A/T) repeated in phase with the ~10.5 bp helical period, which bends the double helix. This reproduces the A_Phased_Repeat output of the non-B gfa / nBMST tool (gfa_APR.gff), ported from its findAPR.c / getAtracts routines.

Algorithm (gfa defaults shown):

A-tracts — every maximal run of A/T whose length is in [min_tract_len, max_tract_len] = [3, 9]. A run is a valid A-tract when, on the forward or reverse-complement strand, the longest A / AnTn pattern minus the longest pure-T run is >= min_tract_len. Each tract is summarised by the centre of its longest A-run.
Phasing — consecutive tract centres separated by [min_sep, max_sep] (gfa uses 9.9..11.1, i.e. ~10 bp) are chained; a chain of at least min_tracts (3) tracts is reported as one A-phased repeat.

Validated against gfa on a 30+ Mb Leishmania assembly: identical count (889) and 100% reciprocal positional overlap with gfa_APR.gff.

class APhasedRepeats(fasta_file, min_tract_len=3, max_tract_len=9, min_tracts=3, min_sep=9.9, max_sep=11.1)[source]#

Detect A-phased repeats (bent DNA) like gfa gfa_APR.gff.

Parameters:

fasta_file -- input FASTA.
min_tract_len -- minimum A-tract (A/T run) length (gfa default 3).
max_tract_len -- maximum A-tract length (gfa default 9).
min_tracts -- minimum number of phased tracts (gfa default 3).
min_sep -- minimum centre-to-centre tract separation (gfa 9.9).
max_sep -- maximum centre-to-centre tract separation (gfa 11.1).

run(progress=True)[source]#

to_bed(output_file, mode='w')[source]#

to_gff(output_file, source='sequana')[source]#: Write a gff3 comparable to gfa gfa_APR.gff (1-based, inclusive).

Detect short tandem repeats (STR / microsatellites).

A short tandem repeat is a short unit (period 1..9 bp) repeated head-to-tail at least min_reps times, spanning at least min_span bp (including a partial trailing copy). This reproduces the Short_Tandem_Repeat output of the non-B gfa / nBMST tool (gfa_STR.gff), ported faithfully from its findSTR.c.

gfa defaults (from gfa.c): period 1..9, min_reps 3, min_span 10. For each start position the smallest period giving a qualifying repeat is taken, then the scan jumps past it; a new STR is kept only if it ends beyond the previous one.

The type attribute is gfa's nonBstr structural code, a 4-bit value 8*isComplementary + 4*isSymmetric + 2*isAlternatingRY + 1*isEven describing which non-B structure the unit can adopt.

Validated against gfa on a 30+ Mb Leishmania assembly: identical coordinates and identical length/num/remainder/type attributes (exact match).

class ShortTandemRepeats(fasta_file, min_period=1, max_period=9, min_span=10, min_reps=3)[source]#

Detect short tandem repeats (microsatellites) like gfa gfa_STR.gff.

Parameters:

fasta_file -- input FASTA.
min_period -- minimum repeat-unit length (gfa default 1).
max_period -- maximum repeat-unit length (gfa default 9).
min_span -- minimum total span in bp (gfa default 10).
min_reps -- minimum number of full copies (gfa default 3).

run(progress=True)[source]#

to_bed(output_file, mode='w')[source]#

to_gff(output_file, source='sequana')[source]#: Write a gff3 comparable to gfa gfa_STR.gff (1-based, inclusive).

Detect Z-DNA forming motifs (alternating purine/pyrimidine runs).

Z-DNA is favoured by runs of alternating purine/pyrimidine dinucleotides, specifically CG/GC (strong) and CA/TG/AC/GT (weak); TA/AT are excluded. This reproduces the Z_DNA_Motif output of the non-B gfa / nBMST tool (gfa_Z.gff), ported faithfully from its findZDNA.c.

Each valid dinucleotide contributes to a Kadane/Vasquez (KV) style score (CG/GC = 25, the weak ones = 3). A run of at least min_z consecutive alternating bases is reported with score = sum_of_dinucleotide_scores // 2. A motif is flagged subset=1 when its score reaches min_kvscore (33).

gfa defaults (from gfa.c): min_z = 10, min_kvscore = 33 (the comment in gfa's is_subset.c saying 35 is stale; gfa.c uses 33).

Validated against gfa on a 30+ Mb Leishmania assembly: identical coordinates, length, score and subset for all 60076 motifs. The only field that differs is composition because gfa counts bases over [start, end-1) (an off-by-one that drops the last base); this module counts the full motif, so its A/C/G/T counts are the correct ones and match the reported sequence.

class ZDNA(fasta_file, min_z=10, min_kvscore=33)[source]#

Detect Z-DNA forming motifs like gfa gfa_Z.gff.

Parameters:

fasta_file -- input FASTA.
min_z -- minimum run length in bp (gfa default 10).
min_kvscore -- KV score threshold for the subset flag (gfa default 33); a motif with score >= min_kvscore gets subset=1.

run(progress=True)[source]#

to_bed(output_file, mode='w')[source]#

to_gff(output_file, source='sequana')[source]#: Write a gff3 comparable to gfa gfa_Z.gff (1-based, inclusive).

Detect G-quadruplex forming motifs.

A G-quadruplex (G4) forms where four or more runs of guanines (G-islands, each at least min_rep bp) sit close together (separated by loops of at most max_spacer bp). This reproduces the G_Quadruplex_Motif output of the non-B gfa / nBMST tool (gfa_GQ.gff), ported faithfully from its findGQ.c / getGislands.

Both strands are scanned: runs of G give plus-strand motifs, runs of C give minus-strand motifs (the reported sequence is then the reverse complement, i.e. G-rich). For each motif gfa reports:

islands : number of merged G-islands (conIls),
runs : how many minimal (min_rep) runs fit across the islands (npos, must be >= 4),
max : the largest run size for which at least 4 runs still fit.

gfa defaults (from gfa.c): min_rep = 3, max_spacer = 7.

Validated against gfa on a 30+ Mb Leishmania assembly: identical coordinates, strand, islands/runs/max attributes. As for Z-DNA, the only field that differs is composition because gfa's printer counts bases over [start, end-1) (an off-by-one dropping the last base); this module counts the full motif, so its A/C/G/T counts are correct and match the reported sequence.

class GQuadruplex(fasta_file, min_rep=3, max_spacer=7)[source]#

Detect G-quadruplex forming motifs like gfa gfa_GQ.gff.

Parameters:

fasta_file -- input FASTA.
min_rep -- minimum guanine run length / G-island size (gfa default 3).
max_spacer -- maximum loop length between islands (gfa default 7).

run(progress=True)[source]#

to_bed(output_file, mode='w')[source]#

to_gff(output_file, source='sequana')[source]#: Write a gff3 comparable to gfa gfa_GQ.gff (1-based, inclusive).

Detect direct (and slipped / tandem) repeats.

A direct repeat is two identical arms in the same orientation, optionally separated by a spacer:

5'- ARM ----- spacer ----- ARM -3'

When the spacer is zero the arm is extended into a big tandem repeat (BTR). This reproduces the Direct_Repeat output of the non-B gfa / nBMST tool (gfa_DR.gff), ported faithfully from its findDR.c.

gfa defaults (from gfa.c): min arm 10, max arm 300, spacer 0..10 (the usage text saying spacer 100 is stale; gfa.c uses 10). The search is greedy: for each start it tries the largest arm first and the smallest qualifying spacer, records the first full match, and skips ahead so successive repeats must extend beyond the previous one (the end escape variable in gfa).

A motif is flagged subset=1 (gfa's "slipped" set) when its spacer is 0.

Validated against the gfa binary compiled from source on a 30+ Mb Leishmania assembly: identical coordinates and attributes.

class DirectRepeats(fasta_file, min_repeat=10, max_repeat=300, max_spacer=10)[source]#

Detect direct/slipped repeats like gfa gfa_DR.gff.

Parameters:

fasta_file -- input FASTA.
min_repeat -- minimum arm length (gfa default 10).
max_repeat -- maximum arm length (gfa default 300).
max_spacer -- maximum spacer between arms (gfa default 10).

run(progress=True, processes=None)[source]#

Scan every sequence for direct repeats.

Parameters:: processes -- number of worker processes. None (default) uses all CPUs; sequences are independent so this is an exact speed-up. Pass 1 to force a single in-process (serial) scan.

to_bed(output_file, mode='w')[source]#

to_gff(output_file, source='sequana')[source]#

Write a gff3 comparable to gfa gfa_DR.gff (1-based, inclusive).

Like gfa, composition counts only the first arm (repeat bases) while sequence is the full motif span.

Find exact repeats in a genome using the shustring tool.

class Repeats(filename_fasta, merge=False, name=None)[source]#

Class for finding repeats in DNA or RNA linear sequences.

Computation is performed each time the threshold is set to a new value.

from sequana import sequana_data, Repeats
rr = Repeats(sequana_data("measles.fa"))
rr.threshold = 4
rr.hist_length_repeats()

(Source code)

Note

Works with shustring package from Bioconda (April 2017)

Todo

use a specific sequence (first one by default). Others can be selected by name

Constructor

Input must be a fasta file with valid DNA or RNA characters

Parameters:

filename_fasta (str) -- a Fasta file, only the first sequence is used !
threshold (int) -- Minimal length of repeat to output
name (str) -- if name is provided, scan the Fasta file and select the corresponding sequence. if you want to analyse all sequences, you need to use a loop by setting _header for each sequence with the sequence name found in sequence header.

Note

known problems. Header with a > character (e.g. in the comment) are left strip and only the comments is kept. Another issue is for multi-fasta where one sequence is ignored (last or first ?)

property begin_end_repeat_position#

property df_shustring#: Return dataframe with shortest unique substring length at each position shortest unique substrings are unique in the sequence and its complement Uses shustring tool

property do_merge#

get_peak_position_and_length(THRESHOLD=3000)[source]#

property header#

hist_length_repeats(bins=20, alpha=0.5, hold=False, fontsize=12, grid=True, title='Repeat length', xlabel='Repeat length', ylabel='#', logy=True)[source]#: Plots histogram of the repeat lengths

property length#

property list_len_repeats#

property longest_shustring#

property names#

plot(clf=True, fontsize=12)[source]#

property threshold#

to_wig(filename, step=1000)[source]#: export repeats into WIG format to import in IGV

Reverse-complement palindrome detection.

A DNA palindrome is a sequence equal to its own reverse complement (e.g. GAATTC). Such palindromes always have an even length: the central base of an odd-length window would have to be its own complement, which never happens for A/C/G/T. Only even window sizes are therefore scanned.

class Palindromes(fasta_file, min_len=4, max_len=12)[source]#

is_palindrome(seq)[source]#

run()[source]#

to_bed(output_file)[source]#

i-motif (C-rich) secondary-structure detection.

An i-motif is the cytosine-rich counterpart of a G-quadruplex: four or more runs of C (each at least min_tract bases) separated by short loops fold into an intercalated structure. Detection is a single regular expression scan.

class IMotif(fasta_file, min_tract=3, max_loop=7)[source]#

run()[source]#

to_bed(output_file)[source]#

class TRF(filename: str | Path, verbose: bool = False, frmt: str | None = None)[source]#

Tandem Repeat Finder utilities

The input data is the output of trf tool when using the -d option. This is not a CSV file. It contains comments in the middle of the file to indicate the name of the contig.

The output filename has the following filename convention:

test.fa.2.5.7.80.10.50.2000.dat

where the numbers indicate the 7 input parameters:

Match = matching weight
Mismatch = mismatching penalty
Delta = indel penalty
PM = match probability (whole number)
PI = indel probability (whole number)
Minscore = minimum alignment score to report
MaxPeriod = maximum period size to report

You may use -h to suppress html output.

Then, you can use this class to easly identify the pattern you want:

t = TRF("input.dat")
query = "length>100 and period_size==3 and entropy>0 and C>20 and A>20 and G>20"
t.df.query(query)

entropy_distribution_by_period()[source]#: Boxplot of entropy per period size.

get_repeats_in_region(seq_name, start, end)[source]#: Return repeats overlapping a genomic region.

hist_cnvs(bins=50, CNVmin=10, motif=None, color='r', log=True)[source]#: Plot histogram of CNV counts for a given motif list.

hist_entropy(bins=50)[source]#: Histogram of the entropy of all found repeats

hist_length_repetition(bins=50, CNVmin=3, motif=['CAG', 'AGC', 'GCA'], color='r', log=True)[source]#: histogram of the motif found in the list provided by users. As an example, this is triplet CAG. Note that we also add the shifted version AGC and GCA.

hist_period_size(bins=50)[source]#: Length of the repetitions

hist_repetitions_per_sequence()[source]#

plot_count_versus_length()[source]#

plot_motif_logo(motif, n=50)[source]#

plot_period_vs_CNV(log=True)[source]#: Scatter plot: period size vs copy number.

plot_repeat_density(normalize_by_length=True)[source]#: Plot repeat density per sequence.

summary()[source]#: Return a descriptive summary of TRF results.

to_bed(outfile, cmap='autumn')[source]#

to_bedgraph(outfile)[source]#: Export TRF data as a bedGraph of repeat density.

top_motifs(n=10)[source]#: Return the most common repeat motifs.

class FindMotif("cl10/select1.sorted.bam") df=fm.find_motif("CAGCAG") df.query("hit>10")[source]#

local threshold should be window length divided by motif length divided by 2

find_motif_bam(filename, motif, window=200, figure=False, savefig=False, local_threshold=None, global_threshold=None)[source]#

find_motif_fasta(filename, motif, window=200, local_threshold=None, global_threshold=None)[source]#

find_motif_from_sequence(seq, motif, window=None, local_threshold=None)[source]#

plot_alignment(bamfile, motif, window=200, global_th=10, title=None, legend=True, legend_fontsize=11, valid_rnames=[], valid_flags=[])[source]#: plot alignments that match the motif.

plot_specific_alignment(bamfile, query_name, motif, clf=True, show_figure=True, authorized_flags=[0, 16], windows=[10, 50, 100, 150, 200, 250, 500, 1000], local_threshold=5)[source]#

find_motif(bamfile, motif='CAGCAG', window=200, savefig=False, local_th=5, global_th=10)[source]#: If at least 10 position contains at least 5 instances of the motif, then this is a hit and the alignment is kept

RNA structure & rRNA depletion #

calculate_mfe(seq)[source]#: Calculate the Minimum Free Energy (MFE) score.

initialize_matrix(seq)[source]#: Initialize a matrix filled with zeros.

is_complementary(base1, base2)[source]#: Check if two bases are complementary.

Ribodesigner module

class RiboDesigner(fasta, gff=None, output_directory='ribodesigner', seq_type='rRNA', max_n_probes=384, force=False, threads=4, identity_step=0.01, force_clustering=False, **kwargs)[source]#

Design probes for ribosomes depletion.

From a complete genome assembly FASTA file and a GFF annotation file:

Extract genomic sequences corresponding to the selected seq_type.
For these selected sequences, design probes computing probe length and inter probe space according to the length of the ribosomale sequence.
Detect the highest cd-hit-est identity threshold where the number of probes is inferior or equal to max_n_probes.
Report the list of probes in BED and CSV files.

In the CSV, the oligo names are in column 1 and the oligo sequences in column 2.

Parameters:

fasta -- The FASTA file with complete genome assembly to extract ribosome sequences from.
gff -- GFF annotation file of the genome assembly. If none provided, assuming the input FastA is already made of rRNA.
output_directory -- The path to the output directory defaults to ribodesigner.
seq_type -- string describing sequence annotation type (column 3 in GFF) to select rRNA from.
max_n_probes -- Max number of probes to design
force -- If the output_directory already exists, overwrite it.
threads -- Number of threads to use in cd-hit clustering.
identity_step (float) -- step to scan the sequence identity (between 0 and 1) defaults to 0.01.
force_clustering

cluster_probes()[source]#: Use cd-hit-est to cluster highly similar probes.

clustering_needed(force=False)[source]#

Checks if a clustering is needed.

Parameters:: force -- force clustering even if unecessary.

export_all_probes_to_fasta()[source]#: From the self.probes_df, export to FASTA and CSV files.

export_to_csv_bed()[source]#: Export final results to CSV and BED files

export_to_json()[source]#

get_all_probes(method='original')[source]#: Run all probe design and concatenate results in a single DataFrame.

get_rna_pos_from_gff()[source]#: Convert a GFF file into a pandas DataFrame filtered according to the self.seq_type.

run(method='greedy')[source]#

References (genomics: sequences, motifs, RNA)#

Sequence basics#

Regulatory / regulatory-adjacent#

Non-B DNA / motif detection#

RNA structure & rRNA depletion#

Sequence basics #

Regulatory / regulatory-adjacent #

Non-B DNA / motif detection #

RNA structure & rRNA depletion #