References (genomics: sequences, motifs, RNA)#

Sequence basics#

class DNA(sequence, codons_stop=['TAA', 'TGA', 'TAG'], codons_stop_rev=['TTA', 'TCA', 'CTA'], codons_start=['ATG'], codons_start_rev=['CAT'])[source]#

Simple DNA class

>>> from sequana.sequence import DNA
>>> d = DNA("ACGTTTT")
>>> d.reverse_complement()

Some long computations are done when setting the window size:

d.window = 100

The ORF detection has been validated agains a plasmodium 3D7 ORF file found on plasmodb.org across the 14 chromosomes.

Constructor

A sequence is just a string stored in the sequence attribute. It has properties related to the type of alphabet authorised.

Parameters:
  • sequence (str) -- May be a string of a Fasta File, in which case only the first sequence is used.

  • complement_in

  • complement_out

  • letters -- authorise letters. Used in check() only.

Todo

use counter only once as a property

property AT_skew#
property GC_skew#
property ORF_pos#
barplot_count_ORF_CDS_by_frame(alpha=0.5, bins=40, xlabel='Frame', ylabel='#', bar_width=0.35)[source]#
entropy(sequence)[source]#
get_dinucleotide_count(window=100)[source]#
get_dna_flexibility(window=100, step=1, threshold=13.7)[source]#
get_entropy(window)[source]#
get_homopolymers(N, window=100)[source]#
get_informational_entropy(window=500, poly=3)[source]#
get_karlin_signature_difference(window=500, dinucleotide_only=False)[source]#
get_trinucleotide_count(window=100)[source]#
hist_ORF_CDS_linearscale(alpha=0.5, bins=40, xlabel='Length', ylabel='#')[source]#
hist_ORF_CDS_logscale(alpha=0.5, bins=40, xlabel='Length', ylabel='#')[source]#
plot_all_skews(figsize=(10, 12), fontsize=16, alpha=0.5)[source]#
property threshold#
property type_filter#
property type_window#
property window#
class RNA(sequence)[source]#

Simple RNA class

>>> from sequana.sequence import RNA
>>> d = RNA("ACGUUUU")
>>> d.reverse_complement()

Constructor

A sequence is just a string stored in the sequence attribute. It has properties related to the type of alphabet authorised.

Parameters:
  • sequence (str) -- May be a string of a Fasta File, in which case only the first sequence is used.

  • complement_in

  • complement_out

  • letters -- authorise letters. Used in check() only.

Todo

use counter only once as a property

class Repeats(filename_fasta, merge=False, name=None)[source]#

Class for finding repeats in DNA or RNA linear sequences.

Computation is performed each time the threshold is set to a new value.

from sequana import sequana_data, Repeats
rr = Repeats(sequana_data("measles.fa"))
rr.threshold = 4
rr.hist_length_repeats()

(Source code)

Note

Works with shustring package from Bioconda (April 2017)

Todo

use a specific sequence (first one by default). Others can be selected by name

Constructor

Input must be a fasta file with valid DNA or RNA characters

Parameters:
  • filename_fasta (str) -- a Fasta file, only the first sequence is used !

  • threshold (int) -- Minimal length of repeat to output

  • name (str) -- if name is provided, scan the Fasta file and select the corresponding sequence. if you want to analyse all sequences, you need to use a loop by setting _header for each sequence with the sequence name found in sequence header.

Note

known problems. Header with a > character (e.g. in the comment) are left strip and only the comments is kept. Another issue is for multi-fasta where one sequence is ignored (last or first ?)

property begin_end_repeat_position#
property df_shustring#

Return dataframe with shortest unique substring length at each position shortest unique substrings are unique in the sequence and its complement Uses shustring tool

property do_merge#
get_peak_position_and_length(THRESHOLD=3000)[source]#
property header#
hist_length_repeats(bins=20, alpha=0.5, hold=False, fontsize=12, grid=True, title='Repeat length', xlabel='Repeat length', ylabel='#', logy=True)[source]#

Plots histogram of the repeat lengths

property length#
property list_len_repeats#
property longest_shustring#
property names#
plot(clf=True, fontsize=12)[source]#
property threshold#
to_wig(filename, step=1000)[source]#

export repeats into WIG format to import in IGV

class Sequence(sequence, complement_in=b'ACGT', complement_out=b'TGCA', letters='ACGT')[source]#

Abstract base classe for other specialised sequences such as DNA.

Sequenced is the base class for other classes such as DNA and RNA.

from sequana import Sequence
s = Sequence("ACGT")
s.stats()
s.get_complement()

Note

You may use a Fasta file as input (see constructor)

Constructor

A sequence is just a string stored in the sequence attribute. It has properties related to the type of alphabet authorised.

Parameters:
  • sequence (str) -- May be a string of a Fasta File, in which case only the first sequence is used.

  • complement_in

  • complement_out

  • letters -- authorise letters. Used in check() only.

Todo

use counter only once as a property

check()[source]#

Check that all letters are valid

complement()[source]#

Alias to get_complement()

gc_content()[source]#

Return mean GC content

get_complement()[source]#

Return complement

get_occurences(pattern, overlap=False)[source]#

Return position of the input pattern in the sequence

>>> from sequana import Sequence
>>> s = Sequence('ACGTTTTACGT')
>>> s.get_occurences("ACGT")
[0, 7]
get_reverse()[source]#

Return reverse sequence

get_reverse_complement()[source]#

Return reverse complement

get_statistics()[source]#
reverse()[source]#

Alias to get_reverse()

reverse_complement()[source]#

Alias to get_reverse_complement

property sequence#
stats()[source]#

Return basic stats about the number of letters

Utilities to manipulate and find codons

class Codon[source]#

Utilities to manipulate codons

The codon contains hard-coded set of start and stop codons (bacteria) for strand plus and minus. Adapt to your needs for other organisms. Based on the scan of Methanothermobacter thermautotrophicus bacteria.

from sequana import Codon
c = Codon()
c.start_codons['+']
codons = {'start': {'+': frozenset({'ATG', 'GTG', 'TTG'}), '-': frozenset({'CAA', 'CAC', 'CAT'})}, 'stop': {'+': frozenset({'TAA', 'TAG', 'TGA'}), '-': frozenset({'CTA', 'TCA', 'TTA'})}}#
find_start_codon_position(sequence, position, strand, max_shift=10000)[source]#

Return starting position and string of closest start codon to a given position

The starting position is on the 5-3 prime direction (see later)

Parameters:
  • sequence (str)

  • position (int) -- 0-base position

  • strand (str) -- '+' or '-'

The search starts at the given position, then +1 base, then -1 base, then +2, -2, +3, etc

Here, we start at position 2 (letter G), then shift by +1 and find the ATG string.

>>> from sequana import Codon
>>> c = Codon()
>>> c.find_start_codon_position("ATGATGATG", 2, "+")
(3, 'ATG')

whereas starting at position 1, a shift or +1 (position 2 ) does not hit a start codon. Next, a shift of -1 (position 0) hits the ATG start codon.:

>>> c.find_start_codon_position("ATGATGATG", 1, "+")
(3, 'ATG')

On strand -, the start codon goes from right to left. So, in the following example, the CAT start codon (reverse complement of ATG) is found at position 3. Developers must take into account a +3 shift if needed:

>>> c.find_start_codon_position("AAACAT", 3, "-")
(3, 'CAT')

>>> c.find_start_codon_position("AAACATCAT", 8, "-")
find_stop_codon_position(sequence, position, strand, max_shift=10000)[source]#

Return position and string of closest stop codon to a given position

Parameters:
  • sequence (str)

  • position (int) -- 0-base position

  • strand (str) --

    • or -

See find_start_codon_position() for details.

Only difference is that the search is based on stop codons rather than start codons.

>>> from sequana import Codon
>>> c = Codon()
>>> c.find_stop_codon_position("ATGACCCC", 2, "+")
(1, 'TGA')
get_codons_from_fasta_and_gff(fasta, gff)[source]#
compute_melting_temperature_salt_adjusted(sequence)[source]#

compute melting temperature with salt adjustement

This rule is a quick estimation for sequences greater than 14bp in length (Chester and Marshak 1993)

Tm(Celsius) = 64.9 + 0.41 x %GC - 650 / (sequence length)

This formula accounts for the stability conferred by GC content but does not account for secondary structures or mismatches.

compute_melting_temperature_wallace_rule(sequence)[source]#

Compute mekting temperature Tm of a sequence using Wallace rule

This rule is a quick estimation for short oligonucleotides (20-25 base pairs), based on [Marmu and Doty 1962]:

Tm(Celsius) = 2 ({A} + {T}) + 4 ({G} + {C})

Where A and T bases contribute 2°C each and G and C bases contribute 4°C each.

This formula assumes standard conditions and is less accurate for longer sequences or those with unusual salt concentrations.

Regulatory / regulatory-adjacent#

class ConsensusBuilder(df)[source]#
all_consensus(mode='majority', threshold=0.25, relative=0.8, strong=0.6, majority=0.5, info_threshold=1.0)[source]#
get_consensus(mode='majority', threshold=0.25, relative=0.8, strong=0.6, majority=0.5, info_threshold=1.0)[source]#
Parameters:
modestr

majority | threshold | relative | information | max_only

thresholdfloat

used in threshold mode

relativefloat

keep bases >= relative * max_frequency

strongfloat

uppercase if max >= strong

info_thresholdfloat

uppercase if information >= this value

class KLAnalysis(df)[source]#
compute_W50()[source]#
get_III()[source]#
get_KSI()[source]#

average across Kozak length (6bp before ATG)

Average across 6 bp (Kozak sequence)

get_peak_position()[source]#

max peak position

get_peak_strength()[source]#

max peak strength

get_power()[source]#
get_signal_concentration()[source]#
get_total_information(min=-1000000.0, max=0)[source]#

Area under the curve for position<0

This removes dilution from averaging. Independent of window scaling. Measures total constraint.

class Kozak("fasta", "gff", feature="gene")[source]#

Filter only to keep ATG since others seems to ncRNA

  • raw Kozak sequence names and counts

  • a Kozak is e.g GGCRGG . first position is the less important

  • for the enumeration of kmers, get of the rid of the Ns

  • odds ratio have 4 cases depending on the on enumeration:

    use entire genome use chromosome by chromosome use of gene on genome use gene on chromosomes

Table of counts of Kozak sequences without dna ambiguities. - across the entire genome - by chromosomes

counts = k.get_all_kmer_counts() counts_chroms = k.get_all_kmer_counts_by_chromosome() counts_genes = k.get_all_kmer_counts_genes_only()

# proportions of kmer in genes: sum(list(counts_genes.values())) / length_genome

# counts in chroms should equal counts in genomes: Sgenes = sum([sum(list(counts_chroms[x].values())) for x in counts_chroms.keys()]) Sgenome = sum(list(counts.values()))

k = Kozak("ecoli_MG1655.fa", "ecoli_MG1655.gff", "gene", "ID")
df = k.get_data()
k.plot_logo(df.query("start_codon=='ATG'"))
bootstrap(df, contexts, n_boot=500, ci=95, sample_size=200)[source]#

Perform bootstrapping to compute confidence intervals for KL divergence.

property collapse_first_cds#
export_meme(filename, name='Kozak')[source]#

PWM compatible with standard motif scanners

filter_dataframe(df, strand=None, query=None, genes_set=None, attribute=None)[source]#
property genetic_type#
get_data()[source]#
get_gc_per_chromosome(quiet=True)[source]#
get_information_content(motif)[source]#
get_random_contexts(Nmax=None, quiet=True)[source]#

Return a background distribution of Kozak contexts.

property include_start_codon#
property keep_ATG_only#
kl_vs_random_atg(motif_df, random_df)[source]#

Compute position-wise divergence using Kullback–Leibler (KL) divergence between Kozak contexts and random (non-annotated) ATG contexts.

This quantifies how specific the Kozak signal is compared to generic ATG neighborhoods.

Compute positional KL divergence between the observed Kozak motif and a background nucleotide distribution.

This method computes, for each position i:

D_KL(P_i || Q) = sum_i (P_i x log2(P_i / Q_i))

where P_i is the observed nucleotide frequency distribution at position i, and Q is a fixed background distribution.

Parameters:
motif_dfpandas.DataFrame

DataFrame with columns ['A', 'C', 'G', 'T'] containing nucleotide frequencies per position.

random_dfpandas.DataFrame

Background distribution.

Returns:
numpy.ndarray

KL divergence (bits) for each motif position.

Notes

  • This quantity is mathematically related to Shannon entropy.

  • When Q is uniform, this is equivalent to classical sequence logo information content.

  • The result is deterministic for a given motif_df and random_df input pair

property left_kozak#
plot_GC_per_chromosome(ylim=[0, 100])[source]#
plot_KL_divergence(df=None, ax=None, n_boot=500, ci=95)[source]#
plot_kozak_chi2(motif=None, GC_mode=None, fontsize=10, noplot=False)[source]#
for GC computation, 2 options:
  1. genomic ATG background excluding annotated starts. Good for GC bias, codon bias, local sequence structure.

  2. genome wide base composition. simple and fast but inflates signal in GC biases genomes. does not control for ATG specific context. This is a simple a genome-wide background is estimated assuming strand symmetry:

    G = C = GC / 2 A = T = AT / 2

plot_logo_bits(df=None, ax=None, color_scheme='colorblind')[source]#

Plot sequence logo with letter heights scaled by information content (bits).

Unlike plot_logo() which shows relative nucleotide frequencies with uniform column heights, this method scales each column by its information content (IC = 2 - Shannon entropy), so highly conserved positions appear tall (up to 2 bits) and variable positions appear short.

Parameters:
  • df -- output of get_data(). If None, get_data() is called.

  • ax -- matplotlib axes object. If None, the current axes is used.

  • color_scheme -- "colorblind" (default) or "classic".

Returns:

the probability logo_data DataFrame (same as plot_logo()).

plot_logo_purine_pyrimidine(df=None, ax=None)[source]#

df is the output of get_data()

property right_kozak#
set_context(left_kozak=6, right_kozak=6, keep_ATG_only=True, include_start_codon=False, background_method='context', collapse_first_cds=True)[source]#

Configure context windows and feature-row collapsing.

Parameters:
  • left_kozak (int) -- number of nucleotides to keep upstream of the start codon.

  • right_kozak (int) -- number of nucleotides to keep downstream of the start codon.

  • keep_ATG_only (bool) -- if True, restrict downstream analyses to rows whose start codon is ATG.

  • include_start_codon (bool) -- include the start codon itself in the Kozak window when True.

  • background_method (str) -- one of "context", "shuffled", or "uniform".

  • collapse_first_cds (bool) -- when True (default), collapse multi-exon CDS rows to one row per transcript (the 5'-most CDS, which is the only CDS row corresponding to a real start codon). See _collapse_to_first_cds() for rationale. Set to False to recover the legacy behaviour where every CDS row is treated as a separate start (useful for benchmarking the bug fix).

temporary_lr(left=None, right=None)[source]#
class KozakAddon(*args, **kwargs)[source]#
builddata()[source]#
property df#
find_kmers(sequence, pattern='GCC[AG]CC')[source]#
>>> k.find_kmers("GCCACC")
True
>>> k.find_kmers("AAAAAA")
False
get_all_kmer_counts(k=6, reverse=False, quiet=True)[source]#

Get all kmers from the entire genome

get_all_kmer_counts_by_chromosome(k=6, reverse=False)[source]#
get_all_kmer_counts_genes_only(k=6, genetic_type='gene', reverse=False)[source]#
get_kmer_counts()[source]#

Get kmer counts for both strands. Returns (minus_strand_counts, plus_strand_counts)

get_odd_ratio(mode='all')[source]#
plot_cumulated()[source]#
plot_logo_all_kmers()[source]#
plot_scatter_odds_ratio_annotated(pattern='GCC[AG]CC')[source]#
plot_scatter_odds_ratio_gene_vs_genome()[source]#
class KozakWeightScore(weight_matrix=None, left_flank=10, right_flank=10)[source]#

Compute Kozak Similarity Score using weight matrix approach.

Weight-matrix based scoring for translation initiation prediction. Implements the algorithm from https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0256411

Flexible left/right flank parameters allow custom window sizes.

Example:

from sequana.kozak import KozakWeightScore
kss = KozakWeightScore(left_flank=10, right_flank=10)
score = kss.score("CGCCGCCACCATGGCGGCGGAGG")

Initialize weight-based scorer.

Parameters:
weight_matrixnp.ndarray, optional

Shape (n_positions, 5) where columns: A, T, G, C, N. Default: canonical ATG 23bp matrix.

left_flankint, default 10

Bases upstream of codon.

right_flankint, default 10

Bases downstream of codon.

score(sequence)[source]#

Compute KSS for sequence.

Parameters:
sequencestr

Length left_flank + 3 + right_flank. Codon centered.

Returns:
float

Normalized score [0, 1].

score_batch(sequences)[source]#

Score multiple sequences.

Parameters:
sequenceslist of str
Returns:
np.ndarray

Scores for each sequence.

class Motif(motif)[source]#
property entropy#

Compute Shannon entropy for each position in the motif.

Shannon Entropy H(p) = -sum(p * log_base(p)) for DNA (4-letter alphabet).

Parameters:

base -- the base of the logarithm. Common values: 2 (bits, default), e (nats), 10 (bans).

Returns:

numpy array of entropy values per position.

property information_content#

Compute information content (bits) for each position in the motif.

Information Content (IC) = max_entropy - Shannon_entropy = 2 - H(p) for DNA (4-letter alphabet).

Parameters:

base -- the base of the logarithm. Common values: 2 (bits, default), e (nats), 10 (bans).

Returns:

numpy array of information content values per position.

plot_entropy()[source]#

Plot Shannon entropy per position.

plot_information_content()[source]#

Plot information content (IC) per position.

kozak_weight_score(sequence, weight_matrix=None, left_flank=10, right_flank=10)[source]#

Quick-score function (convenience wrapper).

Parameters:
sequencestr

DNA sequence centered on codon.

weight_matrixnp.ndarray, optional

Weight matrix. Uses default ATG if None.

left_flankint, default 10
right_flankint, default 10
Returns:
float

Normalized KSS [0, 1].

class TelomerFilter(filename, pattern='AACCCT', threshold=0.8)[source]#

Filter reads based on telomeric repeat content.

Parameters:
  • filename (str) -- Input FastQ file (can be .gz)

  • pattern (str) -- Telomeric repeat unit (default: "AACCCT")

  • threshold (float) -- Fraction of the read that must be telomeric (default: 0.8)

save_non_telomeric_reads(output_filename='non_telomeric.fastq', progress=True)[source]#

Save non-telomeric reads to a file.

save_reads(telomeric_output=None, non_telomeric_output=None, progress=True)[source]#

Identify and save reads list on-the-fly for maximum speed.

Parameters:
  • telomeric_output (str) -- File to save telomeric reads (optional)

  • non_telomeric_output (str) -- File to save non-telomeric reads (optional)

save_telomeric_reads(output_filename='telomeric.fastq', progress=True)[source]#

Save telomeric reads to a file.

class Telomere(reference_file=None, peak_height=20, peak_width=50)[source]#

Scan a FASTA file and identify the extent of telomeric regions.

Basic usage:

from sequana.telomere import Telomere
telo = Telomere("ref.fa")
df = telo.run(tag="myrun")

run() processes every contig, writes per-contig PNG plots and a CSV summary, and returns a DataFrame with one row per contig.

Identifying the right k-mers

By default the scanner uses all six rotations of the canonical vertebrate telomere repeat AACCCT plus additional 6-mers that improve signal-to-noise. To discover organism-specific k-mers:

contig = telo.fasta.sequences[0]
candidates = telo.find_representative_kmers(contig, kmers=6)
telo.kmers = [k for k, _ in candidates]

Sliding-window counts

The two core signals used for peak detection:

XX = telo.get_sliding_kmer_count_five_to_three_prime(seq)
YY = telo.get_sliding_kmer_count_three_to_five_prime(seq)

A telomere on a correctly oriented contig will appear as a peak at the left of XX (LHS) and a peak at the right of YY (RHS). Any peak at the opposite end indicates a reversed/mis-oriented contig.

Plotting

Two plot styles are available through the plot_style argument of run():

'annotated' (default)

Per-contig: plot_contig() — single panel, both strand signals overlaid with colour-coded shaded telomere regions and a status badge. Reversed telomeres are highlighted in red.

Summary: plot_summary() — horizontal chromosome map with each contig drawn proportionally to its length, telomere blocks coloured by orientation, and per-row status badges.

'legacy'

Original two-subplot raw signals (per-contig) and two heatmaps (binary presence + length, summary).

Telomere categories

Each contig in the output DataFrame is assigned a telomere column:

complete

Both LHS (forward) and RHS (reverse-complement) detected.

LHS_only

Only the left-hand telomere detected.

RHS_only

Only the right-hand telomere detected.

none

No telomeric signal above threshold.

Reversed peaks (signal on the unexpected end) are flagged by run() via logging warnings and highlighted in red in the annotated plots.

Initialize the Telomere scanner.

Parameters:
  • reference_file -- path to a FASTA file.

  • peak_height -- minimum peak height for telomere peak detection.

  • peak_width -- minimum peak width for telomere peak detection.

find_LHS_telomere(XX, plotting=False)[source]#
find_RHS_telomere(XX, plotting=False)[source]#
find_representative_kmers(seq, N=100000, pocc=0.005, n_sigma=5, kmers=6)[source]#

Identify over-represented k-mers in the first N bases of seq.

The probability of occurrence threshold pocc works well for ~100,000 bases. For shorter sequences the threshold may need to be adjusted.

Note

The fitted beta distribution used for plotting is estimated from a Leishmania genome and may not be appropriate for other organisms.

get_sliding_kmer_count_five_to_three_prime(seq, W=100)[source]#

Return sliding kmer counts in the 5' → 3' direction.

get_sliding_kmer_count_three_to_five_prime(seq, W=100)[source]#

Return sliding kmer counts in the 3' → 5' direction.

is_telomeric(seq, W=100)[source]#
plot_contig(XX, YY, chrom, total_length, midpoint, lhs1, rhs1, lhs1_extend, rhs1_extend, lhs2, rhs2, lhs2_extend, rhs2_extend)[source]#

Produce an annotated per-contig telomere figure.

Both sliding-kmer-count signals are shown overlaid in a single panel:

  • Blue line / shading: 5'→3' signal; telomere expected at the left (LHS). A signal at the right (RHS) indicates a reversed orientation and is highlighted in red.

  • Orange line / shading: 3'→5' signal; telomere expected at the right (RHS). A signal at the left (LHS) is reversed and shown in red.

A vertical grey bar marks the midpoint when the sequence was trimmed to Nmax. The title carries a status badge (COMPLETE / LHS only / RHS only / NONE).

Parameters:
  • XX -- 5'→3' sliding kmer count array.

  • YY -- 3'→5' sliding kmer count array.

  • chrom -- contig/chromosome name.

  • total_length -- original full sequence length in bp.

  • midpoint -- index of the midpoint cut (0 means no trimming).

  • lhs1 -- LHS telomere boundary position from XX.

  • rhs1 -- RHS telomere boundary position from XX.

  • lhs1_extend -- LHS telomere length from XX.

  • rhs1_extend -- RHS telomere length from XX.

  • lhs2 -- LHS telomere boundary position from YY.

  • rhs2 -- RHS telomere boundary position from YY.

  • lhs2_extend -- LHS telomere length from YY.

  • rhs2_extend -- RHS telomere length from YY.

plot_summary(df)[source]#

Produce an annotated genome-wide telomere summary figure.

Each contig is drawn as a horizontal bar scaled to its length (so longer chromosomes appear wider). Telomeric blocks are overlaid at the appropriate ends:

  • Blue: LHS telomere (5'→3', expected orientation)

  • Orange: RHS telomere (3'→5', expected orientation)

  • Red: Reversed telomere (unexpected end)

A coloured status badge on the right of each row indicates: COMPLETE (green), LHS only (blue), RHS only (orange), NONE (grey), or REVERSED (red).

Contigs are sorted by descending length so the largest chromosomes appear at the top.

Parameters:

df -- DataFrame returned by run().

Returns:

matplotlib Figure.

run(tag, names=None, W=100, Nmax=100000, plot_style='annotated')[source]#

Run telomere detection across all (or selected) contigs.

Parameters:
  • tag -- output filename prefix (None suppresses file output).

  • names -- list of chromosome/contig names to process; defaults to all.

  • W -- sliding window half-width in bp.

  • Nmax -- maximum bp to examine at each end of the contig.

  • plot_style -- 'annotated' (default) uses plot_contig() which overlays both strand signals with colour-coded telomeric regions and a status badge; 'legacy' reproduces the original two-subplot raw output.

circular_shifts(sequence)[source]#

Return all circular shifts of sequence as an ordered list.

For example, circular_shifts("AACCCT") returns all 6 rotations of the string. Useful for kmer-based analyses where all phases of a repeat unit must be considered.

factorize_sequences(sequences)[source]#
CpG(sequence, window=200)[source]#

The Sequence Manipulation Suite: CpG Islands Results for 1200 residue sequence "sample sequence" starting "taacatactt". CpG islands search using window size of 200. Range, value 32 to 231, the y-value is 1.75 and the %GC content is 50.5 33 to 232, the y-value is 1.75 and the %GC content is 50.5

Gardiner-Garden M, Frommer M. J Mol Biol. 1987 Jul 20;196(2):261-82.

compute_cpg_content(seq)[source]#
find_restriction_sites(sequence, enzymes)[source]#

Find restriction sites in a DNA sequence.

Args:

sequence (str): The DNA sequence. enzymes (dict): Dictionary of enzyme names and recognition sequences.

Returns:

dict: A dictionary with enzyme names as keys and lists of start positions as values.

build_kmer(length=6, letters='CG')[source]#

Return list of kmer of given length based on a set of letters

Returns:

list of kmers

get_kmer(sequence, k=7)[source]#

Given a sequence, return consecutive kmers

Returns:

iterator of kmers

Non-B DNA / motif detection#

class G4Hunter(fastafile, window=20, score=1)[source]#

This is an implementation of G4hunter that is 1-2 fold faster

Idea to speed up 50% is to use numba.

base_score(line)[source]#
get_G4(line, fileout, scores, header)[source]#
run(outdir)[source]#
write_sequences(line, fileout, liste, LISTE, header)[source]#
class G4HunterReader(filename_merged=None, filename_all=None)[source]#
load_files(file_pattern)[source]#
load_merged_data(filename)[source]#
to_bed(bedfile, cmap='seismic', threshold=0)[source]#
class ZDNA(fasta_file, motif='CG', min_repeats=6)[source]#

Support multiple motifs: ["CG", "CA", "GT"]

run()[source]#
to_bed(output_file, append=True, mode='a')[source]#
class IMotif(fasta_file, min_tract=3, max_loop=7)[source]#
run()[source]#
to_bed(output_file)[source]#
class Cruciforms(fasta_file, min_stem_len=6, max_stem_len=12)[source]#
is_cruciform(left, right)[source]#
run()[source]#
to_bed(output_file)[source]#
class Palindromes(fasta_file, min_len=4, max_len=12)[source]#
is_palindrome(seq)[source]#
run()[source]#
to_bed(output_file)[source]#
class FindMotif("cl10/select1.sorted.bam") df=fm.find_motif("CAGCAG") df.query("hit>10")[source]#

local threshold should be window length divided by motif length divided by 2

find_motif_bam(filename, motif, window=200, figure=False, savefig=False, local_threshold=None, global_threshold=None)[source]#
find_motif_fasta(filename, motif, window=200, local_threshold=None, global_threshold=None)[source]#
find_motif_from_sequence(seq, motif, window=None, local_threshold=None)[source]#
plot_alignment(bamfile, motif, window=200, global_th=10, title=None, legend=True, legend_fontsize=11, valid_rnames=[], valid_flags=[])[source]#

plot alignments that match the motif.

plot_specific_alignment(bamfile, query_name, motif, clf=True, show_figure=True, authorized_flags=[0, 16], windows=[10, 50, 100, 150, 200, 250, 500, 1000], local_threshold=5)[source]#
find_motif(bamfile, motif='CAGCAG', window=200, savefig=False, local_th=5, global_th=10)[source]#

If at least 10 position contains at least 5 instances of the motif, then this is a hit and the alignment is kept

RNA structure & rRNA depletion#

calculate_mfe(seq)[source]#

Calculate the Minimum Free Energy (MFE) score.

initialize_matrix(seq)[source]#

Initialize a matrix filled with zeros.

is_complementary(base1, base2)[source]#

Check if two bases are complementary.

Ribodesigner module

class RiboDesigner(fasta, gff=None, output_directory='ribodesigner', seq_type='rRNA', max_n_probes=384, force=False, threads=4, identity_step=0.01, force_clustering=False, **kwargs)[source]#

Design probes for ribosomes depletion.

From a complete genome assembly FASTA file and a GFF annotation file:

  • Extract genomic sequences corresponding to the selected seq_type.

  • For these selected sequences, design probes computing probe length and inter probe space according to the length of the ribosomale sequence.

  • Detect the highest cd-hit-est identity threshold where the number of probes is inferior or equal to max_n_probes.

  • Report the list of probes in BED and CSV files.

In the CSV, the oligo names are in column 1 and the oligo sequences in column 2.

Parameters:
  • fasta -- The FASTA file with complete genome assembly to extract ribosome sequences from.

  • gff -- GFF annotation file of the genome assembly. If none provided, assuming the input FastA is already made of rRNA.

  • output_directory -- The path to the output directory defaults to ribodesigner.

  • seq_type -- string describing sequence annotation type (column 3 in GFF) to select rRNA from.

  • max_n_probes -- Max number of probes to design

  • force -- If the output_directory already exists, overwrite it.

  • threads -- Number of threads to use in cd-hit clustering.

  • identity_step (float) -- step to scan the sequence identity (between 0 and 1) defaults to 0.01.

  • force_clustering

cluster_probes()[source]#

Use cd-hit-est to cluster highly similar probes.

clustering_needed(force=False)[source]#

Checks if a clustering is needed.

Parameters:

force -- force clustering even if unecessary.

export_all_probes_to_fasta()[source]#

From the self.probes_df, export to FASTA and CSV files.

export_to_csv_bed()[source]#

Export final results to CSV and BED files

export_to_json()[source]#
get_all_probes(method='original')[source]#

Run all probe design and concatenate results in a single DataFrame.

get_rna_pos_from_gff()[source]#

Convert a GFF file into a pandas DataFrame filtered according to the self.seq_type.

run(method='greedy')[source]#