References (genomics: sequences, motifs, RNA)#
Sequence basics#
- class DNA(sequence, codons_stop=['TAA', 'TGA', 'TAG'], codons_stop_rev=['TTA', 'TCA', 'CTA'], codons_start=['ATG'], codons_start_rev=['CAT'])[source]#
Simple DNA class
>>> from sequana.sequence import DNA >>> d = DNA("ACGTTTT") >>> d.reverse_complement()
Some long computations are done when setting the window size:
d.window = 100
The ORF detection has been validated agains a plasmodium 3D7 ORF file found on plasmodb.org across the 14 chromosomes.
Constructor
A sequence is just a string stored in the
sequenceattribute. It has properties related to the type of alphabet authorised.- Parameters:
sequence (str) -- May be a string of a Fasta File, in which case only the first sequence is used.
complement_in
complement_out
letters -- authorise letters. Used in
check()only.
Todo
use counter only once as a property
- property AT_skew#
- property GC_skew#
- property ORF_pos#
- barplot_count_ORF_CDS_by_frame(alpha=0.5, bins=40, xlabel='Frame', ylabel='#', bar_width=0.35)[source]#
- property threshold#
- property type_filter#
- property type_window#
- property window#
- class RNA(sequence)[source]#
Simple RNA class
>>> from sequana.sequence import RNA >>> d = RNA("ACGUUUU") >>> d.reverse_complement()
Constructor
A sequence is just a string stored in the
sequenceattribute. It has properties related to the type of alphabet authorised.- Parameters:
sequence (str) -- May be a string of a Fasta File, in which case only the first sequence is used.
complement_in
complement_out
letters -- authorise letters. Used in
check()only.
Todo
use counter only once as a property
- class Repeats(filename_fasta, merge=False, name=None)[source]#
Class for finding repeats in DNA or RNA linear sequences.
Computation is performed each time the
thresholdis set to a new value.from sequana import sequana_data, Repeats rr = Repeats(sequana_data("measles.fa")) rr.threshold = 4 rr.hist_length_repeats()
Note
Works with shustring package from Bioconda (April 2017)
Todo
use a specific sequence (first one by default). Others can be selected by name
Constructor
Input must be a fasta file with valid DNA or RNA characters
- Parameters:
filename_fasta (str) -- a Fasta file, only the first sequence is used !
threshold (int) -- Minimal length of repeat to output
name (str) -- if name is provided, scan the Fasta file and select the corresponding sequence. if you want to analyse all sequences, you need to use a loop by setting _header for each sequence with the sequence name found in sequence header.
Note
known problems. Header with a > character (e.g. in the comment) are left strip and only the comments is kept. Another issue is for multi-fasta where one sequence is ignored (last or first ?)
- property begin_end_repeat_position#
- property df_shustring#
Return dataframe with shortest unique substring length at each position shortest unique substrings are unique in the sequence and its complement Uses shustring tool
- property do_merge#
- property header#
- hist_length_repeats(bins=20, alpha=0.5, hold=False, fontsize=12, grid=True, title='Repeat length', xlabel='Repeat length', ylabel='#', logy=True)[source]#
Plots histogram of the repeat lengths
- property length#
- property list_len_repeats#
- property longest_shustring#
- property names#
- property threshold#
- class Sequence(sequence, complement_in=b'ACGT', complement_out=b'TGCA', letters='ACGT')[source]#
Abstract base classe for other specialised sequences such as DNA.
Sequenced is the base class for other classes such as
DNAandRNA.from sequana import Sequence s = Sequence("ACGT") s.stats() s.get_complement()
Note
You may use a Fasta file as input (see constructor)
Constructor
A sequence is just a string stored in the
sequenceattribute. It has properties related to the type of alphabet authorised.- Parameters:
Todo
use counter only once as a property
- complement()[source]#
Alias to
get_complement()
- get_occurences(pattern, overlap=False)[source]#
Return position of the input pattern in the sequence
>>> from sequana import Sequence >>> s = Sequence('ACGTTTTACGT') >>> s.get_occurences("ACGT") [0, 7]
- reverse()[source]#
Alias to
get_reverse()
- property sequence#
Utilities to manipulate and find codons
- class Codon[source]#
Utilities to manipulate codons
The codon contains hard-coded set of start and stop codons (bacteria) for strand plus and minus. Adapt to your needs for other organisms. Based on the scan of Methanothermobacter thermautotrophicus bacteria.
from sequana import Codon c = Codon() c.start_codons['+']
- codons = {'start': {'+': frozenset({'ATG', 'GTG', 'TTG'}), '-': frozenset({'CAA', 'CAC', 'CAT'})}, 'stop': {'+': frozenset({'TAA', 'TAG', 'TGA'}), '-': frozenset({'CTA', 'TCA', 'TTA'})}}#
- find_start_codon_position(sequence, position, strand, max_shift=10000)[source]#
Return starting position and string of closest start codon to a given position
The starting position is on the 5-3 prime direction (see later)
The search starts at the given position, then +1 base, then -1 base, then +2, -2, +3, etc
Here, we start at position 2 (letter G), then shift by +1 and find the ATG string.
>>> from sequana import Codon >>> c = Codon() >>> c.find_start_codon_position("ATGATGATG", 2, "+") (3, 'ATG')
whereas starting at position 1, a shift or +1 (position 2 ) does not hit a start codon. Next, a shift of -1 (position 0) hits the ATG start codon.:
>>> c.find_start_codon_position("ATGATGATG", 1, "+") (3, 'ATG')
On strand -, the start codon goes from right to left. So, in the following example, the CAT start codon (reverse complement of ATG) is found at position 3. Developers must take into account a +3 shift if needed:
>>> c.find_start_codon_position("AAACAT", 3, "-") (3, 'CAT') >>> c.find_start_codon_position("AAACATCAT", 8, "-")
- find_stop_codon_position(sequence, position, strand, max_shift=10000)[source]#
Return position and string of closest stop codon to a given position
See
find_start_codon_position()for details.Only difference is that the search is based on stop codons rather than start codons.
>>> from sequana import Codon >>> c = Codon() >>> c.find_stop_codon_position("ATGACCCC", 2, "+") (1, 'TGA')
- compute_melting_temperature_salt_adjusted(sequence)[source]#
compute melting temperature with salt adjustement
This rule is a quick estimation for sequences greater than 14bp in length (Chester and Marshak 1993)
Tm(Celsius) = 64.9 + 0.41 x %GC - 650 / (sequence length)
This formula accounts for the stability conferred by GC content but does not account for secondary structures or mismatches.
- compute_melting_temperature_wallace_rule(sequence)[source]#
Compute mekting temperature Tm of a sequence using Wallace rule
This rule is a quick estimation for short oligonucleotides (20-25 base pairs), based on [Marmu and Doty 1962]:
Tm(Celsius) = 2 ({A} + {T}) + 4 ({G} + {C})
Where A and T bases contribute 2°C each and G and C bases contribute 4°C each.
This formula assumes standard conditions and is less accurate for longer sequences or those with unusual salt concentrations.
Regulatory / regulatory-adjacent#
- class ConsensusBuilder(df)[source]#
- all_consensus(mode='majority', threshold=0.25, relative=0.8, strong=0.6, majority=0.5, info_threshold=1.0)[source]#
- get_consensus(mode='majority', threshold=0.25, relative=0.8, strong=0.6, majority=0.5, info_threshold=1.0)[source]#
- Parameters:
- modestr
majority | threshold | relative | information | max_only
- thresholdfloat
used in threshold mode
- relativefloat
keep bases >= relative * max_frequency
- strongfloat
uppercase if max >= strong
- info_thresholdfloat
uppercase if information >= this value
- class KLAnalysis(df)[source]#
- class Kozak("fasta", "gff", feature="gene")[source]#
Filter only to keep ATG since others seems to ncRNA
raw Kozak sequence names and counts
a Kozak is e.g GGCRGG . first position is the less important
for the enumeration of kmers, get of the rid of the Ns
- odds ratio have 4 cases depending on the on enumeration:
use entire genome use chromosome by chromosome use of gene on genome use gene on chromosomes
Table of counts of Kozak sequences without dna ambiguities. - across the entire genome - by chromosomes
counts = k.get_all_kmer_counts() counts_chroms = k.get_all_kmer_counts_by_chromosome() counts_genes = k.get_all_kmer_counts_genes_only()
# proportions of kmer in genes: sum(list(counts_genes.values())) / length_genome
# counts in chroms should equal counts in genomes: Sgenes = sum([sum(list(counts_chroms[x].values())) for x in counts_chroms.keys()]) Sgenome = sum(list(counts.values()))
k = Kozak("ecoli_MG1655.fa", "ecoli_MG1655.gff", "gene", "ID") df = k.get_data() k.plot_logo(df.query("start_codon=='ATG'"))
- bootstrap(df, contexts, n_boot=500, ci=95, sample_size=200)[source]#
Perform bootstrapping to compute confidence intervals for KL divergence.
- property collapse_first_cds#
- property genetic_type#
- get_random_contexts(Nmax=None, quiet=True)[source]#
Return a background distribution of Kozak contexts.
- property include_start_codon#
- property keep_ATG_only#
- kl_vs_random_atg(motif_df, random_df)[source]#
Compute position-wise divergence using Kullback–Leibler (KL) divergence between Kozak contexts and random (non-annotated) ATG contexts.
This quantifies how specific the Kozak signal is compared to generic ATG neighborhoods.
Compute positional KL divergence between the observed Kozak motif and a background nucleotide distribution.
This method computes, for each position i:
D_KL(P_i || Q) = sum_i (P_i x log2(P_i / Q_i))
where P_i is the observed nucleotide frequency distribution at position i, and Q is a fixed background distribution.
- Parameters:
- motif_dfpandas.DataFrame
DataFrame with columns ['A', 'C', 'G', 'T'] containing nucleotide frequencies per position.
- random_dfpandas.DataFrame
Background distribution.
- Returns:
- numpy.ndarray
KL divergence (bits) for each motif position.
Notes
This quantity is mathematically related to Shannon entropy.
When Q is uniform, this is equivalent to classical sequence logo information content.
The result is deterministic for a given motif_df and random_df input pair
- property left_kozak#
- plot_kozak_chi2(motif=None, GC_mode=None, fontsize=10, noplot=False)[source]#
- for GC computation, 2 options:
genomic ATG background excluding annotated starts. Good for GC bias, codon bias, local sequence structure.
- genome wide base composition. simple and fast but inflates signal in GC biases genomes. does not control for ATG specific context. This is a simple a genome-wide background is estimated assuming strand symmetry:
G = C = GC / 2 A = T = AT / 2
- plot_logo_bits(df=None, ax=None, color_scheme='colorblind')[source]#
Plot sequence logo with letter heights scaled by information content (bits).
Unlike
plot_logo()which shows relative nucleotide frequencies with uniform column heights, this method scales each column by its information content (IC = 2 - Shannon entropy), so highly conserved positions appear tall (up to 2 bits) and variable positions appear short.- Parameters:
df -- output of
get_data(). If None,get_data()is called.ax -- matplotlib axes object. If None, the current axes is used.
color_scheme --
"colorblind"(default) or"classic".
- Returns:
the probability logo_data DataFrame (same as
plot_logo()).
- plot_logo_purine_pyrimidine(df=None, ax=None)[source]#
df is the output of
get_data()
- property right_kozak#
- set_context(left_kozak=6, right_kozak=6, keep_ATG_only=True, include_start_codon=False, background_method='context', collapse_first_cds=True)[source]#
Configure context windows and feature-row collapsing.
- Parameters:
left_kozak (int) -- number of nucleotides to keep upstream of the start codon.
right_kozak (int) -- number of nucleotides to keep downstream of the start codon.
keep_ATG_only (bool) -- if True, restrict downstream analyses to rows whose start codon is
ATG.include_start_codon (bool) -- include the start codon itself in the Kozak window when True.
background_method (str) -- one of
"context","shuffled", or"uniform".collapse_first_cds (bool) -- when True (default), collapse multi-exon CDS rows to one row per transcript (the 5'-most CDS, which is the only CDS row corresponding to a real start codon). See
_collapse_to_first_cds()for rationale. Set to False to recover the legacy behaviour where every CDS row is treated as a separate start (useful for benchmarking the bug fix).
- class KozakAddon(*args, **kwargs)[source]#
-
- property df#
- find_kmers(sequence, pattern='GCC[AG]CC')[source]#
>>> k.find_kmers("GCCACC") True >>> k.find_kmers("AAAAAA") False
- class KozakWeightScore(weight_matrix=None, left_flank=10, right_flank=10)[source]#
Compute Kozak Similarity Score using weight matrix approach.
Weight-matrix based scoring for translation initiation prediction. Implements the algorithm from https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0256411
Flexible left/right flank parameters allow custom window sizes.
Example:
from sequana.kozak import KozakWeightScore kss = KozakWeightScore(left_flank=10, right_flank=10) score = kss.score("CGCCGCCACCATGGCGGCGGAGG")
Initialize weight-based scorer.
- Parameters:
- weight_matrixnp.ndarray, optional
Shape (n_positions, 5) where columns: A, T, G, C, N. Default: canonical ATG 23bp matrix.
- left_flankint, default 10
Bases upstream of codon.
- right_flankint, default 10
Bases downstream of codon.
- class Motif(motif)[source]#
- property entropy#
Compute Shannon entropy for each position in the motif.
Shannon Entropy H(p) = -sum(p * log_base(p)) for DNA (4-letter alphabet).
- Parameters:
base -- the base of the logarithm. Common values: 2 (bits, default), e (nats), 10 (bans).
- Returns:
numpy array of entropy values per position.
- property information_content#
Compute information content (bits) for each position in the motif.
Information Content (IC) = max_entropy - Shannon_entropy = 2 - H(p) for DNA (4-letter alphabet).
- Parameters:
base -- the base of the logarithm. Common values: 2 (bits, default), e (nats), 10 (bans).
- Returns:
numpy array of information content values per position.
- kozak_weight_score(sequence, weight_matrix=None, left_flank=10, right_flank=10)[source]#
Quick-score function (convenience wrapper).
- Parameters:
- sequencestr
DNA sequence centered on codon.
- weight_matrixnp.ndarray, optional
Weight matrix. Uses default ATG if None.
- left_flankint, default 10
- right_flankint, default 10
- Returns:
- float
Normalized KSS [0, 1].
- class TelomerFilter(filename, pattern='AACCCT', threshold=0.8)[source]#
Filter reads based on telomeric repeat content.
- Parameters:
- save_non_telomeric_reads(output_filename='non_telomeric.fastq', progress=True)[source]#
Save non-telomeric reads to a file.
- class Telomere(reference_file=None, peak_height=20, peak_width=50)[source]#
Scan a FASTA file and identify the extent of telomeric regions.
Basic usage:
from sequana.telomere import Telomere telo = Telomere("ref.fa") df = telo.run(tag="myrun")
run()processes every contig, writes per-contig PNG plots and a CSV summary, and returns a DataFrame with one row per contig.Identifying the right k-mers
By default the scanner uses all six rotations of the canonical vertebrate telomere repeat
AACCCTplus additional 6-mers that improve signal-to-noise. To discover organism-specific k-mers:contig = telo.fasta.sequences[0] candidates = telo.find_representative_kmers(contig, kmers=6) telo.kmers = [k for k, _ in candidates]
Sliding-window counts
The two core signals used for peak detection:
XX = telo.get_sliding_kmer_count_five_to_three_prime(seq) YY = telo.get_sliding_kmer_count_three_to_five_prime(seq)
A telomere on a correctly oriented contig will appear as a peak at the left of
XX(LHS) and a peak at the right ofYY(RHS). Any peak at the opposite end indicates a reversed/mis-oriented contig.Plotting
Two plot styles are available through the plot_style argument of
run():'annotated'(default)Per-contig:
plot_contig()— single panel, both strand signals overlaid with colour-coded shaded telomere regions and a status badge. Reversed telomeres are highlighted in red.Summary:
plot_summary()— horizontal chromosome map with each contig drawn proportionally to its length, telomere blocks coloured by orientation, and per-row status badges.'legacy'Original two-subplot raw signals (per-contig) and two heatmaps (binary presence + length, summary).
Telomere categories
Each contig in the output DataFrame is assigned a
telomerecolumn:completeBoth LHS (forward) and RHS (reverse-complement) detected.
LHS_onlyOnly the left-hand telomere detected.
RHS_onlyOnly the right-hand telomere detected.
noneNo telomeric signal above threshold.
Reversed peaks (signal on the unexpected end) are flagged by
run()vialoggingwarnings and highlighted in red in the annotated plots.Initialize the Telomere scanner.
- Parameters:
reference_file -- path to a FASTA file.
peak_height -- minimum peak height for telomere peak detection.
peak_width -- minimum peak width for telomere peak detection.
- find_representative_kmers(seq, N=100000, pocc=0.005, n_sigma=5, kmers=6)[source]#
Identify over-represented k-mers in the first N bases of seq.
The probability of occurrence threshold pocc works well for ~100,000 bases. For shorter sequences the threshold may need to be adjusted.
Note
The fitted beta distribution used for plotting is estimated from a Leishmania genome and may not be appropriate for other organisms.
- get_sliding_kmer_count_five_to_three_prime(seq, W=100)[source]#
Return sliding kmer counts in the 5' → 3' direction.
- get_sliding_kmer_count_three_to_five_prime(seq, W=100)[source]#
Return sliding kmer counts in the 3' → 5' direction.
- plot_contig(XX, YY, chrom, total_length, midpoint, lhs1, rhs1, lhs1_extend, rhs1_extend, lhs2, rhs2, lhs2_extend, rhs2_extend)[source]#
Produce an annotated per-contig telomere figure.
Both sliding-kmer-count signals are shown overlaid in a single panel:
Blue line / shading: 5'→3' signal; telomere expected at the left (LHS). A signal at the right (RHS) indicates a reversed orientation and is highlighted in red.
Orange line / shading: 3'→5' signal; telomere expected at the right (RHS). A signal at the left (LHS) is reversed and shown in red.
A vertical grey bar marks the midpoint when the sequence was trimmed to Nmax. The title carries a status badge (COMPLETE / LHS only / RHS only / NONE).
- Parameters:
XX -- 5'→3' sliding kmer count array.
YY -- 3'→5' sliding kmer count array.
chrom -- contig/chromosome name.
total_length -- original full sequence length in bp.
midpoint -- index of the midpoint cut (0 means no trimming).
lhs1 -- LHS telomere boundary position from XX.
rhs1 -- RHS telomere boundary position from XX.
lhs1_extend -- LHS telomere length from XX.
rhs1_extend -- RHS telomere length from XX.
lhs2 -- LHS telomere boundary position from YY.
rhs2 -- RHS telomere boundary position from YY.
lhs2_extend -- LHS telomere length from YY.
rhs2_extend -- RHS telomere length from YY.
- plot_summary(df)[source]#
Produce an annotated genome-wide telomere summary figure.
Each contig is drawn as a horizontal bar scaled to its length (so longer chromosomes appear wider). Telomeric blocks are overlaid at the appropriate ends:
Blue: LHS telomere (5'→3', expected orientation)
Orange: RHS telomere (3'→5', expected orientation)
Red: Reversed telomere (unexpected end)
A coloured status badge on the right of each row indicates:
COMPLETE(green),LHS only(blue),RHS only(orange),NONE(grey), or⚠ REVERSED(red).Contigs are sorted by descending length so the largest chromosomes appear at the top.
- Parameters:
df -- DataFrame returned by
run().- Returns:
matplotlib Figure.
- run(tag, names=None, W=100, Nmax=100000, plot_style='annotated')[source]#
Run telomere detection across all (or selected) contigs.
- Parameters:
tag -- output filename prefix (
Nonesuppresses file output).names -- list of chromosome/contig names to process; defaults to all.
W -- sliding window half-width in bp.
Nmax -- maximum bp to examine at each end of the contig.
plot_style --
'annotated'(default) usesplot_contig()which overlays both strand signals with colour-coded telomeric regions and a status badge;'legacy'reproduces the original two-subplot raw output.
- circular_shifts(sequence)[source]#
Return all circular shifts of sequence as an ordered list.
For example,
circular_shifts("AACCCT")returns all 6 rotations of the string. Useful for kmer-based analyses where all phases of a repeat unit must be considered.
- CpG(sequence, window=200)[source]#
The Sequence Manipulation Suite: CpG Islands Results for 1200 residue sequence "sample sequence" starting "taacatactt". CpG islands search using window size of 200. Range, value 32 to 231, the y-value is 1.75 and the %GC content is 50.5 33 to 232, the y-value is 1.75 and the %GC content is 50.5
Gardiner-Garden M, Frommer M. J Mol Biol. 1987 Jul 20;196(2):261-82.
- find_restriction_sites(sequence, enzymes)[source]#
Find restriction sites in a DNA sequence.
- Args:
sequence (str): The DNA sequence. enzymes (dict): Dictionary of enzyme names and recognition sequences.
- Returns:
dict: A dictionary with enzyme names as keys and lists of start positions as values.
Non-B DNA / motif detection#
- class G4Hunter(fastafile, window=20, score=1)[source]#
This is an implementation of G4hunter that is 1-2 fold faster
Idea to speed up 50% is to use numba.
- class ZDNA(fasta_file, motif='CG', min_repeats=6)[source]#
Support multiple motifs: ["CG", "CA", "GT"]
- class FindMotif("cl10/select1.sorted.bam") df=fm.find_motif("CAGCAG") df.query("hit>10")[source]#
local threshold should be window length divided by motif length divided by 2
- find_motif_bam(filename, motif, window=200, figure=False, savefig=False, local_threshold=None, global_threshold=None)[source]#
RNA structure & rRNA depletion#
Ribodesigner module
- class RiboDesigner(fasta, gff=None, output_directory='ribodesigner', seq_type='rRNA', max_n_probes=384, force=False, threads=4, identity_step=0.01, force_clustering=False, **kwargs)[source]#
Design probes for ribosomes depletion.
From a complete genome assembly FASTA file and a GFF annotation file:
Extract genomic sequences corresponding to the selected
seq_type.For these selected sequences, design probes computing probe length and inter probe space according to the length of the ribosomale sequence.
Detect the highest cd-hit-est identity threshold where the number of probes is inferior or equal to
max_n_probes.Report the list of probes in BED and CSV files.
In the CSV, the oligo names are in column 1 and the oligo sequences in column 2.
- Parameters:
fasta -- The FASTA file with complete genome assembly to extract ribosome sequences from.
gff -- GFF annotation file of the genome assembly. If none provided, assuming the input FastA is already made of rRNA.
output_directory -- The path to the output directory defaults to ribodesigner.
seq_type -- string describing sequence annotation type (column 3 in GFF) to select rRNA from.
max_n_probes -- Max number of probes to design
force -- If the output_directory already exists, overwrite it.
threads -- Number of threads to use in cd-hit clustering.
identity_step (float) -- step to scan the sequence identity (between 0 and 1) defaults to 0.01.
force_clustering
- clustering_needed(force=False)[source]#
Checks if a clustering is needed.
- Parameters:
force -- force clustering even if unecessary.
- get_all_probes(method='original')[source]#
Run all probe design and concatenate results in a single DataFrame.