4.1. denovo_assembly

Overview:Denovo Assembly from FASTQ files
Input:FASTQ file(s) from Illumina Sequencing instrument
Output:FASTA, VCF and HTML files

4.1.1. Usage

4.1.1.1. Command line interface

Example:

sequana --pipeline denovo_assembly \
        --input-directory data/ \
        --input-readtag _[12].fastq \
        --extention fastq.gz \
        --working-dir analysis
cd analysis
snakemake -s denovo_assembly.rules --stats stats.txt

4.1.2. Requirements

  • busco
  • bwa
  • khmer
  • freebayes
  • picard (picard-tools)
  • prokka
  • quast
  • spades
  • sambamba
  • samtools
https://raw.githubusercontent.com/sequana/sequana/master/sequana/pipelines/denovo_assembly/denovo_dag.png

4.1.3. Details

Snakemake de-novo assembly pipeline dedicates to small genome like bacteria. It bases on SPAdes. The assembler corrects reads then assemble them using different size of kmer. If the careful option is set, SPAdes corrects mismatches and short INDELs in the contigs using BWA.

The sequencing depth can be normalised with khmer. Digital normalisation converts the existing high coverage regions into a Gaussian distributions centered around a lower sequencing depth. To put it another way, genome regions covered at 200x will be covered at 20x after normalisation. Thus, some reads from high coverage regions are discarded to reduce the quantity of data. Although the coverage is drastically reduce, the assembly will be as good or better than assembling the unnormalised data. Furthermore, SPAdes with normalised data is notably speeder and cost less memory than without digital normalisation. Above all, khmer does this in fixed, low memory and without any reference sequence needed.

The pipeline assess the assembly with several tools and approach. The first one is Quast, a tools for genome assemblies evaluation and comparison. It provides a HTML report with useful metrics like N50, number of mismatch and so on. Furthermore, it creates a viewer of contigs called Icarus. The second approach is to characterise coverage with sequana coverage and to detect mismatchs and short INDELs with Freebayes. The last approach but not the least is BUSCO, that provides quantitative measures for the assessment of genome assembly based on expectations of gene content from near-universal single-copy orthologs selected from OrthoDB.

4.1.4. Rules and configuration details

Here is a documenteted configuration file ../sequana/pipelines/denovo_assembly/config.yaml to be used with the pipeline. Each rule used in the pipeline may have a section in the configuration file. Here are the rules and their developer and user documentation.

4.1.4.1. De-novo assembly

4.1.4.1.1. Digital normalisation

Digital normalisation is a method to normalise coverage of a sample in fixed, low memory and without any reference. The assembly with normalised data provides as good or better results than assembling the unnormalised data. Furthermore, SPAdes with normalised data is notably speeder and cost less memory than without digital normalisation.

Required input:
  • __digital_normalisation__input: List of paired FASTQ files
Required output:
  • __digital_normalisation__output: List of paired FASTQ files
Required log:
  • __digital_normalisation__log: Log with stdout and sterr of Khmer.
Required parameters:
  • __digital_normalisation__prefix: Prefix for intermediary outputs.
Required configuration:
digital_normalisation:
    ksize: 20 # Kmer size used to normalised the coverage.
    cutoff: 20 # When the median k-mer coverage level is above this number the read is not kept.
    max_memory_usage: 16e9 # Maximum amount of memory to use for data structure.
    threads: 4 # Number of threads to be used.
    options: # any options recognised by normalize-by-median.py.
Reference:

4.1.4.1.2. SPAdes

SPAdes is a de-novo assembler designed for small genomes like bacteria or fungi. It correct reads the assemble them using different size of kmer. With the careful options, SPAdes corrects mismatches and short INDELs in the contigs using BWA. This rule works only with paired-end files.

Required input:
  • __spades__fastq: List of paired FASTQ files.
Required output:
  • __spades__contigs: FASTA file of created contigs.
  • __spades__scaffolds: FASTA file of created scaffolds.
Required log:
  • __spades__log: Log file with stdout and stderr of SPAdes.
Required configuration:
spades:
    k: 21,33,55,77 # Comma-separated list of k-mer sizes (must be odd and less than 128).
    careful: yes # Tries to reduce number of mismatches and short indels.
    only_assembler: no # Runs only assembling (without read error correction).
    memory: 250 # RAM limit for SPAdes in Gb (terminates if exceeded).
    threads: 8 # Number of threads to be used.
    options: "" # Any options recognised by spades.py.
Reference:

4.1.4.1.3. Format contigs

This rule changes contigs name to have sample name and the contig number to reduce the description line length. Otherwise, Prokka will not annotate our contigs. This rule also removes contigs less long than a threshold.

Required input:
  • __format_contigs__input: FASTA file with contigs.
Required output:
  • __format_contigs__output: FASTA file with reformated contigs.
Required configuration:
format_contigs:
    threshold: 500 # remove contigs less long than this threshold.

4.1.4.2. Quality assessment metrics

4.1.4.2.1. QUAST

Quast calculates metrics to evaluate and to compare different de novo assembly. It provides a HTML report with useful metrics like N50, number of mismatch and so on. Furthermore, it creates a viewer of contigs called Icarus.

Required input:
  • __quast__input: One or a list of FASTA files.
  • __quast__reference: Reference genome file. (Optional)
  • __quast__genes: Gene positions in the reference genome. (Optional)
Required output:
  • __quast__done: Empty file to connect rule.
Required parameters:
  • __quast__outdir: Output directory with report HTML created by QUAST.
Required log:
  • __quast__log: Log file with stderr and stdout of QUAST.
Required configuration:
quast:
    reference: "" # FASTA reference genome file. (Optional)
    genes_file: "" # GFF with gene position of the reference. (Optional)
    threads: 4 # Number of threads used by QUAST.
    options: "" # Any options recognised by quast.py.
Reference:

4.1.4.2.2. BUSCO

TODO

4.1.4.3. Genome annotation

4.1.4.3.1. Prokka

Prokka is a software tool to annotate bacterial, archaeal and viral genomes quickly and produce standards compliant output files (genbank, gff and so on).

Required input:
  • __prokka__input: FASTA file of the reference.
Required output:
  • __prokka__gbk: Genbank file of the reference.
Required log:
  • __prokka__log: Log file with stderr and stdout of Prokka.
Required parameters:
  • __prokka__prefix: Prefix name for annotations files.
  • __prokka__outdir: Output directory that contains annotations files.
Required configuration:
prokka:
    kingdom: Bacteria # Annotation mode. (Archaea|Bacteria|Mitochondria|Viruses)
    genus: "" # Genus name of the assembly.
    species: "" # Species name of the assembly.
    threads: 4 # Number of threads used by Prokka.
    options: ""# Any options recognised by Prokka.
Reference:

4.1.4.4. Re-mapping

4.1.4.4.1. BWA

Read mapping with BWA-mem

BWA mem rule aligns single or paired end FASTQ files on a reference genome. This rule is a dynamic rule (see developers section). This means that it can be included multiple times with different names in the same pipeline. The reference file is indexed by BWA index and FASTQ files are mapped on it. The sorting is done with sambamba by reading the stdin. Moreover, Sambamba generates automatically the BAM index.

Required input:
  • __bwa_mem_%(name)s__fastq: FASTQ files single or paired end.
  • __bwa_mem_%(name)s__reference: FASTA file of the reference.
  • __bwa_mem_%(name)s__fai: Index file of the reference.
Required output:
  • __bwa_mem_%(name)s__bam: Sorted BAM file.
Log:
  • __bwa_index_%(name)s__log: Log file of BWA index.
  • __bwa_mem_%(name)s__log: Log file of BWA mem.
Required configuration:
bwa_mem_%(name)s:
    index_algorithm: 'is' # BWA index algorithm (is or bwtsw)
    threads: 4         # Number of threads used by BWA mem.
    options:           # Any options recognised by BWA mem tool.
    tmp_directory: '/tmp' # Temporary directory.
Note:
When a dynamic rule is included, template variables must be replaced.
References:

4.1.4.4.2. Sambamba markdup

This rule marks or removes PCR duplicate reads. For determining whether a read is a duplicate or not, the same criteria as in Picard are used.

Required input:
  • __sambamba_markdup__input: Sorted BAM file.
Required output:
  • __sambamba_markdup__output: BAM file with duplicates marked or removed.
Required log:
  • __sambamba_markdup__log_std: Log file with stdout.
  • __sambamba_markdup__log_err: Log file with stderr.
Required configuration:
sambamba_markdup:
    remove: no # Remove or just mark duplicate reads.
    tmp_directory: /tmp # Temporary directory.
Reference:

4.1.4.4.3. Sambamba filter

This rule uses sambamba view to filter reads with a mapping quality lower than a threshold. It also removes reads with multiple occurrence.

Required input:
  • __sambamba_filter__input: bam file
Required output:
  • __sambamba_filter__output: bam file
Required log:
  • __sambamba_filter__log: Log file with stdout and stderr of sambamba.
Required configuration:
sambamba_filter:
    threshold: 30 # Mapping quality score threshold
Reference:

4.1.4.5. Mismatch detection

4.1.4.5.1. Freebayes

Freebayes is a variant caller designed to find SNPs and short INDELs from a BAM file. It produces a very well-annotated VCF output. Moreover, it provides a quality score calculated by a bayesian model.

Required input:
  • __freebayes__input: Sorted BAM file.
  • __freebayes__reference: FASTA file of the reference genome.
Required output:
  • __freebayes__output: VCF file of detected variants.
Required log:
  • __freebayes__log: Log file with stdout and stderr of Freebayes.
Required configuration:
freebayes:
    ploidy: 1 # The sample ploidy.
    options: # Any options recognised by freebayes.
Note:
The ploidy correspond to the expected ploidy of the sample. For example, the ploidy must be set at 1 for a variant-calling on a bacteria sample.
Reference:

4.1.4.5.2. Freebayes filter

Variant filter rules dedicated to VCF files generated by freebayes. It filters with freebayes quality score, coverage depth, frequency and strand ratio.

Required input:
  • __freebayes_vcf_filter__input: VCF file from freebayes.
Required output:
  • __freebayes_vcf_filter__output: Filtered VCF file.
  • __freebayes_vcf_filter__csv: CSV file of filtered variants.
  • __freebayes_vcf_filter__html: HTML report.
Required parameters:
  • __freebayes_vcf_filter__report_dir: Report directory to copy JS/CSS.
Required configuration:
freebayes_vcf_filter:
    freebayes_score: 20 # threshold for minimum freebayes quality score.
    frequency: 0.7 # threshold for minimum alternative allele frequency.
    min_depth: 10 # threshold for minimum coverage depth.
    forward_depth: 3 # threshold for minimum coverage depth of forward strand.
    reverse_depth: 3 # threshold for minimum coverage depth of reverse strand.
    strand_ratio: 0.2 # threshold for minimum strand ratio between 0 and 0.5.

4.1.4.6. Coverage analysis

4.1.4.6.1. Samtools depth

Samtools Depth creates a BED file with the coverage depth for each base position. It can also compute multiple BAM files and concatenate results in one BED file.

Required input:
  • __samtools_depth__input: Sorted BAM file or list of bam file.
Required output:
  • __samtools_depth__output: BED file with coverage for each base.
Required log:
  • __samtools_depth__log: Log file with stderr of samtools.
Reference:

4.1.4.6.2. Sequana coverage

Sequana coverage detects and characterises automatically low and high genome coverage regions. It provides a useful HTML report with dynamic plot and table. Moreover, a CSV file with all metrics computed is created. This CSV can be used to regenerate the sequana_coverage report.

Required input:
  • __sequana_coverage__bed: BED file with coverage by base position.
Optional input:
  • __sequana_coverage__fasta: FASTA file of the reference to compute GC content.
  • __sequana_coverage__gbk: Genbank file to annotate regions of interest. (Optional)

NOTE: If you do not need an optional input. Initiate it as an empty list.

Required output:
  • __sequana_coverage__csv: CSV file with computed metrics.
  • __sequana_coverage__html: HTML report of results.
Required parameter:
  • __sequana_coverage__report_dir: Report directory to copy JS/CSS.
Required configuration:
sequana_coverage:
    k: 2 # number of gaussian predicted.
    circular: yes # if your genome is circular.
    window_size: 30001 # window size to compute the running median.
    low_threshold: -4 # threshold to detect low coverage regions.
    high_threshold: 4 # threshold to detect high coverage regions.
    gc_window_size: 201 # window size to compute GC content.
Reference: