4.1. denovo_assembly

Overview:Denovo Assembly from FastQ files
Input:FastQ file(s) from Illumina Sequencing instrument
Output:FastA, VCF and HTML files

4.1.1. Usage

Example:

sequana --pipeline denovo_assembly --file1 R1.fastq.gz --file2 R2.fastq.gz --project denovo
cd denovo
snakemake -s denovo_assembly.rules -p --stats stats.txt -j 4

4.1.2. Requirements

  • khmer
  • quast
  • spades
  • [variant_calling]

4.1.3. Details

The reads normalization is performed with khmer (digital normalization). It is an optional task defined in the config file. You might use normalization if the sequencing depth (depth of coverage) is too important. Then, the denovo assembly is done with spades. Finally, the coverage and misassembly are evaluated with the variant calling pipeline.

https://raw.githubusercontent.com/sequana/sequana/master/sequana/pipelines/denovo_assembly/denovo_dag.png

4.1.4. Rules and configuration details

Here is a documenteted configuration file ../sequana/pipelines/denovo_assembly/config.yaml to be used with the pipeline. Each rule used in the pipeline may have a section in the configuration file. Here are the rules and their developer and user documentation.

4.1.4.1. samtools_depth

Samtools depth

Creates a bed file with the coverage depth for each base position. It can compute multiple bam files and concatenate results.

Required input:
__samtools_depth__input: sorted bam file or list of bam file
Required output:
__samtools_depth__output: bed file
Config file:
No section required

4.1.4.2. bwa

no docstring found for bwa_mem_dynamic

4.1.4.3. digital_normalisation

Assembly is very hard with ultra-deep sequencing data. Digital

normalisation is a method to normalise coverage of a sample without reference. It provides as good or better results than assembling the unnormalised data.

Required input:
  • __digital_normalisation__input: list of paired fastq files
Required output:
  • __digital_normalisation__output: list of paired fastq files
Required parameters:
  • __digital_normalisation__prefix: output prefix

  • config["digital_normalisation"]["ksize"]: kmer size used to

    normalised the coverage.

  • config["digital_normalisation"]["cutoff"]: when the median k-mer

    coverage level is above this number the read is not kept.

  • config["digital_normalisation"]["max_memory_usage"]: maximum amount

    of memory to use for data structure.

  • config["digital_normalisation"]["threads"]: number of threads to be

    used.

  • config["digital_normalisation"]["options"]: any options recognised by

    normalize-by-median.py.

Logs:
  • __digital_normalisation__log

4.1.4.4. format_contigs

Change names of contigs to have sample name and filter reads shorter

than a threshold.

Required input:
  • __format_contigs__input: fasta file
Required output:
  • __format_contigs__output: fasta file
Required parameter:
  • config["format_contigs"]["threshold"]: when the contig length is

    lower than this number, the contig is not kept.

4.1.4.5. freebayes

Freebayes is a variant caller designed to find SNPs and short INDELs.

It is capable of calling multiple variants segregate on the same read.

Required input:
__freebayes__input: sorted bam file __freebayes__reference: reference fasta file
Required output:
__freebayes__output: vcf file
Required parameter:

config["freebayes"]["ploidy"]: set the ploidy of the sample config["freebayes"]["max-complex-gap"]: set the maximum distance

System Message: ERROR/3 (<string>, line 14)

Unexpected indentation.
between polymorphisms segregate on the same read.

System Message: WARNING/2 (<string>, line 15)

Block quote ends without a blank line; unexpected unindent.

config["freebayes"]["options"]: others freebayes options

4.1.4.6. quast

Quast calculates metrics to evaluate and to compare different de novo

assembly. It provides a HTML file.

Required input:
  • __quast__input: list of fasta files
  • __quast__reference: Reference genome file (optional)
  • __quast__genes: Gene positions in the reference genome.
Required output:
  • __quast__json: json file to create quast link in sequana_report.
Required parameters:
  • __quast__outdir: output directory
Logs:
  • __quast__log

4.1.4.7. snpeff

snpeff adds annotation of variants detected in a VCF file.

Required input:
  • __snpeff__input: vcf file
Required output:
  • __snpeff__output: vcf file
Required configuration:
snpeff:
    reference:  # the genbank file
    options:    # result filters options

4.1.4.8. spades

Spade is a de novo assembler designed for small genomes like bacteria

or fungi. This rule works only with paired-end files.

Required input:
  • __spades__input: list of paired fastq files
Required output:
  • __spades__contigs: fasta file
  • __spades__scaffolds: fasta file
Required parameters:
  • __spades__outdir: output directory

  • config["spades"]["k"]: comma-separated list of k-mer sizes.

  • config["spades"]["careful"]: tries to reduce number of mismatches and

    short indels.

  • config["spades"]["only_assembler"]: runs only assembling.

  • config["spades"]["memory"]: RAM limit for SPAdes in Gb.

  • config["spades"]["threads"]: number of threads to be used.

  • config["spades"]["options"]: any options recognised by spades.py.

Logs:
  • __spades__log