4.5. Variant Calling

Overview:Variant calling from FASTQ files
Input:FASTQ files from Illumina Sequencing instrument
Output:VCF and HTML files

4.5.1. Usage

4.5.1.1. Command line interface

Example:

sequana --pipeline variant_calling \
        --input-directory data/ \
        --input-readtag _[12].fastq \
        --extention fastq.gz \
        --reference reference.fasta \
        --working-dir analysis
cd analysis
snakemake -s variant_calling.rules --stats stats.txt

4.5.2. Requirements

  • bwa
  • freebayes
  • picard (picard-tools)
  • sambamba
  • samtools
  • snpEff
https://raw.githubusercontent.com/sequana/sequana/master/sequana/pipelines/variant_calling/variant_calling_dag.png

4.5.3. Details

Snakemake variant calling pipeline is based on tutorial written by Erik Garrison. Input reads (paired or single) are mapped using bwa and sorted with sambamba-sort. PCR duplicates are marked with sambamba-markdup. Freebayes is used to detect SNPs and short INDELs. The INDEL realignment and base quality recalibration are not necessary with Freebayes. For more information, please refer to a post by Brad Chapman on minimal BAM preprocessing methods.

The pipeline provides an analysis of the mapping coverage using sequana coverage. It detects and characterises automatically low and high genome coverage regions.

Detected variants are annotated with SnpEff if a GenBank file is provided. The pipeline does the database building automatically. Although most of the species should be handled automatically, some special cases such as particular codon table will required edition of the snpeff configuration file.

Finally, joint calling is also available and can be switch on if desired.

4.5.4. Rules and configuration details

Here is a documented configuration file ../sequana/pipelines/variant_calling/config.yaml to be used with the pipeline. Each rule used in the pipeline may have a section in the configuration file. Here are the rules and their developer and user documentation.

4.5.4.1. Mapping

4.5.4.1.1. Add locus in fasta

SnpEff requires the locus names in the annotation file and in the FASTA file (contig name) to be identical. To make this is true, this rule adds locus names of the genbank file into the FASTA file before the mapping.

Required input:
  • __snpeff_add_locus_in_fasta__input: FASTA file of the reference.
Required output:
  • __snpeff_add_locus_in_fasta__output: FASTA file with locus names.
Required configuration:
snpeff:
    reference:  # the genbank file
    options:    # result filters options

4.5.4.1.2. BWA

Read mapping with BWA-mem

BWA mem rule aligns single or paired end FASTQ files on a reference genome. This rule is a dynamic rule (see developers section). This means that it can be included multiple times with different names in the same pipeline. The reference file is indexed by BWA index and FASTQ files are mapped on it. The sorting is done with sambamba by reading the stdin. Moreover, Sambamba generates automatically the BAM index.

Required input:
  • __bwa_mem_%(name)s__fastq: FASTQ files single or paired end.
  • __bwa_mem_%(name)s__reference: FASTA file of the reference.
  • __bwa_mem_%(name)s__fai: Index file of the reference.
Required output:
  • __bwa_mem_%(name)s__bam: Sorted BAM file.
Log:
  • __bwa_index_%(name)s__log: Log file of BWA index.
  • __bwa_mem_%(name)s__log: Log file of BWA mem.
Required configuration:
bwa_mem_%(name)s:
    index_algorithm: 'is' # BWA index algorithm (is or bwtsw)
    threads: 4         # Number of threads used by BWA mem.
    options:           # Any options recognised by BWA mem tool.
    tmp_directory: '/tmp' # Temporary directory.
    sambamba_sort: -N  # -N for QC pipeline and "" for variant and denovo.
Note:
When a dynamic rule is included, template variables must be replaced.
References:

4.5.4.1.3. Sambamba markdup

This rule marks or removes PCR duplicate reads. For determining whether a read is a duplicate or not, the same criteria as in Picard are used.

Required input:
  • __sambamba_markdup__input: Sorted BAM file.
Required output:
  • __sambamba_markdup__output: BAM file with duplicates marked or removed.
Required log:
  • __sambamba_markdup__log_std: Log file with stdout.
  • __sambamba_markdup__log_err: Log file with stderr.
Required configuration:
sambamba_markdup:
    remove: no # Remove or just mark duplicate reads.
    tmp_directory: /tmp # Temporary directory.
Reference:

4.5.4.1.4. Sambamba filter

This rule uses sambamba view to filter reads with a mapping quality lower than a threshold. It also removes reads with multiple occurrence.

Required input:
  • __sambamba_filter__input: bam file
Required output:
  • __sambamba_filter__output: bam file
Required log:
  • __sambamba_filter__log: Log file with stdout and stderr of sambamba.
Required configuration:
sambamba_filter:
    threshold: 30 # Mapping quality score threshold
Reference:

4.5.4.2. Variant Calling

4.5.4.2.1. Freebayes

Freebayes is a variant caller designed to find SNPs and short INDELs from a BAM file. It produces a very well-annotated VCF output. Moreover, it provides a quality score calculated by a bayesian model.

Required input:
  • __freebayes__input: Sorted BAM file.
  • __freebayes__reference: FASTA file of the reference genome.
Required output:
  • __freebayes__output: VCF file of detected variants.
Required log:
  • __freebayes__log: Log file with stdout and stderr of Freebayes.
Required configuration:
freebayes:
    ploidy: 1 # The sample ploidy.
    options: # Any options recognised by freebayes.
Note:
The ploidy correspond to the expected ploidy of the sample. For example, the ploidy must be set at 1 for a variant-calling on a bacteria sample.
Reference:

4.5.4.2.2. Freebayes filter

Variant filter rules dedicated to VCF files generated by freebayes. It filters with freebayes quality score, coverage depth, frequency and strand ratio.

Required input:
  • __freebayes_vcf_filter__input: VCF file from freebayes.
Required output:
  • __freebayes_vcf_filter__output: Filtered VCF file.
  • __freebayes_vcf_filter__csv: CSV file of filtered variants.
  • __freebayes_vcf_filter__html: HTML report.
Required parameters:
  • __freebayes_vcf_filter__report_dir: Report directory to copy JS/CSS.
Required configuration:
freebayes_vcf_filter:
    freebayes_score: 20 # threshold for minimum freebayes quality score.
    frequency: 0.7 # threshold for minimum alternative allele frequency.
    min_depth: 10 # threshold for minimum coverage depth.
    forward_depth: 3 # threshold for minimum coverage depth of forward strand.
    reverse_depth: 3 # threshold for minimum coverage depth of reverse strand.
    strand_ratio: 0.2 # threshold for minimum strand ratio between 0 and 0.5.

4.5.4.3. Joint variants calling

4.5.4.3.1. Joint Freebayes

Freebayes is a variant caller designed to find SNPs and short INDELs from a BAM file. It produces a very well-annotated VCF output. Moreover, it provides a quality score calculated by a bayesian model.

This rules does the variant calling jointly with all samples. It increases the power of the bayesian model.

Required input:
  • __joint_freebayes__input: List of sorted BAM files.
  • __joint_freebayes__reference: FASTA file of the reference genome.
Required output:
  • __joint_freebayes__output: VCF file of detected variants.
Required log:
  • __joint_freebayes__log: Log file with stdout and stderr of Freebayes.
Required parameter:
  • __joint_freebayes__ploidy: The ploidy of the samples.
Required configuration:
freebayes:
    options: # Any options recognised by freebayes.
Reference:

4.5.4.3.2. Joint Freebayes filter

Variant filter rules dedicated to VCF files generated by freebayes with multiple samples. It filters with freebayes quality score, coverage depth and frequency.

Required input:
  • __joint_freebayes_vcf_filter__input: VCF file from freebayes
Required output:
  • __joint_freebayes_vcf_filter__output: Filtered VCF file.
  • __joint_freebayes_vcf_filter__html: HTML report of detected variants.
Required parameters:
  • __freebayes_vcf_filter__report_dir: Report directory to copy JS/CSS.
Required configuration:
freebayes_vcf_filter:
    freebayes_score: 20 # threshold for minimum freebayes quality score.
    frequency: 0.7 # threshold for minimum alternative allele frequency.
    min_depth: 10 # threshold for minimum coverage depth.

4.5.4.4. Annotation

SnpEff adds annotation of variants detected in a VCF file. It annotates using the old 'EFF' field instead of 'ANN' field. The latter does not provide the codon change information.

Required input:
  • __snpeff__input: VCF file of detected variants.
Required output:
  • __snpeff__output: Annotated VCF file.
  • __snpeff__html: HTML report generate by snpEff.
Required configuration:
snpeff:
    reference: 'genes.gb' # The genbank file with annotation of the reference.
    options: '-no-downstream' # Any options
Results filter options:
  • -no-downstream: Do not show DOWNSTREAM changes
  • -no-intergenic: Do not show INTERGENIC changes
  • -no-intron: Do not show INTRON changes
  • -no-upstream: Do not show UPSTREAM changes
  • -no-utr: Do not show 5_PRIME_UTR or 3_PRIME_UTR changes
Reference:

4.5.4.5. Coverage analysis

4.5.4.5.1. Samtools depth

Samtools Depth creates a BED file with the coverage depth for each base position. It can also compute multiple BAM files and concatenate results in one BED file.

Required input:
  • __samtools_depth__input: Sorted BAM file or list of bam file.
Required output:
  • __samtools_depth__output: BED file with coverage for each base.
Required log:
  • __samtools_depth__log: Log file with stderr of samtools.
Reference:

4.5.4.5.2. Sequana coverage

Sequana coverage detects and characterises automatically low and high genome coverage regions. It provides a useful HTML report with dynamic plot and table. Moreover, a CSV file with all metrics computed is created. This CSV can be used to regenerate the sequana_coverage report.

Required input:
  • __sequana_coverage__bed: BED file with coverage by base position.
Optional input:
  • __sequana_coverage__fasta: FASTA file of the reference to compute GC content.
  • __sequana_coverage__gbk: Genbank file to annotate regions of interest. (Optional)

NOTE: If you do not need an optional input. Initiate it as an empty list.

Required output:
  • __sequana_coverage__csv: CSV file with computed metrics.
  • __sequana_coverage__html: HTML report of results.
Required parameter:
  • __sequana_coverage__report_dir: Report directory to copy JS/CSS.
Required configuration:
sequana_coverage:
    k: 2 # number of gaussian predicted.
    circular: yes # if your genome is circular.
    window_size: 30001 # window size to compute the running median.
    low_threshold: -4 # threshold to detect low coverage regions.
    high_threshold: 4 # threshold to detect high coverage regions.
    gc_window_size: 201 # window size to compute GC content.
Reference: