4.2. Quality control

Overview:Quality control, trimming (adapter removal) and taxonomic overview
Input:A set of FastQ files (paired or single-end)
Output:fastqc, cleaned FastQ files, Krona plot

4.2.1. Usage

sequana --pipeline quality_control --input-directory . --working-directory analysis --no-adapters

Or use Sequanix Tutorial interface.

4.2.2. Requirements

  • bwa
  • samtools
  • kraken
  • ktImportText
  • fastqc
  • pigz
  • cutadapt
  • atropos

4.2.3. Details

This pipeline is used to check the quality of the input FastQ files. It id used to remove the Phix (coliphage) that may be present in the data. Low quality reads are trimmed using a dedicated tool such as cutadapt or atropos. If one specifies the quality trimming option in the config file, then we trim low-quality ends from reads BEFORE adapter removal.

The quality trimming algorithm from cutadapt/atropos is the same as in BWA. That is: substract the cutoff (e.g. 30) from all qualities; compute partial sums from the end of the sequence; cut the sequence at the index at which the sum is minimal.

# Original qualities
42, 40, 26, 27, 8, 7, 11, 4, 2, 3
# Subtracting the threshold gives:
32, 30, 16, 17, -2, -3, 1, -6, -8, -7
# Partial sum from the end. Stop early if the sum is greater than zero:
(70), (38), 8, -8, -25, -23, -20, -21, -15, -7

Minimum is -25, we keep the bases 1,2,3,4:

42, 40, 26, 27

Another important point is that all searches for adapter sequences are error tolerant (allowing errors such as mismatches, insertions and deletions). The level of error tolerance is 10% by default.

Another optional step is the taxonomy analysis. This is performed with Kraken using a dedicated database. We provide a couple of database or tools to download them. See sequana_taxonomy tool to download databases. The minikraken database is provided by the author of Kraken while sequana_db1 is a 8Gb database as described here https://github.com/sequana/data/tree/master/sequana_db1

sequana_taxonomy --download minikraken
sequana_taxonomy --download sequana_db1

For the second database, you will need synapseclient:

pip install synapseclient

and an account on synapse website (https://www.synapse.org/).

4.2.4. Rules and configuration details

Here is a documented configuration file ../sequana/pipelines/quality_control/config.yaml to be used with the pipeline. Each rule used in the pipeline may have a section in the configuration file. In the quality_control pipeline, we use the bwa_mem, fastqc, cutadapt, fastq_stats and kraken rules described here below. FastQC

Calls FastQC on input data sets (paired or not)

This rule is a dynamic rule (see developers guide). It can be included in a pipeline with different names. For instance in the quality_control pipeline, it is used as fastqc_samples and fastqc_phix. Here below, the string %(name)s must be replaced by the appropriate dynamic name.

Required input:
  • __fastqc_%(name)s__input_fastq:
Required output:
  • __fastqc_%(name)s__output_done
Required parameters
  • __fastqc_%(name)s__wkdir: the working directory
  • logs/fastqc_%(name)s/fastqc.log
Required configuration:
    options: "-nogroup"   # a string with fastqc options
References: Cutadapt

Cutadapt (adapter removal)

Required input:
  • __cutadapt__input_fastq
Required output:
  • __cutadapt__output
Required parameters:
  • __cutadapt__fwd: forward adapters as a file, or string
  • __cutadapt__rev: reverse adapters as a file, or string
  • __cutadapt__options,
  • __cutadapt__mode, # g for 5' adapter, a for 3' and b for both 5'/3' (see cutadapt doc for details)
  • __cutadapt__wkdir,
  • __cutadapt__design,
  • __cutadapt__design_adapter,
  • __cutadapt__sample
Other requirements:
  • __cutadapt__log
Required configuration:
    do: yes
    tool_choice: cutadapt
    design: "%(adapter_design)s"
    adapter_choice: "%(adapter_type)s"
    fwd: "%(adapter_fwd)s"
    rev: "%(adapter_rev)s"
    m: 20   # cutoff
    mode: "g"   # g for 5' adapter, a for 3' and b for both 5'/3'
    quality: "30"
    options: "-O 6 --trim-n"
http://cutadapt.readthedocs.io/en/stable/index.html Kraken

Kraken taxonomic sequence classification system

Required input:
  • __kraken__input
Required output:
  • __kraken__output_wkdir: working directory
  • __kraken__output: the kraken final output
  • __kraken__output_csv: summary in csv format
  • __kraken__output_json: summary in json format
    database_directory:  # a valid path to a Kraken database

See KrakenBuilder to build your own database or visit https://github.com/sequana/data for a database toy example.

References: BWA-mem

Read mapping with BWA-mem

BWA mem rule aligns single or paired end FASTQ files on a reference genome. This rule is a dynamic rule (see developers section). This means that it can be included multiple times with different names in the same pipeline. The reference file is indexed by BWA index and FASTQ files are mapped on it. The sorting is done with sambamba by reading the stdin. Moreover, Sambamba generates automatically the BAM index.

Required input:
  • __bwa_mem_%(name)s__fastq: FASTQ files single or paired end.
  • __bwa_mem_%(name)s__reference: FASTA file of the reference.
  • __bwa_mem_%(name)s__fai: Index file of the reference.
Required output:
  • __bwa_mem_%(name)s__bam: Sorted BAM file.
  • __bwa_index_%(name)s__log: Log file of BWA index.
  • __bwa_mem_%(name)s__log: Log file of BWA mem.
Required configuration:
    index_algorithm: 'is' # BWA index algorithm (is or bwtsw)
    threads: 4         # Number of threads used by BWA mem.
    options:           # Any options recognised by BWA mem tool.
    tmp_directory: '/tmp' # Temporary directory.
When a dynamic rule is included, template variables must be replaced.
References: FastQ stats

Analyse FastQ files to extract basic stats

Creates boxplot quality image + GC content + basic stats in JSON file

Required input:
  • __fastq_stats_%(name)s__input_fastq:
Required output:
  • __fastq_stats_%(name)s__output_done
Required parameters
  • __fastq_stats_%(name)s__wkdir: the working directory

This rule creates files that are not in the required output namely a _gc.png _boxplot.png_ and .json files with filename set to the original FastQ filename.

Note: analyses only first 500,000 reads