4.2. Quality control

Overview:Quality control, trimming (adapter removal) and taxonomic overview
Input:A set of FastQ (paired or single-end)
Output:fastqc, cleanup FastQ files

4.2.1. Usage

sequana --pipeline quality_control --input-directory . --output-directory analysis

4.2.2. Requirements

4.2.3. Requirements

  • bwa
  • samtools
  • kraken
  • ktImportText
  • fastqc
  • pigz
  • cutadapt
  • atropos

4.2.4. Details

The adapters are removed using cutadapt. If one specifies the quality trimming option in the config file, then we trim low-quality ends from reads BEFORE adapter removal.

The quality trimming algorithm is the same as in BWA. That is: substract the cutoff (e.g. 30) from all qualities; compute partial sums from the end of the sequence; cut the sequence at the index at which the sum is minimal.

# Original qualities
42, 40, 26, 27, 8, 7, 11, 4, 2, 3
# Subtracting the threshold gives:
32, 30, 16, 17, -2, -3, 1, -6, -8, -7
# Partial sum from the end. Stop early if the sum is greater than zero:
(70), (38), 8, -8, -25, -23, -20, -21, -15, -7

Minimum is -25, we keep the bases 1,2,3,4:

42, 40, 26, 27

Another important point is that all searches for adapter sequences are error tolerant (allowing errors such as mismatches, insertions and deletions). The level of error tolerance is 10% by default.

4.2.5. Rules and configuration details

Here is a documented configuration file ../sequana/pipelines/quality_control/config.yaml to be used with the pipeline. Each rule used in the pipeline may have a section in the configuration file. Here are the rules and their developer and user documentation. FastQC

Calls FastQC on input data sets (paired or not)

This rule is a dynamic rule. Meaning that it can be included in a pipeline with different names. For instance in the quality_control pipeline, it is used as fastqc_samples and fastqc_phix. Here below, the string %(name)s must be replaced by the appropriate dynamic name.

Required input:
  • __fastqc_%(name)s__input_fastq:
Required output:
  • __fastqc_%(name)s__output_done
Required parameters
  • __fastqc_%(name)s__wkdir: the working directory
Required configuration:
    options: "-nogroup"   # a string with fastqc options
References: Cutadapt

Cutadapt (adapter removal)

Required input:
  • __cutadapt__input_fastq
Required output:
  • __cutadapt__output
Required parameters:
  • __cutadapt__fwd: forward adapters as a file, or string
  • __cutadapt__rev: reverse adapters as a file, or string
  • __cutadapt__options,
  • __cutadapt__mode, # g for 5' adapter, a for 3' and b for both 5'/3' (see cutadapt doc for details)
  • __cutadapt__wkdir,
  • __cutadapt__design,
  • __cutadapt__design_adapter,
  • __cutadapt__sample
Other requirements:
  • __cutadapt__log
Required configuration:
    do: yes
    tool_choice: cutadapt
    design: "%(adapter_design)s"
    adapter_choice: "%(adapter_type)s"
    fwd: "%(adapter_fwd)s"
    rev: "%(adapter_rev)s"
    m: 20   # cutoff
    mode: "g"   # g for 5' adapter, a for 3' and b for both 5'/3'
    quality: "30"
    options: "-O 6 --trim-n"
http://cutadapt.readthedocs.io/en/stable/index.html Kraken

Kraken taxonomic sequence classification system

Required input:
  • __kraken__input
Required output:
  • __kraken__output_wkdir: working directory
  • __kraken__output: the kraken final output
  • __kraken__output_csv: summary in csv format
  • __kraken__output_json: summary in json format
    database_directory:  # a valid path to a Kraken database

See KrakenBuilder to build your own database or visit https://github.com/sequana/data for a database toy example.