4.3. RNA-seq

Overview:RNASeq
Input:FastQ raw data from Illumina Sequencer (either paired or not)
Output:BAM, count and HTML files

4.3.1. Usage

Example:

sequana --pipeline rnaseq --input-dir . --output-directory analysis --no-adapters
cd analysis
srun snakemake -s rnaseq.rules --stats stats.txt -p -j 12 --nolock --cluster-config cluster_config.json --cluster "sbatch --mem={cluster.ram} --cpus-per-task={threads}"

4.3.2. Requirements

  • cutadapt
  • picard-tools
  • bowtie
  • bowtie2
  • multiqc
  • STAR
  • fastq_screen
  • featureCounts [subread]
https://raw.githubusercontent.com/sequana/sequana/master/sequana/pipelines/rnaseq/dag.png

4.3.3. Details

Snakemake RNA-seq pipeline based on workflow use at Biomics Pole in Institut Pasteur. The pipeline runs some QC, such as FastQC, fastq_screen (you need your own base). Reads could be trimmed by several tools (cutadapt, atropos, clean_ngs) and mapped against reference genome (with bowtie or STAR) and ribosomal RNA (with bowtie1). After, reads are counted with feature-counts (HTSeq-count soon available) against a GFF file. All results are summarized by multiQC.

4.3.4. Rules and configuration details

Here is a documented configuration file ../sequana/pipelines/rnaseq/config.yaml to be used with the pipeline. Each rule used in the pipeline may have a section in the configuration file. Here are the rules and their developer and user documentation.

4.3.4.1. FastQC

Calls FastQC on input data sets (paired or not)

This rule is a dynamic rule. Meaning that it can be included in a pipeline with different names. For instance in the quality_control pipeline, it is used as fastqc_samples and fastqc_phix. Here below, the string %(name)s must be replaced by the appropriate dynamic name.

Required input:
  • __fastqc_%(name)s__input_fastq:
Required output:
  • __fastqc_%(name)s__output_done
Required parameters
  • __fastqc_%(name)s__wkdir: the working directory
Required configuration:
fastqc:
    options: "-nogroup"   # a string with fastqc options
References:

4.3.4.2. Fastq_screen

FastQ Screen allows you to screen a library of sequences in FastQ format against a set of sequence databases so you can see if the composition of the library matches with what you expect.

Required input:
__fastq_screen__input: a output fastq_screen directory
Required output:
__fastq_screen__output: fastq_screen directory results
Config:
fastq_screen:
    conf:  # a valid path to a fastq_screen config file

4.3.4.3. Cutadapt

Cutadapt (adapter removal)

Required input:
  • __cutadapt__input_fastq
Required output:
  • __cutadapt__output
Required parameters:
  • __cutadapt__fwd: forward adapters as a file, or string
  • __cutadapt__rev: reverse adapters as a file, or string
  • __cutadapt__options,
  • __cutadapt__mode, # g for 5' adapter, a for 3' and b for both 5'/3' (see cutadapt doc for details)
  • __cutadapt__wkdir,
  • __cutadapt__design,
  • __cutadapt__design_adapter,
  • __cutadapt__sample
Other requirements:
  • __cutadapt__log
Required configuration:
cutadapt:
    do: yes
    tool_choice: cutadapt
    design: "%(adapter_design)s"
    adapter_choice: "%(adapter_type)s"
    fwd: "%(adapter_fwd)s"
    rev: "%(adapter_rev)s"
    m: 20   # cutoff
    mode: "g"   # g for 5' adapter, a for 3' and b for both 5'/3'
    quality: "30"
    options: "-O 6 --trim-n"
References:
http://cutadapt.readthedocs.io/en/stable/index.html

4.3.4.4. Mapping on rRNA

no docstring found for bowtie1_mapping_dynamic

4.3.4.5. Mapping on reference genome

4.3.4.6. Bowtie1

no docstring found for bowtie1_mapping_dynamic

4.3.4.7. STAR

Read mapping for either single end and paired end data using RNA-STAR.

Required input:
__star_mapping__input: list with one or two fastq.gz
Required output:
__star_mapping__sort: output sorted bam file

params:

__star_mapping__output_prefix1: output prefix for first-pass mapping (temporary file) __star_mapping__output_prefix2: output prefix for second-pass mapping __star_mapping__genome_dir: name of directory for new genome indexation (temporary file) __star_mapping__splice_file: name of the file containing splicing information get during firt-pass alignment

config:

star_mapping:
    prefix_index: "" #path to the index file of reference genome
    ref: "" #path to the reference genome file in fasta format
    options:  "" #options for bowtie1 you want use

4.3.4.8. Counting

Feature counts (subread) is a highly efficient general-purpose read summarization program that counts mapped reads for genomic features such as genes, exons, promoter, gene bodies, genomic bins and chromosomal locations. :reference: http://bioinf.wehi.edu.au/featureCounts/

Required input:
__feature_counts__input: sorted bam file
Required output:
__feature_counts__output_count: output tabulated-delimited file __feature_counts__output_gene_count: output formatted tab-delimited file

Config:

feature_counts:
    gff: " " #path to the GFF/GTF annotation file
    options:  " " #options for featureCounts you want use

4.3.4.9. Reporting

MultiQC aggregates results from bioinformatics analyses across many samples into a single report.

It searches a given directory for analysis logs and compiles a HTML report. It's a general use tool, perfect for summarising the output from numerous bioinformatics tools.

reference:http://multiqc.info/
Required input:
__multiqc__input_dir: an input directory where to find data and logs
Required output:
__multiqc__output: multiqc_report.html in the input directory

Config:

multiqc:
    excluded: "-x *.zip " Ignore analysis files (glob expression)
    output-directory:  " " #name of the output directory where to write results
note:if the directory exists, it is overwritten