4.3. RNA-seq

Overview:RNASeq: Differential expressed genes analysis
Input:FastQ raw data from Illumina Sequencer (either paired or not)
Output:BAM, count and HTML files

4.3.1. Usage

Example:

sequana --pipeline rnaseq -i data/ -o analysis --no-adapters
cd analysis
sbatch snakemake -s rnaseq.rules --stats stats.txt -p -j 12 --nolock --cluster-config cluster_config.json --cluster "sbatch --mem={cluster.ram} --cpus-per-task={threads}"

Or use Sequanix Tutorial interface.

4.3.2. Requirements

  • cutadapt
  • picard-tools
  • bowtie
  • bowtie2
  • multiqc
  • STAR
  • fastq_screen
  • featureCounts [subread]
https://raw.githubusercontent.com/sequana/sequana/master/sequana/pipelines/rnaseq/dag.png

4.3.3. Details

Snakemake RNA-seq pipeline s based on a workflow used at Biomics Pole in Institut Pasteur. The pipeline runs some quality control (e.g., FastQC), fastq_screen (you need your own database). Reads could be trimmed by several tools (cutadapt, atropos, clean_ngs) and mapped against a reference genome (with bowtie or STAR, bowtie2 is used by fastq_screen) and ribosomal RNA (with bowtie1). Then, reads are counted with feature-counts (HTSeq-count soon available) against a GFF file. All results are summarized using multiQC.

Warning

The statistical analysis is not included in our pipeline because it is a step that is difficult to automate before the exploration of the data. However, you can perform this analysis with SARTools (https://github.com/PF2-pasteur-fr/SARTools).

4.3.4. Rules and configuration details

Here is a documented configuration file ../sequana/pipelines/rnaseq/config.yaml to be used with the pipeline. Each rule used in the pipeline may have a section in the configuration file. Here are the rules and their developer and user documentation.

4.3.4.1. FastQC

FastQC is used to check quality of sequenced reads.

Calls FastQC on input data sets (paired or not)

This rule is a dynamic rule (see developers guide). It can be included in a pipeline with different names. For instance in the quality_control pipeline, it is used as fastqc_samples and fastqc_phix. Here below, the string %(name)s must be replaced by the appropriate dynamic name.

Required input:
  • __fastqc_%(name)s__input_fastq:
Required output:
  • __fastqc_%(name)s__output_done
Required parameters
  • __fastqc_%(name)s__wkdir: the working directory
Log:
  • logs/fastqc_%(name)s/fastqc.log
Required configuration:
fastqc:
    options: "-nogroup"   # a string with fastqc options
References:

4.3.4.2. Fastq_screen

Fastq_screen is used to search any contamination in the data. An internal database is mandatory (with bowtie2 indexes).

FastQ Screen allows you to screen a library of sequences in FastQ format against a set of sequence databases so you can see if the composition of the library matches with what you expect.

Required input:
__fastq_screen__input: a output fastq_screen directory
Required output:
__fastq_screen__output: fastq_screen directory results
Config:
fastq_screen:
    conf:  # a valid path to a fastq_screen config file

4.3.4.3. Cutadapt

Cutadapt is used to trim and filter sequences.

Cutadapt (adapter removal)

Required input:
  • __cutadapt__input_fastq
Required output:
  • __cutadapt__output
Required parameters:
  • __cutadapt__fwd: forward adapters as a file, or string
  • __cutadapt__rev: reverse adapters as a file, or string
  • __cutadapt__options,
  • __cutadapt__mode, # g for 5' adapter, a for 3' and b for both 5'/3' (see cutadapt doc for details)
  • __cutadapt__wkdir,
  • __cutadapt__design,
  • __cutadapt__design_adapter,
  • __cutadapt__sample
Other requirements:
  • __cutadapt__log
Required configuration:
cutadapt:
    do: yes
    tool_choice: cutadapt
    design: "%(adapter_design)s"
    adapter_choice: "%(adapter_type)s"
    fwd: "%(adapter_fwd)s"
    rev: "%(adapter_rev)s"
    m: 20   # cutoff
    mode: "g"   # g for 5' adapter, a for 3' and b for both 5'/3'
    quality: "30"
    options: "-O 6 --trim-n"
References:
http://cutadapt.readthedocs.io/en/stable/index.html

4.3.4.4. Mapping on rRNA

In order to estimate rRNA rate, a bowtie alignment on ribosomal sequences is performed

bowtie1_mapping_dynamic

Read mapping for either single end and paired end data using Bowtie1.

Required input:
__bowtie1_mapping_%(name)s__input: list with one or two fastq.gz
Required output:
__bowtie1_mapping_%(name)s__bam: output bam file __bowtie1_mapping_%(name)s__sort: output sorted bam file

params:

__bowtie1_mapping_%(name)s__prefix_index: path to the index file of reference genome

config:

bowtie:
    options:  "" #options for bowtie1 you want use

4.3.4.5. Mapping on reference genome

According to your genome, you can choose to align sequences with bowtie1 or STAR, or both.

4.3.4.6. Bowtie1

bowtie1_mapping_dynamic

Read mapping for either single end and paired end data using Bowtie1.

Required input:
__bowtie1_mapping_%(name)s__input: list with one or two fastq.gz
Required output:
__bowtie1_mapping_%(name)s__bam: output bam file __bowtie1_mapping_%(name)s__sort: output sorted bam file

params:

__bowtie1_mapping_%(name)s__prefix_index: path to the index file of reference genome

config:

bowtie:
    options:  "" #options for bowtie1 you want use

4.3.4.7. STAR

Read mapping for either single end and paired end data using RNA-STAR.

Required input:
__star_mapping__input: list with one or two fastq.gz
Required output:
__star_mapping__sort: output sorted bam file

params:

__star_mapping__output_prefix1: output prefix for first-pass mapping (temporary file) __star_mapping__output_prefix2: output prefix for second-pass mapping __star_mapping__genome_dir: name of directory for new genome indexation (temporary file) __star_mapping__splice_file: name of the file containing splicing information get during firt-pass alignment

config:

star_mapping:
    prefix_index: "" #path to the index file of reference genome
    ref: "" #path to the reference genome file in fasta format
    options:  "" #options for bowtie1 you want use

4.3.4.8. Counting

FeatureCounts is used for counting

Feature counts (subread) is a highly efficient general-purpose read summarization program that counts mapped reads for genomic features such as genes, exons, promoter, gene bodies, genomic bins and chromosomal locations. :reference: http://bioinf.wehi.edu.au/featureCounts/

Required input:
__feature_counts__input: sorted bam file
Required output:
__feature_counts__output_count: output tabulated-delimited file __feature_counts__output_gene_count: output formatted tab-delimited file

Config:

feature_counts:
    gff: " "       #path to the GFF/GTF annotation file
    options:  " "  #options for featureCounts you want use

4.3.4.9. Reporting

MultiQC allows to report all bioinformatics tools in a same html file.

MultiQC aggregates results from bioinformatics analyses across many samples into a single report.

It searches a given directory for analysis logs and compiles a HTML report. It's a general use tool, perfect for summarising the output from numerous bioinformatics tools.

reference:http://multiqc.info/
Required input:
__multiqc__input_dir: an input directory where to find data and logs
Required output:
__multiqc__output: multiqc_report.html in the input directory

Config:

multiqc:
    options: "-c multiqc_config.yaml -f -x *.zip -e htseq" #any options recognised by multiqc
    output-directory:  " " #name of the output directory where to write results
note:if the directory exists, it is overwritten