4.4. Small RNA-seq

Overview:Small RNASeq (miRNA) analysis
Input:FastQ raw data from Illumina Sequencer (single end only)
Output:BAM, count and HTML files

4.4.1. Usage

Example:

sequana --pipeline smallrnaseq --input-dir .  --output-directory analysis --adapters TruSeq
cd analysis
srun snakemake -s smallrnaseq.rules --stats stats.txt -p -j 12 --nolock --cluster-config cluster_config.json --cluster "sbatch --mem={cluster.ram} --cpus-per-task={threads}" --restart-times 2

4.4.2. Requirements

  • cutadapt
  • picard-tools
  • bowtie
  • multiqc
  • fastq_screen
https://raw.githubusercontent.com/sequana/sequana/master/sequana/pipelines/smallrnaseq/dag.png

4.4.3. Details

This pipeline allows to map and count reads on mature and hairpin sequences (to download from miRBase) and perform some QC on data. All results are summarized by multiQC.

4.4.4. Rules and configuration details

Here is a documented configuration file ../sequana/pipelines/smallrnaseq/config.yaml to be used with the pipeline. Each rule used in the pipeline may have a section in the configuration file. Here are the rules and their developer and user documentation.

4.4.4.1. FastQC

Calls FastQC on input data sets (paired or not)

This rule is a dynamic rule (see developers guide). It can be included in a pipeline with different names. For instance in the quality_control pipeline, it is used as fastqc_samples and fastqc_phix. Here below, the string %(name)s must be replaced by the appropriate dynamic name.

Required input:
  • __fastqc_%(name)s__input_fastq:
Required output:
  • __fastqc_%(name)s__output_done
Required parameters
  • __fastqc_%(name)s__wkdir: the working directory
Log:
  • logs/fastqc_%(name)s/fastqc.log
Required configuration:
fastqc:
    options: "-nogroup"   # a string with fastqc options
References:

4.4.4.2. Fastq_screen

FastQ Screen allows you to screen a library of sequences in FastQ format against a set of sequence databases so you can see if the composition of the library matches with what you expect.

Required input:
__fastq_screen__input: a output fastq_screen directory
Required output:
__fastq_screen__output: fastq_screen directory results
Config:
fastq_screen:
    conf:  # a valid path to a fastq_screen config file

4.4.4.3. Cutadapt

Cutadapt (adapter removal)

Required input:
  • __cutadapt__input_fastq
Required output:
  • __cutadapt__output
Required parameters:
  • __cutadapt__fwd: forward adapters as a file, or string
  • __cutadapt__rev: reverse adapters as a file, or string
  • __cutadapt__options,
  • __cutadapt__mode, # g for 5' adapter, a for 3' and b for both 5'/3' (see cutadapt doc for details)
  • __cutadapt__wkdir,
  • __cutadapt__design,
  • __cutadapt__design_adapter,
  • __cutadapt__sample
Other requirements:
  • __cutadapt__log
Required configuration:
cutadapt:
    do: yes
    tool_choice: cutadapt
    design: "%(adapter_design)s"
    adapter_choice: "%(adapter_type)s"
    fwd: "%(adapter_fwd)s"
    rev: "%(adapter_rev)s"
    m: 20   # cutoff
    mode: "g"   # g for 5' adapter, a for 3' and b for both 5'/3'
    quality: "30"
    options: "-O 6 --trim-n"
References:
http://cutadapt.readthedocs.io/en/stable/index.html

4.4.4.4. Bowtie1

Mapping on hairpin and mature sequences. The corresponding fasta files must be in the config file.

bowtie1_mapping_dynamic

Read mapping for either single end and paired end data using Bowtie1.

Required input:
__bowtie1_mapping_%(name)s__input: list with one or two fastq.gz
Required output:
__bowtie1_mapping_%(name)s__bam: output bam file __bowtie1_mapping_%(name)s__sort: output sorted bam file

params:

__bowtie1_mapping_%(name)s__prefix_index: path to the index file of reference genome

config:

bowtie:
    options:  "" #options for bowtie1 you want use

4.4.4.5. Counting

miRNA_count_dynamic

miRNA_count rule count how many reads are mapped on each feature in a BAM file. This vould be done for mapping on miRNA (hairpin and mature from miRBase), but also on transcriptome (cDNA file)

Required input:
__miRNA_count__input: bam file
Required output:
__miRNA_count__output: tabulated file

4.4.4.6. Reporting

MultiQC aggregates results from bioinformatics analyses across many samples into a single report.

It searches a given directory for analysis logs and compiles a HTML report. It's a general use tool, perfect for summarising the output from numerous bioinformatics tools.

reference:http://multiqc.info/
Required input:
__multiqc__input_dir: an input directory where to find data and logs
Required output:
__multiqc__output: multiqc_report.html in the input directory

Config:

multiqc:
    options: "-c multiqc_config.yaml -f -x *.zip -e htseq" #any options recognised by multiqc
    output-directory:  " " #name of the output directory where to write results
note:if the directory exists, it is overwritten