4.7. Quality control Pacbio

Overview:Quality control for Pacbio bam data (raw data)
Input:A set of bam files
Output:report_{sample}.html where sample is the name of an input file

4.7.1. Usage

First copy the BAM files into a directory. Start Sequanix in that directory as follows (input BAM files must have the extension .bam):

sequanix -w analysis -i . -p pacbio_qc

You are ready to go. If you want to filter out some BAM files, you may use the pattern in tab ‘input data’.

In the configuration tab, in the kraken section add as many databases as you wish. You may simply unset the first database to skip the taxonomy, which is experimental.

Save the project and press Run. Once done, open the HTML report for the BAM of interest.

4.7.2. Requirements

https://raw.githubusercontent.com/sequana/sequana/master/sequana/pipelines/pacbio_qc/dag.png

4.7.3. Details

This pipeline takes as inputs a set of BAM files from Pacbio sequencers. It computes a set of basic statistics related to the read lengths. It also shows some histograms related to the GC content, SNR of the diodes and the so-called ZMW values. Finally, a quick taxonomy can be performed using Kraken. HTML reports are created for each sample as well as a multiqc summary page.

4.7.4. Rules and configuration details

Here is a documented configuration file ../sequana/pipelines/pacbio_qc/config.yaml to be used with the pipeline.

4.7.4.1. bam_to_fasta

BAM to Fasta conversion for Pacbio BAM file

Required input:
  • __bam_to_fasta__input_bam
Required output:
  • __bam_to_fasta__output_fasta
Configuration:
  • config['bam_to_fasta']['thread']
References:
sequana.pacbio

4.7.4.2. pacbio_quality

Pacbio quality control

Required input:
  • __pacbio_quality__input : the input BAM file
Required output:
  • __pacbio_quality__output_summary: summary_{sample}.json
Optional parameters:
  • __pacbio_quality__images_directory: where to save PNG images (default to ./images)

In addition to a summary file with basic statistics, this rules creates 5 images with basic histograms about the read lengths, the GC content, the ZMW information, the SNR of the A,C,G,T nucleotides, and a 2D histogram of GC versus read length

References:
sequana.pacbio