2. Overview

Sequana provides standalone applications (e.g., sequana_coverage, sequana_taxonomy) and pipelines in the form of Snakefiles. Although the standalone applications are usually simpler, they may not have all features or parameters offered by the pipelines. Since version 0.8.0, most pipelines have been moved to different repository with one repository per pipeline. This was done to make pipelines independe and the Sequana more modular and effective for deployment in production mode.

The Tutorial, Pipelines, Case Examples sections provide many examples on their usage. Check also the Gallery section for code snippets.

This section will not describe all available standalones and pipelines. We will focus on one example (coverage) to show how one can use the Sequana library, or standalone application, or pipeline to get information about the coverage of a set of mapped reads onto a reference.

2.1. Sequana is a Python library

2.1.1. Example 1 : running median on coverage

Sequana is a Python library. It contains many functionalities, which are fully documented and available in the References section. We can first look at the coverage contained within a BED file using the library. First, we need some data. Sequana provides some test examples, which can be accessed using sequana_data() function. The test case is a virus (about 18,000 bases):

from sequana import sequana_data
filename = sequana_data('JB409847.bed')

We can then use the SequanaCoverage class to read the file:

from sequana import SequanaCoverage
gc = SequanaCoverage(filename)

Select a chromosome (first one) and compute the running median:

chrom = gc[0]
chrom.running_median(n=5001, circular=True)
chrom.compute_zscore()

and finally plot the coverage together with confidence interval (3 sigma):

chrom.plot_coverage()

(Source code, png, hires.png, pdf)

_images/userguide-1.png

See also

notebook available in the github repository

As you can see, Sequana is a standard Python library where developers can select functions, classes, modules to help them design new tools and pipelines.

2.1.2. Example2: read a fastq file

Let us use the FastQC class to get the distribution of the bases ACGT across all reads of a FastQ file.

from sequana import FastQC
from sequana import sequana_data
filename = sequana_data("test.fastq")

fastqc = FastQC(filename)
print(fastqc.fastq)
for x in 'ACGT':
    fastqc.get_actg_content()[x].hist(alpha=0.5, label=x, histtype='step', lw=3, bins=10)

from pylab import legend
legend()

(Source code, png, hires.png, pdf)

_images/userguide-2.png

Many more functionalities are available. The reference guide should help you.

2.2. Sequana provides standalone applications

The Python example about the coverage is actually quite useful. We therefore decided to provide a standalone application. There are other standalone applications listed in Applications (standalone) section.

The one related to the coverage example shown above is named sequana_coverage. If you have a BED file, type:

sequana_coverage  -i <BEDFILENAME>

If your organism has a circular DNA, add -o. You can play with the window size for the running median using -w.

Using the BED file and reference mentionned in the previous section you should obtain the same figure as above.

An additional feature is the report using --show-html option.

2.3. Sequana, a pipeline construction facilitator

In Sequana, in addition to the library and standalone applications, we also provide a set of pipelines (see Pipelines section). Originally, pipeline were provided with Sequana, inside the same source repository. Since version 0.8.0, pipeline have their own repository. For instance, Variant Calling is available on https://github.com/sequana/variant_calling. We will not describe all pipelines here below since new ones may appear now and then. Instead, let us explain the way pipelines can be designed and run.

2.3.1. Installation of a pipeline

With the new design implemented in v0.8.0, pipelines are independent Python packages posted on Pypi. You can now install a pipeline (e.g., variant calling) as follows in your virtual environment:

pip install sequana_variant_calling --upgrade

The --upgrade option is to make sure you install the newest version.

To check if the installation is successful, just type:

sequana_variant_calling --help

2.3.2. Usage

A very simple and useful pipeline for this explanation is the sequana_fastqc pipeline. Install it as follows:

pip install sequana_fastqc

and check the help message:

sequana_fastqc --help

You will see 4 sections some of which are common to all Sequana pipelines.

The generic section allows use to print the help with --help, to set the level of information printed to the screen (--level), the version (--version). Pipelines can be run locally or on a SLURM clusters. This can be set with the --run-mode option. Note, however, that this option is set automatically to slurm-mode if slurm commands are found (e.g. sbatch).

The slurm section can be used to set slurm options for Snakemake. If you do not know what it means, let it be the defaults values. Just note that memory usage is set to 4Gb by default and number of cores is limited to 4 per job.

The snakemake section allows you to set to maximum number of jobs to be used, which is set to 4 (if run-mode is set to local) and 40 (if run-mode is set to slurm).

The --working-directory is set to the name of the pipeline and is the important parameter. It tells sequana where to store the pipeline files (e.g., snakemake, configuration files). You can change it to your will but if it exists already, the pipeline zill not be set up and you will need to use the --force option to overwrite existing files.

The next section is about your input data. Most of the pipelines expect to find Illumina data with single or paired-end data sets. The directory where to find the data is defined by the --input-directory parameter. You can refine the search by providing an input pattern, which is set to *fastq.gz by default. Since, Illumina data may be paired, we have a mechanism to check and discovered paired data for each sample. By default, the paired data are differentiate thanks to a pattern _R1_ or _R2_ to be found in the filenames. The common pattern set with --input-readtag is set to _R[12]_ but can be easily changed. For instance if your files do not contain the R or if the _R1 is to be found at the end of the file, just change it accordingly.

So, let us now perform the fastqc of a bunch of samples. You could type:

sequana_fastqc --input-directory my_data_directory --working-directory test1

This will copy the snakefile, the configuration files and useful files to run the analysis. Follow the instructions that is:

cd test1

In this directory, you can find The configuration file called config.yaml. This pipeline is very simple but you can see the parameters related to your input data:

input_directory: /home/login_example/data_example/my_data_directory
input_readtag: _R[12]_
input_pattern: '*fastq.gz'

So you can edit this file to correct it or change other parameters. If you are happy with those choices, it is now time to run the pipeline. If you know snakemake, you can just use it. For example:

snakemake -s fastqc.rules

or just type:

sh fastq.sh

Wait and see. Once done. If every went well, you can keep the configuration files and pipeline-related files, or delete them using:

make clean

See also

Pipelines section for more information.

2.3.3. Using Sequanix standalone

An even easier way is to use our graphical interface named Sequanix. A snapshot can be found in the Sequanix: GUI for snakemake workflows section and a tutorial in Sequanix Tutorial.

Note, however, that the Sequanix interface is slightly different. The content of the working directory may differ slightly for the time being. The advantage of using Sequanix is that complex configuration pipeline can be tuned easily through its graphical interface.

2.4. Sequana Reports

Pipelines and standalone make use of internal reporting. Since they are part of the Sequana library, they can also be used with your own code. For instance, if you have a BAM file, you can use the following code to create a basic report:

from sequana import BAM, sequana_data
from sequana.modules_report.bamqc import BAMQCModule
filename = sequana_data("test.bam", "testing")
r = BAMQCModule(filename, "bam.html")

that results can be shown in bam.html