Sequana provides standalone applications (e.g., sequana_coverage, sequana_taxonomy) and pipelines in the form of Snakefiles. Although the standalone applications are usually simpler, they may not have all features or parameters offered by the pipelines. Since version 0.8.0, most pipelines have been moved to different repository with one repository per pipeline. This was done to make pipelines independe and the Sequana more modular and effective for deployment in production mode.
This section will not describe all available standalones and pipelines. We will focus on one example (coverage) to show how one can use the Sequana library, or standalone application, or pipeline to get information about the coverage of a set of mapped reads onto a reference.
Sequana is a Python library. It contains many functionalities, which are
fully documented and available in the References section. We can first
look at the coverage contained within a BED file using the library. First, we
need some data. Sequana provides some test examples, which can be accessed
sequana_data() function. The test case is a virus (about
from sequana import sequana_data filename = sequana_data('JB409847.bed')
We can then use the
SequanaCoverage class to read the
from sequana import SequanaCoverage gc = SequanaCoverage(filename)
Select a chromosome (first one) and compute the running median:
chrom = gc chrom.running_median(n=5001, circular=True) chrom.compute_zscore()
and finally plot the coverage together with confidence interval (3 sigma):
notebook available in the github repository
As you can see, Sequana is a standard Python library where developers can select functions, classes, modules to help them design new tools and pipelines.
Let us use the
FastQC class to get the distribution of the bases ACGT
across all reads of a FastQ file.
from sequana import FastQC from sequana import sequana_data filename = sequana_data("test.fastq") fastqc = FastQC(filename) print(fastqc.fastq) for x in 'ACGT': fastqc.get_actg_content()[x].hist(alpha=0.5, label=x, histtype='step', lw=3, bins=10) from pylab import legend legend()
Many more functionalities are available. The reference guide should help you.
The Python example about the coverage is actually quite useful. We therefore decided to provide a standalone application. There are other standalone applications listed in Applications (standalone) section.
The one related to the coverage example shown above is named sequana_coverage. If you have a BED file, type:
sequana_coverage -i <BEDFILENAME>
If your organism has a circular DNA, add
-o. You can play with the window
size for the running median using
Using the BED file and reference mentionned in the previous section you should obtain the same figure as above.
An additional feature is the report using
In Sequana, in addition to the library and standalone applications, we also provide a set of pipelines (see Pipelines section). Originally, pipeline were provided with Sequana, inside the same source repository. Since version 0.8.0, pipeline have their own repository. For instance, Variant Calling is available on https://github.com/sequana/variant_calling. We will not describe all pipelines here below since new ones may appear now and then. Instead, let us explain the way pipelines can be designed and run.
With the new design implemented in v0.8.0, pipelines are independent Python packages posted on Pypi. You can now install a pipeline (e.g., variant calling) as follows in your virtual environment:
pip install sequana_variant_calling --upgrade
The --upgrade option is to make sure you install the newest version.
To check if the installation is successful, just type:
A very simple and useful pipeline for this explanation is the sequana_fastqc pipeline. Install it as follows:
pip install sequana_fastqc
and check the help message:
You will see 4 sections some of which are common to all Sequana pipelines.
The generic section allows use to print the help with --help, to set the level of information printed to the screen (--level), the version (--version). Pipelines can be run locally or on a SLURM clusters. This can be set with the --run-mode option. Note, however, that this option is set automatically to slurm-mode if slurm commands are found (e.g. sbatch).
The slurm section can be used to set slurm options for Snakemake. If you do not know what it means, let it be the defaults values. Just note that memory usage is set to 4Gb by default and number of cores is limited to 4 per job.
The snakemake section allows you to set to maximum number of jobs to be used, which is set to 4 (if run-mode is set to local) and 40 (if run-mode is set to slurm).
The --working-directory is set to the name of the pipeline and is the important parameter. It tells sequana where to store the pipeline files (e.g., snakemake, configuration files). You can change it to your will but if it exists already, the pipeline zill not be set up and you will need to use the --force option to overwrite existing files.
The next section is about your input data. Most of the pipelines expect to find Illumina data with single or paired-end data sets. The directory where to find the data is defined by the --input-directory parameter. You can refine the search by providing an input pattern, which is set to *fastq.gz by default. Since, Illumina data may be paired, we have a mechanism to check and discovered paired data for each sample. By default, the paired data are differentiate thanks to a pattern _R1_ or _R2_ to be found in the filenames. The common pattern set with --input-readtag is set to _R_ but can be easily changed. For instance if your files do not contain the R or if the _R1 is to be found at the end of the file, just change it accordingly.
So, let us now perform the fastqc of a bunch of samples. You could type:
sequana_fastqc --input-directory my_data_directory --working-directory test1
This will copy the snakefile, the configuration files and useful files to run the analysis. Follow the instructions that is:
In this directory, you can find The configuration file called config.yaml. This pipeline is very simple but you can see the parameters related to your input data:
input_directory: /home/login_example/data_example/my_data_directory input_readtag: _R_ input_pattern: '*fastq.gz'
So you can edit this file to correct it or change other parameters. If you are happy with those choices, it is now time to run the pipeline. If you know snakemake, you can just use it. For example:
snakemake -s fastqc.rules
or just type:
Wait and see. Once done. If every went well, you can keep the configuration files and pipeline-related files, or delete them using:
Pipelines section for more information.
Note, however, that the Sequanix interface is slightly different. The content of the working directory may differ slightly for the time being. The advantage of using Sequanix is that complex configuration pipeline can be tuned easily through its graphical interface.
Pipelines and standalone make use of internal reporting. Since they are part of the Sequana library, they can also be used with your own code. For instance, if you have a BAM file, you can use the following code to create a basic report:
from sequana import BAM, sequana_data from sequana.modules_report.bamqc import BAMQCModule filename = sequana_data("test.bam", "testing") r = BAMQCModule(filename, "bam.html")
that results can be shown in bam.html