Mercurial > repos > bioit_sciensano > phagetermvirome

PROGRAM
=======

PhageTerm.py - run as command line in a shell


VERSION
=======

Version 4.0.0
Compatible with python 3.7


INTRODUCTION
============

PhageTermVirome software is a tool to determine phage genome termini and genome packaging mode on single phage or multiple contigs at once.
The software uses phage and virome sequencing reads obtained from libraries prepared with DNA fragmented randomly (e.g. Covaris fragmentation,
and library preparation using Illumina TruSeq). Phage or virome sequencing reads (fastq files) are aligned to the assembled phage genome or assembled
virome (fasta or multifasta files) in order to  calculate two types of coverage values (whole genome coverage and the Starting Position Coverage (SPC)). The starting position coverage is used to perform a detailed termini and packaging mode analysis.

Mu-type phage analysis : can be done if user suspect the phage genome to be Mu-like type (Only for single phage genome analysis, not possible with multifasta file) :
User can also provide the host (bacterial) genome sequence. The Mu-type phage analysis will take the reads that does not match the phage
genome and align them on the bacterial genome using the same mapping function. The analysis to identify Mu-like phages is available only when providing a single phage genome (not possible if user provide a multi-fast file with multiple assembled phage contigs).


The previous PhageTerm program (single phage analysis only) is still available at https://sourceforge.net/projects/phageterm/ (for versions <3.0.0)


A Galaxy wrapper version is also available for the previous version at https://galaxy.pasteur.fr (only for the first version PhageTerm).
PhageTermVirome is not implemented on Galaxy yet).

Since version 3.0.0, PhageTerm can work in 2 modes:
- the usual mono machine mode (parallelization on several cores on the same machine).
- a new multi machine mode (advanced users) with parallelization on several machines, using intermediate files for data exchange.

The default mode is mono machine.
Version 3.0.0 up to version 4.0 work with python 2.7

Since version 4.0, PhageTerm (now PhageTermVirome) works with python 3.7


PREREQUISITES
=============


For version 4.0

Unix/Linux

  - backports
  - backports.functools_lru_cache
  - backports_abc
  - cycler
  - libwebp-base
  - lz4-c
  - matplotlib-base
  - matplotlib
  - numpy
  - openssl
  - pandas
  - patsy
  - pillow
  - pip
  - pyparsing
  - python=3.7
  - python-dateutil
  - python_abi
  - pytz
  - readline
  - reportlab
  - scikit-learn
  - scipy
  - setuptools
  - singledispatch
  - statsmodels
  - tk
  - tornado

A conda virtualenv containing python3.7 and all dependencies is provided for convenience so that users
don't need to install anything else than miniconda or conda. (See below)


FOR INPATIENT USERS : INSTALLING PHAGETERMVIROME USING THE CONDA VIRTUALENV (easiest option)
============================================================================================

First install miniconda if you don't have it already (you don't even need to have python 2.7 or python 3.7 installed on your machine for that since
miniconda contains it): https://docs.conda.io/en/latest/miniconda.html

Download and decompress/extract the PhageTermVirome directory available at https://gitlab.pasteur.fr/vlegrand/ptv.

Then go in the PTV directory, and create the conda environment using the yml file PhageTerm_env_3.yml file for version >=4.0 (python3)

    $ conda env create -f PhageTerm_env_3.yml

Then activate the environment so you can launch PhageTermVirome:

    $ conda activate PhageTerm_env_py3


NOTE:

You can still use the old PhageTerm under python 2.7 (but no multi-fast analysis possible) using the miniconda environment from the PhageTerm_env.yml file for version<4.0 (python2). Using the following commands.

    $ conda env create -f PhageTerm_env.yml

    $ conda activate PhageTerm_env


COMMAND LINE USAGE
==================

Basic usage with mandatory options (PhageTermVirome needs at least one read file, but user can provide a second corresponding paired-end read file if available, using the -p option).

	./PhageTerm.py -f reads.fastq -r phage_sequence(s).fasta


	Help:

        ./PhageTerm.py -h
        ./PhageTerm.py --help


	After installation, we recommend users to perform a software run test, use any of the following:
    	-t TEST_VALUE, --test=TEST_VALUE
                    TEST_VALUE=C5   : Test run for a 5' cohesive end (e.g. Lambda)
               			TEST_VALUE=C3   : Test run for a 3' cohesive end (e.g. HK97)
               			TEST_VALUE=DS   : Test run for a short Direct Terminal Repeats end (e.g. T7)
               			TEST_VALUE=DL   : Test run for a long Direct Terminal Repeats end (e.g. T5)
               			TEST_VALUE=H    : Test run for a Headful packaging (e.g. P1)
               			TEST_VALUE=M    : Test run for a Mu-like packaging (e.g. Mu)


Non-mandatory options

[-p reads_paired -c nbr_core_threads --report_title name_to_write_on_report_outputs -s seed_lenght -d surrounding -g host.fasta -l contig_size_limit_multi-fasta -v virome_run_time_estimation]


Additional advanced options (only for multi-machine users)


[--mm --dir_cov_mm path_to_coverage_results -c nb_cores --core_id idx_core -p reads_paired -s seed_lenght -d surrounding -l limit_multi-fasta]
[--mm --dir_cov_mm path_to_coverage_results --dir_seq_mm path_to_sequence_results --DR_path path_to_results --seq_id index_of_sequence --nb_pieces nbr_of_read_chunks -p reads_paired -s seed_lenght -d surrounding -l limit_multi-fasta] [--mm --DR_path path_to_results --dir_seq_mm path_to_sequence_results -p reads_paired -s seed_lenght -d surrounding -l limit_multi-fasta]


   Detailed  ptions:


	Raw reads file in fastq format:
    -f INPUT_FILE, --fastq=INPUT_FILE
                        Fastq reads
                        (NGS sequences from random fragmentation DNA only,
                        e.g. Illumina TruSeq)

	Phage genome(s) in fasta format:
    -r INPUT_FILE, --ref=INPUT_FILE
                        Reference phage genome(s) as unique contig in fasta format


    Other options common to both modes:

  Raw reads file in fastq format:
    -p INPUT_FILE, --paired=INPUT_FILE
                        Paired fastq reads
                        (NGS sequences from random fragmentation DNA only,
                        e.g. Illumina TruSeq)

	Analysis_name to write on output reports:
    --report_title USER_REPORT_NAME, --report_title=REPORT_NAME
                        Manually enter the name you want to have on your report outputs.
                        Used as prefix for output files.

	Lenght of the seed used for reads in the mapping process:
    -s SEED_LENGHT, --seed=SEED_LENGHT
                        Manually enter the lenght of the seed used for reads
                        in the mapping process (Default: 20).

	Number of nucleotides around the main peak to consider for merging adjacent significant peaks (set to 1 to discover secondary terminus but sites).
    -d SUROUNDING_LENGHT, --surrounding=SUROUNDING_LENGHT
                        Manually enter the lenght of the surrounding used to
                        merge close peaks in the analysis process (Default: 20).

	Host genome in fasta format (option available only for analysis with a single phage genome):
    -g INPUT_FILE, --host=INPUT_FILE
                        Genome of reference host (bacterial genome) in fasta format
                        Warning: increase drastically process time
                        This option can be used only when analyzing a single phage genome (not available for virome contigs as multifasta)

	Define phage mean coverage:
    -m MEAN_NBR, --mean=MEAN_NBR
                        Phage mean coverage to use (Default: 250).

	Define phage mean coverage:
    -l LIMIT_FASTA, —limit=LIMIT_FASTA
                        Minimum phage fasta length (Default: 500).


    Options for mono machine (default) mode only

	Software run test:
    -t TEST_VALUE, --test=TEST_VALUE
                        TEST_VALUE=C5   : Test run for a 5' cohesive end (e.g. Lambda)
               			    TEST_VALUE=C3   : Test run for a 3' cohesive end (e.g. HK97)
               			    TEST_VALUE=DS   : Test run for a short Direct Terminal Repeats end (e.g. T7)
               			    TEST_VALUE=DL   : Test run for a long Direct Terminal Repeats end (e.g. T5)
               			    TEST_VALUE=H    : Test run for a Headful packaging (e.g. P1)
               			    TEST_VALUE=M    : Test run for a Mu-like packaging (e.g. Mu)

    Core processor number to use:
    -c CORE_NBR, --core=CORE_NBR
                        Number of core processor to use (Default: 1).


    Options for multi machine mode only

    Indicate that PhageTerm should run on several machines:
    --mm


    Options for step 1 of multi-machine mode (calculating reads coverage) on several machines

    Directory for coverage results:
    --dir_cov_mm=DIR_PATH/DIR_NAME
                        Directory where to put coverage results.
                        Note: it is up to the user to delete the files in this directory.

    Total number of cores to use
    -c CORE_NBR, --core=CORE_NBR
                        Total number used accross over all machines.

    Index of read chunk to process on current core
    --core_id=IDX
                A number between 0 and CORE_NBR-1

    Directory for checkpoint files:
    --dir_chk=DIR_PATH/DIR_NAME
                    Directory where phageTerm will put its ceckpoints.
                    Note: the directory must exist before launching phageTerm.
                    If the directory already contains a file, phageTerm will start from the results contained in this file.

    --chk_freq=FREQUENCY
                    The frequency in minutes at which checkpoints must be created.
                    Note: default value is 0 which means that no checkpoint is created.


    Options for step 2 of multi-machine mode (calculating per sequence statistics from reads coverage results) on several machines

    Directory for coverage results:
    --dir_cov_mm=DIR_PATH/DIR_NAME
                        Directory where to put coverage results.
                        Note: it is up to the user to delete the files in this directory.

    Directory for per sequence results
    --dir_seq_mm=DIR_PATH/DIR_NAME
                        Directory where to put the information if no match was found for one/several sequences.
                        Note: it is up to the user to delete the files in this directory.

    Directory for DR results
    --DR_path=DIR_PATH/DIR_NAME
                        Directory where to put the information necessary to step 3 (final report generation).
                        This information typically includes names of phage found and per sequence statistics.
                        Note: it is up to the user to delete the files in this directory.

    Sequence identifier
    --seq_id=IDX
            Index of the sequence to be processed by the current phageTerm process.
            Let N be the number of sequences given at the end of step 1.
            Then IDX is  number between 0 and N-1.

    Number of pieces
    --nb_pieces=NP
            Number of parts in which the reads were divided.
            Must be the same value as given via -c at step 1 (CORE_NBR).


    Options for step 3 of multi-machine mode (final report generation)

    Directory for DR results
    --DR_path=DIR_PATH/DIR_NAME
                        Directory where to read the information necessary to step 3 (final report generation).
                        This information typically includes names of phage found and per sequence statistics.
                        Note: it is up to the user to delete the files in this directory.

    Directory for per sequence results
    --dir_seq_mm=DIR_PATH/DIR_NAME
                        Directory where to get the information if no match was found for one/several sequences.
                        Note: it is up to the user to delete the files in this directory.


OUTPUT FILES
==========

	(i) Report (.pdf)

	(ii) Statistical table (.csv)

	(iii) File containingg contains re-organized to stat at the predicted termini (.fasta)


CONTACT
=======

Julian Garneau <julian.garneau@usherbrooke.ca>
Marc Monot <marc.monot@pasteur.fr>
David Bikard <david.bikard@pasteur.fr>
Véronique Legrand <vlegrand@pasteur.fr>
author	bioit_sciensano
date	Fri, 11 Mar 2022 15:06:20 +0000
parents
children