comparison README.txt @ 0:69e8f12c8b31 draft

"planemo upload"
author bioit_sciensano
date Fri, 11 Mar 2022 15:06:20 +0000
parents
children
comparison
equal deleted inserted replaced
-1:000000000000 0:69e8f12c8b31
1 PROGRAM
2 =======
3
4 PhageTerm.py - run as command line in a shell
5
6
7 VERSION
8 =======
9
10 Version 4.0.0
11 Compatible with python 3.7
12
13
14 INTRODUCTION
15 ============
16
17 PhageTermVirome software is a tool to determine phage genome termini and genome packaging mode on single phage or multiple contigs at once.
18 The software uses phage and virome sequencing reads obtained from libraries prepared with DNA fragmented randomly (e.g. Covaris fragmentation,
19 and library preparation using Illumina TruSeq). Phage or virome sequencing reads (fastq files) are aligned to the assembled phage genome or assembled
20 virome (fasta or multifasta files) in order to calculate two types of coverage values (whole genome coverage and the Starting Position Coverage (SPC)). The starting position coverage is used to perform a detailed termini and packaging mode analysis.
21
22 Mu-type phage analysis : can be done if user suspect the phage genome to be Mu-like type (Only for single phage genome analysis, not possible with multifasta file) :
23 User can also provide the host (bacterial) genome sequence. The Mu-type phage analysis will take the reads that does not match the phage
24 genome and align them on the bacterial genome using the same mapping function. The analysis to identify Mu-like phages is available only when providing a single phage genome (not possible if user provide a multi-fast file with multiple assembled phage contigs).
25
26
27 The previous PhageTerm program (single phage analysis only) is still available at https://sourceforge.net/projects/phageterm/ (for versions <3.0.0)
28
29
30 A Galaxy wrapper version is also available for the previous version at https://galaxy.pasteur.fr (only for the first version PhageTerm).
31 PhageTermVirome is not implemented on Galaxy yet).
32
33 Since version 3.0.0, PhageTerm can work in 2 modes:
34 - the usual mono machine mode (parallelization on several cores on the same machine).
35 - a new multi machine mode (advanced users) with parallelization on several machines, using intermediate files for data exchange.
36
37 The default mode is mono machine.
38 Version 3.0.0 up to version 4.0 work with python 2.7
39
40 Since version 4.0, PhageTerm (now PhageTermVirome) works with python 3.7
41
42
43 PREREQUISITES
44 =============
45
46
47 For version 4.0
48
49 Unix/Linux
50
51 - backports
52 - backports.functools_lru_cache
53 - backports_abc
54 - cycler
55 - libwebp-base
56 - lz4-c
57 - matplotlib-base
58 - matplotlib
59 - numpy
60 - openssl
61 - pandas
62 - patsy
63 - pillow
64 - pip
65 - pyparsing
66 - python=3.7
67 - python-dateutil
68 - python_abi
69 - pytz
70 - readline
71 - reportlab
72 - scikit-learn
73 - scipy
74 - setuptools
75 - singledispatch
76 - statsmodels
77 - tk
78 - tornado
79
80 A conda virtualenv containing python3.7 and all dependencies is provided for convenience so that users
81 don't need to install anything else than miniconda or conda. (See below)
82
83
84 FOR INPATIENT USERS : INSTALLING PHAGETERMVIROME USING THE CONDA VIRTUALENV (easiest option)
85 ============================================================================================
86
87 First install miniconda if you don't have it already (you don't even need to have python 2.7 or python 3.7 installed on your machine for that since
88 miniconda contains it): https://docs.conda.io/en/latest/miniconda.html
89
90 Download and decompress/extract the PhageTermVirome directory available at https://gitlab.pasteur.fr/vlegrand/ptv.
91
92 Then go in the PTV directory, and create the conda environment using the yml file PhageTerm_env_3.yml file for version >=4.0 (python3)
93
94 $ conda env create -f PhageTerm_env_3.yml
95
96 Then activate the environment so you can launch PhageTermVirome:
97
98 $ conda activate PhageTerm_env_py3
99
100
101 NOTE:
102
103 You can still use the old PhageTerm under python 2.7 (but no multi-fast analysis possible) using the miniconda environment from the PhageTerm_env.yml file for version<4.0 (python2). Using the following commands.
104
105 $ conda env create -f PhageTerm_env.yml
106
107 $ conda activate PhageTerm_env
108
109
110
111 COMMAND LINE USAGE
112 ==================
113
114 Basic usage with mandatory options (PhageTermVirome needs at least one read file, but user can provide a second corresponding paired-end read file if available, using the -p option).
115
116 ./PhageTerm.py -f reads.fastq -r phage_sequence(s).fasta
117
118
119 Help:
120
121 ./PhageTerm.py -h
122 ./PhageTerm.py --help
123
124
125 After installation, we recommend users to perform a software run test, use any of the following:
126 -t TEST_VALUE, --test=TEST_VALUE
127 TEST_VALUE=C5 : Test run for a 5' cohesive end (e.g. Lambda)
128 TEST_VALUE=C3 : Test run for a 3' cohesive end (e.g. HK97)
129 TEST_VALUE=DS : Test run for a short Direct Terminal Repeats end (e.g. T7)
130 TEST_VALUE=DL : Test run for a long Direct Terminal Repeats end (e.g. T5)
131 TEST_VALUE=H : Test run for a Headful packaging (e.g. P1)
132 TEST_VALUE=M : Test run for a Mu-like packaging (e.g. Mu)
133
134
135 Non-mandatory options
136
137 [-p reads_paired -c nbr_core_threads --report_title name_to_write_on_report_outputs -s seed_lenght -d surrounding -g host.fasta -l contig_size_limit_multi-fasta -v virome_run_time_estimation]
138
139
140 Additional advanced options (only for multi-machine users)
141
142
143 [--mm --dir_cov_mm path_to_coverage_results -c nb_cores --core_id idx_core -p reads_paired -s seed_lenght -d surrounding -l limit_multi-fasta]
144 [--mm --dir_cov_mm path_to_coverage_results --dir_seq_mm path_to_sequence_results --DR_path path_to_results --seq_id index_of_sequence --nb_pieces nbr_of_read_chunks -p reads_paired -s seed_lenght -d surrounding -l limit_multi-fasta] [--mm --DR_path path_to_results --dir_seq_mm path_to_sequence_results -p reads_paired -s seed_lenght -d surrounding -l limit_multi-fasta]
145
146
147
148
149 Detailed ptions:
150
151
152 Raw reads file in fastq format:
153 -f INPUT_FILE, --fastq=INPUT_FILE
154 Fastq reads
155 (NGS sequences from random fragmentation DNA only,
156 e.g. Illumina TruSeq)
157
158 Phage genome(s) in fasta format:
159 -r INPUT_FILE, --ref=INPUT_FILE
160 Reference phage genome(s) as unique contig in fasta format
161
162
163
164 Other options common to both modes:
165
166 Raw reads file in fastq format:
167 -p INPUT_FILE, --paired=INPUT_FILE
168 Paired fastq reads
169 (NGS sequences from random fragmentation DNA only,
170 e.g. Illumina TruSeq)
171
172 Analysis_name to write on output reports:
173 --report_title USER_REPORT_NAME, --report_title=REPORT_NAME
174 Manually enter the name you want to have on your report outputs.
175 Used as prefix for output files.
176
177 Lenght of the seed used for reads in the mapping process:
178 -s SEED_LENGHT, --seed=SEED_LENGHT
179 Manually enter the lenght of the seed used for reads
180 in the mapping process (Default: 20).
181
182 Number of nucleotides around the main peak to consider for merging adjacent significant peaks (set to 1 to discover secondary terminus but sites).
183 -d SUROUNDING_LENGHT, --surrounding=SUROUNDING_LENGHT
184 Manually enter the lenght of the surrounding used to
185 merge close peaks in the analysis process (Default: 20).
186
187 Host genome in fasta format (option available only for analysis with a single phage genome):
188 -g INPUT_FILE, --host=INPUT_FILE
189 Genome of reference host (bacterial genome) in fasta format
190 Warning: increase drastically process time
191 This option can be used only when analyzing a single phage genome (not available for virome contigs as multifasta)
192
193 Define phage mean coverage:
194 -m MEAN_NBR, --mean=MEAN_NBR
195 Phage mean coverage to use (Default: 250).
196
197 Define phage mean coverage:
198 -l LIMIT_FASTA, —limit=LIMIT_FASTA
199 Minimum phage fasta length (Default: 500).
200
201
202 Options for mono machine (default) mode only
203
204 Software run test:
205 -t TEST_VALUE, --test=TEST_VALUE
206 TEST_VALUE=C5 : Test run for a 5' cohesive end (e.g. Lambda)
207 TEST_VALUE=C3 : Test run for a 3' cohesive end (e.g. HK97)
208 TEST_VALUE=DS : Test run for a short Direct Terminal Repeats end (e.g. T7)
209 TEST_VALUE=DL : Test run for a long Direct Terminal Repeats end (e.g. T5)
210 TEST_VALUE=H : Test run for a Headful packaging (e.g. P1)
211 TEST_VALUE=M : Test run for a Mu-like packaging (e.g. Mu)
212
213 Core processor number to use:
214 -c CORE_NBR, --core=CORE_NBR
215 Number of core processor to use (Default: 1).
216
217
218
219 Options for multi machine mode only
220
221 Indicate that PhageTerm should run on several machines:
222 --mm
223
224
225 Options for step 1 of multi-machine mode (calculating reads coverage) on several machines
226
227 Directory for coverage results:
228 --dir_cov_mm=DIR_PATH/DIR_NAME
229 Directory where to put coverage results.
230 Note: it is up to the user to delete the files in this directory.
231
232 Total number of cores to use
233 -c CORE_NBR, --core=CORE_NBR
234 Total number used accross over all machines.
235
236 Index of read chunk to process on current core
237 --core_id=IDX
238 A number between 0 and CORE_NBR-1
239
240 Directory for checkpoint files:
241 --dir_chk=DIR_PATH/DIR_NAME
242 Directory where phageTerm will put its ceckpoints.
243 Note: the directory must exist before launching phageTerm.
244 If the directory already contains a file, phageTerm will start from the results contained in this file.
245
246 --chk_freq=FREQUENCY
247 The frequency in minutes at which checkpoints must be created.
248 Note: default value is 0 which means that no checkpoint is created.
249
250
251
252 Options for step 2 of multi-machine mode (calculating per sequence statistics from reads coverage results) on several machines
253
254 Directory for coverage results:
255 --dir_cov_mm=DIR_PATH/DIR_NAME
256 Directory where to put coverage results.
257 Note: it is up to the user to delete the files in this directory.
258
259 Directory for per sequence results
260 --dir_seq_mm=DIR_PATH/DIR_NAME
261 Directory where to put the information if no match was found for one/several sequences.
262 Note: it is up to the user to delete the files in this directory.
263
264 Directory for DR results
265 --DR_path=DIR_PATH/DIR_NAME
266 Directory where to put the information necessary to step 3 (final report generation).
267 This information typically includes names of phage found and per sequence statistics.
268 Note: it is up to the user to delete the files in this directory.
269
270 Sequence identifier
271 --seq_id=IDX
272 Index of the sequence to be processed by the current phageTerm process.
273 Let N be the number of sequences given at the end of step 1.
274 Then IDX is number between 0 and N-1.
275
276 Number of pieces
277 --nb_pieces=NP
278 Number of parts in which the reads were divided.
279 Must be the same value as given via -c at step 1 (CORE_NBR).
280
281
282 Options for step 3 of multi-machine mode (final report generation)
283
284 Directory for DR results
285 --DR_path=DIR_PATH/DIR_NAME
286 Directory where to read the information necessary to step 3 (final report generation).
287 This information typically includes names of phage found and per sequence statistics.
288 Note: it is up to the user to delete the files in this directory.
289
290 Directory for per sequence results
291 --dir_seq_mm=DIR_PATH/DIR_NAME
292 Directory where to get the information if no match was found for one/several sequences.
293 Note: it is up to the user to delete the files in this directory.
294
295
296
297
298
299
300 OUTPUT FILES
301 ==========
302
303 (i) Report (.pdf)
304
305 (ii) Statistical table (.csv)
306
307 (iii) File containingg contains re-organized to stat at the predicted termini (.fasta)
308
309
310 CONTACT
311 =======
312
313 Julian Garneau <julian.garneau@usherbrooke.ca>
314 Marc Monot <marc.monot@pasteur.fr>
315 David Bikard <david.bikard@pasteur.fr>
316 Véronique Legrand <vlegrand@pasteur.fr>