comparison README.md @ 1:c8e1543546f8 draft

planemo upload for repository https://github.com/ncbi/egapx commit 8173d01b08d9a91c9ec5f6cb50af346edc8020c4-dirty
author fubar
date Sat, 03 Aug 2024 12:10:13 +0000
parents d9c5c5b87fec
children
comparison
equal deleted inserted replaced
0:d9c5c5b87fec 1:c8e1543546f8
1 # Eukaryotic Genome Annotation Pipeline - External (EGAPx) 1 # Galaxy tool wrapping the Eukaryotic Genome Annotation Pipeline - External (EGAPx)
2
3 **Warning**
4 This is a very simple and crude way to run the EGAPx workflow inside Galaxy.
5
6 EGAPx requires huge resources to run with useful data. 128GB and 32 cores are the minimum; 256GB and 64 cores are recommended.
7
8 There is a special test minimal example that can be run in 6GB with 4 cores.
9
10 The user must supply a yaml configuration file in this initial proof of concept.
11 Samples are available in the EGAPx github repository and one is shown below for cut/paste into a history dataset in the upload tool.
12
13 This is not intended for production. Just a proof of concept.
14 It is possibly too inefficient to be useful although it may turn out not to be a problem if run on a dedicated workstation.
15 At least the efficiency can now be more easily estimated.
16
17 This is not recommended for public deployment because of the resource demands.
18
19
2 20
3 EGAPx is the publicly accessible version of the updated NCBI [Eukaryotic Genome Annotation Pipeline](https://www.ncbi.nlm.nih.gov/genome/annotation_euk/process/). 21 EGAPx is the publicly accessible version of the updated NCBI [Eukaryotic Genome Annotation Pipeline](https://www.ncbi.nlm.nih.gov/genome/annotation_euk/process/).
4 22
5 EGAPx takes an assembly fasta file, a taxid of the organism, and RNA-seq data. Based on the taxid, EGAPx will pick protein sets and HMM models. The pipeline runs `miniprot` to align protein sequences, and `STAR` to align RNA-seq to the assembly. Protein alignments and RNA-seq read alignments are then passed to `Gnomon` for gene prediction. In the first step of `Gnomon`, the short alignments are chained together into putative gene models. In the second step, these predictions are further supplemented by _ab-initio_ predictions based on HMM models. The final annotation for the input assembly is produced as a `gff` file. 23 EGAPx takes an assembly fasta file, a taxid of the organism, and RNA-seq data. Based on the taxid, EGAPx will pick protein sets and HMM models. The pipeline runs `miniprot` to align protein sequences, and `STAR` to align RNA-seq to the assembly. Protein alignments and RNA-seq read alignments are then passed to `Gnomon` for gene prediction. In the first step of `Gnomon`, the short alignments are chained together into putative gene models. In the second step, these predictions are further supplemented by _ab-initio_ predictions based on HMM models. The final annotation for the input assembly is produced as a `gff` file.
6
7 We currently have protein datasets posted that are suitable for most vertebrates and arthropods:
8 - Chordata - Mammalia, Sauropsida, Actinopterygii (ray-finned fishes)
9 - Insecta - Hymenoptera, Diptera, Lepidoptera, Coleoptera, Hemiptera
10 - Arthropoda - Arachnida, other Arthropoda
11
12 We will be adding datasets for plants and other invertebrates in the next couple of months. Fungi, protists and nematodes are currently out-of-scope for EGAPx pending additional refinements.
13
14 We currently have protein datasets posted for most vertebrates (mammals, sauropsids, ray-finned fishes) and arthropods. We will be adding datasets for more arthropods, vertebrates and plants in the next couple of months. Fungi, protists and nematodes are currently out-of-scope for EGAPx pending additional refinements.
15 24
16 **Warning:** 25 **Warning:**
17 The current version is an alpha release with limited features and organism scope to collect initial feedback on execution. Outputs are not yet complete and not intended for production use. Please open a GitHub [Issue](https://github.com/ncbi/egapx/issues) if you encounter any problems with EGAPx. You can also write to cgr@nlm.nih.gov to give us your feedback or if you have any questions. 26 The current version is an alpha release with limited features and organism scope to collect initial feedback on execution. Outputs are not yet complete and not intended for production use. Please open a GitHub [Issue](https://github.com/ncbi/egapx/issues) if you encounter any problems with EGAPx. You can also write to cgr@nlm.nih.gov to give us your feedback or if you have any questions.
18 27
19 28
20 **Security Notice:** 29 **Security Notice:**
21 EGAPx has dependencies in and outside of its execution path that include several thousand files from the [NCBI C++ toolkit](https://www.ncbi.nlm.nih.gov/toolkit), and more than a million total lines of code. Static Application Security Testing has shown a small number of verified buffer overrun security vulnerabilities. Users should consult with their organizational security team on risk and if there is concern, consider mitigating options like running via VM or cloud instance. 30 EGAPx has dependencies in and outside of its execution path that include several thousand files from the [NCBI C++ toolkit](https://www.ncbi.nlm.nih.gov/toolkit), and more than a million total lines of code. Static Application Security Testing has shown a small number of verified buffer overrun security vulnerabilities. Users should consult with their organizational security team on risk and if there is concern, consider mitigating options like running via VM or cloud instance.
22
23 **License:**
24 See the EGAPx license [here](https://github.com/ncbi/egapx/blob/main/LICENSE).
25
26
27
28 ## Prerequisites
29
30 - Docker or Singularity
31 - AWS batch, UGE cluster, or a r6a.4xlarge machine (32 CPUs, 256GB RAM)
32 - Nextflow v.23.10.1
33 - Python v.3.9+
34
35 Notes:
36 - General configuration for AWS Batch is described in the Nextflow documentation at https://www.nextflow.io/docs/latest/aws.html
37 - See Nextflow installation at https://www.nextflow.io/docs/latest/getstarted.html
38
39 ## The workflow files
40
41 - Clone the EGAPx repo:
42 ```
43 git clone https://github.com/ncbi/egapx.git
44 cd egapx
45 ```
46
47 ## Input data format
48
49 Input to EGAPx is in the form of a YAML file.
50
51 - The following are the _required_ key-value pairs for the input file:
52
53 ```
54 genome: path to assembled genome in FASTA format
55 taxid: NCBI Taxonomy identifier of the target organism
56 reads: RNA-seq data
57 ```
58 You can obtain taxid from the [NCBI Taxonomy page](https://www.ncbi.nlm.nih.gov/taxonomy).
59
60
61 - RNA-seq data can be supplied in any one of the following ways:
62
63 ```
64 reads: [ array of paths to reads FASTA or FASTQ files]
65 reads: [ array of SRA run IDs ]
66 reads: [SRA Study ID]
67 reads: SRA query for reads
68 ```
69 - If you are using your local reads, then the FASTA/FASTQ files should be provided in the following format:
70 ```
71 reads:
72 - path_to_Sample1_R1.gz
73 - path_to_Sample1_R2.gz
74 - path_to_Sample2_R1.gz
75 - path_to_Sample2_R2.gz
76 ```
77
78 - If you provide an SRA Study ID, all the SRA run ID's belonging to that Study ID will be included in the EGAPx run.
79
80 - The following are the _optional_ key-value pairs for the input file:
81
82 - A protein set. A taxid-based protein set will be chosen if no protein set is provided.
83 ```
84 proteins: path to proteins data in FASTA format.
85 ```
86
87 - HMM file used in Gnomon training. A taxid-based HMM will be chosen if no HMM file is provided.
88 ```
89 hmm: path to HMM file
90 ```
91
92
93 31
94 ## Input example 32 ## Input example
95 33
96 - A test example YAML file `./examples/input_D_farinae_small.yaml` is included in the `egapx` folder. Here, the RNA-seq data is provided as paths to the reads FASTA files. These FASTA files are a sampling of the reads from the complete SRA read files to expedite testing. 34 - A test example YAML file `./examples/input_D_farinae_small.yaml` is included in the `egapx` folder. Here, the RNA-seq data is provided as paths to the reads FASTA files. These FASTA files are a sampling of the reads from the complete SRA read files to expedite testing.
97 35
118 reads: 'txid6954[Organism] AND biomol_transcript[properties] NOT SRS024887[Accession] AND (SRR8506572[Accession] OR SRR9005248[Accession] )' 56 reads: 'txid6954[Organism] AND biomol_transcript[properties] NOT SRS024887[Accession] AND (SRR8506572[Accession] OR SRR9005248[Accession] )'
119 ``` 57 ```
120 58
121 **Note:** Both the above examples will have more RNA-seq data than the `input_D_farinae_small.yaml` example. To make sure the entrez query does not produce a large number of SRA runs, please run it first at the [NCBI SRA page](https://www.ncbi.nlm.nih.gov/sra). If there are too many SRA runs, then select a few of them and list it in the input yaml. 59 **Note:** Both the above examples will have more RNA-seq data than the `input_D_farinae_small.yaml` example. To make sure the entrez query does not produce a large number of SRA runs, please run it first at the [NCBI SRA page](https://www.ncbi.nlm.nih.gov/sra). If there are too many SRA runs, then select a few of them and list it in the input yaml.
122 60
123 - First, test EGAPx on the example provided (`input_D_farinae_small.yaml`, a dust mite) to make sure everything works. This example usually runs under 30 minutes depending upon resource availability. There are other examples you can try: `input_C_longicornis.yaml`, a green fly, and `input_Gavia_tellata.yaml`, a bird. These will take close to two hours. You can prepare your input YAML file following these examples.
124
125 ## Run EGAPx
126
127 - The `egapx` folder contains the following directories:
128 - examples
129 - nf
130 - test
131 - third_party_licenses
132 - ui
133
134 - The runner script is within the ui directory (`ui/egapx.py`). 
135
136 - Create a virtual environment where you can run EGAPx. There is a `requirements.txt` file. PyYAML will be installed in this environment.
137 ```
138 python -m venv /path/to/new/virtual/environment
139 source /path/to/new/virtual/environment/bin/activate
140 pip install -r ui/requirements.txt
141 ```
142
143
144
145
146
147 - Run EGAPx for the first time to copy the config files so you can edit them:
148 ```
149 python3 ui/egapx.py ./examples/input_D_farinae_small.yaml -o example_out
150 ```
151 - When you run `egapx.py` for the first time it copies the template config files to the directory `./egapx_config`.
152 - You will need to edit these templates to reflect the actual parameters of your setup.
153 - For AWS Batch execution, set up AWS Batch Service following advice in the AWS link above. Then edit the value for `process.queue` in `./egapx_config/aws.config` file.
154 - For execution on the local machine you don't need to adjust anything.
155
156
157 - Run EGAPx with the following command for real this time.
158 - For AWS Batch execution, replace temp_datapath with an existing S3 bucket.
159 - For local execution, use a local path for `-w`
160 ```
161 python3 ui/egapx.py ./examples/input_D_farinae_small.yaml -e aws -w s3://temp_datapath/D_farinae -o example_out
162 ```
163
164 - use `-e aws` for AWS batch using Docker image
165 - use `-e docker` for using Docker image
166 - use `-e singularity` for using the Singularity image
167 - use `-e biowulf_cluster` for Biowulf cluster using Singularity image
168 - use '-e slurm` for using SLURM in your HPC.
169 - Note that for this option, you have to edit `./egapx_config/slurm.config` according to your cluster specifications.
170 - type `python3 ui/egapx.py  -h ` for the help menu
171
172 ```
173 $ ui/egapx.py -h
174
175
176 !!WARNING!!
177 This is an alpha release with limited features and organism scope to collect initial feedback on execution. Outputs are not yet complete and not intended for production use.
178
179 usage: egapx.py [-h] [-o OUTPUT] [-e EXECUTOR] [-c CONFIG_DIR] [-w WORKDIR] [-r REPORT] [-n] [-st]
180 [-so] [-dl] [-lc LOCAL_CACHE] [-q] [-v] [-fn FUNC_NAME]
181 [filename]
182
183 Main script for EGAPx
184
185 optional arguments:
186 -h, --help show this help message and exit
187 -e EXECUTOR, --executor EXECUTOR
188 Nextflow executor, one of docker, singularity, aws, or local (for NCBI
189 internal use only). Uses corresponding Nextflow config file
190 -c CONFIG_DIR, --config-dir CONFIG_DIR
191 Directory for executor config files, default is ./egapx_config. Can be also
192 set as env EGAPX_CONFIG_DIR
193 -w WORKDIR, --workdir WORKDIR
194 Working directory for cloud executor
195 -r REPORT, --report REPORT
196 Report file prefix for report (.report.html) and timeline (.timeline.html)
197 files, default is in output directory
198 -n, --dry-run
199 -st, --stub-run
200 -so, --summary-only Print result statistics only if available, do not compute result
201 -lc LOCAL_CACHE, --local-cache LOCAL_CACHE
202 Where to store the downloaded files
203 -q, --quiet
204 -v, --verbose
205 -fn FUNC_NAME, --func_name FUNC_NAME
206 func_name
207
208 run:
209 filename YAML file with input: section with at least genome: and reads: parameters
210 -o OUTPUT, --output OUTPUT
211 Output path
212
213 download:
214 -dl, --download-only Download external files to local storage, so that future runs can be
215 isolated
216
217
218 ```
219
220
221 ## Test run
222
223 ```
224 $ python3 ui/egapx.py examples/input_D_farinae_small.yaml -e aws -o example_out -w s3://temp_datapath/D_farinae
225
226 !!WARNING!!
227 This is an alpha release with limited features and organism scope to collect initial feedback on execution. Outputs are not yet complete and not intended for production use.
228
229 N E X T F L O W ~ version 23.10.1
230 Launching `/../home/user/egapx/ui/../nf/ui.nf` [golden_mercator] DSL2 - revision: c134f40af5
231 in egapx block
232 executor > awsbatch (67)
233 [f5/3007b8] process > egapx:setup_genome:get_genome_info [100%] 1 of 1 ✔
234 [32/a1bfa5] process > egapx:setup_proteins:convert_proteins [100%] 1 of 1 ✔
235 [96/621c4b] process > egapx:miniprot:run_miniprot [100%] 1 of 1 ✔
236 [6d/766c2f] process > egapx:paf2asn:run_paf2asn [100%] 1 of 1 ✔
237 [56/f1dd6b] process > egapx:best_aligned_prot:run_best_aligned_prot [100%] 1 of 1 ✔
238 [c1/ccc4a3] process > egapx:align_filter_sa:run_align_filter_sa [100%] 1 of 1 ✔
239 [e0/5548d0] process > egapx:run_align_sort [100%] 1 of 1 ✔
240 [a8/456a0e] process > egapx:star_index:build_index [100%] 1 of 1 ✔
241 [d5/6469a6] process > egapx:star_simplified:exec (1) [100%] 2 of 2 ✔
242 [64/99ab35] process > egapx:bam_strandedness:exec (2) [100%] 2 of 2 ✔
243 [98/a12969] process > egapx:bam_strandedness:merge [100%] 1 of 1 ✔
244 [78/0d7007] process > egapx:bam_bin_and_sort:calc_assembly_sizes [100%] 1 of 1 ✔
245 [74/bb014e] process > egapx:bam_bin_and_sort:bam_bin (2) [100%] 2 of 2 ✔
246 [39/3cdd00] process > egapx:bam_bin_and_sort:merge_prepare [100%] 1 of 1 ✔
247 [01/f64e38] process > egapx:bam_bin_and_sort:merge (1) [100%] 1 of 1 ✔
248 [aa/47a002] process > egapx:bam2asn:convert (1) [100%] 1 of 1 ✔
249 [45/6661b3] process > egapx:rnaseq_collapse:generate_jobs [100%] 1 of 1 ✔
250 [64/68bc37] process > egapx:rnaseq_collapse:run_rnaseq_collapse (3) [100%] 9 of 9 ✔
251 [18/bff1ac] process > egapx:rnaseq_collapse:run_gpx_make_outputs [100%] 1 of 1 ✔
252 [a4/76a4a5] process > egapx:get_hmm_params:run_get_hmm [100%] 1 of 1 ✔
253 [3c/b71c42] process > egapx:chainer:run_align_sort (1) [100%] 1 of 1 ✔
254 [e1/340b6d] process > egapx:chainer:generate_jobs [100%] 1 of 1 ✔
255 [c0/477d02] process > egapx:chainer:run_chainer (16) [100%] 16 of 16 ✔
256 [9f/27c1c8] process > egapx:chainer:run_gpx_make_outputs [100%] 1 of 1 ✔
257 [5c/8f65d0] process > egapx:gnomon_wnode:gpx_qsubmit [100%] 1 of 1 ✔
258 [34/6ab0c9] process > egapx:gnomon_wnode:annot (1) [100%] 10 of 10 ✔
259 [a9/e38221] process > egapx:gnomon_wnode:gpx_qdump [100%] 1 of 1 ✔
260 [bc/8ebca4] process > egapx:annot_builder:annot_builder_main [100%] 1 of 1 ✔
261 [5f/6b72c0] process > egapx:annot_builder:annot_builder_input [100%] 1 of 1 ✔
262 [eb/1ccdd0] process > egapx:annot_builder:annot_builder_run [100%] 1 of 1 ✔
263 [4d/6c33db] process > egapx:annotwriter:run_annotwriter [100%] 1 of 1 ✔
264 [b6/d73d18] process > export [100%] 1 of 1 ✔
265 Waiting for file transfers to complete (1 files)
266 Completed at: 27-Mar-2024 11:43:15
267 Duration : 27m 36s
268 CPU hours : 4.2
269 Succeeded : 67
270 ```
271 ## Output 61 ## Output
272 62
273 Look at the output in the out diectory (`example_out`) that was supplied in the command line. The annotation file is called `accept.gff`. 63 Look at the output in the out diectory (`example_out`) that was supplied in the command line. The annotation file is called `accept.gff`.
274 ``` 64 ```
275 accept.gff 65 accept.gff
305 2024-03-27 11:20:25 1 .exitcode 95 2024-03-27 11:20:25 1 .exitcode
306 $ aws s3 ls s3://temp_datapath/D_farinae/96/621c4ba4e6e87a4d869c696fe50034/output/ 96 $ aws s3 ls s3://temp_datapath/D_farinae/96/621c4ba4e6e87a4d869c696fe50034/output/
307 2024-03-27 11:20:24 17127134 aligns.paf 97 2024-03-27 11:20:24 17127134 aligns.paf
308 ``` 98 ```
309 99
310 ## Offline mode
311
312 If you do not have internet access from your cluster, you can run EGAPx in offline mode. To do this, you would first pull the Singularity image, then download the necessary files from NCBI FTP using `egapx.py` script, and then finally use the path of the downloaded folder in the run command. Here is an example of how to download the files and execute EGAPx in the Biowulf cluster.
313
314
315 - Download the Singularity image:
316 ```
317 rm egap*sif
318 singularity cache clean
319 singularity pull docker://ncbi/egapx:0.2-alpha
320 ```
321
322 - Clone the repo:
323 ```
324 git clone https://github.com/ncbi/egapx.git
325 cd egapx
326 ```
327
328 - Download EGAPx related files from NCBI:
329 ```
330 python3 ui/egapx.py -dl -lc ../local_cache
331 ```
332
333 - Download SRA reads:
334 ```
335 prefetch SRR8506572
336 prefetch SRR9005248
337 fasterq-dump --skip-technical --threads 6 --split-files --seq-defline ">\$ac.\$si.\$ri" --fasta -O sradir/ ./SRR8506572
338 fasterq-dump --skip-technical --threads 6 --split-files --seq-defline ">\$ac.\$si.\$ri" --fasta -O sradir/ ./SRR9005248
339
340 ```
341 You should see downloaded files inside the 'sradir' folder":
342 ```
343 ls sradir/
344 SRR8506572_1.fasta SRR8506572_2.fasta SRR9005248_1.fasta SRR9005248_2.fasta
345 ```
346 Now edit the file paths of SRA reads files in `examples/input_D_farinae_small.yaml` to include the above SRA files.
347
348 - Run `egapx.py` first to edit the `biowulf_cluster.config`:
349 ```
350 ui/egapx.py examples/input_D_farinae_small.yaml -e biowulf_cluster -w dfs_work -o dfs_out -lc ../local_cache
351 echo "process.container = '/path_to_/egapx_0.2-alpha.sif'" >> egapx_config/biowulf_cluster.config
352 ```
353
354 - Run `egapx.py`:
355 ```
356 ui/egapx.py examples/input_D_farinae_small.yaml -e biowulf_cluster -w dfs_work -o dfs_out -lc ../local_cache
357
358 ```
359
360
361 ## References
362
363 Buchfink B, Reuter K, Drost HG. Sensitive protein alignments at tree-of-life scale using DIAMOND. Nat Methods. 2021 Apr;18(4):366-368. doi: 10.1038/s41592-021-01101-x. Epub 2021 Apr 7. PMID: 33828273; PMCID: PMC8026399.
364
365 Danecek P, Bonfield JK, Liddle J, Marshall J, Ohan V, Pollard MO, Whitwham A, Keane T, McCarthy SA, Davies RM, Li H. Twelve years of SAMtools and BCFtools. Gigascience. 2021 Feb 16;10(2):giab008. doi: 10.1093/gigascience/giab008. PMID: 33590861; PMCID: PMC7931819.
366
367 Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Chaisson M, Gingeras TR. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013 Jan 1;29(1):15-21. doi: 10.1093/bioinformatics/bts635. Epub 2012 Oct 25. PMID: 23104886; PMCID: PMC3530905.
368
369 Li H. Protein-to-genome alignment with miniprot. Bioinformatics. 2023 Jan 1;39(1):btad014. doi: 10.1093/bioinformatics/btad014. PMID: 36648328; PMCID: PMC9869432.
370
371 Shen W, Le S, Li Y, Hu F. SeqKit: A Cross-Platform and Ultrafast Toolkit for FASTA/Q File Manipulation. PLoS One. 2016 Oct 5;11(10):e0163962. doi: 10.1371/journal.pone.0163962. PMID: 27706213; PMCID: PMC5051824.
372
373
374
375 ## Contact us
376
377 Please open a GitHub [Issue](https://github.com/ncbi/egapx/issues) if you encounter any problems with EGAPx. You can also write to cgr@nlm.nih.gov to give us your feedback or if you have any questions.