comparison egapx_runner.xml @ 2:a3b158471bd3 draft

planemo upload for repository https://github.com/ncbi/egapx commit 98875ef7eda9323fc9991970103954e9097d9e73
author fubar
date Sun, 04 Aug 2024 00:06:43 +0000
parents c8e1543546f8
children 6592ae57bb8b
comparison
equal deleted inserted replaced
1:c8e1543546f8 2:a3b158471bd3
36 </tests> 36 </tests>
37 37
38 38
39 39
40 <help><![CDATA[ 40 <help><![CDATA[
41 **What it Does** 41 Galaxy tool wrapping the Eukaryotic Genome Annotation Pipeline (EGAPx)
42 =================================================================================================
43
44 **A very simple and crude way to run the EGAPx workflows inside Galaxy**
45
46 EGAPx requires huge resources to run with useful data. *128GB and 32 cores* are the minimum requirement; *256GB and 64 cores* are recommended.
47
48 There is a special test minimal example that can be run in 6GB with 4 cores.
49
50 In this implementation, the user can supply a yaml configuration file as initial proof of concept.
51
52 Does not use computational resources as efficiently as converting the NF workflow components into Galaxy tools, but it is very simple to maintain until EGAPx becomes stable. It will also enable measurement of the actual loss of efficiency from the crude wrapping method.
53
54
55 Sample yaml configurations
56 ===========================
57
58 YAML sample configurations can be uploaded into your Galaxy history from the `EGAPx github repository <https://github.com/ncbi/egapx/tree/main/examples/>`_.
59 The simplest possible example is shown below - can be cut/paste into a history dataset in the upload tool.
60
61
62 *./examples/input_D_farinae_small.yaml* is included in the examples linked above. RNA-seq data is provided as URI to the reads FASTA files.
63 These FASTA files are a sampling of the reads from the complete SRA read files to expedite testing.
64
65 ::
66
67 genome: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/020/809/275/GCF_020809275.1_ASM2080927v1/GCF_020809275.1_ASM2080927v1_genomic.fna.gz
68 taxid: 6954
69 reads:
70 - https://ftp.ncbi.nlm.nih.gov/genomes/TOOLS/EGAP/data/Dermatophagoides_farinae_small/SRR8506572.1
71 - https://ftp.ncbi.nlm.nih.gov/genomes/TOOLS/EGAP/data/Dermatophagoides_farinae_small/SRR8506572.2
72 - https://ftp.ncbi.nlm.nih.gov/genomes/TOOLS/EGAP/data/Dermatophagoides_farinae_small/SRR9005248.1
73 - https://ftp.ncbi.nlm.nih.gov/genomes/TOOLS/EGAP/data/Dermatophagoides_farinae_small/SRR9005248.2
74
75
76
77 Purpose
78 ========
79
80 **This is not intended for production**
81
82 Just a proof of concept.
83 It is possibly too inefficient to be useful although it may turn out not to be a problem if run on a dedicated workstation.
84 At least the efficiency can now be more easily estimated.
85
86 This tool is not recommended for public deployment because of the resource demands.
87
88 EGAPx Overview
89 ===============
90
91 .. image:: $PATH_TO_IMAGES/Pipeline_sm_ncRNA_CAGE_80pct.png
92
93 **Warning:**
94 The current version is an alpha release with limited features and organism scope to collect initial feedback on execution. Outputs are not yet complete and not intended for production use. Please open a GitHub [Issue](https://github.com/ncbi/egapx/issues) if you encounter any problems with EGAPx. You can also write to cgr@nlm.nih.gov to give us your feedback or if you have any questions.
95
96 EGAPx is the publicly accessible version of the updated NCBI [Eukaryotic Genome Annotation Pipeline](https://www.ncbi.nlm.nih.gov/genome/annotation_euk/process/).
97
98 EGAPx takes an assembly fasta file, a taxid of the organism, and RNA-seq data. Based on the taxid, EGAPx will pick protein sets and HMM models. The pipeline runs `miniprot` to align protein sequences, and `STAR` to align RNA-seq to the assembly. Protein alignments and RNA-seq read alignments are then passed to `Gnomon` for gene prediction. In the first step of `Gnomon`, the short alignments are chained together into putative gene models. In the second step, these predictions are further supplemented by _ab-initio_ predictions based on HMM models. The final annotation for the input assembly is produced as a `gff` file.
99
100 **Security Notice:**
101
102 EGAPx has dependencies in and outside of its execution path that include several thousand files from the [NCBI C++ toolkit](https://www.ncbi.nlm.nih.gov/toolkit), and more than a million total lines of code. Static Application Security Testing has shown a small number of verified buffer overrun security vulnerabilities. Users should consult with their organizational security team on risk and if there is concern, consider mitigating options like running via VM or cloud instance.
103
104
105 *To specify an array of NCBI SRA datasets in yaml*
106
107 ::
108
109 reads:
110 - SRR8506572
111 - SRR9005248
112
113
114 *To specify an SRA entrez query*
115
116 ::
117
118 reads: 'txid6954[Organism] AND biomol_transcript[properties] NOT SRS024887[Accession] AND (SRR8506572[Accession] OR SRR9005248[Accession] )'
119
120
121 **Note:** Both the above examples will have more RNA-seq data than the `input_D_farinae_small.yaml` example. To make sure the entrez query does not produce a large number of SRA runs, please run it first at the [NCBI SRA page](https://www.ncbi.nlm.nih.gov/sra). If there are too many SRA runs, then select a few of them and list it in the input yaml.
122
123 Output
124 =======
125
126 EGAPx output will appear as a collection in the user history. The main annotation file is called *accept.gff*.
127
128 ::
129
130 accept.gff
131 annot_builder_output
132 nextflow.log
133 run.report.html
134 run.timeline.html
135 run.trace.txt
136 run_params.yaml
137
138
139 The *nextflow.log* is the log file that captures all the process information and their work directories. ``run_params.yaml`` has all the parameters that were used in the EGAPx run. More information about the process time and resources can be found in the other run* files.
140
141 ## Intermediate files
142
143 In the log, each line denotes the process that completed in the workflow. The first column (_e.g._ `[96/621c4b]`) is the subdirectory where the intermediate output files and logs are found for the process in the same line, _i.e._, `egapx:miniprot:run_miniprot`. To see the intermediate files for that process, you can go to the work directory path that you had supplied and traverse to the subdirectory `96/621c4b`:
144
145 ::
146
147 $ aws s3 ls s3://temp_datapath/D_farinae/96/
148 PRE 06834b76c8d7ceb8c97d2ccf75cda4/
149 PRE 621c4ba4e6e87a4d869c696fe50034/
150 $ aws s3 ls s3://temp_datapath/D_farinae/96/621c4ba4e6e87a4d869c696fe50034/
151 PRE output/
152 2024-03-27 11:19:18 0
153 2024-03-27 11:19:28 6 .command.begin
154 2024-03-27 11:20:24 762 .command.err
155 2024-03-27 11:20:26 762 .command.log
156 2024-03-27 11:20:23 0 .command.out
157 2024-03-27 11:19:18 13103 .command.run
158 2024-03-27 11:19:18 129 .command.sh
159 2024-03-27 11:20:24 276 .command.trace
160 2024-03-27 11:20:25 1 .exitcode
161 $ aws s3 ls s3://temp_datapath/D_farinae/96/621c4ba4e6e87a4d869c696fe50034/output/
162 2024-03-27 11:20:24 17127134 aligns.paf
163
164
42 ]]></help> 165 ]]></help>
43 <citations> 166 <citations>
44 <citation type="doi">10.1093/bioinformatics/bts573</citation> 167 <citation type="doi">10.1093/bioinformatics/bts573</citation>
45 </citations> 168 </citations>
46 </tool> 169 </tool>