comparison toolfactory/README.md @ 121:2050b2475ae5 draft

Uploaded
author fubar
date Thu, 07 Jan 2021 09:24:17 +0000
parents
children 63d15caea378
comparison
equal deleted inserted replaced
120:0c6c3e10a8f4 121:2050b2475ae5
1 **Breaking news! Docker container is recommended as at August 2020**
2
3 A Docker container can be built - see the docker directory.
4 It is highly recommended for isolation. It also has an integrated toolshed to allow installation of new tools back
5 into the Galaxy being used to generate them.
6
7 Built from quay.io/bgruening/galaxy:20.05 but updates the
8 Galaxy code to the dev branch - it seems to work fine with updated bioblend>=0.14
9 with planemo and the right version of gxformat2 needed by the ToolFactory (TF).
10
11 The runclean.sh script run from the docker subdirectory of your local clone of this repository
12 should create a container (eventually) and serve it at localhost:8080 with a toolshed at
13 localhost:9009.
14
15 Once it's up, please restart Galaxy in the container with
16 ```docker exec [container name] supervisorctl restart galaxy: ```
17 Jobs just do not seem to run properly otherwise and the next steps won't work!
18
19 The generated container includes a workflow and 2 sample data sets for the workflow
20
21 Load the workflow. Adjust the inputs for each as labelled. The perl example counts GC in phiX.fasta.
22 The python scripts use the rgToolFactory.py as their input - any text file will work but I like the
23 recursion. The BWA example has some mitochondrial reads and reference. Run the workflow and watch.
24 This should fill the history with some sample tools you can rerun and play with.
25 Note that each new tool will have been tested using Planemo. In the workflow, in Galaxy.
26 Extremely cool to watch.
27
28 *WARNING*
29
30 Install this tool on a throw-away private Galaxy or Docker container ONLY
31 Please NEVER on a public or production instance
32
33 *Short Story*
34
35 Galaxy is easily extended to new applications by adding a new tool. Each new scientific computational package added as
36 a tool to Galaxy requires some special instructions to be written. This is sometimes termed "wrapping" the package
37 because the instructions tell Galaxy how to run the package as a new Galaxy tool. Any tool in a Galaxy is
38 readily available to all the users through a consistent and easy to use interface.
39
40 Most Galaxy tool wrappers have been manually prepared by skilled programmers, many using Planemo because it
41 automates much of the basic boilerplate and makes the process much easier. The ToolFactory (TF)
42 uses Planemo under the hood for many functions, but hides the command
43 line complexities from the TF user.
44
45 *More Explanation*
46
47 The TF is an unusual Galaxy tool, designed to allow a skilled user to make new Galaxy tools.
48 It appears in Galaxy just like any other tool but outputs include new Galaxy tools generated
49 using instructions provided by the user and the results of Planemo lint and tool testing using
50 small sample inputs provided by the TF user. The small samples become tests built in to the new tool.
51
52 It offers a familiar Galaxy form driven way to define how the user of the new tool will
53 choose input data from their history, and what parameters the new tool user will be able to adjust.
54 The TF user must know, or be able to read, enough about the tool to be able to define the details of
55 the new Galaxy interface and the ToolFactory offers little guidance on that other than some examples.
56
57 Tools always depend on other things. Most tools in Galaxy depend on third party
58 scientific packages, so TF tools usually have one or more dependencies. These can be
59 scientific packages such as BWA or scripting languages such as Python and are
60 usually managed by Conda. If the new tool relies on a system utility such as bash or awk
61 where the importance of version control on reproducibility is low, these can be used without
62 Conda management - but remember the potential risks of unmanaged dependencies on computational
63 reproducibility.
64
65 The TF user can optionally supply a working script where scripting is
66 required and the chosen dependency is a scripting language such as Python or a system
67 scripting executable such as bash. Whatever the language, the script must correctly parse the command line
68 arguments it receives at tool execution, as they are defined by the TF user. The
69 text of that script is "baked in" to the new tool and will be executed each time
70 the new tool is run. It is highly recommended that scripts and their command lines be developed
71 and tested until proven to work before the TF is invoked. Galaxy as a software development
72 environment is actually possible, but not recommended being somewhat clumsy and inefficient.
73
74 Tools nearly always take one or more data sets from the user's history as input. TF tools
75 allow the TF user to define what Galaxy datatypes the tool end user will be able to choose and what
76 names or positions will be used to pass them on a command line to the package or script.
77
78 Tools often have various parameter settings. The TF allows the TF user to define how each
79 parameter will appear on the tool form to the end user, and what names or positions will be
80 used to pass them on the command line to the package. At present, parameters are limited to
81 simple text and number fields. Pull requests for other kinds of parameters that galaxyxml
82 can handle are welcomed.
83
84 Best practice Galaxy tools have one or more automated tests. These should use small sample data sets and
85 specific parameter settings so when the tool is tested, the outputs can be compared with their expected
86 values. The TF will automatically create a test for the new tool. It will use the sample data sets
87 chosen by the TF user when they built the new tool.
88
89 The TF works by exposing *unrestricted* and therefore extremely dangerous scripting
90 to all designated administrators of the host Galaxy server, allowing them to
91 run scripts in R, python, sh and perl. For this reason, a Docker container is
92 available to help manage the associated risks.
93
94 *Scripting uses*
95
96 To use a scripting language to create a new tool, you must first prepared and properly test a script. Use small sample
97 data sets for testing. When the script is working correctly, upload the small sample datasets
98 into a new history, start configuring a new ToolFactory tool, and paste the script into the script text box on the TF form.
99
100 *Outputs*
101
102 Once the script runs sucessfully, a new Galaxy tool that runs your script
103 can be generated. Select the "generate" option and supply some help text and
104 names. The new tool will be generated in the form of a new Galaxy datatype
105 *tgz* - as the name suggests, it's an archive ready to upload to a
106 Galaxy ToolShed as a new tool repository.
107
108 It is also possible to run a tool to generate test outputs, then test it
109 using planemo. A toolshed is built in to the Docker container and configured
110 so a tool can be tested, sent to that toolshed, then installed in the Galaxy
111 where the TF is running.
112
113 If the tool requires a command or test XML override, then planemo is
114 needed to generate test outputs to make a complete tool, rerun to test
115 and if required upload to the local toolshed and install in the Galaxy
116 where the TF is running.
117
118 Once it's in a ToolShed, it can be installed into any local Galaxy server
119 from the server administrative interface.
120
121 Once the new tool is installed, local users can run it - each time, the
122 package and/or script that was supplied when it was built will be executed with the input chosen
123 from the user's history, together with user supplied parameters. In other words, the tools you generate with the
124 ToolFactory run just like any other Galaxy tool.
125
126 TF generated tools work as normal workflow components.
127
128
129 *Limitations*
130
131 The TF is flexible enough to generate wrappers for many common scientific packages
132 but the inbuilt automation will not cope with all possible situations. Users can
133 supply overrides for two tool XML segments - tests and command and the BWA
134 example in the supplied samples workflow illustrates their use.
135
136 *Installation*
137
138 The Docker container is the best way to use the TF because it is preconfigured
139 to automate new tool testing and has a built in local toolshed where each new tool
140 is uploaded. If you grab the docker container, it should just work.
141
142 If you build the container, there are some things to watch out for. Let it run for 10 minutes
143 or so once you build it - check with top until conda has finished fussing. Once everything quietens
144 down, find the container with
145 ```docker ps```
146 and use
147 ```docker exec [containername] supervisorctl restart galaxy:```
148 That colon is not a typographical mistake.
149 Not restarting after first boot seems to leave the job/worflow system confused and the workflow
150 just will not run properly until Galaxy has restarted.
151
152 Login as admin@galaxy.org with password "password". Feel free to change it once you are logged in.
153 There should be a companion toolshed at localhost:9090. The history should have some sample data for
154 the workflow.
155
156 Run the workflow and make sure the right dataset is selected for each of the input files. Most of the
157 examples use text files so should run, but the bwa example needs the right ones to work properly.
158
159 When the workflow is finished, you will have half a dozen examples to rerun and play with. They have also
160 all been tested and installed so you should find them in your tool menu under "Generated Tools"
161
162 It is easy to install without Docker, but you will need to make some
163 configuration changes (TODO write a configuration). You can install it most conveniently using the
164 administrative "Search and browse tool sheds" link. Find the Galaxy Main
165 toolshed at https://toolshed.g2.bx.psu.edu/ and search for the toolfactory
166 repository in the Tool Maker section. Open it and review the code and select the option to install it.
167
168 Otherwise, if not already there pending an accepted PR,
169 please add:
170 <datatype extension="tgz" type="galaxy.datatypes.binary:Binary"
171 mimetype="multipart/x-gzip" subclass="True" />
172 to your local data_types_conf.xml.
173
174
175 *Restricted execution*
176
177 The tool factory tool itself will then be usable ONLY by admin users -
178 people with IDs in admin_users. **Yes, that's right. ONLY
179 admin_users can run this tool** Think about it for a moment. If allowed to
180 run any arbitrary script on your Galaxy server, the only thing that would
181 impede a miscreant bent on destroying all your Galaxy data would probably
182 be lack of appropriate technical skills.
183
184 **Generated tool Security**
185
186 Once you install a generated tool, it's just
187 another tool - assuming the script is safe. They just run normally and their
188 user cannot do anything unusually insecure but please, practice safe toolshed.
189 Read the code before you install any tool. Especially this one - it is really scary.
190
191 **Send Code**
192
193 Pull requests and suggestions welcome as git issues please?
194
195 **Attribution**
196
197 Creating re-usable tools from scripts: The Galaxy Tool Factory
198 Ross Lazarus; Antony Kaspi; Mark Ziemann; The Galaxy Team
199 Bioinformatics 2012; doi: 10.1093/bioinformatics/bts573
200
201 http://bioinformatics.oxfordjournals.org/cgi/reprint/bts573?ijkey=lczQh1sWrMwdYWJ&keytype=ref
202
203 **Licensing**
204
205 Copyright Ross Lazarus 2010
206 ross lazarus at g mail period com
207
208 All rights reserved.
209
210 Licensed under the LGPL
211