Mercurial > repos > fubar > tool_factory_2
diff toolfactory/README.md @ 30:6f48315c32c1 draft
Uploaded
author | fubar |
---|---|
date | Fri, 07 Aug 2020 07:54:23 -0400 |
parents | |
children | 4d578c8c1613 |
line wrap: on
line diff
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/toolfactory/README.md Fri Aug 07 07:54:23 2020 -0400 @@ -0,0 +1,418 @@ +toolfactory_2 +============= + +This is an upgrade to the tool factory but with added parameters +(optionally editable in the generated tool form - otherwise fixed) and +multiple input files. + +Specify any number of parameters - well at +least up to the limit of your patience with repeat groups. + +Parameter values supplied at tool generation time are defaults and +can be optionally editable by the user - names cannot be changed once +a tool has been generated. + +If not editable, they act as hidden parameters passed to the script +and are not editable on the tool form. + +Note! There will be Galaxy default sanitization for all +user input parameters which your script may need to dance around. + +Any number of input files can be passed to your script, but of course it +has to deal with them. Both path and metadata name are supplied either in the environment +(bash/sh) or as command line parameters (python,perl,rscript) that need to be parsed and +dealt with in the script. This is complicated by the common use case of needing file names +for (eg) column headers, as well as paths. Try the examples are show on the tool factory +form to see how Galaxy file and user supplied parameter values can be recovered in each +of the 4 scripting environments supported. + +Best way to deal with multiple outputs is to let the tool factory generate an HTML +page for your users. It automagically lays out pdf images as thumbnail galleries +and can have separate results sections gathering all similarly prefixed files, such as +a Foo section taking text and results from text (foo_whatever.log) and +artifacts (eg foo_MDS_plot.pdf) file names. All artifacts are linked for download. +A copy of the actual script is provided for provenance - be warned, it exposes +real file paths. + +**WARNING before you start** + +Install this tool on a private Galaxy ONLY +Please NEVER on a public or production instance +Please cite the resource at +http://bioinformatics.oxfordjournals.org/cgi/reprint/bts573?ijkey=lczQh1sWrMwdYWJ&keytype=ref +if you use this tool in your published work. + + +*Short Story* + +This is an unusual Galaxy tool capable of generating new Galaxy tools. +It works by exposing *unrestricted* and therefore extremely dangerous scripting +to all designated administrators of the host Galaxy server, allowing them to +run scripts in R, python, sh and perl over multiple selected input data sets, +writing a single new data set as output. + +*Differences between TF2 and the original Tool Factory* + +1. TF2 (this one) allows any number of either fixed or user-editable parameters to be defined +for the new tool. If these are editable, the user can change them but otherwise, they are passed +as fixed and invisible parameters for each execution. Obviously, there are substantial security +implications with editable parameters, but these are always sanitized by Galaxy's inbuilt +parameter sanitization so you may need to "unsanitize" characters - eg translate all "__lt__" +into "<" for certain parameters where that is needed. Please practise safe toolshed. + +2. Any number of (the same datatype) of input files may be defined. + +These changes substantially complicate the way your supplied script is supplied with +all the new and variable parameters. Examples in each scripting language are shown +in the tool help + +*Automated outputs in named sections* + +If your script writes to the current directory path, arbitrary mix of (eg) +pdfs, tabular analysis results and run logs,the tool factory can optionally +auto-generate a linked Html page with separate sections showing a thumbnail +grid for all pdfs and the log text, grouping all artifacts sharing a file +name and log name prefix.if "foo.log" is emitted then *all* other outputs matching foo_* will +all be grouped together - eg +- foo_baz.pdf +- foo_bar.pdf and +- foo_zot.xls + +would all be displayed and linked in the same section with foo.log's contents to form the "Foo" section of the Html page. +Sections appear in alphabetic order and there are no limits on the number of files or sections. + +*Automated generation of new Galaxy tools for installation into any Galaxy* + +Once a script is working correctly, this tool optionally generates a +new Galaxy tool, effectively freezing the supplied script into a new, +ordinary Galaxy tool that runs it over one or more input files selected by +the user. Generated tools are installed via a tool shed by an administrator +and work exactly like all other Galaxy tools for your users. + +If you use the Html output option, please ensure that sanitize_all_html is +set to False and uncommented in universe_wsgi.ini - it should show + +By default, all tool output served as 'text/html' will be sanitized +Change ```sanitize_all_html = False``` + +This opens potential security risks and may not be acceptable for public +sites where the lack of stylesheets may make Html pages damage onlookers' +eyeballs but should still be correct. + +*More Detail* + +To use the ToolFactory, you should have prepared a script to paste into a +text box, and a small test input example ready to select from your history +to test your new script. + +There is an example in each scripting language on the Tool Factory form. You +can just cut and paste these to try it out - remember to select the right +interpreter please. You'll also need to create a small test data set using +the Galaxy history add new data tool. + +If the script fails somehow, use the "redo" button on the tool output in +your history to recreate the form complete with broken script. Fix the bug +and execute again. Rinse, wash, repeat. + +Once the script runs sucessfully, a new Galaxy tool that runs your script +can be generated. Select the "generate" option and supply some help text and +names. The new tool will be generated in the form of a new Galaxy datatype +- toolshed.gz - as the name suggests, it's an archive ready to upload to a +Galaxy ToolShed as a new tool repository. + +Once it's in a ToolShed, it can be installed into any local Galaxy server +from the server administrative interface. + +Once the new tool is installed, local users can run it - each time, the script +that was supplied when it was built will be executed with the input chosen +from the user's history. In other words, the tools you generate with the +ToolFactory run just like any other Galaxy tool,but run your script every time. + +Tool factory tools are perfect for workflow components. One input, one output, +no variables. + +*To fully and safely exploit the awesome power* of this tool, +Galaxy and the ToolShed, you should be a developer installing this +tool on a private/personal/scratch local instance where you are an +admin_user. Then, if you break it, you get to keep all the pieces see +https://bitbucket.org/fubar/galaxytoolfactory/wiki/Home + +** Installation ** +This is a Galaxy tool. You can install it most conveniently using the +administrative "Search and browse tool sheds" link. Find the Galaxy Main +toolshed at https://toolshed.g2.bx.psu.edu/ and search for the toolfactory +repository. Open it and review the code and select the option to install it. + + +If you can't get the tool that way, the xml and py files here need to be +copied into a new tools subdirectory such as tools/toolfactory +Your tool_conf.xml needs a new entry pointing to the xml \file - something like +``` + <section name="Tool building tools" id="toolbuilders"> + <tool file="toolfactory/rgToolFactory.xml"/> + </section> +``` +If not already there (I just added it to datatypes_conf.xml.sample), +please add: + +``` +<datatype extension="toolshed.gz" type="galaxy.datatypes.binary:Binary" +mimetype="multipart/x-gzip" subclass="True" /> +``` +to your local data_types_conf.xml. + + +Of course, R, python, perl etc are needed on your path if you want to test +scripts using those interpreters. Adding new ones to this tool code should +be easy enough. Please make suggestions as bitbucket issues and code. The +HTML file code automatically shrinks R's bloated pdfs, and depends on +ghostscript. The thumbnails require imagemagick . + +* Restricted execution * +The tool factory tool itself will then be usable ONLY by admin users - +people with IDs in admin_users in universe_wsgi.ini **Yes, that's right. ONLY +admin_users can run this tool** Think about it for a moment. If allowed to +run any arbitrary script on your Galaxy server, the only thing that would +impede a miscreant bent on destroying all your Galaxy data would probably +be lack of appropriate technical skills. + +*What it does* This is a tool factory for simple scripts in python, R and +perl currently. Functional tests are automatically generated. How cool is that. + +LIMITED to simple scripts that read one input from the history. Optionally can +write one new history dataset, and optionally collect any number of outputs +into links on an autogenerated HTML index page for the user to navigate - +useful if the script writes images and output files - pdf outputs are shown +as thumbnails and R's bloated pdf's are shrunk with ghostscript so that and +imagemagik need to be available. + +Generated tools can be edited and enhanced like any Galaxy tool, so start +small and build up since a generated script gets you a serious leg up to a +more complex one. + +*What you do* You paste and run your script, you fix the syntax errors and +eventually it runs. You can use the redo button and edit the script before +trying to rerun it as you debug - it works pretty well. + +Once the script works on some test data, you can generate a toolshed compatible +gzip file containing your script ready to run as an ordinary Galaxy tool in +a repository on your local toolshed. That means safe and largely automated +installation in any production Galaxy configured to use your toolshed. + +*Generated tool Security* Once you install a generated tool, it's just +another tool - assuming the script is safe. They just run normally and their +user cannot do anything unusually insecure but please, practice safe toolshed. +Read the fucking code before you install any tool. Especially this one - +it is really scary. + +If you opt for an HTML output, you get all the script outputs arranged +as a single Html history item - all output files are linked, thumbnails for +all the pdfs. Ugly but really inexpensive. + +Patches and suggestions welcome as bitbucket issues please? + +copyright ross lazarus (ross stop lazarus at gmail stop com) May 2012 + +all rights reserved +Licensed under the LGPL if you want to improve it, feel free +https://bitbucket.org/fubar/galaxytoolfactory/wiki/Home + +Material for our more enthusiastic and voracious readers continues below - +we salute you. + +**Motivation** Simple transformation, filtering or reporting scripts get +written, run and lost every day in most busy labs - even ours where Galaxy is +in use. This 'dark script matter' is pervasive and generally not reproducible. + +**Benefits** For our group, this allows Galaxy to fill that important dark +script gap - all those "small" bioinformatics tasks. Once a user has a working +R (or python or perl) script that does something Galaxy cannot currently do +(eg transpose a tabular file) and takes parameters the way Galaxy supplies +them (see example below), they: + +1. Install the tool factory on a personal private instance + +2. Upload a small test data set + +3. Paste the script into the 'script' text box and iteratively run the +insecure tool on test data until it works right - there is absolutely no +reason to do this anywhere other than on a personal private instance. + +4. Once it works right, set the 'Generate toolshed gzip' option and run +it again. + +5. A toolshed style gzip appears ready to upload and install like any other +Toolshed entry. + +6. Upload the new tool to the toolshed + +7. Ask the local admin to check the new tool to confirm it's not evil and +install it in the local production galaxy + + + +**Parameter passing and file inputs** + +Your script will receive up to 3 named parameters +INPATHS is a comma separated list of input file paths +INNAMES is a comma separated list of input file names in the same order +OUTPATH is optional if a file is being generated, your script should write there +Your script should open and write files in the provided working directory if you are using the Html +automatic presentation option. + +Python script command lines will have --INPATHS and --additional_arguments etc. to make it easy to use argparse + +Rscript will need to use commandArgs(TRUE) - see the example below - additional arguments will +appear as themselves - eg foo="bar" will mean that foo is defined as "bar" for the script. + +Bash and sh will see any additional parameters on their command lines and the 3 named parameters +in their environment magically - well, using env on the CL +``` +***python***:: + + # argparse for 3 possible comma separated lists + # additional parameters need to be parsed ! + # then echo parameters to the output file + import sys + import argparse + argp=argparse.ArgumentParser() + argp.add_argument('--INNAMES',default=None) + argp.add_argument('--INPATHS',default=None) + argp.add_argument('--OUTPATH',default=None) + argp.add_argument('--additional_parameters',default=[],action="append") + argp.add_argument('otherargs', nargs=argparse.REMAINDER) + args = argp.parse_args() + f= open(args.OUTPATH,'w') + s = '### args=%s\n' % str(args) + f.write(s) + s = 'sys.argv=%s\n' % sys.argv + f.write(s) + f.close() + + + +***Rscript***:: + + # tool factory Rscript parser suggested by Forester + # http://www.r-bloggers.com/including-arguments-in-r-cmd-batch-mode/ + # additional parameters will appear in the ls() below - they are available + # to your script + # echo parameters to the output file + ourargs = commandArgs(TRUE) + if(length(ourargs)==0){ + print("No arguments supplied.") + }else{ + for(i in 1:length(ourargs)){ + eval(parse(text=ourargs[[i]])) + } + sink(OUTPATH) + cat('INPATHS=',INPATHS,'\n') + cat('INNAMES=',INNAMES,'\n') + cat('OUTPATH=',OUTPATH,'\n') + x=ls() + cat('all objects=',x,'\n') + sink() + } + sessionInfo() + print.noquote(date()) + + +***bash/sh***:: + + # tool factory sets up these environmental variables + # this example writes those to the output file + # additional params appear on command line + if [ ! -f "$OUTPATH" ] ; then + touch "$OUTPATH" + fi + echo "INPATHS=$INPATHS" >> "$OUTPATH" + echo "INNAMES=$INNAMES" >> "$OUTPATH" + echo "OUTPATH=$OUTPATH" >> "$OUTPATH" + echo "CL=$@" >> "$OUTPATH" + +***perl***:: + + (my $INPATHS,my $INNAMES,my $OUTPATH ) = @ARGV; + open(my $fh, '>', $OUTPATH) or die "Could not open file '$OUTPATH' $!"; + print $fh "INPATHS=$INPATHS\n INNAMES=$INNAMES\n OUTPATH=$OUTPATH\n"; + close $fh; + +``` + +Galaxy as an IDE for developing API scripts +If you need to develop Galaxy API scripts and you like to live dangerously, +please read on. + +Galaxy as an IDE? +Amazingly enough, blend-lib API scripts run perfectly well *inside* +Galaxy when pasted into a Tool Factory form. No need to generate a new +tool. Galaxy+Tool_Factory = IDE I think we need a new t-shirt. Seriously, +it is actually quite useable. + +Why bother - what's wrong with Eclipse +Nothing. But, compared with developing API scripts in the usual way outside +Galaxy, you get persistence and other framework benefits plus at absolutely +no extra charge, a ginormous security problem if you share the history or +any outputs because they contain the api script with key so development +servers only please! + +Workflow +Fire up the Tool Factory in Galaxy. + +Leave the input box empty, set the interpreter to python, paste and run an +api script - eg working example (substitute the url and key) below. + +It took me a few iterations to develop the example below because I know +almost nothing about the API. I started with very simple code from one of the +samples and after each run, the (edited..) api script is conveniently recreated +using the redo button on the history output item. So each successive version +of the developing api script you run is persisted - ready to be edited and +rerun easily. It is ''very'' handy to be able to add a line of code to the +script and run it, then view the output to (eg) inspect dicts returned by +API calls to help move progressively deeper iteratively. + +Give the below a whirl on a private clone (install the tool factory from +the main toolshed) and try adding complexity with few rerun/edit/rerun cycles. + +Eg tool factory api script +``` +import sys +from blend.galaxy import GalaxyInstance +ourGal = 'http://x.x.x.x:xxxx' +ourKey = 'xxx' +gi = GalaxyInstance(ourGal, key=ourKey) +libs = gi.libraries.get_libraries() +res = [] +# libs looks like +# u'url': u'/galaxy/api/libraries/441d8112651dc2f3', u'id': +u'441d8112651dc2f3', u'name':.... u'Demonstration sample RNA data', +for lib in libs: + res.append('%s:\n' % lib['name']) + res.append(str(gi.libraries.show_library(lib['id'],contents=True))) +outf=open(sys.argv[2],'w') +outf.write('\n'.join(res)) +outf.close() +``` + +**Attribution** +Creating re-usable tools from scripts: The Galaxy Tool Factory +Ross Lazarus; Antony Kaspi; Mark Ziemann; The Galaxy Team +Bioinformatics 2012; doi: 10.1093/bioinformatics/bts573 + +http://bioinformatics.oxfordjournals.org/cgi/reprint/bts573?ijkey=lczQh1sWrMwdYWJ&keytype=ref + +**Licensing** +Copyright Ross Lazarus 2010 +ross lazarus at g mail period com + +All rights reserved. + +Licensed under the LGPL + +**screenshot** + +![example run](/images/dynamicScriptTool.png) + + +``` +