Mercurial > repos > earlhaminst > gstf_preparation
view gstf_preparation.xml @ 7:9ef7661e8e9c draft
planemo upload for repository https://github.com/TGAC/earlham-galaxytools/tree/master/tools/gstf_preparation commit 79c2cb2362b64134df778cc484e426642eb6895e
author | earlhaminst |
---|---|
date | Wed, 25 Apr 2018 11:06:03 -0400 |
parents | 56bbdbfe3eaa |
children | 92f3966d5bc3 |
line wrap: on
line source
<tool id="gstf_preparation" name="GeneSeqToFamily preparation" version="0.4.1"> <description>converts data for the workflow</description> <command detect_errors="exit_code"> <![CDATA[ python '$__tool_directory__/gstf_preparation.py' #for $q in $queries --gff3 '${q.genome}:${q.gff3_input}' #end for #if str($json) != 'None' #for $v in $json --json '$v' #end for #end if #for $fasta_input in $fasta_inputs --fasta '${fasta_input}' #end for #if $headers --headers #end if #if $longestCDS -l #end if #if $regions --regions '$regions' #end if -o '$output_db' --of '$output_fasta' --ff '$filtered_fasta' ]]> </command> <inputs> <repeat name="queries" title="GFF3 dataset"> <param name="gff3_input" type="data" format="gff3" label="GFF3 dataset" /> <param name="genome" type="text" label="Genome name" help="Genome name without whitespaces or special characters"> <validator type="empty_field" /> </param> </repeat> <param name="json" type="data" format="json" multiple="true" optional="true" label="Gene features in JSON format generated by 'Get features by Ensembl ID' tool" /> <param name="fasta_inputs" type="data" format="fasta" multiple="true" label="Corresponding FASTA datasets" help="Each FASTA header line should start with a transcript id" /> <param name="longestCDS" type="boolean" checked="false" label="Keep only the longest CDS per gene" /> <param name="headers" type="boolean" checked="true" label="Change the header line of the FASTA sequences to the >TranscriptId_species format" help="As required by TreeBest, part of the GeneSeqToFamily workflow" /> <param name="regions" type="text" optional="true" label="Comma-separated list of region IDs (e.g. chromosomes or scaffolds) for which FASTA sequences should be filtered" help="Region IDs are in the `seqid` column for GFF3 and in the `seq_region_name` field in JSON. This is typically used to filter chromosomes with a non-standard genetic code, like mitochondria, to be analysed separately" /> </inputs> <outputs> <data name="output_db" format="sqlite" label="${tool.name} on ${on_string}: SQLite" /> <data name="output_fasta" format="fasta" label="${tool.name} on ${on_string}: FASTA" /> <data name="filtered_fasta" format="fasta" label="${tool.name} on ${on_string}: filtered sequences" /> </outputs> <tests> <test> <param name="fasta_inputs" ftype="fasta" value="Caenorhabditis_elegans.WBcel235.cds.all.shortened.fa" /> <param name="gff3_input" ftype="gff3" value="Caenorhabditis_elegans.WBcel235.87.chromosome.I.shortened.gff3" /> <param name="genome" value="caenorhabditis_elegans" /> <param name="longestCDS" value="false" /> <param name="headers" value="true" /> <output name="output_db" file="test1.sqlite" compare="sim_size" /> <output name="output_fasta" file="test1.fasta" /> <output name="filtered_fasta" file="test1.ns.fasta" /> </test> <test> <param name="fasta_inputs" ftype="fasta" value="Caenorhabditis_elegans.WBcel235.cds.all.shortened.fa" /> <param name="gff3_input" ftype="gff3" value="Caenorhabditis_elegans.WBcel235.87.chromosome.I.shortened.gff3" /> <param name="genome" value="caenorhabditis_elegans" /> <param name="longestCDS" value="true" /> <param name="headers" value="true" /> <output name="output_db" file="test1.sqlite" compare="sim_size" /> <output name="output_fasta" file="test1_longest.fasta" /> <output name="filtered_fasta" file="test1.ns.fasta" /> </test> <test> <param name="fasta_inputs" ftype="fasta" value="Caenorhabditis_elegans.WBcel235.cds.all.shortened.fa" /> <param name="gff3_input" ftype="gff3" value="Caenorhabditis_elegans.WBcel235.87.chromosome.I.shortened.gff3" /> <param name="genome" value="caenorhabditis_elegans" /> <param name="longestCDS" value="false" /> <param name="headers" value="false" /> <output name="output_db" file="test1.sqlite" compare="sim_size" /> <output name="output_fasta" file="Caenorhabditis_elegans.WBcel235.cds.all.shortened.fa" /> <output name="filtered_fasta" file="test1.ns.fasta" /> </test> <test> <param name="fasta_inputs" ftype="fasta" value="CDS.fasta" /> <param name="json" ftype="json" value="gene.json" /> <param name="longestCDS" value="false" /> <param name="headers" value="true" /> <output name="output_db" file="test4.sqlite" compare="sim_size" /> <output name="output_fasta" file="test4.fasta" /> <output name="filtered_fasta" file="test4.ns.fasta" /> </test> <test> <param name="fasta_inputs" ftype="fasta" value="CDS.fasta" /> <param name="json" ftype="json" value="gene.json" /> <param name="longestCDS" value="false" /> <param name="headers" value="true" /> <param name="regions" value="X" /> <output name="output_db" file="test5.sqlite" compare="sim_size" /> <output name="output_fasta" file="test5_filtered.fasta" /> <output name="filtered_fasta" file="test5.ns.fasta" /> </test> </tests> <help> <![CDATA[ **What it does** This tool converts a set of GFF3 and/or JSON gene feature information datasets into SQLite format. It also filters a CDS FASTA dataset to keep only the transcripts present in the gene feature information. Optionally it can also keep only the longest CDS per gene and/or change the header line of the FASTA sequences to the >TranscriptId_species format (as required by TreeBest, part of the GeneSeqToFamily workflow). Example GFF3 file:: scaffold_0 MYZPE13164_Clone_G006_v1.0 gene 44968 69413 . - . ID=MYZPE13164_G006_v1.0_000000030;Name=MYZPE13164_G006_v1.0_000000030;biotype=protein_coding scaffold_0 MYZPE13164_Clone_G006_v1.0 mRNA 44968 69413 . - . ID=MYZPE13164_G006_v1.0_000000030.1;Parent=MYZPE13164_G006_v1.0_000000030;Name=MYZPE13164_G006_v1.0_000000030.1;biotype=protein_coding;_AED=0.31 scaffold_0 MYZPE13164_Clone_G006_v1.0 three_prime_utr 44968 46637 . - . ID=MYZPE13164_G006_v1.0_000000030.1.3utr1;Parent=MYZPE13164_G006_v1.0_000000030.1 scaffold_0 MYZPE13164_Clone_G006_v1.0 exon 44968 47432 . - . ID=MYZPE13164_G006_v1.0_000000030.1.exon1;Parent=MYZPE13164_G006_v1.0_000000030.1 scaffold_0 MYZPE13164_Clone_G006_v1.0 CDS 46638 47432 . - 0 ID=MYZPE13164_G006_v1.0_000000030.1.cds1;Parent=MYZPE13164_G006_v1.0_000000030.1 scaffold_0 MYZPE13164_Clone_G006_v1.0 exon 53325 53539 . - . ID=MYZPE13164_G006_v1.0_000000030.1.exon2;Parent=MYZPE13164_G006_v1.0_000000030.1 scaffold_0 MYZPE13164_Clone_G006_v1.0 CDS 53325 53539 . - 2 ID=MYZPE13164_G006_v1.0_000000030.1.cds2;Parent=MYZPE13164_G006_v1.0_000000030.1 scaffold_0 MYZPE13164_Clone_G006_v1.0 exon 54614 54719 . - . ID=MYZPE13164_G006_v1.0_000000030.1.exon3;Parent=MYZPE13164_G006_v1.0_000000030.1 scaffold_0 MYZPE13164_Clone_G006_v1.0 CDS 54614 54719 . - 0 ID=MYZPE13164_G006_v1.0_000000030.1.cds3;Parent=MYZPE13164_G006_v1.0_000000030.1 scaffold_0 MYZPE13164_Clone_G006_v1.0 CDS 54852 55106 . - 0 ID=MYZPE13164_G006_v1.0_000000030.1.cds4;Parent=MYZPE13164_G006_v1.0_000000030.1 scaffold_0 MYZPE13164_Clone_G006_v1.0 exon 54852 55117 . - . ID=MYZPE13164_G006_v1.0_000000030.1.exon4;Parent=MYZPE13164_G006_v1.0_000000030.1 scaffold_0 MYZPE13164_Clone_G006_v1.0 five_prime_utr 55107 55117 . - . ID=MYZPE13164_G006_v1.0_000000030.1.5utr1;Parent=MYZPE13164_G006_v1.0_000000030.1 scaffold_0 MYZPE13164_Clone_G006_v1.0 five_prime_utr 68851 69413 . - . ID=MYZPE13164_G006_v1.0_000000030.1.5utr2;Parent=MYZPE13164_G006_v1.0_000000030.1 scaffold_0 MYZPE13164_Clone_G006_v1.0 exon 68851 69413 . - . ID=MYZPE13164_G006_v1.0_000000030.1.exon5;Parent=MYZPE13164_G006_v1.0_000000030.1 The following features are parsed: **gene**, **mRNA**, **transcript**, **exon**, **five_prime_utr**, **three_prime_utr** and **CDS**, all other are ignored. Also, **ID** and **Parent** attributes in the 9th column are needed to create relations among features. .. class:: warningmark If a value in the **ID** and **Parent** attribute contains a colon, everything up to the first colon will be discarded. ]]> </help> <citations> </citations> </tool>